Demystifying the Rendering Pipeline in Arm Mali GPUs

Find AI Tools

No difficulty

No complicated process

Find ai tools

Home Hardware Demystifying the Rendering Pipeline in Arm Mali GPUs

Demystifying the Rendering Pipeline in Arm Mali GPUs

Introduction
The Rendering Pipeline
- Object Creation
- Vertex Animation
- Triangle Rasterization
- Pixel Shading
- Incremental Rendering
- Processing Stages in the Pipeline
CPU and GPU Roles
- CPU Responsibilities
- Command Processing Stage
- Geometry Processing Stage
- Pixel Processing Stage
- Uploading Data Resources
Shader Programs and Buffers
- Shader Programs
- Buffers
- Textures
- Descriptors
- Configuration of Draw Calls
Compute Processing Stage
- Separate Pipeline
- Inputs and Outputs
- Procedural Generation
- Legacy Technique
Geometry Processing Stage Group
- Vertex Shading Stage
- Tessellation Shading Stage
- Geometry Shading Stage
- Primitive Assembly
- View Volume Culling
Modern Mali GPU Approach
- Early ZS testing
- Splitting the Vertex Shader
- Buffer Layout Optimization
- Off-Screen Draw Call Culling
Pixel Processing Pipeline
- Fragment Shader
- Rasterization
- Early ZS and Late ZS Testing
- Blending
Parallel Processing in the Pipeline
- Macro-Stages in the Pipeline
- Overlapping Processing Units
- Synchronous vs Asynchronous APIs
- Optimizing Pipeline Performance
Conclusion

Introduction

In this article, we will explore the rendering pipeline used by graphics APIs such as OpenGL ES and Vulkan. We will delve into the different stages of the pipeline and the role they play in rendering images. From object creation to pixel shading, we will discuss each step in detail. We will also examine the responsibilities of the CPU and GPU in the rendering process. Additionally, we will explore the use of shader programs, buffers, and textures, as well as the importance of descriptors in configuring draw calls. We will also touch upon the compute processing stage and its role in procedural generation. Later, we will dive into the geometry processing stage group, highlighting its various components and their functions. We will discuss the modern approach adopted by Mali GPUs and the benefits it brings. Furthermore, we will explore the pixel processing pipeline, including rasterization, early and late ZS testing, and blending. The article will conclude with a discussion on parallel processing in the pipeline and how it impacts rendering efficiency.

The Rendering Pipeline

The rendering pipeline is a series of processing stages that work together to transform a 3D object into a 2D image that can be displayed on a screen. Let's take a closer look at each stage of the pipeline.

Object Creation

The rendering process begins with an object created by an artist. These objects are described using a mesh of triangles that define the Shape of the object. It's important to note that these objects are hollow, and we are only modeling their outer shape, not their internal contents.

Vertex Animation

For each frame of rendering, we animate each vertex in the mesh to place the object correctly sized and oriented in our virtual environment and on the screen. This animation is done by the CPU and involves transforming the vertices based on the desired movement and perspective.

Triangle Rasterization

Once the vertices are transformed, the GPU takes over and rasterizes the visible triangles. Rasterization involves determining which pixels need to be colored in based on the vector form of each triangle. These pixels are then color shaded and written out into the framebuffer.

Pixel Shading

The pixel processing stage consumes the rasterized triangles and executes the fragment coloring operations. This stage is where the final color values for each pixel are computed based on lighting, textures, and other shading techniques.

Incremental Rendering

By repeating the above process for every object in the scene, we can incrementally render the entire image that is to be displayed. This incremental rendering allows for efficient handling of complex scenes with many objects.

Processing Stages in the Pipeline

The rendering pipeline consists of four main stage groups: the CPU, the command processing stage, the geometry processing stage, and the pixel processing stage.

The CPU, running the application and the graphics driver, is responsible for pre-flight tasks such as animation and physics, uploading data to the GPU's memory, and submitting rendering commands.

The command processing stage, inside the GPU, interprets the commands sent by the CPU and coordinates the data processing stages of the GPU. It ensures that the correct resources are available during the draw call.

The geometry processing stage takes the input meshes and animates each vertex into the correct location in screen-space. It also generates additional per-vertex data, such as lighting, and emits a stream of primitives, usually triangles, for pixel processing.

The pixel processing stage consumes the primitive data generated by the geometry processing stage. It rasterizes the primitives to generate pixel coverage and executes the fragment shading operations.

CPU and GPU Roles

In the rendering pipeline, the CPU and the GPU play distinct roles in the overall process. Let's explore their responsibilities in more detail.

CPU Responsibilities

The CPU is responsible for running the application and the graphics driver. It performs pre-flight tasks such as animation and physics calculations. Additionally, the CPU uploads any data resources generated by the CPU into the GPU's memory. The CPU also sends rendering commands to the GPU, which define the operations to be performed during the rendering process.

Command Processing Stage

The command processing stage, located inside the GPU, is responsible for interpreting the rendering commands issued by the CPU. It coordinates the GPU's data processing stages and ensures that the necessary resources are available for each draw call. The command processor reads the descriptors describing the draw call and uses this information to parametrize the rest of the pipeline.

Geometry Processing Stage

The geometry processing stage takes the input meshes provided by the content artists and performs various operations on them. It animates each vertex into the correct location in screen-space based on the transformation matrices provided by the CPU. The geometry processing stage also generates additional per-vertex data, such as per-vertex lighting, which is needed in subsequent stages of the pipeline. Finally, it emits a stream of primitives, such as triangles, for the pixel processing stage to Consume.

Pixel Processing Stage

The pixel processing stage consumes the primitive data generated by the geometry processing stage. It rasterizes the primitives to determine which pixels need to be colored in, based on the vector form of each triangle. The pixel processing stage then executes the fragment shading operations to compute the final color values for each pixel. The resulting pixels are written out to the framebuffer for display.

Uploading Data Resources

To run the rendering pipeline, the CPU must upload different types of data resources to the GPU. Shader programs, which define the processing operations to be performed on each vertex or fragment, are one type of resource. Buffers, which contain input vertex data and other scene information, are another important resource. Textures, both color and non-color, can be used as input for the shader stages and as framebuffer attachments to store the output of the rendering process. Descriptors are control plane structures that provide supplemental information to the GPU. Each resource requires a descriptor to describe its location in memory and how it should be accessed. Rendering commands reference the appropriate descriptors, providing the GPU with the necessary information to complete the operations.

Shader Programs and Buffers

Shader programs and buffers are essential components in the rendering pipeline. Let's take a closer look at each of these components.

Shader Programs

Shader programs define the processing operations to be performed on each vertex or fragment during the rendering process. They are written in a specialized shading language, such as GLSL (OpenGL Shading Language) or SPIR-V (Standard Portable Intermediate Representation). There are different types of shader programs, including vertex shaders, fragment shaders, geometry shaders, tessellation control shaders, and tessellation evaluation shaders. Each shader program is executed on the corresponding processing stage of the pipeline. They can access data input from vertex buffers and output data to buffers, textures, or the framebuffer.

Buffers

Buffers contain various types of data used in the rendering process. They hold input vertex data, primitive connectivity data, and other scene information, such as transformation matrices for each object or information on dynamic light sources. Buffers play a crucial role in providing the necessary input for the geometry processing stage. They are typically accessed by shader programs to fetch or write data during processing. Buffers are allocated and managed by the CPU, and their contents are transferred to the GPU's memory before rendering starts.

Textures

Textures contain image data, both color and non-color, that can be used by the shader stages as input. Textures can be 2D, 3D, or cube maps, and they can store various data formats, such as RGBA color values or depth information. They are commonly used for applying textures to objects or performing advanced shading techniques, such as normal mapping or shadow mapping. Textures can also serve as framebuffer attachments, where the output of the rendering process is stored for later use. Like buffers, textures are allocated and managed by the CPU, and their contents are transferred to the GPU's memory.

Descriptors

Descriptors are control plane structures that provide supplemental information to the GPU. They describe the location and format of resources in memory, as well as how they should be accessed during the rendering process. Each resource required by a rendering command needs a descriptor to specify its configuration. Descriptors play a crucial role in parametrizing the rendering pipeline, as they provide the GPU with the necessary information to complete the operations. When a CPU issues a command to the GPU, the command processor reads the descriptor and uses it to set up the rest of the pipeline.

Configuration of Draw Calls

Each rendering command issued by the CPU references a descriptor that describes the draw call's configuration. The descriptor provides information about the resources needed to complete the operation, such as the buffers, shader programs, and textures required. The GPU uses this information to set up the geometry processing and pixel processing stages correctly. The configuration of draw calls determines how objects are rendered and what shaders are applied to them.

Compute Processing Stage

While the main rendering pipeline consists of the CPU, the command processing stage, the geometry processing stage, and the pixel processing stage, there is an additional stage called the compute processing stage. Let's take a closer look at this stage and its role in the rendering pipeline.

Separate Pipeline

The compute processing stage is a separate pipeline that exists outside of the main rendering pipeline. Unlike the other stages, which form a strict sequence of processing units, the compute processing stage operates in a more loosely connected way. It consumes inputs from main memory and writes outputs back to main memory, rather than directly interacting with the other stages of the pipeline.

Inputs and Outputs

The compute processing stage takes in various inputs from main memory, such as buffers and textures, and performs computations on them. These computations can involve complex algorithms or procedural generation techniques. The outputs of the compute processing stage are modified buffers and textures, which can then be used by the other stages of the pipeline.

Procedural Generation

One of the common use cases for the compute processing stage is procedural generation. This involves generating content, such as meshes or textures, on the fly using algorithms or formulas. For example, the compute processing stage can be used to create procedural terrain or generate particle systems based on user-defined parameters.

Legacy Technique

While the compute processing stage offers flexibility and the ability to perform complex computations, it is considered a legacy technique in modern rendering pipelines. Most of the tasks that were traditionally performed by the geometry shading stage can now be more efficiently solved using compute shaders instead. Therefore, the compute processing stage is rarely used in modern rendering pipelines.

Geometry Processing Stage Group

The geometry processing stage group consists of multiple component pipeline stages that are responsible for processing the input meshes and generating screen-space primitives. Let's take a closer look at each stage in this group.

Vertex Shading Stage

The vertex shading stage is the first stage in the geometry processing stage group. It consumes the stream of vertices generated by the application and animates each vertex in the mesh. The vertex shader is a programmable stage that runs on each vertex and computes the final position of the vertex in screen-space. The vertex shader can also compute additional per-vertex data, such as lighting factors or texture coordinates, which is needed in later stages of the pipeline.

Tessellation Shading Stage

The tessellation shading stage is an optional stage that can be used to programmatically subdivide the primitives in a model. This allows for more detailed geometry to be built on the fly. The tessellation shading stage consists of three sub-stages: the control shader, the tessellator, and the evaluation shader. The control shader determines the amount of subdivision to apply, the tessellator creates the new primitives based on the control shader's output, and the evaluation shader pushes and pulls the tessellated geometry into the correct position.

Geometry Shading Stage

The geometry shading stage is another optional stage that can be used to programmatically create or destroy new primitives. It operates on the input primitives provided by the previous stages of the pipeline, using them as a source of control data to define the geometry shader's operation. One common use case for the geometry shading stage is the creation of particle systems on the fly. The input mesh can be used to define the time step and animation state of the particles that need to be created.

Primitive Assembly

The primitive assembly stage takes the stream of vertices generated by the previous stages of the pipeline and groups them back into the point, line, and triangle primitives that are needed by the later stages. This stage can use either implicit assignment, where the vertex ordering defines each primitive, or explicit construction, where a dedicated index buffer is used to associate vertices with primitives. Explicit construction is generally more efficient, as it allows each vertex to be reused more effectively.

View Volume Culling

Once the primitives are in their final coordinate system, they undergo view volume culling. This step involves determining which primitives are potentially visible inside the view volume and discarding those that are not. Primitives that are outside of the view volume can be safely discarded, as they do not map to any screen pixels. Primitives that are facing away from the camera, typically the back-face of a model, can also be culled by performing a facing test.

The geometry processing stage group is responsible for transforming the input meshes into the correct screen-space primitives. Each stage in the group performs a specific set of operations, allowing for fine control over both the shape and appearance of the rendered objects.

Modern Mali GPU Approach

The Mali GPUs, specifically those based on the Bifrost architecture such as the Mali-G71, have introduced a modern approach to the rendering pipeline. Let's explore the benefits and features of this approach.

Early ZS Testing

One of the key features in modern Mali GPUs is early ZS testing. Early ZS testing allows the GPU to discard occluded samples before fragment shading, resulting in significant performance improvements. To benefit from this feature, the application needs to send geometry in a front-to-back render order. This ensures that the objects closer to the camera are processed first, allowing the GPU to discard occluded samples early in the pipeline.

Splitting the Vertex Shader

In the modern Mali GPU approach, the vertex shader is split into two pieces by the shader compiler. The first part computes the position, while the Second part computes the remaining non-position output attributes. Only the position shader runs before early ZS testing, which allows for efficient culling based on the position of the primitives. The varying shader, which computes the non-position attributes, runs after the early ZS testing and only for fragments that contribute to visible primitives. This split shader approach maximizes the efficiency of the GPU's processing units.

Buffer Layout Optimization

Another aspect of the modern Mali GPU approach is buffer layout optimization. By carefully arranging the data in the buffers, applications can maximize the efficiency of the GPU's memory access Patterns. This optimization can minimize the amount of data fetched or written by the GPU, resulting in improved performance. Efficient buffer layout optimization requires knowledge of the GPU's memory access patterns and the specific requirements of the rendering workload.

Off-Screen Draw Call Culling

Modern Mali GPUs support off-screen draw call culling, which allows applications to efficiently cull draw calls that are entirely off-screen. This technique helps to avoid wasting the power budget on rendering objects that have no impact on the final image. By using software-based culling techniques, such as bounding box tests or frustum culling, applications can skip the rendering of objects that are entirely outside the view volume.

The modern Mali GPU approach brings significant performance improvements to the rendering pipeline. By leveraging early ZS testing, splitting the vertex shader, optimizing buffer layouts, and implementing off-screen draw call culling, Mali GPUs can achieve higher rendering efficiency and deliver better overall performance for mobile content.

Pixel Processing Pipeline

The pixel processing pipeline is responsible for generating the final color values for each pixel in the rendered image. Let's take a closer look at the stages involved in this pipeline.

Fragment Shader

The fragment shader is the central stage of the pixel processing pipeline. It takes the rasterized triangles as input and computes the final color value for each fragment, based on the shading operations defined by the application. The fragment shader is a programmable stage that can access resources such as textures, buffers, and uniform values to achieve complex shading effects.

Rasterization

After the vertex processing stage, the primitive data, such as triangles, goes through the rasterization stage. Rasterization determines which parts of the primitives correspond to each pixel in the framebuffer. This process involves comparing the ideal vector form of each triangle against a per-pixel sample mask. Each pixel generates a single invocation of the fragment shader in the traditional case, but the invocation rate can be higher (using sample-rate shading) or lower (using variable rate shading) than one-to-one.

Early ZS and Late ZS Testing

After rasterization, the quads generated by the rasterization process are submitted for early ZS, depth, and stencil testing. Early ZS testing discards samples that are occluded by other objects or fail the stencil test, thus avoiding unnecessary fragment shading. This form of testing happens before fragment shading and allows for early elimination of occluded fragments. Late ZS testing handles all of the depth and stencil test and update operations that could not be performed early. Late ZS testing happens after fragment shading and can result in the discarding of fragments that were not discarded by early ZS testing.

Blending

The final stage in the pixel processing pipeline is the Blend stage. This stage is responsible for blending transparent pixels into the framebuffer, based on the application's requested blend function. The blend function determines how the colors of the transparent pixels are mixed with the existing colors in the framebuffer. The blend stage is a fixed-function stage that operates on the output color values generated by the fragment shader.

The pixel processing pipeline performs the necessary operations to compute the final color values for each pixel in the rendered image. Rasterization, early ZS and late ZS testing, and blending are important stages in this pipeline, ensuring the correct rendering of objects and the proper handling of transparency.

Parallel Processing in the Pipeline

Parallel processing plays a vital role in optimizing the rendering pipeline and achieving high performance. Let's explore the concept of parallel processing and its impact on the efficiency of the pipeline.

Macro-Stages in the Pipeline

The rendering pipeline can be divided into three macro-stages: the application building a draw call on the CPU, the geometry processing on the GPU, and the pixel processing on the GPU. These macro-stages run in sequence, with each stage working on a different part of the rendering process. To keep all processing units busy and achieve peak performance, multiple operations need to be in flight concurrently.

Overlapping Processing Units

The goal of parallel processing is to overlap the execution of different pipeline stages to fully utilize the processing units. When all processing units are busy, the pipeline operates at peak efficiency, enabling the Timely generation of the final rendered image. Overlapping the execution of the CPU and GPU tasks, as well as within individual stages like the geometry and pixel processing stages, maximizes the pipeline's overall performance.

Synchronous vs Asynchronous APIs

Historically, APIs like OpenGL ES have followed a synchronous behavioral model, even though the underlying processing in modern hardware is inherently asynchronous. To maintain the illusion of synchronicity provided by the API, the CPU and GPU tasks are closely synchronized. In contrast, Vulkan, designed for modern hardware, provides an asynchronous programming model. It gives the application more control over processing dependencies and allows for more parallelism. To achieve optimal performance, developers must define dependencies between data producers and consumers and strike a balance to fully utilize parallel processing capabilities.

Optimizing Pipeline Performance

To optimize a pipelined processing architecture, it's crucial to identify the slowest pipeline stage. The slowest stage determines the overall performance of the rendering process. To improve performance, attention should be focused on optimizing the slowest stage. For example, if the CPU takes longer to construct draw calls than the GPU takes to process them, CPU-side optimizations should be pursued to reduce draw construction costs and minimize the impact on the overall pipeline performance.

Maintaining parallel processing within the rendering pipeline is key to achieving high performance and efficiency. Overlapping tasks and ensuring all processing units are utilized can significantly improve the overall rendering speed.

Conclusion

The rendering pipeline is a complex system of stages that work together to transform 3D objects into 2D images on a screen. From object creation to pixel shading, each stage in the pipeline performs specific tasks to achieve the final rendered image. The CPU and GPU play crucial roles in the pipeline, with the CPU responsible for application and graphics driver tasks, and the GPU handling the various processing stages. Shader programs and buffers are vital components, enabling the GPU to process vertex and fragment data effectively. The pipeline includes stages such as compute processing, geometry processing, and pixel processing, each with specific purposes and capabilities. The modern approach used by Mali GPUs brings improved performance through early ZS testing, splitting the vertex shader, buffer layout optimization, and off-screen draw call culling. The pixel processing pipeline generates the final color values for each pixel, with stages such as rasterization, ZS testing, and blending. Parallel processing within the pipeline is essential for achieving high performance, with overlapping tasks and utilizing all processing units. By understanding the intricacies of the rendering pipeline, developers can optimize their rendering workloads and deliver efficient and visually appealing content.

资源：Arm Developer website

Optimize Your Mining Rig with AMD and Nvidia GPUs

AMD's R5 7600X: Overpriced CPU & Costly Motherboards

Are you spending too much time looking for ai tools?