PTX Hello World
When writing CUDA C++ applications, the GPU kernels (aka __global__
functions) are compiled down to Parallel Thread Execution (PTX). PTX is a well-documented assembly-like language. Unfortunately, I couldn’t find a “Hello, world” equivalent for PTX, so I decided to write this small introduction.
The easiest way to start writing PTX is to embed it inside a CUDA C++ source file, using asm(...)
.
For example, here’s a CUDA kernel,
// main.cu
#include <iostream>
__global__ void add_kernel(int a, int b, int* result) {
int sum;
asm("add.s32 %0, %1, %2;" : "=r"(sum) : "r"(a), "r"(b)); // inline PTX
*result = sum;
}
int main() {
int a = 2, b = 3, result = 0;
int* d_result;
cudaMalloc(&d_result, sizeof(int));
add_kernel<<<1,1>>>(a, b, d_result);
cudaMemcpy(&result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
printf("Result: %d\n", result);
cudaFree(d_result);
return 0;
}
that adds two numbers. This is compiled as
nvcc main.cu -o main.bin && ./main.bin
# -> Result: 5
Now, the asm looks like a function call, but instead contains :
to separate its “arguments”. The syntax of asm
looks like this:
asm("assembly_code" : output_operands : input_operands : clobbers);
Where "assembly_code"
contains a PTX instruction with placeholder variables %0, %1, %2
. These placeholder variables are replaced with : output_operands : input_operands
. The : clobbers
are an optional scratchpad space (we don’t need it in this case).
So the add.s32 %0, %1, %2
PTX instruction tells the GPU to add 32-bit registers %1 %2
as integers and store the result to %0
. The result is the int32 value of 5. The .s32
suffix means the operands and result are signed 32-bit integers.
The "=r"(sum)
means that we are supplying variable sum
that is meant as an output operand (=
) and should reside in a register (r
). Similarly, "r"(a), "r"(b)
means that variables a
and b
are supplied as inputs, and should also reside in registers.
The "=r"
is called a “constraint”. There are many other constraints and you can read about them in GCC docs.
Our asm
instruction should be copied and pasted in the final PTX. PTX can be generated via
nvcc -ptx main.cu -o main.ptx
Looking at main.ptx
, we see:
// main.ptx
.version 8.4
.target sm_52
.address_size 64
// .globl _Z10add_kerneliiPi
.visible .entry _Z10add_kerneliiPi(
.param .u32 _Z10add_kerneliiPi_param_0,
.param .u32 _Z10add_kerneliiPi_param_1,
.param .u64 _Z10add_kerneliiPi_param_2
)
{
.reg .b32 %r<4>;
.reg .b64 %rd<3>;
ld.param.u32 %r2, [_Z10add_kerneliiPi_param_0];
ld.param.u32 %r3, [_Z10add_kerneliiPi_param_1];
ld.param.u64 %rd1, [_Z10add_kerneliiPi_param_2];
cvta.to.global.u64 %rd2, %rd1;
// begin inline asm
add.s32 %r1, %r2, %r3;
// end inline asm
st.global.u32 [%rd2], %r1;
ret;
}
Errors and debugging
Writing inline asm
can be hard because making accidental errors is easy. Let’s see a couple of errors:
1. Typo in the Instruction Name
Here the PTX instruction name is incorrect:
asm("ad.s32 %0, %1, %2;" : "=r"(sum) : "r"(a), "r"(b)); // 'ad' instead of 'add'
Error:
ptxas /tmp/tmpxft_0000bab7_00000000-6_main.ptx, line 30; error : Not a name of any known instruction: 'ad'
ptxas fatal : Ptx assembly aborted due to errors
2. Wrong Operand Type
The add.s64
instruction requires 64-bit operands but is supplied with 32-bit operands instead:
asm("add.s64 %0, %1, %2;" : "=r"(sum) : "r"(a), "r"(b)); // sum, a, b are int (32-bit), while add.s64 needs 64 bit operands
Error:
ptxas /tmp/tmpxft_0000be0d_00000000-6_main.ptx, line 30; error : Arguments mismatch for instruction 'add'
ptxas fatal : Ptx assembly aborted due to errors
3. Missing Output Operand
The output operand isn’t supplied:
asm("add.s32 %0, %1, %2;" : : "r"(a), "r"(b)); // No output operand
Error:
main.cu(12): warning #549-D: variable "sum" is used before its value is set
*result = sum;
main.cu(11): error: Internal Compiler Error (codegen): "asm operand index requested is larger than the number of asm operands provided!"
4. Invalid Register Constraint
Register constraint is incorrect:
asm("add.s32 %0, %1, %2;" : "=m"(sum) : "r"(a), "r"(b)); // '=m' is not valid for PTX
Error:
main.cu(13): error: asm constraint letter 'm' is not allowed inside a __device__/__global__ function
asm("add.s32 %0, %1, %2;" : "=m"(sum) : "r"(a), "r"(b));
Conclusion
Writing PTX by hand is hard and error-prone. However, just like hand-writing CPU assembly, hand-written PTX can be used to extract peak performance from important kernels, for instance a matrix multiplication. PTX instructions can be written using inline asm
statements with a gcc
syntax. If you are interested in learning more about PTX, take a look at NVIDIA PTX documentation.