When writing CUDA C++ applications, the GPU kernels (aka __global__ functions) are compiled down to Parallel Thread Execution (PTX). PTX is a well-documented assembly-like language. Unfortunately, I couldn’t find a “Hello, world” equivalent for PTX, so I decided to write this small introduction.

The easiest way to start writing PTX is to embed it inside a CUDA C++ source file, using asm(...).

For example, here’s a CUDA kernel,

// main.cu
#include <iostream>

__global__ void add_kernel(int a, int b, int* result) {
  int sum;
  asm("add.s32 %0, %1, %2;" : "=r"(sum) : "r"(a), "r"(b)); // inline PTX
  *result = sum;
}

int main() {
  int a = 2, b = 3, result = 0;
  int* d_result;
  cudaMalloc(&d_result, sizeof(int));
  add_kernel<<<1,1>>>(a, b, d_result);
  cudaMemcpy(&result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
  printf("Result: %d\n", result);
  cudaFree(d_result);
  return 0;
}

that adds two numbers. This is compiled as

nvcc main.cu -o main.bin && ./main.bin
# -> Result: 5

Now, the asm looks like a function call, but instead contains : to separate its “arguments”. The syntax of asm looks like this:

asm("assembly_code" : output_operands : input_operands : clobbers);

Where "assembly_code" contains a PTX instruction with placeholder variables %0, %1, %2. These placeholder variables are replaced with : output_operands : input_operands. The : clobbers are an optional scratchpad space (we don’t need it in this case).

So the add.s32 %0, %1, %2 PTX instruction tells the GPU to add 32-bit registers %1 %2 as integers and store the result to %0. The result is the int32 value of 5. The .s32 suffix means the operands and result are signed 32-bit integers.

The "=r"(sum) means that we are supplying variable sum that is meant as an output operand (=) and should reside in a register (r). Similarly, "r"(a), "r"(b) means that variables a and b are supplied as inputs, and should also reside in registers.

The "=r" is called a “constraint”. There are many other constraints and you can read about them in GCC docs.

Our asm instruction should be copied and pasted in the final PTX. PTX can be generated via

nvcc -ptx main.cu -o main.ptx

Looking at main.ptx, we see:

// main.ptx
.version 8.4
.target sm_52
.address_size 64

	// .globl	_Z10add_kerneliiPi

.visible .entry _Z10add_kerneliiPi(
	.param .u32 _Z10add_kerneliiPi_param_0,
	.param .u32 _Z10add_kerneliiPi_param_1,
	.param .u64 _Z10add_kerneliiPi_param_2
)
{
	.reg .b32 	%r<4>;
	.reg .b64 	%rd<3>;


	ld.param.u32 	%r2, [_Z10add_kerneliiPi_param_0];
	ld.param.u32 	%r3, [_Z10add_kerneliiPi_param_1];
	ld.param.u64 	%rd1, [_Z10add_kerneliiPi_param_2];
	cvta.to.global.u64 	%rd2, %rd1;
	// begin inline asm
	add.s32 %r1, %r2, %r3;
	// end inline asm
	st.global.u32 	[%rd2], %r1;
	ret;

}

Errors and debugging

Writing inline asm can be hard because making accidental errors is easy. Let’s see a couple of errors:

1. Typo in the Instruction Name

Here the PTX instruction name is incorrect:

asm("ad.s32 %0, %1, %2;" : "=r"(sum) : "r"(a), "r"(b)); // 'ad' instead of 'add'

Error:

ptxas /tmp/tmpxft_0000bab7_00000000-6_main.ptx, line 30; error   : Not a name of any known instruction: 'ad'
ptxas fatal   : Ptx assembly aborted due to errors

2. Wrong Operand Type

The add.s64 instruction requires 64-bit operands but is supplied with 32-bit operands instead:

asm("add.s64 %0, %1, %2;" : "=r"(sum) : "r"(a), "r"(b)); // sum, a, b are int (32-bit), while add.s64 needs 64 bit operands

Error:

ptxas /tmp/tmpxft_0000be0d_00000000-6_main.ptx, line 30; error   : Arguments mismatch for instruction 'add'
ptxas fatal   : Ptx assembly aborted due to errors

3. Missing Output Operand

The output operand isn’t supplied:

asm("add.s32 %0, %1, %2;" : : "r"(a), "r"(b)); // No output operand

Error:

main.cu(12): warning #549-D: variable "sum" is used before its value is set
      *result = sum;
main.cu(11): error: Internal Compiler Error (codegen): "asm operand index requested is larger than the number of asm operands provided!"

4. Invalid Register Constraint

Register constraint is incorrect:

asm("add.s32 %0, %1, %2;" : "=m"(sum) : "r"(a), "r"(b)); // '=m' is not valid for PTX

Error:

main.cu(13): error: asm constraint letter 'm' is not allowed inside a __device__/__global__ function
      asm("add.s32 %0, %1, %2;" : "=m"(sum) : "r"(a), "r"(b));

Conclusion

Writing PTX by hand is hard and error-prone. However, just like hand-writing CPU assembly, hand-written PTX can be used to extract peak performance from important kernels, for instance a matrix multiplication. PTX instructions can be written using inline asm statements with a gcc syntax. If you are interested in learning more about PTX, take a look at NVIDIA PTX documentation.