Jun 23, 2026

Profiling Your First CUDA App with GPUFlight Trace

A practical walkthrough of using gpufl trace to capture a CUDA application and upload the result to the GPUFlight dashboard.

GPU profiling usually starts with a simple question:

I launched some CUDA kernels. What actually happened on the GPU?

In this post, we will profile a tiny CUDA vector-add application with gpufl trace, then upload the captured logs to the GPUFlight dashboard to inspect kernel events and the timeline.

This is the narrative version of the hands-on tutorial. The full tutorial source lives here:

gpu-flight/gpufl-tutorial/tutorial-01

What `gpufl trace` Does

gpufl trace runs your application through the GPUFlight launcher. The launcher injects GPUFlight into the target process, captures CUDA activity, and writes local trace logs.

For a CUDA application, the trace can show:

kernel launches
kernel names
launch dimensions
stream information
memory activity
synchronization activity
a timeline of what happened during the run

The important part is that the application does not need to link the GPUFlight SDK for this path. If you can launch the program from a terminal, you can start with gpufl trace.

The Sample CUDA App

The sample application is intentionally small: it allocates three vectors, launches a vector-add kernel 50 times, synchronizes, copies the result back, and validates the output.

__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Full source:

vector_add.cu

Build the Sample

On Windows, build the CUDA sample with CMake and Visual Studio:

cmake -S . -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release

Then run it once without GPUFlight:

.\build\Release\gpufl_tutorial_01.exe

Expected output:

Vector add completed successfully: 50 kernel launches, 1048576 elements

Capture a Trace

After building gpufl-client, point a PowerShell variable at the launcher:

$gpufl = "C:\path\to\gpufl-client\build-windows\daemon\launcher\Release\gpufl.exe"

Then run the sample through gpufl trace:

& $gpufl trace `
  --name tutorial-01-vector-add `
  --output .\gpufl-logs `
  -- .\build\Release\gpufl_tutorial_01.exe

The output directory now contains a generated session folder:

gpufl-logs/
  <session-id>/
    device.1.log.gz
    sass.1.log.gz
    scope.1.log.gz
    system.1.log.gz
    system.2.log.gz

Those files are the local source of truth for the trace.

Upload the Trace in the Dashboard

Open the GPUFlight dashboard and go to Uploads.

If you do not have an account yet, register here:

https://app.gpuflight.com/register

Drag the generated log files from gpufl-logs into the upload area. GPUFlight detects the session and shows an upload plan.

The upload flow looks like this:

Files are received.
The session begins uploading.
Upload rows move to completed.
The session appears under Sessions.
You can open the session and inspect kernels and the timeline.

What You See After Upload

For this tiny vector-add app, the dashboard should show:

one GPU
50 kernel launches
the vector_add kernel
per-launch timing
grid and block dimensions
occupancy-related values
a timeline view showing launches on a wall-clock axis

The kernel table answers “what ran and how long did it take?”

The timeline answers “when did it happen?”

Streaming Upload with `gpufl-agent`

Drag-and-drop upload is great for a first trace, but sometimes you want the trace command to upload as it runs. For that, use --upload with gpufl-agent.

First, create an API key in the dashboard under Settings > API keys.

Then set the upload environment:

$env:GPUFL_BACKEND_URL = "https://api.gpuflight.com"
$env:GPUFL_API_KEY = "gpfl_xxxxxxxxxxxx"
$agentJar = "C:\path\to\gpufl-agent.jar"

Run the trace with upload enabled:

& $gpufl trace `
  --name tutorial-01-vector-add `
  --output .\gpufl-logs `
  --upload `
  --agent-jar $agentJar `
  -- .\build\Release\gpufl_tutorial_01.exe

The agent tails the log files and streams them to the backend. When the run completes, the session is already on its way to the dashboard.

When to Use This Workflow

Use gpufl trace when:

you have an existing CUDA program
you can launch it from a terminal
you want a quick activity trace without changing source code
you want to inspect kernel events and a timeline in the dashboard

Use the embedded GPUFlight SDK instead when:

you own the application source
you want explicit GFL_SCOPE regions
you want GPUFlight initialized directly inside the process
you want tighter control over when capture starts and stops

Next Steps

The next useful step is adding application context. Raw CUDA kernel names are useful, but named scopes make traces easier to understand.

For launch-time profiling, NVTX ranges are a lightweight way to label phases without linking the GPUFlight SDK. For embedded GPUFlight, GFL_SCOPE gives you GPUFlight-owned scope events.

Related docs: