Profiling Your First CUDA App with GPUFlight Trace
A practical walkthrough of using gpufl trace to capture a CUDA application and upload the result to the GPUFlight dashboard.
GPU profiling usually starts with a simple question:
I launched some CUDA kernels. What actually happened on the GPU?
In this post, we will profile a tiny CUDA vector-add application with gpufl trace, then upload the captured logs to the GPUFlight dashboard to inspect kernel events and the timeline.
This is the narrative version of the hands-on tutorial. The full tutorial source lives here:
gpu-flight/gpufl-tutorial/tutorial-01
What gpufl trace Does
gpufl trace runs your application through the GPUFlight launcher. The launcher injects GPUFlight into the target process, captures CUDA activity, and writes local trace logs.
For a CUDA application, the trace can show:
- kernel launches
- kernel names
- launch dimensions
- stream information
- memory activity
- synchronization activity
- a timeline of what happened during the run
The important part is that the application does not need to link the GPUFlight SDK for this path. If you can launch the program from a terminal, you can start with gpufl trace.
The Sample CUDA App
The sample application is intentionally small: it allocates three vectors, launches a vector-add kernel 50 times, synchronizes, copies the result back, and validates the output.
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
Full source:
Build the Sample
On Windows, build the CUDA sample with CMake and Visual Studio:
cmake -S . -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
Then run it once without GPUFlight:
.\build\Release\gpufl_tutorial_01.exe
Expected output:
Vector add completed successfully: 50 kernel launches, 1048576 elements
Capture a Trace
After building gpufl-client, point a PowerShell variable at the launcher:
$gpufl = "C:\path\to\gpufl-client\build-windows\daemon\launcher\Release\gpufl.exe"
Then run the sample through gpufl trace:
& $gpufl trace `
--name tutorial-01-vector-add `
--output .\gpufl-logs `
-- .\build\Release\gpufl_tutorial_01.exe
The output directory now contains a generated session folder:
gpufl-logs/
<session-id>/
device.1.log.gz
sass.1.log.gz
scope.1.log.gz
system.1.log.gz
system.2.log.gz
Those files are the local source of truth for the trace.
Upload the Trace in the Dashboard
Open the GPUFlight dashboard and go to Uploads.
If you do not have an account yet, register here:
https://app.gpuflight.com/register
Drag the generated log files from gpufl-logs into the upload area. GPUFlight detects the session and shows an upload plan.
The upload flow looks like this:
- Files are received.
- The session begins uploading.
- Upload rows move to completed.
- The session appears under Sessions.
- You can open the session and inspect kernels and the timeline.
What You See After Upload
For this tiny vector-add app, the dashboard should show:
- one GPU
- 50 kernel launches
- the
vector_addkernel - per-launch timing
- grid and block dimensions
- occupancy-related values
- a timeline view showing launches on a wall-clock axis
The kernel table answers “what ran and how long did it take?”
The timeline answers “when did it happen?”
Streaming Upload with gpufl-agent
Drag-and-drop upload is great for a first trace, but sometimes you want the trace command to upload as it runs. For that, use --upload with gpufl-agent.
First, create an API key in the dashboard under Settings > API keys.
Then set the upload environment:
$env:GPUFL_BACKEND_URL = "https://api.gpuflight.com"
$env:GPUFL_API_KEY = "gpfl_xxxxxxxxxxxx"
$agentJar = "C:\path\to\gpufl-agent.jar"
Run the trace with upload enabled:
& $gpufl trace `
--name tutorial-01-vector-add `
--output .\gpufl-logs `
--upload `
--agent-jar $agentJar `
-- .\build\Release\gpufl_tutorial_01.exe
The agent tails the log files and streams them to the backend. When the run completes, the session is already on its way to the dashboard.
When to Use This Workflow
Use gpufl trace when:
- you have an existing CUDA program
- you can launch it from a terminal
- you want a quick activity trace without changing source code
- you want to inspect kernel events and a timeline in the dashboard
Use the embedded GPUFlight SDK instead when:
- you own the application source
- you want explicit
GFL_SCOPEregions - you want GPUFlight initialized directly inside the process
- you want tighter control over when capture starts and stops
Next Steps
The next useful step is adding application context. Raw CUDA kernel names are useful, but named scopes make traces easier to understand.
For launch-time profiling, NVTX ranges are a lightweight way to label phases without linking the GPUFlight SDK. For embedded GPUFlight, GFL_SCOPE gives you GPUFlight-owned scope events.
Related docs: