I built a GPU compute language
Hey guys! Big news today: VecTec is officially working.
If you've been following my blog you know I've been building Techlang, my own compiled programming language. If you haven't, go check out the previous blog posts first, this one builds on top of everything I've already built.
What is VecTec?
VecTec is a companion language to Techlang that runs code on the GPU instead of the CPU.
Here's what it looks like:
// arrays.vtec — runs on the GPU
kernel addArrays(ArrayOf(float) a, ArrayOf(float) b) returns ArrayOf(float) {
int id = threadId();
return a[id] + b[id];
}
And calling it from Techlang:
// main.tec — runs on the CPU
!import(std.tec) as std;
!import(arrays.vtec) as gpu;
function main() returns none {
ArrayOf(float) a = {1.0, 2.0, 3.0, 4.0};
ArrayOf(float) b = {5.0, 6.0, 7.0, 8.0};
}
That's it. No CUDA setup. No memory management. No copying data to and from the GPU manually. Just write a kernel and call it like a normal function.
The compiler handles everything automatically.
Why does this matter?
GPU programming is notoriously painful. Here's what the equivalent CUDA code looks like:
// the CUDA equivalent — brace yourself
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void addArrays(float* a, float* b, float* result, int size) {
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < size) {
result[id] = a[id] + b[id];
}
}
int main() {
int size = 4;
float a[] = {1.0, 2.0, 3.0, 4.0};
float b[] = {5.0, 6.0, 7.0, 8.0};
float result[4];
float *d_a, *d_b, *d_result;
cudaMalloc(&d_a, size * sizeof(float));
cudaMalloc(&d_b, size * sizeof(float));
cudaMalloc(&d_result, size * sizeof(float));
cudaMemcpy(d_a, a, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size * sizeof(float), cudaMemcpyHostToDevice);
addArrays<<<1, size>>>(d_a, d_b, d_result, size);
cudaMemcpy(result, d_result, size * sizeof(float),
cudaMemcpyDeviceToHost);
for (int i = 0; i < size; i++) {
printf("%f\n", result[i]);
}
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_result);
return 0;
}
40+ lines of boilerplate vs 15 lines of clean Techlang/VecTec code. And this is a simple example — real CUDA programs are much more complex.
VecTec eliminates all of that.
How it works under the hood
The magic happens in two stages:
Stage 1 — VecTec compilation
When the Techlang compiler sees !import(arrays.vtec) as gpu,
it automatically invokes the VecTec compiler on the kernel file.
VecTec uses LLVM's NVPTX backend to compile the kernel to PTX — NVIDIA's GPU assembly language. This is the same approach used by Julia and other high-performance languages.
Stage 2 — Runtime wrapper generation
The compiler then automatically generates a C wrapper that:
- Embeds the PTX as a string constant
- Declares CUDA functions for memory allocation and kernel launching
- Wraps each kernel in a clean C function
When you call gpu.addArrays(a, b) in Techlang,
the compiler generates calls to this wrapper which handles:
- Allocating GPU memory
- Copying data to the GPU
- Launching the kernel with the right number of threads
- Waiting for completion
- Copying results back
- Freeing GPU memory
All invisible to the programmer.
Stage 3 — Linking
The compiled Techlang binary, the GPU wrapper, and the standard library all get linked together into a single native executable that runs on both CPU and GPU.
The compiler pipeline
arrays.vtec
↓
VecTec compiler (LLVM NVPTX backend)
↓
arrays.ptx (GPU assembly)
↓
Auto-generated C wrapper with embedded PTX
↓
nvcc compiles wrapper → arrays_runtime.o main.tec
↓
Techlang compiler (LLVM x86 backend)
↓
main.o main.o + arrays_runtime.o + stdlib.o
↓
gcc links everything
↓
Native binary (runs on CPU + GPU)
Both languages share the same frontend — the lexer, parser, and semantic analyzer are identical. Only the backend differs — x86 for Techlang, NVPTX for VecTec.
Building VecTec
The hardest parts were:
The NVPTX target setup
LLVM's NVPTX backend works differently from x86. Kernels need special metadata annotations to be recognized as GPU entry points, and thread ID registers are accessed through LLVM intrinsics rather than normal function calls.
The opaque pointer problem (again)
Modern LLVM uses opaque pointers which means you can't ask a pointer "what type do you point to?" — you have to track that information yourself. This was tricky for array parameters since the element type needs to be known when generating GEP instructions.
The CUDA API version
The CUDA driver API changed cuCtxCreate to require 4 arguments
in newer versions instead of 3.
This caused a confusing compile error that took a while to track down.
The automatic size inference
When you call gpu.addArrays(a, b),
the compiler needs to know how many GPU threads to launch.
It automatically uses the size of the first array parameter —
so if a has 4 elements, 4 GPU threads are launched,
one per element.
The manual array zise parameter
When calling gpu.addArrays(a, b), under the hood CUDA also wants the sizes for both arrays, the actual call would be addArrays(a, aSize, b, bSize), I had to change the PTX generator to manually put the sizes when generating the function call.
What's next for VecTec?
Right now VecTec supports basic kernels with array inputs and outputs. The next steps are:
- Shared memory — fast per-thread-block memory for more complex algorithms
- 2D and 3D thread grids — for matrix operations and image processing
- More built-ins —
blockId(),blockDim()for advanced kernels - Automatic optimization — let LLVM's GPU optimizer do more work
Try it yourself
Techlang and VecTec are both open source and available on GitHub. You'll need an NVIDIA GPU and the CUDA toolkit installed.