VibeTensor: Can AI Build a Deep Learning Framework from Scratch?

While LLMs writing code has become commonplace, what if AI agents could write an entire deep learning system software stack spanning tens of thousands of lines? VibeTensor, released by NVIDIA researchers, provides an answer to this question as an open-source project.

Today, we explore VibeTensor—a deep learning runtime fully generated by AI coding agents—examining its architecture, development methodology, and limitations.

What is VibeTensor?

VibeTensor is a deep learning system software stack implemented by LLM-powered coding agents under high-level human guidance. It's not a simple Python binding wrapper, but a complete runtime that includes a tensor/storage system, schema-free dispatcher, reverse-mode autograd engine, and CUDA memory management (streams, events, graphs).

Code Scale

According to the paper, VibeTensor's codebase consists of:

Component	Lines of Code
C++/CUDA Core Runtime	63,543 LOC
Plugins	17,500 LOC
Python Overlay	9,016 LOC
Node.js/TypeScript	2,010 LOC
AI Kernel Suite	55,882 LOC
Test Code	53,955 LOC

Key Features

PyTorch-style Eager Execution: Code executes immediately and generates dynamic graphs.
Multi-language Support: Built on a C++20 core with Python interface via nanobind, plus an experimental Node.js/TypeScript interface.
Extensibility: DLPack interoperability, stable C ABI for dynamic plugins, and hooks for custom kernels written in Triton or CUTLASS.

How AI Builds Systems: Vibe-Coded

The most interesting aspect of this project is its development methodology. Researchers treated agents as black boxes and used the following workflow:

Goal Setting: Humans specify scope and invariants.
Code Generation: Agents propose and apply diffs.
Verification: Instead of line-by-line human review, validity was verified through builds, tests, and differential checks against other implementations (like PyTorch).

In other words, tests served as specifications, and agents wrote and modified code to pass these tests. The entire development took approximately two months.

System Architecture

VibeTensor's structure consists of:

Language Bindings: Python (nanobind), Node.js (N-API)
Core: Dispatcher (Router) → Autograd Engine (Reverse Mode)
Execution: Cache Allocator, CUDA Graph, Advanced Indexing
Kernels: CPU/CUDA Operator Kernels + External Plugins

Performance and the Frankenstein Effect

VibeTensor successfully trained models like CIFAR-10 ViT and miniGPT end-to-end on NVIDIA H100 and Blackwell GPUs. However, it showed significant performance gaps compared to PyTorch.

End-to-End Training Performance (H100)

Workload	Speed vs PyTorch
Sequence Reversal	3.04× slower
CIFAR-10 ViT	5.76× slower
miniGPT (Shakespeare)	5.79× slower

On Blackwell GPUs, performance ranged from 1.72× to 6.15× slower depending on the workload.

The Frankenstein Composition Effect

Researchers named this performance degradation phenomenon the Frankenstein composition effect:

Each subsystem (e.g., tensor operations, autograd) appears correct and reasonable individually.
However, when combined, they create inefficient bottlenecks because global performance goals weren't considered.

The specific technical cause is a non-reentrant global backward gate—a process-wide try-locked mutex that simplifies safety but serializes independent backward work. This ultimately starves high-performance backend kernels, reducing GPU utilization.

Bottleneck flow: User Script → Frontend (High Latency) → Autograd Engine → Global Lock (SERIALIZED, bottleneck) → Backend Kernels (GPU underutilization) → Result

Kernel and Multi-GPU Experiments

Despite performance limitations, VibeTensor includes notable high-performance components.

AI-Generated Triton Kernel Performance

Some kernels outperformed PyTorch's default implementations:

Operation	Speed vs PyTorch
RMSNorm (forward)	6.3× faster
Rotary embeddings (forward)	5.33× faster
Attention (forward, causal)	1.54× faster
Attention (backward, causal)	1.26× faster
LayerNorm (forward+backward)	1.06× faster

However, small-batch GQA prefill operations sometimes trailed FlashAttention at 0.67× forward speed.

Multi-GPU Support

The project includes an experimental Fabric subsystem and Ring Allreduce plugin targeting Blackwell GPUs:

GPUs	Batch per GPU	Iteration Time	Throughput
1	65,536	29.88 ms	2.19×10⁶ samples/s
4	65,536	70.98 ms	3.69×10⁶ samples/s

Weak scaling across four GPUs achieved 1.69× throughput improvement.

Conclusions and Implications

VibeTensor is a research prototype serving as a milestone in AI-assisted software engineering, not a production-ready framework.

This project proves that coding agents can coherently generate complex system software spanning language bindings down to CUDA memory management. At the same time, it clearly demonstrates the structural limitations of AI coding—generating code that is individually correct but globally suboptimal.

Key Takeaways

AI agents can generate over 60,000 lines of complex system code.
Test-based verification alone can produce functionally correct systems.
However, global optimization still requires human intervention.
Individual component correctness doesn't guarantee overall system efficiency.

Resources

Note: This project is released for research purposes only and is not recommended for production use.