VibeTensor: Can AI Build a Deep Learning Framework from Scratch?
NVIDIA researchers released VibeTensor, a complete deep learning runtime generated by LLM-based AI agents. With over 60,000 lines of C++/CUDA code written by AI, we analyze the possibilities and limitations this project reveals.

While LLMs writing code has become commonplace, what if AI agents could write an entire deep learning system software stack spanning tens of thousands of lines? VibeTensor, released by NVIDIA researchers, provides an answer to this question as an open-source project.
Today, we explore VibeTensor—a deep learning runtime fully generated by AI coding agents—examining its architecture, development methodology, and limitations.
What is VibeTensor?
VibeTensor is a deep learning system software stack implemented by LLM-powered coding agents under high-level human guidance. It's not a simple Python binding wrapper, but a complete runtime that includes a tensor/storage system, schema-free dispatcher, reverse-mode autograd engine, and CUDA memory management (streams, events, graphs).
Code Scale
According to the paper, VibeTensor's codebase consists of:
| Component | Lines of Code |
|---|---|
| C++/CUDA Core Runtime | 63,543 LOC |
| Plugins | 17,500 LOC |
| Python Overlay | 9,016 LOC |
| Node.js/TypeScript | 2,010 LOC |
| AI Kernel Suite | 55,882 LOC |
| Test Code | 53,955 LOC |
Key Features
- PyTorch-style Eager Execution: Code executes immediately and generates dynamic graphs.
- Multi-language Support: Built on a C++20 core with Python interface via nanobind, plus an experimental Node.js/TypeScript interface.
- Extensibility: DLPack interoperability, stable C ABI for dynamic plugins, and hooks for custom kernels written in Triton or CUTLASS.
How AI Builds Systems: Vibe-Coded
The most interesting aspect of this project is its development methodology. Researchers treated agents as black boxes and used the following workflow:
- Goal Setting: Humans specify scope and invariants.
- Code Generation: Agents propose and apply diffs.
- Verification: Instead of line-by-line human review, validity was verified through builds, tests, and differential checks against other implementations (like PyTorch).
In other words, tests served as specifications, and agents wrote and modified code to pass these tests. The entire development took approximately two months.
System Architecture
VibeTensor's structure consists of:
- Language Bindings: Python (nanobind), Node.js (N-API)
- Core: Dispatcher (Router) → Autograd Engine (Reverse Mode)
- Execution: Cache Allocator, CUDA Graph, Advanced Indexing
- Kernels: CPU/CUDA Operator Kernels + External Plugins
Performance and the Frankenstein Effect
VibeTensor successfully trained models like CIFAR-10 ViT and miniGPT end-to-end on NVIDIA H100 and Blackwell GPUs. However, it showed significant performance gaps compared to PyTorch.
End-to-End Training Performance (H100)
| Workload | Speed vs PyTorch |
|---|---|
| Sequence Reversal | 3.04× slower |
| CIFAR-10 ViT | 5.76× slower |
| miniGPT (Shakespeare) | 5.79× slower |
On Blackwell GPUs, performance ranged from 1.72× to 6.15× slower depending on the workload.
The Frankenstein Composition Effect
Researchers named this performance degradation phenomenon the Frankenstein composition effect:
- Each subsystem (e.g., tensor operations, autograd) appears correct and reasonable individually.
- However, when combined, they create inefficient bottlenecks because global performance goals weren't considered.
The specific technical cause is a non-reentrant global backward gate—a process-wide try-locked mutex that simplifies safety but serializes independent backward work. This ultimately starves high-performance backend kernels, reducing GPU utilization.
Bottleneck flow: User Script → Frontend (High Latency) → Autograd Engine → Global Lock (SERIALIZED, bottleneck) → Backend Kernels (GPU underutilization) → Result
Kernel and Multi-GPU Experiments
Despite performance limitations, VibeTensor includes notable high-performance components.
AI-Generated Triton Kernel Performance
Some kernels outperformed PyTorch's default implementations:
| Operation | Speed vs PyTorch |
|---|---|
| RMSNorm (forward) | 6.3× faster |
| Rotary embeddings (forward) | 5.33× faster |
| Attention (forward, causal) | 1.54× faster |
| Attention (backward, causal) | 1.26× faster |
| LayerNorm (forward+backward) | 1.06× faster |
However, small-batch GQA prefill operations sometimes trailed FlashAttention at 0.67× forward speed.
Multi-GPU Support
The project includes an experimental Fabric subsystem and Ring Allreduce plugin targeting Blackwell GPUs:
| GPUs | Batch per GPU | Iteration Time | Throughput |
|---|---|---|---|
| 1 | 65,536 | 29.88 ms | 2.19×10⁶ samples/s |
| 4 | 65,536 | 70.98 ms | 3.69×10⁶ samples/s |
Weak scaling across four GPUs achieved 1.69× throughput improvement.
Conclusions and Implications
VibeTensor is a research prototype serving as a milestone in AI-assisted software engineering, not a production-ready framework.
This project proves that coding agents can coherently generate complex system software spanning language bindings down to CUDA memory management. At the same time, it clearly demonstrates the structural limitations of AI coding—generating code that is individually correct but globally suboptimal.
Key Takeaways
- AI agents can generate over 60,000 lines of complex system code.
- Test-based verification alone can produce functionally correct systems.
- However, global optimization still requires human intervention.
- Individual component correctness doesn't guarantee overall system efficiency.
Resources
- Paper: VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
- GitHub: https://github.com/NVLabs/vibetensor
Note: This project is released for research purposes only and is not recommended for production use.
Subscribe to Newsletter
Related Posts

Self-Evolving AI Agents — The New Paradigm of 2026
GenericAgent, Evolver, Open Agents — comparing 3 self-evolving agent frameworks that learn, adapt, and grow without human coding.

Build Your Own LLM Knowledge Base — A Karpathy-Style Knowledge System
Complete guide to building a permanent personal knowledge system with Obsidian + Claude Code. Wiki + Memory dual-axis architecture.

Why Karpathy's CLAUDE.md Got 48K Stars — And How to Write Your Own
One markdown file raised AI coding accuracy from 65% to 94%. Analyzing Karpathy's 4 rules and practical writing guide.