AI Swarms: The Trillion-Token Experiment

January 16, 2026

A new experiment involving “agentic coding” reveals that the future of software development isn’t one smart AI—it’s hundreds of them working in a hierarchy.

Massive Scale: A new report details an experiment running hundreds of AI agents concurrently for weeks, resulting in projects with over 1.6 million lines of code.

Structural Breakthrough: After failing with “flat” collaborative structures, the team succeeded by adopting a strict “Planner-Worker” hierarchy that mimics corporate management.

Proof of Concept: The autonomous swarm successfully built a web browser, migrated a massive codebase from Solid to React, and optimized a video engine by 25x without human intervention.

In a move that signals a major shift in how artificial intelligence might soon handle software engineering, a new report has detailed an ambitious experiment in “scaling long-running autonomous coding.” The project, conducted by the team behind Cursor, moved beyond the current standard of single-agent assistance to deploy hundreds of concurrent agents working toward a single goal.

The results were staggering. Over the course of several weeks, these AI swarms wrote millions of lines of code and consumed billions of tokens, effectively managing projects that would typically occupy human engineering teams for months.

Your Health, Decoded: Introducing the New Dedicated ChatGPT Health

The Bottleneck of the Single Agent

The report begins by addressing a common frustration in the AI industry: while today’s models (like GPT-4 or Claude 3) are excellent at focused tasks, they struggle with complexity. They cannot effectively “hold” the context of a massive architecture in their head.

The researchers’ hypothesis was simple: if one agent is limited, what happens if you run hundreds in parallel? However, the path to scaling wasn’t just about adding more compute—it was a lesson in organizational management.

The Chaos of “Flat” Collaboration

According to the findings, the team’s first attempt to organize the agents was a flat, democratic structure. Agents were given equal status and coordinated via a shared file, using a “locking” mechanism to claim tasks—similar to how a database handles concurrent writes.

The result was a gridlock. “Twenty agents would slow down to the effective throughput of two or three,” the report states. Agents would hold locks too long, crash without releasing them, or ignore the system entirely.

When the team switched to “optimistic concurrency control”—letting agents work freely and only failing if files changed—a psychological bottleneck emerged. Without leadership, the agents became risk-averse. They avoided complex architectural changes, preferring small, safe edits. The swarm was working, but it wasn’t building anything meaningful.

The Solution: A “Planner-Worker” Hierarchy

Success was finally achieved when the team imposed a strict hierarchy, effectively creating a digital assembly line.

Planners: High-level agents that explore the codebase and generate tasks. They can spawn “sub-planners,” creating a recursive management structure.
Workers: Agents that execute specific tasks without worrying about the bigger picture. They grind on tickets and push code.
Judges: Agents that review progress at the end of cycles to decide the next steps.

This structure allowed the system to scale to massive projects without individual agents suffering from tunnel vision.

“Impossible” Results: Browsers, Emulators, and Rust

To prove the viability of this system, the team unleashed their swarm on several monumental tasks.

The headline achievement was the creation of a web browser from scratch. The agents ran for nearly a week, generating over 1 million lines of code across 1,000 files. The report notes that despite the complexity, new agents entering the system could easily orient themselves and contribute to the same branch with minimal conflict.

Other successful experiments included:

A Massive Migration: The swarm performed an in-place migration of the Cursor codebase from Solid to React, executing over 266,000 additions and 193,000 deletions over three weeks.
Performance Optimization: A long-running agent rewrote a video rendering engine in Rust, achieving a 25x speed increase and implementing complex features like smooth zooming and motion blur.
Ongoing Projects: The team currently has swarms building a Windows 7 emulator (currently at 14.6k commits) and an Excel clone (1.6 million lines of code).

The Next Evolution of Visual AI: Meet the All-New ChatGPT Images

What We Learned About Models

The experiment also provided rare insights into the performance of unreleased or specific model versions in long-horizon tasks.

The report highlighted that GPT-5.2 proved superior for long-running autonomous work, excelling at maintaining focus and following instructions over extended periods. In contrast, Opus 4.5 was described as prone to “yielding back control quickly” and taking shortcuts.

Interestingly, the team found that role specialization was key. GPT-5.2 was the superior “Planner,” while other models, including those specifically trained for coding like GPT-5.1-codex, were better utilized elsewhere.

The Future of Coding

While the system is not yet perfect—agents still occasionally drift or run too long—the experiment offers a definitive answer to a lingering question in the tech world: Can autonomous coding scale?

The answer appears to be yes. By treating AI agents not as individual assistants but as a coordinate workforce, the report suggests we are approaching a future where human developers act as architects, overseeing swarms of AI workers that build our software while we sleep.

Source