AI Pathways

Nice thought—I agree that Pokemon eval is one of several great directions for actually informative future evals. I'm unsure on proliferation though, given the speed at which we've seen e.g. GPT-4 level capabilities decrease in price for both training and inference, we should expect future capabilities to proliferate fast as well.

Expand full comment

huhvgf6554

Mar 20

https://arxiv.org/pdf/2503.14499 new paper on this exact topic!

Expand full comment

Joshua Blake

This essay is great. My challenge is whether it's over fit to the last few months of progress. It seems very anchored on inference-time compute but that wouldn't have been the focus 6 months ago. Is it reasonable to think that this will remain the important axis for scaling for more than a year or two?

Expand full comment

Good challenge, thanks! I agree there may be another paradigm. I think that:

(a) if there is no new paradigm, the current set of tools we have can take us very far (and achieve automated researchers) with just a huge amount of schlep and grind to get good data;

(b) if there is a new paradigm, the current one can help us find it via speeding up research progress, and either way, whatever the New Thing is will probably benefit from compute scale and having good data to some degree—which has broadly held true throughout the history of AI.

Expand full comment

Nathan Lambert

Mar 16

Progress * ;)

Expand full comment

Doh, fixed, thanks! Your posts were quite helpful in refining my model of progress here, keep up the good work.

Expand full comment

Connor Heaton

Excellent article! I think the view of AGI progress via a growing footprint of vertically specialized distilled models yields really interesting conclusions. My main challenge would be the underlying assumption that distillation will be highly effective across different domains, despite the continued jaggedness of capabilities. Please correct me if I'm wrong, but offhand I don't think we've seen a huge weight of evidence for that yet.

Another interesting implication from this is that we'll have capable agents for the most valuable kinds of work first - if we're constrained on needing custom models by vertical and on compute for long horizon tasks, development will be focused on the clusters of use cases which are worth that level of investment and run rate.

Expand full comment

Oh, I didn't mean to imply that the specialization would necessarily result in separate models. Another interesting fact right now is that model merging is pretty strong, and so if a single lab is developing models for different verticals, they can in principle just combine them with almost no downside. Then, after merging, distillation into a fairly general but cheaper agent.

You're right that distillation + jaggedness of capabilities means that some capabilities will survive the distill better than others, and some verticals may be constrained by this (implying higher prices). So far, distillation works particularly poorly for memorization-based capabilities involving recall of facts from the training set. In many cases you can probably get around this by having large context windows and providing the relevant knowledge for the task, but this could seriously affect some important tasks.

Expand full comment

darin

model merging is so interesting !! have never thought abt in that context before

Expand full comment

Mar 16

Well written! I learned some new things, looking forward to the next essays.

Expand full comment

James Sanders

"Assuming that this 60m-AI agent might be based on a model of similar size to DeepSeek’s r1 (or indeed larger), then it requires at least a full set of 8xGH200 GPUs to run. This is the size of NVIDIA’s latest AI server, costing $300k. Some basic napkin math shows that OpenAI’s announced Stargate project could sustain running at least ~3.1 million of these agents 24/7 by 2029 for a cost of $500b, (albeit not counting batching, so a very loose lower bound)."

I had a long conversation with Grok about this, and the numbers seems off by 2 orders of magnitude, even if you assume quite high token/s requirements for the agents (text, reasoning, vision, tool calling and computer use):

"Final Estimate

Range:

Single-stream (3,872 tokens/s): 18–129 agents.

Aggregate (14,800 tokens/s, 20% overhead): 57–394 agents.

Realistic Scenario: Mid-range multimodal agent (93 tokens/s):

~41 agents (single-stream).

~127–159 agents (aggregate, adjusted).

Comparison to Text-Only: Text-only mid-range (30 tokens/s) supported 400–450 agents. Multimodal increases demand ~3x, reducing capacity to ~1/3 (e.g., 419 ÷ 3 ≈ 140 agents), consistent with ~127–159.

Final Answer: The 8xGH200 server running DeepSeek-R1 could support approximately 127–159 multimodal AI agents continuously, assuming a typical demand of 93 tokens/s per agent (vision, tool use, computer use + reasoning). For lighter (30 tokens/s) or heavier (206 tokens/s) workloads, this scales to 394 or 57 agents, respectively, at maximum utilization."

Expand full comment

Mark Schröder