The idea of looking at task completion % for tasks of set lengths (10min, 30, 2h) is a great idea and would be a very useful eval to keep people's expectations grounded. Reminds me of Anthopic demonstrating Claude 3.7 playing Pokemon and getting through 3 gym leaders. Our evals all focus on "immediate" intelligence, like the output, but that's for the chatbots usecase. Agents need to be able to operate independently for increasingly long periods of time; that'll be what makes AI autonomous and closer to the idea of a drop-in worker. It does seem that compute and context length might be a very big hurdle, especially if being able to stay coherent for hours requires exponentially more compute - I.e. a 5h task requires 100x more compute than a 30 min task. Though maybe not many tasks truly require 5 straight hours, instead they can be discretely broken down into multiple 30 min tasks. Still interesting and certainly stretches my timelines a little bit - proliferation is still probably around 5-10 years away at the earliest.
Nice thought—I agree that Pokemon eval is one of several great directions for actually informative future evals. I'm unsure on proliferation though, given the speed at which we've seen e.g. GPT-4 level capabilities decrease in price for both training and inference, we should expect future capabilities to proliferate fast as well.
This essay is great. My challenge is whether it's over fit to the last few months of progress. It seems very anchored on inference-time compute but that wouldn't have been the focus 6 months ago. Is it reasonable to think that this will remain the important axis for scaling for more than a year or two?
Good challenge, thanks! I agree there may be another paradigm. I think that:
(a) if there is no new paradigm, the current set of tools we have can take us very far (and achieve automated researchers) with just a huge amount of schlep and grind to get good data;
(b) if there is a new paradigm, the current one can help us find it via speeding up research progress, and either way, whatever the New Thing is will probably benefit from compute scale and having good data to some degree—which has broadly held true throughout the history of AI.
Excellent article! I think the view of AGI progress via a growing footprint of vertically specialized distilled models yields really interesting conclusions. My main challenge would be the underlying assumption that distillation will be highly effective across different domains, despite the continued jaggedness of capabilities. Please correct me if I'm wrong, but offhand I don't think we've seen a huge weight of evidence for that yet.
Another interesting implication from this is that we'll have capable agents for the most valuable kinds of work first - if we're constrained on needing custom models by vertical and on compute for long horizon tasks, development will be focused on the clusters of use cases which are worth that level of investment and run rate.
Oh, I didn't mean to imply that the specialization would necessarily result in separate models. Another interesting fact right now is that model merging is pretty strong, and so if a single lab is developing models for different verticals, they can in principle just combine them with almost no downside. Then, after merging, distillation into a fairly general but cheaper agent.
You're right that distillation + jaggedness of capabilities means that some capabilities will survive the distill better than others, and some verticals may be constrained by this (implying higher prices). So far, distillation works particularly poorly for memorization-based capabilities involving recall of facts from the training set. In many cases you can probably get around this by having large context windows and providing the relevant knowledge for the task, but this could seriously affect some important tasks.
"Assuming that this 60m-AI agent might be based on a model of similar size to DeepSeek’s r1 (or indeed larger), then it requires at least a full set of 8xGH200 GPUs to run. This is the size of NVIDIA’s latest AI server, costing $300k. Some basic napkin math shows that OpenAI’s announced Stargate project could sustain running at least ~3.1 million of these agents 24/7 by 2029 for a cost of $500b, (albeit not counting batching, so a very loose lower bound)."
I had a long conversation with Grok about this, and the numbers seems off by 2 orders of magnitude, even if you assume quite high token/s requirements for the agents (text, reasoning, vision, tool calling and computer use):
Comparison to Text-Only: Text-only mid-range (30 tokens/s) supported 400–450 agents. Multimodal increases demand ~3x, reducing capacity to ~1/3 (e.g., 419 ÷ 3 ≈ 140 agents), consistent with ~127–159.
Final Answer: The 8xGH200 server running DeepSeek-R1 could support approximately 127–159 multimodal AI agents continuously, assuming a typical demand of 93 tokens/s per agent (vision, tool use, computer use + reasoning). For lighter (30 tokens/s) or heavier (206 tokens/s) workloads, this scales to 394 or 57 agents, respectively, at maximum utilization."
Have you considered that maybe the step from one day agents to a week/ month may in fact be very fast, bc at that points most problems with coherence have already been solved?
The idea of looking at task completion % for tasks of set lengths (10min, 30, 2h) is a great idea and would be a very useful eval to keep people's expectations grounded. Reminds me of Anthopic demonstrating Claude 3.7 playing Pokemon and getting through 3 gym leaders. Our evals all focus on "immediate" intelligence, like the output, but that's for the chatbots usecase. Agents need to be able to operate independently for increasingly long periods of time; that'll be what makes AI autonomous and closer to the idea of a drop-in worker. It does seem that compute and context length might be a very big hurdle, especially if being able to stay coherent for hours requires exponentially more compute - I.e. a 5h task requires 100x more compute than a 30 min task. Though maybe not many tasks truly require 5 straight hours, instead they can be discretely broken down into multiple 30 min tasks. Still interesting and certainly stretches my timelines a little bit - proliferation is still probably around 5-10 years away at the earliest.
Nice thought—I agree that Pokemon eval is one of several great directions for actually informative future evals. I'm unsure on proliferation though, given the speed at which we've seen e.g. GPT-4 level capabilities decrease in price for both training and inference, we should expect future capabilities to proliferate fast as well.
This essay is great. My challenge is whether it's over fit to the last few months of progress. It seems very anchored on inference-time compute but that wouldn't have been the focus 6 months ago. Is it reasonable to think that this will remain the important axis for scaling for more than a year or two?
Good challenge, thanks! I agree there may be another paradigm. I think that:
(a) if there is no new paradigm, the current set of tools we have can take us very far (and achieve automated researchers) with just a huge amount of schlep and grind to get good data;
(b) if there is a new paradigm, the current one can help us find it via speeding up research progress, and either way, whatever the New Thing is will probably benefit from compute scale and having good data to some degree—which has broadly held true throughout the history of AI.
Progress * ;)
Doh, fixed, thanks! Your posts were quite helpful in refining my model of progress here, keep up the good work.
Excellent article! I think the view of AGI progress via a growing footprint of vertically specialized distilled models yields really interesting conclusions. My main challenge would be the underlying assumption that distillation will be highly effective across different domains, despite the continued jaggedness of capabilities. Please correct me if I'm wrong, but offhand I don't think we've seen a huge weight of evidence for that yet.
Another interesting implication from this is that we'll have capable agents for the most valuable kinds of work first - if we're constrained on needing custom models by vertical and on compute for long horizon tasks, development will be focused on the clusters of use cases which are worth that level of investment and run rate.
Oh, I didn't mean to imply that the specialization would necessarily result in separate models. Another interesting fact right now is that model merging is pretty strong, and so if a single lab is developing models for different verticals, they can in principle just combine them with almost no downside. Then, after merging, distillation into a fairly general but cheaper agent.
You're right that distillation + jaggedness of capabilities means that some capabilities will survive the distill better than others, and some verticals may be constrained by this (implying higher prices). So far, distillation works particularly poorly for memorization-based capabilities involving recall of facts from the training set. In many cases you can probably get around this by having large context windows and providing the relevant knowledge for the task, but this could seriously affect some important tasks.
Well written! I learned some new things, looking forward to the next essays.
"Assuming that this 60m-AI agent might be based on a model of similar size to DeepSeek’s r1 (or indeed larger), then it requires at least a full set of 8xGH200 GPUs to run. This is the size of NVIDIA’s latest AI server, costing $300k. Some basic napkin math shows that OpenAI’s announced Stargate project could sustain running at least ~3.1 million of these agents 24/7 by 2029 for a cost of $500b, (albeit not counting batching, so a very loose lower bound)."
I had a long conversation with Grok about this, and the numbers seems off by 2 orders of magnitude, even if you assume quite high token/s requirements for the agents (text, reasoning, vision, tool calling and computer use):
"Final Estimate
Range:
Single-stream (3,872 tokens/s): 18–129 agents.
Aggregate (14,800 tokens/s, 20% overhead): 57–394 agents.
Realistic Scenario: Mid-range multimodal agent (93 tokens/s):
~41 agents (single-stream).
~127–159 agents (aggregate, adjusted).
Comparison to Text-Only: Text-only mid-range (30 tokens/s) supported 400–450 agents. Multimodal increases demand ~3x, reducing capacity to ~1/3 (e.g., 419 ÷ 3 ≈ 140 agents), consistent with ~127–159.
Final Answer: The 8xGH200 server running DeepSeek-R1 could support approximately 127–159 multimodal AI agents continuously, assuming a typical demand of 93 tokens/s per agent (vision, tool use, computer use + reasoning). For lighter (30 tokens/s) or heavier (206 tokens/s) workloads, this scales to 394 or 57 agents, respectively, at maximum utilization."
Have you considered that maybe the step from one day agents to a week/ month may in fact be very fast, bc at that points most problems with coherence have already been solved?
Will ai talents (employees) will be a rare / hard to replace / replicate asset?
> despite clearly not being trained with RL to write stories.
[citation needed]