Secret Agent CPU, Revisited

From Architecture Call to Market Model

Ben Bajarin

May 26, 2026

∙ Paid

Secret Agent CPU

Ben Bajarin

Mar 24

Read full story

When we first published Secret Agent CPU in March, the goal was to understand how inference changes when the workload moves from answering a prompt to completing a workflow and build a market sizing model for how much agentic use cases could increase the TAM for server CPUs. Our research identified, in traditional inference, the GPU carries the model math and the CPU can sit closer to a host-control role. Agentic inference creates a different operating problem but also a new opportunity for new infrastructure. The system has to keep the workflow alive while the model waits on tools, preserves state, checks permissions, and interacts with external systems. The GPU remains essential, but the return on that GPU increasingly depends on whether the surrounding infrastructure can keep the agent loop moving.

That was the architectural point in the original report. The CPU becomes more important because agentic inference pushes more work into the execution layer around the model. This is where the old AI infrastructure framing starts to break down. A prompt-response system can be measured mostly through accelerator throughput. A production agent has to be measured through the whole inference cell: accelerator racks that generate tokens, CPU execution racks that keep environments alive, and the memory fabric that keeps state close enough to use.

We emphasize this distinction because demand per user changes as AI moves from chat into work. A lightweight assistant may only need a narrow CPU-side footprint while the model responds. A more capable agent can fan out into multiple live environments as it does a host of CPU specific tasks like retrieves data, runs code, calls tools, verifies outputs, and waits on systems of record. In that world, infrastructure limits are measured by tokens per second and by the number of simultaneous agentic execution environments each active user can sustain.

The new development is that this orchestration layer is becoming more visible as a product category. Recent market sizing work has started to separate server CPU demand into more useful buckets, but the more important signal is the emergence of dedicated CPU orchestration racks as part of the inference architecture. These racks are likely to sit in line with GPU and XPU racks as part of the complete agentic inference system. That moves the thesis beyond a simple head-node attach story. Head-node CPUs still matter inside accelerator systems, but our original Agentic CPU thesis was always closer to a dedicated orchestration-layer thesis.

That shift is what changed our model. If the CPU role is limited to head-node content inside accelerator systems, the opportunity is real but more bounded. If dedicated CPU racks become part of the production inference path, the server CPU TAM expands differently because the CPU becomes part of the execution fabric around the model. In the full report, we raise our 2030 server CPU TAM base-case scenario by roughly 25%-30% versus the original framework, while keeping the higher cases tied to clearer evidence that dedicated orchestration racks become a repeatable hyperscaler procurement pattern.

We are still keeping range discipline and if anything, still modeling this conservatively. The evidence supports a higher base case, while the most aggressive outcomes still require more production proof. Recent public commentary has put a large CPU revenue marker into the market, including standalone CPU servers and CPU content inside larger AI systems. The exact mix still needs cleaner disclosure, but the direction supports the notion that dedicated CPU capacity is becoming part of the inference system rather than a distant bull-case abstraction.

The second layer is memory. Once dedicated CPU racks become part of the agentic inference architecture, CPU-attached memory intensity becomes one of the best signals to monitor. SOCAMM and LPDDR-based server memory are not literally HBM for CPUs. The technologies differ in packaging, bandwidth, economics, and supply chain. The strategic analogy is better stated as: HBM keeps accelerator math engines fed, while dense CPU-attached memory may become part of the fabric that keeps agentic CPU racks productive. As more state sits near the CPU tier, the market has to underwrite memory capacity, bandwidth, and power efficiency as part of the CPU story.

The full report includes:

Our revised Creative Strategies server CPU TAM framework and the uplift from the original *Secret Agent CPU* model.
A gross-versus-net demand model that separates existing server CPU demand from new agent-native orchestration demand.
A revised architecture view that places dedicated inline CPU orchestration racks closer to the center of the original thesis.
A hyperscaler and neocloud framework for understanding who absorbs early CPU demand and who may need to fund more surrounding infrastructure.
A beneficiary map that extends the read-through beyond CPUs into memory, storage, networking, rack integration, power, cooling, and client execution.
A scenario-based company model for Intel, AMD, and ARM/custom silicon, focused on dollar growth and share shift rather than stock recommendations.
A CPU-attached memory model that explains why agentic orchestration may create a second-order DRAM and SOCAMM opportunity.
Sensitivity work for CPU attach, AI CPU ASPs, memory per orchestration CPU, and agentic infrastructure intensity per user.
An inference-cell framework showing why served-user capacity depends on both GPU token throughput and CPU-side execution environments.
An evidence scorecard separating what is proven, what is directionally supported, and what still needs deployment proof.
A monitoring dashboard for dedicated rack orders, CPU attach per accelerator, memory bandwidth per CPU, and SOCAMM adoption.

Secret Agent CPU, Revisited

From Architecture Call to Market Model

Secret Agent CPU

The full report includes:

This post is for paid subscribers