The Diligence Stack - By Creative Strategies

The Diligence Stack - By Creative Strategies

Memory in the Age of Inference

How Concurrent User And Agent Sessions Turn Memory Into A System Architecture Problem

Ben Bajarin's avatar
Ben Bajarin
May 28, 2026
∙ Paid
Memory's $200B Inflection

Memory's $200B Inflection

Ben Bajarin
·
Feb 19
Read full story

A lot has happened since our first anchor report on memory back in February. We argued that the first phase of the AI memory story was about repricing coupled with the markets need to understand how the demand cycle has changed with AI compute. As we anticipated, HBM became scarce, conventional DRAM tightened, NAND began to benefit from AI storage demand, and memory moved from a background input in the server bill of materials to one of the more visible constraints in the AI infrastructure stack. That repricing is still a driver because it gave the market a clear signal that memory had become large enough to affect AI infrastructure economics. The next question is more durable: what does memory normalize into as inference becomes a larger share of AI workloads?

We think the answer depends on how inference scales. Early inference could often be understood as a prompt-response workload. A user asks a question, the system generates an answer, and the economics can be framed mostly around cost per token. That framing remains useful, but it becomes incomplete as AI usage shifts toward many concurrent user and agent sessions. The system has to keep those sessions useful while work is happening. It has to preserve context, maintain state, retrieve information, manage tool calls, and carry a workflow forward across multiple steps. The user may only see a short answer or a completed task, but underneath the system is holding much more live state than the interface suggests. Thinking models and reasoning models have fundamentally changed the demand for inference compute and the entire inference cell infrastructure.

At scale, inference specific compute is bound by how many concurrent users/agents it can handle at a single time. That is the central memory/storage shift taking place. Inference memory demand should be modeled around how many live sessions the system has to support, and where the state for those sessions resides. Longer context pulls on accelerator memory and host memory because more information has to stay close to execution. KV cache turns concurrency into a capacity problem because every active session carries memory with it. Agentic workflows extend the problem further because the system has to remember what the agent is doing while it is doing it. As these workflows become more persistent, the memory question becomes less about a static bill of materials and more about the operating capacity of the AI system. In Storage Shock, we framed the storage version of this problem as the need to keep more data warm enough to be called and used. The memory version is that more inference state has to stay live enough for the system to continue the work. Context is part of that state, but the larger issue is continuity: the system has to preserve enough of the session’s active memory so the next step can happen without rebuilding the workflow from scratch.

The Agentic AI Storage Shock

The Agentic AI Storage Shock

Ben Bajarin
·
May 21
Read full story

This is why memory increasingly needs to be viewed as a hierarchy rather than a single component category. HBM remains the clearest near-term constraint because it is tied directly to accelerator roadmaps and high-bandwidth execution. Server DRAM becomes more important as AI head-node and CPU-side execution demand rise. NAND and enterprise SSDs become more relevant when some state can tolerate additional latency and move closer to the serving path. Other layers remain earlier in their proof cycle, but they are useful signals for where the architecture may go if agentic infrastructure becomes more repeatable.

For stakeholders, the important point is that memory intensity can rise with utilization on top of along with server shipments. A fleet can become more memory constrained when the same installed base is serving more live sessions for longer periods of time. That means the memory model has to account for how long sessions remain active, how much state they retain, and how efficiently the system can move that state across the hierarchy. The central question is whether enough of the session can stay close to compute to preserve performance, while lower-cost tiers absorb the state that does not need to remain in the hottest layer..

This also changes how we should think about cyclicality. Commodity memory will still cycle, and the familiar variables of things like pricing, supply, inventory, and customer digestion will continue to drive revisions in the segments of the industry that will still abide by that dynamic. However, the more salient point is that AI-attached memory can behave differently when it is tied to qualification, roadmap certainty (LTAs), and system output. In AI infrastructure, memory decisions are moving earlier into system design because they affect how much useful inference work the platform can support.

Standards remain essential because they make the ecosystem buildable, but the interface increasingly becomes the floor rather than the source of differentiation. The value shifts toward suppliers whose memory fits the system roadmap, performs within the power and thermal envelope, exposes enough usable capacity to software, and helps the platform support more concurrent inference work. That is why memory should be forecast by where inference state lives. Accelerator-attached memory follows accelerator and ASIC roadmaps, server DRAM follows AI head-node and CPU execution demand, and NAND or enterprise SSDs become more relevant when context and persistent state can tolerate more latency. The memory cycle will still cycle, but a cycle-only lens is too blunt once inference creates a hierarchy of memory constraints that shape operating capacity and value capture.

What’s Inside the Full Report

  • A full framework for moving from the memory pricing shock to the next question: whether inference scale changes the structural role of memory after pricing cools.

  • A detailed explanation of why concurrent user and agent sessions turn memory into a live-state capacity problem.

  • A breakdown of where inference state lives across HBM, server DRAM, SOCAMM/LPDDR, NAND/eSSD, and CXL.

  • A model bridge for forecasting memory demand by tier rather than using one blended memory line.

  • A deeper discussion of KV cache, context length, active state, and why memory pressure can rise before accelerator unit growth fully explains the move.

  • A framework for why CPU core count may become a memory-attach signal in agentic inference systems.

  • Scenario logic for translating CPU-side memory assumptions into long-term DRAM demand sensitivities.

  • A beneficiary map separating current evidence from later-cycle opportunities across memory suppliers, SSDs, controllers, interface silicon, rack integration, and CXL.

  • A disciplined view of what the thesis does and does not require, including why commodity memory can still cycle while AI-attached memory behaves differently.

  • A monitoring dashboard for what would confirm, weaken, or force a rethink of the thesis.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Creative Strategies, Inc. · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture