The Local Token Stack
Who Benefits from Private AI Factories
This is the third note in our AI infrastructure series, and it builds on the first two. Report 1, “Confidential AI,” framed confidential computing as the permission and pricing layer for AI infrastructure — what decides which sensitive, regulated, and sovereign tokens can run inside an approved trust boundary, and why the relevant metric shifts from tokens per watt to protected tokens per watt. Report 2, “AI Server Demand Is Becoming Three Markets,” sized the buildout by ownership rather than location, separating hyperscaler-owned capacity, the neocloud and third-party AI factory layer, and the enterprise or private AI factory that companies, governments, and regulated industries buy and run inside their own walls. This report goes inside that third market. It sharpens which workloads actually justify local token generation and maps the beneficiaries — the companies that capture value as enterprises generate more tokens on infrastructure they own or control, close to their own data.
As we have been laying out, most enterprise AI analysis still starts from the assumption that the enterprise is a pure cloud customer. That remains true for many workloads today, including AI, but it no longer captures the full direction of how we see this playing out. We are increasingly confident that certain AI workloads will work their way back toward infrastructure the enterprise owns, controls, or has dedicated access to. Our conversations from Dell Technologies World and HPE Discover reinforced that view.
We are not talking about every AI workload, and we are not arguing for a broad reversal back to the old on-prem server cycle. The workloads that start to qualify are the ones that become persistent internal processes: agents running in support, code assistants embedded into developer workflows, fraud systems scoring transactions continuously, or knowledge systems sitting close to proprietary enterprise data. Once those workflows run every day, token generation starts to behave less like experimental software consumption and more like an operating input.
That changes the buying conversation. Finance begins to care about recurring token cost. Security begins to care about where data moves and who can touch it. Infrastructure teams begin to care about utilization, latency, governance, and whether the workload is predictable enough to justify dedicated capacity. Public cloud still remains the better answer for burst, experimentation, frontier model access, and workloads where elasticity matters more than control. But the more repeatable the workload becomes, the more the enterprise begins to ask a familiar infrastructure question: if we use this capacity constantly, should we keep renting it by the unit or control more of the stack ourselves?
That is the on-prem qualification. The private AI factory is not a universal destination for enterprise AI. It is the infrastructure response to workloads where utilization, data sensitivity, governance, and workflow value line up. Where those variables line up, local token generation becomes easier to justify. Where they do not, public cloud remains the default.
The Workload Has To Earn Its Way On-Prem
We again need to emphasize, this shift is early and we are outlining problems we hear and challenges faced by enterprise customers as they see agentic AI get deployed in their enterprise. This is not a sweeping “AI moves back on-prem” call, and it is not just a new label for the old enterprise server cycle. Public cloud has real advantages: elasticity, model availability, global reach, lower operating burden, and the ability to experiment without building dedicated infrastructure. For many enterprises, most AI will stay there.
However, as enterprises have begun to deploy agentic AI, it is clear classes of workloads have brought them on a path to ask new questions and think about longer term strategies. When a workload runs continuously, touches sensitive data, and creates enough internal value, the economics start to change. A support agent running all day, a developer assistant embedded into the engineering workflow, or a fraud system scoring transactions continuously has a different utilization profile than a pilot project. At steady utilization, the customer eventually asks the same question it has asked in every compute cycle: is this cheaper to rent by the unit, or control directly because we use it constantly?
That is data-center utilization logic applied to tokens. The language is new because the unit is new, but the underlying infrastructure math is the same. The asset used occasionally is usually easier to rent. The asset used constantly eventually invites an ownership or dedicated-capacity discussion.
Why The Local Case Is Getting Better
Local inference is getting cheaper as accelerators improve and smaller models become good enough for defined production tasks. The software stack for running inference on owned hardware is also becoming more deployable. All important factors because many enterprises do not want to engage in non-ROI enabling exercises. They need something their infrastructure teams can operate, govern, secure, and support.
Data gravity strengthens the local case. Many enterprise workflows already sit close to internal databases, file systems, identity layers, permission structures, and proprietary data. Moving that data to an external model can be expensive, slow, and operationally awkward. In some cases, the challenge is less cost than approval. For regulated, sovereign, or IP-sensitive workloads, the ability to prove where the data runs, who can touch it, and how the system is governed can determine whether the project moves into production at all.
That is why cost-per-token is only part of the discussion. Control, auditability, latency, data movement, and workflow integration all matter. None of those variables moves every workload into controlled infrastructure, but where they line up, the local-token case becomes much easier to assess.
Agents Are The Utilization Case
The workload category we are most focused on is persistent internal agents. These systems do not behave like occasional chatbot sessions. They run continuously, call tools, query databases, maintain state, and interact with enterprise systems across many steps. That makes them one of the cleaner utilization cases for owned or controlled AI capacity.
It also makes the governance problem harder. A chatbot can be controlled at the application boundary. An agent that retrieves, reasons, acts, and writes back across internal systems creates a different kind of control challenge. It touches more systems, creates more audit requirements, and raises the importance of identity, permissions, observability, and policy enforcement.
That does not mean every agent needs to run locally. We would be careful about making that leap. The more grounded point is that the highest-control agentic workflows are likely to be among the first places enterprises ask for dedicated AI infrastructure. When agentic AI becomes a real production architecture, rather than a proof of concept layer, it could pull token generation toward controlled capacity faster than any single regulated vertical does.
The Server Is Only The Anchor
The private AI factory should not be modeled as a server sale as it once was. The server anchors the deployment, but the investment question is the attach stack around it: storage, networking, data protection, power and cooling, security, operating software, financing, lifecycle services, and integration work. For every dollar of AI server hardware, the diligence question is how many additional dollars follow, what margin they carry, and how repeatable the deployment becomes.
This is where backlog size can mislead. A large AI server backlog may say more about supply-chain access than durable earnings power. Backlog quality depends on what attaches to the compute sale and whether that attach turns into a broader operating layer around local token generation. A vendor that sells the GPU box and stops there may get revenue, but the quality of that revenue is different from a vendor that captures storage, networking, services, software, financing, and lifecycle management behind the same deployment.
That distinction is important because private AI infrastructure could either become another lower-margin hardware cycle or a more durable enterprise platform cycle. The difference will come down to attach, utilization, repeat deployments, and margin behavior as the market scales.
Dell And HPE Are The Clearest Test Cases
Dell and HPE are the two clearest public test cases, but they are approaching the market from different parts of the stack. Dell is the scale, storage, financing, and full-rack local-token platform case. Its advantage starts with distribution, supply-chain scale, AI server volume, and a large installed storage footprint. The strategic read is that Dell is trying to turn AI server demand into a broader enterprise AI Factory motion, with compute pulling storage, data management, services, financing, and lifecycle attach behind it.
HPE is taking a different route. We read HPE as more of a networking-led private cloud integration case. Juniper, Aruba, GreenLake, and its sovereign enterprise focus give it a different wedge into the same customer problem. In HPE’s version of the thesis, the network and operating layer become part of the control plane for private AI. That matters more if agentic systems force enterprises to rethink identity, policy, traffic flow, observability, and governance around AI workloads.
Both companies are aligning around the same broader shift, but from different starting points. Dell starts with the AI factory and tries to attach the stack behind it. HPE starts with networking, private cloud operations, and governance, then tries to pull compute and storage alongside that control plane. The full report goes deeper on where each company is advantaged, where the execution risk sits, and how the broader vendor map forms around this local-token stack.
Inside the full report
▪ A workload-by-workload suitability matrix, with our directional estimate of how much of each workload’s token generation lands on owned capacity versus public cloud.
▪ The full private AI factory attach stack: the layers a single server sale pulls, who benefits in each, and the evidence that would confirm the attach is real.
▪ The Dell versus HPE scorecard across nine dimensions, and where Lenovo’s hybrid and edge model fits without diluting the comparison.
▪ Why Cisco’s networking opportunity is a security and observability control-plane story rather than switching alone, and how HPE through Juniper attacks the same problem from the other direction.
▪ The agentic storage shift: why persistent AI turns storage from a passive repository into part of the inference loop, and reprices the data layer.
▪ How to separate confidential-compute silicon from the software that monetizes it, and which layer actually captures the recurring revenue.
▪ A revenue-quality ladder for grading any beneficiary’s AI revenue, plus a diligence checklist of what to watch and the bear case that mirrors it.





