Tokenomics and the Fixed-Cost Economics of AI Factories
A companion model to The Inference Payback that turns revenue per MW into rack-hour math: committed GPU capacity, paid tokens, and the falsifiers that would break the factory economics.
From Inference Payback to Rack-Hour Economics
Our last report, The Inference Payback, framed the next AI infrastructure question at the factory level. Token growth is useful evidence of demand, but it does not tell us whether AI factories can earn their cost of capital. The more important test is profitable demand density: how much external, paid, high-margin inference a deployed megawatt can support after power, leases, depreciation, financing, utilization, and refresh cycles.
This report moves one layer lower in the same economic question. The prior report looked at whether inference can fund the broader AI buildout. This report looks at the production mechanics underneath that payback case. The unit of analysis is the rack-hour: how many sellable tokens a fixed AI factory can produce per hour, at what realized price, and with how much of annual capacity allocated to inference rather than model development, internal research, idle reserve, or improvement loops.
That is the reason for the model. Revenue per megawatt and gross profit per megawatt are the board-level payback metrics. Paid tokens per rack-hour are the operating mechanism underneath them. A factory can look large in GPUs and megawatts while still underperforming economically if too much capacity is absorbed by work that does not monetize well or does not monetize at all.
When an AI lab rents a large GPU cluster on an annualized basis, that capacity starts to behave like a fixed-cost production asset. The company has committed to a cost base. The operating question becomes how many revenue-generating tokens that factory can produce per hour, per rack, and per year.
This is the context behind Jensen Huang’s “AI factory” language. A modern AI data center is a production asset. Training, post-training, research, internal evaluation, and inference all compete for the same factory capacity. Inference is the clearest path to direct revenue generation, which makes rack-hour allocation one of the central economic variables in the model.
The purpose of this report is to give readers a way to translate AI infrastructure scale into token economics. GPU count, megawatts, rack architecture, GPU-hour pricing, throughput, utilization, and realized token pricing are all relevant, but they only become useful when they are connected in one model. The rack-hour framework lets us ask how much paid inference capacity a factory can produce and how much fixed cost that capacity has to absorb.
OpenAI and Stargate are useful reference cases because more public information and estimates exist around their scale, power requirements, architecture, and possible rental economics. The framework is broader than OpenAI. The same logic applies to any company renting or operating large-scale AI compute, including frontier labs, hyperscalers, and neocloud platforms. The specific inputs will vary by architecture, contract, workload mix, pricing model, and utilization, but the economic question is the same: how many paid tokens can the deployed infrastructure produce against its fixed cost base?
One scope note. This report is focused on the AI lab or compute customer side of the transaction. We are asking how a company that has committed to a GPU-hour bill converts that fixed cost into paid inference. The provider-side model is different. Oracle, CoreWeave, Crusoe, Nebius, hyperscalers, and other neoclouds have their own questions around capex recovery, power pricing, financing, ROIC, customer concentration, and residual value. Those economics matter, but they sit one layer away from this model.
The output should be read as a capacity framework, not a revenue forecast. We are not trying to identify a single Stargate (like) price or produce a definitive operating model for one facility. We are giving readers a way to underwrite the payback logic of any large AI factory: start with the fixed compute commitment, translate it into rack-hour burden, estimate sustained token throughput, apply realized token pricing, and then test how much of the annual factory can be kept in revenue-generating inference.
That is the central question for the inference era. Competitive advantage will accrue to the companies that convert deployed infrastructure into paid token volume most efficiently, with utilization, model architecture, serving software, and pricing power all compounding into better factory economics.
In the full report, paid subscribers get:
A rack-hour companion model to The Inference Payback that translates revenue-per-MW and GP-per-MW into token throughput, realized pricing, and fixed-factory cost burden.
A Stargate Abilene reference case using roughly 100K GB200 GPUs, 1,389-1,400 NVL72 racks, and three GPU-hour rental scenarios.
A pricing-mix bridge from GPT-5.5-style list prices to a blended $6.67 per 1M total tokens, with the caveat that list price is not realized revenue.
Throughput sensitivities for GB200 and GB300 racks, including active-parameter, decode, batch-size, and serving-efficiency assumptions.
An annualized factory model that separates theoretical rack-hours from inference rack-hours, training, research, idle reserve, and internal improvement loops.
A fixed-cost-per-token sensitivity showing when the factory burden remains manageable and when throughput, pricing, or allocation break the spread.
A scope boundary between the AI lab/customer model and the neocloud or hyperscaler owner model, so the two payback questions do not get blended together.
A monitoring framework for realized token pricing, sustained rack throughput, inference allocation, demand absorption, serving-stack efficiency, and the thesis breakers that would make the model fail.



