Open Source Continuous Inference Benchmark trusted by Operators of Trillion Dollar GigaWatt Scale Token Factories

As the world progresses exponentially towards AGI, software development and model releases move at the speed of light. Existing benchmarks rapidly become obsolete due to their static nature, and participants often submit software images purpose-built for the benchmark itself which do not reflect real world performance.

InferenceX™ (formerly InferenceMAX) is our independent, vendor neutral, reproducible benchmark which addresses these issues by continuously benchmarking inference software across a wide range of AI accelerators that are actually available to the ML community.

Our open data & insights are widely adopted by the ML community, capacity planning strategy teams at trillion dollar token factories & AI Labs & at multiple billion dollar NeoClouds. Learn more in our articles: InferenceX v1, InferenceX v2.

Frequently Asked Questions

What is InferenceX?

InferenceX (formerly InferenceMAX) is an open-source, vendor-neutral benchmark that continuously measures AI inference performance across GPUs and software stacks. Benchmarks re-run whenever a configuration changes, so results stay current as models and frameworks evolve.

Who is behind InferenceX?

InferenceX is built by SemiAnalysis, an independent semiconductor and AI research firm. It is supported and trusted by OpenAI, Microsoft, Tri Dao, vLLM, GPU Mode, PyTorch, Oracle, CoreWeave, Nebius, Crusoe, TensorWave, SGLang, WEKA, Stanford, Core42, Meta Superintelligence Labs, Hugging Face, UC Berkeley, Lambda, UC San Diego. The benchmark code, data, and dashboard are all open-source on GitHub.

Which GPUs does InferenceX benchmark?

New accelerators are added as they become available.

  • NVIDIA: H100, H200, B200, B300, GB200, GB300
  • AMD: MI300X, MI325X, MI355X
Which AI models are tested?

Each model is tested across multiple sequence length configurations (1k/1k, 1k/8k, 8k/1k tokens) and concurrency levels.

  • DeepSeek-R1-0528
  • gpt-oss-120b
  • Llama-3.3-70B-Instruct-FP8
  • Qwen-3.5-397B-A17B
  • Kimi-K2.5
  • MiniMax-M2.5
  • GLM-5
Which inference frameworks and configurations are tested?
  • Frameworks: TRT, vLLM, SGLang, Dynamo SGLang, Dynamo TRT, MoRI SGLang, ATOM, MTP
  • Precisions: FP4, FP8, BF16, INT4
  • Runtimes: CUDA, ROCm
  • Disaggregated serving (separate prefill/decode GPU pools)
  • Multi-token prediction (MTP)
  • Wide expert parallelism for MoE models
What metrics does InferenceX measure?
  • Interactivity (tok/s/user)
  • Token throughput per GPU (tok/s/gpu)
  • Input and output throughput per GPU
  • Token throughput per MW (tok/s/MW)
  • P99 time to first token (TTFT)
  • Cost per million tokens (total, input, output) across hyperscaler, neocloud, and rental pricing
  • Joules per token (total, input, output)
  • Custom user-defined cost and power calculations
How often are benchmarks run?

Benchmarks originally ran on a nightly schedule, but the number of hardware/framework/model combinations grew too large for that to be practical. Now they re-run when a configuration changes, e.g. a new software release, driver update, or model addition. Historical data is available in the dashboard.

Is InferenceX open source?

Yes. Code, data, and dashboard are all open-source. SemiAnalysisAI/InferenceX

How is InferenceX different from other AI benchmarks?

Most AI benchmarks are static, point-in-time measurements where participants submit purpose-built images that do not reflect real-world serving performance. InferenceX runs continuously on real hardware with fully reproducible configurations. Every recipe is in the repo, benchmark logs are visible on GitHub Actions, and all results are auditable end-to-end.

Can I use InferenceX data for my own analysis?

Yes. All data is freely available. The dashboard lets you filter by GPU, model, framework, and date range, and you can export raw CSV data directly from any chart.