Open Source Continuous Inference Benchmark trusted by Operators of Trillion Dollar GigaWatt Scale Token Factories
As the world progresses exponentially towards AGI, software development and model releases move at the speed of light. Existing benchmarks rapidly become obsolete due to their static nature, and participants often submit software images purpose-built for the benchmark itself which do not reflect real world performance.
InferenceX™ (formerly InferenceMAX) is our independent, vendor neutral, reproducible benchmark which addresses these issues by continuously benchmarking inference software across a wide range of AI accelerators that are actually available to the ML community.
Our open data & insights are widely adopted by the ML community, capacity planning strategy teams at trillion dollar token factories & AI Labs & at multiple billion dollar NeoClouds. Learn more in our articles: InferenceX v1, InferenceX v2.
Reproducibility
Every data point on the dashboard is the output of a public GitHub Actions workflow run. The recipe, logs, artifacts, and the resulting database row are all linked end to end, so anyone can audit, rerun, or fork a benchmark.
- 1Recipe in repo. Every combination of hardware, framework, model, and precision is a shell script committed to the public repo. The exact image, command line, and parallelism are pinned in source.
- 2Run on real hardware. GitHub Actions schedules the workflow on the actual target accelerator (NVIDIA, AMD, etc.) and streams the full job log publicly while it runs.
- 3Artifacts uploaded. Request latencies, token counts, GPU power telemetry, and evaluation samples are attached to the run page. GitHub Actions retains them for 90 days, and a weekly snapshot of the full benchmark database is published as a public GitHub Release for longer auditability.
- 4Ingested into the dashboard. Successful runs are loaded into the database and surfaced here. Every chart tooltip carries a direct link back to the GitHub Actions run that produced the point. Click any point to audit the source.
Frequently Asked Questions
- What is InferenceX?
InferenceX (formerly InferenceMAX) is an open-source, vendor-neutral benchmark that continuously measures AI inference performance across GPUs and software stacks. Benchmarks re-run whenever a configuration changes, so results stay current as models and frameworks evolve.
- Who is behind InferenceX?
InferenceX is built by SemiAnalysis, an independent semiconductor and AI research firm. It is supported and trusted by OpenAI, Microsoft, Tri Dao, vLLM, GPU Mode, PyTorch, Oracle, CoreWeave, Nebius, Crusoe, TensorWave, SGLang, WEKA, Stanford, Core42, Meta, Hugging Face, UC Berkeley, Lambda, UC San Diego, Red Hat, White House. The benchmark code, data, and dashboard are all open-source on GitHub.
- Which GPUs does InferenceX benchmark?
New accelerators are added as they become available.
- NVIDIA: H100, H200, B200, B300, GB200, GB300
- AMD: MI300X, MI325X, MI355X
- Which AI models are tested?
Each model is tested across multiple sequence length configurations (1k/1k, 1k/8k, 8k/1k tokens) and concurrency levels.
- DeepSeek-R1-0528
- gpt-oss-120b
- Llama-3.3-70B-Instruct-FP8
- Qwen-3.5-397B-A17B
- Kimi-K2.5
- Kimi-K2.5
- MiniMax-M2.5
- MiniMax-M2.5
- GLM-5
- GLM-5
- DeepSeek-V4-Pro
- Which inference frameworks and configurations are tested?
- Frameworks: ATOM, Dynamo SGLang, Dynamo TRT, Dynamo vLLM, MoRI SGLang, SGLang, TRT, vLLM, MTP
- Precisions: FP4, FP8, BF16, INT4
- Runtimes: CUDA, ROCm
- Disaggregated serving (separate prefill/decode GPU pools)
- Multi-token prediction (MTP)
- Wide expert parallelism for MoE models
- What metrics does InferenceX measure?
- Interactivity (tok/s/user)
- Token throughput per GPU (tok/s/gpu)
- Input and output throughput per GPU
- Token throughput per MW (tok/s/MW)
- P99 time to first token (TTFT)
- Cost per million tokens (total, input, output) across hyperscaler, neocloud, and rental pricing
- Joules per token (total, input, output)
- Custom user-defined cost and power calculations
- How often are benchmarks run?
Benchmarks originally ran on a nightly schedule, but the number of hardware/framework/model combinations grew too large for that to be practical. Now they re-run when a configuration changes, e.g. a new software release, driver update, or model addition. Historical data is available in the dashboard.
- Is InferenceX open source?
Yes. Code, data, and dashboard are all open-source. SemiAnalysisAI/InferenceX
- How is InferenceX different from other AI benchmarks?
Most AI benchmarks are static, point-in-time measurements where participants submit purpose-built images that do not reflect real-world serving performance. InferenceX runs continuously on real hardware with fully reproducible configurations. Every recipe is in the repo, benchmark logs are visible on GitHub Actions, and all results are auditable end-to-end.
- How are results reproducible?
Every data point on the dashboard is produced by a public GitHub Actions workflow run. The recipe (model, framework, precision, parallelism, sequence length, concurrency) is committed to the repo, the run executes on the actual target hardware, and the resulting artifacts (logs, metrics, GPU traces) are uploaded to the run page. Anyone can click through from a tooltip in any chart to the exact GitHub Actions run that produced that point.
- Where can I see the raw benchmark logs?
Click any data point on a chart to open its tooltip. The "GitHub Actions Run" link goes directly to the workflow run that produced it. From there you can inspect the full job logs, the exact framework and driver versions, command line arguments, and download the raw artifacts including request latencies, token counts, and GPU power telemetry.
- Can I rerun a benchmark myself?
Yes. The benchmark recipes live in the /benchmarks directory of the repo as standalone shell scripts. If you have access to the same hardware, you can fork the repo and run the script directly, or trigger the same GitHub Actions workflow to reproduce a result.
- Are old runs preserved?
Yes. GitHub Actions retains workflow run logs and artifacts for 90 days. For longer auditability, we also publish a weekly snapshot of the full benchmark database as a public GitHub Release, so anyone can download the historical dataset and reproduce or reanalyze any chart in the dashboard.
- Can I use InferenceX data for my own analysis?
Yes. All data is freely available. The dashboard lets you filter by GPU, model, framework, and date range, and you can export raw CSV data directly from any chart.