Question 1

What is InferenceX?

Accepted Answer

InferenceX (formerly InferenceMAX) is an open-source, vendor-neutral benchmark that continuously measures AI inference performance across GPUs and software stacks. Benchmarks re-run whenever a configuration changes, so results stay current as models and frameworks evolve.

Question 2

Who is behind InferenceX?

Accepted Answer

InferenceX is built by SemiAnalysis, an independent semiconductor and AI research firm. It is supported and trusted by OpenAI, Microsoft, Tri Dao, vLLM, GPU Mode, PyTorch, Oracle, CoreWeave, Nebius, Crusoe, TensorWave, SGLang, WEKA, Stanford, Core42, Meta, Hugging Face, UC Berkeley, Lambda, UC San Diego, Red Hat, White House. The benchmark code, data, and dashboard are all open-source on GitHub.

Question 3

Which GPUs does InferenceX benchmark?

Accepted Answer

New accelerators are added as they become available. NVIDIA: H100, H200, B200, B300, GB200, GB300 AMD: MI300X, MI325X, MI355X

Question 4

Which AI models are tested?

Accepted Answer

Each model is tested across multiple sequence length configurations (1k/1k, 1k/8k, 8k/1k tokens) and concurrency levels. DeepSeek-R1-0528 gpt-oss-120b Llama-3.3-70B-Instruct-FP8 Qwen-3.5-397B-A17B Kimi-K2.5 Kimi-K2.5 MiniMax-M2.5 MiniMax-M2.5 GLM-5 GLM-5 DeepSeek-V4-Pro

Question 5

Which inference frameworks and configurations are tested?

Accepted Answer

Frameworks: ATOM, Dynamo SGLang, Dynamo TRT, Dynamo vLLM, MoRI SGLang, SGLang, TRT, vLLM, MTP Precisions: FP4, FP8, BF16, INT4 Runtimes: CUDA, ROCm Disaggregated serving (separate prefill/decode GPU pools) Multi-token prediction (MTP) Wide expert parallelism for MoE models

Question 6

What metrics does InferenceX measure?

Accepted Answer

Interactivity (tok/s/user) Token throughput per GPU (tok/s/gpu) Input and output throughput per GPU Token throughput per MW (tok/s/MW) P99 time to first token (TTFT) Cost per million tokens (total, input, output) across hyperscaler, neocloud, and rental pricing Joules per token (total, input, output) Custom user-defined cost and power calculations

Question 7

How often are benchmarks run?

Accepted Answer

Benchmarks originally ran on a nightly schedule, but the number of hardware/framework/model combinations grew too large for that to be practical. Now they re-run when a configuration changes, e.g. a new software release, driver update, or model addition. Historical data is available in the dashboard.

Question 8

Is InferenceX open source?

Accepted Answer

Yes. Code, data, and dashboard are all open-source. SemiAnalysisAI/InferenceX

Question 9

How is InferenceX different from other AI benchmarks?

Accepted Answer

Most AI benchmarks are static, point-in-time measurements where participants submit purpose-built images that do not reflect real-world serving performance. InferenceX runs continuously on real hardware with fully reproducible configurations. Every recipe is in the repo, benchmark logs are visible on GitHub Actions, and all results are auditable end-to-end.

Question 10

How are results reproducible?

Accepted Answer

Every data point on the dashboard is produced by a public GitHub Actions workflow run. The recipe (model, framework, precision, parallelism, sequence length, concurrency) is committed to the repo, the run executes on the actual target hardware, and the resulting artifacts (logs, metrics, GPU traces) are uploaded to the run page. Anyone can click through from a tooltip in any chart to the exact GitHub Actions run that produced that point.

Question 11

Where can I see the raw benchmark logs?

Accepted Answer

Click any data point on a chart to open its tooltip. The "GitHub Actions Run" link goes directly to the workflow run that produced it. From there you can inspect the full job logs, the exact framework and driver versions, command line arguments, and download the raw artifacts including request latencies, token counts, and GPU power telemetry.

Question 12

Can I rerun a benchmark myself?

Accepted Answer

Yes. The benchmark recipes live in the /benchmarks directory of the repo as standalone shell scripts. If you have access to the same hardware, you can fork the repo and run the script directly, or trigger the same GitHub Actions workflow to reproduce a result.

Question 13

Are old runs preserved?

Accepted Answer

Yes. GitHub Actions retains workflow run logs and artifacts for 90 days. For longer auditability, we also publish a weekly snapshot of the full benchmark database as a public GitHub Release, so anyone can download the historical dataset and reproduce or reanalyze any chart in the dashboard.

Question 14

Can I use InferenceX data for my own analysis?

Accepted Answer

Yes. All data is freely available. The dashboard lets you filter by GPU, model, framework, and date range, and you can export raw CSV data directly from any chart.

Open Source Continuous Inference Benchmark trusted by Operators of Trillion Dollar GigaWatt Scale Token Factories

Reproducibility

Frequently Asked Questions