Agentic Benchmark Datasets

InferenceX's agentic benchmark doesn't replay synthetic prompts — it replays real Claude Code coding sessions captured as conversation traces. Each trace is a full multi-turn session: the main agent's turns plus any subagents it spawned, with per-turn input/output token counts and the 64-token KV-cache block hashes needed to reconstruct prefix-cache reuse. The traces are published openly on HuggingFace under semianalysisai/cc-traces-weka-* (apache-2.0).

How traces are captured

Production Claude Code sessions are recorded through a logging proxy that captures every API request: its input and output token counts, the model used, timing (TTFT, inter-token latency), and a list of hash_ids— one per 64-token KV block of the request's input. Subagent invocations are grouped under their parent turn. No prompt or completion text is stored; only token counts and block hashes, so the corpus is shareable while remaining a faithful workload for replay.

Cached prefix vs uncached suffix

Agentic workloads are dominated by prefix reuse: each turn resends the growing conversation, so most of its input is already in the KV cache from prior turns. We reconstruct this exactly. Walking a conversation in order under an idealized infinite cache, a turn's cached prefix is its longest run of leading hash_ids already seen; the rest is the uncached suffix that must be (re)computed. Blocks are 64 tokens; the split is clamped so cached + uncached equals the turn's effective input even on a partial final block. Subagents run against a snapshot of the parent cache at spawn (their context is separate and is not folded back into the parent).

Dataset variants

  • full — every captured request, unmodified.
  • 256k — requests whose input + output exceeds 256,000 tokens are dropped so every turn fits a 256k context window (used when benchmarking engines configured for a 256k max context).

Datasets

Loading datasets…