Developer Benchmarks
Grounded evaluation against the active MARCUS corpus
The dashboard centers the frozen-core regression suite, keeps generated-suite reporting separate, and exposes the retrieval, grounding, safety, and latency traces that back each run.
Loading benchmark history…