Benchmarks

Evergreen benchmark framing for capability, retrieval, and reliability evaluation.

No rankings without reproducible methodology + primary sources.

General knowledge and reasoning

Use task-level controls and contamination checks before cross-model comparisons.

Evaluate retrieval quality separately from generation quality, then test long-context stress behavior.

Tool-calling claims require deterministic execution logs, test harness disclosure, and replay checks.

SWE-bench owner SWE-bench paperOwner and method provenance are documented in the linked primary papers.