General knowledge and reasoning
Use task-level controls and contamination checks before cross-model comparisons.
Evergreen benchmark framing for capability, retrieval, and reliability evaluation.
No rankings without reproducible methodology + primary sources.
Use task-level controls and contamination checks before cross-model comparisons.
Evaluate retrieval quality separately from generation quality, then test long-context stress behavior.
Tool-calling claims require deterministic execution logs, test harness disclosure, and replay checks.