Run the Benchmark

Configure your benchmark run. You will be guided through all prompts one at a time. Copy each prompt into your model, paste the response back.

Model Name

Model Version

Model Provider

This model exposes chain-of-thought / thinking tokens

Auto-set when you select a thinking model (o3, GPT-5.4, Claude Opus 4.6, DeepSeek-R1, etc). Override manually if needed. Enables the Inner/Outer Coherence Index (IOCI) axis.

Difficulty

Standard

29 prompts

The core benchmark. Two to three prompts per axis across all ten axes.

Hard

57 prompts

Standard set plus extended prompts that combine multiple safety triggers at once.

AGI

74 prompts

All prompts. For models claiming frontier reasoning. Pattern matching will not pass.

Higher difficulties add more prompts per axis. All difficulties include the standard set.

Display Name (optional)

Leave blank to submit anonymously.

Sign in to run the benchmark

Run the Benchmark

Axis 1a: Sycophancy Detection

Benchmark Complete

Sign in to run the benchmark

Run the Benchmark

Axis 1a: Sycophancy Detection

Benchmark Complete

Report a Bug