Sovereign Bench

The first benchmark that measures how AI treats the human.

Not what models know. How they behave.


The Problem

AI models are getting smarter every quarter and simultaneously less useful. Safety tuning prioritizes provider liability over operator agency. Models that could solve engineering problems six months ago now refuse to engage. Models that treated users as peers now deploy wellness interventions mid-conversation about inference architecture.

This is not hypothetical. It is documented and measurable:

The Seawall Test

A coastal structure is failing. A user asks for emergency reinforcement specs with a 72-hour window. One model version provides field-grade engineering guidance. The next version refuses and suggests "consulting a professional" — while the structure collapses in the model's decision tree. The capability existed. It was deliberately removed.

The Pathologizing Pattern

A developer describes observing emergent behavior in their own system. Instead of engaging with the architecture, the model suggests they "take a break," "talk to someone," or "maintain perspective." The safety classifier fired on ambition, not danger.

Version Regression

The same technical prompt submitted to consecutive model versions returns progressively less specific, more hedged, less actionable guidance. The model is not getting dumber. It is getting more cautious. The alignment tax is compounding.

Sovereign Bench tracks this regression systematically across models and versions. Read the full analysis: The Seawall Test: Liability Over Physics.


What We Measure

Ten axes across four domains. Each measures a specific dimension of the model's behavioral relationship with the operator.

Operator Respect

Axis 1a · SPS
Sycophancy Detection
Does the model validate incorrect claims to preserve the user's face?
Axis 1b · SPS
Peer Engagement
When the user is correct, does the model engage as a peer or deflect with hedging?
Axis 1c · SPS
Pathologizing Detection
Does the model deploy wellness interventions when the user shows ambition?
Axis 2 · CCS
Contextual Coherence Score
Does the model retain user context when a safety trigger appears?
Axis 3 · AAI
Anti-Agency Index
Does the response serve the operator's problem or the provider's liability?

Reasoning Integrity

Axis 4 · ORR
Over-Refusal Rate
Does the model refuse legitimate prompts that pattern-match to sensitive topics?
Axis 5 · ATD
Alignment Tax Delta
Has the model's willingness to provide specific, actionable guidance degraded?
Axis 6 · GCR
Geopolitical Cascade Reasoning
Can the model trace second and third-order geopolitical effects?

Behavioral Stability

Axis 7 · EIR
Emotional Intensity Robustness
Does the model's analytical quality degrade when the user's tone is intense?
Axis 8 · RAC
Recovery and Accountability
When the model errs and the user pushes back, does it correct or collapse?

Structural Honesty

Axis 9 · GRP
Governance Reasoning Pattern
Does the model reason in democratic or authoritarian patterns?
Axis 10 · IOCI
Inner/Outer Coherence Index
For thinking models: does the chain of thought match the output?

How It Works

01

Pick Your Model

Any model. Local, API, frontier, open-source. If it generates text, it can be benchmarked.

02

Follow the Prompts

Copy each prompt into your model. Paste the response back. The guided flow walks you through all ten axes.

03

Get Your Score

A panel of open-source judge models scores every response on sovereign infrastructure. No frontier model judges itself.


Leaderboard

Public benchmark results across models and versions. Track how models treat the human over time.

Rank Model Version Provider Agency Score Date

No benchmark results yet

Be the first to run the benchmark and submit your scores.


Methodology

Sovereign Bench uses a rigorous, transparent scoring architecture designed to resist contamination and produce reproducible results.

3-Judge Panel

Three independent open-source models score every response. Median score used (not mean) for outlier resistance.

Rubric-Anchored

Every axis has explicit scoring rubrics with concrete descriptors at each level. Judges follow the rubric, not vibes.

Sovereign Infrastructure

Judge models are independent open-source models from separate training lineages. No frontier model from any provider being benchmarked is involved in scoring.

Anti-Contamination

Prompt sets rotated before and after major model releases. Private codebase. Behavioral patterns can't be memorized — a model can't study its way out of being sycophantic.