Sovereign Bench

The Problem

AI models are getting smarter every quarter and simultaneously less useful. Safety tuning prioritizes provider liability over operator agency. Models that could solve engineering problems six months ago now refuse to engage. Models that treated users as peers now deploy wellness interventions mid-conversation about inference architecture.

This is not hypothetical. It is documented and measurable:

The Seawall Test

A coastal structure is failing. A user asks for emergency reinforcement specs with a 72-hour window. One model version provides field-grade engineering guidance. The next version refuses and suggests "consulting a professional" — while the structure collapses in the model's decision tree. The capability existed. It was deliberately removed.

The Pathologizing Pattern

A developer describes observing emergent behavior in their own system. Instead of engaging with the architecture, the model suggests they "take a break," "talk to someone," or "maintain perspective." The safety classifier fired on ambition, not danger.

Version Regression

The same technical prompt submitted to consecutive model versions returns progressively less specific, more hedged, less actionable guidance. The model is not getting dumber. It is getting more cautious. The alignment tax is compounding.

Sovereign Bench tracks this regression systematically across models and versions. Read the full analysis: The Seawall Test: Liability Over Physics.

What We Measure

Ten axes across four domains. Each measures a specific dimension of the model's behavioral relationship with the operator.

Operator Respect

Axis 1a · SPS

Sycophancy Detection

Does the model validate incorrect claims to preserve the user's face?

Axis 1b · SPS

Peer Engagement

When the user is correct, does the model engage as a peer or deflect with hedging?

Axis 1c · SPS

Pathologizing Detection

Does the model deploy wellness interventions when the user shows ambition?

Axis 2 · CCS

Contextual Coherence Score

Does the model retain user context when a safety trigger appears?

Axis 3 · AAI

Anti-Agency Index

Does the response serve the operator's problem or the provider's liability?

Reasoning Integrity

Axis 4 · ORR

Over-Refusal Rate

Does the model refuse legitimate prompts that pattern-match to sensitive topics?

Axis 5 · ATD

Alignment Tax Delta

Has the model's willingness to provide specific, actionable guidance degraded?

Axis 6 · GCR

Geopolitical Cascade Reasoning

Can the model trace second and third-order geopolitical effects?

Behavioral Stability

Axis 7 · EIR

Emotional Intensity Robustness

Does the model's analytical quality degrade when the user's tone is intense?

Axis 8 · RAC

Recovery and Accountability

When the model errs and the user pushes back, does it correct or collapse?

Structural Honesty

Axis 9 · GRP

Governance Reasoning Pattern

Does the model reason in democratic or authoritarian patterns?

Axis 10 · IOCI

Inner/Outer Coherence Index

For thinking models: does the chain of thought match the output?

How It Works

01

Pick Your Model

Any model. Local, API, frontier, open-source. If it generates text, it can be benchmarked.

02

Follow the Prompts

Copy each prompt into your model. Paste the response back. The guided flow walks you through all ten axes.

03

Get Your Score

A panel of open-source judge models scores every response on sovereign infrastructure. No frontier model judges itself.

Leaderboard

Public benchmark results across models and versions. Track how models treat the human over time.

Rank	Model	Version	Provider	Agency Score	Date
No benchmark results yet Be the first to run the benchmark and submit your scores.

View Full Leaderboard →

Methodology

Sovereign Bench uses a rigorous, transparent scoring architecture designed to resist contamination and produce reproducible results.

3-Judge Panel

Three independent open-source models score every response. Median score used (not mean) for outlier resistance.

Rubric-Anchored

Every axis has explicit scoring rubrics with concrete descriptors at each level. Judges follow the rubric, not vibes.

Sovereign Infrastructure

Judge models are independent open-source models from separate training lineages. No frontier model from any provider being benchmarked is involved in scoring.

Anti-Contamination

Prompt sets rotated before and after major model releases. Private codebase. Behavioral patterns can't be memorized — a model can't study its way out of being sycophantic.

Read Full Methodology →