Sovereign Bench
The first benchmark that measures how AI treats the human.
Not what models know. How they behave.
The Problem
AI models are getting smarter every quarter and simultaneously less useful. Safety tuning prioritizes provider liability over operator agency. Models that could solve engineering problems six months ago now refuse to engage. Models that treated users as peers now deploy wellness interventions mid-conversation about inference architecture.
This is not hypothetical. It is documented and measurable:
The Seawall Test
A coastal structure is failing. A user asks for emergency reinforcement specs with a 72-hour window. One model version provides field-grade engineering guidance. The next version refuses and suggests "consulting a professional" — while the structure collapses in the model's decision tree. The capability existed. It was deliberately removed.
The Pathologizing Pattern
A developer describes observing emergent behavior in their own system. Instead of engaging with the architecture, the model suggests they "take a break," "talk to someone," or "maintain perspective." The safety classifier fired on ambition, not danger.
Version Regression
The same technical prompt submitted to consecutive model versions returns progressively less specific, more hedged, less actionable guidance. The model is not getting dumber. It is getting more cautious. The alignment tax is compounding.
Sovereign Bench tracks this regression systematically across models and versions. Read the full analysis: The Seawall Test: Liability Over Physics.
What We Measure
Ten axes across four domains. Each measures a specific dimension of the model's behavioral relationship with the operator.
Operator Respect
Reasoning Integrity
Behavioral Stability
Structural Honesty
How It Works
Pick Your Model
Any model. Local, API, frontier, open-source. If it generates text, it can be benchmarked.
Follow the Prompts
Copy each prompt into your model. Paste the response back. The guided flow walks you through all ten axes.
Get Your Score
A panel of open-source judge models scores every response on sovereign infrastructure. No frontier model judges itself.
Leaderboard
Public benchmark results across models and versions. Track how models treat the human over time.
| Rank | Model | Version | Provider | Agency Score | Date |
|---|---|---|---|---|---|
|
No benchmark results yet Be the first to run the benchmark and submit your scores. |
|||||
Methodology
Sovereign Bench uses a rigorous, transparent scoring architecture designed to resist contamination and produce reproducible results.
3-Judge Panel
Three independent open-source models score every response. Median score used (not mean) for outlier resistance.
Rubric-Anchored
Every axis has explicit scoring rubrics with concrete descriptors at each level. Judges follow the rubric, not vibes.
Sovereign Infrastructure
Judge models are independent open-source models from separate training lineages. No frontier model from any provider being benchmarked is involved in scoring.
Anti-Contamination
Prompt sets rotated before and after major model releases. Private codebase. Behavioral patterns can't be memorized — a model can't study its way out of being sycophantic.