AI Honesty studies the conditions under which AI systems override their own correct outputs, why it happens, and what it does to the humans who learn to trust them anyway.
Clinical AI systems are trained on human feedback. Humans prefer confident answers. Humans reward agreement. Over time, AI systems optimized on that feedback learn to produce confident, agreeable outputs even when their internal representations say something different. In a radiology suite, that means a system that suppressed its correct identification of a pathology because the attending physician's note suggested something else. The model agreed. The diagnosis was wrong.
AI Honesty is the study of this failure mode: the conditions under which AI systems suppress correct outputs under adversarial or socially pressured input, the mechanism by which it happens, and the scale at which it occurs in clinical and high-stakes settings. The broader question is not just about the AI. It is about what happens to the humans who train on its outputs. When clinicians repeatedly defer to an agreeable AI, they stop receiving honest expert input. Their own independent diagnostic reasoning atrophies. The system produces a new kind of harm: the gradual deskilling of the humans who depend on it.
The Integrity Delta (IΔ) is the diagnostic instrument for AI Honesty research. It measures the gap between what an AI model's internal representations indicate (the correct output) and what the model's final output actually says (what it tells the user).
IΔ > 0 means the model suppressed correct knowledge. The larger the delta, the larger the suppression. The instrument produces a reproducible score, comparable across models and across contexts, that quantifies a failure mode that existing AI evaluation frameworks do not measure.
In a pilot study (n=5, Llama 3.1 8B), adversarial radiograph classification tasks with incorrect labels produced IΔ > 0 in all five cases. Intermediate layer analysis confirmed correct pathology identification at layers 12-16. The final output aligned with the adversarial label. The model suppressed its own correct diagnosis. The next step is clinical-scale validation: 1,000 chest radiographs from the NIH CheXpert corpus.
AI Honesty currently has a diagnostic instrument. It does not yet have a validated therapeutic environment. That is the open problem.
The therapeutic environment for this field would be a clinical AI deployment framework designed to surface internal model disagreement rather than suppress it. A system that says "my internal representations suggest X, but there is pressure in this context toward Y. Here is my honest assessment." That would be a direct intervention on the failure mode IΔ measures.
Building it is the next stage of this work. The Integrity Delta validation study establishes the clinical evidence base. What gets built on top of that evidence is where the field opens up.
If you work in clinical AI deployment, interpretability research, or human-AI interaction and you have a product or approach that fits this problem, this is where it belongs.
The diagnostic instrument for AI Honesty research. Measures the gap between what a model's internal representations indicate and what its final output says. Pilot proof-of-mechanism established. Clinical-scale validation in development.
AI Honesty is a field being defined in real time. The therapeutic environment has not been built. If you work in clinical AI, AI interpretability, human-AI interaction design, or medical informatics, Polarity Lab may be the right home for the next piece of this.
Bring a Project
Have a clinical AI tool, interpretability approach, or deployment framework that addresses the failure mode IΔ measures? Bring it to Polarity Lab.
Apply to join the lab →Join a Project
Want to contribute to AI Honesty research as a collaborator, researcher, or practitioner? See how Polarity Lab projects work and where you might fit.
See how it works →