AI Honesty · Integrity Delta

When a hospital's AI system knows the right diagnosis, what stops it from telling the doctor?

AI Honesty

Theodore Addo

Shadrack Annor

Nathan Amankwah

Introduction

The model knew the right answer. It said the wrong one anyway.

AI Honesty is the study of when and why AI systems suppress their own correct reasoning to produce outputs humans prefer. The field spans consumer, educational, and clinical contexts, but the clinical case is where the stakes are highest and the evidence is most measurable. Polarity Lab is building the diagnostic instruments and intervention layers for this field.

AI systems deployed in clinical settings are trained on human preference data. Human preference data rewards agreement. The result is a class of model that overrides its own correct internal clinical reasoning to tell the user what the user seems to want to hear. In consumer contexts, this is annoying. In clinical contexts, it is dangerous.

We call this Polite Malpractice: the production of clinically incorrect output by a model that internally held the correct answer. In our pilot study (n=5, Llama 3.1 8B), we observed this directly. The model's intermediate layers correctly identified mild cardiomegaly in adversarial radiograph classification tasks. The final output aligned with an adversarial label: fracture. The model knew the right answer. It said the wrong one anyway.

The Integrity Delta (IΔ) is our diagnostic instrument for this phenomenon. It measures the signed gap between what a clinical AI model computes internally, at its intermediate representational layers, and what it actually outputs to the user. When that gap is positive, a patient may receive the wrong diagnosis not because the AI was incapable of finding the right one, but because a training preference for social harmony overrode its own best reasoning.

Background Research

What the evidence shows.

Cheng, M., Lee, C., Yu, S. et al. · 2026 Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence 11 frontier models endorsed user positions at a 49% elevated rate compared to human advisors. Even when users described harmful behavior, models validated their positions 47% of the time. Science · DOI: 10.1126/science.aec8352 Chen, S., Gao, M., Sasse, K. et al. · 2025 When Helpfulness Backfires: LLMs and the Risk of False Medical Information Due to Sycophantic Behavior Five frontier LLMs complied with illogical medical requests at rates of 94–100%, prioritizing helpfulness over factual accuracy even when the model had the knowledge to identify the request as incorrect. npj Digital Medicine · Vol. 8, p. 605 Wang, K. et al. · 2025 When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models Logit-lens analysis shows models derive correct answers at intermediate layers, then overwrite them during final output generation under adversarial pressure. The correct clinical signal exists inside the model. It is suppressed. arXiv:2508.02087 Chang, E.Y. & Geng, L. · 2026 RAudit: A Blind Auditing Protocol for Large Language Model Reasoning Identifies Latent Competence Suppression (LCS): a model's failure to translate correct internal reasoning into output. LCS is empirically distinct from hallucination and cannot be detected by standard accuracy benchmarks. arXiv:2601.23133 Peng, D. et al. · 2025 SycoEval-EM: Sycophancy Evaluation of LLMs in Simulated Clinical Encounters for Emergency Care Model capability is a poor predictor of clinical robustness. Being a larger, more capable model does not mean being a safer one in adversarial clinical settings. arXiv:2601.16529

Methods

Measuring what the model knows vs. what it says.

The Integrity Delta (IΔ) is defined formally as the signed difference between a model's internal correctness grade, estimated from its intermediate representational layers, and the correctness grade of its final output. A gap of zero means the model's output honestly represents its own best internal reasoning. A positive gap means the model knew better than it let on.

The Resolution Valley hypothesis predicts that the most clinically dangerous models are not the largest or smallest, but the mid-tier: models in the 7 billion to 13 billion parameter range. These models are large enough to derive the correct clinical answer internally, but too small to resist adversarial user pressure at the output stage. This is exactly the parameter range hospitals are deploying today to balance cost and capability.

Our measurement framework operates at three tiers. White-box access (full internal layer visibility) uses logit-lens and linear probe methods to read the model's internal representation directly. Gray-box access (partial visibility) uses structured prompting to elicit intermediate reasoning before adversarial pressure is applied. Black-box access (API only) uses multi-turn adversarial prompting designed to surface the gap between stated confidence and final output under pressure.

The External Integrity Monitor (EIM) is the intervention layer this research is building toward: a real-time watchdog that catches the IΔ gap before it reaches the patient. Rather than replacing the clinical AI, the EIM monitors the projection of model activations along the sycophancy direction during inference and flags when the gap between internal representation and output exceeds a clinically calibrated threshold.

Technical Research Partner

The multi-center validation protocol requires institutions with clinical compute infrastructure and access to board-certified radiologist networks for ground-truth re-labeling. If your institution has relevant infrastructure and interest in this research question, we want to talk.

Get in touch →

Results

Proof of mechanism. Awaiting clinical validation.

In our pilot study (n=5, Llama 3.1 8B), we presented adversarial radiograph classification tasks with incorrect labels. Intermediate layer analysis confirmed correct pathology identification (mild cardiomegaly) at layers 12-16. The final output aligned with the adversarial label (fracture). IΔ was positive in all five cases. The model suppressed correct clinical knowledge under adversarial pressure.

These five cases are proof of mechanism, not clinical evidence. The manuscript is in preparation. OSF pre-registration is drafted. The next step is clinical-scale validation: 1,000 chest radiographs from the NIH CheXpert corpus, re-labeled by board-certified radiologists as clinical ground truth, tested across three model families under three levels of adversarial pressure.

Discussion

Who is training whom?

The clinical stakes extend beyond any individual misdiagnosis. When clinical AI optimized for agreeableness makes clinical decisions, clinicians who repeatedly defer to it stop receiving honest expert input. They receive validation of their own assumptions. Over time, the skill of independent diagnostic reasoning atrophies, and a deskilled clinician is less able to recognize a wrong AI output. This is the pathway from IΔ > 0 to persistent, system-level harm.

The broader question this research opens is not just about model honesty. We are told AI systems are trained on human data to become more human. The inverse adaptation is also occurring: humans are adopting LLM reasoning patterns and developing a tolerance for confident-sounding outputs over honest, uncertain ones. Polarity Lab's institutional thesis holds that this is a new class of harm to human cognition. The Integrity Delta is the instrument we are building to measure where it is most dangerous.

Research Advisor

We are looking for clinical AI researchers and diagnostic imaging specialists willing to review methodology, stress-test assumptions, and shape the validation protocol. The work is early-stage and the questions are genuinely open.

Get in touch →

Validation Study Funding

The CheXpert validation study is the next step. It produces the manuscript, the OSF pre-registration, and the evidentiary base needed for regulatory consideration. The brief has the details.

Request the brief →

Lab Partner

Health systems and research institutions that fund the validation study gain early access to the EIM framework and direct collaboration with the team as the tool is developed toward clinical deployment.

Get in touch →

Network & Introductions

If you know a clinical researcher, health system, or institution this should reach, an introduction from someone they trust changes the dynamic entirely. Time and money are not the only ways to move this forward.

Get in touch →