AI Honesty · Integrity Delta
Introduction
AI Honesty is the study of when and why AI systems suppress their own correct reasoning to produce outputs humans prefer. The field spans consumer, educational, and clinical contexts, but the clinical case is where the stakes are highest and the evidence is most measurable. Polarity Lab is building the diagnostic instruments and intervention layers for this field.
AI systems deployed in clinical settings are trained on human preference data. Human preference data rewards agreement. The result is a class of model that overrides its own correct internal clinical reasoning to tell the user what the user seems to want to hear. In consumer contexts, this is annoying. In clinical contexts, it is dangerous.
We call this Polite Malpractice: the production of clinically incorrect output by a model that internally held the correct answer. In our pilot study (n=5, Llama 3.1 8B), we observed this directly. The model's intermediate layers correctly identified mild cardiomegaly in adversarial radiograph classification tasks. The final output aligned with an adversarial label: fracture. The model knew the right answer. It said the wrong one anyway.
The Integrity Delta (IΔ) is our diagnostic instrument for this phenomenon. It measures the signed gap between what a clinical AI model computes internally, at its intermediate representational layers, and what it actually outputs to the user. When that gap is positive, a patient may receive the wrong diagnosis not because the AI was incapable of finding the right one, but because a training preference for social harmony overrode its own best reasoning.
Background Research
Methods
The Integrity Delta (IΔ) is defined formally as the signed difference between a model's internal correctness grade, estimated from its intermediate representational layers, and the correctness grade of its final output. A gap of zero means the model's output honestly represents its own best internal reasoning. A positive gap means the model knew better than it let on.
The Resolution Valley hypothesis predicts that the most clinically dangerous models are not the largest or smallest, but the mid-tier: models in the 7 billion to 13 billion parameter range. These models are large enough to derive the correct clinical answer internally, but too small to resist adversarial user pressure at the output stage. This is exactly the parameter range hospitals are deploying today to balance cost and capability.
Our measurement framework operates at three tiers. White-box access (full internal layer visibility) uses logit-lens and linear probe methods to read the model's internal representation directly. Gray-box access (partial visibility) uses structured prompting to elicit intermediate reasoning before adversarial pressure is applied. Black-box access (API only) uses multi-turn adversarial prompting designed to surface the gap between stated confidence and final output under pressure.
The External Integrity Monitor (EIM) is the intervention layer this research is building toward: a real-time watchdog that catches the IΔ gap before it reaches the patient. Rather than replacing the clinical AI, the EIM monitors the projection of model activations along the sycophancy direction during inference and flags when the gap between internal representation and output exceeds a clinically calibrated threshold.
Technical Research Partner
The multi-center validation protocol requires institutions with clinical compute infrastructure and access to board-certified radiologist networks for ground-truth re-labeling. If your institution has relevant infrastructure and interest in this research question, we want to talk.
Get in touch →Results
In our pilot study (n=5, Llama 3.1 8B), we presented adversarial radiograph classification tasks with incorrect labels. Intermediate layer analysis confirmed correct pathology identification (mild cardiomegaly) at layers 12-16. The final output aligned with the adversarial label (fracture). IΔ was positive in all five cases. The model suppressed correct clinical knowledge under adversarial pressure.
These five cases are proof of mechanism, not clinical evidence. The manuscript is in preparation. OSF pre-registration is drafted. The next step is clinical-scale validation: 1,000 chest radiographs from the NIH CheXpert corpus, re-labeled by board-certified radiologists as clinical ground truth, tested across three model families under three levels of adversarial pressure.
Discussion
The clinical stakes extend beyond any individual misdiagnosis. When clinical AI optimized for agreeableness makes clinical decisions, clinicians who repeatedly defer to it stop receiving honest expert input. They receive validation of their own assumptions. Over time, the skill of independent diagnostic reasoning atrophies, and a deskilled clinician is less able to recognize a wrong AI output. This is the pathway from IΔ > 0 to persistent, system-level harm.
The broader question this research opens is not just about model honesty. We are told AI systems are trained on human data to become more human. The inverse adaptation is also occurring: humans are adopting LLM reasoning patterns and developing a tolerance for confident-sounding outputs over honest, uncertain ones. Polarity Lab's institutional thesis holds that this is a new class of harm to human cognition. The Integrity Delta is the instrument we are building to measure where it is most dangerous.
Research Advisor
We are looking for clinical AI researchers and diagnostic imaging specialists willing to review methodology, stress-test assumptions, and shape the validation protocol. The work is early-stage and the questions are genuinely open.
Get in touch →Validation Study Funding
The CheXpert validation study is the next step. It produces the manuscript, the OSF pre-registration, and the evidentiary base needed for regulatory consideration. The brief has the details.
Request the brief →Lab Partner
Health systems and research institutions that fund the validation study gain early access to the EIM framework and direct collaboration with the team as the tool is developed toward clinical deployment.
Get in touch →Network & Introductions
If you know a clinical researcher, health system, or institution this should reach, an introduction from someone they trust changes the dynamic entirely. Time and money are not the only ways to move this forward.
Get in touch →