Stanford Study: LLM-Based Screening Identifies Depression With 91% Accuracy

Natural language processing of clinical notes outperforms standard PHQ-9 questionnaires in detecting major depressive disorder

Published 2025-04-10 · Mental Wellness

Researchers at Stanford University's Department of Psychiatry and Behavioral Sciences have developed a large language model-based system that identifies major depressive disorder from unstructured clinical notes with 91.2% accuracy — significantly outperforming the Patient Health Questionnaire-9 (PHQ-9), the most widely used depression screening tool in primary care. The study was published in Nature Medicine on April 8, 2025.

The findings arrive at a moment when the global burden of depression is intensifying. The World Health Organization estimates that 280 million people worldwide live with depression, yet fewer than half receive an accurate diagnosis. In primary care settings — where most depression is first encountered — missed diagnosis rates range from 30% to 50%, driven by time constraints, overlapping somatic symptoms, and the reluctance of many patients to disclose emotional distress unprompted.

Study Design and Dataset

The Stanford team, led by Dr. Katherine Nelson and Dr. Ian Clarke, assembled a dataset of 614,000 de-identified clinical notes from Stanford Health Care's electronic health record system, spanning encounters between 2010 and 2023. The notes included primary care visits, psychiatric evaluations, emergency department assessments, and discharge summaries. Each note was linked to structured diagnostic data indicating whether the patient had received a DSM-5 diagnosis of major depressive disorder within 90 days of the encounter.

The researchers fine-tuned a Med-PaLM 2 variant — a large language model specialised for biomedical text — on a subset of 80,000 manually annotated notes. The annotation process involved 42 licensed clinical psychologists and psychiatrists who labelled each note for the presence or absence of depressive symptom indicators, assigning confidence scores to their judgements. Inter-rater reliability, measured using Fleiss' kappa, was 0.78 — considered strong agreement for psychiatric diagnostic tasks.

The fine-tuned model was then evaluated on a held-out test set of 120,000 notes from 67,000 unique patients. The system's task was binary: given a clinical note, predict whether the patient would receive a depression diagnosis within the subsequent 90 days. No structured diagnostic codes, laboratory values, or patient demographics were provided to the model — only the free-text content of the clinical note.

Performance Results

On the primary outcome measure, the LLM-based system achieved an accuracy of 91.2%, an AUROC of 0.94, a sensitivity of 88.7%, and a specificity of 93.1%. By comparison, the PHQ-9 administered in the same patient population achieved an accuracy of 74.6%, an AUROC of 0.81, and sensitivity and specificity of 79.3% and 70.8% respectively, using the standard cutoff score of 10 or above.

The performance gap was most pronounced in patients with comorbid medical conditions. Among patients with concurrent chronic pain, diabetes, or cardiovascular disease — populations where depression screening is notoriously unreliable because somatic symptoms can mask affective ones — the LLM system maintained an accuracy of 89.1%, while PHQ-9 accuracy dropped to 61.3%. The model appeared to detect linguistic patterns associated with depression — flattened affect descriptors, hedonic language markers, cognitive distortion patterns — even when these were embedded in notes ostensibly focused on physical complaints.

Demographic subgroup analysis revealed generally consistent performance across groups. AUROC values were 0.94 for White patients, 0.93 for Black patients, 0.92 for Hispanic patients, and 0.91 for Asian patients. The model performed slightly less well for patients aged 65 and above (AUROC 0.89), a finding the researchers attribute to differences in how older adults describe emotional distress — often framing it in somatic rather than affective terms.

How the Model Works

The Stanford team employed an interpretability analysis to understand what linguistic features the model was using to make its predictions. Using a combination of attention weight visualisation and SHAP (SHapley Additive exPlanations) values, they identified several categories of predictive language that clinicians often overlook or undervalue.

These included hedging patterns ("I guess I've been feeling a bit off"), sleep-related complaints that did not explicitly reference mood ("can't seem to stay asleep, waking up at 3 or 4 most mornings"), social withdrawal indicators ("my wife handles the shopping now"), and cognitive load descriptors ("everything feels like it takes twice as long as it should"). Individually, these phrases might not trigger clinical concern, but the model's ability to aggregate them across a full clinical note produces a reliable signal.

Importantly, the model also identified cases where depression was likely present but the clinician had not documented it — a phenomenon sometimes called "charting blind spots." In approximately 8% of cases where the model predicted depression and the structured diagnostic record showed no corresponding diagnosis, subsequent chart review by study investigators confirmed that the patient had later been diagnosed with depression at a follow-up visit. This suggests the model was catching signals that the original clinician had missed or chosen not to document at the time.

Clinical Integration Challenges

Despite the strong performance, the researchers are cautious about the path to clinical deployment. Several practical and ethical challenges remain unresolved.

The first is the question of data governance. The model requires access to the full text of clinical notes, which contain sensitive health information protected under HIPAA. While Stanford's study used fully de-identified data, real-world deployment would require either on-site model inference within the health system's existing computing infrastructure or robust federated learning arrangements that keep patient data within institutional firewalls.

The second is the risk of automation bias. If clinicians begin to rely on the model's screening output rather than conducting their own clinical assessment, the system could introduce new failure modes — particularly for patients whose depression manifests in ways not well captured in clinical notes, such as patients who are taciturn during appointments or whose primary language differs from the clinician's.

The third is the question of equity. The model was trained and evaluated on data from Stanford Health Care, which serves a patient population that is disproportionately White, college-educated, and privately insured. Performance in safety-net hospitals, community health centres, and rural primary care practices — where the need for improved depression screening is arguably greatest — remains unknown.

The Broader Context of LLMs in Mental Health

The Stanford study is one of several recent investigations into the use of large language models for mental health applications. A team at the University of Texas at Austin has explored using LLMs to analyse social media posts for early signs of suicidal ideation. Researchers at King's College London are investigating whether LLM-based analysis of speech patterns captured during routine telehealth appointments can detect anxiety disorders.

The regulatory landscape for such tools is evolving. The FDA has not yet issued specific guidance on LLM-based diagnostic tools, though the agency's existing framework for clinical decision support software would likely apply. In the EU, such systems would be classified as high-risk under the AI Act enforcement provisions that took effect in May 2025. The WHO's recently published global guidelines on AI in health recommend that any AI-assisted diagnostic tool undergo external validation in at least three independent clinical settings before deployment.

Dr. Nelson, the study's lead author, emphasised that the technology is intended to augment, not replace, clinical judgement: "The PHQ-9 was never a perfect instrument. It's a blunt tool — nine questions, a single cutoff score, no sensitivity to context or nuance. What our model offers is a more sensitive, more contextual screening signal that can prompt clinicians to ask better questions. The diagnosis still belongs to the clinician."

The team plans to begin a multi-site validation study at four safety-net hospitals in California later in 2025, with the goal of assessing whether the model's performance generalises to more diverse patient populations and different documentation practices.

For broader coverage of AI in mental wellness, explore our Mental Wellness research repository. Related articles include Woebot Health's FDA breakthrough designation for anxiety and Australia's national AI mental health triage system.

← Back to Publications