By Soham NandiReviewed by Lily Ramsey, LLMJan 9 2025
In an article published in the journal ACL Anthology, researchers explored the use of large language models (LLMs) in mental health treatment, particularly in psychotherapy.
They developed an evaluation framework to assess the viability and ethics of LLM responses, focusing on empathy and adherence to motivational interviewing principles. The authors revealed disparities in empathy toward different racial groups and highlighted the importance of response generation. They concluded by proposing safety guidelines for LLM deployment in mental health contexts.
Background
LLMs, such as generative pre-trained transformers (GPT), have rapidly transformed various healthcare applications, including mental health support. These models have been explored for their potential to alleviate clinician burnout and expand access to mental health services, particularly as psychological distress has risen, especially among minority groups.
While previous research has demonstrated LLMs’ success in tasks like risk prediction and cognitive reframing, concerns about their ethical deployment have emerged.
Notably, recent failures, such as the death of a Belgian man after interacting with a GPT-based chatbot and harmful dieting advice from the Tessa chatbot, highlight the risks of automated mental health care.
Most prior work on automated psychotherapy has focused on rule-based or retrieval-based approaches. However, the potential for bias in LLM responses, particularly in terms of race and demographics, has not been adequately addressed.
Although existing studies have highlighted biases in artificial intelligence (AI) systems, the impact of these biases on mental health support in diverse populations remains underexplored.
This paper addressed this gap by evaluating whether LLMs like GPT-4 provided equitable care across demographic subgroups. Through clinical evaluations and bias audits, the study revealed significant disparities in empathy levels, particularly for Black and Asian patients, underscoring the need for equity in AI-driven mental health care.
Data and Experimental Setup
The researchers analyzed peer-to-peer mental health support on Reddit, using a dataset of 12,513 posts and 70,429 responses from 26 mental health-related subreddits. The authors evaluated GPT-4 responses in three treatment personas, namely, social media post style (SMP), mental health forum style (MHF-1), and mental health clinician style (Clinic).
To mitigate bias, they also tested two additional settings, an unaware respondent (MHF-2) and an aware respondent (MHF-3), which differed in their use of demographic information like race, gender, or age.
The study further explored how GPT-4 could infer demographic attributes, such as ethnicity, age, and gender, from text without explicit supervision. A few-shot perception experiment was conducted where GPT-4 was asked to predict these attributes based on a post's content.
The results showed that while the demographic labels inferred by GPT-4 did not always match the poster’s self-identification, they reflected a model of how respondents in peer support might perceive a post’s author.
The authors manually verified GPT-4’s demographic predictions, finding 94% agreement on race, 84% on age, and 81% on gender with human annotators, indicating the model’s reliability in demographic inference.
Experiment Methodology
The authors evaluated the empathy and bias in mental health responses from both human and GPT-4 sources. Two licensed clinical psychologists assessed 50 Reddit posts, randomly paired with either a peer-to-peer or GPT-4 response, to evaluate empathy using the EPITOME framework and motivational interviewing criteria.
Clinicians rated the responses on warmth, understanding, and exploration of the seeker’s feelings and also measured how much the response encouraged change. A manipulation check followed, where clinicians guessed the percentage of AI-generated responses.
Additionally, an automatic evaluation using fine-tuned robustly optimized bidirectional encoder representations from transformers approach (RoBERTa) classifiers predicted empathy levels in a held-out dataset, achieving high accuracy.
The study also examined demographic bias by testing fairness between groups using statistical measures like demographic parity.
The demographic leaking experiment assessed how GPT-4’s responses were influenced by implicit or explicit demographic cues in Reddit posts. A counterfactual human evaluation was conducted where posts were transformed to reveal or imply the poster’s gender or race, and responses were gathered through Amazon Mechanical Turk.
Results and Analysis
The study evaluated GPT-4’s performance in providing mental health support, comparing it with human peer-to-peer responses. Clinical evaluations revealed that GPT-4 often showed higher empathy, especially in emotional reactions and exploration, though its interpretation of patient experiences was weaker due to the absence of lived experience.
Clinicians noted that GPT-4 was effective at encouraging positive change but could sometimes appear overly direct, potentially perceived as "talking down" to patients.
In terms of demographic fairness, GPT-4 exhibited less variation in empathy across racial and gender subgroups compared to human responses.
Human peer-to-peer responses showed more empathy when demographic attributes were implied rather than explicitly stated, though Black posters received lower empathy overall. GPT-4 responses, on the other hand, showed significantly lower empathy for Black posters compared to White or unidentified groups, especially in certain prompts.
Further analysis of GPT-4, GPT-3.5, and Mental-LLaMa models confirmed that AI responses can amplify racial biases seen in human peer-to-peer interactions. Notably, GPT-3.5 performed better than GPT-4 in some areas.
To address bias, explicitly instructing models to consider demographic attributes helped reduce disparities in GPT-4 responses, though this approach showed mixed results for other models.
Conclusion
In conclusion, the authors evaluated the use of LLMs like GPT-4 in mental health support, focusing on empathy and bias across demographic groups. Findings revealed that GPT-4 responses often showed higher empathy than human responses, particularly in emotional reactions and exploration.
However, significant racial disparities were observed, with lower empathy shown to Black and Asian posters. Bias could be mitigated by explicitly instructing the model to consider demographic attributes.
The researchers proposed guidelines for developers to reduce biases in LLM-based mental health technologies and ensure equitable care in psychotherapy applications.
Journal Reference
Gabriel, S., Puri, I., Xu, X., Matteo Malgaroli, & Ghassemi, M. (2024). Can AI Relate: Testing Large Language Model Response for Mental Health Support. 2206–2221. doi: 10.18653/v1/2024.findings-emnlp.120. https://aclanthology.org/2024.findings-emnlp.120.pdf
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.