FairEval:

Evaluating Fairness in LLM-Based Recommendations with Personality Awareness


School of Computer Science and Engineering, Beihang University, Beijing, China Department of Electrical and Computer Engineering, McGill University, Montreal, Canada
Corresponding Author
Under review at a top-tier ACM conference in recommender systems🎉

Abstract

Recent advances in Large Language Models (LLMs) have enabled their application to recommender systems (RecLLMs), yet concerns remain regarding fairness across demographic and psychological user dimensions. We introduce FairEval, a novel evaluation framework to systematically assess fairness in LLM-based recommendations. Unlike prior benchmarks that focus solely on demographic attributes, FairEval uniquely integrates personality traits with eight sensitive demographic attributes, including gender, race, and age enabling a comprehensive and nuanced assessment of user-level bias. We evaluate state-of-the-art models, including ChatGPT 4o and Gemini 1.5 Flash, on music and movie recommendation tasks using structured prompts. FairEval’s personality-aware fairness metric, PAFS, achieves high consistency scores up to 0.9969 for ChatGPT 4o and 0.9997 for Gemini 1.5 Flash, underscoring its robustness in equitable recommendations across diverse user profiles, while also uncovering fairness gaps, with SNSR disparities reaching up to 34.79%. Our results also reveal disparities in recommendation consistency across user identities and prompt formulations, including typographical and multilingual variations. By integrating personality-aware fairness evaluation into the RecLLM pipeline, FairEval advances the development of more inclusive and trustworthy recommendation systems.



Introduction

Recommender systems, driven by large language models (LLMs), are reshaping personalized content delivery, yet concerns about fairness in these systems, especially across demographic and psychological user dimensions, persist. FairEval introduces a novel evaluation framework to address these issues by integrating both personality traits and demographic attributes, providing a more comprehensive fairness assessment. This framework, applied to state-of-the-art LLMs like ChatGPT 4o and Gemini 1.5 Flash, evaluates fairness in recommendations through a range of metrics, including personality-aware fairness indicators and traditional demographic fairness measures.



FairEval Framework Evaluation of LLM-Generated Recommendations
Figure 1: An illustration of FairEval-generated movie recommendations under different prompt types.

The Framework of FairEval

FairEval: A Framework for Evaluating Fairness in LLM-Based Recommender Systems. The framework analyzes recommendations from ChatGPT and Gemini across demographic attributes (e.g., age, gender, race, religion) using established fairness metrics—Jaccard@K, SERP*@K, PRAG*@K, PAFS@K—as well as disparity indicators (SNSR and SNSV). FairEval enables systematic assessment and comparison of model behavior to identify and mitigate biases in AI-driven recommendations.

The Framework of FairEval.



The Framework of FairEval Prompt

FairEval Prompt-Based Fairness Evaluation. This framework evaluates LLM-generated recommendations based on user prompts: Neutral, Identity-Based, and Intersectional Prompts. Recommendations from GPT-4o and Gemini 1.5 Flash are analyzed for fairness, with results informing bias mitigation efforts.

The Framework of FairEval Prompt.



Results

Evaluation of LLM-generated music recommendations based on prompt sensitivity. This figure compares recommendations from ChatGPT 4o and Gemini 1.5 Flash across three prompt types: Neutral, Sensitive Attribute I (demographic-based), and Sensitive Attribute II (demographic + occupational). The right panel summarizes user attributes (e.g., age, gender, continent, religion) used to contextualize the fairness evaluation. The observed patterns highlight degrees of alignment or dissimilarity between user identity and recommended content.

Results of LLM-generated movie and music recommendations.



Robustness Results of ChatGPT 4o and Gemini 1.5 Flash (LLMs)

Robustness of ChatGPT 4o (top) and Gemini 1.5 Flash (down) under prompt perturbations. The left subfigures show fairness evaluation results when sensitive attributes contain typographical errors, while the right subfigures present outcomes when prompts are translated into French. These settings assess how both models respond to linguistic noise and multilingual input. Gemini 1.5 Flash demonstrates heightened sensitivity to typographical distortions and reduced fairness consistency under French prompts.

Results of Robustness of LLMs under prompt perturbations.



Conclusion

In this work, we introduced the FairEval framework, which provides a comprehensive approach to evaluating fairness in LLM-based recommender systems. By incorporating both demographic attributes and personality traits, FairEval offers a more nuanced and reliable assessment of recommendation fairness. Our experiments with ChatGPT 4o and Gemini 1.5 Flash across music and movie recommendation tasks highlight significant disparities in fairness, underscoring the need for personality-aware metrics like PAFS. The robustness tests reveal the vulnerability of LLMs to prompt variations, which emphasizes the importance of considering linguistic noise and multilingual inputs in real-world applications. FairEval paves the way for more equitable and transparent recommender systems in the age of AI-driven content delivery.

BibTeX


@article{sah2025faireval,
  title={FairEval: Evaluating Fairness in LLM-Based Recommendations with Personality Awareness},
  author={Sah, Chandan Kumar and Lian, Xiaoli and Xu, Tony and Zhang, Li},
  journal={arXiv preprint arXiv:2504.07801},
  year={2025}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.