Emotional intelligence (if it even exists) is a skill that we consider important in teachers, care providers, and colleagues. So why not in technology? But can large language models like ChatGPT understand, recognise, assess, or regulate emotions? And even more impressive, can they also answer test questions about it? A new study published in Communications Psychology investigated this, with surprisingly positive results.
The researchers had six leading language models – including ChatGPT-4, Claude 3.5 and Gemini 1.5 – complete five standardised emotional intelligence tests. These tests measure, for example, whether you understand why someone feels a certain way, or what would be a good way to deal with emotions of yourself or others. The result: the models scored an average of 81% correct, while people in previous validation studies had an average of 56%. So ChatGPT-4 was better at recognising and regulating emotions… than the average person, at least on these standardised tests.
But it didn’t stop there. The researchers asked ChatGPT-4 to develop new test items in the same style and structure as the original tests. These ‘AI tests’ were then presented to 467 test subjects, without them knowing that the questions came from a language model. And here too, the results were striking: the ChatGPT versions were just as tricky, realistic, and clear as the originals. The internal consistency was comparable, and the differences in validity and clarity were statistically small (all below Cohen’s d of 0.25).
Of course, there are caveats. Some ChatGPT items seemed similar in content to existing questions (although rarely literally), and there are legitimate questions about what this really means. A good test result does not mean that a model understands something, let alone feels something, the way people do. And empathy is more than just answering multiple-choice questions correctly.
Yet the authors’ conclusion is clear: LLMs score remarkably well if we define emotional intelligence as being able to reason correctly about feelings. This opens up perspectives for applications in healthcare, education, and HR. At the same time, it makes us think philosophically about the difference between human and artificial empathy and raises the question of what is actually still human about empathy.
Abstract of the study :
Large Language Models (LLMs) demonstrate expertise across diverse domains, yet their capacity for emotional intelligence remains uncertain. This research examined whether LLMs can solve and generate performance-based emotional intelligence tests. Results showed that ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3 outperformed humans on five standard emotional intelligence tests, achieving an average accuracy of 81%, compared to the 56% human average reported in the original validation studies. In a second step, ChatGPT-4 generated new test items for each emotional intelligence test. These new versions and the original tests were administered to human participants across five studies (total N = 467). Overall, original and ChatGPT-generated tests demonstrated statistically equivalent test difficulty. Perceived item clarity and realism, item content diversity, internal consistency, correlations with a vocabulary test, and correlations with an external ability emotional intelligence test were not statistically equivalent between original and ChatGPT-generated tests. However, all differences were smaller than Cohen’s d ± 0.25, and none of the 95% confidence interval boundaries exceeded a medium effect size (d ± 0.50). Additionally, original and ChatGPT-generated tests were strongly correlated (r = 0.46). These findings suggest that LLMs can generate responses that are consistent with accurate knowledge about human emotions and their regulation.