Sometimes it is tempting to blog only about big, bold new insights. However, personally, I find it equally important to highlight solid studies that add nuance to what we already know, in part. Research that helps us better understand what actually happens during a didactical approach. This post is one of those.
Peer assessment is one of those ideas that sounds wonderful in theory: involving students in judging each other’s work, increasing ownership, and deepening their understanding of criteria. Yet one question keeps returning again and again: how fair are students to one another? And more specifically: do they give their friends higher scores than they deserve?
A new study by Mitsuko Tanaka in Japan examined this exact issue. First-year university students rated each other’s English presentations. A week later, they completed a questionnaire indicating how well they knew each peer and who counted as a friend. The researchers then applied advanced models to examine the origins of the scores. This was not a quick classroom survey but a substantial analysis using Rasch modelling*, crossed random-effects**, and other statistical tools. For anyone working on peer assessment reliability, this is interesting because it does not begin with ideals but with variability and potential bias.
So what did the study find? There is a friendship bias, but it is small. For every step from “I barely know this classmate” to “we are friends outside class”, the score increased by about 0.16 points out of 35. That is almost negligible. A bit like saying someone you like smiles a little more when you walk by. Detectable, but hardly dramatic.
A second finding is that female students gave slightly higher ratings than male students. Again, a small effect. No major shifts, no structural unfairness. And interestingly, the presenter’s gender did not matter. There was no evidence that boys judged boys more harshly or that girls were softer on girls. The common fear that such biases severely distort peer evaluation was not supported.
But perhaps the most important result is not the bias itself, but the reliability of the ratings. The correlation between student and teacher assessments was .67, which is precisely in line with what meta-analyses typically find in this field. That strengthens the credibility of this dataset. The scores were also internally consistent. In other words, although small biases exist, peer assessment in this study behaved in a manner very much as earlier research would lead us to expect.
Is the result universally applicable? Probably not. These were Japanese students in a culture where harmony and group cohesion play a stronger role than in many European classrooms. The context was non-anonymous, the task involved oral presentations, and the gender distribution was uneven. At the same time, the underlying lesson is recognisable: human judgements are human judgements. They are never entirely objective, but they are also not as erratic as we sometimes fear.
What does this mean for classroom practice? Generally, peer assessment can be useful, as long as its purpose is clear. If it becomes high stakes, if it directly affects grades or progression, then caution is needed, and triangulation with other sources is essential. If the goal is learning rather than grading, it does not need to be a perfect system. Students learn by watching, comparing, and working with criteria. Minor deviations caused by friendship or personality do not fundamentally change that.
The study, therefore, offers reassurance. There is bias, but it is not dramatic. Students assess each other reasonably fairly, even when they know exactly who they are judging. The variation in strictness between individual raters appears to be larger than any friendship effect. You lose more reliability through one very strict or one very lenient student than through friendship. That may be the most important lesson: human judgment is by definition variable, but that does not make it worthless.
Image: https://commons.wikimedia.org/wiki/File:Peer_Evaluation_%2827478%29_-_The_Noun_Project.svg
* In case you are wondering: A Rasch model is a statistical model used to analyse test or questionnaire data by placing both item difficulty and participant ability on the same scale. The model assumes that the probability of a correct response depends only on the difference between a person’s ability and an item’s difficulty. This makes it possible to compare and scale items fairly, regardless of which students responded to which items.
** Crossed random effects are used when two (or more) independent sources of variation each influence the outcome without being nested within one another. For example, students complete multiple tasks and tasks are completed by multiple students. In such a model, both the student and the task receive their own random effect. This allows you to estimate their separate contributions to performance differences in a reliable way.