Do student evaluations measure teaching effectiveness?

By | March 29, 2014

For a couple of weeks now, there has been an energetic discussion on the Higher Education Teaching and Learning discussion  board on LinkedIn around the question, “Do student evaluations measure teaching effectiveness?” In the course of some 335 comments, including several of mine, the discussion has predictably gone around the circle several times.

There seem to be two main points at issue. First, ought students to be offering up opinions about the effectiveness of their instructors? Second, if such opinions are useful, how ought they to be gathered, and what use ought to be made of them?

In any discussion of this size, there is unlikely to be a clear consensus. There does seem to be a general agreement that certainly students are entitled to their opinions, although there is less consensus that such opinions actually mean anything in pedagogical terms and still less regarding how they ought to be applied. It is certainly true from some of the stories relayed that such evaluations are often used for punitive or leverage purposes rather than for potential instructor improvement. But there is also a sense that student feedback can be helpful if collected appropriately and administered fairly.

The usual approach is a one-shot survey administered at the end of the class, asking for students’ summary ratings of different aspects of class content and behavior. These surveys are seldom based on any explicit pedagogical model or tested systematically. They usually consist of a few items with some face credibility. They persist more because some value may emerge out of comparability across time than because anyone really thinks that the items themselves say anything useful. Even when multiple questions are asked, they are generally so highly intercorrelated as to be indistinguishable. This can be quite frustrating to the instructor trying to tease out of the overall pattern things that s/he ought to do more or less of, or the supervisor trying to mine these data for advice to offer the developing instructor.

Obviously, this approach is methodologically unsound. One-shot retrospective-recall surveys are notoriously skewed by both the most recent experiences and by completely extraneous factors such as personal mood, health issues, and the nature of the commute that day. No competent survey researcher, or even a would-be researcher taking his/her first class in research methods, would ever hang a finding of any importance on such a one-shot measure. It’s an indicator of the lack of understanding about research that such survey results are still considered by administrators to be in any way credible, let alone credible to the degree of hanging career-threatening decisions on them. This approach fundamentally disrespects the data by coercing from them a story that they are unequipped to tell – which, as you may recall, I find unacceptable.

There is a large body of research demonstrating exactly this, including studies specifically directed to the question of class evaluations. So why do we persist, if we know going in that the results are basically worthless? I recall yea many years ago when I was taking my first survey research class; at the end of the term when the professor was distributing the evaluation survey several of us pointed out that we’d just finished discussing how unreliable such measures were, so why would we even be doing this? Our professor sighed and suggested that we ask the professor in the organizational behavior class that many of us were also taking; the answer was more to be found under the heading of organizational pathology than research methods. Administrators just go with the cheapest, easiest, and most familiar measures they can find.

Some people worry that any student evaluation results are going to be biased by class rigor, as students go easy on instructors who are easy on them and punish stricter grading. Actually, research indicates that most students give somewhat higher overall ratings to the teachers who really help them learn something important rather than those who are merely easy. But when opinions are aggregated across a large number of students each of whom may have a somewhat different basis for his/her opinion, this point may get lost. Any statistic will have both a measure of central tendency and a distribution. Not infrequently, the distribution is more informative than the central tendency. If all the students’ ratings are concentrated in a narrow range, that may say one thing; if s/he is rated highly by some and toward the bottom by others, that may say something different.

Even those who value student input worry about whether the value of a particular class is always immediately apparent to them. The utility of foundation classes, for example, may not be evident until after the student has understood subsequently what the foundation is for, and thus the evaluations of those classes (and by extension the instructors who teach them) may be biased downward. So some have suggested that students ought not to be asked about the utility of a particular class until later on in their program. Of course, this is likely to be even more biased by selective recall. There might be some value to a student’s comparative rating of a class or instructor; that is, whether a particular class/instructor was more or less useful to the student’s learning than the usual that s/he had encountered. However, it’s unlikely that the students’ criteria for making such assessments would remain stable over time. The more you learn, the better you become at determining what it is that helps or retards your learning.

Longitudinal data are always more informative than one-shot data. In this case, it might be interesting to track linear trends in either the overall evaluation of instructors or the ratings of individual students, but this might not be allowed under student privacy rules. And of course, if a measure doesn’t mean anything, having more instances of it isn’t likely to add much illumination.

In short, after a great deal of back-and-forth by some very smart folks, the answer to the question originally posed is that teacher evaluations produce data. The degree to which these data can be assumed to be a proxy for actual teaching effectiveness is, however, highly debatable, being contingent on the subject, the setting, the context, and the phases of the moon, as well as whether the proverbial butterfly in the Amazon jungle (or the warehouse) flapped its wings today. But since there are definitely real consequences to these measures, the debate needs to be continued.