Course Evaluations (SETs)

It has been a few years since I have very carefully read student evaluations of the courses I teach. Why? That’s the easy question. By and large college students have no experience with anything, so why should I pay any attention to how they view the courses I teach or how I teach them? The difficult question is “Why do colleges pay so much attention to student course evaluations?” That’s a more interesting question.

It has been argued that course evaluations address four areas

1) diagnostic FEEDBACK to faculty about the effectiveness of their teaching; 2) a measure of teaching effectiveness to be used in PERSONNEL DECISIONS; 3) information for students to use in INSTRUCTOR/COURSE SELECTION; 4) an outcome or a process  description for RESEARCH ON TEACHING; Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International journal of educational research, 11(3), 253-388.

Interestingly, there have been hundreds of research papers published on (students’ evaluations of teaching effectiveness) SETs. The literature generally recognizes two purposes: (1) formative and (2) summative. Formative assessments deal with improving individual teaching methods while summative assessments deal with evaluating the teacher.  Before I delve more deeply into the literature, I have to confess that something bothers me about this. I have been teaching college classes for about 14 years and I would not be comfortable with giving college students control over much of anything. Why? First, in general, they are immature. There is a physiological question whether their brains are fully developed and, frankly, their behaviors are bizarrely predictable. For example, last week included two of the warmest days of the Spring – highs in the 80’s. Fully 40% of my students didn’t attend class because “sun.” I regularly supervise independent study courses with graduating seniors and getting them to work is like pulling teeth without pain numbing medicine.

It is difficult to explain what bothers me about this. When I question SETs, supporters refuse to engage and just mumble things like “students are fair” or ” people have looked at this,”  but they ALL refuse to engage the issue(s). How about this example? Let’s survey every attendee of a live play in the United States. We can ask them questions about the set, the actors, the story, the director, the venue, etc. What do we learn? If I were to take such a survey, I confess that I really have no expertise in anything related to theater. I might be able to identify really bad acting, but beyond that, I’m not sure. I’ve been to many plays and theater performances, but I am not qualified to tell you much more than I like this play or didn’t like that one. The question then becomes is there a relevant explanation behind why I like one play, but not another? College administrators would emphatically argue yes.

For example, ” If student ratings influence personnel decisions, it is recommended that only crude judgments (for example, exceptional, adequate, and unacceptable) of instructional effectiveness be used (d’Apollonia and Abrami 1997). Because there is no single definition of what makes an effective teacher, committees and administration should avoid making fine discriminating decisions; for example, committees should not compare ratings across classes because students across classes are different and courses have different goals, teaching methods, and content, among other characteristics (McKeachie 1997). Algozzine, B., Gretes, J., Flowers, C., Howley, L., Beattie, J., Spooner, F., … & Bray, M. (2004). Student evaluation of college teaching: A practice in search of principles. College teaching, 52(4), 134-141.  These authors continue by noting that “[s]till other authors are critical of using any aggregate measures of teaching performance (Damron 1995; Haskell 1997a; Mason, Steagall, and Fabritius 1995; Sproule 2000; Widlak, McDaniel, and Feldhusen 1973). They argue that an effective teaching metric does not exist and that students’ opinions are not necessarily based on fact or valid (Sproule 2000). Haskell (1997a) suggested that SET infringes on the instructional responsibilities of faculty by providing a control mechanism over curriculum, content, grading, and teaching methodology, which is a serious, unrecognized infringement on academic freedom.” Algozzine, B., Gretes, J., Flowers, C., Howley, L., Beattie, J., Spooner, F., … & Bray, M. (2004). Student evaluation of college teaching: A practice in search of principles. College teaching, 52(4), 134-141.

Like jurors, it appears that students take their responses seriously. Spencer, K. J., & Schmelkin, L. P. (2002). Student perspectives on teaching and its evaluation. Assessment & Evaluation in Higher Education, 27(5), 397-409. However, like juries, students can be wrong or mistaken. These authors further note in the conclusions that students “wish to have an impact but their lack of (a) confidence in the use of the results; and (b) knowledge of just how to influence teaching, is reflected in the observation that they do not even consult the public results of student ratings.”

Jackson, M. J., & Jackson, W. T. (2015). The Misuse of Student Evaluations of Teaching: Implications, Suggestions and Alternatives. Academy of Educational Leadership Journal, 19(3), 165,  examined “how to best use [teaching evaluations] as an indicator of teaching effectiveness.” (at p. 167). The authors address this question resignedly in the face of the recognition that administrators will continue to use SETs as summative measures of teaching quality. Those authors further recommend that SETs be used in a summative capacity to only measure three broad categories: below average, average, and above average. (at pp. 167-68). The sample examined by these authors was found to not be normally distributed, which makes it problematic to use the mean class score as a measure of effectiveness/quality. (at pp. 168-70). The authors conclude that SETS should likely be used as a formative assessment for improvement (as originally intended), but if used as a summative measure of performance, that “[i]t is strongly suggested that the current practice of comparing an instructor’s average score to the average of the department or college be avoided. Instead either a global score or an average of individual dimensions of the SET should be normalized. From this distribution identify the outliers, the faculty members scoring above or below one standard deviation from the normalized mean. It is suggested that those faculty members scoring within the mid or average category be viewed as scoring the same. Statistically, the scores of these individuals are not significantly different. The outliers should be considered and either recognized as exceptionally strong and/or weak.” (at p. 171).

As a previous blog post suggests, I have a problem with the last suggestion, (Learning Outcomes). My fear is that faculty have had time to adjust to being measured by student surveys and have been able to devise strategies, like Kip’s, to artificially increase SET scores at the expense of rigor with diminished learning outcomes. Now, I am obviously making the assumption that a desired goal of attending college is learning and that faculty evaluations should promote that goal. To that end, McCallum, L. W. (1984). A meta-analysis of course evaluation data and its use in the tenure decision. Research in Higher Education, 21(2), 150-158, noted that ” all of these measures attempt to assess degree of student learning as the primary criterion. The techniques range from the most-frequently used common final examinations across course sections to be evaluated (Centra, 1977; Orpen, 1980) to nationally prepared normative examinations (Gessner, 1973).” (at p. 151).  One problem with these approaches is obviously “do they measure learning outcomes effectively?”

For example, when I teach upper division courses, my goals are more unrelated to information and fact delivery. When I teach business law, my primary goal is issue identification rather than rote memorization of legal facts. This is premised on the ideas that (1) anyone can go online and figure out how the law generally treats a specific issue, (2) it is impossible to meaningfully teach a large proportion of the law related to specific legal issues, and (3) students forget much specific information rather quickly. Thus, if you gave a common, standardized exam to my class, they would likely perform worse than a class taught by an instructor who emphasized memorizing legal requirements. However, I would hope that when my students take their places in the work world, they will be able to observe a set of facts and deduce potential legal issues inherent in those facts and would be able to perform basic research to attempt to refine and resolve those potential issues.

With this in mind, Clayson, D. E. (2009). Student evaluations of teaching: Are they related to what students learn? A meta-analysis and review of the literature. Journal of Marketing Education, 31(1), 16-30, conducted a meta-analysis of research that examined the link between learning and SETs. The author started off by noting that “[e]ssentially, no one has given a widely accepted definition of what “good” teaching is, nor has a universally agreeable criterion of teaching effectiveness been established (J. V. Adams, 1997; Kulik, 2001).” (at p. 16).  Some researchers have even argued that good teaching and “the most” learning are not related (at p. 17).  The author further notes that, as with most aspects of SET research, the conclusions of researchers are mixed.  One interesting aspect is the apparent division between “pure” education researchers (those researchers largely employed in education colleges) and researchers who primarily work in business colleges and publish in more business-related journals. Educational researchers generally support use of SETs, while researchers outside education are generally skeptical and tend to believe that any relationship between SETs and learning is accidental (at p. 17).

I share the author’s view that “[i]f both learning and SET are related to good teaching, then SET should be found to be related to learning. A test of this assertion has been hindered by several methodological difficulties, the most fundamental of which is how learning can be measured”  (at p. 18). The author then continues by noting three prominent means used to measure student learning: the connection between SETs and grades, the student perception of learning, and the relationship between grades and learning (at p. 18).

I once surveyed three large sections of principles of micro and asked them what was their primary motivation in class. It was multiple choice and included learning, parental demands, and getting a high grade. 85% of students chose getting a high grade as their primary motivator. So, when the author notes that grades are important to students, I agree (at p. 18). The author’s main point in this discussion is the possibility of a quid pro quo with grades and SET scores. Another issue is the extent to which students reliably perceive their own learning. The author noted prior research found that” [s]tudents’ perceived grades need not be strongly related to their actual grades (Baird, 1987; Clayson, 2005a; Flowers, Osterlind, Pascarella, & Pierson, 2001; Sheehan & DuPrey, 1999; Williams & Ceci, 1997),  (at p. 18). Some of this confusion may be related to what is now termed the Dunning-Kruger Effect The third issue relates to whether or not actual grades reflect actual learning. The author finds the literature to suggest that students’ grades likely do not reflect students’ actual learning, (at p. 19).

The question then becomes, “how do we measure student learning?” Clayson (2009) notes five suggested methods, (at p. 19). These five suggestions are (1) using mean class grades rather than individual grades, (2) common tests across multiple sections controlling for instructor variance, (3) difference in pre- and post-test scores, (4) performance in future classes controlling for student characteristics, and (5) using standardized, subject-specific tests.

Clayson spent a significant amount of the paper discussing the relationship between SETs and course rigor. First, he notes that a number of researchers have found that courses perceived as more rigorous received lower SET scores (at pp. 19-20). Five arguments were posited to explain these results. First, the negative relationship may be the result of methodological artifice (the author”s own research showed that the date of the survey mattered), (at p. 20). Second, the relative level of rigor may be a more appropriate measure than the absolute level of rigor as students do not object if they believe the rigor is appropriate for the course (at p. 20). Third, students may choose to avoid courses known to be more rigorous. One paper, “Wilhelm (2004) compared course evaluations, course worth, grading leniency, and course workload as factors of business students choosing classes. Her findings indicated that ‘students are 10 times more likely to choose a course with a lenient grader, all else being equal'”, (at p. 21). Fourth, some researchers have found a chain where rigor is “positively related to the students’ perceptions of learning, but negatively linked to instructional fairness, which made its total effect on the [SET] negative,” (at p. 20).  Finally, it is possible that students view education differently than researchers assume. “A survey of 750 freshmen in business classes revealed that almost 86% did not equate educational excellence with learning. More than 96% of the students did not cite ‘knowledgeable’ as a desirable quality of a good instructor (Chonko, Tanner, & Davis, 2002). Students do not generally believe that a demand for rigor is an important characteristic of a good teacher (Boex, 2000; Chonko et al., 2002; Clayson, 2005b). Furthermore, students seem to have decoupled their perception of grades from study habits,” (at p. 20).

Given these issues, Clayson (2009) conducted a meta-analysis of the literature with regard to prior studies using common examinations across classes. In general, results show a small, but insignificant, positive relationship between learning (as measured by common exam scores) and SET scores. The studies showing a strong positive relationship tend to be from education/psychology classes located in education or liberal arts colleges and have been conducted less recently, (at p. 24-25). These results were between-class results and did not hold within-class, (at p. 25).

The author provides a single summary explanation, “[o]bjective measures of learning are unrelated to the SET. However, the students’ satisfaction with, or perception of, learning is related to the evaluations they give,” (at p. 26). Finally, the author notes that “[t]o a certain extent, the explanation can be summed up by a rather dark statement about human behavior by the American journalist and author Donald R. P. Marquis, who once wrote, ‘If you make people think they’re thinking, they’ll love you. If you really make them think, they’ll hate you’ (as cited in Morley & Evertt, 1965, p. 237),” (at p. 27). This conclusion is supported by by anecdotal discussion of “Kip” in my earlier referenced post “Learning Outcomes.”

Clayson (2009) did note two papers that used common examination results to try and measure the connection between SETs and learnign outcomes. THe first, Soper, J. C. (1973). Soft research on a hard subject: Student evaluations reconsidered. The Journal of Economic Education, 5(1), 22-26, looked at economics classes that took the TUCE test (the Test of Understanding in College Economics) both pre- and post-course. The author found that SET scores did not significantly explain changes in the pre- and post-TUCE scores and in many cases coefficients were negative.

In the second study, Marlin Jr, J. W., & Niss, J. F. (1980). End-of-course evaluations as indicators of student learning and instructor effectiveness. The Journal of Economic Education, 11(2), 16-27, the authors proposed an educational production function. The authors examined outputs that included ” measures of cognitive performance: grade in the course, improvement in knowledge as indicated by test performance, ability to reason as indicated by performance on test questions requiring application of theory, and retention of knowledge over time. Other measures of output are indicated by changes in student attitudes and time spent on the course,” (at p. 17). Inputs included “three general categories; institutional (I), student (A and E), and teacher (T and V),” (at p. 17). For cognitive ability the authors examined course grades, examination scores, a “gap-closing” measure of pre- and post-TUCE scores, and TUCE scores obtained after a lapse of one semester.

Variable inputs measured relating to the instructors included “teacher personal attributes, and we include such matters as empathy for the student, ability to respond to student needs, effort at teaching the course in an understandable manner, and basic preparation in the subject matter. The course attributes include text selection, method of presentation, examination procedures and policies, and general difficulty of the course,” (at p. 19). The authors examined 289 students across 8 sections of economics classes in the Fall 1978 semester, (at p. 20). The authors found that student-specific variables explained most of student learning, (at p. 23). The authors further concluded that “if there is a correlation between educational output and student ratings of the variable teacher inputs, we can conclude that student evaluations can be used as surrogates for direct evaluation and do indeed measure the level of teacher input. Since the canonical correlations of student ratings and outputs are significant and since the canonical correlation coefficients are reasonably high, we conclude that student evaluations can be used to measure teacher effectiveness,” (at p. 24).

I have a couple of problems with these results. First, the authors suggest a model of a production function involving 5 outputs and multiple inputs that they characterize as fixed, variable, etc. However, they completely abandon this theoretical model as an estimation framework and use it, instead, as an argument for an ad hoc examination of certain variables that “should” influence certain other variables. Second, they use cannonical correlations for estimating these relationships. Cannonical correlations are rarely used in economics because correlation does not equal causation and it feels more like a “throw everything at the wall and see what sticks” procedure. Additionally, cannonical correlations suffer at least three limitation, “(1) the deficiency of the canonical Rc statistic as an indicator of the variance shared by the sets, (2) weight instability and correlation maximization, and (3) problems associated with attempts to partition the sets into correlated constructs,” Lambert, Z. V., & Durand, R. M. (1975). Some precautions in using canonical analysis. Journal of Marketing Research, 12(4), 468-475, (at p. 469). Finally, the authors state that multiple output variables (five) make it problematic to use multivariable regression techniques, but then freely combine output variables in their cannonical analysis.

What are the conclusions from all this? First, there is a lot existing research into SETs and I doubt no college administrator has taken the time to work through it. I have spent several hours over more than a week writing this short blog post and barely made more than a dent in the research record. Second, the research record results are clearly split between SETs are highly valuable and SETs are completely worthless. Third, the literature literally spans nearly 100 years and some more recent papers have suggested that students have changed over time. Just yesterday an article appeared in Gothamist touting perceived and recorded changes in youth labeled “millennials.” If, as some research has suggested, student attitudes toward learning are changing and have changed, then research on SETs from, say, 20-plus years ago is likely no longer useful. As Clayson (2009) pointed out, it was this body of research from education colleges that most strongly supported the use of SETs to evaluate learning. Finally, my own experiences and observations from teaching college for the last 14 years is that, as teachers, we are largely the same. On a 5 point scale, most of us will regularly fall between 3.5 and 4.5 and the instructors that regularly rate below or above that range need to be investigated more carefully. Kip regularly scores close to 5. My beliefs/perceptions align with the general feel from the literature that suggests SETs, when used as a summative tool, should be used only to identify possible poor and excellent performers.

What about the formative role of SETs? In the past I have taken great steps to elicit student feedback to use to improve teaching and course delivery. I have encountered a few issues. Anecdotally,  I have had students spend several minutes completing their SET in my class and offer very detailed recommendations. Once, when I was teaching at Colorado College, I adopted several of the recommendations given by one of principles students. None of them worked. The reason they didn’t work was because he was pretty atypical. For example, most students don’t read the assigned text, or if they do, they do so without much effort or enthusiasm (I surveyed my students about their primary learning/study tools and 10% chose the assigned text). I have, however, received advice from students who ALWAYS diligently read the text. How valuable is such advice in general? So, yes, I am naturally skeptical when I read SET questions like “Objectives for course were clearly presented. ”

I usually teach 400 to 500 students across 3 classes and I have found many (most?) students will not read the syllabus, and many (most?) struggle with paying attention. I receive around 2500 email messages from students during a typical semester and most of the questions are answered in the syllabus, in mass emails I have sent, in announcements posted on the course website, and by statements I have made during lectures. I wrote a research paper that showed students were much more productive when doing out-of-class activities as opposed to in-class activities, so I decided to “bite the bullet” and organize a variety of in-class, extra-credit activities. During one such activity, as soon as it started 40 out of 180 students got up and walked out the door. One student came to my office hours to complain that the exercise seemed pointless (admittedly, it’s hard to coordinate 180 students by yourself), but pointless? I followed up these exercises by surveying the classes whether they preferred doing in-class exercises for extra credit, or listening to me lecture and 65% chose lecture.

I am convinced that students love chalk-and-talk precisely because it does not require them to think. They can sit, look at their smartphone, text their friends (#fomo), watch YouTube vidoes, etc. In-class exercises require thought, exertion, activity. When I was in Virginia I taught a Law and Economics class, which was largely lecture. Honestly, even I felt bored so I read my SETs with trepidation. The students really liked the class! I continue to consistently try new things to (hopefully) increase learning outcomes, but reading and acting on SETs is not high on my list.


Comments are closed.