What AERA misses about value-added teacher evaluation

“Democracy is the worst form of government except all the others that have been tried.” Winston Churchill
Last week, the American Educational Research Association (AERA) issued a statement on the use of Value-Added Models (VAM) for teacher evaluation. As a proud AERA member, I’m both happy to see my profession demand responsible use of the tools of our trade and a little sad that AERA seems willing to throw the baby out with the bathwater.
What is VAM? A review
VAM uses student standardized test scores and multivariate statistical methods to evaluate student progress over the course of a school year. By using a control variable for the student’s previous year’s test score, VAM can measure how much gain a student has made during the current year and compare that gain to those made by all other similar students across a state. Say Bill is a white, male student from a rural area who got a 452 on Virginia’s 3rd grade math end-of-year exam. VAM will generate an expected score for Bill’s 4th grade math exam based on what other white, male students from a rural area who got a 452 on the 3rd grade exam got in the past. VAM holds that if Bill’s actual score is better than that expected score, Bill has demonstrated more growth than expected.
When we aggregate students to the classroom level, VAM becomes a method for teacher evaluation (and many other things, but that’s another post). Let’s say that Mrs. Reyes teaches Bill and 29 other 4th graders. Bill and 20 of his classmates made more progress (according to VAM) than their peers across the state. Most of us would agree that if 21 of 30 students make more progress than expected, Mrs. Reyes probably did a great job with those kids.
What AERA got right?
The logic of VAM is compelling, but AERA rightly reminds us that the devil is in the details. Before it can produce accurate results, VAM relies on quality standardized tests designed with teacher evaluation in mind and administered consistently for years without changes to the test. AERA points out that many states who have implemented VAM teacher evaluation simply do not have tests of sufficient quality and/or are trying to use VAM after making significant changes to test content, which effectively means those states are comparing apples to oranges. Under these conditions, VAM will struggle to produce valid teacher evaluations.
AERA also tries to nip in the bud several ways in which policymakers and other observers have suggested extending VAM’s reach. As the statement notes, VAM should not be used to evaluate principals and teacher training programs because it cannot reliably isolate their effects. I would add that too many degrees of separation exist between the student and those levels for VAM to produce valid information on principal or teacher training program effectiveness. To hold a teacher responsible for a child’s learning in that teacher’s content area is fair. To hold a principal or a university’s college of education responsible for that learning, when neither has had sufficient direct contact with that student that would explain their learning, is ludicrous. For similar reasons, VAM cannot support the more common practice of shared attribution, where all teachers in a school are judged on whether students make VAM progress in specific subjects. An art teacher teaches art and should not be held responsible for her student’s reading scores.
What AERA got wrong
While AERA’s statement correctly urges caution and criticizes sloppy VAM implementation, it ultimately left me disappointed. It focuses far too much on the negative with VAM without any discussion of its positives or its accuracy relative to other leading methods of teacher evaluation.
The statement contains at least one strawman argument that only adds to the rising strain of anti-intellectual criticism of VAM. The statement warns that VAM not should be “used to have a high-stakes, dispositive weight in evaluations.” AERA well knows that not a single state has implemented, plans to implement, or is even considering a system of teacher evaluation with a teacher’s VAM score as its sole or even primary method. Even the most ardent VAM supporters understand that a multiple methods approach that includes teacher observation data is necessary. Not a serious single commentator exists that would reduce teacher evaluation exclusively to their student’s performance on one standardized test.
“All the others”
More importantly, the AERA statement offers its criticisms of VAM without a full description of the problems facing other methods of teacher evaluation. It calls for a substantial investment in evaluation methods such as “teacher observation data and peer assistance and review models that provide formative and summative assessments of teaching,” as if these methods are as new as VAM.  They are not. Teacher observation and peer assistance and review have been the primary methods by which teacher have been evaluated for the vast history of public education in the United States.
The primary criticism of such methods is that they may be far more likely to give favorable ratings to teachers than teachers’ performance might warrant. In their seminal study, Jacob and Lefgren found that principals’ average rating of all their teachers was an 8.7 out of 10. In other words, principals rank the average teacher as exceptional. Even the very worst teachers in the study receive a 6 out of 10, or above average. I have seen similar patterns in the districts with which I am familiar, where upwards of 90 percent of teachers are rated as excellent teachers.
Lest I be accused of teacher bashing, let me explicitly state that I do not reject out of hand the possibility that 90 percent of teachers in the United States deserve to be rated as excellent. The United States spends shamefully little on income assistance, health care, and other social welfare programs that help students come to school ready to learn. The rise of single parent households and of households where both parents must work means our children get to spend less time with their primary caregivers than ever. Schools themselves are woefully underfunded. Given all of these challenges, our schools’ “failure” to produce the results we expect may be predetermined. 90 percent of our teachers may really be exceptional, but even exceptional teachers can only produce but so many miracles.
However, I am equally open to the possibility that the universally high observational ratings indicate deep, inherent problems with the method itself. Most principals have strong incentive to rate their teachers highly and no clear incentives to keep ratings accurate. As a veteran of one school system’s HR department, I know too well that it is almost impossible to fire a teacher for poor performance, so claims that 90 percent teachers are exceptional are very hard for me to accept.
The future of teacher evaluation
I come not to bury observational evaluation, but to praise it. I value how a principal and one’s colleagues evaluate a teacher, because they see that teacher’s performance each day. Their evaluations can lend a depth to our understanding of that teacher’s work that a standardized test simply cannot provide.
However, I also recognize that education is a results-oriented business. We depend on schools to teach our children the skills they need to lead a fulfilling life. It seems appropriate to test students on how well they have mastered those skills and to evaluate teachers in part on their students’ mastery, provided our standard of evaluation is reasonable and fair.
My professional opinion as a quantitative researcher and educator is that the best contemporary VAM models create that reasonable, fair standard. The use of previous years’ test scores controls for the level of learning with which students start the year, which means that teachers are held responsible only for their progress. The use of controls for key student demographic factors ensures that teachers are not penalized for teaching large populations of at-risk students. As the AERA statement points out, VAM must always be used correctly and will always technical problems associated with its use, but this is equally true of all methods of teacher evaluation.
In short, I believe that teacher evaluations and VAM both have important, complimentary roles to play in a fair, rigorous system of teacher evaluation, and I hope more states take steps to ensure that both are used well.

Thanks to Cara Jackson for her assistance in the preparation of this post.

Elizabeth Sobka