Failed tests

(Image by Joy Olivia Miller)

Failed tests

Linking teacher merit pay to standardized-test scores compromises learning and creates incentives to cheat.

—

Jan–Feb/12

The concept of measuring teacher performance based on student standardized-test scores reminds Derek Neal of the 1970s Saturday Night Live commercial parody about the household cleaner that’s also a dessert topping. “I call this Shimmer floor wax and education policy,” he says, summing up what he considers the ridiculous linking of those two metrics, a practice that has become increasingly common in the era of No Child Left Behind.

In October the Wall Street Journal reported that 23 states and the District of Columbia use test scores, at least in part, to evaluate teachers. Eleven states use those results to determine tenure. The trend represents a shift away from decades of teacher compensation and job security based on seniority or education level with minimal attention to student performance. The $4.35 billion federal Race to the Top program instituted in 2009 accelerated the shift, granting funds based on result-oriented teacher evaluations, often focused on test scores.

Neal, a professor in economics and the Committee on Education, insists it’s a “logical impossibility” that standardized tests, as they’re most often administered, could assess both teachers and students without compromising teacher integrity, student learning, or both. “The idea is that we want faculty held accountable for what students learn, so the tool that we use to measure what students learn is the tool that we should use to hold faculty accountable,” Neal says. “It’s all rhetorically very pleasing, but it has nothing to do with the economics of how you design incentive systems.”

For standardized tests to show a correlation between student scores and teacher performance, they must be comparable from year to year and, therefore, predictable. “Any test that is very predictable will fail the requirement of being well designed for use in an incentives system,” Neal says, “because if it’s predictable, there will necessarily be a hidden action—which is, find a way to get a copy of the test and have [students] memorize the answers.”

Other types of what he calls “funny business” point to the disproportionate importance placed on testing. A 2005 study reported that Virginia educators increased the sugar content of school meals served on exam days because low glucose levels have been associated with poor scores. Some teachers have gone to the extreme of committing fraud. Steven Levitt, the William B. Ogden distinguished service professor of economics, and Brian A. Jacob, PhD’01, uncovered evidence that from 1993 to 2000 some Chicago Public Schools teachers changed student answers on standardized tests before submitting them.

Neal’s research suggests that, whether teachers use honest or nefarious methods, using the same test to measure professional competence and student achievement fails both objectives. In a 2011 National Bureau of Economic Research working paper, “The Design of Performance Pay in Education,” he finds that even when test scores improve—as they often do when teachers have a stake in the results—the growth tends to reflect mastery of test-taking techniques as opposed to the subject matter itself. Neal’s paper reviews studies from Kenya, Israel, Portugal, England, and throughout the United States. In that worldwide data, he says, “I see very weak evidence that the movement toward assessment-based accountability has increased real skill levels rather than test-taking skill levels.”

When they’re evaluated on student scores, teachers are motivated to focus on tactics specific to a test. Neal cites a 2002 Journal of Human Resources paper by Harvard professor Dan Koretz describing a Kentucky school district’s standardized-test results. Third graders performed at a fourth-grade math level—until the district switched testing companies. “They ordered a test that was supposed to cover the exact same curriculum, but they ordered it from a different company,” Neal said in a 2010 lecture. With a rueful laugh, he added, “Lo and behold, they weren’t as special anymore.”

Over time results on the new test rebounded to the levels achieved on the previous one, but when Koretz gave the first company’s exam to a subset of students who had prepared for the second version, the results dropped again. “The results don’t always turn out this starkly,” Neal says, “but it’s clear there’s a lot of evidence out there that when you put in these high-stakes programs, you get gains that are specific to a type of assessment.”

Despite the nodding heads he sees during presentations to policy audiences, Neal senses little momentum for the wholesale change he considers necessary. He advocates designing tests that do not repeat questions or formats from year to year and limiting multiple-choice problems to avoid spending class time on tactics such as when to guess or ignore questions.

Neal also argues that teachers should not be evaluated as a monolithic whole as if, for example, all the fifth-grade math teachers do the same job. “I think there are a lot of people in the policy community that want to say that’s exactly the case,” he says, “and I think that’s stupid.” Because suburban and inner-city schools—or honors and remedial classes—have students with different backgrounds and skill levels, Neal says that teachers should be judged according to “appropriately defined comparison sets.” Within those comparison sets, salary bonuses can be more fairly distributed. He proposes a “pay for percentile” plan outlined in a 2011 paper written with Gadi Barlevy of the Federal Reserve Bank of Chicago. Software that Neal developed and offers for free allows students to be classified according to academic history and demographic factors. How well those students fare within their groups then determines a teacher’s relative performance and merit pay.

To Neal’s frustration, changes of that magnitude seldom enter the public debate. Instead discussion tends to focus on nips and tucks to No Child Left Behind and fine-tuning test design rather than reforming the process to remove the inherent temptations on teachers. “I’m arguing, no,” Neal says, “you’ve got to junk it and start over.”

Education & Social Service

Research

By design

Mathematical model

Out of circulation

A light that stays

In full color

Uninhibited debate

Related Stories