Skeptophilia: Teacher scores and error bars

One of the first rules of handling data that students learn in science classes is the concept of "significant figures." Although the rules for determining whether a particular digit in a measurement or calculation is significant (i.e. reliably accurate) are a little complicated, the whole idea boils down to a simplistic concept:

When you do a calculation that combines various pieces of measured data, the result cannot be any more accurate than the least accurate piece of data that went into the calculation.

It's why you see "error bars" around data points in scientific papers. You have to keep in mind how precise the data is, so that when you combine different measurements, you know how accurate the result is. And the difficulty is that error is cumulative; the more pieces of data you combine, the greater the cumulative error becomes, and the lower the confidence that the outcome is actually right.

Which brings me to how teachers' grades are being calculated in New York state.

Our grades this year are a composite of three measures. 60% of our grade comes from numerical scores assigned by our principal from classroom observations; 40% comes from the outcome of our students' performance on tests (20% each from two different sets of tests). This year, my two blocks of twenty percentage points each came from my AP Biology exam results, and the total of my student's results on my in-class final exams. So, here are my results:

I got 60/60 on classroom observations. I got 20/20 on my AP Biology exam results, which is mystifying for two reasons: (1) the exam itself was a poorly-designed exercise in frustration, as I described in a previous blog post; and (2) three of my 27 students got a 2 on the exam, which is below the benchmark, so my score should have been knocked down a peg because of that.

I got a 10/20 on my in-class final exam results.

Why? A combination of reasons. The state, in their desperation to pretend that all outcomes are quantifiable, required that for the purposes of calculating our "teacher grade," the exit exam score had to be compared to a "pre-test." My pre-test, in AP Biology, was the combination of the students' Regents (Introductory) Biology and Regents Chemistry final exams -- both markedly easier tests. Every student in my class scored below their pre-test score on my rigorous, college-level final, so in the state's eyes it looks like the year they spent in my class actively made them stupider.

I also got graded down because of the three students in my elective who chose not to take the final exam. You might ask yourself why the teacher should be blamed for a student's choice to skip the day of the final. The state has a ready answer: "It is the teacher's responsibility to make certain that all students complete the requirements of the course." (That's a direct quote, folks.)

So, my overall grade this year is a 90, which you'd think I'd be pretty pleased with. Actually, I'm not, because my grade -- supposedly, a measure of my effectiveness as a teacher -- isn't a 90 at all. What should it be, then? Damned if I know. We've combined three measurements to get that score that were all measuring different things, at different accuracies.

Remember error bars?

Were my classroom observation scores accurate? I'd say so, and I'm not just saying that because I scored well. The principal I work for is outstanding, and has a real sense of what good classroom teaching is. Of the three measures, I'd say that this is the one I'm the most confident of.

How about the 40% that came from test scores? Honestly, I'd say that number has a wobble factor of at least ten points either way. In part, the test score outcomes are due to my effectiveness as a teacher; it'd be a sad state of affairs if how my students performed had nothing to do with me at all. But are there other factors involved?

Of course. On the plus side, there's the hard work the students put in. Dedication to a class they've enjoyed. Good study skills. Raw intelligence.

On the minus side, there's poverty. Cognitive disabilities. Lack of parental support. Bad attitude. Frustration. Laziness.

To name a few.

So, really, how confident are you that my grade of 90 is actually a reflection of my effectiveness as a teacher? Because that confidence can't be any higher than the least accurate measure that went into calculating it.

The funny thing is, this statistical concept is one that is taught in every Educational Statistics class in the world, and yet the powers-that-be in the State Department of Education have been completely unresponsive to claims that the way they're handling numbers is spurious. Of course, I don't know why we should expect any different; the way they handle scaling final exams in New York state is also spurious, and they have feigned deafness to objections from teachers on that count, too.

As an example, on the state biology final, students have consistently needed to get 46% of the answers correct to score a scaled score of 65 [passing], while on the physics exam, the fraction of correct answers students need to score a 65 has varied from 59% to 67%. Yes, that's correct; there have been years where exam scores in physics have been scaled downward. When questioned about how this can possibly be fair, Carl Preske, Education Specialist at the New York State Department of Education, responded (this is a direct quote):

I promised myself that I would not join in any discussion of negative curve and the quality of the questions. So much for promises, unless you personally have a degree in tests and measurements I doubt that you have the expertise that the twenty teachers who have worked on each question. Secondly if you lack a degree in psychometrics than [sic] comments on negative curves are useless. That being said, each subject area established their own cut points for 65 and 85 more than 10 years ago: we (those constructing the physics exam) have the advantage of having a much larger number of difficult questions to place on each exam than does Chemistry and with that greater number of difficult questions we are able to avoid what you prefer to call a negative. Since we have about 20-25 questions above the 65 cut point we are able to stretch out the top 35 scaled credits, Chemistry has between12 and 18 questions above the cut point over which they may scale the 35 credits. If you wish to remove the "negative curve" than [sic] please find a way to generate 20 difficult questions to give to the test writing group each year.

Well, that was lucid.

So, we're basing teachers' scores on a combination of metrics based on the scaled scores of flawed tests.

Remember the idea of error being cumulative? ("Your score is a 90! ± 50 points!")

Now, you may be thinking, what real difference does a teacher's score make? How can it be used against them? My own opinion is that we are, country-wide, moving toward using teachers' end-of-the year scores for purposes of awarding (or revoking) tenure, job retention, and (ultimately) raises and salary. None of that has happened yet. But already, these scores are being considered reliable enough that they are being used as a criterion for the awarding grant money. I just saw last week an offer of research grant money that was open to teachers -- but only if you were considered "Highly Effective," that is, you scored a 91 or higher for the year.

That's right, folks. If I'd gotten one point higher, I would be able to apply for a four-year research grant worth $15,000/year. But I'm only "Effective," not "Highly Effective," so there you are.

The whole thing is intensely frustrating, because it seems like all of the rank-and-file teachers grasp the problem with this immediately, and none of the higher-ups in the State Department of Education are even willing to admit that what they're doing is statistically invalid. Their attitude seems to be that if it can be converted to numbers, it's real. And if it's real, it can be converted to numbers.

Oh, and if it can be converted to numbers, it's valid. Right?

Of course right.

Me, I'm just going to keep loping along doing what I've always done, teacher score be damned. I told a colleague this year that I didn't care what I got as long as it was above a 65, because if I "failed" I'd have to do more paperwork, which makes me sound like one of my less-motivated students. But I know that what I do in the classroom works; I know I'm effective. Whether I got a 90, or a 100, or a 72, means absolutely nothing, neither in the statistical sense nor in any other sense. What we do as teachers has an inherently unquantifiable aspect to it. How can you measure students' excitement? Or creativity? Or the sense of wonder they get at learning about the world? Or the moment that a kid decides, "I love this subject. I want to spend the rest of my life doing this?"

But the b-b stackers in the state capitol don't, apparently, recognize any of that as valuable. It's a good thing that most of us teachers still do.

Skeptophilia

Saturday, July 20, 2013

Teacher scores and error bars

No comments:

Post a Comment