Skeptophilia: teacher grades

Showing posts with label teacher grades. Show all posts

Monday, June 27, 2016

Score card

It's the last week of June, and I just wrapped up another school year. My 29th overall, which still seems kind of impossible to me until I realize that a child of a former student graduated from high school this year. Then it seems pretty real, along with a realization of "Good lord, I'm getting old."

So I've been at this for a long time, and with, I think, some measure of success. Which is why I read my letter from the school district awarding me my numerical grade for the school year with a mixture of amusement and irritation.

I won't leave you hanging; I got an 81. I got an 84 last year and a 91 the year before that, so according to the state rating scale, I'm becoming incrementally less competent. It can't, of course, be because the metric is flawed, that the three grades are comparing different assessments of different students put together in different ways. No, in the minds of the geniuses at NYSED, this number means something fundamental about my effectiveness as a teacher.

In fact, that's what an 81 gets you; a designation of "Effective." You have to have a 92 to be "Highly Effective." If you're below 75, you're "Developing." I'm glad I didn't land in that category. If after 29 years at this game, I'm not "Developed," I don't hold out much hope.

What amused me most about all of this nonsense was the paragraph that said, and I quote:

Please remember that your scores are confidential and should not be shared in any way. In accordance with state regulations, the parent of a child in your class may request your composite score and rating as well as that of the principal. For your own protection, teachers are strongly discouraged from sharing their own scores outside of the district process.

Which is a recommendation I'm happy to toss to the wind (along with the aforementioned letter). If we keep our scores and the way they were generated under wraps, it allows the statistics gurus at the State Education Department to keep everyone under the impression that they actually know what they're doing.

[image courtesy of the Wikimedia Commons]

Let me get specific, here. My 91 two years ago was based upon the scores of my Critical Thinking classes and my AP Biology class. Critical Thinking is an elective, and while the day-to-day work is difficult (requiring a lot of thinking, surprisingly) the material that is suitable for an exam at the end of the year is actually quite easy. So my students performed brilliantly, as I would expect. Additionally, that year's AP class was an extremely talented group who knocked the final exam clear out of the park.

Fast forward to last year. My score last year was based on a combination of my Regents (Introductory) Biology class and my AP Biology classes. Because of a strange policy of piling students who are classified as learning disabled into the same class, last year's Regents Biology was half composed of students who have been identified with learning disabilities. Many of these students were hard-working and wonderful to teach, but it's unsurprising that that part of my grade went down. My two AP classes last year were a friendly, cheerful lot who also happened to be somewhat motivationally challenged, and who by the end of the school year were far more invested in playing Cards Against Humanity than they were in studying for my final. So that accounts for the remainder of the decline in my score.

This year, my score was a composite once again between Regents and AP Biology, but this time my Regents classes were among the most talented, hardest-working freshman and sophomores I've ever had. My AP class was small but outstanding, but because of the way the scoring is done, they would have to score on my (very difficult) final exam higher than a target determined by their score on the (far easier) Regents Biology exam for me to have that student's score count in my favor. On the part of my assessment that came from my AP class, I got a grand total of three points of of a possible twenty -- mostly because of students who got an 81 or 82 on an exam where their target was 85.

So my three scores in three consecutive years have absolutely nothing to do with one another, and (I would argue) nothing whatsoever to do with my competence as a teacher. But because there's no idea that is so stupid that someone can't tinker with it to make it even stupider, next year the State Department of Education has informed us that we'll be assessed a different way. Our joy at hearing this pronouncement was short-lived, because once we heard how they're going to score us, we all rolled our eyes so hard it looked like the email was inducing grand mal seizures.

Next year, unless over half of your students are in classes that take a mandated state exam at the end of the year, 50% of your score will be based on an average of the "Big Five" exams, the ones that all students have to take to graduate -- English, US History, Algebra I, Global History, and Biology. (The other half, fortunately, will be based on evaluation by an administrator.) If you think you can't have read that correctly, you did; the half of the high school band teacher's grade (for example) will come from students' scores on exams that she had absolutely nothing to do with. Even for me, who teaches one of the "big five" -- less than half of my students next year will be in Regents Biology, so I'll be getting the composite score, too.

But don't worry! Because students mostly score pretty well on these exams, and the score will be calculated using the time-honored statistical technique of averaging averages, we'll all look like we're brilliant. So in effect, they took an evaluation metric that was almost completely meaningless, and changed it so as to make it completely meaningless.

Because that's clearly how you want an evaluation system to work.

All of this, it must be said, comes from the drive toward "data-driven instruction" -- converting every damn thing we do into numbers. Couple this with a push toward tying those numbers to tenure, retention, and merit pay, along with a fundamental distrust of the teachers themselves, and we now have a system that is so far removed from any measure of reliability that it's almost funny.

Almost. Because NYSED, and other state educational agencies, look upon all of this as being deadly serious. It's all very well for me -- a veteran teacher of nearly three decades who is looking to retire in the next few years -- to laugh about this. I wouldn't be laughing if I were a new teacher, however, and I'd be laughing even less if I were a college student considering education as a profession.

In fact, it'd make me look closely at what other career options I had.

Saturday, July 20, 2013

Teacher scores and error bars

One of the first rules of handling data that students learn in science classes is the concept of "significant figures." Although the rules for determining whether a particular digit in a measurement or calculation is significant (i.e. reliably accurate) are a little complicated, the whole idea boils down to a simplistic concept:

When you do a calculation that combines various pieces of measured data, the result cannot be any more accurate than the least accurate piece of data that went into the calculation.

It's why you see "error bars" around data points in scientific papers. You have to keep in mind how precise the data is, so that when you combine different measurements, you know how accurate the result is. And the difficulty is that error is cumulative; the more pieces of data you combine, the greater the cumulative error becomes, and the lower the confidence that the outcome is actually right.

Which brings me to how teachers' grades are being calculated in New York state.

Our grades this year are a composite of three measures. 60% of our grade comes from numerical scores assigned by our principal from classroom observations; 40% comes from the outcome of our students' performance on tests (20% each from two different sets of tests). This year, my two blocks of twenty percentage points each came from my AP Biology exam results, and the total of my student's results on my in-class final exams. So, here are my results:

I got 60/60 on classroom observations. I got 20/20 on my AP Biology exam results, which is mystifying for two reasons: (1) the exam itself was a poorly-designed exercise in frustration, as I described in a previous blog post; and (2) three of my 27 students got a 2 on the exam, which is below the benchmark, so my score should have been knocked down a peg because of that.

I got a 10/20 on my in-class final exam results.

Why? A combination of reasons. The state, in their desperation to pretend that all outcomes are quantifiable, required that for the purposes of calculating our "teacher grade," the exit exam score had to be compared to a "pre-test." My pre-test, in AP Biology, was the combination of the students' Regents (Introductory) Biology and Regents Chemistry final exams -- both markedly easier tests. Every student in my class scored below their pre-test score on my rigorous, college-level final, so in the state's eyes it looks like the year they spent in my class actively made them stupider.

I also got graded down because of the three students in my elective who chose not to take the final exam. You might ask yourself why the teacher should be blamed for a student's choice to skip the day of the final. The state has a ready answer: "It is the teacher's responsibility to make certain that all students complete the requirements of the course." (That's a direct quote, folks.)

So, my overall grade this year is a 90, which you'd think I'd be pretty pleased with. Actually, I'm not, because my grade -- supposedly, a measure of my effectiveness as a teacher -- isn't a 90 at all. What should it be, then? Damned if I know. We've combined three measurements to get that score that were all measuring different things, at different accuracies.

Remember error bars?

Were my classroom observation scores accurate? I'd say so, and I'm not just saying that because I scored well. The principal I work for is outstanding, and has a real sense of what good classroom teaching is. Of the three measures, I'd say that this is the one I'm the most confident of.

How about the 40% that came from test scores? Honestly, I'd say that number has a wobble factor of at least ten points either way. In part, the test score outcomes are due to my effectiveness as a teacher; it'd be a sad state of affairs if how my students performed had nothing to do with me at all. But are there other factors involved?

Of course. On the plus side, there's the hard work the students put in. Dedication to a class they've enjoyed. Good study skills. Raw intelligence.

On the minus side, there's poverty. Cognitive disabilities. Lack of parental support. Bad attitude. Frustration. Laziness.

To name a few.

So, really, how confident are you that my grade of 90 is actually a reflection of my effectiveness as a teacher? Because that confidence can't be any higher than the least accurate measure that went into calculating it.

The funny thing is, this statistical concept is one that is taught in every Educational Statistics class in the world, and yet the powers-that-be in the State Department of Education have been completely unresponsive to claims that the way they're handling numbers is spurious. Of course, I don't know why we should expect any different; the way they handle scaling final exams in New York state is also spurious, and they have feigned deafness to objections from teachers on that count, too.

As an example, on the state biology final, students have consistently needed to get 46% of the answers correct to score a scaled score of 65 [passing], while on the physics exam, the fraction of correct answers students need to score a 65 has varied from 59% to 67%. Yes, that's correct; there have been years where exam scores in physics have been scaled downward. When questioned about how this can possibly be fair, Carl Preske, Education Specialist at the New York State Department of Education, responded (this is a direct quote):

I promised myself that I would not join in any discussion of negative curve and the quality of the questions. So much for promises, unless you personally have a degree in tests and measurements I doubt that you have the expertise that the twenty teachers who have worked on each question. Secondly if you lack a degree in psychometrics than [sic] comments on negative curves are useless. That being said, each subject area established their own cut points for 65 and 85 more than 10 years ago: we (those constructing the physics exam) have the advantage of having a much larger number of difficult questions to place on each exam than does Chemistry and with that greater number of difficult questions we are able to avoid what you prefer to call a negative. Since we have about 20-25 questions above the 65 cut point we are able to stretch out the top 35 scaled credits, Chemistry has between12 and 18 questions above the cut point over which they may scale the 35 credits. If you wish to remove the "negative curve" than [sic] please find a way to generate 20 difficult questions to give to the test writing group each year.

Well, that was lucid.

So, we're basing teachers' scores on a combination of metrics based on the scaled scores of flawed tests.

Remember the idea of error being cumulative? ("Your score is a 90! ± 50 points!")

Now, you may be thinking, what real difference does a teacher's score make? How can it be used against them? My own opinion is that we are, country-wide, moving toward using teachers' end-of-the year scores for purposes of awarding (or revoking) tenure, job retention, and (ultimately) raises and salary. None of that has happened yet. But already, these scores are being considered reliable enough that they are being used as a criterion for the awarding grant money. I just saw last week an offer of research grant money that was open to teachers -- but only if you were considered "Highly Effective," that is, you scored a 91 or higher for the year.

That's right, folks. If I'd gotten one point higher, I would be able to apply for a four-year research grant worth $15,000/year. But I'm only "Effective," not "Highly Effective," so there you are.

The whole thing is intensely frustrating, because it seems like all of the rank-and-file teachers grasp the problem with this immediately, and none of the higher-ups in the State Department of Education are even willing to admit that what they're doing is statistically invalid. Their attitude seems to be that if it can be converted to numbers, it's real. And if it's real, it can be converted to numbers.

Oh, and if it can be converted to numbers, it's valid. Right?

Of course right.

Me, I'm just going to keep loping along doing what I've always done, teacher score be damned. I told a colleague this year that I didn't care what I got as long as it was above a 65, because if I "failed" I'd have to do more paperwork, which makes me sound like one of my less-motivated students. But I know that what I do in the classroom works; I know I'm effective. Whether I got a 90, or a 100, or a 72, means absolutely nothing, neither in the statistical sense nor in any other sense. What we do as teachers has an inherently unquantifiable aspect to it. How can you measure students' excitement? Or creativity? Or the sense of wonder they get at learning about the world? Or the moment that a kid decides, "I love this subject. I want to spend the rest of my life doing this?"

But the b-b stackers in the state capitol don't, apparently, recognize any of that as valuable. It's a good thing that most of us teachers still do.