Skeptophilia (skep-to-fil-i-a) (n.) - the love of logical thought, skepticism, and thinking critically. Being an exploration of the applications of skeptical thinking to the world at large, with periodic excursions into linguistics, music, politics, cryptozoology, and why people keep seeing the face of Jesus on grilled cheese sandwiches.

Friday, May 2, 2014

Computerized essay, computerized grade

Honestly, I didn't need another reason to hate the increasing barrage of standardized tests that has come to characterize the American approach to public education.

I've seen enough of its ill effects already.  Demoralized kids, who daily face curricula that have turned into a hodgepodge of minutiae and generalities, with little emphasis on connections or critical thinking.  The "teach-to-the-test" mentality becoming abundant amongst teachers and administrators -- driven, it must be said, not by laziness or ineptitude, but because they are now being evaluated by how well the students perform on these metrics.  Writing that is graded on meeting a set of bullet-point rubrics that often have little to do with depth of understanding, creativity, nuance.

But just yesterday, I found yet another reason to despise the direction our educational system is going.  Because apparently, the latest push in the educational assessment world is to take essays -- the last bastion of expressive thought in an increasingly fill-in-the-bubbles world -- and score them by computer.

[image courtesy of the Wikimedia Commons]

I'm not making this up.  There is now software out there -- Intellimetric, eRater, and Project Essay Grade, for example -- that developers claim can take an essay written by a student on a computer and come up with a score that matches to a high degree the score that would be given by a trained human reader.  There's also "WriteToLearn Automated Language Assessment" -- offered by none other than Pearson Education, who seems to be becoming to the educational world what Monsanto is to the environmentalists.

Proponents say that humans are fallible, biased, tire easily, can be sloppy, can cheat.  Which is all true, of course.  But the people who are using machine scoring of essays are confident to the point of hubris: "ETS has been at the forefront of research in automated scoring of open-ended items for over two decades," reads the description of the use of automated scoring protocols on the Educational Testing Service website, "with a long list of significant, peer-reviewed research publications as evidence of our activity in the field.  ETS scientists have published on automated scoring issues in the major journals of the educational measurement, computational linguistics and language testing fields.  Their work has also resulted in 19 U.S. patents related to applying NLP in assessment, significantly more than any other organization."

And it's already in use.  The GMAT (Graduate Management Admission Test) essays are at least in part machine scored, and the PARCC (Partnership for Assessment of Readiness for College and Careers) assessments are supposed to be following suit in the 2014-2015 school year.  So evidently, the ETS people and their pals at Pearson and other educational software development corporations have done their public relations jobs well.  The educational establishment, it seems, is sold on automated essay scoring.

Which puts them in a kind of awkward position apropos of a piece of research published in just yesterday, which showed that a simple piece of software called "BABEL" (Basic Automated B.S. Essay Language), developed by programmer Les Perelman at MIT, can produce a high-scoring essay, according to the automated scoring software -- despite the fact that the output of BABEL is meaningless gibberish.

Which implies that students could do the same.  Further implying that what the automated scoring programs are detecting is not writing quality.

Given the prompt to write about "privacy," BABEL produced an essay that scored a 5.4 out of a possible 6, according to the automated scoring software, despite the fact that it contained the sentence, "Privateness has not been and undoubtedly never will be lauded, precarious, and decent."  The whole essay was written that way, i.e., complete and utter bullshit, composed of random words strung together into fancy-sounding complex sentences with lots of commas and subordinate clauses. Since the automated scoring software was looking for complexity of sentence structure, word length, and word commonness as some of its criteria for the overall score, and could not actually discern the meaning (or lack thereof) of the passage, the program was fooled.

Which means, of course, that there's no reason that humans couldn't similarly game the program.  Learn how to string some ten-dollar words together, put in a few cool phrases like "will certainly be, despite suggestions to the contrary," and figure out how to do parallel construction, and apparently it doesn't matter if you're saying anything that's meaningful.

Look, I know I'm a bit of a Luddite, but it's not that I don't trust technology per se.  I just think that thus far, it has some significant limitations.  We are not yet -- and chances are, won't be for some time -- within hailing distance of a sentient computer, that would be able to understand the nuance and connotation of written or spoken language.  All of the apps and programs and bells and whistles that seem to be taking the educational world by storm are no replacement for a truly engaging teacher.  Even if the software improves dramatically, I would question its utility as anything more than a clever teaching tool, and something that a skilled teacher really can get along just fine without.

But I find the idea that we are further mechanizing the act of teaching -- an act that is, when done well, far more an art form than it is a science -- profoundly repulsive.  As of two years ago, we were told by the state of New York that we teachers are not trusted sufficiently to grade our own final exams, so we have to give the exams to other teachers to score.  Now, apparently, we're moving to taking the assessment of our students out of human hands entirely.

Next, I fear, we will see the teachers themselves replaced by software -- with the ETS and Pearson and so on lauding the changes as visionary, and describing the "peer-reviewed research" the scientists on their payroll are doing, that shows how effective it all is.  "Students learn best with an interactive computer-based tutorial," I can see the press release saying.  "We have been at the forefront of non-teacher-based instruction for decades!"

No teachers necessary, right?  Just some low-paid aides to keep the kids pointed at the computer screens.  Consider the savings to the taxpayer!

More and more we are seeing an emphasis on processing children through factory-model schools, as if they were little automata that could be tweaked and turned and programmed and all come out identically "career and college ready."  There is scant emphasis on creative, original thought, because, after all, how could you assess that, turn it into a number?  And you know, if you can't quantify it, it doesn't exist.

And I suspect that Perelman's result with BABEL will be met with a thunderous silence.  The educational establishment has a sorry history of ignoring research that would cause them to have to make a shift in the status quo, especially when said status quo is making a lot of money for the corporations that are now holding the purse strings.

Easier, apparently, to brush off a 5.4/6 on a nonsense essay than it is to admit that the entire system is headed in exactly the wrong direction.


  1. Here's an example of how the automated testing companies regard this sort of thing:

    It's always interesting to see how people justify themselves -- I can imagine the culture in the companies that produce this software. These researchers can be ignored because they're acting maliciously and in bad faith, apparently. It's a known issue which they have already addressed by writing a FAQ entry. Test takers, of course, would never behave in that way. So there.

    I'd like to see how these programs rate samples of simple, powerful writing that's survived the ages, if you disable the plagiarism check.

    It's no secret to anyone who's (attempted to) read a journal of literary criticism, that one can use a lot of big words and complex sentences while saying nothing that makes any sense. And those articles do get published, so you could say that the automated checkers are fairly accurate at predicting academic success. Ability to create complex sentences is regarded as an important academic skill; ability to communicate clearly is merely useful.

  2. This was on As it Happens on Tuesday. Listen to the garbage output that is generated by BABEL which then gets a mark in the 90th percentile.

  3. "No teachers necessary, right?  Just some low-paid aides to keep the kids pointed at the computer screens.  Consider the savings to the taxpayer!"

    Let me jump right off the deep end (cannonball!!!) and say that spitting out little automatons is in our nation's interest...
    The largest employer in America (Walmart) needs neither intelligent/skilled labor nor an informed electorate... and it has been proven that they will use their money to sway our political leaders to enact self-serving legislation. No conspiracy theory necessary.

    Game, set, match... I suppose?

    Well, not entirely. Public opinion (I.e. this blog) is instrumental in retaining the common sense our society needs to beat back (or at least slow down) the progress of these selfish organizations.