Skeptophilia (skep-to-fil-i-a) (n.) - the love of logical thought, skepticism, and thinking critically. Being an exploration of the applications of skeptical thinking to the world at large, with periodic excursions into linguistics, music, politics, cryptozoology, and why people keep seeing the face of Jesus on grilled cheese sandwiches.

Wednesday, February 26, 2014

Academic gibberish

About three years ago, I wrote a post on the problem with scientific jargon.  The gist of my argument was that while specialist vocabulary is critical in the sciences, its purpose should be to enhance clarity of speech and writing, and if it does not accomplish that, it is pointless.  Much of woo-wooism, in fact, comes about because of mushy definitions of words like "energy" and "field" and "frequency;" the best scientific communication uses language precisely, leaving little room for ambiguity and misunderstanding.

That doesn't mean that learning scientific language isn't difficult, of course.  I've made the point more than once that the woo-woo misuse of terminology springs from basic intellectual laziness.  The problem is, though, that because the language itself requires hard work to learn, the use of scientific vocabulary and academic syntax can cross the line from being precise and clear into deliberate obscurantism, a Freemason-like Guarding of the Secret Rituals.  There is a significant incentive, it seems, to use scientific jargon as obfuscation, to prevent the uninitiated from understanding what is going on.

[image courtesy of the Wikimedia Commons]

The scientific world just got a demonstration of that unfortunate tendency with the announcement yesterday that 120 academic papers have been withdrawn by publishers, after computer scientist Cyril Labbé of Joseph Fourier University (Grenoble, France) demonstrated that they hadn't, in fact, been written by the people listed on the author line...

... they were, in fact, computer-generated gibberish.

Labbé developed software that was specifically written to detect papers produced by SciGen, a random academic paper generator produced by some waggish types at MIT.  The creators of SciGen set out to prove that meaningless jargon strings would still make it into publication -- and succeeded beyond their wildest dreams.  “I wasn’t aware of the scale of the problem, but I knew it definitely happens.  We do get occasional emails from good citizens letting us know where SciGen papers show up,” says Jeremy Stribling, who co-wrote SciGen when he was at MIT.

The result has left a lot of folks in the academic world red-faced.  Monika Stickel, director of corporate communications at IEEE, a major publisher of academic papers, said that the publisher "took immediate action to remove the papers" and has "refined our processes to prevent papers not meeting our standards from being published in the future."

More troubling, of course, is how they got past the publishers in the first place, because I think this goes deeper than substandard (worthless, actually) papers slipping by careless readers.  Myself, I have to wonder if anyone can actually read some of the technical papers that are currently out there, and understand them well enough to determine if they make sense or not.  Now, up front I have to say that despite my scientific background, I am a generalist through and through (some would say "dilettante," to which I say: guilty as charged, your honor).  I can usually read papers on population genetics and cladistics with a decent level of understanding; but even papers in the seemingly-related field of molecular genetics zoom past me so fast they barely ruffle my hair.

Are we approaching an era when scientists are becoming so specialized, and so sunk in jargon, that their likelihood of reaching anyone who is not a specialist in exactly the same field is nearly zero?

It would be sad if this were so, but I fear that it is.  Take a look, for example, at the following little quiz I've put together for your enjoyment.  Below are eight quotes, of which some are from legitimate academic journals, and some were generated using SciGen.  See if you can determine which are which.
  1. On the other hand, DNS might not be the panacea that cyberinformaticians expected. Though conventional wisdom states that this quandary is mostly surmounted by the construction of the Turing machine that would allow for further study into the location-identity split, we believe that a different solution is necessary.
  2. Based on ISD empirical literature, is suggested that structures like ISDM might be invoked in the ISD context by stakeholders in learning or knowledge acquisition, conflict, negotiation, communication, influence, control, coordination, and persuasion. Although the structuration perspective does not insist on the content or properties of ISDM like the previous strand of research, it provides the view of ISDM as a means of change.
  3. McKeown uses intersecting multiple hierarchies in the domain knowledge base to represent the different perspectives a user might have. This partitioning of the knowledge base allows the system to distinguish between different types of information that support a particular fact. When selecting what to say the system can choose information that supports the point the system is trying to make, and that agrees with the perspective of the user.
  4. For starters, we use pervasive epistemologies to verify that consistent hashing and RAID can interfere to realize this objective. On a similar note, we argue that though linked lists and XML are often incompatible, the acclaimed relational algorithm for the visualization of the Internet by Kristen Nygaard et al. follows a Zipf-like distribution.
  5. Interaction machines are models of computation that extend TMs with interaction to capture the behavior of concurrent systems, promising to bridge the fields of computation theory and concurrency theory.
  6. Unlike previous published work that covered each area individually (antenna-array design, signal processing, and communications algorithms and network throughput) for smart antennas, this paper presents a comprehensive effort on smart antennas that examines and integrates antenna-array design, the development of signal processing algorithms (for angle of arrival estimation and adaptive beamforming), strategies for combating fading, and the impact on the network throughput.
  7. The roadmap of the paper is as follows. We motivate the need for the location-identity split. Continuing with this rationale, we place our work in context with the existing work in this area. Third, to address this obstacle, we confirm that despite the fact that architecture can be made interposable, stable, and autonomous, symmetric encryption and access points are continuously incompatible.
  8. Lastly, we discuss experiments (1) and (4) enumerated above. Error bars have been elided, since most of our data points fell outside of 36 standard deviations from observed means. On a similar note, note that active networks have more jagged seek time curves than do autogenerated neural networks.
Ready for the answers?

#1:  SciGen.
#2:  Daniela Mihailescu and Marius Mihailescu, "Exploring the Nature of Information Systems Development Methodology: A Synthesized View Based on a Literature Review," Journal of Service Science and Management, June 2010.
#3:  Robert Kass and Tom Finin, "Modeling the User in Natural Language Systems," Computational Linguistics, September 1988.
 #4:  SciGen.
#5:  Dina Goldin and Peter Wegner, "The Interactive Nature of Computing: Refuting the Strong Church-Turing Thesis," Kluvier Academic Publications, May 2007.
#6:  Salvatore Bellofiore et al., "Smart Antenna System Analysis, Integration, and Performance on Mobile Ad-Hoc Networks (MANETs)," IEEE Transactions on Antennas and Propagation, May 2002.
#7:  SciGen.
 #8:  SciGen.

How'd you do?  If you're like most of us, I suspect that telling them apart was guesswork at best.

Now, to reiterate; it's not that I'm saying that scientific terminology per se is detrimental to understanding.  As I say to my students, having a uniform, standard, and precise vocabulary is critical.  Put a different way, we all have to speak the same language.  But this doesn't excuse murky writing and convoluted syntax, which often seem to me to be there as much to keep non-scientists from figuring out what the hell the author is trying to say as it is to provide rigor.

And the Labbé study illustrates pretty clearly that it is not just a stumbling block for relative laypeople like myself.  That 120 computer-generated SciGen papers slipped past the eyes of the scientists themselves points to a more pervasive, and troubling, problem.

Maybe it's time to revisit the topic of academic writing, from the standpoint of seeing that it accomplishes what it originally was intended to accomplish; informing, teaching, enhancing knowledge and understanding.  Not, as it seems to have become these days, simply being a means of creating a coded message that is so well encrypted that sometimes not even the members of the Inner Circle can elucidate its meaning.


  1. Part of the root cause here is that from the perspective of the researcher, the purpose of scientific writing is not informing, teaching, or enhancing knowledge and understanding. It's employment. In the publish-or-perish world of the current scientific industry, the rate of publication and the degree to which you cite other important people and are cited by important people is at least as important, if not more so, than the content of the research.

  2. What I'd like to know is how many of those papers were cited in other papers.

  3. After reading the original Nature article linked to in this blog it was easier to understand how these papers slipped through. Contrary to what the author of this blog stated, these papers did NOT make it past peer review. Unlike empirical research articles, conference submissions are not subject to peer review prior to being accepted. This doesn't excuse the negligence of the conference organizers who were supposed to read those papers. I just doubt that these papers would have made it past the careful scrutiny of an independent set of content experts tasked with reviewing them for publication in an academic journal.

    1. Thanks for the correction, which I will edit my post to reflect.

  4. Our results in this experiment are consistent with a ∂σ≤0.001, implying a stochastic intermediacy range within normal tolerances and denying the intercedent inapplicability of the Stoßmann-Freihofer recalculation of ₯≡₰. No other conclusion is warranted by the Cooper-Hofstadter Conjecture as substantiated by our previous paper, Countervalancy Interactions Between Anarthrous and Non-Anarthrous Implicatory Particle Variance.

  5. The disestablishmentarianistic modality incorporated within the bimodular phase induction heliosphere exacerbates the coagulative dimorphist excurbishment. Kerchunking the mesosphere requires a basal rodametric metal oxide silicon field effect transistor with a minimum frequency of 500ohms and a maximum resistance of eleventy-twelve hertz. The output results in condensed water vapor generated within the exhaust matrix of an aircraft, which freezes as icellined crystals. These crytalline matrices then descend through a gravity modality of travel, to the terrestrial substrate, wherein they can be conflagrated, resulting in carbonization, dioxide emission, and conspiracy by way of illuminati.

    Can haz PHD now, plz.

  6. Okay, so I scored 100% in sorting the samples. I have a BSc from 20-odd years ago in life sciences, but more importantly I'm a writer, and I'm working/studying at the MA level in that field.

    Here's the thing: the SciGen stuff is grammatically and linguistically f__ked. I personally wouldn't have the faintest clue as to the meaning or context of some of the terms, but simple analysis of the linguistic meaning-structure of the samples was absolutely solid in distinguishing SciGen from real stuff.

    Of course, this is a limited case. I couldn't tell you whether a paper written by real people (and therefore, presumably, having a functional meaning structure in linguistic terms) was real or bullshit. A scientist operating in a field outside my expertise could certainly snow me.

    But unless the computer generated stuff picks up its grammatical game, someone with my skillset will easily distinguish it.

    So: we can at least partly solve the problem. Maybe we should put beta-readers in place before the actual peer review process: people who review the papers just to see if they actually make sense, before they get passed on to the science types who analyse that sense and decide whether it's solid enough to publish.