Skeptophilia: linguistics

Showing posts with label linguistics. Show all posts

Monday, February 9, 2026

Spell check

The connection between a spoken language and its written form, known as its orthography, is seldom straightforward.

Much has been made of the non-intuitive symbol-to-sound correspondence in English -- you've probably seen the old quip that "learning English spelling is rough, but it can be taught through tough, thorough thought, though." There are two main reasons for the often weird pronunciation rules (and multiple exceptions) in English. First, there's a general rule of thumb that the older a language's writing system is, the more time it's had to diverge from pronunciation. (Put a different way, pronunciation tends to shift faster than written language does.) Second, English is an amalgam of Germanic/Old English and Romance/Norman French, each of which had its own (different) pronunciation rules, with borrow-words added in from just about every culture English-speakers have contacted.

Honestly, though, for a strange writing-to-pronunciation correspondence, I don't think any language in the world can beat Irish and Scottish Gaelic. In what sensible system would the feminine name Caoimhe be pronounced "kwee-va?"

Now, don't get me wrong, I think Irish and Scottish Gaelic are both gorgeous languages. I just look at the written forms and think, "I can't even make a guess at how that's pronounced."

Of course, there's no problem that arose naturally and organically that humans can't make worse out of sheer cussedness. Deliberate misspellings in (for example) business names make me wonder how any child grows up knowing how to spell correctly. Near my village there used to be a children's dance studio -- now long out of business -- called, I shit you not, "The Shug'r-n-Spyce Skool of Dance."

And I'm with Dave Barry, who said that any business tacking an extra "e" onto the end of words to make them look old and quaint should be taxed at a higher rate. ("Ye Olde Curiositie Shoppe.")

We add another layer of weirdness when there are ill-advised attempts to meld English with non-English alphabets. There's a whole thing called "faux Cyrillic," where Cyrillic letters are thrown in to give something a pseudo-Russian flavor. Just look at the header on the game Tetris -- it's almost always written "TETЯIS." The problem is, "Я" isn't pronounced /r/, it's pronounced /ya/, so the game spelled this way would be "Tetyais."

Then, there's this sign in front of a Greek restaurant that I saw while visiting family in Santa Fe, New Mexico:

"Ooh, my favorite! Grssk Rthtphsssrphs!"

Throw into the mix the recent development of "Textspeak" -- lmfao and brb and ttyl are so commonly used that they don't even flag as misspellings -- and you have the makings of a confused mess. These sorts of conventions aren't only created to speed up communication, however; they can also be used to hide -- like "Leet," an online spelling convention originating in the late 1980s to allow hackers to communicate with each other on message boards without alerting the moderators by using forbidden keywords. (An example of Leet is that "elite hacker" is written "31337 H4XØR" -- the first word using 3, 1, and 7 for the letters E, L, and T, respectively, so the first word is "eleet.")

It's always a struggle to stay one step ahead of bad actors, and there are scammers who have used this kind of technique to get people to respond to scam emails (or click on their websites), by substituting one similar, but non-English, character in a legitimate-looking website address. "Citibank.com," for example, might turn into "Citibɑnk.com" -- substituting the IPA symbol "ɑ" for the standard "a" -- and unless you were looking closely, you might well not notice the difference, and click your way to a website that is definitely not the real Citibank.

So what we end up with is a mishmash of problems that arose from a combination of the vagaries of language evolution and deliberate attempts to mess things up further, along with a good measure of pure idiocy:

As Julius Caesar so famously said, "Vspph, vphdph, vphcph."

In the above case, there's also an apparent disregard of what my tattoo artist said to me -- "Be sure it's what you want, because that shit's permanent."

So that's this morning's musings on some weird features of written language. Understand that I'm not one of those types who rails at every alteration from the King's English -- I'm about as far from a prescriptivist as you can get. I can't help but wonder, though, if some of what's happened has actually made it more difficult to be understood.

Of course, if you're a ЯUSSIAИ 31337 H4XØR, that's probably exactly what you wanted.

****************************************

Wednesday, December 31, 2025

The mystery of the Travellers

Monday's post, about the difficulty of defining the term race, prompted a loyal reader of Skeptophilia to ask if I'd ever heard of the Irish Travellers.

I asked if they were Romani (colloquially referred to as Gypsies, although that term is now usually considered a slur) who live in Ireland. She said no -- the story is more interesting than that.

And indeed it is.

The Travellers, or Mincéirí, are a generally nomadic group of people whose origins are shrouded in mystery, but who by some accounts have lived on the island as an identifiable group since at least the twelfth century C.E. In the Irish language they're called An Lucht Siúil -- "the walking people." They have a distinct style of dress, including emphasis on beadwork and embroidery, and their own sets of tunes and songs. They even speak a separate language -- Shelta -- which contains words from Irish and English, as well as a number of what appear to be neologisms. It's not been well-studied, because as a group with a history of persecution, the Travellers are (understandably) reluctant to share their knowledge with outsiders. What's known of it, though, seems to be mutually unintelligible to both speakers of Irish and English, and to qualify as an actual separate language (i.e., not a dialect or a pidgin).

Despite the fact that they've experienced discrimination, and the difficulty of maintaining their lifestyle in the face of an increasingly homogenized, technological world, there are still over thirty thousand people in Ireland who self-identify as Travellers.

A Traveller caravan in July 1954 [Image credit: National Library of Ireland]

Their origins are a mystery. There are Romani in Ireland, just as there are in most European countries; although they occupy a similar societal niche as the Travellers, they seem to be unrelated. (Genetic studies of Romani have shown fairly conclusively that they are an Indo-Aryan people who made their way into Europe something like a thousand years ago from what is now the Indian state of Rajasthan.) An analysis of the genetics of the Travellers has found that they are essentially Irish in origin, although have been reproductively isolated from the rest of the population since at least the eleventh century C.E., and possibly before. This study concluded that while related, the Travellers are as distinct from the rest of the Irish as the Icelanders are from the Norwegians.

How could this have happened? One hypothesis -- and it's no more than that -- is that the ancestors of the Travellers belonged to an itinerant profession that was looked down upon and segregated not because of genetic unrelatedness, but because of social stigma (similar to the Dalits of India). Like many people with a history of oppression, they are struggling to maintain their language, culture, and identity, and have finally achieved recognition by the Irish government as a distinct ethnic group worthy of protection.

Logo of the All-Travellers Forum (Mincéir Whiden is Shelta for "Travellers talking")

This group highlights once again the difficulty of defining what we mean by race or ethnicity. Genetically, the Travellers are very similar to the Irish, and seem to share a common origin some time in the last millennium. Their language, Shelta, probably started out being a pidgin of Old Gaelic and Middle English, but now (like the Kreyòl language of Haiti) has evolved and strengthened into an actual complex and complete language. Culturally, they're distinct enough to warrant governmental recognition and at least some efforts toward protection and support.

This is hardly the only such case known. Here in the United States, we've got the Melungeons of eastern Tennessee and Kentucky and southwestern Virginia, the Brass Ankles of South Carolina, and the Redbones of southwestern Louisiana, all of which seem from genetic studies to be "tri-racial isolates" descended from a combination of sub-Saharan Africans, Native Americans, and western European ancestors, but who -- like the Travellers -- have been separate long enough to develop their own distinct cultures. My mother's people, the Cajuns, are another such case; they're predominantly of Nova Scotian French ancestry, but have a good admixture of Indigenous Canadian, French Creole, Spanish, and German ancestry, and by virtue of being isolated for a good two centuries, have developed a unique culture and language. My having learned French as a child from my older relatives means I have a strong Cajun accent when I speak it. When I've visited Québec, I've often found it difficult to understand and be understood -- another example, to pilfer a quip from Oscar Wilde, of two countries separated by the same language.

So there you have it. Thank you to the reader who suggested the topic; I always love it when my research for this blog results in my learning something I hadn't known about. I find human genetics, ethnicity, language, and migration patterns endlessly fascinating -- explaining my choice of a field for my master's degree, and the frequency with which the topic shows up here at Skeptophilia. And I suppose we shouldn't be surprised that the truth is more complex than our desire to pigeonhole reality would suggest. As Ursula LeGuin put it, "I never knew anybody who found life simple. I think a time or a life looks simple only if you leave out the details."

****************************************

Saturday, December 13, 2025

The cost of beauty

I have a great admiration for poets.

They have an amazing way of collapsing a tremendous amount of punch into a small space. The best poets use language in vivid and surprising ways, often contravening the laws of grammar to evoke powerful images and emotions. Consider, for example, Peter Viereck's striking poem "The Lyricism of the Weak:"

I sit here with the wind is in my hair;
I huddle like the sun is in my eyes;
I am (I wished you'd contact me) alone.

A fat lot you'd wear crape if I was dead.
It figures, who I heard there when I phoned you;
It figures, when I came there, who has went.

Dogs laugh at me, folks bark at me since then;
"She is," they say, "no better than she ought to";
I love you irregardless how they talk.

You should of done it (which it is no crime)
With me you should of done it, what they say.
I sit here with the wind is in my hair.

Viereck's twisted syntax and use of questionable forms like "irregardless" and "should of" might be wrong -- more about that in a moment -- but man, they work.

In my opinion, though, no one used language in more startling and creative ways than e. e. cummings. He's the name I come up with whenever I'm asked, "Who is your favorite poet?" He uses words the way a skilled Impressionist painter uses color. Some of his best -- from paeans to joy and love like "if everything happens that can't be done" to emotional sucker punches like "me up at does" and "anyone lived in a pretty how town" actually use ungrammaticality as a tool.

This is why I was of two minds when I read an interesting paper by Thom Scott-Phillips of Central Europe University called, "Why Do Humans Have Linguistic Intuition?" Why, for example, do we intuitively recognize that the sentence in English "I don't want to go to the cinema" is okay but "I don't want going to the cinema" is not? Why, in Viereck's poem, did the word "is" in the first line sound like a linguistic hiccup?

Scott-Phillips's contention is that our expectation is that the speaker (or writer) is expressing him/herself using the "principle of optimal relevance:"

Informally, "optimally relevant" means "efficient use of cognitive resources." More formally, the relevance of a stimulus is the trade off between the cognitive costs and the cognitive benefits created by attending to and processing the stimulus; and stimuli are optimally relevant if and only if neither costs not [sic] benefits can be improved without making the other worse off. Cognitive costs are, in the most general sense, the opportunity costs of attention; and in the specific context of communication this effectively means audience processing costs. Cognitive benefits are, in the most general sense, the impact that attention has on future decision making; and in the specific context of communication this effectively means accurate enough identification of the communicator’s intended meaning. Putting all this together, the Communicative Principle of Relevance implies that when interpreting communicative stimuli, audiences presume that no alternative stimulus could suggest the same (or a very similar) meaning at lower processing cost for the audience.

He compares our sense of sentences that "feel wrong" to our immediate (and intuitive) recognition of the weirdness of "impossible objects:"

[Image is in the Public Domain]

He further states that our linguistic alarm bells go off in one of three situations:

The sentence appears to have no plausible cognitive benefits in the first place (i.e. no meaning can be determined), such that there is no possible trade off of costs and benefits (i.e. no relevance).
The sentence deviates from conventional use without any plausible change in interpretation, however small or nuanced. Such sentences raise the cognitive costs of interpretation with no plausible change in benefits.
There are mutual contradictions between the functions of two (or more) constructions within a sentence, rendering the optimisation of cognitive costs and cognitive benefits impossible.

Well, okay, but.

I can accept this in the case of technical communication, or even (most) common conversation, where the main goal is simply being understood. (Although given how often misunderstandings take place, perhaps "simply" is itself the wrong choice of words.) But what about language being used to evoke emotion? In Viereck's poem, the non-standard grammar was a mirror of the disordered thoughts of the jilted lover he was writing about, and in context works brilliantly. I mean, try straightening out the sentence structure and "correcting" the wording -- what you'll have left is an empty complaint you probably wouldn't remember five minutes from now.

On the other hand, I first ran into "The Lyricism of the Weak" when I was in college, something like forty-five years ago -- and it popped into my mind immediately when I read Scott-Phillips's paper.

My point is, strange and unexpected syntax -- ungrammatical usage -- might have "no plausible cognitive benefits" in a scientific paper, a news report, or a conversation with your significant other. But in poetry, and in beautifully-written prose fiction as well, the cognitive costs are worth it. Consider the epic smackdown King Théoden of Rohan gave to Saruman in J. R. R. Tolkien's The Two Towers:

"We will have peace, " said Théoden at last thickly and with an effort. Several of the Riders cried out gladly. Théoden held up his hand. "Yes, we will have peace," he said, now in a clear voice, "we will have peace, when you and all of your works have perished -- and the works of your dark master to whom you would deliver us. You are a liar, Saruman, and a corrupter of men's hearts. You hold out your hand to me, and I perceive only a finger of the claw of Mordor. Cruel and cold! Even if your war on me was just -- as it was not, for were you ten times as wise you would have no right to rule me and mine for your own profit as you desired -- even so, what will you say of your torches in Westfold and the children that lie dead there? And they hewed Háma's body before the gates of the Hornburg, after he was dead. When you hang from a gibbet at your window for the sport of your own crows, I will have peace with you and Orthanc. So much for the House of Eorl. A lesser son of great sires am I, but I do not need to lick your fingers. Turn elsewhither. But I fear your voice has lost its charm."

There's some non-standard grammar in there, and a few words (like "elsewhither") that wouldn't show up in common vocabulary. It's not written simply -- with "optimal relevance" -- but wow. I defy you to find a single word you could change in that passage without lessening its impact.

Honestly, I suspect that Scott-Phillips wouldn't disagree; he did, after all, say that the problem arose when a sentence "deviates from conventional use without any plausible change in interpretation, however small or nuanced," and it's that nuance that I'm talking about here. But I think sometimes a strange, even jarring, turn of phrase can be preferable to more straightforward diction. Think of your own favorite example of evocative writing (and feel free to post some examples in the comments!), and consider the damage if some grammar prescriptivist insisted that it all be written according to "the rules."

For me, the cognitive cost of reading something beautiful is one I'm willing, even eager, to pay.

****************************************

Friday, November 28, 2025

Wandering in the Tower of Babel

How many languages are there in the world?

Seems like it should be an easy question, right? Not so much. Just like the issue of biological species (that I touched upon in Wednesday's post, about dogs and wolves), figuring out where to draw dividing lines in linguistics isn't simple. How different do two modes of speech or writing have to be to constitute separate languages?

Here's an admittedly rather facile example from my own experience. I grew up speaking both French and English; for three of my grandparents, and my mom, French was their first language. What I heard, though, was Cajun French, a dialect brought into southern Louisiana by people who had been exiled from Acadia (now called Nova Scotia) in the mid-eighteenth century. (In my mother's family's case, they did a thirty year stint in France first, and left on the 1785 Acadian Expeditions when the king of France decided they weren't fitting in and basically paid them to go away. Just as well; they missed the French Revolution, which broke out only four years later. A couple of years after that, the king regretted not going with them.)

What's interesting, though, is that when I go to Québec, I have a really hard time understanding spoken French. Part of it is that admittedly, I'm a bit rusty; I haven't been around francophones for forty years. But the accent is so different from what I'm used to that it often befuddles me. Further still are French-based creoles like Haitian Creole, Antillean Creole, and Seychellois; those have a lot of French vocabulary but a great admixture of words (and grammatical structures) from African languages, particularly from western Volta-Congo languages such as Fongbe and Igbo (for the first two) and the Bantu language (for Seychellois).

And I can verify that Haitian Creole and French aren't mutually intelligible. A well-meaning principal I worked for was welcoming in a young lady who was a refugee from Haiti, and told her that I was someone she could speak to in her native language -- assuming that Haitian Creole and French were close enough that we could chat. She and I had a good laugh when we found out that neither of us spoke the other's language, so we had to get by on her broken English supplemented when necessary with my rather ill-remembered French, when the words were at least close enough to help.

Another example is Breton, a Celtic language related to Welsh that's spoken in Brittany. My band recorded a couple of songs in Breton (here's one example), and a friend noticed how much it sounded like French. Like Haitian Creole, though, it really is a different language, with its own grammar, syntax, and lexicon -- but enough borrowed words and pronunciation influence from French that it has a superficially French sound.

So even with currently extant languages, it's hard to know where to draw the lines. Standard French, Cajun French, and Québecois are usually considered close enough to count as the same language (more specifically, as three dialects -- the linguistic analog to a subspecies). Breton, and the creoles I listed, are universally considered to be separate languages. The current estimate is that there are now around seven thousand languages spoken in the world, although that number gets revised all the time as we learn more about them.

The situation becomes even more difficult when you start considering languages across time. Languages evolve, despite the prescriptivists' best efforts, and -- once again, like with biological evolution -- it's an open debate where you draw the line. English today is pretty similar to English spoken in England and eastern North America in the eighteenth century; a few different words and some odd (to our ears) grammar, is all. Go back to Shakespeare's day, and it was more different still, although -- with practice -- modern readers can see a performance of Macbeth or As You Like It and understand what's being said. (And even, in the latter, be able to laugh at most of the bawdy jokes.) Back in Chaucer's times, today's English speakers would have a difficult time of it. And actual Old English -- no, Shakespeare isn't "Old English," even if you've heard it called that -- is a completely different, mutually unintelligible language with Modern English. For example, can you identify this passage?

Fæder ure þu þe eart on heofonum,
si þin nama gehalgod.
To becume þin rice, gewurþe ðin willa,
on eorðan swa swa on heofonum.

If you recognized it as the opening lines of the Lord's Prayer, you're either a language nerd or else really good at picking out patterns. Old English was a language related to a dialect of West Germanic spoken in Saxony -- on the border of what is now Denmark and Germany -- and so unlike Modern English that if you went back to tenth century England, you'd need a handy phrase book to be understood.

The reason all this comes up is because I stumbled upon a site listing "spurious languages" -- written or spoken systems that we once thought were languages, and now we are kind of saying, "Um, maybe not." And there are a lot of them. Some, like Malakhel -- an Eastern Iranian dialect spoken in the Waziristan region of Pakistan -- were, like Cajun French, found to be close enough to an existing language (Ormuri) that the two were combined. Some, like Dazawa, a Chadic language spoken in northern Nigeria, are so poorly studied we honestly don't know if they're separate languages or not; in the case of Dazawa, there are only a handful of speakers, and most of them have switched to speaking the majority language of Hausa, so it might be too late to find out. Some, like Palpa, a language supposedly spoken by a small group of people in Nepal, are probably due to inaccuracies in study, and may never have existed in the first place.

Then there are languages (probably?) that are known from only an inscription or two, so there's not enough information available even to make a firm determination. One of many examples is Noric, a presumed Celtic language spoken in the Roman province of Noricum (present-day Austria and Slovenia), known from a grand total of three short inscriptions.

The Grafenstein fragment, one of three known examples of the Noric language [Image is in the Public Domain]

It's written in Old Italic script, an alphabet also used for the only-distantly-related Etruscan language. This particular one appears to be a record of a financial transaction. Another, found near Ptuj, Slovenia, says:

𐌀𐌓𐌕𐌄𐌁𐌖𐌈𐌆𐌁𐌓𐌏𐌙𐌈𐌖𐌉 (ARTEBUDZBROGDUI)

It's thought to be a personal name -- Artebudz, son of Brogduos. Linguists suspect that the name Artebudz comes from Celtic root words meaning "bear penis," which I think we can all agree is a hell of a name.

The thing is, with only short bits to analyze, any determination of what this language was, who (other than Bear Penis) spoke it, how widespread and long-lived it was, and how it was related to other languages at the time, are all little more than educated speculation.

It's astonishing to think that even as small as the world has gotten, what with near-instantaneous digital communication, international travel, and maps of damn near the entire planet, we still have a hard time pinning down language. It's fluid, ever-changing, dynamic, with new forms cropping up all the time and old ones dying out or being subsumed. But that's part of the fascination of linguistics, isn't it? Something like speech and the written word, that most of us take for granted, is actually phenomenally complex, to the point of being nearly impossible to pigeonhole.

It's no wonder the ancient Israelites thought of the myth of the Tower of Babel to explain it all. Even wandering amongst its many rooms is enough to boggle the mind.

****************************************

Tuesday, November 18, 2025

The old gods

My M.A. is in historical linguistics, focusing particularly on northern European languages and how they interacted in (relatively) recent times. (While "recent," to a linguist, isn't quite as out of line with common usage as compared to how it's used by geologists, it bears mentioning that my earliest point of research is around fifteen hundred years ago.) One of the difficulties I ran into was that two of the languages I studied -- Old English and Old Norse -- descend from a common root a very long time ago, so they share some similarities that are "genetic." A simple example is that the Old English word for home (hām) and the Old Norse word (heim) are both descended from a reconstructed Proto-Germanic root *haimaz. So if a word in Modern English comes from an Old Norse borrow-word -- one that came into English following the Viking invasions in the ninth and tenth centuries -- how could you differentiate that from a word that had been there all along, descending from the common roots of the two languages?

The most effective method is that during the time following the split between the ancestors of Old English and Old Norse, each of the languages evolved in different directions. To take just one of many examples I used, some time around the eighth century, a pronunciation shift occurred called palatalization. This is when words with a stop (p, t, d, g, and so on) followed by a front vowel (i or e) eventually "palatalize" the consonant, usually to y, j, or ch. (It's driven by ease of pronunciation, and it's still happening today -- it's why in fast speech most people pronounce "don't you" as something like /dontchu/.)

In any case, words with /gi/ and /ge/ combinations in Old English all got palatalized to /yi/ and /ye/. It's why we have yield (Old English gieldan), yet (Old English gīet) and yellow (Old English geolu), to name three. So how do we have any /gi/ and /ge/ words left? Well, if they were borrowed -- mostly from the Norse-speaking invaders -- after the palatalization shift happened, they missed their chance. So most of our words with that combination (gift, get, girth, gear, and so on) are Old Norse loan-words.

That's just one of the patterns I used, but it gives you the flavor of how this sort of work is done. Differentiating genetic relationships between languages (inherited from common ancestry) and incidental relationships (through migration, cultural contact, and borrowing). Anyhow, the point is, I've been steeped in this kind of research for a long time. (Since "recently," in fact.)

But what I didn't know is that the same techniques have been brought to bear not on linguistics, but on religion, myth, and belief patterns. The work I saw was done on Indo-European speaking cultures (encompassing languages from the British Isles all the way to India), but there's no reason the same techniques couldn't be used for other linguistic/cultural groups.

When I found out about it, my immediate thought was, "Brilliant! That makes total sense." Deities can be "inherited" (passed down within a culture) or "borrowed" (adopted because of cultural contact), just like words can. The names are a big clue; so, of course, are the physical, personal, and spiritual attributes. Some of the more obvious ones -- here called by their reconstructed Proto-Indo-European names -- include *Dyḗws Ph₂tḗr, the daylight-sky god; his consort *Dʰéǵʰōm, the earth mother; his daughter *H₂éwsōs, the dawn goddess; his sons the Divine Twins; *Seh₂ul, the sun god; and *Meh₁not, the moon goddess.

When you start seeing the patterns, they jump out at you. *Dyḗws Ph₂tḗr directly led to Zeus, Jupiter, the Vedic sky god Dyáus, the Albanian sky god Zojz, and the Norse war god Týr. To take only one other example -- *H₂éwsōs, the goddess of dawn, gave rise to the Greek Eos, the Vedia Ushas, the Lithuanian Aušrinė, and the Germanic Ēostre or Ostara -- from whose name we get our word Easter. (The word Easter has nothing to do with the Babylonian god Ishtar, despite the rather hysterical post to that effect that seems to get passed around every spring. The two sound a little similar but have no cultural or linguistic connection other than that.)

Aurora, Goddess of Dawn, by Giovanni Francesco Barbieri (1621) [Image is in the Public Domain]

What I find most fascinating about all this is how conservative cultures can be. If the name of a dawn goddess in the three-thousand-year-old Indian Rig Veda is linguistically and thematically connected to the name of a similar goddess revered in eighth century C.E. Scandinavia, how far back do her roots go? That there is any similarity considering the geographical separation and the long passage of time is somewhere beyond remarkable.

Our beliefs are remarkably resistant to change, and when a belief is hooked to something in a language, that bit of language becomes frozen, too. Well, not frozen, exactly, but really sluggish. The old gods, it seems, are still with us.

Changed, perhaps, but still recognizable.

****************************************

Saturday, August 30, 2025

The universal language

Sometimes I have thoughts that blindside me.

The last time that happened was a couple of days ago, while I was working in my office and our puppy, Jethro, was snoozing on the floor. Well, as sometimes happens to dogs, he started barking and twitching in his sleep, and followed it up with sinister-sounding growls -- all the more amusing because while awake, Jethro is about as threatening as your average plush toy.

So my thought, naturally, was to wonder what he was dreaming about. Which got me thinking about my own dreams, and recalling some recent ones. I remembered some images, but mostly what came to mind were narratives -- first I did this, then the slimy tentacled monster did that.

That's when the blindside happened. Because Jethro, clearly dreaming, was doing all that without language.

How would thinking occur without language? For almost all humans, our thought processes are intimately tied to words. In fact, the experience of having a thought that isn't describable using words is so unusual that we have a word for it -- ineffable.

Mostly, though, our lives are completely, um, effable. So much so that trying to imagine how a dog (or any other animal) experiences the world without language is, for me at least, nearly impossible.

What's interesting is how powerful this drive toward language is. There have been studies of pairs of "feral children" who grew up together but with virtually no interaction with adults, and in several cases those children invented spoken languages with which to communicate -- each complete with its own syntax, morphology, and phonetic structure.

A fascinating study that came out in the Proceedings of the National Academy of Sciences, detailing research by Manuel Bohn, Gregor Kachel, and Michael Tomasello of the Max Planck Institute for Evolutionary Anthropology, showed that you don't even need the extreme conditions of feral children to induce the invention of a new mode of symbolic communication. The researchers set up Skype conversations between monolingual English-speaking children in the United States and monolingual German-speaking children in Germany, but simulated a computer malfunction where the sound didn't work. They then instructed the children to communicate as best they could anyhow, and gave them some words/concepts to try to get across.

They started out with some easy ones. "Eating" resulted in the child miming eating from a plate, unsurprisingly. But they moved to harder ones -- like "white." How do you communicate the absence of color? One girl came up with an idea -- she was wearing a polka-dotted t-shirt, and pointed to a white dot, and got the idea across.

But here's the interesting part. When the other child later in the game had to get the concept of "white" across to his partner, he didn't have access to anything white to point to. He simply pointed to the same spot on his shirt that the girl had pointed to earlier -- and she got it immediately.

Language is defined as arbitrary symbolic communication. Arbitrary because with the exception of a few cases like onomatopoeic words (bang, pow, ping, etc.) there is no logical connection between the sound of a word and its referent. Well, here we have a beautiful case of the origin of an arbitrary symbol -- in this case, a gesture -- that gained meaning only because the recipient of the gesture understood the context.

I'd like to know if such a gesture-language could gain another characteristic of true language -- transmissibility. "It would be very interesting to see how the newly invented communication systems change over time, for example when they are passed on to new 'generations' of users," said study lead author Manuel Bohn, in an interview with Science Daily. "There is evidence that language becomes more systematic when passed on."

In time, might you end up with a language that was so heavily symbolic and culturally dependent that understanding it would be impossible for someone who didn't know the cultural context -- like the Tamarians' language in the brilliant, poignant, and justifiably famous Star Trek: The Next Generation episode "Darmok"?

"Sokath, his eyes uncovered!"

It's through cultural context, after all, that languages start developing some of the peculiarities (also seemingly arbitrary) that led Edward Sapir and Benjamin Whorf to develop the hypothesis that now bears their names -- that the language we speak alters our brains and changes how we understand abstract concepts. In K. David Harrison's brilliant book The Last Speakers, he tells us about a conversation with some members of a nomadic tribe in Siberia who always described positions of objects relative to the four cardinal directions -- so at the moment my coffee cup wouldn't be on my right, it would be south of me. When Harrison tried to explain to his Siberian friends how we describe positions, at first he was greeted with outright bafflement.

Then, they all erupted in laughter. How arrogant, they told him, that you see everything as relative to your own body position -- as if when you turn around, suddenly the entire universe changes shape to compensate for your movement!

[Image available under a license at https://commons.wikimedia.org/wiki/File:Human_Language_Families_Map.PNG]

Another interesting example of this was the subject of a 2017 study by linguists Emanuel Bylund and Panos Athanasopoulos, and focused not on our experience of space but of time. And they found something downright fascinating. Some languages (like English) are "future-in-front," meaning we think of the future as lying ahead of us and the past behind us, turning time into something very much like a spatial dimension. Other languages retain the spatial aspect, but reverse the direction -- such as the Peruvian language of Aymara. For them, the past is in front, because you can remember it, just as you can see what's in front of you. The future is behind you -- therefore invisible.

Mandarin takes the spatial axis and turns it on its head -- the future is down, the past is up (so the literal translation of the Mandarin expression of "next week" is "down week"). Asked to order photographs of someone in childhood, adolescence, adulthood, and old age, they will place them vertically, with the youngest on top. English and Swedish speakers tend to think of time as a line running from left (past) to right (future); Spanish and Greek speakers tended to picture time as a spatial volume, as if it were something filling a container (so emptier = past, fuller = future).

All of which underlines how fundamental to our thinking language is. And further baffles me when I try to imagine how other animals think. Because whatever Jethro was imagining in his dream, he was clearly understanding and interacting with it -- even if he didn't know to attach the word "squirrel" to the concept.

****************************************

Monday, August 25, 2025

Tall tales and folk etymologies

My master's degree is in historical linguistics, and one of the first things I learned was that it's tricky to tell if two words are related.

Languages are full of false cognates, pairs of words that look alike but have different etymologies -- in other words, their similarities are coincidental. Take the words police and (insurance) policy. Look like they should be related, right?

Nope. Police comes from the Latin politia (meaning "civil administration"), which in turn comes from polis, "city." (So it's a cognate to the last part of words like metropolis and cosmopolitan.) Policy -- as it is used in the insurance business -- comes from the Old Italian poliza (a bill or receipt) and back through the Latin apodissa to the Greek ἀπόδειξις (meaning "a written proof or declaration"). To make matters worse, the other definition of policy -- a practice of governance -- comes from politia, so it's related to police but not to the insurance meaning of policy.

Speaking of government -- and another example of how you can't trust what words look like -- you might never guess that the word government and the word cybernetics are cousins. Both of them come from the Greek κυβερνητικός -- a mechanism used to steer a ship.

My own research was about the extent of borrowing between Old Norse, Old English, and Old Gaelic, as a consequence of the Viking invasions of the British Isles that started in the eighth century C.E. The trickiest part was that Old Norse and Old English are themselves related languages; both of them belong to the Germanic branch of the Indo-European language family. So there are some legitimate cognates there, words that did descend in parallel in both languages. (A simple example is the English day and Norwegian dag.) So how do you tell if a word in English is there because it descended peacefully from its Proto-Germanic roots, or was borrowed from Old Norse-speaking invaders rather late in the game?

It isn't simple. One group I'm fairly sure are Old Norse imports are most of our words that have a hard /g/ sound followed by an /i/ or an /e/, because some time around 700 C.E. the native Old English /gi/ and /ge/ words were palatalized to /yi/ and /ye/. (Two examples are yield and yellow, which come from the Anglo-Saxon gieldan and geolu respectively.) So if we have surviving words with a /gi/ or /ge/ -- gift, get, gill, gig -- they must have come into the language after 700, as they escaped getting palatalized to *yift, *yet, *yill, and *yig. Those words -- and over a hundred more I was able to identify, using similar sorts of arguments -- came directly from Old Norse.

[Image licensed under the Creative Commons M. Adiputra, Globe of language, CC BY-SA 3.0]

Anyhow, the whole topic comes up because I've been seeing this thing going around on social media headed, "Did You Know...?" with a list of a bunch of words, and the curious and funny origins they supposedly have.

And almost all of them are wrong.

I've refrained from saying anything to the people who posted it, because I don't want to be the "Well, actually..." guy. But it rankled enough that I felt impelled to write a post about it, so this is kind of a broadside "Well, actually...", which I'm not sure is any nicer. But in any case, here are a few of the more egregious "folk etymologies," as these fables are called -- just to set the record straight.

History doesn't come from "his story," i.e., a deliberate way to tell men's stories and exclude women's. The word's origins have nothing to do with men at all. It comes from the Greek ‘ἱστορία, "inquiry."
Snob is not a contraction of the Latin sine nobilitate ("without nobility"). It's only attested back to the 1780s and is of unknown origin.
Marmalade doesn't have its origin with Mary Queen of Scots, who supposedly asked for it when she had a headache, leading her French servants to say "Marie est malade." The word is much older than that, and goes back to the Portuguese marmelada, meaning "quince jelly," and ultimately to the Greek μελίμηλον, "apples preserved in honey."
Nasty doesn't come from the biting and vitriolic nineteenth-century political cartoonist Thomas Nast. In fact, it predates Nast by several centuries (witness Hobbes's comment about medieval life being "poor, nasty, brutish, and short," which was written in 1651). Nasty probably comes from the Dutch nestig, meaning "dirty."
Pumpernickel doesn't have anything to do with Napoleon and his alleged horse Nicole who supposedly liked brown bread, leading Napoleon to say that it was "Pain pour Nicole." Its actual etymology is just as weird, though; it comes from the medieval German words pumpern and nickel and translates, more or less, to "devil's farts."
Crap has very little to do with Thomas Crapper, who perfected the design of the flush toilet, although it certainly sounds like it should (and his name and accomplishment probably repopularized the word's use). Crapper's unfortunate surname comes from cropper, a Middle English word for "farmer." As for crap, it seems to come from Medieval Latin crappa, "chaff," but its origins before that are uncertain.
Last, but certainly not least, fuck is not an acronym. For anything. It's not from "For Unlawful Carnal Knowledge," whatever Van Halen would have you believe, and those words were not hung around adulterers' necks as they sat in the stocks. It also doesn't stand for "Fornication Under Consent of the King," which comes from the story that in bygone years, when a couple got married, if the king liked the bride's appearance, he could claim the right of "prima nocta" (also called "droit de seigneur"), wherein he got to spend the first night of the marriage with the bride. (Apparently this did happen, but rarely, as it was a good way for the king to seriously piss off his subjects.) But the claim is that afterward -- and now we're in the realm of folk etymology -- the king gave his official permission for the bride and groom to go off and amuse themselves as they wished, at which point he stamped the couple's marriage documents "Fornication Under Consent of the King," meaning it was now legal for the couple to have sex with each other. The truth is, this is pure fiction. The word fuck comes from a reconstructed Proto-Germanic root *fug, meaning "to strike." There are cognates (same meaning, different spelling) in just about every Germanic language there is. In English, the word is one of the most amazing examples of lexical diversification I can think of; there's still the original sexual definition, but consider -- just to name a few -- "fuck that," "fuck around," "fuck's sake," "fuck up," "fuck-all," "what the fuck?", and "fuck off." Versatile fucking word, that one.

So anyway. Hope that sets the record straight. I hate coming off like a know-it-all, but in this case I actually do know what I'm talking about. A general rule of thumb (which has nothing to do with the diameter stick you're allowed to beat your wife with) is, "don't fuck with a linguist." No acronym needed to make that clear.

****************************************

Saturday, July 19, 2025

Footprints

The southern tip of mainland Italy is called Calabria. It's a strikingly beautiful place, containing three national parks (Pollino National Park, Sila National Park and Aspromonte National Park), and a stretch of coastline -- near Reggio, facing across the Straits of Messina to Sicily -- that poet Gabriele D'Annunzio called "the most beautiful kilometer in Italy." It's a region blessed with more than its share of dramatic scenery.

[Image licensed under the Creative Commons Cliff at Tropea, Italy, Sep 2005 , CC BY-SA 2.5]

Calabria forms the "toe of Italy's boot." I remember noticing the country's odd shape when I was a kid and first became fascinated with maps (a fascination that remains with me today), and wondering why it looked like that; back then, when plate tectonics was still a new science, I doubt they really understood it on a level any deeper than "it's near a plate margin, and that moves stuff around." Today, we have a much more detailed understanding of the geology of the area, and it is complex.

Tectonic map of southern Italy and Sicily [Image licensed under the Creative Commons Jpvandijk, J.P. van Dijk, Janpieter van Dijk, Johannes Petrus van Dijk, CentralMediterranean-GeotectonicMap, CC BY-SA 4.0]

On its simplest level, the entire southern half of Italy is being pushed to the southeast, and it's riding up and over the northern edge of the African Plate. This process is responsible not only for the volcanism of the region -- Mount Etna being the most obvious example -- but the massive earthquakes that have shaped it, in part creating the gorgeous topography. (It also has made it a dangerous place to live. The Messina Earthquake of 1908, with an epicenter right across the straits from Calabria, had a magnitude of 7.1 and killed an estimated eighty thousand people, most of them in the first three minutes after the quake struck and the majority of the buildings collapsed.)

As interesting as the geology of the region is, that's not what spurred me to write about the topic today. What I'd like to tell you about is Calabria's tremendous linguistic diversity, an embarrassment of riches packed into a small geographical area. The main language, of course, is standard Italian, but a great many people there (especially in the southern parts) speak Calabrian, a Greek-influenced-Latin derivative that is mostly mutually intelligible with Italian but has some distinct vocabulary and pronunciations.

Then there's Grecanico, which is derived from an archaic dialect of Byzantine Greek, and is spoken by a group of people descended from folks who settled in the region more than a thousand years ago and have somehow maintained their ethnic identity the whole time. It's written with the Latin, not Greek, alphabet -- but other than that has more in common with Thessalian Greek than with Italian.

Another language that has little to do with Italian is Arbëresh, a dialect of Albanian brought in with migrants during the Late Middle Ages. From some of its idiosyncrasies, it appears to be related to Tosk Albanian, a group of dialects spoken in the southern parts of Albania, near the border of Greece. It's astonishing that we can still identify the part of the world the ancestors of the Arbëreshë people came from centuries ago -- by the peculiarities of the language they have spoken during the more than six hundred years they've lived in isolated communities in Calabria.

Finally, there's Gardiol, which is related to Occitan (also known as Provençal or Languedoc), the Romance language widely spoken in the southern half of France. Like with Calabrian (and also Catalan in Spain), most Occitan speakers in France speak the majority language as well, but use Occitan when speaking with family, friends, and locals. The ancestors of the speakers of Gardiol came in with the persecution of the Waldensian "heretics" in France in the thirteenth century, who found a refuge in a thinly-populated part of northern Calabria. Once again -- amazingly -- they've retained their ethnic identity and language through all the vagaries of time since their arrival.

All of that -- and standard Italian as well -- in an area of around fifteen thousand square kilometers, a little more than the size of the state of Connecticut.

UNESCO describes all four of these languages -- Calabrian, Grecanico, Arbëresh, and Gardiol -- as "in serious danger of disappearing." It's sad to think of these footprints of history vanishing, and taking along with them pieces of human culture that somehow had persisted for centuries. I understand why this happens; in modern life, speaking and writing the dominant language is not only useful, it's often essential for getting a job and making a living. These little pockets of other languages survived better when people had little mobility and even less connectedness to others living far away. In today's world, they seem doomed.

Change is the fate of all things, but it inevitably comes with a sense of loss. The linguistic diversity of the beautiful region of Calabria will, very likely, soon be gone. Like biodiversity loss, this diminishes the richness of our world. I hope that linguists are working to catalog and study these unique languages -- before the last native speakers are gone forever.

****************************************

Tuesday, July 8, 2025

Linguistic Calvinball

I've written here before about the monumental difficulty of translating written text when you (1) don't know what the character-to-sound correspondence is (including whether the script is alphabetic, syllabic, or ideographic), (2) don't know what language the script represents, and (3) don't know whether it's read left-to-right, right-to-left, or alternating every other line (boustrophedonic script). This was what Arthur Evans, Alice Kober, and Michael Ventris were up against with the Linear B script of Crete. That they succeeded is a testimony not only to their skill as linguists and to their sheer dogged persistence, but to the fact that they had absolutely astonishing pattern-recognition ability. Despite my MA in linguistics and decent background in a handful of languages, I can't imagine taking on such a task, much less succeeding at it.

The problem becomes even thornier when you consider that what appears to be a script might be asemic -- something that looks like a real written language but is actually meaningless. (Just a couple of months ago, I wrote here about an asemic text called A Book From the Sky that the creator himself said was nonsense, but that hasn't stopped people from trying to translate it anyhow.)

Which brings us to the Rohonc Codex.

The first certain mention of the Rohonc Codex is in the nineteenth century, although a 1743 catalog of the Rohonc (now the city of Rechnitz, Austria) Library might refer to it -- it says, "Magyar imádságok, volumen I in 12" ("Hungarian prayers in one volume, size duodecimo").

As you'll see, that the text represents prayers, or is even in Hungarian, very much remains to be seen. The size matches; duodecimo means "twelve sheets, approximately 127 millimeters by 187 millimeters in size," and given that some of the earliest guesses about the book's contents were that it was a prayerbook in archaic Hungarian, it's possible that the catalog entry refers to the Codex. The paper it's printed on appears to be sixteenth-century Venetian in origin, but of course this doesn't mean that's when the book was written -- only that it's unlikely to be any older than that.

One page of the Rohonc Codex [Image is in the Public Domain]

The drawings are rather crude, and the lettering doesn't resemble any known script, although various linguists have compared it to Hungarian runes, Dacian, a dialect of early Romanian, and some variant of Hindi. Others think it's simply a forgery -- asemic, in other words -- with a sizable number attributing it to the antiquarian Sámuel Nemes, who was known to have forged other documents.

There's no sure connection between Nemes and the Rohonc Codex, however. He's not known ever to have handled the document, and certainly never mentioned it. So this seems as tentative as all the other explanations.

Attempts to use the statistical distribution of clusters of symbols, invoking such patterns as Zipf's Law -- the tendency across languages for the word rank to be inversely proportional to word frequency -- have also failed.

Like with A Book From the Sky, this hasn't stopped hopeful scholars from claiming success. Some of them have been eye-rollingly bad, like the solution proposed in 1996 by one Attila Nyíri of Hungary. Nyíri combined some Sumerian symbols with chance resemblances to the Latin alphabet, and used such expedients as rearranging letters and letting the same symbol correspond to more than one sound, and still came up with gibberish like, Eljött az Istened. Száll az Úr. Ó. Vannak a szent angyalok. Azok. Ó. ("Your God has come. The Lord flies. Oh. There are the holy angels. Them. Oh."

I'm perhaps to be excused for being reminded of the Dick and Jane readers. "Oh, Jane, see Spot. See Spot run. Oh, Spot, don't roll in that dead squirrel. Oh."

Another attempt, this one only marginally more plausible, was made by Romanian linguist Viorica Enăchiuc, and hypothesized that the document (1) is read right-to-left and bottom-to-top, and (2) was written in a Dacian dialect of Latin. This one came up with lines like Solrgco zicjra naprzi olto co sesvil cas ("O Sun of the live let write what span the time"), which still isn't exactly what I'd call lucid writing.

Then there's the Indian linguist Mahesh Kumar Singh, who said the Codex is written left-to-right and top-to-bottom in Hindi, using an obscure variant of the ancient Brahmi script. Singh translated one passage as, He bhagwan log bahoot garib yahan bimar aur bhookhe hai / inko itni sakti aur himmat do taki ye apne karmo ko pura kar sake ("Oh, my God! Here the people is very poor, ill and starving, therefore give them sufficient potency and power that they may satisfy their needs.") His "translation," though, was immediately excoriated by other linguists, who said that he was playing fast-and-loose with the script interpretation, and had come up with symbol-to-sound correspondences that were convenient to how he wanted the translation to come out, not what was supported in other texts.

So the whole enterprise has turned into the linguistic version of Calvinball (from Bill Watterson's brilliant Calvin and Hobbes). If you make up the rules as you go, and never play by the same rules twice, anything can happen.

The upshot of it all is that the Rohonc Codex is still undeciphered, if there's even anything there to decipher. Like the more famous Voynich Manuscript, it retains its aura of attractive mystery, because most of us can't resist a puzzle, even if a lot of the best linguists think the script is nonsense. Because how do you prove decisively that something isn't sensible language?

After all, there are still people who think that Donald Trump's speeches make sense, even when he says shit like, "I saw engines about three, four years ago. These things were coming—cylinders, no wings, no nothing—and they’re coming down very slowly, landing on a raft in the middle of the ocean someplace, with a circle, boom! Reminded me of the Biden circles that he used to have, right? He’d have eight circles, and he couldn’t fill ’em up. But then I heard he beat us with the popular vote. He couldn’t fill up the eight circles. I always loved those circles, they were so beautiful, so beautiful to look at."

So maybe "Oh. There are the holy angels. Them. Oh," isn't so bad.

In any case, I'm sure there'll be further attempts to solve it. Which falls into the "no harm if it amuses you" department. And who knows? Maybe there's a team made up of this century's Evans/Kober/Ventris triumvirate who will actually succeed.

All I know is that attempting it is way above my pay grade.

****************************************