Skeptophilia: Cracking the code

Tuesday, July 23, 2019

Cracking the code

Being a linguistics geek, I've written before on some of the greatest "mystery languages" -- including Linear B (a Cretan script finally deciphered by Alice Kober and Michael Ventris), the still-undeciphered Linear A, and even some recent inventions like the scripts in the Voynich Manuscript and the Codex Seraphinianus (neither of which at present has been shown to represent an actual language -- they may just be strings of random symbols).

The obvious difficulty in translating a script when you do not know what language it represents starts (but doesn't come close to ending) with the problem that there are three rough categories into which written languages fall -- phonetic (where each symbol represents a sound, as in English), syllabic (where each symbol represents a syllable, as in the Japanese hiragana), and pictographic (where each symbol represents an idea, as in Chinese). Even once you know that, deciphering the language is a daunting task. Some languages (such as English) are usually SVO (subject-verb-object); others (such as Japanese) are SOV (subject-object-verb): a few (such as Gaelic) are VSO (verb-subject-object). Imagine starting from zero -- knowing nothing about sound-to-character correspondence, nothing about what language is represented, nothing about the preferred word order.

Oh, and then there's the question of whether the language is inflected (words change form depending on how they're used in a sentence, such as Latin, Greek, and Finnish), agglutinative (new words are created by stringing together morphemes, such as Turkish, Tagalog, and Bantu), or isolating (words are largely invariant, and how they're used in the sentence is shown by untranslatable "markers," such as Chinese and Yoruba).

Suffice it to say the whole task is about as close to impossible as you'd like to get, making Kober and Ventris's success that much more astonishing.

A sample of the Linear B script [Image is licensed under the Creative Commons Sharon Mollerus, NAMA Linear B tablet of Pylos, CC BY 2.0]

So that's why I was so fascinated by a link sent to me by my buddy Andrew Butters (fellow author and blogger at Potato Chip Math), which describes a new AI software developed at MIT which is tackling -- and solving -- some of these linguistic conundrums.

There's just one hitch; you have to know, or at least guess at, a related language, the theory being that symbols and spellings change more slowly than pronunciation and meaning (which is one reason why English has such bizarre spelling -- consider the sounds made by the "gh" letter combination in ghost, rough, lough, hiccough, and through). So the AI wouldn't work so well on synthetic languages like the ones in Voynich and the Codex Seraphinianus.

But otherwise, it's impressive. Developed by Jiaming Luo and Regina Barzilay from MIT and Yuan Cao from Google's AI lab, the software was trained on sound-letter correspondences in known languages, and then allowed to tackle Linear B. It looked for patterns such as the ones Kober and Ventris found by brute force -- the commonness of various symbols, their positions in words, their likelihood of occurring adjacent to other symbols -- and then compared that to ancient Greek.

The AI got the right answer 67% of the time. Which is amazing for a first pass.

A press release from MIT describes the software's technique in more detail:

[T]he process begins by mapping out these relations for a specific language. This requires huge databases of text. A machine then searches this text to see how often each word appears next to every other word. This pattern of appearances is a unique signature that defines the word in a multidimensional parameter space. Indeed, the word can be thought of as a vector within this space. And this vector acts as a powerful constraint on how the word can appear in any translation the machine comes up with.

These vectors obey some simple mathematical rules. For example: king – man + woman = queen. And a sentence can be thought of as a set of vectors that follow one after the other to form a kind of trajectory through this space.

The key insight enabling machine translation is that words in different languages occupy the same points in their respective parameter spaces. That makes it possible to map an entire language onto another language with a one-to-one correspondence.

Which is pretty damn cool. What they're planning on tackling next, I don't know. After all, there are a great many undeciphered (or poorly understood) scripts out there, so I suspect there are a lot to choose from. In any case, it's an exciting step toward solving some long standing linguistic mysteries -- and being able to hear the voices of people who have been silent for centuries.

************************************

The subject of Monday's blog post gave me the idea that this week's Skeptophilia book recommendation should be a classic -- Konrad Lorenz's Man Meets Dog. This book, written back in 1949, is an analysis of the history and biology of the human/canine relationship, and is a must-read for anyone who owns, or has ever owned, a doggy companion.

Given that it's seventy years old, some of the factual information in Man Meets Dog has been superseded by new research -- especially about the genetic relationships between various dog breeds, and between domestic dogs and other canid species in the wild. But his behavioral analysis is impeccable, and is written in his typical lucid, humorous style, with plenty of anecdotes that other dog lovers will no doubt relate to. It's a delightful read!

[Note: if you purchase this book using the image/link below, part of the proceeds goes to support Skeptophilia!]

Skeptophilia

Tuesday, July 23, 2019

Cracking the code

No comments:

Post a Comment