Not that I like having my worldview called into question, mind you; but I have to admit there's a certain thrill in discovering that there are subtleties I had never considered. Take, for example, Benford's Law, that I first heard about a while back while listening to the radio program Freakonomics. In any reasonably unrestricted data set, what should be the relative frequencies of the first digit? Put another way, if I was to take a set of numbers (like the populations of all of the incorporated villages, towns, and cities in the United States) and look only at the first digits, how many of them would be 1s, 2s, 3s, and so on?
On first glance, I saw no reason that the distribution shouldn't be anything but equal. That's what a set of random numbers means, right? And how are the populations of municipalities ranging from ten people all the way up to several million anything other than a collection of random numbers?
Well, you've probably already guessed this isn't right. Lining up the frequencies of 1s through 9s in order, you get a perfect inverse relationship. About 30% of the first digits are 1s, all the way down to only 5% being 9s.
Why is this? Well, the simple answer is that the statisticians are still arguing about it. But it does give a way to catch when a supposedly real data set has been altered or fudged; the real data set will conform to Benford's Law, and (very likely) the altered one won't.
Another interesting one, and in fact the reason why I was thinking about this topic, is Zipf's Law, named after American linguist George Kingsley Zipf, who first attempted a mathematical explanation of why it works. Zipf's Law looks at the frequencies of different words in long passages of text, and finds that there's an inverse relationship, similar to what we saw with Benford's Law. In English, the most commonly used word is "the." The next most common ("of") has half that frequency. The third ("and") has one-third the frequency. And on down the line; the tenth most frequent word occurs at one-tenth the frequency of the most common one, and so forth.
Zipf's Law has been tested in dozens of different languages, including conlangs like Esperanto, and it always holds. So does the related pattern called the Brevity Law (there's an inverse relationship between the length of a word and how commonly it's used), and -- to me the most fascinating -- the Law of Hapax Legomenon, which states that in long passages of text, about half of the words will only occur once (the name comes from the Greek ἅπαξ λεγόμενον, meaning "being said once").
Where things get really interesting is that these three laws -- Zipf's Law, the Brevity Law, and the Law of Hapax Legomenon -- may have relevance to the search for extraterrestrial intelligence. Say we pick up what seems like radio-wave-encoded language from another star system. The difficulty is obvious; translating a passage from another language when we don't know the sound-to-meaning correspondence is mind-bogglingly difficult (although it has been accomplished, most famously Alice Kober's and Michael Ventris's decipherment of the Linear B script of Crete).
The task seems even more hopeless for an alien language, that shares no genetic roots with any human language, and thus the most useful tool we have -- noting similarities with known related languages -- is a non-starter. Just like Dr. Ellie Arroway in Contact, we'd be faced first with the seemingly insurmountable problem of figuring out if it is an actual alien language, and not just noise or gibberish.
The three laws I mentioned may solve at least that much of the problem. The fact that they've been shown to govern the frequency distribution of every language tested, including completely unrelated ones like Japanese and Swahili, suggests that they might represent a universal tendency. Just as Benford's Law can help statisticians identify falsified data sets, the three laws of word frequency distribution might help us tell if what we've picked up is truly language.
It still leaves the linguists with the daunting task of figuring out what it all means, but at least they won't be working fruitlessly on something that turns out to be mere noise.
I find the whole thing fascinating, not only from the alien angle (which you'd probably predict I'd love) but because it once again demonstrates that our intuition about things can lead us astray. Who would have guessed, for example, that half of the words in a long passage of text would occur only once? I love the way science, and scientific analysis, can correct our fallible "common sense" about how things work.
And, as with Zipf, Brevity, and Hapax Legomenon, open up doors to understanding things we never dreamed of.
****************************************
Ever get frustrated by scientists making statements like "It's not possible to emulate a human mind inside a computer" or "faster-than-light travel is fundamentally impossible" or "time travel into the past will never be achieved?"
Take a look at physicist Chiara Marletto's The Science of Can and Can't: A Physicist's Journey Through the Land of Counterfactuals. In this ambitious, far-reaching new book, Marletto looks at the phrase "this isn't possible" as a challenge -- and perhaps, a way of opening up new realms of scientific endeavor.
Each chapter looks at a different open problem in physics, and considers what we currently know about it -- and, more importantly, what we don't know. With each one, she looks into the future, speculating about how each might be resolved, and what those resolutions would imply for human knowledge.
It's a challenging, fascinating, often mind-boggling book, well worth a read for anyone interested in the edges of scientific knowledge. Find out why eminent physicist Lee Smolin calls it "Hugely ambitious... essential reading for anyone concerned with the future of physics."
[Note: if you purchase this book using the image/link below, part of the proceeds goes to support Skeptophilia!]
i wonder if these algorithms have been applied to the Voynich manuscript. To a casual eye (like mine) its "words" seem to be too repetitive to be a real language, unless the rules are different for a specialized text.
ReplyDeleteThat's a brilliant suggestion. I don't know if it's been attempted, but that would be a great place to start.
Delete