Skeptophilia: Language machines

Tuesday, December 20, 2022

Language machines

If you've ever used Google Translate, you've probably noticed that it can be a little wonky.

Take, for example, the anecdote about the French guy who was wooing an American woman long-distance, and texted to her, "Prends une photo coquine pour moi." ("Take a naughty picture for me.") The woman wasn't certain what that meant, so she popped it into Google Translate, and was told it meant, "Take a photo for me, slut."

I think my favorite, though, is some feedback that a company called Koyu Matcha Green Tea received via their website, from a customer in Finland. When they ran what the customer wrote through a Finnish-to-English Google Translate, it came out as the following:

If it resonated with cold to the bone? Matcha Latté is guaranteed fireman, green tea with hot steamed milk. Behold, thou hast already tasted.

Um... thanks? We think?

The difficulty is that languages are complex entities, full of idioms and peculiarities and exceptions, so trying to find a mechanistic, totally rule-based way to characterize them is somewhere beyond tricky. But because of the work of a Ph.D. student at the University of Cambridge, we have come one step closer to doing exactly that -- at least for Sanskrit.

About 2,500 years ago, a man named Dakṣiputra Pāṇini living in what is now northwestern Pakistan wrote a work called Aṣṭādhyāyī, which created a set of rules for the morphology -- the way words, prefixes, suffixes, and so on combine -- of the Sanskrit language. An example of linking together these fragments, called morphemes, in English is the word incomprehensibly -- made up of in- (prefix meaning "not"), comprehend (stem of the word, altered to replace /d/ with /s/), -ible (suffix meaning "capable of"), and -ly (adverbial marker), in that order.

Imagine trying to come up with a list of rules for all the ways morphemes can combine in English, such that the rules only produced well-formed words and not garbled messes like iblecomprehendlyin.

That's what Pāṇini tried to do for Sanskrit.

The problem is that Pāṇini's rules seemed sometimes to lead to self-contradictions. Given a particular combination of morphemes, there are often two or more rules that apply, so which should you use? Linguists analyzing the rule-set discovered that Pāṇini had written a "metarule" -- a rule determining how other rules should be applied -- which said that if two rules seem to conflict, the "later rule should take precedence." Everyone had interpreted this to mean that the one mentioned later in the book was the more important.

But that sometimes led to ungrammatical words. So something was off, but what?

Enter Cambridge student Rishi Rajpopat, who had been toiling over Pāṇini's rules for months. Then he had a brainstorm; what if the problem was that the metarule itself had been mistranslated? He altered the metarule to read that if two rules are in conflict, the one that applies to the latter part of the word (the suffix) takes precedence over the one that applies to the first part of the word (the stem).

With that one change in interpretation, Pāṇini's rule system works to combine morphemes and produces grammatically-correct words almost one hundred percent of the time.

Which, of course, is a cause for much rejoicing amongst both linguists and people who are attempting to create high-quality translation software.

I wonder, though, how any such attempt would fare for English. English is an amalgam of a Germanic root language, with heavy borrowing from French, Latin, and Spanish, and less-frequent (but still significant) borrowing from Old Norse, Italian, Greek, Dutch, Gaelic, and several Indigenous American languages. This has introduced spellings, pronunciations, and morphologies that defy easy characterization.

Even some of the simple rules you learned in elementary school can't be applied with anything like real consistency. "I before e except after c" -- unless your weird foreign neighbor Keith forfeits eight beige sleighs to a feisty caffeinated weightlifter.

You see the difficulty.

So as much as I'm impressed by Rajpopat's accomplishment, I don't think it's going to go very far toward fixing Google Translate's problem.

No matter. The delight of being told the tea is so good it's "guaranteed fireman" makes up for any potential awkwardness incurred because you accidentally called your girlfriend an unpleasant name while attempting to initiate sexytimes. You gotta take the good with the bad.

****************************************

Skeptophilia

Tuesday, December 20, 2022

Language machines

No comments:

Post a Comment