For decades, linguists have racked their brains over the question of precisely how the syntax of various languages is different. PhD candidate Martin Kroon has developed a computer system that brings us closer to finding an answer. His PhD defence is on 10 November.

Knowing about the similarities and differences between languages will bring us one step closer to understanding how our brains work. After all, discovering a structure that is shared by different languages could tell us a great deal about how the brain handles language. Until now, however, it has proved difficult to identify all the ways in which languages are the same or different. ‘This is all done manually, but there are an awful lot of languages and basically an infinite number of sentences you can generate in them,’ Kroon explains. This means that there’s a risk of bias. ‘You have to select in advance what you’re going to compare, which can cause you to overlook things or conversely to confirm things that don’t occur very often at all.’

Compressing language

Kroon therefore decided to take a different approach. A computer system should make it possible to compare different languages on a larger scale. ‘I mainly used transcripts of EU meetings, because they’re translated into all the European Union languages,’ he says, and then explains how he applied two methods to the data. ‘First, I was impressed by the Minimum Description Length (MDL) principle. This is actually a matter of compression, the same as you do on your computer: how do you make big data as small as possible, so that they fit into a zip file? To do this, MDL searches for patterns that occur frequently but are not too long. In Dutch, for example, this could be “article+noun”. This pattern is easy to compress and you won’t find it in Czech, for example, because Czech doesn’t have articles.’

He found that the system worked. Patterns in the transcripts emerged, indicating syntactic similarities and differences. At the same time, however, the computer would often find differences that on closer inspection had very little to do with syntax. ‘Some texts were translated manually, so you couldn’t compare them syntactically any more,’ says Kroon. ‘For instance, the original English “to the matter at hand” was translated into Dutch as “en nu het eigenlijke onderwerp” (= “and now the actual subject”). This means the same thing, but it’s completely different in terms of syntax and structure.’