In October 2016, a research paper emerged from the Google research lab with the astonishing title, Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. The Google researchers seem to be suggesting that their system could match the abilities of human translators. Indeed, they reported that their system produced translations that were nearly indistinguishable, from human translations. Not surprisingly, the claim was met with considerable skepticism. After all, while MT had been progressing steadily in recent years, it was universally acknowledged that MT output was far from human level quality. Indeed, many would argue that MT had some fundamental limitations that made it unlikely to approach human level quality any time, in the foreseeable future. The Google research paper was not mere hype. The paper describes a revolutionary new approach to machine translation. Neural machine translation. The researchers performed detailed systematic comparisons with the current state of the art systems, which are based on SMT, statistical machine translation. Not only did the system clearly outperform the best current SMT systems, in many cases, it produced translations that seemed, just as good as those produced by human translators. Neural MT is completely different from statistical MT. It relies on neural networks, which had been revolutionizing the entire field of natural language processing. As the name suggests, neural networks are inspired by the structure and operation of the human brain, which is a vast network of interconnected computing elements called neurons. Here's a simplified depiction of a neural network, with input nodes connected to output nodes via a hidden layer. In statistical machine translation, each word is treated as a kind of atomic object. An SMT system for French and English has entries in its translation table, like red goes to rouge, house goes to maison, etc.. But the words don't mean anything to the system. The rules might as well look like this, EnglishWord195 goes to FrenchWord783, EnglishWord453 goes to FrenchWord361, and so on. In fact, SMT systems convert words to numbers like this, because they can be stored more efficiently. But this is hopelessly primitive compared to human understanding of words. When we encounter a word that we know, something much more interesting happens. Think about the word red. Of course, we notice that is a word we know, but the experience is so much richer. We're reminded of things that are red. Other colors might come to mind. Perhaps especially colors that are similar to red. The word brings with it an entire cloud of meaning. You might think that this is just the way it is. Computers can learn specific rules about words, but the words don't mean anything to a computer. But now with the advent of neural networks, that has changed completely. To understand how this is done, it helps to imagine how humans come to understand these clouds of meaning. Here's one idea of how that might happen. A child learning a language hears the word red, for the first time in a sentence like, the rose is red. At this point the child has very little sense of what red means. Later, the child is told to pick up a red block, and here's a description of red and green blocks, and so on. As the child learns the language, she hears the word red in hundreds or thousands of contexts. Think of what is contained in those contexts. There will be massive variation, of course, much of it random and uninteresting. But it might well be that roses are mentioned multiple times, and they're perhaps quite a few mentions of apples. In general, the context will contain a lot of red things and probably also lots of mentions of other colors. When the child hears the word red, she thinks of all the other words, that have appeared in those contexts. Especially those words that appear frequently like red things and perhaps color words. This is what happens with all words. As we learn a word, we develop a very rich network of connections, between that word and the other words that we know. For a computer to actually understand a word, it also needs this meaning cloud. With neural networks, something like this is now possible. In recent years, researchers have developed techniques to produce something called word embeddings, with a process reminiscent of the child's learning process described above. To produce these embeddings, a neural network is created, and it's given the following task. For a given word, it predict how likely another word is to appear nearby the first word. The network is exposed to a massive amount of text data to train it in this task. The text is presented to the network in the following way. For each word, the network sees pairs consisting of that word, together with another word that appeared nearby. After a while, the network will learn to predict fairly accurately how likely a given pair of words is. The training process of a neural network consists in adjusting the weights of the connections inside the network. In this case, there might be 300 weights for each word in the language, and these weights get tuned for each word to help the network better predict nearby words. The point of this is not to do the prediction rather, the whole point is to obtain the tuned weights for each word, what researchers call word embeddings. These represent the clouds of meaning for each word. Many researchers believe they represent major progress towards real understanding of words. Here's what I'm betting like this looks like for say, house. This is simply a list of 300 numbers between zero and one. In the interests of space, we don't show the entire list. How can we possibly tell if this is a good representation of house? It's pretty hard to tell anything from directly inspecting this list of 300 decimal numbers, but there are lots of interesting things to do with these lists or vectors, as they're usually called, for example, one can compare the vectors for different words and see how similar they are. Here we see a list of the words whose vectors are most similar to the vector for house. Houses, bungalow, apartment, bedroom, townhouse. This seems to be a pretty good attempt to produce the words most similar to house just based on the similarity of the vectors. Here are the most similar vectors for a woman. Girl, teenage_girl, teenager, lady. Remember, these word vectors are just list of numbers. Like individual numbers, vectors of numbers can be added and subtracted. This gives some interesting possibilities. Consider the meaning of the word king. It combines the elements, ruler and male, plus something extra about how the position is usually inherited. What if we wanted to figure out the female counterpart of king? Well, if you think of king as ruler plus male, then king minus male would give a ruler and king minus male plus female should give a ruler plus female, which is the definition of queen. Amazingly, this is exactly what you can do with word vectors. We can produce the following vector. Call it q, which is king minus male, plus female. When we list the most similar vectors to q, the most similar one is queen. We've seen that these word vectors support analogical reasoning. Man is to king as woman is to queen. Actually lots of analogies work this way with the word vectors. Paris minus France plus Poland equals Warsaw. Japan minus sushi, plus Italy equals pasta and so on. As we've seen, word embedding are remarkable in how they capture abstract properties of words and they're now commonly used in conjunction with neural networks for many language tasks, including translation. There's another aspect of neural networks that contrasts sharply with SMT systems. Remember that SMT systems are restricted to N-grams short word sequences of a specific maximum length. We saw that Chomsky had argued that unbounded dependencies are a fundamental part of human language. If you restrict consideration to sequences of a given maximum length, a typical limitation might be five words, then there will be dependencies that the system simply has no chance of understanding. Neural networks need not be limited this way so that they have the ability to learn the unbounded dependencies that Chomsky was referring to. These unbounded dependencies occur in all ordinary sentences, for example, as we have seen, is a frequent problem in English to German translation, where German verbs can move over a large number of words compared to their starting point in an English sentence. Interestingly, the Google paper considered exactly this language pair English to German and found striking improvements in translation quality compared to the best SMT systems. In the past few years, neural networks have generated a great deal of excitement, particularly in computational linguistics. They've shown the ability to learn more effectively than other machine learning methods for machine translation and many other language tasks. However, it's not so easy to tell exactly what these systems are learning. The case of zero shot translation illustrates this very clearly. Remember that SMT systems are trained with large amounts of parallel data, that is texts from the source language that have been translated into the target language. To train an SMT system to translate from English to French, it is necessary to have large amounts of English texts that have been translated into French. Usually many millions of words are required for good quality translation. This is typically the case for neural translation as well. Parallel data is required for each language pair, but this is where zero shot translation comes in. Amazingly, Google researchers have found that they can build a neural machine translation system that can learn to translate a given language pair without having seen any parallel data for that language. In what experiment? The researchers built a multilingual NMT system, and they gave it parallel data for English to Spanish, and English to Portuguese. Then, they showed that the system could generate reasonably good quality Portuguese to Spanish translations without ever having seen Portuguese to Spanish data during training. How can this be? What could it be that the neural network is learning? It's generally difficult to tell what is being learned by neural network since, they can involve very large numbers of parameters that are tuned during the training process. It's very hard to say what the network is learning that enables it to translate from Spanish to Portuguese without being presented with a single example of translated text from Spanish to Portuguese. The Google researchers have an interesting speculation. The network they suggest is beginning to learn interlingua. Interlingua is a somewhat mysterious notion with deep historical roots. It describes a universal language that could be thought to lie between all the languages to be translated. Enlightenment philosophers were fascinated by the idea of a universal language that would be the foundation for rational thought. It also has a more practical motivation for machine translation. In an important sense, interlingua would make the problem much simpler. Think about a system to translate between four different languages, say English, Spanish, French, and Portuguese. For each of these four languages, there would be three systems, for example, English would be paired with Spanish, French, and Portuguese, so that would be four times three, or 12 systems in all. But things are different if translation is always from, or to interlingua. Then, we need four systems to translate each language to interlingua, and four systems to translate each language back from interlingua. In general, normally for N languages, we would need N times N minus one systems. With interlingua, we only need two times N systems. For four languages, interlingua means we only need eight instead of 12 systems. Not such a big deal, but what if we have 100 or more languages like Google does? On the standard approach, we would need 100 times 99, or 9,900 systems, whereas, with interlingua, we only need two times 100, 200 systems. That starts to look like a big difference, but what is interlingua? The notion has an intuitive appeal, when you translate a text, you first have to understand the text. Let's say you're supposed to translate the sentence "The girl kicked the ball" into French. Before you start thinking about French words and grammar, you form a mental image of the described situation with a girl and a ball. There's clearly some thought process intervening between the source sentence and the construction of the target sentence, and you would probably form the same mental image with that sentence, even if you were going to translate it into Spanish. That is one way to think about interlingua. It's a mental language that we use to understand the sentences that we should translate. You could think of it as the brain's native language. In fact, something like interlingua must exist. When we translate "The girl kicked the ball" to French, we first have a mental representation of the sentence that will then provide the basis of the French translation. But what would that representation look like? What is it actually? One possible answer is that the network is starting to develop general representations of linguistic meaning; the kinds of logical representations that Montague described nearly 50 years ago. We don't really know if that's what's starting to happen, but there are hints. The first point is simply that serious shot translation works, at least to some extent. That shows that when the system is presented with training data involving translation from Spanish to English, it's learning something about Spanish, which is then able to apply when translating Spanish to Portuguese. What it's learning could be some aspects of the meaning of the Spanish sentences. The researchers tried to explore this further. They came up with a way to visualize the internal state of the network as it was translating a sentence. They placed these visualizations in a two-dimensional space, and found the sentences in different languages tended to be close to each other if their meanings were the same. This is exactly what you would expect, if the network is mapping sentences from different languages into the same hidden meaning representation.