Deep learning, a technique derived from neural networks (an early AI technology), is poised to revolutionize machine translation. Google recently open sourced its word2vec package, which analyzes English language texts to discover the meanings and relationships of words. The results are pretty impressive and point toward a significant advance in machine translation technology in the next few years.
Current machine translation systems rely on statistical techniques to train systems. Statistical machine translation works by feeding a large corpus of aligned texts, one in each language, to train the system. The system does not understand either language in any meaningful sense, but simply learns that Hello is highly correlated with Hola in Spanish. Given enough text, it can produce decent quality output that is suitable for comprehension, but not useful for publication without post-editing.
Deep learning has the potential to change all of this by enabling the construction of systems that autonomously learn how words within each language relate to each other, their synonyms, and other linguistic structures. This will enable machine translation systems to understand the material they are translating.
For example, the word2vec package enables you to find the “distance” to the closest words or phrases from an input phrase. Type in “France”, and it will display:
spain 0.678515 belgium 0.665923 netherlands 0.652428 italy 0.633130 switzerland 0.622323 luxembourg 0.610033 portugal 0.577154 russia 0.571507 germany 0.563291 catalonia 0.534176
Google is investing heavily in deep learning technology, as it outperforms traditional machine learning systems by a wide margin, and as machine learning is closely related to search. The better their system understands the queries users are making, the better the search results will be. (I suspect they have already implemented this in some of their search tools, as I’ve noticed a significant improvement in related links in search results in the past year).
Machine translation vendors would be wise to take a close look at word2vec and related projects, as they have the potential to render current techniques obsolete. (Its a good bet that Google’s translation team is already working on this problem). Otherwise, they stand a good chance of being left in the dust by new systems and companies.