It seems machine translation is not only a big trend in the translation industry, but it’s become something of a buzzword outside of the industry, too. Machine translation is not a new phenomenon; for decades, academic researchers have been looking into the possibility of using a machine to translate one language into another without human intervention.
Types of machine translation becoming available freely online has changed most people’s behaviour (at least online): you can now get the gist of an article or a website written in a language you do not understand with a few clicks.
Other machine translation engines are now being used by professional translators as well. The latest development is using artificial intelligence to help make the engines more accurate, which has led some to predict that the machines will take over the translation tasks performed by humans.
We sat down with the machine translation (MT) specialist in STP’s technology team, Mattia Ruaro, to discuss MT in the industry and at STP. Mattia is a translator by training and has become a key part of STP’s technology team after starting out in a project management role.
In this first part, we’ll talk to Mattia about what machine translation is and how machine translation engines can be used – and trained.
So, Mattia, how does MT work?
Machine translation is the technology that allows an engine to translate from one natural language to another. Thus far, natural language has basically also meant written language. Machine translation has been around for decades, but there has been a lot of progress in the last 20 years.
There are several types of MT engines; the rule-based ones came first, then the statistical ones and after that the more recent neural machine translation. Every new type of MT has followed the same pattern: the technology has been developed, it’s been trialled and used with a lot of enthusiasm – and then people have discovered its limitations.
While there is a lot of hype about the latest technology, neural MT, even replacing human translators, it has limitations, too. This cycle seems to be there for all the different technologies – none of them are actually quite the miracle solution they are hyped up to be at the start.
What are the differences between statistical machine translation (SMT) and rule-based machine translation (RBMT)?
In essence, rule-based machine translation does what it says on the tin; the engine operates according to a set of rules, which are inputted by the developer. Nothing apart from the rules regulates the output from the engine.
The limitations of purely rule-based machine translation were discovered quickly. You need to input all the rules manually and sometimes a long list of exceptions, which is just not viable in a commercial environment, since it takes far too long.
The only exception to this are situations where your source language and your target language are closely related. This means that the languages are very close in terms of their lexicon and the semantics of that lexicon, as well as being structurally similar. Since you don’t need to input lots of different rules, you save a lot of effort.
Statistical engines are different: they draw on data to create patterns – this is a more recent approach. It’s basically about feeding the engine as much data as possible and the engine finding patterns in that data.
At STP, which types of MT engines out of the ones you mention have been used?
All of them. We tried rule-based engines for translating between Scandinavian languages, which are closely related. So, we would use a rule-based engine to produce output to help with a text we were translating from Danish into Swedish, for example.
For the past 4–5 years, statistical engines have been more viable for us business-wise. Lately, we have been experimenting with neural machine translation. We started with only English into Finnish for neural MT, but we are now in the process of trialling it with other language pairs as well. So far, it seems to be working well in terms of the fluency of the output but it still has some difficulties processing terminology, particularly when it comes to specialised areas. Only time – and extensive testing – will tell how much better this technology truly is..
Thus far, which language pairs has MT been most successful for? What about text domains?
For us at STP, the differences have been bigger between different domains than between different language pairs. The big advantage of statistical engines over rule-based ones has been customisability. It’s all about the data you feed the engine.
If you only input data for one domain, you can get rather good results, since you are training the engine for a narrow scope of material. This has been successful for software, mechanical engineering, financial and business – the latter is a bit of a catch-all term for things like website content, newsletters, HR documentation and so on.
But MT has certainly not been successful for all domains. For example, we haven’t had much success with medical engines. Medical texts are heavily regulated, and machine-translation suggestions can become more of a hindrance than a help when you’re having to follow multiple glossaries and style guides.
Is it not possible to train an engine with the help of glossaries and other resources?
Yes, with glossaries, certainly. Style guides are guidelines and they do not contain absolute rules, most of the time, so they are more difficult to implement. It also has to be said that these resources are only as useful as the client makes them.
Another issue with glossaries and resources is that they are often specific to one client – creating and training an engine for just one client is a big investment of time, effort and money. So, we need to be sure that it will be of use in the future – it’s a risky investment for a language service provider to make.
How do you train the MT engine to give you good-quality output?
By having a lot of good data to begin with. If you’re looking for material to input, make sure it’s clean, flowing text and just text. It’s much better to clean the data than to feed the engine unnecessary clutter.
Once the first batch of data has been inputted, you should start using it and get feedback from translators to see if you can tweak the engine.
Ideally, you would prepare the data to make it easier for the MT engine: you’d get rid of extra formatting and tags and make it easier for the engine to parse. MT engines will struggle with extremely long segments and fragmented content.
If it’s possible to get feedback and train the engine based on that, I would recommend this. The cycle of preparing the input, training the engine and asking for feedback should be repeated regularly.
This practice of continuously improving MT engines is actually part of the machine translation post-editing standard ISO 18587 that STP received a certification in in March this year – you have to make sure that there’s a constant loop of feedback and improvement!
In part 2, you can read more about Mattia’s thoughts on neural machine translation and how STP has approached using machine translation as another technology to help translators in their work.