Building a Field-Specific Neural Machine Translation Engine - Focusing on Data Collection and Performance Verification (Part 1)

Introduction
Since 2015, with the development of the AI industry's technology neural networks, a new type of automated machine translation called Neural Machine Translation (NMT) has emerged.
Although NMT has only been in existence for a few years, it has quickly become a mainstream machine translation method because of its higher translation accuracy compared to the conventional Rule Base Machine Translation (RBMT) and Statistical Machine Translation (SMT). It has become a mainstream machine translation method.
Now, let's move on to field-specific translation. There is a strong perception that field-specific translation is better left to human translation than to machine translation.
The reasons for this are,
1. the difficulty of translating specialized documents is high, and the difference in quality compared to human translation is large
2. the output translation of terms may differ each time, and mistranslations inappropriate for the field may be output.
3. the output translation of terminology differs each time, and mistranslations inappropriate to the field are output.
In reality, however, there are various limitations to manual field-specific translation when considering overall workload, cost, and quality.
Conversely, depending on the application, customers increasingly expect machine translation to be acceptable as long as it provides a somewhat satisfactory level of translation.
Now that NMT is booming, is it possible to create a field-specific NMT that can meet these expectations?
And how can it be made and how well does it perform?
Let us examine these questions.
First Element of Engine Creation: Data
Needless to say, to create a field-specific NMT engine, you need data. And the data itself must be bilingual and field-specific.
Bilingual data is data in which sentences and their translations are aligned and mapped. For example, Japanese and its Chinese translation are linked.
Field specific means, in other words, that to create an engine for the sports field, sports data should be used. This is because using general-purpose bilingual data may result in translations that do not fit the data for a particular field, or may adversely affect the data for other fields.
The NMT mechanism requires a larger volume of bilingual data than the conventional SMT.
Therefore, the first problem to be solved is to secure a large amount of field-specific bilingual data.
Methods of collecting field-specific bilingual data
In reality, however, it is not easy to collect such data. Furthermore, when the field-specific designation is added, it becomes even more difficult to collect the data.
So, how can we collect "large volumes" of "field-specific" bilingual data?
There are three possible methods
1. extracting bilingual data
(1) If sentence-level bilingual data does not exist, secure a certain amount of data in both languages that correspond to each other, e.g., at the file level.
(2) From the file-level data in (1), determine the sentence-level correspondence and extract "bilingual" sentence pairs.
2. Extracting Field-Similarity Data
(1) If bilingual data itself exists, but data from multiple fields are mixed and not distinguished, first select a small amount of field-specific bilingual data from the data.
(2) Using sentence similarity calculation, search for data similar to the selected field-specific data.
3. language conversion
(1) Secure monolingual data.
(2) Convert it to another language.
(3) Treat it as bilingual data.
In this article, we have introduced three methods for collecting bilingual data. In this article, we actually collected Japanese-Chinese bilingual data specific to the IT field by using the first method introduced above, "1.
In general, there are many websites where IT-related technical documents are posted that support multiple languages per page.
Therefore, it is possible to obtain a large amount of Japanese-Chinese bilingual data from such sites.
The data we obtain is only at the page file level, but we create an alignment tool and process the Japanese-Chinese data to correspond to the sentence level.
The term "alignment" here refers to the automatic mapping of sentences, and the method will be introduced at another time.
As a result of this process, we were able to collect approximately 600,000 sentence pairs of Chinese-Japanese bilingual data.
Although this is still a small amount of data compared to the amount of data using state-of-the-art general-purpose NMT, it is enough for the prototype of the NMT engine.
In the next article, we will post the actual process of prototyping a field-specific Chinese-Japanese/Chinese-Japanese NMT engine using the Chinese-Japanese bilingual data in the IT field that we collected here.