
I want to develop
highly accurate AI translations
for a specific field
I want to improve the
accuracy of a multilingual chatbot
Want to predict what
will be said during a conversation or Q&A session
I'm looking for a
bilingual corpus that supports
languages of Southeast Asian countries
I want to create a
terminological dictionary
specialized for an industry or field
We have been developing machine translation since the 1980s
and have extensive experience in delivering bilingual corpora to major companies.
With our many years of experience and know-how, we provide
multilingual corpora that contribute to problem solving.
We have bilingual corpora in various fields such as tourism, medicine, law, finance, intellectual property, etc.
We also provide
bilingual corpus of dialogues such as press conferences, Q&A sessions, etc.
We can provide more than 1 million pairs of bilingual corpus!
You can also use
as a translation memory for the translation support tool you are using.
In addition to data translated from Japanese into English,
has a bilingual corpus of translations from English into Southeast Asian languages and rare languages
.
Bilingual corpora are translated by native speakers of each language
and thus reflect
nuances and common phrases unique to each language.
In addition to the field and data type, the
translation direction of the source and target texts is also managed so that
the desired bilingual corpus can be extracted.
We can also create multilingual terminology dictionaries.
We have a collection of precautions for safe work at factories and work sites in a conversational format for each industry.
Many fields are covered, from primary industries such as agriculture, forestry, and fisheries, to reports on economics, finance, IT, and nuclear power.
Data necessary for foreigners to enter and live in Japan, such as laws related to immigration and visa status and medical interviews, are available in multiple languages.
Research department of a major telecommunications company
/ Research institute of a broadcasting station
/ AI-based machine translation engine development company
, etc.
Bilingual corpora are becoming increasingly necessary in the development and tuning of machine translation and multilingual generation AI. Also, when translators start translating in a new field, in-house translation memory will greatly improve the efficiency of the translation process. On the other hand, it is not easy to collect a systematic bilingual corpus on one's own, and it is more efficient to procure data from external sources and focus on the development of the desired deliverables and translation work.
The following six points are important for selecting a bilingual corpus.
1) Language combination
2) Field
3) Quality
4) Quantity
5) Data type
6) Context
The "language combination" in 1) means whether the language pair to be translated by the translator or to be trained in the development of machine translation is Japanese and English or Japanese and Chinese. Similarly, "language combination" is the most important factor when registering in the translation memory of the translation support tool to be used. Furthermore, if you are looking for a more natural expression, which language is the source text is also an important factor. As a concrete example, it is natural that there is a difference in fluency of English expressions and background of translation between a Japanese source text translated into English and an English source text translated into Japanese, even if we speak of a "bilingual corpus of Japanese and English" in a few words.
The "field" in (2) refers to fields such as tourism, medicine, law, economics, science and technology. Machine learning of data in the fields in which you want to improve accuracy will help accelerate development. When building a language model to improve machine translation accuracy in a specific field, a rough estimate of the effect is around 100,000 pairs. Needless to say, the field is also important when creating a terminology dictionary from a bilingual corpus. In our example, the fields required in Japan were the tourism field when the country was focusing on responding to inbound tourism in the name of tourism, and the bilingual corpus in the medical field was required to strengthen medical care for foreigners visiting or residing in Japan. Furthermore, the needs have shifted over the years to the business field, where presentation materials are used to present and explain business activities, and where speeches at conferences and events are followed by recorded question-and-answer sessions.
The "quality" of (3), in other words, refers to accuracy, and depends on the method by which the bilingual corpus was created. It is preferable that the data be translated by human hands. If the data is translated by machine translation and has not been checked or corrected by human eyes, the quality will naturally be lower. Even in the case of bilingual data translated by human hands, there are cases where one source text is translated into two or more target texts, and if the source and target texts must correspond to each other in one sentence, this will affect the quality. Furthermore, if the data is translated with excessive abbreviations, it is also not a good quality bilingual corpus.
As for the "quantity" in (4), for example, tens of thousands of pairs of bilingual corpora in a particular field would be a sufficient quantity to be useful to translators for the purpose of registering them in the translation memory or word dictionary of a translation support tool used in-house. For machine learning in a specific field, such as when developing a machine translation engine, 100,000 pairs are said to be effective to a certain extent. On the other hand, when developing a general-purpose machine translation engine, tens of millions of pairs are said to be necessary. Thus, the amount of bilingual corpus required varies greatly depending on the application.
The "data type" in (5) indicates whether the original file from which the bilingual corpus is created is a report, white paper, presentation, press conference, Q&A session, etc. It is more effective to use a report or white paper if you want to machine-learn the written language, or a bilingual corpus created from an announcement, press conference, or Q&A session if you want to machine-learn the spoken language. Since our bilingual corpus manages attributes in detail, it is possible to extract a bilingual corpus by data type.
The "presence or absence of context" in (6) means whether or not there is a semantic link between multiple sentences. Specifically, example sentences in the dictionary must only contain a specific headword, and there is no context between them and other example sentences. Therefore, it is considered "out of context". In contrast, a report is "contextual" because it consists of multiple sentences describing a certain event or events. Similarly, press conferences and Q&A sessions can be said to be "contextual" because multiple speakers take turns speaking. In order to generate more accurate answers not only in machine translation, but also in chatbots, it is now required more than ever before that the bilingual corpus to be machine-learned be "contextualized.
Conclusion
When selecting a bilingual corpus, it is necessary to consider 1) to 6) above, depending on the application. In particular, when performing machine learning for a specific field, it is essential to consider the need for quality, data type, and contextual availability. To ensure that you utilize the right bilingual corpus for your purposes, please check our free sample data first.