Support for building large-scale language models required for the development and tuning of AI translation engines and generative AI

We support
the construction of
large-scale language models required for the development and
tuning of AI translation engines and generative AI.

More than 1 million pairs
Support for more than 10 languages
Extensive delivery record

Click here to request a sample

Solves
machine learning problems related to AI translation and Q&A!

I want to develop
highly accurate AI translations
for a specific field
I want to improve the
accuracy of a multilingual chatbot
Want to predict what
will be said during a conversation or Q&A session
I'm looking for a
bilingual corpus that supports
languages of Southeast Asian countries
I want to create a
terminological dictionary
specialized for an industry or field

Six Strengths of Kodensha

More than 40 years of experience

We have been developing machine translation since the 1980s
and have extensive experience in delivering bilingual corpora to major companies.
With our many years of experience and know-how, we provide
multilingual corpora that contribute to problem solving.
Strengths in specialized fields

We have bilingual corpora in various fields such as tourism, medicine, law, finance, intellectual property, etc.

We also provide
bilingual corpus of dialogues such as press conferences, Q&A sessions, etc.
More than 1 million pairs

We can provide more than 1 million pairs of bilingual corpus!
You can also use
as a translation memory for the translation support tool you are using.
Rare languages

In addition to data translated from Japanese into English,
has a bilingual corpus of translations from English into Southeast Asian languages and rare languages
.
Natural expressions

Bilingual corpora are translated by native speakers of each language
and thus reflect
nuances and common phrases unique to each language.
Semi-customized

In addition to the field and data type, the
translation direction of the source and target texts is also managed so that
the desired bilingual corpus can be extracted.
We can also create multilingual terminology dictionaries.

Click here for inquiries and sample requests.

Support for specialized fields such as the following

Bilingual corpus of conversational business instructions and specialized fields

We have a collection of precautions for safe work at factories and work sites in a conversational format for each industry.
Many fields are covered, from primary industries such as agriculture, forestry, and fisheries, to reports on economics, finance, IT, and nuclear power.
Multilingual data for foreign residents and visitors to Japan

Data necessary for foreigners to enter and live in Japan, such as laws related to immigration and visa status and medical interviews, are available in multiple languages.

Case Studies

Research department of a major telecommunications company
/ Research institute of a broadcasting station
/ AI-based machine translation engine development company
, etc.

Frequently Asked Questions

What is a bilingual corpus?: A bilingual corpus is a collection of sentences translated in one language and two or more other languages. We offer the richest corpus of Japanese and English.

How many words and word combinations are in a bilingual corpus?: We have the most extensive corpus for Japanese and English. We also support English and Chinese, Spanish and French, Indonesian and Portuguese, and other combinations.

If I purchase a bilingual corpus, are there any restrictions on its use?: Yes, we have bilingual corpora that can be used only for machine learning, and bilingual corpora that can be published on websites and in educational materials. In the latter case, we will discuss the terms and conditions with you prior to purchase.

What is the minimum purchase unit or price for a bilingual corpus?: It is said that about 100,000 pairs are required for use in machine learning, and our customers often purchase in such units. The price will be discussed separately depending on the intended use and the quantity purchased.

How are bilingual corpora delivered?: We can deliver the data as text files (CSV, TSV, etc.) or Excel files upon request.

Key Points for Selecting a Bilingual Corpus

Bilingual corpora are becoming increasingly necessary in the development and tuning of machine translation and multilingual generation AI. Also, when translators start translating in a new field, in-house translation memory will greatly improve the efficiency of the translation process. On the other hand, it is not easy to collect a systematic bilingual corpus on one's own, and it is more efficient to procure data from external sources and focus on the development of the desired deliverables and translation work.

The following six points are important for selecting a bilingual corpus.
1) Language combination
2) Field
3) Quality
4) Quantity
5) Data type
6) Context

The "language combination" in 1) means whether the language pair to be translated by the translator or to be trained in the development of machine translation is Japanese and English or Japanese and Chinese. Similarly, "language combination" is the most important factor when registering in the translation memory of the translation support tool to be used. Furthermore, if you are looking for a more natural expression, which language is the source text is also an important factor. As a concrete example, it is natural that there is a difference in fluency of English expressions and background of translation between a Japanese source text translated into English and an English source text translated into Japanese, even if we speak of a "bilingual corpus of Japanese and English" in a few words.

The "field" in (2) refers to fields such as tourism, medicine, law, economics, science and technology. Machine learning of data in the fields in which you want to improve accuracy will help accelerate development. When building a language model to improve machine translation accuracy in a specific field, a rough estimate of the effect is around 100,000 pairs. Needless to say, the field is also important when creating a terminology dictionary from a bilingual corpus. In our example, the fields required in Japan were the tourism field when the country was focusing on responding to inbound tourism in the name of tourism, and the bilingual corpus in the medical field was required to strengthen medical care for foreigners visiting or residing in Japan. Furthermore, the needs have shifted over the years to the business field, where presentation materials are used to present and explain business activities, and where speeches at conferences and events are followed by recorded question-and-answer sessions.

The "quality" of (3), in other words, refers to accuracy, and depends on the method by which the bilingual corpus was created. It is preferable that the data be translated by human hands. If the data is translated by machine translation and has not been checked or corrected by human eyes, the quality will naturally be lower. Even in the case of bilingual data translated by human hands, there are cases where one source text is translated into two or more target texts, and if the source and target texts must correspond to each other in one sentence, this will affect the quality. Furthermore, if the data is translated with excessive abbreviations, it is also not a good quality bilingual corpus.

As for the "quantity" in (4), for example, tens of thousands of pairs of bilingual corpora in a particular field would be a sufficient quantity to be useful to translators for the purpose of registering them in the translation memory or word dictionary of a translation support tool used in-house. For machine learning in a specific field, such as when developing a machine translation engine, 100,000 pairs are said to be effective to a certain extent. On the other hand, when developing a general-purpose machine translation engine, tens of millions of pairs are said to be necessary. Thus, the amount of bilingual corpus required varies greatly depending on the application.

The "data type" in (5) indicates whether the original file from which the bilingual corpus is created is a report, white paper, presentation, press conference, Q&A session, etc. It is more effective to use a report or white paper if you want to machine-learn the written language, or a bilingual corpus created from an announcement, press conference, or Q&A session if you want to machine-learn the spoken language. Since our bilingual corpus manages attributes in detail, it is possible to extract a bilingual corpus by data type.

The "presence or absence of context" in (6) means whether or not there is a semantic link between multiple sentences. Specifically, example sentences in the dictionary must only contain a specific headword, and there is no context between them and other example sentences. Therefore, it is considered "out of context". In contrast, a report is "contextual" because it consists of multiple sentences describing a certain event or events. Similarly, press conferences and Q&A sessions can be said to be "contextual" because multiple speakers take turns speaking. In order to generate more accurate answers not only in machine translation, but also in chatbots, it is now required more than ever before that the bilingual corpus to be machine-learned be "contextualized.

Conclusion
When selecting a bilingual corpus, it is necessary to consider 1) to 6) above, depending on the application. In particular, when performing machine learning for a specific field, it is essential to consider the need for quality, data type, and contextual availability. To ensure that you utilize the right bilingual corpus for your purposes, please check our free sample data first.

Click here for inquiries and sample requests.

We support the construction of large-scale language models required for the development and tuning of AI translation engines and generative AI.

Solves machine learning problems related to AI translation and Q&A!

Six Strengths of Kodensha

More than 40 years of experience

Strengths in specialized fields

More than 1 million pairs

Rare languages