Support for building large-scale language models required for the development and tuning of AI translation engines and generative AI

We support
the construction of
large-scale language models required for the development and
tuning of AI translation engines and generative AI.

  • More than 1 million pairs
  • Support for more than 10 languages
  • Extensive delivery record
Click here to request a sample

Solves
machine learning problems related to AI translation and Q&A!

  •  I want to develop highly accurate AI translation specialized for a specific field

    I want to develop
    highly accurate AI translations
    for a specific field

  •  Want to improve the accuracy of multilingual chatbots

    I want to improve the
    accuracy of a multilingual chatbot

  •  Predicting what to say during a conversation or Q&A session

    Want to predict what
    will be said during a conversation or Q&A session

  •  Looking for a bilingual corpus for languages spoken in Southeast Asian countries

    I'm looking for a
    bilingual corpus that supports
    languages of Southeast Asian countries

  •  We want to create a terminology dictionary specific to our industry or field of expertise

    I want to create a
    terminological dictionary
    specialized for an industry or field

Six Strengths of Kodensha

  •  More than 40 years of experience

    More than 40 years of experience

    We have been developing machine translation since the 1980s
    and have extensive experience in delivering bilingual corpora to major companies.
    With our many years of experience and know-how, we provide
    multilingual corpora that contribute to problem solving.

  •  Strengths in specialized fields

    Strengths in specialized fields

    We have bilingual corpora in various fields such as tourism, medicine, law, finance, intellectual property, etc.

    We also provide
    bilingual corpus of dialogues such as press conferences, Q&A sessions, etc.

  •  More than 1 million pairs

    More than 1 million pairs

    We can provide more than 1 million pairs of bilingual corpus!
    You can also use
    as a translation memory for the translation support tool you are using.

  •  Rare languages

    Rare languages

    In addition to data translated from Japanese into English,
    has a bilingual corpus of translations from English into Southeast Asian languages and rare languages
    .

  •  Natural expressions

    Natural expressions

    Bilingual corpora are translated by native speakers of each language
    and thus reflect
    nuances and common phrases unique to each language.

  •  Semi-customized

    Semi-customized

    In addition to the field and data type, the
    translation direction of the source and target texts is also managed so that
    the desired bilingual corpus can be extracted.
    We can also create multilingual terminology dictionaries.

Support for specialized fields such as the following

Case Studies

Research department of a major telecommunications company
/ Research institute of a broadcasting station
/ AI-based machine translation engine development company
, etc.

Frequently Asked Questions

What is a bilingual corpus?
A bilingual corpus is a collection of sentences translated in one language and two or more other languages. We offer the richest corpus of Japanese and English.
How many words and word combinations are in a bilingual corpus?
We have the most extensive corpus for Japanese and English. We also support English and Chinese, Spanish and French, Indonesian and Portuguese, and other combinations.
If I purchase a bilingual corpus, are there any restrictions on its use?
Yes, we have bilingual corpora that can be used only for machine learning, and bilingual corpora that can be published on websites and in educational materials. In the latter case, we will discuss the terms and conditions with you prior to purchase.
What is the minimum purchase unit or price for a bilingual corpus?
It is said that about 100,000 pairs are required for use in machine learning, and our customers often purchase in such units. The price will be discussed separately depending on the intended use and the quantity purchased.
How are bilingual corpora delivered?
We can deliver the data as text files (CSV, TSV, etc.) or Excel files upon request.

Key Points for Selecting a Bilingual Corpus

 Key Points for Selecting a Bilingual Corpus

Bilingual corpora are becoming increasingly necessary in the development and tuning of machine translation and multilingual generation AI. Also, when translators start translating in a new field, in-house translation memory will greatly improve the efficiency of the translation process. On the other hand, it is not easy to collect a systematic bilingual corpus on one's own, and it is more efficient to procure data from external sources and focus on the development of the desired deliverables and translation work.

The following six points are important for selecting a bilingual corpus.
1) Language combination
2) Field
3) Quality
4) Quantity
5) Data type
6) Context

The "language combination" in 1) means whether the language pair to be translated by the translator or to be trained in the development of machine translation is Japanese and English or Japanese and Chinese. Similarly, "language combination" is the most important factor when registering in the translation memory of the translation support tool to be used. Furthermore, if you are looking for a more natural expression, which language is the source text is also an important factor. As a concrete example, it is natural that there is a difference in fluency of English expressions and background of translation between a Japanese source text translated into English and an English source text translated into Japanese, even if we speak of a "bilingual corpus of Japanese and English" in a few words.

The "field" in (2) refers to fields such as tourism, medicine, law, economics, science and technology. Machine learning of data in the fields in which you want to improve accuracy will help accelerate development. When building a language model to improve machine translation accuracy in a specific field, a rough estimate of the effect is around 100,000 pairs. Needless to say, the field is also important when creating a terminology dictionary from a bilingual corpus. In our example, the fields required in Japan were the tourism field when the country was focusing on responding to inbound tourism in the name of tourism, and the bilingual corpus in the medical field was required to strengthen medical care for foreigners visiting or residing in Japan. Furthermore, the needs have shifted over the years to the business field, where presentation materials are used to present and explain business activities, and where speeches at conferences and events are followed by recorded question-and-answer sessions.

The "quality" of (3), in other words, refers to accuracy, and depends on the method by which the bilingual corpus was created. It is preferable that the data be translated by human hands. If the data is translated by machine translation and has not been checked or corrected by human eyes, the quality will naturally be lower. Even in the case of bilingual data translated by human hands, there are cases where one source text is translated into two or more target texts, and if the source and target texts must correspond to each other in one sentence, this will affect the quality. Furthermore, if the data is translated with excessive abbreviations, it is also not a good quality bilingual corpus.

As for the "quantity" in (4), for example, tens of thousands of pairs of bilingual corpora in a particular field would be a sufficient quantity to be useful to translators for the purpose of registering them in the translation memory or word dictionary of a translation support tool used in-house. For machine learning in a specific field, such as when developing a machine translation engine, 100,000 pairs are said to be effective to a certain extent. On the other hand, when developing a general-purpose machine translation engine, tens of millions of pairs are said to be necessary. Thus, the amount of bilingual corpus required varies greatly depending on the application.

The "data type" in (5) indicates whether the original file from which the bilingual corpus is created is a report, white paper, presentation, press conference, Q&A session, etc. It is more effective to use a report or white paper if you want to machine-learn the written language, or a bilingual corpus created from an announcement, press conference, or Q&A session if you want to machine-learn the spoken language. Since our bilingual corpus manages attributes in detail, it is possible to extract a bilingual corpus by data type.

The "presence or absence of context" in (6) means whether or not there is a semantic link between multiple sentences. Specifically, example sentences in the dictionary must only contain a specific headword, and there is no context between them and other example sentences. Therefore, it is considered "out of context". In contrast, a report is "contextual" because it consists of multiple sentences describing a certain event or events. Similarly, press conferences and Q&A sessions can be said to be "contextual" because multiple speakers take turns speaking. In order to generate more accurate answers not only in machine translation, but also in chatbots, it is now required more than ever before that the bilingual corpus to be machine-learned be "contextualized.

Conclusion
When selecting a bilingual corpus, it is necessary to consider 1) to 6) above, depending on the application. In particular, when performing machine learning for a specific field, it is essential to consider the need for quality, data type, and contextual availability. To ensure that you utilize the right bilingual corpus for your purposes, please check our free sample data first.

Close
See more