We asked the developer about machine processing in the creation of the Japanese-Korean dictionary.

Hello to everyone who visits the Takadensha website! Nice to meet you. My name is Satake from Kodensha.

In order to enhance the contents of our website, we have started the "Kodensha R&D Office Blog" to provide you with various information from our R&D Office.

Satake, a sales assistant who is not good at science and not familiar with technology, interviewed our developers and presented the "Kodensha Development Office Blog" in the form of a dialogue. Please read on (*^_^*) Let's get right to the point!

Kodensha offers a variety of services related to translation and text input, such as software development, human-powered translation and interpretation, mobile and Internet-related content development, and ASP services.
In fact, in addition to the above services, we are also engaged in a variety of development projects every day using a wide range of technologies!

In this first installment of the "Development Office Blog," we spoke with Mr. Kawakami of the Development Section 4 about machine processing in the creation of Japanese-Korean dictionaries!
Mr. Kawakami, thank you very much for your time!

Dictionary construction is possible through natural language processing!

Satake: I understand that you are developing machine processing in the creation of Japanese-Korean dictionaries, but what exactly are you working on? As a Korean language lover, I have a feeling that this will be very exciting! Please give me your best regards!

Kawakami:Thank you very much for your time. First of all, let me briefly explain the work I was in charge of this time, which is to create a Japanese-Korean dictionary from a very large amount of Japanese-Korean bilingual texts through statistical natural language processing. There are more than 10 million bilinguals in the corpus, from which bilingual words are automatically extracted. And in this development, a high correspondence accuracy for both Japanese and Korean languages was recognized.

Satake: ... I'm sorry to start off with a basic question, but may I start by asking what "natural language processing" is?

Kawakami: Natural language processing is a technology that allows computers to process the languages that people use in their daily lives, such as Japanese, English, Chinese, and Korean (natural language). This natural language processing is applied to the predictive conversion and kanji conversion in the IME, which is included in our well-known software such as ChineseWriter11.

Satake: I see! So this natural language processing is used in familiar places! And as for extracting bilingual words from Japanese-Korean bilingual texts, for example,
Japan "I like anime" Korea "저는 애니메이션을 좋아합니다"
→ Automatic extraction of bilingual words... "Iㅣ저 / isㅣ는 / animeㅣ애니메이션 / isㅣㅣㅣ을 / likeㅣ좋 , 아 / isㅣ합니다" Does that mean this will happen?

Kawakami: Yes, that's right. The high degree of correspondence between the two languages means that the accuracy of mutually correct extraction of Japanese and Korean for a single word is high, as shown above.

Satake: The number of bilingual translations of 10 million is also impressive, and the ability to automatically extract from that number to create a Japanese-Korean dictionary seems to be very useful in organizing the data.

Kawakami: This is a bit off topic, but the readings of foreign words in Japanese and Korean are quite different. Taking the above as an example, "anime" in English is
Japanese "anime" Korean "애니메이션".

Satake: That's right! The ones I was particularly surprised by not being understood locally were "McDonald's" and "Burger King"! In Korean, they are "맥도날드" and "버거킹", and by the way, hamburgers are called "햄버거" (henbogo).

If you try to write them down, you may not feel that much difference, but when they pop up in actual conversation, you really can't communicate with them at all (cry)! It is interesting to see how different the pronunciations are even in foreign languages!

By the way, I often hear the word "corpus" mentioned earlier.

What is a corpus? This is an explanation of the term "corpus," which you may have heard often.

Kawakami: A corpus is a language resource that is a database of a large number of written and spoken words.
In this case, the two languages are Japanese and Korean, so this kind of corpus is called a bilingual corpus. So a bilingual corpus is a corpus that contains sentences and phrases from different languages in pairs.

Satake: A bilingual corpus is a corpus constructed for use as training data in natural language processing such as machine translation, isn't it?

Kawakami: Yes. In this case, we focused on words that appeared in the bilingual corpus and used them to extract Japanese-Korean bilingual words.
In addition, this bilingual corpus is used in various fields such as natural language processing, language education, and artificial intelligence (AI), and its needs are increasing every year. In particular, they are very important in neural translation, which automatically learns translation processes from large bilingual corpora, and in statistical translation!

Satake: It really is being applied in a variety of fields! So, by learning bilingual corpora, it is possible to build systems and improve translation accuracy?

Kawakami: That's right. By the way, do you know the word "morphological analysis," which is an important word in the study of natural language processing?

What is morphological analysis, a major topic in the field of natural language processing?

Satake: Morphological analysis.... This is a term I have never heard of before! Please explain it to me! (Translated by.)

Kawakami: "Morphological analysis" is a technique for breaking down sentences into "the minimum meaningful units (= morphemes)" and attaching part-of-speech tags to each. Breaking down a sentence or phrase into morphemes helps analyze its grammar and meaning.
For example, in the sentence "I exercise in the park," the sentence is broken down as follows: I (pronoun)/ha (adverb)/park (noun)/de (particle)/exercise (noun)/shi (verb)/mas (auxiliary verb)"
.
If we were to use an analogy, if the road to a destination is a "sentence" about going somewhere, morphological analysis would be the process of putting a break in the road for each district it passes through and giving each a district name (town name).
Don't you feel like you learned this morphological analysis technique somewhere?

Satake: I wonder where ... . Ah! I see, it's the same as the part-of-speech decomposition I learned in Japanese in the past!

Kawakami: Actually, morphological analysis is used in various tools that we often use.
For example, when you search for "tourist attractions in Osaka" on an Internet search engine, the morphological analysis described above will first split the word into "Osaka/'s/'sights" and then search for that word. Then the word is searched for.

Satake: I didn't realize that morphological analysis is also used in the search engine I usually use ... . I was surprised.

Kawakami: Yes, it is. Morphological analysis is used in machine translation and artificial intelligence (AI).

Satake: I see that this morphological analysis is being used in various places. At first, when I heard the word "morphological analysis" alone, I thought it sounded difficult or complicated, but after hearing some examples of applications, I am beginning to feel familiar and close to it!

We reproduce the segmentation information unique to Korean!

Satake: What did you do in the morphological analysis of this corpus?

Kawakami: During the morphological analysis, the segmental information peculiar to Korean is lost, so we have added a process to restore it.

Satake: The word "segmented writing" here refers to an orthographic system that makes sentences easier to read by inserting punctuation as necessary.
어제 친구와 밥을 먹었습니다. / I had dinner with a friend yesterday.
  ↑Insert a space like this.

Satake: Was this your first effort to create a dictionary by machine processing like this?

Kawakami: Actually, we have done Chinese-Japanese dictionary creation work before, and that technology is the foundation for this project. As an example, we were involved in the following work. For more details, please see this report.
(Reference) Research project on dictionary creation in 2015
A survey on the development of dictionaries for machine translation of Chinese patent documents and the quality evaluation of machine translation

Satake: So your experience at that time led to this development! The high accuracy of the correspondence between the two languages also gives us high expectations for further development in the future!

Kawakami: In the creation of our bilingual corpus, our two strengths, natural language processing technology and human translation, are organically combined.

How was it?
In this article, we interviewed the developer of a technology that statistically processes and constructs dictionaries from a vast amount of data.

Natural language processing? Corpus? Morphological analysis? I was not familiar with any of the technical terms, but I was intrigued by the explanation of the terms and the application examples!

It was a great learning experience for me to know once again that the services we use casually in our daily lives actually integrate various technologies, and at the same time, it made me reflect on my own lack of study as an employee of a software development company.... (... ... ... ... ... ... ... ... ... ... ...)

I hope that I can continue to learn as I share development information with you through this development blog.

The bilingual corpus introduced here can be used for various purposes, including applications, systems, and research and development.
If you have any questions about bilingual corpora or natural language processing, please do not hesitate to contact us.

If you have any questions about natural language processing, please contact us here >

Please look forward to our next "Development Office Blog"!
We welcome your comments and suggestions.