Interview with a person in charge of machine learning and evaluation of "kode-AI Translation Cloud API! (Part 2)

 Building a Field-Specific Neural Machine Translation Engine from the Ground Up

Interview with a person in charge of "kode-AI Translation Cloud API" about the evaluation of "kode-AI Translation Cloud API"!

This article is the second part of a continuation of the previous one.

~Click here to read the first part of the blog.

 

 

3. evaluation results - what is BLEU value? ~(Part 2) 4.

 

Satake: So, before and after machine learning 10,000 pieces of data, how much would another 1,000 pieces of data that were not directly machine learned improve translation accuracy?
And the basis for judging how good the translation quality of these 1,000 English translations is is the English text that was first translated by human translators, is that right?

Shibata: Yes, that is correct. By the way, Mr. Satake, do you know what the "BLEU value" mentioned above is?

Satake: Do you mean a mechanically evaluated value?

Shibata: The "BLEU value" is a mechanical evaluation of the similarity between the 1,000 English sentences used for evaluation, which are not used for training, and the 10,000 sentences that were first translated by human translators.

It is a score that evaluates the similarity between the human translation and the result of automatic translation as a percentage.

Satake: Wow, I've memorized the "BLEU value" very well!

Shibata: The results of the machine evaluation (BLEU value) of the results before and after training are as follows.
Before] 27.80 ⇒ [After] 54.49

Satake: You can see that the 1,000 data for evaluation has been improved by nearly double by learning 10,000 data!

Shibata: That's right. And the results of manual evaluation were as follows.
Pre-study average] 54.0 ⇒ [Post-study average] 71.4
(*Evaluation standard: 6-point scale out of 100, focusing on whether the translated text is understandable or not)

Satake: That's about a 30% improvement! It shows that the learning has been effective even when evaluated by a human.
I guess you could say that having the machine learn itself is meaningful! What were some of the challenges you faced in this trial?

 

4. discussion

Shibata: Well, I think it was selecting the data to be tested in a well-balanced way.
For example, among the 10,000 cases of data, there are many cases where the sentence patterns are the same but only the proper nouns in the data are slightly different.
When extracting 1,000 pieces of data from such a large number, it would be meaningless to conduct an experiment if all the data were similar in content.
That is why it was very difficult to select a good balance of similar Japanese sentences by first grouping them together.

Satake: You also had to make the selection with your own eyes, and I think it was a patient and time-consuming process.
Did you have any suggestions for improvement in the future based on the results of this project?

Shibata: The "BLEU value," which is a measure of the similarity between the result translation and the reference translation, increased significantly, so I think we can conclude that the learning was very effective.
In the human evaluation, except for mistranslations of proper nouns, the results were generally comprehensible.
As for mistranslations of proper nouns, further improvement in translation quality can be expected by using the dictionary function to cover them.
In fact, when we tried registering the translation in the dictionary afterwards, we found that it reproduced nearly 100% of the translation!

Satake: The key is to use the dictionary registration function for proper nouns such as station names in advance!

Shibata: Another point for improvement is that we would like to experiment with even larger amounts of data in the future.
In fact, the 10,000 cases we have this time is not a large amount of data for machine learning.
If we conduct experiments using a larger amount of data, I think we can expect higher learning results!

Satake: The results of this experiment show that the learning was very effective, so I am looking forward to future trials.
In what kind of situations can we expect to see the system in actual use?

Shibata: I feel that the system can be used for multilingual broadcasts inside train stations, on trains, and indoors in department stores. If bilingual data has been accumulated, and if the patterns for the parts excluding proper nouns are roughly established, it could be put to practical use. Above all, it will help reduce costs and time since there is no need to manually translate from scratch.

Satake: We asked Mr. Shibata about machine learning evaluation of neural engines, which is expected to improve translation accuracy more and more in the future! Thank you very much for your time today, Mr. Shibata!

Shibata: Thank you very much.

Satake: What kind of interesting stories will we hear from the developers at
in the next installment of "Physical Interviews in the Development Office"?