명시적 어휘 정렬 정보를 부가한 부분 어휘 토큰 단위 기반의 신경망 자동번역 시스템 및 방법
- 10635753 (2020.04.28)
17HS1700, 지식증강형 실시간 동시통역 원천기술 개발,
- The present invention provides a method of generating training data to which explicit word-alignment information is added without impairing sub-word tokens, and a neural machine translation method and apparatus including the method. The method of generating training data includes the steps of: (1) separating basic word boundaries through morphological analysis or named entity recognition of a sentence of a bilingual corpus used for learning; (2) extracting explicit word-alignment information from the sentence of the bilingual corpus used for learning; (3) further dividing the word boundaries separated in step (1) into sub-word tokens; (4) generating new source language training data by using an output from the step (1) and an output from the step (3); and (5) generating new target language training data by using the explicit word-alignment information generated in the step (2 ) and the target language outputs from the steps (1) and (3).