ETRI-Knowledge Sharing Plaform

ENGLISH

성과물

논문 검색
구분 SCI
연도 ~ 키워드

상세정보

학술지 Sentence-Chain Based Seq2seq Model for Corpus Expansion
Cited 10 time in scopus Download 45 time Share share facebook twitter linkedin kakaostory
저자
정의석, 박전규
발행일
201708
출처
ETRI Journal, v.39 no.4, pp.455-466
ISSN
1225-6463
출판사
한국전자통신연구원 (ETRI)
DOI
https://dx.doi.org/10.4218/etrij.17.0116.0074
협약과제
16MS1700, 언어학습을 위한 자유발화형 음성대화처리 원천기술 개발, 이윤근
초록
This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence-chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4-times the number of ngrams with superior performance for English text.
키워드
Corpus expansion, Lexical chain, Sentence chain, Seq2seq model
KSP 제안 키워드
Data Augmentation, Data sparseness, Language generation, Language model, Lexical Chain, Recurrent Neural Network(RNN), Sequential data, chain based, superior performance, training data
본 저작물은 공공누리 제4유형 : 출처표시 + 상업적 이용금지 + 변경금지 조건에 따라 이용할 수 있습니다.
제4유형