ETRI Knowledge Sharing Platform : Sentence-Chain Based Seq2seq Model for Corpus Expansion

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article Sentence-Chain Based Seq2seq Model for Corpus Expansion

Cited 13 time in scopus

Download 174 time Share share

Abstract: This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence-chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4-times the number of ngrams with superior performance for English text.

KSP Keywords: Data Augmentation, Data sparseness, Language Model, Language generation, Sequential data, chain based, neural network(NN), recurrent neural network(RNN), superior performance, training data

This work is distributed under the term of Korea Open Government License (KOGL)
(Type 4: : Type 1 + Commercial Use Prohibition+Change Prohibition)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.