ETRI Knowledge Sharing Platform : Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Conference Paper Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Cited 8 time in scopus

Download 177 time Share share

Authors: Haeju Lee, Oh Joon Kwon, Yunseon Choi, Minho Park, Ran Han, Yoonhyung Kim, Jinhyeon Kim, Youngjune Lee, Haebin Shin, Kangwook Lee, Kee-Eung Kim

Citation: Findings of the Association for Computational Linguistics: NAACL 2022, pp.813-830

Abstract: The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MMDisamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned multi-modal encoderdecoder that incorporates visual inputs and performs all four subtasks at once for efficiency. This approach won the MM-Coref and response retrieval subtasks and was nominated runnerup for the remaining subtasks using a single unified model at the 10th Dialog Systems Technology Challenge (DSTC10), setting a high bar for the novel task of multi-modal task-oriented dialog systems.

KSP Keywords: Conversational Agents, Dialog systems, Multi-modal, Response retrieval, Task-oriented, Unified model, coreference resolution, dialog state tracking

This work is distributed under the term of Creative Commons License (CCL)
(CC BY)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.