ETRI Knowledge Sharing Platform : Single-anchored Multi-modal Dense Video Captioning for Esports Broadcasts Commentaries

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Conference Paper Single-anchored Multi-modal Dense Video Captioning for Esports Broadcasts Commentaries

Cited 0 time in scopus

Download 413 time Share share

Authors: Ari Yu, Jinwoo Hyun, Hyeong-Gyu Jang, Sung-Yun Park, Sang-Kwang Lee

Issue Date: 2025-10

Citation: International ACM Workshop on Multimedia Content Analysis in Sports (MMSports) 2025, pp.31-38

Language: English

Type: Conference Paper

DOI: https://dx.doi.org/10.1145/3728423.3759412

Abstract: The popularity and industrial scale of esports broadcasting are expanding rapidly. However, research on automated commentary generation tailored to esports remains in its early stages compared to traditional sports such as soccer and baseball. Processing esports videos directly with existing dense video captioning models is challenging due to the integration of real-time scoreboards and complex game mechanics. To address this challenge, we present a model optimized for esports commentary generation. We constructed and analyzed the esports broadcast commentaries dataset, which comprises 703 matches and 25,262 timestamped commentaries from the 2022 and 2023 League of Legends Champions Korea regular seasons. Based on this analysis, we propose a two-stage framework, termed the Single-anchored Multi-modal Dense Video Captioning model. In the first stage, a spotting sub-model detects game events by processing scoreboard time-series data using optical character recognition at 1 frame per second. Through rule-based noise correction, this stage achieves high temporal precision, attaining an F1-score of 99.7%. In the second stage, a captioning sub-model pools visual features centered on the detected anchor and generates fluent, contextually relevant commentary using an LSTM-based decoder. Experimental results demonstrate that the proposed model outperforms existing esports captioning baselines. Furthermore, qualitative evaluations indicate that the model effectively captures dynamic in-game scenes, producing commentary closely aligned with that of professional broadcasts. Therefore, our model holds significant practical potential as an automated commentary solution that can be readily deployed in amateur matches without human commentators.

KSP Keywords: Early stages, F1-score, First stage, Industrial Scale, League of Legends, Multi-modal, Noise correction, Optical character Recognition, Proposed model, Real-time, Rule-based

This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI