ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Multimodal Alzheimer’s disease recognition from image, text and audio
Cited 0 time in scopus Download 11 time Share share facebook twitter linkedin kakaostory
Authors
Byounghwa Lee, Hwa Jeon Song, Young-Jin Park, Byung Ok Kang
Issue Date
2025-08
Citation
Scientific Reports, v.15, pp.1-14
ISSN
2045-2322
Publisher
Springer Nature
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1038/s41598-025-14998-7
Abstract
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that significantly affects cognitive function. One widely used diagnostic approach involves analyzing patients’ verbal descriptions of pictures. While prior studies have primarily focused on speech- and text-based models, the integration of visual context is still at an early stage. This study proposes a novel multimodal AD prediction model that integrates image, text, and audio modalities. The image and text modalities are processed using a vision-language model and structured as a bipartite graph before fusion, while all three modalities are integrated through a combination of co-attention-based intermediate fusion and late fusion, enabling effective inter-modality cooperation. The proposed model achieves an accuracy of 90.61%, outperforming state-of-the-art models. Furthermore, an ablation study quantifies the contribution of each modality using Shapley values, which serve as the foundation for a novel auxiliary loss function that adaptively adjusts modality importance during training. The findings indicate that integrating image, text, and audio modalities via a co-attention-based intermediate fusion enhances AD classification performance. Additionally, this study analyzes modality-specific attention patterns and key linguistic tokens, demonstrating that audio and text provide complementary cues for AD classification.
KSP Keywords
AD classification, Bipartite graph, Classification Performance, Cognitive function, Disease recognition, Inter-modality, Language Model, Proposed model, Shapley values, Visual Context, late fusion
This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)
CC BY NC ND