ETRI Knowledge Sharing Platform : GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Conference Paper GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection

Cited - time in scopus

Download 9 time Share share

Citation: International Conference on Learning Representations (ICLR) 2026, pp.1-27

Abstract: Large language models (LLMs) are increasingly deployed in real-world applications but remain highly vulnerable to jailbreak prompts that bypass safety guardrails and elicit harmful outputs. We propose GraphShield, a graph-theoretic jailbreak detector that models information routing inside the LLM as token--layer graphs. Unlike prior defenses that rely on surface cues or costly gradient signals, GraphShield captures network-level dynamics in a lightweight and model-agnostic way by extracting multi-scale structural and semantic features that reveal jailbreak signatures. Extensive experiments on LLaMA-2-7B-Chat and Vicuna-7B-v1.5 show that GraphShield reduces attack success rates to 1.9% and 7.8%, respectively, while keeping refusal rates on benign prompts at 7.1% and 6.8%, significantly improving the robustness–utility trade-off compared to strong baselines. These results demonstrate that graph-theoretic modeling of network-level dynamics provides a principled and effective framework for robust jailbreak detection in LLMs.

Keyword: Jailbreak Detection, Graph-Based Features, Large Language Models (LLMs), Safety and Robustness in LLMs

KSP Keywords: Graph-Based features, Graph-theoretic, Multi-scale, Network-level, Real-world applications, Safety and Robustness, Success rate, Trade-off, language models, semantic features

This work is distributed under the term of Creative Commons License (CCL)
(CC BY)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.