NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks

SLEEM, Lama; FRANCOIS, Jérôme; LI, Lujun; Foucher, Nathan; GENTILE, Niccolo; STATE, Radu

doi:10.1109/ccnc65079.2026.11366297

Download

Article (Scientific journals)

NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks

SLEEM, Lama; FRANCOIS, Jérôme; LI, Lujun et al.

2026 • In CCNC, p. 1-7

Peer reviewed

Permalink
https://hdl.handle.net/10993/68158

DOI
10.1109/ccnc65079.2026.11366297

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

2511.11784v2 (1).pdf

Author postprint (652.47 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Disciplines :

Computer science

Author, co-author :

SLEEM, Lama ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN

FRANCOIS, Jérôme ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN

LI, Lujun ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN

Foucher, Nathan; Institut National Polytechnique de Toulouse,Toulouse,France

GENTILE, Niccolo ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences > Department of Behavioural and Cognitive Sciences > Team Conchita D AMBROSIO ; Foyer S.A.,Leudelange,Luxembourg

STATE, Radu ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN

External co-authors :

yes

Language :

English

Title :

NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks

Publication date :

09 January 2026

Journal title :

CCNC

Pages :

1-7

Peer reviewed :

Peer reviewed

Additional URL :

http://xplorestaging.ieee.org/ielx8/11366253/11366254/11366297.pdf?arnumber=11366297

Available on ORBilu :

since 05 April 2026

Statistics

Number of views

66 (3 by Unilu)

Number of downloads

17 (2 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjalili et al., "A survey on large language models: Applications, challenges, limitations, and practical usage," Authorea Preprints, vol. 3, 2023.
OpenAI, "GPT-4V(ision) System Card," https://cdn.openai.com/papers, 2023, accessed: 2024-05-06.
E. Kasneci, K. Sesler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier et al., "Chatgpt for good? on opportunities and challenges of large language models for education," Learning and individual differences, vol. 103, p. 102274, 2023.
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., "A survey of large language models," arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023.
B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal et al., "Language models are few-shot learners," arXiv preprint arXiv:2005.14165, vol. 1, p. 3, 2020.
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., "Gpt-4 technical report," arXiv preprint arXiv:2303.08774, 2023.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.
F. Perez and I. Ribeiro, "Ignore previous prompt: Attack techniques for language models," arXiv preprint arXiv:2211.09527, 2022.
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, "Jailbreaking black box large language models in twenty queries," arXiv preprint arXiv:2310.08419, 2023.
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, "Universal and transferable adversarial attacks on aligned language models," arXiv preprint arXiv:2307.15043, 2023.
A. Robey, E.Wong, H. Hassani, and G. J. Pappas, "Smoothllm: Defending large language models against jailbreaking attacks," arXiv preprint arXiv:2310.03684, 2023.
X. Zhang, C. Zhang, T. Li, Y. Huang, X. Jia, X. Xie, Y. Liu, and C. Shen, "A mutation-based method for multi-modal jailbreaking attack detection," CoRR, 2023.
T. Rebedea, L. Derczynski, S. Ghosh, M. N. Sreedhar, F. Brahman, L. Jiang, B. Li, Y. Tsvetkov, C. Parisien, and Y. Choi, "Guardrails and security for LLMs: Safe, secure and controllable steering of LLM applications," in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts), Y. Arase, D. Jurgens, and F. Xia, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 13-15. [Online]. Available: https://aclanthology.org/2025.acl-tutorials.8/
S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, "Efficient formal safety analysis of neural networks," Advances in neural information processing systems, vol. 31, 2018.
J. Ji, B. Hou, A. Robey, G. J. Pappas, H. Hassani, Y. Zhang, E.Wong, and S. Chang, "Defending large language models against jailbreak attacks via semantic smoothing," arXiv preprint arXiv:2402.16192, 2024.
M. Pisano, P. Ly, A. Sanders, B. Yao, D. Wang, T. Strzalkowski, and M. Si, "Bergeron: Combating adversarial attacks through a consciencebased alignment framework," arXiv preprint arXiv:2312.00029, 2023.
M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau, "Llm self defense: By self examination, llms know they are being tricked," arXiv preprint arXiv:2308.07308, 2023.
H. Inan, K.Upasani, J. Chi, R.Rungta, K. Iyer,Y. Mao, M.Tontchev, Q. Hu, B. Fuller, D. Testuggine et al., "Llama guard: Llm-based input-output safeguard for human-ai conversations," arXiv preprint arXiv:2312.06674, 2023.
G. Alon and M. Kamfonas, "Detecting language model attacks with perplexity," arXiv preprint arXiv:2308.14132, 2023.
J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, "On detecting adversarial perturbations," arXiv preprint arXiv:1702.04267, 2017.
Y. Liu, G. Shen, G. Tao, Z. Wang, S. Ma, and X. Zhang, "Complex backdoor detection by symmetric feature differencing," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 003-15 013.
L. Li, L. Sleem, N. Gentile, G. Nichil, and R. State, "Exploring the impact of temperature on large language models:hot or cold?" 2025. [Online]. Available: https://arxiv.org/abs/2506.07295
M. Anschütz, D. M. Lozano, and G. Groh, "This is not correct! negation-aware evaluation of language generation systems," arXiv preprint arXiv:2307.13989, 2023.
P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V. Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer et al., "Jailbreakbench: An open robustness benchmark for jailbreaking large language models," arXiv preprint arXiv:2404.01318, 2024.
M. Andriushchenko, F. Croce, and N. Flammarion, "Jailbreaking leading safety-aligned llms with simple adaptive attacks," arXiv preprint arXiv:2404.02151, 2024.
W. Luo, S. Ma, X. Liu, X. Guo, and C. Xiao, "Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks," arXiv preprint arXiv:2404.03027, 2024.
G. Borbély and A. Kornai, "Sentence length," in Proceedings of the 16th Meeting on the Mathematics of Language, P. de Groote, F. Drewes, and G. Penn, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2019, pp. 114-125. [Online]. Available: https://aclanthology.org/W19-5710/
B. Sigurd, M. Eeg-Olofsson, and J. van deWeijer, "Word length, sentence length and frequency-zipf revisited," Studia Linguistica, vol. 58, pp. 37-52, 04 2004.
M. Cutts, Oxford guide to plain English. Oxford university press, 2020.
F. T. Liu, K. Ting, and Z.-H. Zhou, "Isolation forest," 01 2009, pp. 413-422.