SLEEM, Lama ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
FRANCOIS, Jérôme ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
LI, Lujun ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
Foucher, Nathan; Institut National Polytechnique de Toulouse,Toulouse,France
GENTILE, Niccolo ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences > Department of Behavioural and Cognitive Sciences > Team Conchita D AMBROSIO ; Foyer S.A.,Leudelange,Luxembourg
STATE, Radu ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
External co-authors :
yes
Language :
English
Title :
NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks
M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjalili et al., "A survey on large language models: Applications, challenges, limitations, and practical usage," Authorea Preprints, vol. 3, 2023.
OpenAI, "GPT-4V(ision) System Card," https://cdn.openai.com/papers, 2023, accessed: 2024-05-06.
E. Kasneci, K. Sesler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier et al., "Chatgpt for good? on opportunities and challenges of large language models for education," Learning and individual differences, vol. 103, p. 102274, 2023.
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., "A survey of large language models," arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023.
B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal et al., "Language models are few-shot learners," arXiv preprint arXiv:2005.14165, vol. 1, p. 3, 2020.
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., "Gpt-4 technical report," arXiv preprint arXiv:2303.08774, 2023.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.
F. Perez and I. Ribeiro, "Ignore previous prompt: Attack techniques for language models," arXiv preprint arXiv:2211.09527, 2022.
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, "Jailbreaking black box large language models in twenty queries," arXiv preprint arXiv:2310.08419, 2023.
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, "Universal and transferable adversarial attacks on aligned language models," arXiv preprint arXiv:2307.15043, 2023.
A. Robey, E.Wong, H. Hassani, and G. J. Pappas, "Smoothllm: Defending large language models against jailbreaking attacks," arXiv preprint arXiv:2310.03684, 2023.
X. Zhang, C. Zhang, T. Li, Y. Huang, X. Jia, X. Xie, Y. Liu, and C. Shen, "A mutation-based method for multi-modal jailbreaking attack detection," CoRR, 2023.
T. Rebedea, L. Derczynski, S. Ghosh, M. N. Sreedhar, F. Brahman, L. Jiang, B. Li, Y. Tsvetkov, C. Parisien, and Y. Choi, "Guardrails and security for LLMs: Safe, secure and controllable steering of LLM applications," in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts), Y. Arase, D. Jurgens, and F. Xia, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 13-15. [Online]. Available: https://aclanthology.org/2025.acl-tutorials.8/
S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, "Efficient formal safety analysis of neural networks," Advances in neural information processing systems, vol. 31, 2018.
J. Ji, B. Hou, A. Robey, G. J. Pappas, H. Hassani, Y. Zhang, E.Wong, and S. Chang, "Defending large language models against jailbreak attacks via semantic smoothing," arXiv preprint arXiv:2402.16192, 2024.
M. Pisano, P. Ly, A. Sanders, B. Yao, D. Wang, T. Strzalkowski, and M. Si, "Bergeron: Combating adversarial attacks through a consciencebased alignment framework," arXiv preprint arXiv:2312.00029, 2023.
M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau, "Llm self defense: By self examination, llms know they are being tricked," arXiv preprint arXiv:2308.07308, 2023.
H. Inan, K.Upasani, J. Chi, R.Rungta, K. Iyer,Y. Mao, M.Tontchev, Q. Hu, B. Fuller, D. Testuggine et al., "Llama guard: Llm-based input-output safeguard for human-ai conversations," arXiv preprint arXiv:2312.06674, 2023.
G. Alon and M. Kamfonas, "Detecting language model attacks with perplexity," arXiv preprint arXiv:2308.14132, 2023.
J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, "On detecting adversarial perturbations," arXiv preprint arXiv:1702.04267, 2017.
Y. Liu, G. Shen, G. Tao, Z. Wang, S. Ma, and X. Zhang, "Complex backdoor detection by symmetric feature differencing," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 003-15 013.
L. Li, L. Sleem, N. Gentile, G. Nichil, and R. State, "Exploring the impact of temperature on large language models:hot or cold?" 2025. [Online]. Available: https://arxiv.org/abs/2506.07295
M. Anschütz, D. M. Lozano, and G. Groh, "This is not correct! negation-aware evaluation of language generation systems," arXiv preprint arXiv:2307.13989, 2023.
P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V. Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer et al., "Jailbreakbench: An open robustness benchmark for jailbreaking large language models," arXiv preprint arXiv:2404.01318, 2024.
M. Andriushchenko, F. Croce, and N. Flammarion, "Jailbreaking leading safety-aligned llms with simple adaptive attacks," arXiv preprint arXiv:2404.02151, 2024.
W. Luo, S. Ma, X. Liu, X. Guo, and C. Xiao, "Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks," arXiv preprint arXiv:2404.03027, 2024.
G. Borbély and A. Kornai, "Sentence length," in Proceedings of the 16th Meeting on the Mathematics of Language, P. de Groote, F. Drewes, and G. Penn, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2019, pp. 114-125. [Online]. Available: https://aclanthology.org/W19-5710/
B. Sigurd, M. Eeg-Olofsson, and J. van deWeijer, "Word length, sentence length and frequency-zipf revisited," Studia Linguistica, vol. 58, pp. 37-52, 04 2004.
M. Cutts, Oxford guide to plain English. Oxford university press, 2020.
F. T. Liu, K. Ting, and Z.-H. Zhou, "Isolation forest," 01 2009, pp. 413-422.