Evaluating large language models effectiveness for flow-based intrusion detection: a comparative study with ML and DL baselines

Mehavilla, Lorena; García, José; Alesanco, Álvaro; Rodríguez, María
doi:10.1007/s10462-025-11432-2
000168101 001__ 168101
000168101 005__ 20260126155509.0
000168101 0247_ $$2doi$$a10.1007/s10462-025-11432-2
000168101 0248_ $$2sideral$$a147679
000168101 037__ $$aART-2026-147679
000168101 041__ $$aeng
000168101 100__ $$aMehavilla, Lorena$$uUniversidad de Zaragoza
000168101 245__ $$aEvaluating large language models effectiveness for flow-based intrusion detection: a comparative study with ML and DL baselines
000168101 260__ $$c2026
000168101 5060_ $$aAccess copy available to the general public$$fUnrestricted
000168101 5203_ $$aThis paper presents the first systematic benchmark evaluating Large Language Models (LLMs), specifically GPT-2, GPT-Neo-125M, and LLaMA-3.2-1B, as standalone classifiers for intrusion detection, covering both binary and multiclass classification tasks, using structured Zeek logs derived from the CIC IoT 2023 dataset. We compare their performance against established and widely used Machine Learning (XGBoost, Random Forest, Decision Tree) and Deep Learning models (MLP, GRU, LeNet-5) across key evaluation metrics: detection effectiveness (precision, recall and F1-score), inference speed, and resource consumption. All models are consistently trained and rigorously evaluated on the CIC IoT 2023 dataset, ensuring fair, reproducible, and transparent comparisons. Our findings indicate that while LLMs achieve strong F1-score exceeding 95%, and do not fully utilize available GPU resources, they still do not outperform top-performing ML models. Notably XGBoost achieves a higher F1-score of 96.96%, using only 4% of the available CPU. These results emphasize the practical trade-offs between detection capability, inference efficiency, and hardware requirements when applying LLMs in flow-based IDS contexts, particularly in resource-constrained environments such as IoT or edge deployments.
000168101 536__ $$9info:eu-repo/grantAgreement/ES/DGA/T31-20R$$9info:eu-repo/grantAgreement/ES/MCINN/PID2022-136476OB-I00
000168101 540__ $$9info:eu-repo/semantics/openAccess$$aby$$uhttps://creativecommons.org/licenses/by/4.0/deed.es
000168101 655_4 $$ainfo:eu-repo/semantics/article$$vinfo:eu-repo/semantics/publishedVersion
000168101 700__ $$aRodríguez, María$$uUniversidad de Zaragoza
000168101 700__ $$0(orcid)0000-0001-9485-7678$$aGarcía, José$$uUniversidad de Zaragoza
000168101 700__ $$0(orcid)0000-0002-5254-1402$$aAlesanco, Álvaro$$uUniversidad de Zaragoza
000168101 7102_ $$15008$$2560$$aUniversidad de Zaragoza$$bDpto. Ingeniería Electrón.Com.$$cÁrea Ingeniería Telemática
000168101 773__ $$g59, 2 (2026), [38 pp.]$$pArtif. intell. rev.$$tARTIFICIAL INTELLIGENCE REVIEW$$x0269-2821
000168101 8564_ $$s2958161$$uhttps://zaguan.unizar.es/record/168101/files/texto_completo.pdf$$yVersión publicada
000168101 8564_ $$s1265242$$uhttps://zaguan.unizar.es/record/168101/files/texto_completo.jpg?subformat=icon$$xicon$$yVersión publicada
000168101 909CO $$ooai:zaguan.unizar.es:168101$$particulos$$pdriver
000168101 951__ $$a2026-01-26-14:50:32
000168101 980__ $$aARTICLE
Atlantis Institut des Sciences Fictives