170998 20260430151736.0 doi 10.1177/17562872261436905 sideral 148950 ART-2026-148950 eng Borque-Fernando, Ángel Universidad de Zaragoza (orcid)0000-0003-0178-4567 Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation 2026 Abstract Background: Large language models (LLMs) are increasingly being explored to supporting evidence-based decision-making in urology, but their accuracy in interpreting and applying clinical guidelines remains uncertain. Objectives: We aimed to evaluate the ability of LLMs to interpret and apply clinical guidelines across the full spectrum of major urological cancers. Design: This expert-validated study evaluated six configurations of three top LLMs (Claude,Gemini, and ChatGPT) using 25 structured questions for each of the seven major urological cancers: prostate cancer, upper tract urothelial carcinoma, muscle-invasive and non-muscleinvasive bladder cancer, renal cell carcinoma, penile cancer, and testicular cancer. Methods: Both simple and rephrased prompts were used to assess the impact of prompt engineering on response quality. All figures and tables from the English-language EAU guidelines were systematically converted into plain, structured text and peer reviewed by multidisciplinary experts before evaluating the LLM responses. Each response was independently rated by 9–11 uro-oncology specialists using a five-point Likert scale (1: incorrect/unacceptable, 5: optimal), resulting in 10,500 evaluations. Results: Claude achieved the highest overall accuracy, with 45.9% of responses rated as optimal (Likert 5) and 87% as optimal/acceptable (Likert 4–5). Tumor-specific performance peaked in muscle-invasive bladder (56.7% optimal, 93% optimal/acceptable), penile (49.5%, 95%), and testicular cancer (60.9%, 94%). Gemini and ChatGPT showed lower optimal rates but acceptable performance (68%–70% optimal/acceptable). Rephrased prompts did not consistently outperform simple versions. All models showed acceptable accuracy, but the results should be interpreted cautiously due to recency bias and fast LLM tech evolution. Conclusion: This study demonstrates the value of rigorous plain language adaptation and expert validation in benchmarking LLMs, supporting their potential as decision-support tools in uro-oncology. Access copy available to the general public Unrestricted info:eu-repo/semantics/openAccess by https://creativecommons.org/licenses/by/4.0/deed.es info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Navarro, Denis Universidad de Zaragoza (orcid)0000-0002-0795-8743 Doblare, Manuel Universidad de Zaragoza (orcid)0000-0001-8741-6452 Esteban, Luis M. (orcid)0000-0002-3007-302X Perez-Fentes, Daniel Álvarez-Maestro, Mario Medina-López, Rafael A. Rodríguez Faba, Oscar Rubio-Briones, José Fernández-Pello, Sergio Fernández-Gómez, Jesús María Fernández Aparicio, Tomás Guerrero Ramos, Félix Izquierdo, Laura Álvarez-Ossorio Fernández, José Luis 5008 785 Universidad de Zaragoza Dpto. Ingeniería Electrón.Com. Área Tecnología Electrónica 5004 605 Universidad de Zaragoza Dpto. Ingeniería Mecánica Área Mec.Med.Cont. y Teor.Est. 1013 817 Universidad de Zaragoza Dpto. Cirugía Área Urología 18 (2026) Therapeutic Advances in Urology 1756-2872 3603222 http://zaguan.unizar.es/record/170998/files/texto_completo.pdf Versión publicada 2729312 http://zaguan.unizar.es/record/170998/files/texto_completo.jpg?subformat=icon icon Versión publicada oai:zaguan.unizar.es:170998 articulos driver 2026-04-30-13:58:43 ARTICLE