Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation

Borque-Fernando, Ángel; Fernández-Pello, Sergio; Guerrero Ramos, Félix; Navarro, Denis; Rubio-Briones, José; Esteban, Luis M.; Doblare, Manuel; Perez-Fentes, Daniel; Fernández Aparicio, Tomás; Izquierdo, Laura; Rodríguez Faba, Oscar; Álvarez-Ossorio Fernández, José Luis; Álvarez-Maestro, Mario; Fernández-Gómez, Jesús María; Medina-López, Rafael A.

doi:10.1177/17562872261436905

Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation

Borque-Fernando, Ángel (Universidad de Zaragoza) ; Navarro, Denis (Universidad de Zaragoza) ; Doblare, Manuel (Universidad de Zaragoza) ; Esteban, Luis M. ; Perez-Fentes, Daniel ; Álvarez-Maestro, Mario ; Medina-López, Rafael A. ; Rodríguez Faba, Oscar ; Rubio-Briones, José ; Fernández-Pello, Sergio ; Fernández-Gómez, Jesús María ; Fernández Aparicio, Tomás ; Guerrero Ramos, Félix ; Izquierdo, Laura ; Álvarez-Ossorio Fernández, José Luis

Resumen: Abstract
Background: Large language models (LLMs) are increasingly being explored to supporting evidence-based decision-making in urology, but their accuracy in interpreting and applying clinical guidelines remains uncertain.
Objectives: We aimed to evaluate the ability of LLMs to interpret and apply clinical guidelines across the full spectrum of major urological cancers.
Design: This expert-validated study evaluated six configurations of three top LLMs (Claude,Gemini, and ChatGPT) using 25 structured questions for each of the seven major urological cancers: prostate cancer, upper tract urothelial carcinoma, muscle-invasive and non-muscleinvasive bladder cancer, renal cell carcinoma, penile cancer, and testicular cancer.
Methods: Both simple and rephrased prompts were used to assess the impact of prompt engineering on response quality. All figures and tables from the English-language EAU guidelines were systematically converted into plain, structured text and peer reviewed by multidisciplinary experts before evaluating the LLM responses. Each response was independently rated by 9–11 uro-oncology specialists using a five-point Likert scale (1: incorrect/unacceptable, 5: optimal), resulting in 10,500 evaluations.
Results: Claude achieved the highest overall accuracy, with 45.9% of responses rated as optimal (Likert 5) and 87% as optimal/acceptable (Likert 4–5). Tumor-specific performance peaked in muscle-invasive bladder (56.7% optimal, 93% optimal/acceptable), penile (49.5%, 95%), and testicular cancer (60.9%, 94%). Gemini and ChatGPT showed lower optimal rates but acceptable performance (68%–70% optimal/acceptable). Rephrased prompts did not consistently outperform simple versions. All models showed acceptable accuracy, but the results should be interpreted cautiously due to recency bias and fast LLM tech evolution.
Conclusion: This study demonstrates the value of rigorous plain language adaptation and expert validation in benchmarking LLMs, supporting their potential as decision-support tools in uro-oncology.
Idioma: Inglés
DOI: 10.1177/17562872261436905
Año: 2026
Publicado en: Therapeutic Advances in Urology 18 (2026)
ISSN: 1756-2872
Tipo y forma: Article (Published version)
Área (Departamento): Área Tecnología Electrónica (Dpto. Ingeniería Electrón.Com.)
Área (Departamento): Área Mec.Med.Cont. y Teor.Est. (Dpto. Ingeniería Mecánica)
Área (Departamento): Área Urología (Dpto. Cirugía)

You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Exportado de SIDERAL (2026-04-30-13:58:43)

Permalink:

Visitas y descargas

Este artículo se encuentra en las siguientes colecciones:
Articles > Artículos por área > Mec. de Medios Contínuos y Teor. de Estructuras
Articles > Artículos por área > Tecnología Electrónica
Articles > Artículos por área > Urología

Back to search

Record created 2026-04-30, last modified 2026-04-30

Versión publicada:
PDF

Rate this document:

(Not yet reviewed)

Add to personal basket
Export as BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

Universidad de Zaragoza Repository

Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation