Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation

Borque-Fernando, Ángel; Fernández-Pello, Sergio; Guerrero Ramos, Félix; Navarro, Denis; Rubio-Briones, José; Esteban, Luis M.; Doblare, Manuel; Perez-Fentes, Daniel; Fernández Aparicio, Tomás; Izquierdo, Laura; Rodríguez Faba, Oscar; Álvarez-Ossorio Fernández, José Luis; Álvarez-Maestro, Mario; Fernández-Gómez, Jesús María; Medina-López, Rafael A.

doi:10.1177/17562872261436905

Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation

Borque-Fernando, Ángel (Universidad de Zaragoza) ; Navarro, Denis (Universidad de Zaragoza) ; Doblare, Manuel (Universidad de Zaragoza) ; Esteban, Luis M. ; Perez-Fentes, Daniel ; Álvarez-Maestro, Mario ; Medina-López, Rafael A. ; Rodríguez Faba, Oscar ; Rubio-Briones, José ; Fernández-Pello, Sergio ; Fernández-Gómez, Jesús María ; Fernández Aparicio, Tomás ; Guerrero Ramos, Félix ; Izquierdo, Laura ; Álvarez-Ossorio Fernández, José Luis

Resumen: Abstract
Background: Large language models (LLMs) are increasingly being explored to supporting evidence-based decision-making in urology, but their accuracy in interpreting and applying clinical guidelines remains uncertain.
Objectives: We aimed to evaluate the ability of LLMs to interpret and apply clinical guidelines across the full spectrum of major urological cancers.
Design: This expert-validated study evaluated six configurations of three top LLMs (Claude,Gemini, and ChatGPT) using 25 structured questions for each of the seven major urological cancers: prostate cancer, upper tract urothelial carcinoma, muscle-invasive and non-muscleinvasive bladder cancer, renal cell carcinoma, penile cancer, and testicular cancer.
Methods: Both simple and rephrased prompts were used to assess the impact of prompt engineering on response quality. All figures and tables from the English-language EAU guidelines were systematically converted into plain, structured text and peer reviewed by multidisciplinary experts before evaluating the LLM responses. Each response was independently rated by 9–11 uro-oncology specialists using a five-point Likert scale (1: incorrect/unacceptable, 5: optimal), resulting in 10,500 evaluations.
Results: Claude achieved the highest overall accuracy, with 45.9% of responses rated as optimal (Likert 5) and 87% as optimal/acceptable (Likert 4–5). Tumor-specific performance peaked in muscle-invasive bladder (56.7% optimal, 93% optimal/acceptable), penile (49.5%, 95%), and testicular cancer (60.9%, 94%). Gemini and ChatGPT showed lower optimal rates but acceptable performance (68%–70% optimal/acceptable). Rephrased prompts did not consistently outperform simple versions. All models showed acceptable accuracy, but the results should be interpreted cautiously due to recency bias and fast LLM tech evolution.
Conclusion: This study demonstrates the value of rigorous plain language adaptation and expert validation in benchmarking LLMs, supporting their potential as decision-support tools in uro-oncology.
Idioma: Inglés
DOI: 10.1177/17562872261436905
Año: 2026
Publicado en: Therapeutic Advances in Urology 18 (2026)
ISSN: 1756-2872
Tipo y forma: Artículo (Versión definitiva)
Área (Departamento): Área Tecnología Electrónica (Dpto. Ingeniería Electrón.Com.)
Área (Departamento): Área Mec.Med.Cont. y Teor.Est. (Dpto. Ingeniería Mecánica)
Área (Departamento): Área Urología (Dpto. Cirugía)

Debe reconocer adecuadamente la autoría, proporcionar un enlace a la licencia e indicar si se han realizado cambios. Puede hacerlo de cualquier manera razonable, pero no de una manera que sugiera que tiene el apoyo del licenciador o lo recibe por el uso que hace.

Exportado de SIDERAL (2026-04-30-13:58:43)

Enlace permanente:

Visitas y descargas

Este artículo se encuentra en las siguientes colecciones:
Artículos > Artículos por área > Mec. de Medios Contínuos y Teor. de Estructuras
Artículos > Artículos por área > Tecnología Electrónica
Artículos > Artículos por área > Urología

Volver a la búsqueda

Registro creado el 2026-04-30, última modificación el 2026-04-30

Versión publicada:
PDF

Valore este documento:

(Sin ninguna reseña)

Añadir a una carpeta personal
Exportar como BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

Repositorio Institucional de Documentos

Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation