<?xml version="1.0" encoding="UTF-8"?>
<collection>
<dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:invenio="http://invenio-software.org/elements/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><dc:identifier>doi:10.1177/17562872261436905</dc:identifier><dc:language>eng</dc:language><dc:creator>Borque-Fernando, Ángel</dc:creator><dc:creator>Navarro, Denis</dc:creator><dc:creator>Doblare, Manuel</dc:creator><dc:creator>Esteban, Luis M.</dc:creator><dc:creator>Perez-Fentes, Daniel</dc:creator><dc:creator>Álvarez-Maestro, Mario</dc:creator><dc:creator>Medina-López, Rafael A.</dc:creator><dc:creator>Rodríguez Faba, Oscar</dc:creator><dc:creator>Rubio-Briones, José</dc:creator><dc:creator>Fernández-Pello, Sergio</dc:creator><dc:creator>Fernández-Gómez, Jesús María</dc:creator><dc:creator>Fernández Aparicio, Tomás</dc:creator><dc:creator>Guerrero Ramos, Félix</dc:creator><dc:creator>Izquierdo, Laura</dc:creator><dc:creator>Álvarez-Ossorio Fernández, José Luis</dc:creator><dc:title>Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation</dc:title><dc:identifier>ART-2026-148950</dc:identifier><dc:description>Abstract
Background: Large language models (LLMs) are increasingly being explored to supporting evidence-based decision-making in urology, but their accuracy in interpreting and applying clinical guidelines remains uncertain.
Objectives: We aimed to evaluate the ability of LLMs to interpret and apply clinical guidelines across the full spectrum of major urological cancers.
Design: This expert-validated study evaluated six configurations of three top LLMs (Claude,Gemini, and ChatGPT) using 25 structured questions for each of the seven major urological cancers: prostate cancer, upper tract urothelial carcinoma, muscle-invasive and non-muscleinvasive bladder cancer, renal cell carcinoma, penile cancer, and testicular cancer.
Methods: Both simple and rephrased prompts were used to assess the impact of prompt engineering on response quality. All figures and tables from the English-language EAU guidelines were systematically converted into plain, structured text and peer reviewed by multidisciplinary experts before evaluating the LLM responses. Each response was independently rated by 9–11 uro-oncology specialists using a five-point Likert scale (1: incorrect/unacceptable, 5: optimal), resulting in 10,500 evaluations.
Results: Claude achieved the highest overall accuracy, with 45.9% of responses rated as optimal (Likert 5) and 87% as optimal/acceptable (Likert 4–5). Tumor-specific performance peaked in muscle-invasive bladder (56.7% optimal, 93% optimal/acceptable), penile (49.5%, 95%), and testicular cancer (60.9%, 94%). Gemini and ChatGPT showed lower optimal rates but acceptable performance (68%–70% optimal/acceptable). Rephrased prompts did not consistently outperform simple versions. All models showed acceptable accuracy, but the results should be interpreted cautiously due to recency bias and fast LLM tech evolution.
Conclusion: This study demonstrates the value of rigorous plain language adaptation and expert validation in benchmarking LLMs, supporting their potential as decision-support tools in uro-oncology.</dc:description><dc:date>2026</dc:date><dc:source>http://zaguan.unizar.es/record/170998</dc:source><dc:doi>10.1177/17562872261436905</dc:doi><dc:identifier>http://zaguan.unizar.es/record/170998</dc:identifier><dc:identifier>oai:zaguan.unizar.es:170998</dc:identifier><dc:identifier.citation>Therapeutic Advances in Urology 18 (2026)</dc:identifier.citation><dc:rights>by</dc:rights><dc:rights>https://creativecommons.org/licenses/by/4.0/deed.es</dc:rights><dc:rights>info:eu-repo/semantics/openAccess</dc:rights></dc:dc>

</collection>