探索基于大语言模型的医疗问诊偏好数据合成策略

豆乘风; 张颖; 金芝; 焦文品; 赵海燕; 赵永强; 陶政为

doi:10.1007/s11390-025-4929-7

探索基于大语言模型的医疗问诊偏好数据合成策略

Exploring LLM-Based Data Synthesis Strategies for Aligning Medical Consultation Preferences

摘要

摘要: 本研究探索了人工智能反馈强化学习（RLAIF）技术在优化医疗问诊模型中的应用，旨在解决偏好对齐数据合成的相关挑战，同时降低对医学专家的依赖。具体而言，本文研究了 RLAIF 在医疗对话生成中的应用，聚焦两大核心挑战：精准反映医生偏好，以及现有自动化评估系统的不可靠性。为解决这些问题，提出了一种用于合成偏好对齐数据集的两阶段方法。第一阶段，利用大型语言模型的对话续写能力，通过一次性学习干预对多样化且上下文一致的对话分支进行采样；第二阶段则通过结果反馈和过程反馈对医生偏好进行建模—— 结果反馈采用基于规则的奖励系统，过程反馈则采用基于规划的奖励策略。为验证该方法，作者开发了强调用户引导、指令遵循和综合能力的评估数据集，并构建了基于标准化患者测试的客观评估体系。实验结果表明，所提出的数据合成方法在五个数据集上均表现优异：结果反馈使诊断准确率提升 17.6%，过程反馈提升 23.3%。

Abstract: This research explores the application of reinforcement learning from artificial intelligence feedback (RLAIF) techniques to enhance healthcare consultation models, with the aim of addressing the challenges associated with preference-aligned data synthesis while reducing the dependence on medical experts. Specifically, we investigate the use of RLAIF in the generation of medical dialogues, focusing on two primary challenges: accurately reflecting doctors’ preferences and the unreliability of existing automated assessment systems. To address these issues, we propose a two-stage approach for synthesizing preference-aligned datasets. In the first stage, we leverage the dialogue continuation capabilities of a large language model to sample diverse, contextually aligned dialogue branches, employing one-shot learning for intervention. The second stage involves modeling doctors’ preferences through both outcome and process feedback. For outcome feedback, a rule-based reward system is utilized, whereas a planning-based reward strategy is employed for process feedback. To validate our approach, we develop the Chinese Standardized Patient Test (CSPT) dataset that emphasizes user guiding, instruction following, and synthesis ability, and construct an objective assessment system based on standardized patient testing. Experimental results demonstrate that our data synthesis approach performs well across five datasets, achieving a 17.6% improvement in diagnostic accuracy with outcome feedback and a 23.3% improvement with process feedback.

HTML全文

参考文献()

施引文献

资源附件()