Exploring LLM-based Data Synthesis Strategies for Medical Consultation Preference Alignment
-
Abstract
This research explores the application of Reinforcement Learning from Artificial Intelligence Feedback (RLAIF) techniques to enhance healthcare consultation models, with the aim of addressing the challenges associated with preference-aligned data synthesis while reducing the dependence on medical experts. Specifically, we investigate the use of RLAIF in the generation of medical dialogues, focusing on two primary challenges: accurately reflecting physicians' preferences and the unreliability of existing automated assessment systems. To address these issues, we propose a two-stage approach for synthesizing preference-aligned datasets. In the first stage, we leverage the dialogue continuation capabilities of a large language model to sample diverse, contextually aligned dialogue branches, employing one-shot learning for intervention. The second stage involves modeling doctors' preferences through both outcome and process feedback. For outcome feedback, a rule-based reward system is utilized, whereas a planning-based reward strategy is employed for process feedback. To validate our approach, we develop the Chinese standardized patient test(CSPT) dataset that emphasizes user guidance, instruction following, and synthesis ability, and constructed an objective assessment system based on standardized patient testing. Experimental results demonstrate that our data synthesis approach performs well across five datasets, achieving a 17.6% improvement in diagnostic accuracy with outcome feedback and a 23.3% improvement with process feedback.
-
-