How to Make the Most Out of LLM Production Data: Simulated User Feedback | by Pasquale Antonante, Ph.D. | Apr, 2024

In this section we will show how we can use the open-source library continuous-eval to create simulated user feedback.Consider a Q&A chatbot application. After deployment, users begin rating responses with thumbs up or down, indicating a need for performance enhancement. For this example we will use the example named correctness in continuous-eval:dataset = Dataset(example_data_downloader(“correctness”))# Samples are annotated with “correct”, “incorrect” or “refuse-to-answer”# We remove the samples where the LLL refused to answer (i.e., said “I don’t know”)dataset.filter(lambda x: x[“annotation”] != “refuse-to-answer”)dataset.sample(300) # Only for this example: randomly sample 300 examplesAs we mentioned, we want to create some custom criteria. We leverage the LLMBasedCustomMetric class to define the Tone and Conciseness metrics. To do so we need to define the metric and provide a scoring rubric.For the tone:tone = LLMBasedCustomMetric(name=”Tone”,definition=”The Tone/Content Issues metric evaluates the appropriateness and accuracy of the tone and content in responses to specific questions. It focuses on ensuring that the tone is professional and suitable for the context, and that the content accurately addresses the question without unnecessary deviations or inaccuracies. This metric is crucial for maintaining a professional image and ensuring clear, direct communication.”,scoring_rubric=”””Use the following rubric to assign a score to the answer based on its tone:- Score 1: The response is inappropriate or inaccurate, with a tone that is either too informal, overly strong, or not suited to the professional context. The content may be irrelevant, incorrect, or fail to directly address the question posed.- Score 2: The response is mostly appropriate and accurate but may contain minor tone or content issues. The tone is generally professional but may slip into informality or unnecessary strength in places. The content addresses the question but may include minor inaccuracies or unnecessary details.- Score 3: The response is appropriate and accurate, with a tone that is professional and suited to the context. The content directly and correctly addresses the question without unnecessary deviations or inaccuracies.”””,scoring_function=ScoringFunctions.Numeric(min_val=1, max_val=3),model_parameters={“temperature”: 0},)while for conciseness:conciseness = LLMBasedCustomMetric(name=”Conciseness”,definition=”Conciseness in communication refers to the expression of ideas in a clear and straightforward manner, using the fewest possible words without sacrificing clarity or completeness of information. It involves eliminating redundancy, verbosity, and unnecessary details, focusing instead on delivering the essential message efficiently. “,scoring_rubric=”””Use the following rubric to assign a score to the answer based on its conciseness:- Score 1: The answer is overly verbose, containing a significant amount of unnecessary information, repetition, or redundant expressions that do not contribute to the understanding of the topic.- Score 2: The answer includes some unnecessary details or slightly repetitive information, but the excess does not severely hinder understanding.- Score 3: The answer is clear, direct, and to the point, with no unnecessary words, details, or repetition.”””,scoring_function=ScoringFunctions.Numeric(min_val=1, max_val=3),model_parameters={“temperature”: 0},)We use Tone and Conciseness together with more standard metrics, in particular we will consider theAnswer Correctness (DeterministicAnswerCorrectens and LLMBasedAnswerCorrectness)Answer Relevance (LLMBasedAnswerRelevance)Style Consistency (LLMBasedStyleConsistency)Readability (FleschKincaidReadability)The next step is to put all the metrics together and specify what field of the dataset should be used to compute the metrics. To do that we can use the SingleModulePipelinepipeline = SingleModulePipeline(dataset=dataset,eval=[DeterministicAnswerCorrectness().use(answer=dataset.answer,ground_truth_answers=dataset.ground_truths,),LLMBasedAnswerCorrectness().use(question=dataset.question,answer=dataset.answer,ground_truth_answers=dataset.ground_truths,),LLMBasedAnswerRelevance().use(question=dataset.question, answer=dataset.answer),LLMBasedStyleConsistency().use(answer=dataset.answer, ground_truth_answers=dataset.ground_truths),FleschKincaidReadability().use(answer=dataset.answer),tone.use(question=dataset.question,answer=dataset.answer,ground_truth_answers=dataset.ground_truths,),conciseness.use(question=dataset.question,answer=dataset.answer,ground_truth_answers=dataset.ground_truths,),],)and run all the metrics using the EvaluationManagereval_manager = EvaluationManager(pipeline)# The dataset already contains the model output so we just set the evaluation resultseval_manager.evaluation.results = dataset.dataeval_manager.run_metrics() # Note: there is no progress bar, it might take a few minutesThe next step is to train simulated user feedback predictordatasplit = DataSplit(X=eval_manager.metrics.to_pandas(),y=map(lambda x: 1 if x == “correct” else 0, dataset[“annotation”]),split_ratios=SplitRatios(train=0.6, test=0.2, calibration=0.2),)# We use the train and calibration sets to train the classifierpredictor = EnsembleMetric(training=datasplit.train, calibration=datasplit.calibration)This simulated user feedback predictor is able to correctly predict the human feedback in the test split 96.67% of the time.We can leverage the proposed approach to better understand what is important to the user. Below is the learned importance of every metric by the simulated user feedback predictor.Learned importance of every metric by the simulated user feedback predictor. Image by the author.Looking at the plot, we see that Correctness (including token overlap, which is another measure for correctness) and Relevance to the question are the most important predictors of user preference. But the user also weighs tone and style consistency into the decision. At the same time, we can see that conciseness and readability are not as important. Reviewing this graph provides valuable insight into user preferences, giving a clear indication of what elements are essential and what can be adjusted if compromises need to be made.Collecting user feedback is challenging, yet it is the most important information for developers of large language models (LLMs). By simulating user feedback during offline testing, we significantly reduces the time it takes for feedback to travel from the field back to developers, while maintaining positive user relationships.In practice, our approach has proven to closely mirror actual human responses, outperforming traditional methods that rely on isolated LLM responses. This strategy allows for the incremental improvement of generative AI applications, fostering continuous refinement and greater congruence with what users expect.—Note: We will soon publish a research paper with more details on this methodology. Stay tuned!