#
FAQ & Troubleshooting
1. How should the user choose the expected output for generative tasks (e.g., summarization, rephrasing)?
When evaluating generative tasks, the user should create expected outputs that are concise but semantically accurate, rather than overly verbose or rigid. The expected output mustn't enforce a single phrasing. Instead, it should reflect the core intent of the task, allowing for multiple correct and valid variations. For tasks like summarization or rephrasing, where linguistic variability is natural, the user should avoid using metrics such as Exact Accuracy that focus on character-level or string-level matches. Instead, the user is encouraged to rely on Contextual Accuracy or LLM-based evaluations (LLM-as-a-Judge), which are better suited to scoring semantic similarity and meaning alignment.
2. How should the user use benchmarking to enforce compliance or policy-aligned outputs?
If the organization has internal guidelines or regulatory requirements for responses, the user should embed these constraints in the evaluation logic through a custom prompt using the LLM-as-a-Judge framework or by using the Faithfullness metric and defining the policy in the policy instructions box. By configuring the benchmarking engine to include a faithfulness check or custom evaluators, the user can ensure the agent responses comply with required policies. Faithfulness and LLM-driven rating metrics are especially effective in flagging policy violations, hallucinations, or deviations from authorized phrasing in enterprise or compliance-sensitive applications.
3. What is the recommended number of inputs per benchmark run?
To derive meaningful and statistically valid insights from a benchmark test, the user should aim to include at least 10 to 20 input scenarios. These scenarios should represent a diverse set of queries—including common cases, known failure points, and edge conditions—to ensure comprehensive coverage and fair comparison across agent configurations or models.
4. What should the user do if all models score poorly on a given input?
If all models perform poorly on a specific input, this could signal deeper issues such as an overly rigid or unrealistic expected output, ambiguity in the input phrasing, or insufficient clarity in the prompt. In such cases, the user should first consider refining the expected output to better reflect the true intent of the task. Another useful strategy is to run the same input through a high-performing model like GPT-4, to establish a performance reference. Additionally, revisiting and improving the prompt instructions may help improve model alignment and clarify task expectations.
5. How can the user check if a new prompt or model change has caused a regression?
To detect regressions effectively, the user should save a prior benchmark run as a baseline. Whenever changes are made—whether it's a new prompt template, a switch to a different model, or fine-tuning of the agent—the user should re-run the same benchmark and compare results. This approach allows for precise tracking of changes in metrics like accuracy, latency, cost, or relevance, helping the user identify and mitigate any silent performance drops or quality regressions introduced by the update.
6. Can the user estimate the cost before deploying a model to production?
Yes, benchmarking provides visibility into estimated model usage costs before deployment. The user should include the Cost and Token Count metrics when configuring a benchmark test. These metrics provide estimates of token-based pricing and help the user understand how much each input is likely to cost in terms of inference. With this information, the user can compare different models under the same conditions and choose the most cost-effective option for their use case or budget constraints.