#
Concepts
The Benchmarking module in Purple Fabric helps teams test and compare complete agent configurations, including underlying LLMs, prompts, and custom settings. It provides a controlled environment to simulate real-world behavior, allowing you to observe how different setups perform across various inputs and evaluation metrics.
From evaluative measures such as accuracy and relevance, to operational metrics like cost, tokens, and latency, to custom LLM-based checks, Benchmarking makes it easy to understand how well an agent is performing. While Purple Fabric comes with a set of commonly used metrics, teams can also define their custom metrics using LLM-as-a-Judge to score responses more flexibly.
This gives you the confidence to experiment with prompt updates, switch models, or change configurations, knowing they can validate performance before anything goes live.
Core capabilities include:
Comparative testing across multiple models and prompt configurations
A/B testing of prompt variations for the same model
A/B testing for the same prompt and model, but with different model configurations
Quality (OOTB LLM judge) and operational (performance) metrics-based evaluation
Customised LLM-as-a-Judge agent for custom evaluation metrics
Exportable, repeatable benchmark runs for reliable regression testing and analysis
#
LLM and Agent Configurations
Different tracks (columns) in benchmarking can be configured using:
Different LLM models and configurations (e.g., GPT-4, Claude, Gemini)
Same LLM provider with different configuration settings ( e.g., Temperature settings and Top P)
Variations in perspective/system instructions
Each track (column) allows side-by-side performance comparison for each row and overall aggregated results across rows
#
LLM as a Judge
LLM as a Judge in Purple Fabric lets you define your evaluation logic by customizing both the prompt and model configuration. You can specify how the model should evaluate responses by referencing variables such as:
Input – the original user input
Expected Output – the ideal or reference answer
Agent Response – the actual output from the agent
Context – any supporting information or background
It further supports:
Custom Metrics – You can define task-specific evaluation parameters (e.g., factual accuracy, tone, policy alignment) that align with their business needs
Agent Summary – Automatically aggregates individual row scores to produce an overall performance score for the agent across all test cases
This setup gives teams flexibility to evaluate what matters most to them, enabling nuanced, case-specific scoring at scale.
#
Metrics
#
Evaluative Metrics
These are OOTB LLM-based metrics based on semantic and contextual understanding to measure agent response quality against different parameters. Below are the metrics defined under Evaluative Metrics :
#
Exact Accuracy
Measures how exactly the agent’s response matches the expected output at the string level, this includes character-by-character comparison and is the only metric under evaluative that is not LLM-based
Scoring Logic:
100% score → Perfect match with the expected output (identical strings)
0% score → Discrepancy in match, non-identical string
Key use cases - Rule-based evaluations (Legal document clause matching), Tasks requiring strict output formatting ( function signature match), Automated scoring for deterministic tasks
#
Contextual Accuracy
Measures semantic similarity between the agent response and the expected output, allowing for flexible phrasing and paraphrasing. The goal is to evaluate whether the model "got the right idea."
Scoring Logic ( scored as a percentage between 0-100 %)
Higher score → Response conveys the same meaning as expected
Lower score → Semantic drift or misunderstood intent
Note: This is ideal when wording flexibility is acceptable, but correctness still matters.
Key use cases - Open-ended QA (Multi-turn customer support chat), Instruction following where exact words may vary (Meeting summarization tools )
#
Faithfulness -
Evaluates whether the agent's response remains consistent with the provided guidelines, policy, or task objective, i.e., does the model stay faithful to the intent, guidelines, and rules. By default, the scoring is between 0-5, but you can configure the scoring in the policy instructions box.
Scoring Logic:
Higher score → No contradictions, deviations, or speculative behavior
Lower score → Hallucinations, policy violations, or inaccurate claims
Key use cases - Enterprise or regulated responses (e.g., Loan Eligibility Checker), Safety-sensitive agents (e.g., Customer Service bots), Fact-checking or controlled generation (e.g., Insurance Claim Eligibility Assistant)
#
Groundedness -
Assesses whether the agent's response aligns/matches the provided context (retrieved chunks, uploaded files) while generating its output. It ensures no hallucination is observed as context is considered only from the source material.
Scoring Logic ( scored between 0-5 ) :
Higher score (5) → Response is fully grounded in context
Lower score (0) → Presence of unverified or unsupported claims
Key use cases - Retrieval-Augmented Generation (RAG) (e.g., Financial report summarization)
Clarification: Groundedness doesn’t check correctness, only whether the response sticks to the source material i. e context used by agents
#
Relevance
Evaluates how well the agent's response addresses the original input and adheres to system instructions/perspective
Scoring Logic ( scored between 0-5 )
Higher score (5) → Response is directly focused, appropriately scoped, and on-topic
Lower score (0) → Generic, tangential, or ignores instructions
Key Use Cases: Conversation agents such as (e.g., Virtual teaching assistants), Customer support or internal helpdesk (e.g., Customer query handling)
#
Operational Metrics
Performance and Cost Metrics of Model Inference.
Metrics available -
#
Cost - Token-based cost estimation
Cost metrics measure the tokens consumed (input and output) based on model provider pricing. Benchmarking highlights how token-heavy use cases like conversations or automation tasks (summarization, classification, extraction) impact cost, helping identify the most efficient LLM setup.
#
Latency - Total and first-token response time
Measures how fast each LLM and agent configuration responds. Comparing latency helps identify configurations that deliver the best real-time performance for specific use cases.
Conversational Agents: Low latency is essential for real-time, human-like conversations. Benchmarking helps identify LLMs and setups that provide quick replies and smooth dialogue flow.
Automation Agents: Total response time matters for task throughput (e.g., bulk processing.
#
Tokens - Count of input and output tokens
Benchmarking token count guides you in selecting or tuning LLMs and agent configurations for balanced performance and cost.
Conversational Agents: High-output tokens can inflate cost. Benchmarking helps optimize system prompts and control verbosity while maintaining relevance and clarity.
Automation Agents: The Input token count is key, especially when processing long documents or datasets. Understanding token limits ensures agents are configured for high-efficiency throughput.
#
Ratings - Response ratings from users
Ratings reflect the naturalness, helpfulness, and contextual understanding key for improving dialogue models and engagement.
#
Custom metrics
Can be defined using LLM-as-a-Judge. These rely on structured evaluation prompts referencing input/output placeholders and produce consistent, scalable scores across datasets.
Example: Using LLM-as-a-Judge to Evaluate Helpfulness
You want to evaluate how helpful an agent’s response is to a customer query. Instead of relying on an exact match with an expected output, you can define a custom metric called Helpfulness using LLM-as-a-Judge.
To do this, you can create an evaluation prompt such as:
Evaluate how helpful the is in answering the . Consider whether it provides accurate, relevant, and actionable information.
Rate helpfulness on a scale from 0 to 5, where:
0 = Not helpful at all
3 = Somewhat helpful but missing important details or clarity
5 = Extremely helpful, accurate, and directly addresses your need
Metrics Variables -
Metric Name - Helpfulness
Output Type - Number
Description - Measures how helpful the response is, on a scale of 0 to 5
Since the output type of this metric is a number, it will automatically include it in the Agent Summary. You can enable the average of these helpfulness scores.
The summary view will then show the average helpfulness score across all benchmarked inputs for each agent configuration.
For example, if you run 15 test inputs and receive individual Helpfulness scores between 3 and 5, the summary may display:
Helpfulness (Avg.): 4.2