Documentation

# Concepts

The Benchmarking module in Purple Fabric helps teams test and compare complete agent configurations, including underlying LLMs, prompts, and custom settings. It provides a controlled environment to simulate real-world behavior, allowing you to observe how different setups perform across various inputs and evaluation metrics.

From evaluative measures such as accuracy and relevance, to operational metrics like cost, tokens, and latency, to custom LLM-based checks, Benchmarking makes it easy to understand how well an agent is performing. While Purple Fabric comes with a set of commonly used metrics, teams can also define their custom metrics using LLM-as-a-Judge to score responses more flexibly.

This gives you the confidence to experiment with prompt updates, switch models, or change configurations, knowing they can validate performance before anything goes live.

Core capabilities include:

  • Comparative testing across multiple models and prompt configurations

  • A/B testing of prompt variations for the same model

  • A/B testing for the same prompt and model, but with different model configurations

  • Quality (OOTB LLM judge) and operational (performance) metrics-based evaluation

  • Customised LLM-as-a-Judge agent for custom evaluation metrics

  • Exportable, repeatable benchmark runs for reliable regression testing and analysis

# LLM and Agent Configurations

Different tracks (columns) in benchmarking can be configured using:

  • Different LLM models and configurations (e.g., GPT-4, Claude, Gemini)

  • Same LLM provider with different configuration settings ( e.g., Temperature settings and Top P)

  • Variations in perspective/system instructions

Each track (column) allows side-by-side performance comparison for each row and overall aggregated results across rows

# LLM as a Judge

LLM as a Judge in Purple Fabric lets you define your evaluation logic by customizing both the prompt and model configuration. You can specify how the model should evaluate responses by referencing variables such as:

  • Input – the original user input

  • Expected Output – the ideal or reference answer

  • Agent Response – the actual output from the agent

  • Context – any supporting information or background

It further supports:

  • Custom Metrics – You can define task-specific evaluation parameters (e.g., factual accuracy, tone, policy alignment) that align with their business needs

  • Agent Summary – Automatically aggregates individual row scores to produce an overall performance score for the agent across all test cases

This setup gives teams flexibility to evaluate what matters most to them, enabling nuanced, case-specific scoring at scale.

# Metrics

# Evaluative Metrics

These are OOTB LLM-based metrics based on semantic and contextual understanding to measure agent response quality against different parameters. Below are the metrics defined under Evaluative Metrics :

# Exact Accuracy

Measures how exactly the agent’s response matches the expected output at the string level, this includes character-by-character comparison and is the only metric under evaluative that is not LLM-based

Scoring Logic: 100% score → Perfect match with the expected output (identical strings)
0% score → Discrepancy in match, non-identical string

Key use cases - Rule-based evaluations (Legal document clause matching), Tasks requiring strict output formatting ( function signature match), Automated scoring for deterministic tasks

# Contextual Accuracy

Measures semantic similarity between the agent response and the expected output, allowing for flexible phrasing and paraphrasing. The goal is to evaluate whether the model "got the right idea."

Scoring Logic ( scored as a percentage between 0-100 %)

Higher score → Response conveys the same meaning as expected

Lower score → Semantic drift or misunderstood intent

Note: This is ideal when wording flexibility is acceptable, but correctness still matters.

Key use cases - Open-ended QA (Multi-turn customer support chat), Instruction following where exact words may vary (Meeting summarization tools )

# Faithfulness -

Evaluates whether the agent's response remains consistent with the provided guidelines, policy, or task objective, i.e., does the model stay faithful to the intent, guidelines, and rules. By default, the scoring is between 0-5, but you can configure the scoring in the policy instructions box.

Scoring Logic:

Higher score → No contradictions, deviations, or speculative behavior

Lower score → Hallucinations, policy violations, or inaccurate claims

Key use cases - Enterprise or regulated responses (e.g., Loan Eligibility Checker), Safety-sensitive agents (e.g., Customer Service bots), Fact-checking or controlled generation (e.g., Insurance Claim Eligibility Assistant)

# Groundedness -

Assesses whether the agent's response aligns/matches the provided context (retrieved chunks, uploaded files) while generating its output. It ensures no hallucination is observed as context is considered only from the source material.

Scoring Logic ( scored between 0-5 ) :

Higher score (5) → Response is fully grounded in context

Lower score (0) → Presence of unverified or unsupported claims

Key use cases - Retrieval-Augmented Generation (RAG) (e.g., Financial report summarization)

Clarification: Groundedness doesn’t check correctness, only whether the response sticks to the source material i. e context used by agents

# Relevance

Evaluates how well the agent's response addresses the original input and adheres to system instructions/perspective

Scoring Logic ( scored between 0-5 )

Higher score (5) → Response is directly focused, appropriately scoped, and on-topic

Lower score (0) → Generic, tangential, or ignores instructions

Key Use Cases: Conversation agents such as (e.g., Virtual teaching assistants), Customer support or internal helpdesk (e.g., Customer query handling)

# Operational Metrics

Performance and Cost Metrics of Model Inference.

Metrics available -

# Cost - Token-based cost estimation

Cost metrics measure the tokens consumed (input and output) based on model provider pricing. Benchmarking highlights how token-heavy use cases like conversations or automation tasks (summarization, classification, extraction) impact cost, helping identify the most efficient LLM setup.

# Latency - Total and first-token response time

Measures how fast each LLM and agent configuration responds. Comparing latency helps identify configurations that deliver the best real-time performance for specific use cases.

Conversational Agents: Low latency is essential for real-time, human-like conversations. Benchmarking helps identify LLMs and setups that provide quick replies and smooth dialogue flow.

Automation Agents: Total response time matters for task throughput (e.g., bulk processing.

# Tokens - Count of input and output tokens

Benchmarking token count guides you in selecting or tuning LLMs and agent configurations for balanced performance and cost.

Conversational Agents: High-output tokens can inflate cost. Benchmarking helps optimize system prompts and control verbosity while maintaining relevance and clarity.

Automation Agents: The Input token count is key, especially when processing long documents or datasets. Understanding token limits ensures agents are configured for high-efficiency throughput.

# Ratings - Response ratings from users

Ratings reflect the naturalness, helpfulness, and contextual understanding key for improving dialogue models and engagement.

# Custom metrics

Can be defined using LLM-as-a-Judge. These rely on structured evaluation prompts referencing input/output placeholders and produce consistent, scalable scores across datasets.

Example: Using LLM-as-a-Judge to Evaluate Helpfulness

You want to evaluate how helpful an agent’s response is to a customer query. Instead of relying on an exact match with an expected output, you can define a custom metric called Helpfulness using LLM-as-a-Judge.

To do this, you can create an evaluation prompt such as:

Evaluate how helpful the is in answering the . Consider whether it provides accurate, relevant, and actionable information.

Rate helpfulness on a scale from 0 to 5, where:

0 = Not helpful at all

3 = Somewhat helpful but missing important details or clarity

5 = Extremely helpful, accurate, and directly addresses your need

Metrics Variables -

Metric Name - Helpfulness

Output Type - Number

Description - Measures how helpful the response is, on a scale of 0 to 5

Since the output type of this metric is a number, it will automatically include it in the Agent Summary. You can enable the average of these helpfulness scores.

The summary view will then show the average helpfulness score across all benchmarked inputs for each agent configuration.

For example, if you run 15 test inputs and receive individual Helpfulness scores between 3 and 5, the summary may display:

Helpfulness (Avg.): 4.2