# Overview

# What is the Benchmarking Module?

The Benchmarking module in Purple Fabric is a powerful evaluation tool that enables teams to simulate, compare, and validate the performance of AI agents across models, prompts, and configuration setups before releasing them into production. It combines Evaluative, Operational, and Custom metrics in a controlled testing environment to surface deep insights into agent quality, consistency, and efficiency.

# Key Use Cases

Evaluate Agent Performance via Evaluative Metrics like exact accuracy, groundedness, faithfulness, relevance, and contextual accuracy
Conduct Regression Testing across agent versions to detect unintended drops in quality after prompt or config changes.
Compare LLMs and Configurations (e.g., GPT-4 vs. Claude vs. Gemini, temperature, top-p changes) using a visual benchmark grid.
Test Prompt Variants (A/B Testing) to understand which prompt style produces more reliable or cost-efficient responses.
Surface Weak Spots in agent behavior through row-level inspection and detailed metrics reporting.
Optimize Operational Metrics like token usage, latency, and cost before scaling to production.
Apply Custom Metrics like tone, compliance alignment, or brand style consistency using LLM-as-a-Judge capabilities.
Audit & Track Over Time with exportable, repeatable runs and historical performance summaries for every agent setup.

Role-based Quick Links

Role	Description	Quick Link
AI Product Owners	View agent performance trends and access explainability reports	Agent Monitoring Dashboard and Explainable AI
Compliance & Risk Teams	Access audit logs, review usage history, and validate policy adherence	View Auditability and Explainable AI
Data Scientists	Analyze model drift, monitor token spikes, and evaluate latency fluctuations	Agent Monitoring Dashboard
Business Stakeholders	Get a high-level overview of agents and key operational metrics at a glance	Agent Monitoring Dashboard

# Where to Next?

Try a Quick Start Benchmark a conversational agent
Read Concepts Related to Benchmarking Deep dive into Evaluative Metrics, Operational Metrics, LLM-as-a-Judge, and Benchmark Run configuration.