Documentation

# Overview

# What is the Benchmarking Module?

The Benchmarking module in Purple Fabric is a powerful evaluation tool that enables teams to simulate, compare, and validate the performance of AI agents across models, prompts, and configuration setups before releasing them into production. It combines Evaluative, Operational, and Custom metrics in a controlled testing environment to surface deep insights into agent quality, consistency, and efficiency.

# Key Use Cases

  • Evaluate Agent Performance via Evaluative Metrics like exact accuracy, groundedness, faithfulness, relevance, and contextual accuracy

  • Conduct Regression Testing across agent versions to detect unintended drops in quality after prompt or config changes.

  • Compare LLMs and Configurations (e.g., GPT-4 vs. Claude vs. Gemini, temperature, top-p changes) using a visual benchmark grid.

  • Test Prompt Variants (A/B Testing) to understand which prompt style produces more reliable or cost-efficient responses.

  • Surface Weak Spots in agent behavior through row-level inspection and detailed metrics reporting.

  • Optimize Operational Metrics like token usage, latency, and cost before scaling to production.

  • Apply Custom Metrics like tone, compliance alignment, or brand style consistency using LLM-as-a-Judge capabilities.

  • Audit & Track Over Time with exportable, repeatable runs and historical performance summaries for every agent setup.

Role-based Quick Links

Role Description Quick Link
AI Product Owners View agent performance trends and access explainability reports Agent Monitoring Dashboard and Explainable AI
Compliance & Risk Teams Access audit logs, review usage history, and validate policy adherence View Auditability and Explainable AI
Data Scientists Analyze model drift, monitor token spikes, and evaluate latency fluctuations Agent Monitoring Dashboard
Business Stakeholders Get a high-level overview of agents and key operational metrics at a glance Agent Monitoring Dashboard

# Where to Next?

  • Try a Quick Start Benchmark a conversational agent

  • Read Concepts Related to Benchmarking Deep dive into Evaluative Metrics, Operational Metrics, LLM-as-a-Judge, and Benchmark Run configuration.