#
Quick Start
Get to Your Agent:
Navigate to Expert Agent Studio, find and open the agent you want to test, and click on BenchmarkStart Fresh:
You’ll see past benchmarks (if any). To begin a new one, just click on Start New Give it a name that helps you remember what it’s for - something like New Prompt A/B Test
Set Up Your Tracks:
- Your current agent’s model and prompt will already show up as the first track (Column)
- You can add more tracks by clicking the + button – try out different prompts or models
For each track, you can customize:
* Prompt text
* Model
* Model settings (e.g., temperature, Top p)

Add Inputs and Expected Outputs:
This is what the agent will be tested against
Option 1: Add Manually
Just click “+ Field” to add a row.
you can then type in:
* Input (user question or query)
* Expected Output (ideal response)
* Optionally you can Add files as context to support the input query

Option 2: Bulk Upload Click “Import” and upload a JSON file with multiple inputs and expected outputs. You can download the template to get a reference of the sample JSON

Pick Your Metrics: Click the little bar graph icon to open Benchmark Settings. You can choose from:
- Evaluative Metrics (e.g., Exact Match, Faithfulness, Relevance)

- Operational Metrics (e.g., token usage, cost)

- Custom Metrics (you can create your own using LLM-as-a-Judge)
Each metric helps you understand how well your agent is doing
- Evaluative Metrics (e.g., Exact Match, Faithfulness, Relevance)
Run the Benchmark: Once everything’s ready (inputs + tracks + metrics), select the rows you want to test and hit Run Prompt
NEWYou can click the stop icon to halt a benchmark run if it takes too long or the execution is incorrect.Check and Export Results:
After the run is done, you’ll see performance scores for each row and track.
Simply click Export, select the desired information, the desired file format and then download the file.
The 'Summary Results' option is available only when the metrics is selected for the export.