This section is designed to help you quickly find clear, actionable guidance on how to use the Benchmark tool effectively.
Refer to the following page to know more:
Benchmark Conversational Agents
Benchmark Automation Agents
Configure Benchmarking Metrics
Use Evaluative Metrics
Configure Operational Metrics
Defining Custom Metrics
#Benchmark Conversational Agents
#Manual Setup
Manually entering the Input and expected output, useful in cases where a smaller number of inputs need to be tested
Add a field, which will add a row
Enter the 'Input'
Provide the 'Expected Output' (ideal response) manually
You can toggle ON the 'Auto GT Capture' option to automatically capture the model’s generated output as ground truth whenever the user runs the prompt. Refer to Automated Ground Truth Capture for Agents to know more about this feature
Optional: you can Add files as context for your input by clicking on the 'Add Files' icon
Adding Files as Context
To add files as context to your input,
Click on 'Add Files', Add Files window will appear
You can upload the files here directly by drag and drop or by browsing through the system to upload
You can additionally click on the Documnet Library icon to select files from the library and add them as context. All the uploaded files will appear inside the ‘Available Files’ Tab
You can re use the files already uploaded by making use of this tab
A file added to any row will be retained in the subsequent rows as a context for the conversation. For example, File A is added to row 1 will be added automatically to the subsequent rows below Row 1, unless explicitly deleted by you. If the File A is then deleted from row 1, it will automatically be deleted from the subsequent rows below it to remove it from the context of the conversation
A total of 10 files can be added as context to one particular row
#Bulk Setup
Importing JSON files that contain the Input and Expected output pairs, useful in cases where a large number of queries need to be tested
Click on Import
Download the ‘Document Template’ ( illustrates the JSON structure which is supported by the system)
Prepare the JSON file according to the template
Upload the file by dragging and dropping or browsing through the system
Click on submit
In case you need to add multiple files as context during bulk import, you need to add the “doc id” and “doc name” for each file to enable the system to add that as a context in the row.
Click on the documents tab ( placed alongside the import button )
The Documents dialog box will appear, and you can directly import files here
After the import is complete, an option to download the file list will appear
Click on the ‘Download file list’ button
A CSV file would be downloaded, this file would contain the ‘doc id’ and ‘doc name’ for each file uploaded, to reference within the JSON import file
OR
The Documents dialog box will appear, and you can opt for the option to ‘Select files from Doc Library’
Doc Library will open
You can select the folder and then select the files for which you want to download the file list
You can also import additional files into the existing folders to update the library and then download the document list
#Benchmark Automation Agents
#Manual Setup
Manually entering the Input and expected output, useful in cases where a smaller number of queries need to be tested.
Click on + Field, which will add a row within the benchmarking schema
Enter the Input
Provide the Expected Output (ideal response) manually
You can toggle ON the 'Auto GT Capture' option to automatically capture the model’s generated output as ground truth whenever the user runs the prompt. Refer to Automated Ground Truth Capture for Agents to know more about this feature
The input and output can have multiple fields, depending on the input and output schema you have defined while building the agent
For multi-occurrence fields, you can click on 'Add Value' to add multiple values for the same field
Note - For each line item added in the output schema, a separate field for expected output will be added.
In case you have selected ‘File’ as the data type for input, you can click Browse and upload the file.
#Bulk Import
Click Import
Download the ‘Document Template’ ( illustrates the JSON structure which is supported by the system)
Prepare the JSON file according to the template
Upload the file by dragging and dropping or browsing through the system
Click Submit
In case ‘File’ is a data type for Input, you need to add the “doc id” and “doc name” for each input file to enable the system to add that as an input. Click on the documents tab ( placed alongside the import button )
The Documents dialog box appears, and you can directly import files here
After the import is complete, an option to download the file list appears
Click the ‘Download file list’ button
A CSV file would be downloaded, this file would contain the ‘doc id’ and ‘doc name’ for each file uploaded, to reference within the import file
OR
The Documents dialog box will appear, and you can opt for the option to ‘Select files from Doc Library’, the Doc Library appears
You can select the folder and then select the files for which you want to download the file list
You can also import additional files into the existing folders to update the library and then download the document list
After adding the Input and Expected Output for all the rows you want to include in the benchmark, you can proceed to add different tracks (Model and Prompt Variations) and configure the benchmarking metrics to start the benchmarking run.
These actions are not dependent on any specific order; you can add tracks and set metrics in whichever sequence suits your workflow.
#Configure Benchmarking Metrics
Configure metrics by selecting from three metric types: Operational, Evaluative, and Custom. By clicking on the Benchmark settings button ( Bar graph-like icon )
#Use Evaluative Metrics
You can select the metrics you want to evaluate for each track
You can also expand by clicking on the metrics to view the prompt in a read-only mode. This is for you to understand how the prompt is structured for each OOTB metric, and select the metric accordingly
You can select one or more metrics depending on how you want to evaluate
Exact match or contextual match will help you in evaluating the match between the Expected Output and Agent Response
For the Faithfulness metric, you can provide the guidelines on which you want the Agent's response to be evaluated. You can specify the rules, expectations, or constraints the LLM response must follow. This is the reference against which the LLM response will be evaluated
Groundedness will help in evaluating how well the Agent Response is aligned with the context provided (for cases like RAG or file as an input)
Relevance will help you in evaluating how well the response aligns with the system instructions (the prompt)
Each of these Evaluative metrics has a cost associated with it for evaluation
#Configure Operational Metrics
You can select from the operational metrics available.
The ratings metric is selected by default; other metrics can be selected or deselected.
#Defining Custom Metrics (LLM-as-a-Judge)
You can define a custom metric, which you can configure according to the use case.
Click + Create new LLM-as-a-Judge
Enter a name for your metric. This should clearly describe what you’re evaluating (e.g., Clarity, Empathy, Policy Compliance)
Enter the Metric description
Select the LLM model that will be the LLM-as-a-judge
In the perspective area, write the evaluation prompt that instructs the LLM how to judge the output. The instructions for writing the prompt have been provided in the prompt box for you to provide a well-structured output
You can use the supported parameters:
Input – the original user query (optional)
Expected Output – ideal response (optional)
Agent Response – actual model output (required)
Context – documents or context used (optional)
Use the supported parameters, you can use these parameters inside the prompt instructions by typing “{“ and selecting the appropriate parameter from the drop down list
Configure the metrics variable for each row run
Provide a name for the metric and the data type to be used for scoring
Number: Use this for rating scales (e.g., 1–5 or 0–100)
Boolean: Use this for Yes/No questions
Text: Use this for qualitative feedback or open-ended comments
Optionally, describe the metric. While not required, this helps both the LLM and human reviewers understand the purpose of the metric
You can add multiple metrics as per their requirement for evaluation by clicking on the + icon placed on the right side
Configure the Agent Summary, which is the overall metric aggregated across all rows to assess agent performance for the entire run
Note: Only Number and Boolean metric variables will be available for adding in score aggregations (Agent Summary). Average can be aggregated for the Number data type, and ‘Count of’ can be aggregated for the Boolean data type
Click Save
All the saved metrics appears as a list
You can select or deselect specific metrics in benchmarking, with scores updating without triggering additional LLM runs, reducing both cost and latency.
You can select the custom metric they want to enable for evaluation
The Custom metrics have a cost associated with them.