# How To

This section is designed to help you quickly find clear, actionable guidance on how to use the Benchmark tool effectively.

Refer to the following page to know more:

  • Benchmark Conversational Agents
  • Benchmark Automation Agents
  • Configure Benchmarking Metrics
  • Use Evaluative Metrics
  • Configure Operational Metrics
  • Defining Custom Metrics

# Benchmark Conversational Agents

# Manual Setup

Manually entering the Input and expected output, useful in cases where a smaller number of inputs need to be tested

  1. Add a field, which will add a row
  2. Enter the 'Input'
  3. Provide the 'Expected Output' (ideal response) manually
  1. Optional: you can Add files as context for your input by clicking on the 'Add Files' icon

Adding Files as Context
To add files as context to your input,

  1. Click on 'Add Files', Add Files window will appear
  2. You can upload the files here directly by drag and drop or by browsing through the system to upload
  3. You can additionally click on the Documnet Library icon to select files from the library and add them as context. All the uploaded files will appear inside the ‘Available Files’ Tab
  4. You can re use the files already uploaded by making use of this tab
  5. A file added to any row will be retained in the subsequent rows as a context for the conversation. For example, File A is added to row 1 will be added automatically to the subsequent rows below Row 1, unless explicitly deleted by you. If the File A is then deleted from row 1, it will automatically be deleted from the subsequent rows below it to remove it from the context of the conversation
  6. A total of 10 files can be added as context to one particular row

# Bulk Setup

Importing JSON files that contain the Input and Expected output pairs, useful in cases where a large number of queries need to be tested

  1. Click on Import
  2. Download the ‘Document Template’ ( illustrates the JSON structure which is supported by the system)
  3. Prepare the JSON file according to the template
  4. Upload the file by dragging and dropping or browsing through the system
  5. Click on submit

In case you need to add multiple files as context during bulk import, you need to add the “doc id” and “doc name” for each file to enable the system to add that as a context in the row.

  1. Click on the documents tab ( placed alongside the import button )
  2. The Documents dialog box will appear, and you can directly import files here
  3. After the import is complete, an option to download the file list will appear
  4. Click on the ‘Download file list’ button
  5. A CSV file would be downloaded, this file would contain the ‘doc id’ and ‘doc name’ for each file uploaded, to reference within the JSON import file

OR

  1. The Documents dialog box will appear, and you can opt for the option to ‘Select files from Doc Library’
  2. Doc Library will open
  3. You can select the folder and then select the files for which you want to download the file list
  4. You can also import additional files into the existing folders to update the library and then download the document list

# Benchmark Automation Agents

# Manual Setup

Manually entering the Input and expected output, useful in cases where a smaller number of queries need to be tested.

  1. Click on + Field, which will add a row within the benchmarking schema
  2. Enter the Input
  3. Provide the Expected Output (ideal response) manually
  1. The input and output can have multiple fields, depending on the input and output schema you have defined while building the agent
  2. For multi-occurrence fields, you can click on 'Add Value' to add multiple values for the same field

Note - For each line item added in the output schema, a separate field for expected output will be added.

In case you have selected ‘File’ as the data type for input, you can click Browse and upload the file.

# Bulk Import

  1. Click Import
  2. Download the ‘Document Template’ ( illustrates the JSON structure which is supported by the system)
  3. Prepare the JSON file according to the template
  4. Upload the file by dragging and dropping or browsing through the system
  5. Click Submit

In case ‘File’ is a data type for Input, you need to add the “doc id” and “doc name” for each input file to enable the system to add that as an input. Click on the documents tab ( placed alongside the import button )

  1. The Documents dialog box appears, and you can directly import files here
  2. After the import is complete, an option to download the file list appears
  3. Click the ‘Download file list’ button
  4. A CSV file would be downloaded, this file would contain the ‘doc id’ and ‘doc name’ for each file uploaded, to reference within the import file

OR

  1. The Documents dialog box will appear, and you can opt for the option to ‘Select files from Doc Library’, the Doc Library appears
  2. You can select the folder and then select the files for which you want to download the file list
  3. You can also import additional files into the existing folders to update the library and then download the document list

After adding the Input and Expected Output for all the rows you want to include in the benchmark, you can proceed to add different tracks (Model and Prompt Variations) and configure the benchmarking metrics to start the benchmarking run.

These actions are not dependent on any specific order; you can add tracks and set metrics in whichever sequence suits your workflow.

# Configure Benchmarking Metrics

Configure metrics by selecting from three metric types: Operational, Evaluative, and Custom. By clicking on the Benchmark settings button ( Bar graph-like icon )

# Use Evaluative Metrics

  1. You can select the metrics you want to evaluate for each track
  2. You can also expand by clicking on the metrics to view the prompt in a read-only mode. This is for you to understand how the prompt is structured for each OOTB metric, and select the metric accordingly

  1. You can select one or more metrics depending on how you want to evaluate
  2. Exact match or contextual match will help you in evaluating the match between the Expected Output and Agent Response
  3. For the Faithfulness metric, you can provide the guidelines on which you want the Agent's response to be evaluated. You can specify the rules, expectations, or constraints the LLM response must follow. This is the reference against which the LLM response will be evaluated

  1. Groundedness will help in evaluating how well the Agent Response is aligned with the context provided (for cases like RAG or file as an input)
  2. Relevance will help you in evaluating how well the response aligns with the system instructions (the prompt)

# Configure Operational Metrics

You can select from the operational metrics available.

The ratings metric is selected by default; other metrics can be selected or deselected.

# Defining Custom Metrics (LLM-as-a-Judge)

You can define a custom metric, which you can configure according to the use case.

  1. Click + Create new LLM-as-a-Judge

  2. Enter a name for your metric. This should clearly describe what you’re evaluating (e.g., Clarity, Empathy, Policy Compliance)

  3. Enter the Metric description

  4. Select the LLM model that will be the LLM-as-a-judge

  5. In the perspective area, write the evaluation prompt that instructs the LLM how to judge the output. The instructions for writing the prompt have been provided in the prompt box for you to provide a well-structured output

    You can use the supported parameters: Input – the original user query (optional) Expected Output – ideal response (optional) Agent Response – actual model output (required) Context – documents or context used (optional)

  6. Use the supported parameters, you can use these parameters inside the prompt instructions by typing “{“ and selecting the appropriate parameter from the drop down list

  7. Configure the metrics variable for each row run

  8. Provide a name for the metric and the data type to be used for scoring Number: Use this for rating scales (e.g., 1–5 or 0–100) Boolean: Use this for Yes/No questions Text: Use this for qualitative feedback or open-ended comments

  9. Optionally, describe the metric. While not required, this helps both the LLM and human reviewers understand the purpose of the metric

  10. You can add multiple metrics as per their requirement for evaluation by clicking on the + icon placed on the right side

  11. Configure the Agent Summary, which is the overall metric aggregated across all rows to assess agent performance for the entire run

    Note: Only Number and Boolean metric variables will be available for adding in score aggregations (Agent Summary). Average can be aggregated for the Number data type, and ‘Count of’ can be aggregated for the Boolean data type

  12. Click Save

  13. All the saved metrics appears as a list

  1. You can select the custom metric they want to enable for evaluation