# Automated Ground Truth Capture for Agents in Benchmark

Users can now automatically capture Ground Truth (GT) for automation agents with intelligent model selection, granular execution control, and manual override capabilities, streamlining benchmarking workflows and maintaining consistent evaluation standards.

Key Features:

  • Auto GT Capture Toggle: Enable or disable automatic ground truth capture with a single toggle under the Expected Output column header
  • Intelligent Model Selection: Default GT capture from the first model column, with dropdown selection when multiple models are present
  • Manual Override with Lock: Edit any GT field with automatic lock icon indicator, preventing future auto-capture overwrites for manually edited values
  • Document-Assisted Editing: Side overlay with field-level details and document viewer for easy manual GT refinement with source document reference
  • Model Switch Confirmation: Pop-up confirmation when changing GT source model, with options to recapture or retain existing GT values
  • Clear All on Toggle Off: Disabling Auto GT Capture clears all automatically captured GT values while preserving manually edited entries
  • Granular Execution Control: Independent configuration for LLM Generation and Metric Evaluation with model/metric selection and run scope options (All Rows and Columns, Selected Rows and Columns Only, Run for Empty and Failed Rows)
  • Selective Row/Column Processing: Re-run only missing or selected data to avoid overwriting existing outputs and reduce unnecessary costs

Impact:

  • Dramatically reduces benchmarking setup time by eliminating manual GT entry for every row across multiple models
  • Improves evaluation consistency by automatically capturing GT from a designated source model, reducing human error
  • Provides flexibility for iterative refinement through manual GT editing with automatic lock protection
  • Accelerates model comparison cycles by enabling quick GT recapture when switching source models
  • Reduces operational costs by preventing unnecessary LLM calls and metric evaluations through selective execution
  • Enhances user control with independent LLM generation and metric evaluation toggles for targeted benchmarking runs
  • Simplifies complex benchmarking workflows by supporting partial re-runs for failed or missing data without full regeneration

# Ground Truth (GT) Generation for Conversation Agents

In Conversation Agents, the Expected Output (GT) represents the ideal text response for a specific prompt or query. Instead of manually writing every response, you can use a high-performing "Source Model" to automatically generate and store these baselines.

# Step 1: Enable the Toggle

Locate the Expected Output column header in your benchmarking table.

  • Toggle ON "Auto GT Capture": When enabled, the system will automatically save the selected model's response as the GT for every row in the run.
  • Toggle OFF: No automatic capture occurs. The column will remain empty or retain its previous state, allowing for manual entry.

# Step 2: Select the GT Source Model

In the model configuration panel, you will see a list of available LLMs (e.g., GPT-5, Claude 3.5, Gemini 1.5).

  • Radio Button Selection: Click the radio button next to the model name you wish to use as your "Source of Truth." This model’s outputs will populate the Expected Output column when you run the prompt.

# Step 3: The Auto-Capture Workflow

  1. Add Prompts: Input your conversation queries into the rows of the benchmarking table.
  2. Activate Capture: Ensure the Auto GT Capture toggle is ON.
  3. Select Source: Mark the radio button for your preferred GT Source Model.
  4. Run Prompt: Click the Run Prompt button.
  5. Automatic Population: As the system processes the request, the model’s text response is automatically stored under the Expected Output column for each row.

# Step 4: Manual Refinement and Locking

You can manually override the auto-captured responses to ensure the highest quality baseline.

  • Edit GT: Click the Edit icon inside the Expected Output text box to modify or rewrite the captured response.
  • Manual Lock: Once a GT is manually edited, a Manual Person Icon appears beside the box.
  • Persistence: This GT is now "locked." Future auto-capture runs or model changes will not overwrite this specific row.

# Step 5: Model Switch Behavior

If you change the GT Source Model (by selecting a different radio button), a confirmation pop-up will appear:

"Do you want to recapture the Ground Truth (GT) using the new model’s outputs, or keep the existing GT?"

  • Option 1: Recapture GT using new model: The Expected Output column is updated with responses from the new model (excluding manually locked rows).
  • Option 2: Keep existing GT: All existing GTs remain unchanged, regardless of the new model selection.

# Step 6: GT Persistence & Evaluation

Establishing a Ground Truth is critical for the long-term evaluation of your AI agents:

  • Benchmarking Metrics: Once GT (auto or manual) is stored, the platform uses it to automatically calculate similarity and accuracy metrics across all future model runs.
  • Cross-Run Consistency: Stored GTs persist across different configurations and model comparisons, ensuring your evaluation baseline remains stable throughout the project lifecycle.

# Ground Truth (GT) Generation for Automation Agents

# Step 1: Setting Up Auto GT Capture

Toggle Auto GT Capture: Located under the Expected Output column header.

  • ON: The system automatically populates the GT column using the selected model's output.
  • OFF (Clear All): Turning the toggle OFF acts as a "Clear All" command. It removes all automatically generated GT values, leaving only your manual edits.

# Step 2: Select the GT Source Model

  • Default: If only one model column is present, it is used for GT capture by default.
  • Dropdown Selection: If multiple model columns are added, use the GT Model Dropdown to select which model’s output should serve as the reference.
  • Model Switch: If you change the source model in the dropdown, a confirmation pop-up will ask if you wish to recapture GTs using the new model.

# Step 3: Manual Editing & Document Viewer

To ensure 100% accuracy, you can manually refine any automated GT value.

  • Edit Mode: Click the edit icon in an Expected Output cell.
  • Side Overlay: A side panel will open displaying:
    • Field Level Details: Specific attributes of the data point.
    • Document Viewer: A built-in viewer showing the source document (PDF/Image) so you can verify values without leaving the screen.
  • Manual Lock: Once edited, a person icon appears. This field is now locked and will never be overwritten by the system, even if you change models or toggle Auto GT OFF/ON.

# Step 4: Advanced Run Configuration

When you click Run Prompt, the Configure Evaluation window appears, allowing you to control costs and focus on specific data.

LLM Generation Configuration
Decide which models to run and which cells to process:

  • Model Selection: Choose one or more models (e.g., GPT-5, Claude 3.5).
  • Run Scope:
    • All Rows and Columns: A full refresh; overwrites existing outputs.
    • Run for Selected Rows/Columns: Processes only the cells you have highlighted.
    • Run for empty and failed rows: (Previously "Missing Only") Processes only cells without data or those that errored in previous runs.

Metric Evaluation Configuration
Decide how to score the outputs against the GT:

  • Metrics Selection: Choose from your saved custom metrics.
  • Run Scope: Similar to generation, you can choose to score All, Selected, or Missing (only cells where a score hasn't been calculated yet).

# Step 5: Independent Execution

You have the flexibility to run generation and evaluation separately or together via toggle switches in the configuration window:

  • Run Generation Only: Use this to see model outputs before deciding to score them.
  • Run Metrics Only: Use this if you have edited your GTs manually and need to update the scores without re-running the expensive LLM calls.
  • Run Both: The standard end-to-end benchmarking workflow.