Evaluate and optimize your agents.

Agents are non-deterministic. A prompt change, a model upgrade, or a new tool can shift behavior in ways unit tests cannot catch. Full trace visibility is the feedback loop that turns guesswork into evidence-based improvement.

Why Observation Matters

Turn traces into business intelligence

Every model call, tool execution, and agent decision leaves a trace. The Harness makes them actionable — so you can control costs, assure quality, and ship improvements with confidence.

Understand Failures

A workflow returned the wrong answer? Follow the trace from run.started through every model call and tool execution. See exactly where reasoning diverged.

Measure Token Efficiency

Track input and output tokens per model call, per agent, per workflow. Identify which agents are expensive, which prompts are bloated, and where streaming reduces cost.

Detect Latency Spikes

Model call durations, tool execution times, sandbox command latencies — all histogrammed per provider and operation. Spot regressions after provider or prompt changes.

Compare Providers

Running A/B tests between OpenAI and Anthropic? Traces carry provider name, model, token usage, and finish reasons on every span. Compare objectively with data.

Audit Tool Usage

Which tools does each agent call? How often? With what latency? Tool spans carry name, type, MCP transport, and permission decisions — structured observability data you can pipe into your own audit log.

Validate Improvements

Ship a prompt improvement? Compare pre/post trace distributions for accuracy, token count, and latency. Evidence-based iteration beats guesswork.

Trace tree The run becomes a navigable operations graph.
Trace Structure

Every operation is visible

The Harness emits OpenTelemetry traces aligned with the official GenAI semantic conventions. Every model call, tool execution, and agent decision is a span — so your observability platform understands AI operations natively.

  • harness.session.prompt — the entry point
  • invoke_agent {name} — agent reasoning loop
  • chat gpt-4o-mini — model call with token usage
  • execute_tool search_docs — tool invocation
  • harness.sandbox.exec — command execution
  • harness.state.op — persistence operation
Content capture Operational metadata stays on; sensitive content is opt-in.
Privacy Gate

Content is opt-in

By default, prompts and responses are not captured in traces or logs. Events are emitted with content fields set to null. Enable captureContent only for deliberate local diagnostics.

  • captureContent: false — default, privacy-first
  • Events emitted but content fields are null
  • Structured object content omitted unless enabled
  • Enable only for local debugging, never in production
Debugging

Follow the evidence

When something goes wrong, traces tell the story. Here is the systematic approach to finding and fixing issues.

01
Identify the Run

Get the run_id from the UI, API response, or logs. Every trace is rooted at a session prompt or workflow invocation.

02
Inspect the Trace

Open the trace in CloudGrid, Grafana, or your observability platform. See the full hierarchy: session → workflow → agent → model → tool.

03
Find the Divergence

Check model finish reasons, tool outputs, token counts, and error.type attributes. The span with ERROR status points to the failure.

04
Compare Distributions

Look at token usage and operation.duration histograms. Did a prompt change increase token count? Did a provider switch add latency?

05
Check Tool Usage

Tool spans show which tools were called, with what arguments, and how long they took. MCP spans include transport and auth metadata.

06
Validate the Fix

Re-run a smoke test and compare the new trace to the old one. Confirm the fix resolved the issue without introducing regressions.

Failure Triage

What traces reveal

OperationTimeoutError The run, model, or tool exceeded its budget. The trace shows which span hit the limit and how long sibling spans took. Tune defaults or inspect provider latency.
ValidationError Input or output schema mismatch. The trace includes the Zod issue path and the offending value. Check the agent output span for the raw model response.
ModelError Provider HTTP/network/error response. The model span carries normalized metadata: status, provider type, request id, and body summary. No need to dig through raw SDK logs.
Permission Denials A tool call was gated by a review request or explicit deny. The tool span carries harness.permission.mode and harness.permission.decision. Audit who approved what.
Recommended Stack

One system for
traces, logs, metrics,
and AI evaluation.

The Harness emits standard OTLP telemetry and works with any OpenTelemetry-compatible backend. For teams building with PURISTA, CloudGrid is the recommended choice — a source-available observability platform that combines traces, logs, metrics, dashboards, alerts, and AI evaluation and optimization in a single system.

Why CloudGrid
Self-hosted, source-available

Your telemetry can stay in your network. CloudGrid is built in the open under Apache 2.0 + Commons Clause terms.

OTel-native from the start

Speaks GenAI semantic conventions and OpenInference. AgentRuns, LlmCalls, and ToolCalls are first-class entities, not parsed log lines.

AI evaluation built in

Build datasets, run evaluations, compare baselines and candidates, explore optimization runs, and keep row-level evidence next to the spans it came from.

PURISTA Harness adapter

CloudGrid ships a puristajs/harness eval adapter out of the box. Harness owns all model-provider calls — CloudGrid never holds provider credentials.

cloudgrid.dev →
Getting connected
01
Start CloudGrid

One command: docker compose up. OTLP HTTP on port 4318, gRPC on 4317.

02
Point the exporter

Set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318. The Harness reads standard OTel env vars — no Harness config changes needed.

03
Watch runs appear

Every run.started through run.finished appears as a distributed waterfall with span attributes, events, and token counts.

04
Add AI evaluation

Configure the eval harness adapter and start building datasets from your production traces. Score with deterministic rules or judge models. Run prompt experiments.

Operations

Runbook for operators

Structured logs, traces, and verification gates turn incident response from guesswork into a repeatable process.

Verification Gates

Run these checks before every deployment or environment promotion.

snippet.ts bash
npm run lint
npm run typecheck
npm test
npm run test:contracts
npm run test:integration
npm run test:failure
npm run build
Service Readiness

Before exposing a harness-backed service to traffic:

  • Verify session creation and direct agent invocation
  • Verify every workflow entrypoint
  • Verify tool and MCP failures map to harness errors
  • Verify cancellation and timeout behavior
  • Verify harness.shutdown() closes adapters cleanly
Recovery steps
01

Identify run_id from the UI, logs, or API response

02

Inspect structured logs for the matching run_id

03

Inspect trace in your observability platform if traceId is present

04

Check final run.finished event for normalized error metadata

05

Fix provider, tool, or configuration issue

06

Re-run a smoke test, then call harness.shutdown() on process exit

Extending

Add your own dimensions

The Harness emits standard spans. Your application can add custom attributes, events, and baggage to enrich traces with business context.

Custom Span Attributes

Add tenant IDs, user segments, experiment flags, or cost centers to every span via OTel baggage and span processors.

snippet.ts typescript
import { propagation, context } from '@opentelemetry/api'

const ctx = propagation.setBaggage(
  context.active(),
  propagation.createBaggage({
    'tenant.id': { value: 'acme-corp' },
    'experiment.variant': { value: 'site' },
  })
)
Workflow-Level Context

Pass business context through workflow handlers. The Harness propagates the active OTel context through every agent and tool call.

snippet.ts typescript
.workflows(({ workflow }) => ({
  analyze: workflow({
    handler: async (ctx) => {
      // All agent and tool spans inside
      // this workflow inherit the context
      const result = await ctx.agents.analyzer(ctx.input)
      return result
    },
  }),
}))

Observe, understand, improve.

Traces are the feedback loop that turns guesswork into evidence. Start with defaults, connect CloudGrid or any OTel-compatible backend, and iterate with confidence.