Evaluate andoptimizeyour agents.

Agents are non-deterministic. A prompt change, a model upgrade, or a new tool can shift behavior in ways unit tests cannot catch. Full trace visibility is the feedback loop that turns guesswork into evidence-based improvement.

Why Observation Matters

Turn traces into business intelligence

Every model call, tool execution, and agent decision leaves a trace. The Harness makes them actionable — so you can control costs, assure quality, and ship improvements with confidence.

Understand Failures

A workflow returned the wrong answer? Follow the trace from run.started through every model call and tool execution. See exactly where reasoning diverged.

Measure Token Efficiency

Track input and output tokens per model call, per agent, per workflow. Identify which agents are expensive, which prompts are bloated, and where streaming reduces cost.

Detect Latency Spikes

Model call durations, tool execution times, sandbox command latencies — all histogrammed per provider and operation. Spot regressions after provider or prompt changes.

Compare Providers

Running A/B tests between OpenAI and Anthropic? Traces carry provider name, model, token usage, normalized finish reason, and raw provider outcome metadata. Compare objectively with data.

Respect Rate Limits

Model failures include retry kind, retry-after delay, attempt count, and rate-limit scope when providers expose it. Short waits stay inside the call budget; long waits are visible to queues and workers.

Audit Tool Usage

Which tools does each agent call? How often? With what latency? Tool spans carry name, type, MCP transport, and permission decisions — structured observability data you can pipe into your own audit log.

Validate Improvements

Ship a prompt improvement? Compare pre/post trace distributions for accuracy, token count, and latency. Evidence-based iteration beats guesswork.

Trace treeThe run becomes a navigable operations graph.

Trace Structure

Every operation is visible

The Harness emits OpenTelemetry traces aligned with the official GenAI semantic conventions. Every model call, tool execution, and agent decision is a span — so your observability platform understands AI operations natively.

harness.session.prompt — the entry point
invoke_agent {name} — agent reasoning loop
chat gpt-4o-mini — model call with token usage
execute_tool search_docs — tool invocation
harness.sandbox.exec — command execution
harness.state.op — persistence operation
harness.runtime.* — durable lease, checkpoint, resume
harness.context_checkpoint.* — workflow context snapshots

Content captureOperational metadata stays on; sensitive content is opt-in.

Privacy Gate

Content is opt-in

By default, prompts and responses are not captured in traces or logs. Events are emitted with content fields set to null. Widen contentCaptureMode only for deliberate local diagnostics.

contentCaptureMode: 'NO_CONTENT' — default, privacy-first
Events emitted but content fields are null
Structured object content omitted unless enabled
Enable only for local debugging, never in production

Debugging

Follow the evidence

When something goes wrong, traces tell the story. Here is the systematic approach to finding and fixing issues.

Identify the Run

Get the run_id from the UI, API response, or logs. Every trace is rooted at a session prompt or workflow invocation.

Inspect the Trace

Open the trace in CloudGrid, Grafana, or your observability platform. See the full hierarchy: session → workflow → agent → model → tool.

Find the Divergence

Check model finish reasons, tool outputs, token counts, and error.type attributes. The span with ERROR status points to the failure.

Compare Distributions

Look at token usage and operation.duration histograms. Did a prompt change increase token count? Did a provider switch add latency?

Check Tool Usage

Tool spans show which tools were called, with what arguments, and how long they took. MCP spans include transport and auth metadata.

Validate the Fix

Re-run a smoke test and compare the new trace to the old one. Confirm the fix resolved the issue without introducing regressions.

Failure Triage

What traces reveal

OperationTimeoutErrorThe run, model, or tool exceeded its budget. The trace shows which span hit the limit and how long sibling spans took. Tune defaults or inspect provider latency.

ValidationErrorInput or output schema mismatch. The trace includes the Zod issue path and the offending value. Check the agent output span for the raw model response.

ModelErrorProvider HTTP/network/error response. The model span carries normalized metadata: status, provider type, request id, and body summary. No need to dig through raw SDK logs.

Permission DenialsA tool call was gated by a review request or explicit deny. The tool span carries harness.permission.mode and harness.permission.decision. Audit who approved what.

Recommended Stack

One system for
traces, logs, metrics,
and AI evaluation.

The Harness emits standard OTLP telemetry and works with any OpenTelemetry-compatible backend. For teams building with PURISTA, CloudGrid is the recommended choice — a source-available observability platform that combines traces, logs, metrics, dashboards, alerts, and AI evaluation and optimization in a single system.

Why CloudGrid

Self-hosted, source-available

Your telemetry can stay in your network. CloudGrid is built in the open under Apache 2.0 + Commons Clause terms.

OTel-native from the start

Speaks GenAI semantic conventions and OpenInference. AgentRuns, LlmCalls, and ToolCalls are first-class entities, not parsed log lines.

AI evaluation built in

Build datasets, run evaluations, compare baselines and candidates, explore optimization runs, and keep row-level evidence next to the spans it came from.

PURISTA Harness adapter

CloudGrid ships a puristajs/harness eval adapter out of the box. Harness owns all model-provider calls — CloudGrid never holds provider credentials.

cloudgrid.dev →

Getting connected

Start CloudGrid

One command: docker compose up. OTLP HTTP on port 4318, gRPC on 4317.

Point the exporter

Set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318. The Harness reads standard OTel env vars — no Harness config changes needed.

Watch runs appear

Every run.started through run.finished appears as a distributed waterfall with span attributes, events, and token counts.

Add AI evaluation

Configure the eval harness adapter and start building datasets from your production traces. Score with deterministic rules or judge models. Run prompt experiments.

Operations

Runbook for operators

Structured logs, traces, and verification gates turn incident response from guesswork into a repeatable process.

Verification Gates

Run these checks before every deployment or environment promotion.

snippet.tsbash

npm run lint
npm run typecheck
npm test
npm run test:contracts
npm run test:integration
npm run test:failure
npm run build

Service Readiness

Before exposing a harness-backed service to traffic:

Verify session creation and direct agent invocation
Verify every workflow entrypoint
Verify tool and MCP failures map to harness errors
Verify cancellation and timeout behavior
Verify harness.shutdown() closes adapters cleanly

Recovery steps

Identify run_id from the UI, logs, or API response

Inspect structured logs for the matching run_id

Inspect trace in your observability platform if traceId is present

Check final run.finished event for normalized error metadata

Fix provider, tool, or configuration issue

Re-run a smoke test, then call harness.shutdown() on process exit

Extending

Add your own dimensions

The Harness emits standard spans. Your application can add custom attributes, events, and baggage to enrich traces with business context.

Custom Span Attributes

Add tenant IDs, user segments, experiment flags, or cost centers to every span via OTel baggage and span processors.

snippet.tstypescript

import { propagation, context } from '@opentelemetry/api'

const ctx = propagation.setBaggage(
  context.active(),
  propagation.createBaggage({
    'tenant.id': { value: 'acme-corp' },
    'experiment.variant': { value: 'site' },
  })
)

Workflow-Level Context

Pass business context through workflow handlers. The Harness propagates the active OTel context through every agent and tool call.