Evaluate and optimize your agents.
Agents are non-deterministic. A prompt change, a model upgrade, or a new tool can shift behavior in ways unit tests cannot catch. Full trace visibility is the feedback loop that turns guesswork into evidence-based improvement.
Turn traces into business intelligence
Every model call, tool execution, and agent decision leaves a trace. The Harness makes them actionable — so you can control costs, assure quality, and ship improvements with confidence.
A workflow returned the wrong answer? Follow the trace from run.started through every model call and tool execution. See exactly where reasoning diverged.
Track input and output tokens per model call, per agent, per workflow. Identify which agents are expensive, which prompts are bloated, and where streaming reduces cost.
Model call durations, tool execution times, sandbox command latencies — all histogrammed per provider and operation. Spot regressions after provider or prompt changes.
Running A/B tests between OpenAI and Anthropic? Traces carry provider name, model, token usage, and finish reasons on every span. Compare objectively with data.
Which tools does each agent call? How often? With what latency? Tool spans carry name, type, MCP transport, and permission decisions — structured observability data you can pipe into your own audit log.
Ship a prompt improvement? Compare pre/post trace distributions for accuracy, token count, and latency. Evidence-based iteration beats guesswork.
Every operation is visible
The Harness emits OpenTelemetry traces aligned with the official GenAI semantic conventions. Every model call, tool execution, and agent decision is a span — so your observability platform understands AI operations natively.
- harness.session.prompt — the entry point
- invoke_agent {name} — agent reasoning loop
- chat gpt-4o-mini — model call with token usage
- execute_tool search_docs — tool invocation
- harness.sandbox.exec — command execution
- harness.state.op — persistence operation
Content is opt-in
By default, prompts and responses are not captured in traces or logs. Events are emitted with content fields set to null. Enable captureContent only for deliberate local diagnostics.
- captureContent: false — default, privacy-first
- Events emitted but content fields are null
- Structured object content omitted unless enabled
- Enable only for local debugging, never in production
Follow the evidence
When something goes wrong, traces tell the story. Here is the systematic approach to finding and fixing issues.
Get the run_id from the UI, API response, or logs. Every trace is rooted at a session prompt or workflow invocation.
Open the trace in CloudGrid, Grafana, or your observability platform. See the full hierarchy: session → workflow → agent → model → tool.
Check model finish reasons, tool outputs, token counts, and error.type attributes. The span with ERROR status points to the failure.
Look at token usage and operation.duration histograms. Did a prompt change increase token count? Did a provider switch add latency?
Tool spans show which tools were called, with what arguments, and how long they took. MCP spans include transport and auth metadata.
Re-run a smoke test and compare the new trace to the old one. Confirm the fix resolved the issue without introducing regressions.
What traces reveal
One system for
traces, logs, metrics,
and AI evaluation.
The Harness emits standard OTLP telemetry and works with any OpenTelemetry-compatible backend. For teams building with PURISTA, CloudGrid is the recommended choice — a source-available observability platform that combines traces, logs, metrics, dashboards, alerts, and AI evaluation and optimization in a single system.
Your telemetry can stay in your network. CloudGrid is built in the open under Apache 2.0 + Commons Clause terms.
Speaks GenAI semantic conventions and OpenInference. AgentRuns, LlmCalls, and ToolCalls are first-class entities, not parsed log lines.
Build datasets, run evaluations, compare baselines and candidates, explore optimization runs, and keep row-level evidence next to the spans it came from.
CloudGrid ships a puristajs/harness eval adapter out of the box. Harness owns all model-provider calls — CloudGrid never holds provider credentials.
One command: docker compose up. OTLP HTTP on port 4318, gRPC on 4317.
Set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318. The Harness reads standard OTel env vars — no Harness config changes needed.
Every run.started through run.finished appears as a distributed waterfall with span attributes, events, and token counts.
Configure the eval harness adapter and start building datasets from your production traces. Score with deterministic rules or judge models. Run prompt experiments.
Runbook for operators
Structured logs, traces, and verification gates turn incident response from guesswork into a repeatable process.
Run these checks before every deployment or environment promotion.
npm run lint
npm run typecheck
npm test
npm run test:contracts
npm run test:integration
npm run test:failure
npm run build Before exposing a harness-backed service to traffic:
- Verify session creation and direct agent invocation
- Verify every workflow entrypoint
- Verify tool and MCP failures map to harness errors
- Verify cancellation and timeout behavior
- Verify harness.shutdown() closes adapters cleanly
Identify run_id from the UI, logs, or API response
Inspect structured logs for the matching run_id
Inspect trace in your observability platform if traceId is present
Check final run.finished event for normalized error metadata
Fix provider, tool, or configuration issue
Re-run a smoke test, then call harness.shutdown() on process exit
Add your own dimensions
The Harness emits standard spans. Your application can add custom attributes, events, and baggage to enrich traces with business context.
Add tenant IDs, user segments, experiment flags, or cost centers to every span via OTel baggage and span processors.
import { propagation, context } from '@opentelemetry/api'
const ctx = propagation.setBaggage(
context.active(),
propagation.createBaggage({
'tenant.id': { value: 'acme-corp' },
'experiment.variant': { value: 'site' },
})
) Pass business context through workflow handlers. The Harness propagates the active OTel context through every agent and tool call.
.workflows(({ workflow }) => ({
analyze: workflow({
handler: async (ctx) => {
// All agent and tool spans inside
// this workflow inherit the context
const result = await ctx.agents.analyzer(ctx.input)
return result
},
}),
})) Observe, understand, improve.
Traces are the feedback loop that turns guesswork into evidence. Start with defaults, connect CloudGrid or any OTel-compatible backend, and iterate with confidence.