Test agents without calling external APIs.
Every layer of a harness application is testable in isolation. Inject a fake model provider. Contract-test your state store. Assert streaming event sequences. Run fake MCP servers in CI. No real API keys needed.
Four levels, clear boundaries
Start at the bottom. Each level is independently testable. Climb the pyramid only when the previous level passes.
Replace the model with a fake provider
A fake model provider is a plain object that implements the ModelProvider interface. It returns deterministic responses. Real sessions, real sandboxes, real tool calls — only the model output is controlled.
Minimal fake provider
const fakeProvider = {
id: 'fake',
genAiSystem: 'fake',
async object(req) {
return {
object: {
answer: 'Fake answer from test fixture.',
citations: [],
},
usage: { inputTokens: 1, outputTokens: 1, totalTokens: 2 },
finishReason: 'stop',
}
},
}
const harness = defineHarness({ name: 'test' })
.models({
fast: {
provider: fakeProvider,
model: 'fake-model',
capabilities: ['object'],
},
})
.agents(({ agent }) => ({
answerer: agent({
model: 'fast',
input: z.object({ question: z.string() }),
output: z.object({ answer: z.string(), citations: z.array(z.string()) }),
instructions: 'Answer concisely.',
}),
}))
.build()
const session = await harness.getSession('test-user')
await expect(
session.agents.answerer.prompt({ question: 'hello?' })
).resolves.toMatchObject({ answer: 'Fake answer from test fixture.' }) Input-aware fake
Inspect the request messages to return different responses for different inputs — useful for testing branching workflows.
const inputAwareFake = {
id: 'fake',
genAiSystem: 'fake',
async object(req) {
const last = req.messages.at(-1)?.content ?? ''
const isCritical = last.toString().toLowerCase()
.includes('critical')
return {
object: {
priority: isCritical ? 'P1' : 'P3',
owner: isCritical ? 'on-call' : 'team-backlog',
nextAction: isCritical
? 'page on-call immediately'
: 'add to backlog',
},
usage: { inputTokens: 1, outputTokens: 1, totalTokens: 2 },
finishReason: 'stop',
}
},
} The harness also ships fakeModelProvider() and fakeMemoryAdapter() helpers from @purista/harness/testing for common fixture patterns.
Assert event sequences from .stream()
The streaming API emits typed RunEvent values. Collect them in an array and assert the sequence, tool calls, and final output.
Collect and assert events
const events: RunEvent[] = []
for await (const event of session.agents.answerer.stream({
question: 'What is the policy?',
})) {
events.push(event)
}
// lifecycle
expect(events.at(0)?.type).toBe('run.started')
expect(events.at(-1)?.type).toBe('run.finished')
// tool calls
const toolEvents = events.filter(e => e.type === 'tool.started')
expect(toolEvents).toHaveLength(1)
expect(toolEvents[0].toolId).toBe('search_docs')
// final output
const finished = events.find(e => e.type === 'run.finished')
expect(finished?.output).toMatchObject({
answer: expect.any(String),
citations: expect.any(Array),
}) Event sequence for tool use
run.started First event. Always emitted. agent.started Agent loop begins. tool.started One per tool call. Has toolId, toolName, input. tool.finished Tool output returned. Has output and durationMs. agent.finished Model returned final validated output. run.finished Last event. Has output, usage, durationMs. Workflows emit the same outer run.started / run.finished envelope plus agent.started / agent.finished for each agent invocation inside.
Test tool handlers in isolation
TypeScript tool handlers are plain async functions. Pass a minimal context object and assert both successful output and validation failure behaviour — no harness or session needed.
Happy path
import { policyLookupHandler } from './tools/policyLookup'
it('returns policy text for known topic', async () => {
const ctx = {
logger: console,
signal: new AbortController().signal,
toolId: 'policy_lookup',
sandbox: undefined,
}
const result = await policyLookupHandler(
ctx,
{ topic: 'deployment' }
)
expect(result).toMatchObject({
text: expect.stringContaining('change freeze'),
})
}) Validation and error paths
it('rejects unknown topic', async () => {
const ctx = {
logger: console,
signal: new AbortController().signal,
toolId: 'policy_lookup',
sandbox: undefined,
}
await expect(
policyLookupHandler(ctx, { topic: 'does-not-exist' })
).rejects.toThrow('Policy not found')
})
it('respects cancellation', async () => {
const abort = new AbortController()
abort.abort()
await expect(
policyLookupHandler(
{ ...ctx, signal: abort.signal },
{ topic: 'deployment' }
)
).rejects.toThrow()
}) Validate scorer definitions before running evals
Import evaluateDeterministicScorer directly to unit-test scorer definitions without a harness, session, or candidate loop.
import { evaluateDeterministicScorer } from '@purista/harness'
it('contains scorer passes when value is present', async () => {
const result = evaluateDeterministicScorer(
{
type: 'contains',
path: '/answer',
value: 'change freeze',
caseInsensitive: true,
},
{
input: { question: 'Can I deploy on Friday?' },
output: { answer: 'No — there is a change freeze.' },
}
)
expect(result).toMatchObject({ score: 1, passed: true })
})
it('contains scorer fails gracefully on missing pointer', async () => {
const result = evaluateDeterministicScorer(
{ type: 'contains', path: '/missing', value: 'foo' },
{ input: {}, output: { answer: 'something' } }
)
expect(result.passed).toBe(false)
expect(result.evidence?.reason).toBe('missing_pointer')
}) Test all four types
contains value present / absent / case handling regex pattern match / no match / flags attribute-equality equal values / unequal / missing pointers json-schema valid shape / missing required / additional properties Test scorers before candidate runs. A wrong pointer path or invalid schema returns passed: false silently — catching it in a unit test is faster than debugging it mid-eval.
Test MCP tools with local fake servers
Use fixture servers for contract tests. Prove the harness correctly executes commands through the sandbox, validates schemas, and handles errors — without a real MCP server.
Stdio MCP contract tests
// things to assert for mcp_stdio tools:
// 1. command runs through sandbox executor
it('runs command through sandbox exec', async () => {
const result = await session.agents.diagrammer.prompt(input)
expect(result.diagramXml).toBeDefined()
})
// 2. fails cleanly when sandbox has no executor
it('throws SandboxNoExecutorError without executor', async () => {
const noExecHarness = defineHarness({ name: 'test' })
.sandbox(inMemorySandbox()) // file-only, no exec
.tools({ drawio: { kind: 'mcp_stdio', ... } })
.build()
const s = await noExecHarness.getSession('t')
await expect(s.agents.diagrammer.prompt(input))
.rejects.toThrow('SandboxNoExecutorError')
})
// 3. validates input schema before calling server
it('rejects invalid tool input', async () => {
await expect(
session.agents.diagrammer.prompt({ badInput: true })
).rejects.toThrow('ValidationError')
}) HTTP MCP contract tests
// things to assert for mcp_http tools:
// 1. auth failure surfaces as McpAuthError
it('throws McpAuthError on 401', async () => {
const harness = buildHarnessWithHttpMcp({
url: 'http:___PH2___
})
await expect(session.agents.search.prompt(input))
.rejects.toThrow('McpAuthError')
})
___PH3___
it('throws McpProtocolError on bad response', async () => {
await expect(session.agents.search.prompt(input))
.rejects.toThrow('McpProtocolError')
})
___PH4___
it('rejects invalid MCP server output', async () => {
___PH5___
await expect(session.agents.search.prompt(input))
.rejects.toThrow('ValidationError')
})
___PH6___
it('throws ToolError when MCP signals isError', async () => {
await expect(session.agents.search.prompt(input))
.rejects.toThrow('ToolError39;)
}) Prove human-in-the-loop flows work correctly
Review gates are stateful flows. Four invariants must hold. Test each one explicitly.
No mutation before approval
Run the workflow, assert write tools were not called before the review decision resolves.
Review questions visible and submittable
Assert answer choices are returned and that each can be submitted independently.
Decisions are idempotent
Submit the same decision twice. Assert the second submission succeeds or returns a clear idempotency signal.
Stale ids fail cleanly
Submit a decision with a stale review id or run id. Assert a meaningful error, not a silent no-op.
it('does not write before review approval', async () => {
const writesSpy = vi.fn()
const harness = defineHarness({ name: 'test' })
.tools({
write_wiki: {
description: 'Write to wiki.',
input: z.object({ content: z.string() }),
output: z.object({ ok: z.boolean() }),
handler: async (_ctx, input) => {
writesSpy(input)
return { ok: true }
},
},
})
.workflows(({ workflow }) => ({
propose_and_review: workflow({
input: z.object({ topic: z.string() }),
output: z.object({ approved: z.boolean() }),
handler: async (ctx) => {
const proposal = await ctx.agents.writer(ctx.input)
const decision = await ctx.reviewGate(proposal)
if (decision.approved) {
await ctx.tools.write_wiki({ content: proposal.content })
}
return { approved: decision.approved }
},
}),
}))
.build()
const session = await harness.getSession('test')
// pause at review gate — do not approve
const run = session.workflows.propose_and_review.stream({ topic: 'AI policy' })
for await (const event of run) {
if (event.type === 'review.requested') break
}
expect(writesSpy).not.toHaveBeenCalled()
}) Validate custom adapters with shared contracts
The harness ships shared contract test suites from @purista/harness/testing. Run them against your custom adapters before publishing to verify they behave exactly as the core expects.
State store
import { stateStoreContract } from '@purista/harness/testing'
import { myStateStore } from './myStateStore'
describe('myStateStore', () => {
stateStoreContract(() => myStateStore({
url: process.env.TEST_DB_URL,
}))
}) Validates session CRUD, run write/read, message append, event persistence, and concurrent-write behaviour.
Memory adapter
import { memoryAdapterContract } from '@purista/harness/testing'
import { myMemoryAdapter } from './myMemoryAdapter'
describe('myMemoryAdapter', () => {
memoryAdapterContract(() => myMemoryAdapter({
url: process.env.TEST_REDIS_URL,
}))
}) Validates scope isolation, read/write round-trips, list with prefix, delete, TTL expiry, and search when declared.
Sandbox
import { sandboxContract } from '@purista/harness/testing'
import { myCustomSandbox } from './myCustomSandbox'
describe('myCustomSandbox', () => {
sandboxContract(() => myCustomSandbox())
}) Validates file read/write, executor availability declaration, cancellation, and snapshot/resume when declared.
Coverage gate
The harness core enforces a coverage gate on its own test suite: statements 80%, branches 75%, functions 80%, lines 80%. Adapter packages should enforce their own gates before publishing.
Production-ready. Check the list.
Review the security model and production checklist before deploying to make sure nothing slips through.