Skip to content

Testing

Testing LLM-based applications is notoriously difficult because of their non-deterministic nature. PURISTA provides tools to make your agent tests reliable, fast, and deterministic.

1. Unit Testing Agents

When you use purista add agent, a test file is automatically generated. The goal of a unit test is to verify your agent's logic (tool calls, state changes, schema validation) without making real LLM calls.

ts
import { supportAgent } from './supportAgent.js'
import { MockModel, testAgent } from '@purista/ai'

describe('Support Agent', () => {
  it('should call the ticketing tool if the user reports a bug', async () => {
    const model = new MockModel()
      .on(/broken laptop/i)
      .reply('I have created a ticket for you.')

    const { instance, eventBridge, destroy } = await testAgent(supportAgent, {
      models: {
        'openai:gpt-4o-mini': model
      }
    })

    // 2. Mock the service command
    const createTicketMock = vi.fn().mockResolvedValue({ id: 'ticket-123' })
    eventBridge.registerCommand('ticketing', '1', 'createTicket', createTicketMock)

    // 3. Run the agent
    const result = await instance.invoke({ payload: { prompt: 'My laptop is broken' } })

    // 4. Verify assertions
    expect(createTicketMock).toHaveBeenCalledWith(
      expect.objectContaining({ reason: 'Broken laptop' })
    )
    expect(result.envelopes.some(e => e.frame.kind === 'message')).toBe(true)
    await destroy()
  })
})

2. Using the Test Helper (testAgent)

The testAgent helper is your best friend. It:

  • Sets up an in-memory EventBridge.
  • Creates a runtime instance of your agent.
  • Injects mock models and providers.
  • Provides a clean way to register mock commands.
  • Returns destroy() to cleanly stop the instance and bridge.

MockModel gives deterministic scripting:

  • .on(string | RegExp).reply(string | fn)
  • .onJson(matcher).reply(object | fn)

3. Strategies for Reliable Tests

A. Schema Validation

Verify that your agent correctly handles malformed input. Because you've defined addPayloadSchema, PURISTA will automatically throw a HandledError before the agent even starts.

B. State/History Checks

If your agent uses persistConversation, you can verify the history state after a run:

ts
const session = await instance.session.load('test-session')
expect(session.data.messages).toHaveLength(2)

C. Deterministic Output

Mock the model output to verify how your agent handler processes it (e.g., extracting values from JSON or formatting a string).

4. Evaluation Datasets (Advanced)

For production-ready agents, unit tests are not enough. You need to evaluate the quality of the LLM responses.

PURISTA supports an "Evaluation Mode" where you can run your agent against a dataset of "Golden Questions" and "Expected Answers."

  • Metrics: BLEU, ROUGE, or LLM-as-a-judge scoring.
  • CI/CD: Block deployments if the evaluation score drops below a certain threshold.

See the AI Basic Example for a complete reference on evaluation datasets.