Mental Model & Philosophy

Resilience Through Patterns

Fault tolerance, retry logic, and graceful failure handling

Resilience in PURISTA is not an afterthought. Every message has delivery semantics, every handler has retry boundaries, and every bridge exposes health diagnostics. The framework defaults to safe behavior and fails fast when guarantees cannot be met.

Delivery semantics

End-to-end message delivery is a combination of three factors:

  1. The selected event bridge
  2. Broker and component configuration
  3. Your handler design (idempotency, retries, side effects)

Common guarantee modes

ModeMeaningWhen to use
At-most-onceLower overhead, messages can be lostTelemetry, metrics, non-critical events
At-least-onceSafer delivery, duplicates expectedBusiness events, payment processing
Exactly-onceRarely guaranteed end-to-endRequires idempotent handlers + deduplication
Exactly-once is a handler property

No broker can guarantee exactly-once delivery across distributed side effects. PURISTA provides the tools (idempotency keys, deduplication, idempotent handlers); you provide the design.

Safe defaults

PURISTA defaults to strict startup validation for reliability-sensitive semantics:

  • If a handler requests delivery behavior a bridge cannot honor, startup fails in strict mode
  • Late command responses after timeout are ignored with a warning
  • Stream sessions use bounded timeout handling and terminal-frame enforcement
  • Queue workers apply bounded retries and dead-letter routing using lifecycle defaults

Canonical defaults

AreaDefaultBehavior
Command invocation timeoutBridge defaultCommandTimeout (30s)Caller timeout is terminal; late responses ignored
Stream invocation timeoutBridge defaultCommandTimeoutLate frames after timeout/terminal are ignored
Subscription failure handlingmode: 'strict', maxAttempts: 1Startup rejects unsupported semantics
Queue lifecycle retrymaxAttempts: 10, exponential backoff, retryWindowMs: 24hRetries stay bounded; route to DLQ after exhaustion

Subscription control outcomes

Subscription handlers can return explicit control outcomes:

OutcomeMeaning
ackSettle as successful
retryRequest retry, optionally with delayMs
deadLetterRoute directly to dead-letter handling
dropSettle and discard with a warning
stop-consumerPause the subscription consumer; requires explicit resume
.setSubscriptionFunction(async function (context, payload) {
  try {
    await context.resources.db.processEvent(payload)
    return { status: 'ack' }
  } catch (err) {
    if (err.code === 'CONFLICT') {
      return { status: 'deadLetter' }
    }
    return { status: 'retry', delayMs: 5000 }
  }
})

Health and paused-state semantics

Service health includes paused operational state as first-class observability:

  • Paused queue workers are in ServiceHealthState.pausedQueueWorkers
  • Paused subscription consumers are in ServiceHealthState.pausedSubscriptionConsumers
  • If either list is non-empty, service health is warn
const health = await service.getHealth()
console.log(health.state) // 'healthy' | 'warn' | 'unhealthy'
console.log(health.pausedQueueWorkers)
console.log(health.pausedSubscriptionConsumers)

In-flight diagnostics

Event bridges expose in-flight diagnostics by work kind:

const diagnostics = service.getInFlightDiagnostics()
console.log(diagnostics.total) // all in-flight handlers
console.log(diagnostics.byKind) // { command: 3, subscription: 1, stream: 0, generic: 0 }

Use this during graceful shutdown to verify drain reached zero before teardown.

Stream reliability

Stream consumers should handle terminal frames explicitly:

  • complete — normal end of stream
  • error — terminal failure
  • cancel — consumer cancelled (normal control path)

Expect exactly one terminal state per session.

for await (const frame of stream) {
  if (frame.type === 'chunk') {
    processChunk(frame.payload)
  } else if (frame.type === 'complete') {
    console.log('stream completed')
    break
  } else if (frame.type === 'error') {
    console.error('stream error:', frame.error)
    break
  } else if (frame.type === 'cancel') {
    console.log('stream cancelled by consumer')
    break
  }
}

Designing for resilience

Idempotency

Make command and subscription side effects idempotent:

// ✅ Idempotent: same input produces same result
.setCommandFunction(async function (context, payload) {
  const key = `user:${payload.email}`
  const existing = await context.resources.db.getByKey(key)
  if (existing) {
    return { userId: existing.id } // already created
  }
  const userId = await context.resources.db.create(payload)
  return { userId }
})

Timeout budgets

Set explicit timeout and retry budgets:

userServiceV1ServiceBuilder
  .getCommandBuilder('processPayment', 'process a payment')
  .setCommandFunction(async function (context, payload) {
    // Business logic
  })
  // Override default timeout for this specific command
  // Timeout is configured on the bridge or via command metadata

Graceful shutdown

import { gracefulShutdown } from '@purista/core'

gracefulShutdown(logger, [
  eventBridge,
  userService,
  emailService,
])

gracefulShutdown takes a logger, an ordered array of destroyables, and an optional timeout in ms (default 30 000). Each entry is destroyed in sequence.

When to focus on resilience

  • Production systems where downtime is costly
  • Financial transactions (payments, billing, inventory)
  • Multi-step workflows where partial failure is expensive
  • Systems with external dependencies (third-party APIs, databases)

Common pitfalls

  • Non-idempotent side effects. Without idempotency, retries create duplicate data.
  • Ignoring timeout boundaries. Default timeouts may not match your SLA.
  • Assuming exactly-once delivery. Design for at-least-once with idempotency.
  • Not testing failure paths. Test retries, timeouts, and dead-letter routing in CI.
  • Missing drain verification. Always check in-flight diagnostics before shutdown.

Checklist

  • All command and subscription side effects are idempotent
  • Timeout and retry budgets are defined per handler
  • Dead-letter routing is configured for queue workers
  • Graceful shutdown waits for in-flight messages
  • Health checks include paused state
  • Failure paths are tested in CI (not just happy path)
  • Operational runbook documents outage and reconnect behavior

Related

Read Next
What is a Service?

from Service — The Container