Retry Metrics and Observability

Retry Metrics and Observability¶

Prodigy provides comprehensive observability into retry behavior through logging and event tracking, rather than programmatic metrics access. This enables debugging retry issues and monitoring workflow resilience without requiring code changes.

How Retry Observability Works¶

Retry operations are observable through three primary mechanisms:

Console Logs - Visible during workflow execution
Tracing Infrastructure - Controlled by verbosity flags
Event Tracking - Recorded for MapReduce workflows

Internal Metrics Structure¶

While retry metrics are tracked internally via the RetryMetrics structure (src/cook/retry_v2.rs:399-422), this data is not exposed as a queryable API for workflow users. Instead, metrics are surfaced through logging at key decision points.

The internal structure tracks: - total_attempts - Total number of attempts made - successful_attempts - Number of successful operations - failed_attempts - Number of failed operations - retries - History of retry attempts with delays

Observing Retry Behavior in Workflows¶

Console Output During Retries¶

When a command fails and triggers a retry, you'll see log messages showing:

Retrying <operation> (attempt 2/5) after 200ms
Retrying <operation> (attempt 3/5) after 400ms

Source: These messages come from the retry executor's logging (src/cook/retry_v2.rs:247-250).

Verbosity Flags for Detailed Logs¶

Use Prodigy's verbosity flags to see more retry details:

# Default - shows retry attempts
prodigy run workflow.yml

# Verbose - shows retry decision-making
prodigy run workflow.yml -v

# Very verbose - shows all retry internals
prodigy run workflow.yml -vv

With -v or higher verbosity, you'll see: - Retry attempt counts and delays - Circuit breaker state transitions - Retry budget consumption - Error pattern matching results

Retry Budget Exhaustion¶

When retry budget is exhausted, a warning is logged:

Retry budget exhausted for <operation>

Source: src/cook/retry_v2.rs:238-243

This indicates the total retry delay exceeded the configured retry_budget, preventing further attempts even if max_attempts hasn't been reached.

Circuit Breaker Observability¶

Circuit breaker state changes are logged automatically:

Circuit Breaker Opens¶

Circuit breaker opened after 5 consecutive failures

Source: src/cook/retry_v2.rs:391

When the failure threshold is exceeded, the circuit breaker opens and blocks further retry attempts until the recovery timeout expires.

Circuit Breaker Transitions¶

Circuit breaker transitioning to half-open
Circuit breaker closed after successful operation

Source: src/cook/retry_v2.rs:359, 377

These debug-level messages show circuit breaker recovery: - Half-open: Testing if failures have cleared (one attempt allowed) - Closed: Normal operation resumed after successful attempt

Workflow-Level Observability¶

On-Failure Handler Execution¶

When retries are configured via on_failure handlers, you'll see:

- shell: "flaky-command"
  on_failure:
    claude: "/fix-issue"
    max_attempts: 5

Workflow output shows: 1. Initial command failure 2. Retry attempt messages with delays 3. Handler command execution 4. Final success or failure after exhausting attempts

Example Log Progression¶

❌ Command failed: exit code 1
Retrying (attempt 2/5) after 100ms
❌ Command failed: exit code 1
Retrying (attempt 3/5) after 200ms
✅ Command succeeded

MapReduce Event Tracking¶

MapReduce workflows provide additional observability through the event tracking system. Retry attempts are recorded as events:

# View events for a MapReduce job
prodigy events show <job_id>

Events include: - Retry attempt counts per work item - Individual agent retry behavior - Correlation between retries and DLQ items

See Event Tracking for comprehensive event details.

Troubleshooting with Retry Logs¶

Identifying Retry Configuration Issues¶

Problem: Commands retry unnecessarily on non-transient errors

Check logs for:

Retrying (attempt 2/5) after 100ms
Retrying (attempt 3/5) after 200ms
...

Solution: Configure retry_on error matchers to only retry specific errors:

retry_config:
  max_attempts: 5
  retry_on:
    - pattern: "connection refused"
    - pattern: "timeout"

See Conditional Retry with Error Matchers.

Debugging Circuit Breaker Behavior¶

Problem: Circuit breaker opening too aggressively

Check logs for:

Circuit breaker opened after 3 consecutive failures

Solution: Adjust failure_threshold and recovery_timeout:

circuit_breaker:
  failure_threshold: 5      # Increase tolerance
  recovery_timeout_ms: 30000  # Longer recovery window

Analyzing Retry Budget Exhaustion¶

Problem: Retries stop before max_attempts reached

Check logs for:

Retry budget exhausted for <operation>

Solution: Increase retry_budget or reduce max_delay_ms:

retry_config:
  retry_budget_ms: 120000  # Allow up to 2 minutes of retries
  max_delay_ms: 10000       # Cap individual delays at 10s

See Retry Budget.

Integration with Error Handling¶

Retry observability integrates with Prodigy's error handling pipeline:

Retry - First line of defense for transient failures
On-Failure Handlers - Second line when retries exhausted
Dead Letter Queue - Final capture for MapReduce failures

Example workflow progression:

Command fails → Retry with backoff → Still failing → Execute on_failure handler →
Handler succeeds → Continue OR Handler fails → DLQ (MapReduce only)

Logs show the complete progression:

❌ shell: flaky-script failed
Retrying (attempt 2/3) after 100ms
❌ shell: flaky-script failed
Retrying (attempt 3/3) after 200ms
❌ shell: flaky-script failed
Executing on_failure handler: claude /debug-failure
✅ on_failure handler succeeded

MapReduce DLQ Relationship¶

For MapReduce workflows, retry exhaustion leads to DLQ:

Retrying work item 'item-123' (attempt 3/3)
❌ Failed after 3 attempts
→ Item moved to Dead Letter Queue

View DLQ items with retry history:

prodigy dlq show <job_id>

Each DLQ item preserves: - Number of retry attempts made - Final error message - Retry delays used - Circuit breaker state at failure

See Dead Letter Queue (DLQ).

Monitoring Retry Patterns¶

Identifying Flaky Operations¶

Watch for patterns in logs indicating flaky behavior:

# Count retry attempts across workflow execution
prodigy run workflow.yml -v 2>&1 | grep "Retrying" | wc -l

# Find operations that retry most frequently
prodigy run workflow.yml -v 2>&1 | grep "Retrying" | \
  grep -oE 'Retrying [^ ]+' | sort | uniq -c | sort -rn

Measuring Retry Effectiveness¶

Track retry success vs. failure rates:

# Find successful retries
grep "Retrying" workflow.log | grep -A 5 "succeeded"

# Find exhausted retries
grep "Retry budget exhausted" workflow.log
grep "Failed after .* attempts" workflow.log

Basic Retry Configuration - Configure retry behavior
Backoff Strategies - Choose appropriate backoff patterns
Event Tracking - MapReduce observability
Troubleshooting - Debug common retry issues

Retry Metrics and Observability