Dead Letter Queue (DLQ)

Dead Letter Queue (DLQ)¶

The Dead Letter Queue (DLQ) captures persistently failing work items for analysis and retry while allowing MapReduce jobs to continue processing other items. Instead of blocking the entire workflow when individual items fail, the DLQ provides fault tolerance and enables debugging of failure patterns.

Overview¶

When a map agent fails to process a work item after exhausting retry attempts, the item is automatically sent to the DLQ. This allows the MapReduce job to complete successfully while preserving all failure information for later investigation and reprocessing.

The DLQ integrates with MapReduce through the on_item_failure policy, which defaults to dlq for MapReduce workflows. Alternative policies include retry (immediate retry), skip (ignore failures), stop (halt workflow), and custom (user-defined handling).

flowchart TD
    Start[Work Item Processing] --> Execute[Execute Agent Command]
    Execute --> Success{Success?}
    Success -->|Yes| Next[Process Next Item]
    Success -->|No| Retry{Retry<br/>Attempts<br/>Left?}

    Retry -->|Yes| Backoff[Exponential Backoff]
    Backoff --> Execute

    Retry -->|No| Policy{on_item_failure<br/>Policy}

    Policy -->|dlq| CreateDLQ[Create DeadLetteredItem]
    Policy -->|stop| Halt[Halt Workflow]
    Policy -->|skip| Skip[Mark Skipped]
    Policy -->|custom| Custom[Run Custom Handler]

    CreateDLQ --> SaveDLQ[Save to DLQ Storage]
    SaveDLQ --> Next
    Skip --> Next
    Custom --> Next

    Halt --> End[Workflow Failed]
    Next --> End

    style Success fill:#e8f5e9
    style Policy fill:#fff3e0
    style CreateDLQ fill:#e1f5ff
    style SaveDLQ fill:#e1f5ff
    style Halt fill:#ffebee

Figure: DLQ failure flow showing retry logic and policy-based handling.

Storage Structure¶

DLQ data is stored in the global Prodigy directory using this structure:

~/.prodigy/dlq/{repo_name}/{job_id}/mapreduce/dlq/{job_id}/
├── items/                  # Individual item files
│   ├── item-123.json      # DeadLetteredItem for item-123
│   ├── item-456.json      # DeadLetteredItem for item-456
│   └── ...
└── index.json             # DLQ index metadata

For example:

~/.prodigy/dlq/prodigy/mapreduce-1234567890/mapreduce/dlq/mapreduce-1234567890/
├── items/
│   └── item-123.json
└── index.json

Storage Implementation: - GlobalStorage provides the base path: ~/.prodigy/dlq/{repo_name}/{job_id} (src/storage/global.rs:47) - DLQStorage adds subdirectories: mapreduce/dlq/{job_id} (src/cook/execution/dlq.rs:529) - Note: The redundant mapreduce/dlq/{job_id} subdirectory structure is maintained for backward compatibility with earlier DLQ implementations. This ensures existing DLQ data remains accessible after upgrades. - Each failed item is stored as a separate JSON file in the items/ directory (src/cook/execution/dlq.rs:536) - File naming: {item_id}.json (e.g., item-123.json) - index.json maintains a list of all item IDs and metadata for fast lookups (src/cook/execution/dlq.rs:608-637) - Items are loaded into memory on-demand for operations

This global storage architecture enables: - Cross-worktree access: Multiple worktrees working on the same job share DLQ data - Persistent state: DLQ survives worktree cleanup - Centralized monitoring: All failures accessible from a single location - Scalability: Individual item files prevent loading entire DLQ into memory

DLQ Item Structure¶

Each failed item in the DLQ is stored as a DeadLetteredItem with comprehensive failure information:

{
  "item_id": "item-123",                    // (1)!
  "item_data": { "file": "src/main.rs", "priority": 5 },  // (2)!
  "first_attempt": "2025-01-11T10:25:00Z",  // (3)!
  "last_attempt": "2025-01-11T10:30:00Z",
  "failure_count": 3,                       // (4)!
  "failure_history": [                      // (5)!
    {
      "attempt_number": 1,
      "timestamp": "2025-01-11T10:30:00Z",
      "error_type": { "CommandFailed": { "exit_code": 101 } },
      "error_message": "cargo test failed with exit code 101",
      "stack_trace": "thread 'main' panicked at src/main.rs:42...",
      "agent_id": "agent-1",
      "step_failed": "shell: cargo test",
      "duration_ms": 45000,
      "json_log_location": "/Users/user/.local/state/claude/logs/session-abc123.json"
    }
  ],
  "error_signature": "CommandFailed::cargo test failed with exit",  // (6)!
  "reprocess_eligible": true,               // (7)!
  "manual_review_required": false,
  "worktree_artifacts": {                   // (8)!
    "worktree_path": "/Users/user/.prodigy/worktrees/prodigy/agent-1",
    "branch_name": "agent-1",
    "uncommitted_changes": "src/main.rs modified",
    "error_logs": "error.log contents..."
  }
}

Unique identifier for the work item
Original work item data from input JSON
When the item first failed (used for age-based filtering)
Number of retry attempts made
Complete history of all failure attempts with detailed error information
Simplified error pattern for grouping similar failures
Whether item can be automatically retried
Preserved state from failed agent's execution environment

Source: DeadLetteredItem struct (src/cook/execution/dlq.rs:32-43)

Key Fields¶

Source: DeadLetteredItem struct (src/cook/execution/dlq.rs:32-43)

item_id: Unique identifier for the work item
item_data: Original work item data from input JSON
first_attempt: Timestamp of first failure attempt (DateTime)
last_attempt: Timestamp of most recent failure attempt (DateTime)
failure_count: Number of failed attempts (u32)
failure_history: Array of FailureDetail objects capturing each attempt
error_signature: Simplified error pattern for grouping similar failures
reprocess_eligible: Whether item can be retried automatically
manual_review_required: Whether item needs human intervention
worktree_artifacts: Captured state from failed agent's worktree (Optional)

Failure Detail Fields¶

Each attempt in failure_history is a FailureDetail object (src/cook/execution/dlq.rs:46-59):

# Source: src/cook/execution/dlq.rs:46-59 (FailureDetail struct)
{
  "attempt_number": 1,
  "timestamp": "2025-01-11T10:30:00Z",
  "error_type": { "CommandFailed": { "exit_code": 101 } },
  "error_message": "cargo test failed with exit code 101",
  "stack_trace": "thread 'main' panicked at src/main.rs:42...",
  "agent_id": "agent-1",
  "step_failed": "claude: /process '${item}'",
  "duration_ms": 45000,
  "json_log_location": "/path/to/claude-session.json"
}

Field Details: - attempt_number: Sequential attempt counter (u32) - timestamp: When this attempt occurred (DateTime) - error_type: Error classification (see Error Types below) - error_message: Human-readable error description - stack_trace: Optional detailed stack trace - agent_id: ID of the agent that failed - step_failed: Command string that failed (e.g., "shell: cargo test") - duration_ms: How long the attempt ran before failing (u64) - json_log_location: Path to Claude JSON log for debugging (Optional)

Error Types¶

The error_type field uses the ErrorType enum (src/cook/execution/dlq.rs:62-71):

Execution ErrorsGit ErrorsSystem Errors

Timeout: Agent execution exceeded timeout limit

CommandFailed { exit_code: i32 }: Shell or Claude command returned non-zero exit code

Example:

{ "CommandFailed": { "exit_code": 101 } }

ValidationFailed: Post-execution validation command failed

WorktreeError: Git worktree operation failed (creation, cleanup)

MergeConflict: Merge back to parent worktree failed

Common when multiple agents modify the same files.

ResourceExhausted: System resources (memory, disk) exhausted

Unknown: Unclassified error

Review error_message and stack_trace for details.

Worktree Artifacts¶

The WorktreeArtifacts structure captures the agent's execution environment (src/cook/execution/dlq.rs:74-80):

worktree_path: Path to agent's isolated worktree (PathBuf)
branch_name: Git branch created for agent (String)
uncommitted_changes: Files modified but not committed (Option)
error_logs: Captured error output (Option)

Note: Both uncommitted_changes and error_logs are stored as optional strings, not arrays. If present, they contain descriptive text about the changes/logs rather than file lists.

These artifacts are preserved for debugging and can be accessed after failure.

DLQ Commands¶

Planned Feature - Stub Implementation

The DLQ CLI commands are currently implemented as stubs that only print status messages. Full functionality is planned for a future release.

Current Status:

Command definitions exist in src/cli/args.rs:577-677
Stub implementations in src/cli/commands/dlq.rs:1-74
Commands will print confirmation messages but do not execute actual operations

See DLQ Integration for current workflow-level DLQ functionality.

Prodigy provides comprehensive CLI commands for managing the DLQ. The following sections describe the planned command interface.

List Command¶

View DLQ items across jobs:

# List all DLQ items
prodigy dlq list

# List items for specific job
prodigy dlq list --job-id mapreduce-1234567890

# List only retry-eligible items
prodigy dlq list --eligible

# Limit results
prodigy dlq list --limit 10

Inspect Command¶

Examine detailed information for a specific item:

# Inspect item by ID
prodigy dlq inspect item-123

# Inspect item in specific job
prodigy dlq inspect item-123 --job-id mapreduce-1234567890

Output includes: - Full item data - Complete failure history with all attempts - Error messages and stack traces - JSON log locations for debugging - Worktree artifacts

Analyze Command¶

Analyze failure patterns across items:

# Analyze all failures
prodigy dlq analyze

# Analyze specific job
prodigy dlq analyze --job-id mapreduce-1234567890

# Export analysis results
prodigy dlq analyze --export analysis.json

Analysis output includes: - pattern_groups: Failures grouped by error type and message similarity - error_distribution: Histogram of error types - temporal_distribution: Failures over time

Retry Command¶

⚠️ STUB: This command is not yet implemented. Description below shows planned interface.

Reprocess failed items:

# Source: src/cli/args.rs:640-659
# Retry all failed items for a workflow
prodigy dlq retry <workflow_id>

# Control parallelism (default: 10)
prodigy dlq retry <workflow_id> --parallel 10

# Filter items to retry
prodigy dlq retry <workflow_id> --filter 'item.priority >= 5'

# Limit retry attempts per item (default: 3)
prodigy dlq retry <workflow_id> --max-retries 2

# Force retry even if not eligible
prodigy dlq retry <workflow_id> --force

Command Parameters: - <workflow_id>: The workflow/job identifier to retry items from (e.g., "mapreduce-1234567890") - --parallel: Number of concurrent retry workers (default: 10, src/cli/args.rs:653) - --max-retries: Maximum retry attempts per item (default: 3, src/cli/args.rs:649) - --filter: Expression to filter which items to retry (default: all eligible items, src/cli/args.rs:645) - --force: Force retry of items marked as not eligible (default: false, src/cli/args.rs:657)

Supported Flags: --parallel, --max-retries, --filter, --force

Retry Behavior¶

The retry functionality is designed to handle large-scale DLQ reprocessing:

Streaming support: Items are processed incrementally to avoid memory issues with large DLQs
Parallel execution: Uses --parallel flag (default: 10) to control concurrent workers
State updates:
Successful items are removed from DLQ
Failed items remain with updated failure_history
failure_count is incremented
Correlation tracking: Maintains original item IDs and correlation metadata
Interruption safe: Supports stopping and resuming retry operations

Implementation Note: The DLQ reprocessing logic uses the DeadLetterQueue::reprocess method (src/cook/execution/dlq.rs:202-233) which filters for reprocess_eligible items before attempting retry.

Export Command¶

Export DLQ items for external analysis:

# Export to JSON (default)
prodigy dlq export dlq-items.json

# Export to CSV
prodigy dlq export dlq-items.csv --format csv

# Export specific job
prodigy dlq export output.json --job-id mapreduce-1234567890

Stats Command¶

View DLQ statistics:

# Overall DLQ statistics
prodigy dlq stats

# Stats for specific workflow
prodigy dlq stats --workflow-id my-workflow

Output includes: - Total items in DLQ - Items by error type - Average retry count - Retry-eligible vs manual review required - Temporal distribution

Clear Command¶

Remove items from DLQ:

# Clear all items for a workflow (prompts for confirmation)
prodigy dlq clear my-workflow

# Skip confirmation prompt
prodigy dlq clear my-workflow --yes

Purge Command¶

Clean up old DLQ items:

# Purge items older than 30 days
prodigy dlq purge --older-than-days 30

# Purge for specific job
prodigy dlq purge --older-than-days 30 --job-id mapreduce-1234567890

# Skip confirmation
prodigy dlq purge --older-than-days 30 --yes

Debugging with DLQ¶

Accessing Claude JSON Logs¶

Each FailureDetail includes a json_log_location field pointing to the Claude Code JSON log for that execution. This log contains: - Complete conversation history - All tool invocations and results - Error details and stack traces - Token usage statistics

# View JSON log from DLQ item
cat $(prodigy dlq inspect item-123 | jq -r '.failure_history[0].json_log_location')

# Pretty-print with jq
cat /path/to/session.json | jq '.'

For more details on Claude JSON logs, see Retry Metrics and Observability.

Common Debugging Workflow¶

Step-by-Step Debugging

Follow this workflow to diagnose and fix failures systematically:

List failed items:

prodigy dlq list --job-id mapreduce-1234567890

Inspect specific failure:
```
prodigy dlq inspect item-123
```

Examine Claude logs:

cat /path/to/claude-session.json | jq '.messages[-3:]'

Analyze failure patterns:

prodigy dlq analyze --job-id mapreduce-1234567890

Fix underlying issue (code bug, config error, etc.)
Retry failed items:
```
prodigy dlq retry mapreduce-1234567890
```

Integration with MapReduce¶

The DLQ is tightly integrated with MapReduce workflows through the on_item_failure policy:

# Source: Workflow configuration (see src/config/mapreduce.rs for MapConfig schema)
name: my-workflow
mode: mapreduce

map:
  input: "items.json"
  json_path: "$.items[*]"

  # Default policy: send failures to DLQ
  on_item_failure: dlq

  agent_template:
    - claude: "/process '${item}'"

Available Policies¶

dlq (default): Failed items sent to DLQ, job continues
retry: Immediate retry with exponential backoff
skip: Ignore failures, mark as skipped, continue
stop: Halt entire workflow on first failure
custom: User-defined failure handler

Failure Flow¶

Work Item Processing
       ↓
   Command Failed
       ↓
  Retry Attempts (if configured)
       ↓
   Still Failing?
       ↓
on_item_failure: dlq
       ↓
Create DeadLetteredItem
       ↓
Save to ~/.prodigy/dlq/{repo}/{job_id}/mapreduce/dlq/{job_id}/items/{item_id}.json
       ↓
Continue Processing Other Items

Best Practices¶

When to Retry vs Manual Fix¶

Choosing the Right Approach

Before retrying failed items, analyze the failure patterns with prodigy dlq analyze to determine if the issue is transient (safe to retry) or systematic (requires code changes first).

Automatic Retry (via prodigy dlq retry): - Transient failures (network timeouts, resource contention) - Flaky tests or intermittent issues - Items that may succeed with more resources (--max-parallel 1)

Manual Fix (code changes, then retry): - Logic errors in processing code - Invalid assumptions about item data - Missing dependencies or configuration - Systematic failures affecting multiple items

DLQ Management¶

Monitor regularly: Use prodigy dlq stats to track failure rates
Analyze patterns: Run prodigy dlq analyze to identify systematic issues
Clean up old items: Periodically run prodigy dlq purge to remove resolved failures
Set review flags: Mark manual_review_required: true for items needing human investigation

Performance Considerations¶

Resource Management

When retrying large DLQs, start with lower parallelism (e.g., --max-parallel 5) to avoid overwhelming system resources. Monitor the first few retries before increasing parallelism.

Large DLQs: Retry command uses streaming to handle thousands of items efficiently
Parallelism: Tune --max-parallel based on failure type (CPU-bound vs I/O-bound)
Filtering: Use --filter to target specific subsets of failures

Advanced: DLQ Storage Internals¶

For advanced debugging or direct file access, understanding the DLQ storage structure can be helpful.

Index File Structure¶

The index.json file provides metadata about the DLQ (src/cook/execution/dlq.rs:608-637):

# Source: src/cook/execution/dlq.rs:608-637 (index structure)
{
  "job_id": "mapreduce-1234567890",
  "item_count": 3,
  "item_ids": ["item-123", "item-456", "item-789"],
  "updated_at": "2025-01-11T10:30:00Z"
}

This index is automatically updated when items are added or removed from the DLQ.

Direct File Access¶

To inspect DLQ items directly:

# List all DLQ items for a job
ls ~/.prodigy/dlq/prodigy/mapreduce-1234567890/mapreduce/dlq/mapreduce-1234567890/items/

# View a specific item
cat ~/.prodigy/dlq/prodigy/mapreduce-1234567890/mapreduce/dlq/mapreduce-1234567890/items/item-123.json | jq

# Count total items
jq '.item_count' ~/.prodigy/dlq/prodigy/mapreduce-1234567890/mapreduce/dlq/mapreduce-1234567890/index.json

Note: Direct file access is provided for debugging. Always use the Prodigy CLI commands for production operations to ensure data consistency.

Cross-References¶

Checkpoint and Resume: DLQ state preserved in checkpoints
Event Tracking: DLQ operations emit trackable events
Error Handling: Broader error handling strategies
Worktree Architecture: Agent isolation and artifact preservation
Retry Metrics and Observability: Monitoring retry behavior and failures