Dead Letter Queue (DLQ)
Dead Letter Queue (DLQ)¶
The Dead Letter Queue (DLQ) captures persistently failing work items for analysis and retry while allowing MapReduce jobs to continue processing other items. Instead of blocking the entire workflow when individual items fail, the DLQ provides fault tolerance and enables debugging of failure patterns.
Overview¶
When a map agent fails to process a work item after exhausting retry attempts, the item is automatically sent to the DLQ. This allows the MapReduce job to complete successfully while preserving all failure information for later investigation and reprocessing.
The DLQ integrates with MapReduce through the on_item_failure policy, which defaults to dlq for MapReduce workflows. Alternative policies include retry (immediate retry), skip (ignore failures), stop (halt workflow), and custom (user-defined handling).
flowchart TD
Start[Work Item Processing] --> Execute[Execute Agent Command]
Execute --> Success{Success?}
Success -->|Yes| Next[Process Next Item]
Success -->|No| Retry{Retry<br/>Attempts<br/>Left?}
Retry -->|Yes| Backoff[Exponential Backoff]
Backoff --> Execute
Retry -->|No| Policy{on_item_failure<br/>Policy}
Policy -->|dlq| CreateDLQ[Create DeadLetteredItem]
Policy -->|stop| Halt[Halt Workflow]
Policy -->|skip| Skip[Mark Skipped]
Policy -->|custom| Custom[Run Custom Handler]
CreateDLQ --> SaveDLQ[Save to DLQ Storage]
SaveDLQ --> Next
Skip --> Next
Custom --> Next
Halt --> End[Workflow Failed]
Next --> End
style Success fill:#e8f5e9
style Policy fill:#fff3e0
style CreateDLQ fill:#e1f5ff
style SaveDLQ fill:#e1f5ff
style Halt fill:#ffebee
Figure: DLQ failure flow showing retry logic and policy-based handling.
Storage Structure¶
DLQ data is stored in the global Prodigy directory using this structure:
~/.prodigy/dlq/{repo_name}/{job_id}/mapreduce/dlq/{job_id}/
├── items/ # Individual item files
│ ├── item-123.json # DeadLetteredItem for item-123
│ ├── item-456.json # DeadLetteredItem for item-456
│ └── ...
└── index.json # DLQ index metadata
For example:
~/.prodigy/dlq/prodigy/mapreduce-1234567890/mapreduce/dlq/mapreduce-1234567890/
├── items/
│ └── item-123.json
└── index.json
Storage Implementation:
- GlobalStorage provides the base path: ~/.prodigy/dlq/{repo_name}/{job_id} (src/storage/global.rs:47)
- DLQStorage adds subdirectories: mapreduce/dlq/{job_id} (src/cook/execution/dlq.rs:529)
- Note: The redundant mapreduce/dlq/{job_id} subdirectory structure is maintained for backward compatibility with earlier DLQ implementations. This ensures existing DLQ data remains accessible after upgrades.
- Each failed item is stored as a separate JSON file in the items/ directory (src/cook/execution/dlq.rs:536)
- File naming: {item_id}.json (e.g., item-123.json)
- index.json maintains a list of all item IDs and metadata for fast lookups (src/cook/execution/dlq.rs:608-637)
- Items are loaded into memory on-demand for operations
This global storage architecture enables: - Cross-worktree access: Multiple worktrees working on the same job share DLQ data - Persistent state: DLQ survives worktree cleanup - Centralized monitoring: All failures accessible from a single location - Scalability: Individual item files prevent loading entire DLQ into memory
DLQ Item Structure¶
Each failed item in the DLQ is stored as a DeadLetteredItem with comprehensive failure information:
{
"item_id": "item-123", // (1)!
"item_data": { "file": "src/main.rs", "priority": 5 }, // (2)!
"first_attempt": "2025-01-11T10:25:00Z", // (3)!
"last_attempt": "2025-01-11T10:30:00Z",
"failure_count": 3, // (4)!
"failure_history": [ // (5)!
{
"attempt_number": 1,
"timestamp": "2025-01-11T10:30:00Z",
"error_type": { "CommandFailed": { "exit_code": 101 } },
"error_message": "cargo test failed with exit code 101",
"stack_trace": "thread 'main' panicked at src/main.rs:42...",
"agent_id": "agent-1",
"step_failed": "shell: cargo test",
"duration_ms": 45000,
"json_log_location": "/Users/user/.local/state/claude/logs/session-abc123.json"
}
],
"error_signature": "CommandFailed::cargo test failed with exit", // (6)!
"reprocess_eligible": true, // (7)!
"manual_review_required": false,
"worktree_artifacts": { // (8)!
"worktree_path": "/Users/user/.prodigy/worktrees/prodigy/agent-1",
"branch_name": "agent-1",
"uncommitted_changes": "src/main.rs modified",
"error_logs": "error.log contents..."
}
}
- Unique identifier for the work item
- Original work item data from input JSON
- When the item first failed (used for age-based filtering)
- Number of retry attempts made
- Complete history of all failure attempts with detailed error information
- Simplified error pattern for grouping similar failures
- Whether item can be automatically retried
- Preserved state from failed agent's execution environment
Source: DeadLetteredItem struct (src/cook/execution/dlq.rs:32-43)
Key Fields¶
Source: DeadLetteredItem struct (src/cook/execution/dlq.rs:32-43)
item_id: Unique identifier for the work itemitem_data: Original work item data from input JSONfirst_attempt: Timestamp of first failure attempt (DateTime) last_attempt: Timestamp of most recent failure attempt (DateTime) failure_count: Number of failed attempts (u32)failure_history: Array ofFailureDetailobjects capturing each attempterror_signature: Simplified error pattern for grouping similar failuresreprocess_eligible: Whether item can be retried automaticallymanual_review_required: Whether item needs human interventionworktree_artifacts: Captured state from failed agent's worktree (Optional)
Failure Detail Fields¶
Each attempt in failure_history is a FailureDetail object (src/cook/execution/dlq.rs:46-59):
# Source: src/cook/execution/dlq.rs:46-59 (FailureDetail struct)
{
"attempt_number": 1,
"timestamp": "2025-01-11T10:30:00Z",
"error_type": { "CommandFailed": { "exit_code": 101 } },
"error_message": "cargo test failed with exit code 101",
"stack_trace": "thread 'main' panicked at src/main.rs:42...",
"agent_id": "agent-1",
"step_failed": "claude: /process '${item}'",
"duration_ms": 45000,
"json_log_location": "/path/to/claude-session.json"
}
Field Details:
- attempt_number: Sequential attempt counter (u32)
- timestamp: When this attempt occurred (DateTimeerror_type: Error classification (see Error Types below)
- error_message: Human-readable error description
- stack_trace: Optional detailed stack trace
- agent_id: ID of the agent that failed
- step_failed: Command string that failed (e.g., "shell: cargo test")
- duration_ms: How long the attempt ran before failing (u64)
- json_log_location: Path to Claude JSON log for debugging (Optional)
Error Types¶
The error_type field uses the ErrorType enum (src/cook/execution/dlq.rs:62-71):
Timeout: Agent execution exceeded timeout limit
CommandFailed { exit_code: i32 }: Shell or Claude command returned non-zero exit code
Example:
ValidationFailed: Post-execution validation command failed
WorktreeError: Git worktree operation failed (creation, cleanup)
MergeConflict: Merge back to parent worktree failed
Common when multiple agents modify the same files.
ResourceExhausted: System resources (memory, disk) exhausted
Unknown: Unclassified error
Review error_message and stack_trace for details.
Worktree Artifacts¶
The WorktreeArtifacts structure captures the agent's execution environment (src/cook/execution/dlq.rs:74-80):
worktree_path: Path to agent's isolated worktree (PathBuf)branch_name: Git branch created for agent (String)uncommitted_changes: Files modified but not committed (Option) error_logs: Captured error output (Option)
Note: Both uncommitted_changes and error_logs are stored as optional strings, not arrays. If present, they contain descriptive text about the changes/logs rather than file lists.
These artifacts are preserved for debugging and can be accessed after failure.
DLQ Commands¶
Planned Feature - Stub Implementation
The DLQ CLI commands are currently implemented as stubs that only print status messages. Full functionality is planned for a future release.
Current Status:
- Command definitions exist in src/cli/args.rs:577-677
- Stub implementations in src/cli/commands/dlq.rs:1-74
- Commands will print confirmation messages but do not execute actual operations
See DLQ Integration for current workflow-level DLQ functionality.
Prodigy provides comprehensive CLI commands for managing the DLQ. The following sections describe the planned command interface.
List Command¶
View DLQ items across jobs:
# List all DLQ items
prodigy dlq list
# List items for specific job
prodigy dlq list --job-id mapreduce-1234567890
# List only retry-eligible items
prodigy dlq list --eligible
# Limit results
prodigy dlq list --limit 10
Inspect Command¶
Examine detailed information for a specific item:
# Inspect item by ID
prodigy dlq inspect item-123
# Inspect item in specific job
prodigy dlq inspect item-123 --job-id mapreduce-1234567890
Output includes: - Full item data - Complete failure history with all attempts - Error messages and stack traces - JSON log locations for debugging - Worktree artifacts
Analyze Command¶
Analyze failure patterns across items:
# Analyze all failures
prodigy dlq analyze
# Analyze specific job
prodigy dlq analyze --job-id mapreduce-1234567890
# Export analysis results
prodigy dlq analyze --export analysis.json
Analysis output includes:
- pattern_groups: Failures grouped by error type and message similarity
- error_distribution: Histogram of error types
- temporal_distribution: Failures over time
Retry Command¶
⚠️ STUB: This command is not yet implemented. Description below shows planned interface.
Reprocess failed items:
# Source: src/cli/args.rs:640-659
# Retry all failed items for a workflow
prodigy dlq retry <workflow_id>
# Control parallelism (default: 10)
prodigy dlq retry <workflow_id> --parallel 10
# Filter items to retry
prodigy dlq retry <workflow_id> --filter 'item.priority >= 5'
# Limit retry attempts per item (default: 3)
prodigy dlq retry <workflow_id> --max-retries 2
# Force retry even if not eligible
prodigy dlq retry <workflow_id> --force
Command Parameters:
- <workflow_id>: The workflow/job identifier to retry items from (e.g., "mapreduce-1234567890")
- --parallel: Number of concurrent retry workers (default: 10, src/cli/args.rs:653)
- --max-retries: Maximum retry attempts per item (default: 3, src/cli/args.rs:649)
- --filter: Expression to filter which items to retry (default: all eligible items, src/cli/args.rs:645)
- --force: Force retry of items marked as not eligible (default: false, src/cli/args.rs:657)
Supported Flags: --parallel, --max-retries, --filter, --force
Retry Behavior¶
The retry functionality is designed to handle large-scale DLQ reprocessing:
- Streaming support: Items are processed incrementally to avoid memory issues with large DLQs
- Parallel execution: Uses
--parallelflag (default: 10) to control concurrent workers - State updates:
- Successful items are removed from DLQ
- Failed items remain with updated
failure_history failure_countis incremented- Correlation tracking: Maintains original item IDs and correlation metadata
- Interruption safe: Supports stopping and resuming retry operations
Implementation Note: The DLQ reprocessing logic uses the DeadLetterQueue::reprocess method (src/cook/execution/dlq.rs:202-233) which filters for reprocess_eligible items before attempting retry.
Export Command¶
Export DLQ items for external analysis:
# Export to JSON (default)
prodigy dlq export dlq-items.json
# Export to CSV
prodigy dlq export dlq-items.csv --format csv
# Export specific job
prodigy dlq export output.json --job-id mapreduce-1234567890
Stats Command¶
View DLQ statistics:
# Overall DLQ statistics
prodigy dlq stats
# Stats for specific workflow
prodigy dlq stats --workflow-id my-workflow
Output includes: - Total items in DLQ - Items by error type - Average retry count - Retry-eligible vs manual review required - Temporal distribution
Clear Command¶
Remove items from DLQ:
# Clear all items for a workflow (prompts for confirmation)
prodigy dlq clear my-workflow
# Skip confirmation prompt
prodigy dlq clear my-workflow --yes
Purge Command¶
Clean up old DLQ items:
# Purge items older than 30 days
prodigy dlq purge --older-than-days 30
# Purge for specific job
prodigy dlq purge --older-than-days 30 --job-id mapreduce-1234567890
# Skip confirmation
prodigy dlq purge --older-than-days 30 --yes
Debugging with DLQ¶
Accessing Claude JSON Logs¶
Each FailureDetail includes a json_log_location field pointing to the Claude Code JSON log for that execution. This log contains:
- Complete conversation history
- All tool invocations and results
- Error details and stack traces
- Token usage statistics
# View JSON log from DLQ item
cat $(prodigy dlq inspect item-123 | jq -r '.failure_history[0].json_log_location')
# Pretty-print with jq
cat /path/to/session.json | jq '.'
For more details on Claude JSON logs, see Retry Metrics and Observability.
Common Debugging Workflow¶
Step-by-Step Debugging
Follow this workflow to diagnose and fix failures systematically:
-
List failed items:
-
Inspect specific failure:
-
Examine Claude logs:
-
Analyze failure patterns:
-
Fix underlying issue (code bug, config error, etc.)
-
Retry failed items:
Integration with MapReduce¶
The DLQ is tightly integrated with MapReduce workflows through the on_item_failure policy:
# Source: Workflow configuration (see src/config/mapreduce.rs for MapConfig schema)
name: my-workflow
mode: mapreduce
map:
input: "items.json"
json_path: "$.items[*]"
# Default policy: send failures to DLQ
on_item_failure: dlq
agent_template:
- claude: "/process '${item}'"
Available Policies¶
dlq(default): Failed items sent to DLQ, job continuesretry: Immediate retry with exponential backoffskip: Ignore failures, mark as skipped, continuestop: Halt entire workflow on first failurecustom: User-defined failure handler
Failure Flow¶
Work Item Processing
↓
Command Failed
↓
Retry Attempts (if configured)
↓
Still Failing?
↓
on_item_failure: dlq
↓
Create DeadLetteredItem
↓
Save to ~/.prodigy/dlq/{repo}/{job_id}/mapreduce/dlq/{job_id}/items/{item_id}.json
↓
Continue Processing Other Items
Best Practices¶
When to Retry vs Manual Fix¶
Choosing the Right Approach
Before retrying failed items, analyze the failure patterns with prodigy dlq analyze to determine if the issue is transient (safe to retry) or systematic (requires code changes first).
Automatic Retry (via prodigy dlq retry):
- Transient failures (network timeouts, resource contention)
- Flaky tests or intermittent issues
- Items that may succeed with more resources (--max-parallel 1)
Manual Fix (code changes, then retry): - Logic errors in processing code - Invalid assumptions about item data - Missing dependencies or configuration - Systematic failures affecting multiple items
DLQ Management¶
- Monitor regularly: Use
prodigy dlq statsto track failure rates - Analyze patterns: Run
prodigy dlq analyzeto identify systematic issues - Clean up old items: Periodically run
prodigy dlq purgeto remove resolved failures - Set review flags: Mark
manual_review_required: truefor items needing human investigation
Performance Considerations¶
Resource Management
When retrying large DLQs, start with lower parallelism (e.g., --max-parallel 5) to avoid overwhelming system resources. Monitor the first few retries before increasing parallelism.
- Large DLQs: Retry command uses streaming to handle thousands of items efficiently
- Parallelism: Tune
--max-parallelbased on failure type (CPU-bound vs I/O-bound) - Filtering: Use
--filterto target specific subsets of failures
Advanced: DLQ Storage Internals¶
For advanced debugging or direct file access, understanding the DLQ storage structure can be helpful.
Index File Structure¶
The index.json file provides metadata about the DLQ (src/cook/execution/dlq.rs:608-637):
# Source: src/cook/execution/dlq.rs:608-637 (index structure)
{
"job_id": "mapreduce-1234567890",
"item_count": 3,
"item_ids": ["item-123", "item-456", "item-789"],
"updated_at": "2025-01-11T10:30:00Z"
}
This index is automatically updated when items are added or removed from the DLQ.
Direct File Access¶
To inspect DLQ items directly:
# List all DLQ items for a job
ls ~/.prodigy/dlq/prodigy/mapreduce-1234567890/mapreduce/dlq/mapreduce-1234567890/items/
# View a specific item
cat ~/.prodigy/dlq/prodigy/mapreduce-1234567890/mapreduce/dlq/mapreduce-1234567890/items/item-123.json | jq
# Count total items
jq '.item_count' ~/.prodigy/dlq/prodigy/mapreduce-1234567890/mapreduce/dlq/mapreduce-1234567890/index.json
Note: Direct file access is provided for debugging. Always use the Prodigy CLI commands for production operations to ensure data consistency.
Cross-References¶
- Checkpoint and Resume: DLQ state preserved in checkpoints
- Event Tracking: DLQ operations emit trackable events
- Error Handling: Broader error handling strategies
- Worktree Architecture: Agent isolation and artifact preservation
- Retry Metrics and Observability: Monitoring retry behavior and failures