Preprocessor¶
The preprocessor feature allows ripgrep to search virtually any file type by transforming content before searching. This chapter covers how to use preprocessors to search PDFs, compressed files, and other non-text formats.
Quick Reference
Key Flags:
--pre COMMAND- Run command to transform files before searching--pre-glob GLOB- Only preprocess files matching pattern (strongly recommended)-z/--search-zip- Built-in compression support (gzip, bzip2, xz, lz4, lzma, brotli, zstd)
Common Use Cases:
- PDFs:
rg --pre pdftotext --pre-glob '*.pdf' 'pattern' - Office docs:
rg --pre ./preprocessor --pre-glob '*.{doc,docx}' 'pattern' - Encrypted files:
rg --pre ./decrypt --pre-glob '*.gpg' 'pattern'
Overview¶
A preprocessor is a command that transforms file content before ripgrep searches it. This enables searching binary formats, encrypted files, compressed archives, and any content that can be converted to text.
The --pre flag takes a command that receives:
- File path as the first argument
- File content on stdin
The preprocessor outputs the transformed content to stdout, which ripgrep then searches.
Preprocessor and stdin
When ripgrep receives input from stdin (rather than searching files), the preprocessor is not invoked. Preprocessors only run when searching actual files on disk.
How It Works¶
flowchart TD
File["Binary File
PDF, .gz, .docx"] --> Check{"--pre-glob
match?"}
Check -->|No Match| Direct["Direct Search
UTF-8 content"]
Check -->|Match| Pre["Preprocessor
Transform to text"]
Pre --> Stdout["Text Output
to stdout"]
Stdout --> Search["ripgrep
Pattern Match"]
Direct --> Search
Search --> Results[Search Results]
style Pre fill:#e1f5ff
style Search fill:#e8f5e9
style Check fill:#fff3e0
Figure: Preprocessor execution flow showing conditional transformation based on --pre-glob patterns.
Basic Usage: Searching PDFs¶
One common use case is searching PDF files. While PDFs are primarily visual documents, they often contain searchable text streams.
The Problem¶
Let's try searching a PDF without a preprocessor:
No results! Even though the text exists in the PDF, ripgrep can't find it because PDFs use a binary format where text may not be encoded as simple UTF-8.
The Solution¶
First, convert the PDF manually to verify the text is there:
$ pdftotext 1995-watson.pdf > 1995-watson.txt
$ rg 'The Commentz-Walter algorithm' 1995-watson.txt
316:The Commentz-Walter algorithms : : : : : : : : : : : : : : :
7165:4.4 The Commentz-Walter algorithms
10062:in input string S , we obtain the Boyer-Moore algorithm. The Commentz-Walter algorithm
Great! The text is searchable after conversion. Now let's automate this with a preprocessor.
Creating a Simple Preprocessor¶
Create a shell script that wraps pdftotext:
Make it executable:
Now search the PDF directly:
$ rg --pre ./preprocess 'The Commentz-Walter algorithm' 1995-watson.pdf
316:The Commentz-Walter algorithms : : : : : : : : : : : : : : :
7165:4.4 The Commentz-Walter algorithms
10062:in input string S , we obtain the Boyer-Moore algorithm. The Commentz-Walter algorithm
The preprocessor must be resolvable as a command. Either:
- Place it in a directory in your PATH
- Use an absolute or relative path (like ./preprocess)
Performance Benefits¶
Preprocessor-based PDF search can be faster than specialized tools due to ripgrep's parallelism:
$ time rg --pre ./preprocess 'The Commentz-Walter algorithm' 1995-watson.pdf -c
6
real 0.697
user 0.684
sys 0.007
$ time pdfgrep 'The Commentz-Walter algorithm' 1995-watson.pdf -c
6
real 1.336
user 1.310
sys 0.023
When searching many PDFs, ripgrep's parallel processing makes the difference even more significant.
Building a Robust Preprocessor¶
The simple preprocessor above fails on non-PDF files:
$ echo foo > not-a-pdf
$ rg --pre ./preprocess 'foo' not-a-pdf
not-a-pdf: preprocessor command failed: '"./preprocess" "not-a-pdf"':
-------------------------------------------------------------------------------
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't find trailer dictionary
Handling Multiple File Types¶
Make the preprocessor conditional on file type using two approaches:
flowchart TD
Start[File Input] --> Approach{"Detection
Strategy"}
Approach -->|Extension-Based| Ext[Check File Extension]
Ext --> ExtMatch{*.pdf?}
ExtMatch -->|Yes| NonEmpty{"File
Non-empty?"}
ExtMatch -->|No| Cat1[cat - -]
NonEmpty -->|Yes| PDF1[pdftotext - -]
NonEmpty -->|No| Cat1
Approach -->|Content Sniffing| Sniff[Run 'file' Command]
Sniff --> FileType{"Content
Type?"}
FileType -->|PDF| PDF2[pdftotext - -]
FileType -->|Zstandard| Zstd[pzstd -cdq]
FileType -->|gzip| Gzip[gzip -cdq]
FileType -->|Other| Cat2[cat - -]
PDF1 --> Out[Output to stdout]
Cat1 --> Out
PDF2 --> Out
Zstd --> Out
Gzip --> Out
Cat2 --> Out
style Ext fill:#e8f5e9
style Sniff fill:#fff3e0
style Out fill:#e1f5ff
Figure: Two file type detection strategies showing trade-offs between speed (extension-based) and accuracy (content sniffing).
#!/bin/sh
case "$1" in
*.pdf)
# The -s flag ensures that the file is non-empty.
if [ -s "$1" ]; then
exec pdftotext - -
else
exec cat
fi
;;
*)
exec cat
;;
esac
Pros: Fast, simple, works for well-named files
Cons: Fails if files lack proper extensions
#!/bin/sh
case "$1" in
*.pdf)
# Handle PDFs by extension first
if [ -s "$1" ]; then
exec pdftotext - -
else
exec cat
fi
;;
*)
# Sniff content type for files without clear extensions
case $(file "$1") in
*Zstandard*)
exec pzstd -cdq
;;
*gzip*)
exec gzip -cdq
;;
*)
exec cat
;;
esac
;;
esac
Pros: Works even when files lack proper extensions
Cons: Slower due to file command invocation
Optimizing with --pre-glob¶
Performance Impact
Running a preprocessor on every file spawns a new process per file, which can significantly slow down searches. Always use --pre-glob to limit preprocessing to specific file types.
The --pre-glob flag limits preprocessing to files matching a glob pattern.
Performance Impact¶
Compare searching without and with --pre-glob:
# Preprocessor runs on ALL files
$ time rg --pre preprocessor 'fn is_empty' -c
crates/globset/src/lib.rs:1
crates/matcher/src/lib.rs:2
crates/ignore/src/overrides.rs:1
real 0.138
user 0.485
sys 0.209
# Preprocessor runs ONLY on PDF files
$ time rg --pre preprocessor --pre-glob '*.pdf' 'fn is_empty' -c
crates/globset/src/lib.rs:1
crates/matcher/src/lib.rs:2
crates/ignore/src/overrides.rs:1
real 0.008
user 0.010
sys 0.002
The performance difference is dramatic: 17x faster by limiting the preprocessor to PDF files only.
Usage Pattern¶
# Only preprocess PDFs
rg --pre ./preprocessor --pre-glob '*.pdf' 'search term'
# Preprocess multiple file types
rg --pre ./preprocessor --pre-glob '*.{pdf,doc,docx}' 'search term'
# Multiple glob patterns
rg --pre ./preprocessor --pre-glob '*.pdf' --pre-glob '*.doc' 'search term'
Best Practice
Always use --pre-glob when you know which file types need preprocessing. This keeps searches fast by avoiding unnecessary process spawning.
Preprocessor Use Cases¶
Beyond PDFs, preprocessors enable searching many file types:
PDF Text Extraction¶
Compressed Files¶
Built-in Compression Support
Ripgrep has built-in support for compressed files via -z/--search-zip:
- gzip (.gz, .tgz)
- bzip2 (.bz2, .tbz2)
- xz (.xz, .txz)
- lz4 (.lz4)
- lzma (.lzma)
- brotli (.br)
- zstd (.zst, .zstd)
- uncompress (.Z)
Use --pre only for compression formats not covered by -z.
Encrypted Files¶
#!/bin/sh
# Decrypt GPG-encrypted files
case "$1" in
*.gpg) exec gpg --decrypt "$1" 2>/dev/null ;;
*) exec cat ;;
esac
Encoding Conversion¶
Document Format Conversion¶
#!/bin/sh
# Convert Microsoft Office documents
# Requires: pandoc or catdoc
case "$1" in
*.docx) exec pandoc -t plain "$1" ;;
*.doc) exec catdoc "$1" ;;
*) exec cat ;;
esac
Relationship with -z/--search-zip¶
Ripgrep provides two ways to handle special file formats:
-z/--search-zip: Built-in support for common compressed formats including gzip (.gz, .tgz), bzip2 (.bz2, .tbz2), xz (.xz, .txz), lz4 (.lz4), lzma (.lzma), brotli (.br), zstd (.zst, .zstd), and uncompress (.Z)--pre: Custom preprocessing for any file transformation
When to use each:
Use -z/--search-zip |
Use --pre |
|---|---|
| Standard compression formats | Custom formats (PDF, Office docs) |
| Built-in, no external tools | Requires external tools |
| Faster (no process spawning) | More flexible |
| Automatic detection | You control the logic |
You can combine both flags if needed:
Preprocessor Priority
When a file matches a --pre-glob pattern, the preprocessor takes precedence over -z/--search-zip for that file. This means if you specify both flags and a file matches the glob pattern, only the preprocessor runs (the built-in decompression is skipped).
Performance Considerations¶
Overhead¶
Each preprocessor invocation spawns a new process per file. This overhead can be significant:
- Without
--pre-glob: New process for every file searched - With
--pre-glob: New process only for matching files - Parallelism helps: Multiple files processed simultaneously
Note: The -z/--search-zip flag also spawns decompression processes (one per compressed file), but uses optimized glob matching to automatically detect compression formats by file extension, avoiding the need for manual glob specification.
Optimization Tips¶
- Use
--pre-globto limit preprocessing to necessary files - Keep preprocessors fast: Avoid expensive transformations if possible
- Use built-in features when available (
-zinstead of--prefor compression) - Test preprocessors independently before using with ripgrep
- Consider preprocessing once: For repeated searches, convert files ahead of time
Benchmarking¶
Test preprocessor impact on your workflow:
# Measure overhead
time rg --pre ./preprocessor 'pattern'
time rg --pre ./preprocessor --pre-glob '*.pdf' 'pattern'
time rg 'pattern' # without preprocessor for comparison
Security Considerations¶
Security Alert
Preprocessors execute arbitrary commands with access to file content. Be cautious when:
- Using untrusted preprocessor scripts
- Searching files from untrusted sources
- Handling file paths in preprocessors
CVE-2021-3013
A security vulnerability (CVE-2021-3013) was fixed related to preprocessor command handling. Always use the latest ripgrep version.
Safe Scripting Practices¶
Verify file properties before processing to prevent errors and potential security issues.
# BAD: Vulnerable to command injection
eval "pdftotext $1 -"
# GOOD: Properly quoted
exec pdftotext "$1" -
Always quote variables and avoid eval with user-controlled input.
Use absolute paths for tools when possible to prevent PATH-based attacks.
Testing and Debugging¶
Development Workflow
Always test preprocessors independently before integrating with ripgrep. This isolates issues and makes debugging faster:
- Test the preprocessor command directly on sample files
- Verify output is correct plain text
- Check exit codes (0 for success)
- Then integrate with ripgrep using
--pre
Testing Preprocessors Independently¶
Test your preprocessor before using it with ripgrep:
# Test manually
./preprocessor test.pdf | head -20
# Verify exit code
./preprocessor test.pdf > /dev/null
echo $? # Should be 0 for success
Common Issues¶
Preprocessor not found
Solution: Use absolute/relative path or add to PATH
Preprocessor fails
Solution: Check preprocessor handles file type correctly
No output
Solution: Verify preprocessor outputs to stdout:
Debugging Flags¶
Use ripgrep's debugging output to see what's happening:
# See which files are preprocessed
RUST_LOG=debug rg --pre ./preprocessor 'pattern' 2>&1 | grep -i preproc
Understanding RUST_LOG Output
Setting RUST_LOG=debug enables detailed logging that shows:
- Which files matched
--pre-globpatterns - When the preprocessor command is invoked
- File paths passed to the preprocessor
- Preprocessor exit codes and errors
This is invaluable for debugging why files aren't being preprocessed as expected or when preprocessor commands fail silently.
Examples¶
Complete Multi-Format Preprocessor¶
A production-ready preprocessor handling multiple formats:
#!/bin/sh
# multi-preprocessor - Handle PDFs, Office docs, and compressed files
# Compression formats based on built-in support in crates/cli/src/decompress.rs:490-532
set -e # (1)!
FILE="$1"
# Check file is non-empty
[ -s "$FILE" ] || exec cat # (2)!
# Try extension-based matching first
case "$FILE" in
*.pdf)
exec pdftotext - - # (3)!
;;
*.docx)
exec pandoc -t plain "$FILE" # (4)!
;;
*.doc)
exec catdoc "$FILE"
;;
*.xlsx|*.xls)
exec ssconvert -T Gnumeric_stf:stf_csv "$FILE" fd://1 # (5)!
;;
esac
# Fall back to content sniffing
case $(file -b "$FILE") in # (6)!
*PDF*)
exec pdftotext - -
;;
*Zstandard*)
exec pzstd -cdq
;;
*gzip*)
exec gzip -cdq
;;
*bzip2*)
exec bzip2 -cdq
;;
*)
exec cat # (7)!
;;
esac
- Exit immediately on any error to prevent partial transformations
- Return empty output for empty files instead of failing
- Uses stdin (
-) for input and stdout (-) for output - Converts DOCX to plain text format for searching
- Converts spreadsheets to CSV format on file descriptor 1 (stdout)
- Falls back to magic number detection for files without proper extensions
- Pass through unchanged if no transformation needed
Usage:
# Search all supported formats in current directory
rg --pre ./multi-preprocessor --pre-glob '*.{pdf,doc,docx,xlsx}' 'search term'
Summary¶
The preprocessor feature makes ripgrep a universal search tool:
--pre COMMAND: Run command to transform files before searching (implementation in crates/core/flags/defs.rs:5453-5625)--pre-glob GLOB: Only preprocess files matching glob pattern- Use cases: PDFs, compressed files, Office documents, encrypted content
- Performance: Use
--pre-globto minimize overhead - Security: Validate inputs and avoid command injection
- Alternative: Use
-z/--search-zipfor built-in compression support
With preprocessors, ripgrep can search virtually any file format that can be converted to text.