Unicode Patterns¶
Ripgrep has excellent Unicode support enabled by default. All regex metacharacters are Unicode-aware.
Unicode Character Classes¶
Use \p{Property} syntax to match Unicode character properties:
# Find emoji
rg '\p{Emoji}'
# Find Greek text
rg '\p{Greek}+'
# Find alphabetic characters (all scripts)
rg '\p{Alphabetic}+'
# Find uppercase letters (all scripts)
rg '\p{Uppercase}+'
How Unicode Properties Work¶
Unicode properties categorize characters by script, category, or attribute. When ripgrep evaluates a \p{Property} pattern, it checks each character's Unicode metadata:
graph LR
subgraph "Text: 'Hello Σ世界'"
C1[H] --> C2[e]
C2 --> C3[l]
C3 --> C4[l]
C4 --> C5[o]
C5 --> C6[' ']
C6 --> C7[Σ]
C7 --> C8[世]
C8 --> C9[界]
end
subgraph "Pattern: \\p{Greek}+"
P["Check each
character"]
P --> M1{Is Greek?}
end
C7 -."Unicode
property".-> M1
M1 -->|Yes| R1[✓ Match Σ]
style C7 fill:#e8f5e9
style R1 fill:#c8e6c9
Figure: Unicode property matching examines character metadata. Only 'Σ' matches \p{Greek} in this example.
Common Unicode Properties¶
| Property | Matches |
|---|---|
\p{Emoji} |
Emoji characters |
\p{Greek} |
Greek script |
\p{Han} |
Chinese characters |
\p{Cyrillic} |
Cyrillic script |
\p{Alphabetic} |
Alphabetic characters |
\p{Uppercase} |
Uppercase letters |
\p{Lowercase} |
Lowercase letters |
\p{White_Space} |
Whitespace |
\p{any} |
Any character (including newlines) |
Property Names Are Case-Insensitive
Unicode property names can be written in any case: \p{Greek}, \p{greek}, and \p{GREEK} are all equivalent.
Using \p{any} for Newline Matching
\p{any} matches any Unicode codepoint, including newlines (\n), regardless of multiline mode settings. This provides an alternative to using . with the --multiline-dotall flag.
# Match any character including newlines without --multiline-dotall
rg '\p{any}+'
# Equivalent to using . with multiline-dotall
rg --multiline-dotall '.+'
See crates/core/flags/defs.rs:4249-4250 for implementation details.
Note on Emoji Matching: Emoji matching with \p{Emoji} may vary across different regex engines and Unicode versions. Some complex emoji (like multi-codepoint sequences, skin tone modifiers, or zero-width joiners) may require additional pattern logic. Always test emoji patterns with your specific use case and data.
For a comprehensive list of Unicode properties, see the Rust regex Unicode documentation.
Unicode Support Across Regex Engines
The availability and behavior of Unicode properties may vary between ripgrep's default regex engine and the PCRE2 engine. While both support Unicode, PCRE2 provides additional Unicode features and properties.
graph LR
subgraph Default["Default Engine (Rust regex)"]
direction TB
D1["Standard Unicode
Properties"]
D2["Fast Performance"]
D3["UTF-8 Optimized"]
end
subgraph PCRE2["PCRE2 Engine"]
direction TB
P1["Extended Unicode
Properties"]
P2["Advanced Features"]
P3["Perl Compatibility"]
end
Default -.->|"Use when"| U1["Standard properties
sufficient"]
PCRE2 -.->|"Use when"| U2["Need extended
Unicode support"]
style Default fill:#e1f5ff
style PCRE2 fill:#fff3e0
Figure: Default engine provides standard Unicode properties with optimal performance. PCRE2 offers extended Unicode features when needed (enable with --pcre2 or -P).
If you need extended Unicode support not available in the default engine, see the PCRE2 documentation for details.
Additionally, Unicode support depends on the Unicode version used by your system and ripgrep version. Character property definitions and emoji classifications may evolve across Unicode versions.
Unicode-Aware Metacharacters¶
By default, these metacharacters are Unicode-aware:
\w: Matches all Unicode word characters (not just ASCII)\s: Matches all Unicode whitespace\d: Matches Unicode decimal digits\b: Unicode word boundaries
# Match Unicode word characters (includes accented letters, etc.)
rg '\w+'
# Match Unicode whitespace
rg '\s+'
Case-Insensitive Unicode¶
The -i flag performs Unicode case folding:
# Matches "café", "CAFÉ", "Café", etc.
rg -i 'café'
# Works across all scripts
rg -i 'Σ' # Matches Σ and σ (Greek)
Disabling Unicode¶
For ASCII-only searches with better performance, use --no-unicode:
Performance Considerations
While Unicode mode provides rich character class support, it can impact performance in certain scenarios:
- Frequent word character matching: Patterns like
\w{100}(repeated Unicode word checks) are slower than their ASCII equivalents - Large files with ASCII-only content: If you know your content is ASCII-only,
--no-unicodeprovides measurable speedup - Word boundaries:
\band\Buse Unicode word definitions, which are more computationally expensive than ASCII boundaries
When to disable Unicode:
- Processing ASCII-only data (source code, logs, configuration files)
- Performance-critical searches on large codebases
- Pattern uses
\w,\d,\s, or\bextensively
See crates/core/flags/defs.rs:4930-4934 for implementation context.
Note: --no-unicode affects the entire search, not individual patterns.
Common Use Cases¶
Validating International Names¶
Pattern for International Names
This pattern validates names that may contain accents, diacritics, or non-Latin scripts:
# Match names with Unicode letters (supports accents, non-Latin scripts)
rg '^\p{Alphabetic}+( \p{Alphabetic}+)*$' # (1)!
# Find names containing specific scripts
rg '\p{Han}+\s+\p{Latin}+' # Chinese + Latin names
- Matches one or more alphabetic characters, followed by optional space-separated words (e.g., "José García", "李明", "Müller")
Processing Multilingual Text¶
# Extract sentences from mixed-script documents
rg '\p{Uppercase}\p{Alphabetic}+.*?[.!?]' # (1)!
# Find all non-ASCII text
rg '[^\p{ASCII}]+' # (2)!
# Identify specific language blocks
rg '\p{Arabic}+' --only-matching # (3)!
- Matches sentences starting with uppercase, containing alphabetic chars, ending with punctuation (works across all scripts)
- Negated ASCII property matches any character outside the ASCII range (useful for finding internationalized content)
- Extracts Arabic text blocks; use
--only-matchingto output just the matched text without file/line context
Data Validation¶
Cleaning Unicode Text
These patterns help identify and clean problematic Unicode characters in data files:
# Validate Unicode whitespace handling
rg '\p{White_Space}+' --replace ' ' # (1)!
# Find problematic characters
rg '[^\p{Print}\p{White_Space}]' # (2)!
- Normalize all types of Unicode whitespace (tabs, non-breaking spaces, etc.) to regular spaces
- Find non-printable control characters that may cause display or parsing issues
Working with Emoji¶
# Find lines containing emoji
rg '\p{Emoji}' # (1)!
# Extract emoji from text
rg '\p{Emoji}+' --only-matching # (2)!
# Find text without emoji (using negative pattern in broader search)
rg '^[^\p{Emoji}]+$' # (3)!
- Matches any line containing at least one emoji character
- Extracts sequences of emoji without surrounding context (useful for emoji inventories)
- Matches entire lines that contain no emoji characters (useful for filtering emoji-free content)