Advanced Features

This section covers Debtmap’s advanced analysis capabilities: purity detection, data flow analysis, entropy-based complexity, and context-aware analysis.

Purity Detection

Debtmap detects pure functions - those without side effects that always return the same output for the same input.

What makes a function pure:

No I/O operations (file, network, database)
No mutable global state
No random number generation
No system calls
Deterministic output

Purity detection is optional:

Both is_pure and purity_confidence are Option types
May be None for some functions or languages where detection is not available
Rust has the most comprehensive purity detection support

Four-level purity classification: The PurityLevel enum (src/core/mod.rs:49-62) provides more nuanced classification than the binary is_pure:

StrictlyPure: No mutations whatsoever - pure mathematical functions
LocallyPure: Uses local mutations for efficiency but no external side effects (builder patterns, accumulators, owned mut self)
ReadOnly: Reads external state but doesn’t modify it (constants, &self methods)
Impure: Modifies external state or performs I/O (&mut self, statics, I/O)

This four-level classification enables better scoring for functions that use local mutations for efficiency but are functionally pure (referentially transparent). See Complexity Metrics for how purity affects scoring.

Confidence scoring (when available):

0.9-1.0: Very confident (no side effects detected)
0.7-0.8: Likely pure (minimal suspicious patterns)
0.5-0.6: Uncertain (some suspicious patterns)
0.0-0.4: Likely impure (side effects detected)

Example:

#![allow(unused)]
fn main() {
// Pure: confidence = 0.95
fn calculate_total(items: &[Item]) -> f64 {
    items.iter().map(|i| i.price).sum()
}

// Impure: confidence = 0.1 (I/O detected)
fn save_total(items: &[Item]) -> Result<()> {
    let total = items.iter().map(|i| i.price).sum();
    write_to_file(total)  // Side effect!
}
}

Benefits:

Pure functions are easier to test
Can be safely cached or memoized
Safe to parallelize
Easier to reason about

Data Flow Analysis

Debtmap builds a comprehensive DataFlowGraph that extends basic call graph analysis with variable dependencies, data transformations, I/O operations, and purity tracking.

Call Graph Foundation

Upstream callers - Who calls this function

Indicates impact radius
More callers = higher impact if it breaks

Downstream callees - What this function calls

Indicates dependencies
More callees = more integration testing needed

Example:

{
  "name": "process_payment",
  "upstream_callers": [
    "handle_checkout",
    "process_subscription",
    "handle_refund"
  ],
  "downstream_callees": [
    "validate_payment_method",
    "calculate_fees",
    "record_transaction",
    "send_receipt"
  ]
}

Variable Dependency Tracking

DataFlowGraph tracks which variables each function depends on (src/data_flow/mod.rs:119):

#![allow(unused)]
fn main() {
pub struct DataFlowGraph {
    // Maps function_id -> set of variable names used
    variable_deps: HashMap<FunctionId, HashSet<String>>,
    // ...
}
}

What it tracks:

Function parameters (primary source via extraction adapters)
Local variables accessed in function body
Captured variables (closures)

Note: Variable dependency tracking stores variable names only (as HashSet<String>). It does not track mutability information - that analysis is handled separately by the purity detection system.

Benefits:

Identify functions coupled through shared state
Detect potential side effect chains
Guide refactoring to reduce coupling

Example output:

{
  "function": "calculate_total",
  "variable_dependencies": ["items", "tax_rate", "discount", "total"],
  "parameter_count": 3,
  "local_var_count": 1
}

Data Transformation Patterns

DataFlowGraph tracks data transformations between functions. The TransformationType enum (src/organization/data_flow_analyzer.rs:35-46) classifies transformations by their input/output cardinality:

#![allow(unused)]
fn main() {
pub enum TransformationType {
    Direct,        // A → B (pure transformation)
    Aggregation,   // (A, B) → C (multiple inputs to single output)
    Decomposition, // A → (B, C) (single input to multiple outputs)
    Enrichment,    // A → Result<B> (validation/enrichment with Result/Option)
    Expansion,     // A → Vec<B> (single input to collection)
}
}

Classification logic (src/organization/data_flow_analyzer.rs:124-146):

Multiple input parameters → Aggregation
Return type is Result<T> or Option<T> → Enrichment
Return type is Vec<T> → Expansion
Return type is tuple → Decomposition
Default → Direct

Example usage:

#![allow(unused)]
fn main() {
// Aggregation: (items, discount_rate) → f64
fn calculate_discounted_total(items: &[Item], discount_rate: f64) -> f64 {
    items.iter().map(|i| i.price).sum::<f64>() * (1.0 - discount_rate)
}

// Enrichment: Config → Result<ValidatedConfig>
fn validate_config(config: Config) -> Result<ValidatedConfig> {
    // ...
}

// Expansion: Order → Vec<LineItem>
fn extract_line_items(order: &Order) -> Vec<LineItem> {
    order.items.clone()
}
}

Note: The DataFlowGraph.data_transformations field (src/data_flow/mod.rs:149) stores transformation_type as a String, allowing flexible pattern descriptions beyond the enum variants.

I/O Operation Detection

Tracks functions performing I/O operations for purity and performance analysis:

I/O categories tracked:

File I/O: std::fs, File::open, read_to_string
Network I/O: HTTP requests, socket operations
Database I/O: SQL queries, ORM operations
System calls: Process spawning, environment access
Blocking operations: thread::sleep, synchronous I/O in async

Example detection:

#![allow(unused)]
fn main() {
// Detected I/O operations: FileRead, FileWrite
fn save_config(config: &Config, path: &Path) -> Result<()> {
    let json = serde_json::to_string(config)?;  // No I/O
    std::fs::write(path, json)?;                 // FileWrite detected
    Ok(())
}
}

I/O metadata:

{
  "function": "save_config",
  "io_operations": ["FileWrite"],
  "is_blocking": true,
  "affects_purity": true,
  "async_safe": false
}

Purity Analysis Integration

DataFlowGraph integrates with purity detection to provide comprehensive side effect analysis:

Side effect tracking:

I/O operations (file, network, console)
Global state mutations
Random number generation
System time access
Non-deterministic behavior

Purity confidence factors:

1.0: Pure mathematical function, no side effects
0.8: Pure with deterministic data transformations
0.5: Mixed - some suspicious patterns
0.2: Likely impure - I/O detected
0.0: Definitely impure - multiple side effects

Example analysis:

{
  "function": "calculate_discount",
  "is_pure": true,
  "purity_confidence": 0.95,
  "side_effects": [],
  "deterministic": true,
  "safe_to_parallelize": true,
  "safe_to_cache": true
}

Modification Impact Analysis

DataFlowGraph calculates the impact of modifying a function:

#![allow(unused)]
fn main() {
pub struct ModificationImpact {
    pub function_name: String,
    pub affected_functions: Vec<String>,  // Upstream callers
    pub dependency_count: usize,          // Downstream callees
    pub has_side_effects: bool,
    pub risk_level: RiskLevel,
}
}

Risk level calculation:

Critical: Many upstream callers + side effects + low test coverage
High: Many callers OR side effects with moderate coverage
Medium: Few callers with side effects OR many callers with good coverage
Low: Few callers, no side effects, or well-tested

Example impact analysis:

{
  "function": "validate_payment_method",
  "modification_impact": {
    "affected_functions": 4,
    "dependency_count": 8,
    "has_side_effects": true,
    "risk_level": "High"
  }
}

Note: The affected_functions field contains the count of upstream callers. The actual function names can be obtained from the upstream_callers field in the function metadata.

Using modification impact:

# Analyze impact before refactoring
debtmap analyze . --format json | jq '.functions[] | select(.name == "validate_payment_method") | .modification_impact'

Impact analysis uses:

Refactoring planning: Understand blast radius before changes
Test prioritization: Focus tests on high-impact functions
Code review: Flag high-risk changes for extra scrutiny
Dependency management: Identify tightly coupled components

DataFlowGraph Methods

Key methods for data flow analysis:

#![allow(unused)]
fn main() {
// Add function with its dependencies
pub fn add_function(&mut self, function_id: String, callees: Vec<String>)

// Track variable dependencies
pub fn add_variable_dependency(&mut self, function_id: String, var_name: String)

// Record I/O operations
pub fn add_io_operation(&mut self, function_id: String, io_type: IoType)

// Calculate modification impact
pub fn calculate_modification_impact(&self, function_id: &str) -> ModificationImpact

// Get all functions affected by a change
pub fn get_affected_functions(&self, function_id: &str) -> Vec<String>

// Find functions with side effects
pub fn find_functions_with_side_effects(&self) -> Vec<String>
}

Integration in analysis pipeline:

Parser builds initial call graph
DataFlowGraph extends with variable/I/O tracking
Purity analyzer adds side effect information
Modification impact calculated for each function
Results used in prioritization and risk scoring

Connection to Unified Scoring:

The dependency analysis from DataFlowGraph directly feeds into the unified scoring system’s dependency factor (20% weight):

Dependency Factor Calculation: Functions with high upstream caller count or on critical paths from entry points receive higher dependency scores (8-10)
Isolated Utilities: Functions with few or no callers score lower (1-3) on dependency factor
Impact Prioritization: This helps prioritize functions where bugs have wider impact across the codebase
Modification Risk: The modification impact analysis uses dependency data to calculate blast radius when changes are made

Example:

Function: validate_payment_method
  Upstream callers: 4 (high impact)
  → Dependency Factor: 8.0

Function: format_currency_string
  Upstream callers: 0 (utility)
  → Dependency Factor: 1.5

Both have same complexity, but validate_payment_method gets higher unified score
due to its critical role in the call graph.

This integration ensures that the unified scoring system considers not just internal function complexity and test coverage, but also the function’s importance in the broader codebase architecture.

Entropy-Based Complexity

Advanced pattern detection to reduce false positives.

Token Classification:

#![allow(unused)]
fn main() {
enum TokenType {
    Variable,     // Weight: 1.0
    Method,       // Weight: 1.5 (more important)
    Literal,      // Weight: 0.5 (less important)
    Keyword,      // Weight: 0.8
    Operator,     // Weight: 0.6
}
}

Shannon Entropy Calculation:

H(X) = -Σ p(x) × log₂(p(x))

where p(x) is the probability of each token type.

Dampening Decision:

#![allow(unused)]
fn main() {
if entropy_score.token_entropy < 0.4
   && entropy_score.pattern_repetition > 0.6
   && entropy_score.branch_similarity > 0.7
{
    // Apply dampening
    effective_complexity = base_complexity × (1 - dampening_factor);
}
}

Output explanation:

Function: validate_input
  Cyclomatic: 15 → Effective: 5
  Reasoning:
    - High pattern repetition detected (85%)
    - Low token entropy indicates simple patterns (0.32)
    - Similar branch structures found (92% similarity)
    - Complexity reduced by 67% due to pattern-based code

Entropy Analysis Caching

EntropyAnalyzer includes an LRU-style cache for performance optimization when analyzing large codebases or performing repeated analysis.

Cache Structure

#![allow(unused)]
fn main() {
struct CacheEntry {
    score: EntropyScore,
    timestamp: Instant,
    hit_count: usize,
}
}

Cache configuration:

Default size: 1000 entries
Eviction policy: LRU (Least Recently Used)
Memory per entry: ~128 bytes
Total memory overhead: ~128 KB for default size

Cache Statistics

The analyzer tracks cache performance:

#![allow(unused)]
fn main() {
pub struct CacheStats {
    pub hits: usize,
    pub misses: usize,
    pub evictions: usize,
    pub hit_rate: f64,
    pub memory_usage: usize,
}
}

Example stats output:

{
  "entropy_cache_stats": {
    "hits": 3427,
    "misses": 1573,
    "evictions": 573,
    "hit_rate": 0.685,
    "memory_usage": 128000
  }
}

Hit rate interpretation:

> 0.7: Excellent - many repeated analyses, cache is effective
0.4-0.7: Good - moderate reuse, typical for incremental analysis
< 0.4: Low - mostly unique functions, cache less helpful

Performance Benefits

Typical performance gains:

Cold analysis: 100ms baseline (no cache benefit)
Incremental analysis: 30-40ms (~60-70% faster) for unchanged functions
Re-analysis: 15-20ms (~80-85% faster) for recently analyzed functions

Best for:

Watch mode: Analyzing on file save (repeated analysis of same files)
CI/CD: Comparing feature branch to main (overlap in functions)
Large codebases: Many similar functions benefit from pattern caching

Memory estimation:

Total cache memory = entry_count × 128 bytes

Examples:
- 1,000 entries: ~128 KB (default)
- 5,000 entries: ~640 KB (large projects)
- 10,000 entries: ~1.25 MB (very large)

Cache Management

Automatic eviction:

When cache reaches size limit, oldest entries evicted
Hit count influences retention (frequently accessed stay longer)
Timestamp used for LRU ordering

Cache invalidation:

Function source changes invalidate entry
Cache cleared between major analysis runs
No manual invalidation needed

Configuration (if exposed in future):

[entropy.cache]
enabled = true
size = 1000           # Number of entries
ttl_seconds = 3600    # Optional: expire after 1 hour

Context-Aware Analysis

Debtmap adjusts analysis based on code context:

Pattern Recognition:

Validation patterns (repetitive checks)
Dispatcher patterns (routing logic)
Builder patterns (fluent APIs)
Configuration parsers (key-value processing)

Adjustment Strategies:

Reduce false positives for recognized patterns
Apply appropriate thresholds by pattern type
Consider pattern confidence in scoring

Example:

#![allow(unused)]
fn main() {
// Recognized as "validation_pattern"
// Complexity dampening applied
fn validate_user_input(input: &UserInput) -> Result<()> {
    if input.name.is_empty() { return Err(Error::EmptyName); }
    if input.email.is_empty() { return Err(Error::EmptyEmail); }
    if input.age < 13 { return Err(Error::TooYoung); }
    // ... more similar validations
    Ok(())
}
}

Coverage Integration

Debtmap parses LCOV coverage data for risk analysis:

LCOV Support:

Standard format from most coverage tools
Line-level coverage tracking
Function-level aggregation

Coverage Index:

O(1) exact name lookups (~0.5μs)
O(log n) line-based fallback (~5-8μs)
~200 bytes per function
Thread-safe (Arc<CoverageIndex>)

Performance Characteristics

Index Build Performance:

Index construction: O(n), approximately 20-30ms for 5,000 functions
Memory usage: ~200 bytes per record (~2MB for 5,000 functions)
Scales linearly with function count

Lookup Performance:

Exact match (function name): O(1) average, ~0.5μs per lookup
Line-based fallback: O(log n), ~5-8μs per lookup
Cache-friendly data structure for hot paths

Analysis Overhead:

Coverage integration overhead: ~2.5x baseline analysis time
Target overhead: ≤3x (maintained through optimizations)
Example timing: 53ms baseline → 130ms with coverage (2.45x overhead)
Overhead includes index build + lookups + coverage propagation

When to use coverage integration:

Skip coverage (faster iteration): For rapid development iteration or quick local checks, omit --lcov to get baseline results 2.5x faster
Include coverage (comprehensive analysis): Use coverage integration for final validation, sprint planning, and CI/CD gates where comprehensive risk analysis is needed

Thread Safety:

Coverage index wrapped in Arc<CoverageIndex> for lock-free parallel access
Multiple analyzer threads can query coverage simultaneously
No contention on reads, suitable for parallel analysis pipelines

Memory Footprint:

Total memory = (function_count × 200 bytes) + index overhead

Examples:
- 1,000 functions: ~200 KB
- 5,000 functions: ~2 MB
- 10,000 functions: ~4 MB

Scalability:

Tested with codebases up to 10,000 functions
Performance remains predictable and acceptable
Memory usage stays bounded and reasonable

Generating coverage:

# Rust (using cargo-tarpaulin)
cargo tarpaulin --out lcov --output-dir target/coverage

# Or using cargo-llvm-cov
cargo llvm-cov --lcov --output-path target/coverage/lcov.info

Using with Debtmap:

debtmap analyze . --lcov target/coverage/lcov.info

Coverage dampening: When coverage data is provided, debt scores are dampened for well-tested code:

final_score = base_score × (1 - coverage_percentage)

This ensures well-tested complex code gets lower priority than untested simple code.

Debtmap Documentation

Advanced Features

Purity Detection

Data Flow Analysis

Call Graph Foundation

Variable Dependency Tracking

Data Transformation Patterns

I/O Operation Detection

Purity Analysis Integration

Modification Impact Analysis

DataFlowGraph Methods

Entropy-Based Complexity

Entropy Analysis Caching

Cache Structure

Cache Statistics

Performance Benefits

Cache Management

Context-Aware Analysis

Coverage Integration

Performance Characteristics

See Also

Keyboard shortcuts

Debtmap Documentation