Detection algorithms
Detection is the fourth pipeline stage. It analyzes correlated traces to identify seven types of anti-patterns: N+1 queries, redundant calls, slow operations, excessive fanout, chatty services, connection pool saturation and serialized-but-parallelizable calls.
Shared pattern: borrowed HashMap keys
All three detectors group spans by a composite key. A key insight is that the spans live in the Trace struct, which outlives the detector function. This means we can borrow from the spans instead of cloning:
// N+1: group by (event_type, template)
HashMap<(&EventType, &str), Vec<usize>>
// Redundant: group by (event_type, template, params)
HashMap<(&EventType, &str, &[String]), Vec<usize>>
// Slow: group by (event_type, template)
HashMap<(&EventType, &str), Vec<usize>>The values are Vec<usize>: indices into trace.spans rather than cloned spans. This keeps the HashMap small and avoids copying the event data.
For a trace with 50 spans, each having a 40-character template string, borrowed keys save 50 × 40 = 2,000 bytes of String allocations per grouping pass.
N+1 detection
Algorithm
- Group spans by
(&EventType, &str template) - Skip groups with fewer than
thresholdoccurrences (default 5) - Count distinct parameter sets via
HashSet<&[String]> - Skip groups with fewer than
thresholddistinct params (same params = redundant, not N+1) - Compute time window between earliest and latest timestamp
- Skip groups where the window exceeds
window_limit_ms(default 500ms) - Assign severity: Critical if >= 10 occurrences, Warning otherwise
Distinct params via borrowed slices
let distinct_params: HashSet<&[String]> = indices
.iter()
.map(|&i| trace.spans[i].params.as_slice())
.collect();Using &[String] as a HashSet key is a critical design choice:
- No allocation: borrows the existing Vec as a slice reference
- No collision bug: directly compares the full Vec content, unlike a
join(",")approach where["a,b"]and["a", "b"]would produce the same joined string
Rust's standard library implements Hash and Eq for &[T] when T: Hash + Eq, making this zero-cost.
Iterator-based window computation
pub fn compute_window_and_bounds_iter<'a>(
mut iter: impl Iterator<Item = &'a str>,
) -> (u64, &'a str, &'a str) {
let Some(first) = iter.next() else {
return (0, "", "");
};
let mut min_ts = first;
let mut max_ts = first;
let mut has_second = false;
for ts in iter {
has_second = true;
if ts < min_ts { min_ts = ts; }
if ts > max_ts { max_ts = ts; }
}
// ...
}Why iterator instead of &[&str]? The caller would need to collect timestamps into a Vec first:
// Old (allocates):
let timestamps: Vec<&str> = indices.iter().map(|&i| ...).collect();
let (w, min, max) = compute_window_and_bounds(×tamps);
// New (zero allocation):
let (w, min, max) = compute_window_and_bounds_iter(
indices.iter().map(|&i| trace.spans[i].event.timestamp.as_str())
);The iterator-based version eliminates one Vec<&str> allocation per detection group. With 3 detectors × multiple groups per trace × thousands of traces, this adds up.
The has_second boolean replaces a count variable that was only used to check count < 2. This avoids incrementing a counter on every iteration.
ISO 8601 timestamp parser
fn parse_timestamp_ms(ts: &str) -> Option<u64> {
let time_part = ts.split('T').nth(1)?;
let time_part = time_part.trim_end_matches('Z');
let mut colon_parts = time_part.split(':');
let hours: u64 = colon_parts.next()?.parse().ok()?;
let minutes: u64 = colon_parts.next()?.parse().ok()?;
let sec_str = colon_parts.next()?;
// ... parse seconds and fractional part
}Why not chrono? chrono adds ~150KB to the binary and parses ~200ns per timestamp. This hand-rolled parser handles the fixed format (YYYY-MM-DDTHH:MM:SS.mmmZ) in ~5ns by splitting on known delimiters and using iterator .next() calls instead of collecting into Vecs.
The parser uses iterators throughout (split(':') -> .next(), split('.') -> .next()) to avoid allocating intermediate Vec<&str> collections.
The parser computes milliseconds since Unix epoch by parsing both the date (YYYY-MM-DD) and time components. The date-to-days conversion uses the Howard Hinnant algorithm (public domain), which requires no external dependencies.
Lexicographic timestamp comparison
Min/max timestamps are found via string comparison: if ts < min_ts { min_ts = ts; }. This works because ISO 8601 timestamps with fixed-width fields (2025-07-10T14:32:01.123Z) sort chronologically when compared lexicographically. This is guaranteed by the ISO 8601 standard, Section 5.3.3.
Sanitizer-aware classification
OpenTelemetry agents and database drivers collapse SQL literals to placeholder tokens before the statement reaches perf-sentinel. The placeholder style depends on the stack: JDBC agents produce bare ?, PostgreSQL native drivers (pgx, asyncpg, sqlx, node-pg) emit $1/$2 (which normalize_sql rewrites to $? with empty params since v0.7.7), Python DB-API drivers emit %s, .NET drivers emit @p0/@Name, and Oracle/SQLAlchemy emit :name. In all cases the sanitized statement reaches perf-sentinel with the placeholder already in place and an empty params vector. The standard distinct_params >= threshold check sees one distinct empty params slice and never fires, the redundant detector then groups all the spans together and misclassifies them as redundant_sql.
The heuristic in crates/sentinel-core/src/detect/sanitizer_aware.rs recovers the correct classification via four signals, evaluated in order:
looks_sanitized: every span has a recognized placeholder in its template (?,$?,%s,@alpha,:alpha) and an emptyparamsvector. Seetemplate_has_placeholderinsanitizer_aware.rsfor the full list. Required to activate the heuristic at all.has_orm_scope: at least one OpenTelemetry instrumentation scope on the spans matches a known ORM marker (Hibernate, Spring Data, EF Core, SQLAlchemy, ActiveRecord, GORM, Prisma, Diesel, etc.). Markers are matched with a word-boundary check (preceded and followed by a non-alphanumeric byte), sojpaonly fires onspring-data-jpaand friends, never onmyappjpastats. A positive match is treated as strong evidence of N+1.timing_variance_suggests_n_plus_one: when the scope signal is absent, fall back to the coefficient of variation ofduration_us. True N+1 hits different rows with different cache states, so the spread is wider, cached redundant calls cluster tightly. Threshold0.5is empirical.sequential_siblings_indexed(Strict mode only): every span shares one non-emptyparent_span_idand the group chainsprev.end_us <= next.start_usafter sort by start time. Bounds are computed in microseconds to avoid the silent truncation of sub-millisecond durations. Substitutes forhas_orm_scopeon bare-driver stacks (Vert.x reactive PG, pgx, asyncpg, sqlx, PrismaqueryRaw) that never emit an ORM scope.high_occurrence(Strict mode, all branches): a high occurrence count (>= 3 xn_plus_one_threshold, default 15) serves as both a primary signal and a corroborator. Under thelooks_sanitizedguard (params empty, template has?), 15+ identical sanitized templates in one trace is structurally n+1 regardless of ORM scope, sequential siblings, or timing variance. Legacy polling loops below the threshold (typical 5-10 calls per request) stay classified asredundant_sql.
The four emission modes (Auto, Strict, Always, Never) are documented in Configuration § "sanitizer_aware_classification" with their precision/recall trade-offs.
Known limit
looks_sanitized cannot tell a sanitized literal ? apart from a PostgreSQL JSONB existence operator (data ? 'key') when the latter happens to appear in a query with no other literals. The harm direction is asymmetric: a misclassified JSONB group flips from redundant_sql to n_plus_one_sql, both of which contribute equally to GreenOps avoidable_io_ops, only the suggestion text differs.
HTTP extension (0.7.8+)
The same dispatch also covers HTTP outbound groups via classify_http_group_indexed. HTTP has no looks_sanitized analogue (the normalizer always collapses path IDs to {id}/{uuid}, params are never erased to empty the way a SQL sanitizer erases them), and no ORM scope concept. The HTTP path therefore relies on a narrower signal set:
Auto/Always: timing variance alone (CV>= 0.5).Strict: a primary signal (HTTP placeholder in the template, high occurrence, or sequential siblings) corroborated by timing variance. Unlike the SQL path, high occurrence alone is not sufficient corroboration for HTTP, because without thelooks_sanitizedgate a busy polling loop or CDN-warmed repeated call would otherwise be promoted ton_plus_one_http.
Known limit: query-string redaction
N+1 HTTP detection requires the varying parameter to be visible in the span. An N+1 loop that varies a path segment is detected (distinct extracted params, or the {id} placeholder anchors the Strict primary). An N+1 loop that varies a query parameter is invisible when the instrumentation redacts the query string before export. OpenTelemetry .NET System.Net.Http redacts to ?* by default, so every call carries a byte-identical url.full, distinct_params collapses to 1, and the group is correctly classified as redundant_http. The distinguishing parameter was destroyed upstream, so no trace consumer can recover it. See Limitations § "HTTP query-string redaction and N+1 visibility" for the operator-facing workarounds.
Redundant detection
Borrowed slice keys
HashMap<(&EventType, &str, &[String]), Vec<usize>>The three-part key includes the full params slice, ensuring that two spans with the same template but different params are in different groups. This is the correct behavior: redundant detection flags exact duplicates (same template AND same params).
The use of &[String] instead of joining params into a single string prevents a subtle collision bug: ["a,b"] (one param containing a comma) and ["a", "b"] (two params) would produce the same joined key "a,b" but are semantically different parameter sets.
Severity
- Info (< 5 occurrences): common for config lookups, health checks
- Warning (>= 5 occurrences): likely a loop bug or missing cache
The threshold of 2 (minimum to flag) catches any exact duplicate. Unlike N+1 which requires 5+ occurrences, even 2 identical queries in one request is suspicious and worth flagging at Info level.
ORM bind parameters
ORMs that use named bind parameters (Entity Framework Core with @__param_0, Hibernate with ?1) produce SQL spans where actual parameter values are not visible in db.statement/db.query.text. In this case, N+1 patterns (same query with different values) appear as redundant queries (same template, same visible params), because perf-sentinel cannot distinguish the bound values. Both findings correctly identify the repeated query pattern. ORMs that inline literal values (SeaORM raw statements, JDBC without prepared statements) allow accurate N+1 vs redundant classification.
Sanitizer-aware classification (0.5.7+)
The same shape appears whenever the OpenTelemetry agent runs its SQL statement sanitizer (default ON), since literals are collapsed to ? before the span reaches perf-sentinel. The standard distinct-params rule sees one bucket of empty params and rejects the group, so the redundant detector misclassifies the N+1 as redundant_sql and the operator gets the wrong remediation.
The 0.5.7 sanitizer-aware heuristic recovers the correct classification by running a second pass over the same (event_type, template) groups that the first pass rejected. It activates only when every span in the group has an empty params vector and a recognized placeholder in its template (the on-wire signature of a sanitized N+1). Since v0.7.7 the template_has_placeholder check recognizes five styles: bare ? (JDBC), $? (PostgreSQL native, normalized from $1/$2), %s (Python DB-API), @alpha (.NET, excluding @@ system variables), :alpha (Oracle/SQLAlchemy, excluding :: casts). Truly literal-free queries like SELECT NOW() have no placeholder in the template and do not activate the heuristic. It then evaluates two independent signals:
- Instrumentation scope marker (high confidence). Per-span
instrumentation_scopeschains are searched, case-insensitively, for any of the known ORM substrings:spring-data,hibernate,jpa,micronaut-data,jdbi,r2dbc,entityframeworkcore,entity-framework,sqlalchemy,django,active-record/activerecord,gorm,sequelize,prisma,typeorm,mongoose,sea-orm,diesel. Bare SQL drivers likesqlx(Go/Rust),pgx,asyncpgand the Vert.x reactive PG client are intentionally excluded: their n+1 patterns are handled by the sequential-siblings signal instead. A match flips the verdict toLikelyNPlusOne. - Timing-variance fallback (medium confidence). When no ORM marker is present, the heuristic computes the coefficient of variation (
std-dev / mean) ofduration_us. True N+1 lookups hit different rows with different cache states, so durations spread out (CV typically 0.4 to 1.0), cached redundant calls cluster tightly (CV near 0). The threshold of0.5is empirical and is the only knob in the heuristic. At least 3 spans are required for a stable variance estimate.
The configurable [detection] sanitizer_aware_classification mode gates emission across four points on a recall-vs-precision dial: auto (default) emits when either signal fires, strict (0.5.8+) requires a primary signal (ORM scope OR sequential siblings) plus a corroborating signal (variance OR, on the ORM branch, high occurrence count), always reclassifies every sanitized group regardless of signal, and never disables the second pass entirely. Findings emitted by the heuristic carry classification_method = SanitizerHeuristic so consumers can distinguish them from direct classifications. The mode picks where to sit on the trade-off:
autofavors recall: catches all ORM-induced N+1 because the ORM scope alone fires the verdict, at the cost of absorbing legitimateredundant_sqlfindings on Spring Data / EF Core stacks (afindById(sameId)called in a loop served from row cache flips ton_plus_one_sql).strictfavors precision: preservesredundant_sqlfindings on moderate-count cached identical queries (below the3 x thresholdbar) because the timing-variance signal stays low. Above the bar (default 15 occurrences), any sanitized group fires regardless of ORM scope, sequential siblings, or variance. Recommended when actionableredundant_sqlfindings are valuable signal in your environment.
Known limits: a real single-param redundancy whose literal happens to be collapsed by the sanitizer (e.g. SELECT * FROM config WHERE key = ? queried 10 times for the same key) cannot be distinguished from an N+1 without scope or variance signal. In auto mode it flips to n_plus_one_sql whenever an ORM scope is present (harm-reducing direction: batch fetch is a strict superset of "cache one value"). In strict mode it stays redundant_sql because the timing variance is low. In always mode it always flips. In never mode the heuristic is bypassed entirely.
The timing-variance signal (timing_variance_suggests_n_plus_one, coefficient of variation > 0.5) carries an asymmetric-harm tuning: a false positive merely swaps redundant_sql for n_plus_one_sql (same avoidable_io_ops weight, only the suggestion text differs), while a false negative leaves a real N+1 silent, so the threshold favors false positives. Under strict the signal becomes load-bearing as the sole corroborator on the ORM branch below the high-occurrence bar, and it has a warm-cache blind spot: a real ORM-induced N+1 against a fully warm row cache (e.g. 100 primary-key lookups with every row in shared_buffers) can cluster within roughly 10% (CV around 0.1) and stay silent. The 0.5 threshold is kept across modes pending empirical validation in the simulation lab; if lab traffic shows it too restrictive under strict, the right follow-up is a [detection] sanitizer_aware_min_cv knob rather than a new global default.
Slow detection
Saturation arithmetic
let threshold_us = threshold_ms.saturating_mul(1000);
// ...
if max_duration_us > threshold_us.saturating_mul(5) {
Severity::Critical
}saturating_mul returns u64::MAX on overflow instead of wrapping to zero. This prevents a malicious or misconfigured threshold_ms = u64::MAX from disabling severity thresholds.
Not part of waste ratio
Slow findings have green_impact.estimated_extra_io_ops = 0. They are necessary operations that happen to be slow: they need optimization (indexing, caching), not elimination. Including them in the waste ratio would conflate "avoidable I/O" (N+1, redundant) with "slow I/O" (needs a different fix).
Detection orchestration
pub fn detect(traces: &[Trace], config: &DetectConfig) -> Vec<Finding> {
let mut findings = Vec::new();
for trace in traces {
findings.extend(detect_n_plus_one(trace, ...));
findings.extend(detect_redundant(trace));
findings.extend(detect_slow(trace, ...));
}
findings
}The detectors run sequentially on each trace. While they could theoretically share a single grouping pass, the key types differ ((&EventType, &str) vs (&EventType, &str, &[String])) and the separate implementations are clearer and independently testable. With typical trace sizes of 10-50 spans, multiple O(n) passes are negligible.
Fanout detection
Algorithm
- Group spans by
parent_span_id - Skip groups where the parent has
max_fanoutor fewer children (default 20) - For each parent exceeding the threshold, emit an
ExcessiveFanoutfinding - Severity: Warning if >
max_fanout, Critical if > 3xmax_fanout
The fanout detector uses a HashMap<&str, usize> span index for O(1) parent lookup and compute_window_and_bounds to compute the chronological span of child timestamps in a single pass.
Not part of waste ratio
Like slow findings, fanout findings have green_impact.estimated_extra_io_ops = 0. Excessive fanout is a structural concern (too many child operations per parent) that needs architectural optimization, not I/O elimination. Both the dedup loop and the green_impact enrichment use FindingType::is_avoidable_io() to make this determination, ensuring a single source of truth.
Cross-trace slow percentiles
In batch mode, detect_slow_cross_trace collects slow spans across all traces and computes p50/p95/p99 percentiles per normalized template. This complements the per-trace slow detection by identifying templates that are consistently slow across multiple requests.
- Only spans exceeding the threshold are collected (pre-filter for performance)
- Only templates appearing in at least 2 distinct traces are reported (single-trace cases are handled by per-trace detection)
- Percentile computation uses the nearest-rank method via
div_ceil
Chatty service detection
Algorithm
- Filter spans to HTTP outbound only (
type: http_out) - Count total HTTP outbound spans in the trace
- If count <
chatty_service_min_calls(default 15), skip - Collect the top called normalized endpoints for the suggestion message
- Assign severity: Warning if > threshold, Critical if > 3x threshold
Input: trace with N spans
Output: 0 or 1 ChattyService finding
filter spans where type == http_out
if count(http_spans) < chatty_service_min_calls:
return []
group http_spans by normalized template
sort groups by count descending
top_endpoints = first 5 groups
severity = Critical if count >= 3 * threshold else Warning
emit finding with top_endpoints in suggestionComplexity: O(n) to filter and count, O(k log k) to sort groups where k is the number of distinct templates. Since k << n in practice, this is effectively O(n).
Difference from fanout
Excessive fanout detects a single parent with too many direct children. Chatty service detects an entire trace with too many outbound HTTP calls, independently of the parent-child structure. A trace can trigger both when a single parent generates all the calls or only chatty service when the calls are spread across multiple parents.
Not part of waste ratio
Chatty service findings have green_impact.estimated_extra_io_ops = 0. The detector flags an architectural concern (too many inter-service calls per request), not a batching opportunity. The calls may all be necessary; the problem is that the service boundary is too fine-grained. FindingType::is_avoidable_io() returns false for ChattyService.
Connection pool saturation detection
Algorithm
- Filter spans to SQL only (
type: sql) - Group SQL spans by service name
- For each service group, compute peak concurrency via sweep-line
- If peak concurrency <
pool_saturation_concurrent_threshold(default 10), skip - Assign severity: Warning if > threshold, Critical if > 3x threshold
Input: trace with N spans, grouped by service
Output: 0 or more PoolSaturation findings (one per service)
for each service in sql_spans_by_service:
events = []
for span in service_spans:
start = parse_timestamp(span.timestamp)
end = start + span.duration_us
events.push((start, +1))
events.push((end, -1))
sort events by timestamp, with -1 before +1 on ties
current = 0
peak = 0
for (ts, delta) in events:
current += delta
peak = max(peak, current)
if peak >= pool_saturation_concurrent_threshold:
emit findingComplexity: O(n log n) for the sort step, O(n) for the sweep. Total: O(n log n) per service group.
Sweep-line tie-breaking
When a span ends and another begins at the exact same microsecond, the algorithm processes the end event (-1) before the start event (+1). This avoids inflating peak concurrency when spans are merely adjacent rather than overlapping.
Not part of waste ratio
Pool saturation findings have green_impact.estimated_extra_io_ops = 0. High concurrency is not avoidable I/O. It signals potential contention on the database connection pool, which is a tuning or architectural concern. FindingType::is_avoidable_io() returns false for PoolSaturation.
Serialized calls detection
Algorithm
- Group sibling spans by
parent_span_id - For each parent group, sort children by end time (ascending)
- Find the longest non-overlapping subsequence via dynamic programming (Weighted Interval Scheduling with unit weights)
- If the optimal sequence has >=
serialized_min_sequential(default 3) spans with distinct templates, emit a finding - Severity: always Info (heuristic, inherent false positive risk)
Input: trace with N spans, grouped by parent_span_id
Output: 0 or more SerializedCalls findings
for each parent_id in spans_by_parent:
children = spans with this parent_id
if len(children) < serialized_min_sequential:
skip
sort children by end_time ascending
// Predecessor computation: for each span i, binary search for p(i),
// the rightmost span j (j < i) whose end_time <= span i's start_time.
// O(log n) per span.
// DP recurrence:
// dp[i] = max(dp[i-1], dp[p(i)] + 1)
// where dp[i] = longest non-overlapping subsequence in children[0..=i]
// Backtrack from dp[n-1] to reconstruct the selected spans.
// Guard: predecessor must be strictly less than current index
// to guarantee termination on degenerate input (zero-duration spans).
if len(selected) >= serialized_min_sequential
AND distinct_templates(selected) > 1:
emit finding for selected sequenceComplexity: O(n log n) for sorting + O(n log n) for all binary searches + O(n) for the DP fill and backtrack = O(n log n) total per parent group. This is the same asymptotic cost as the simpler greedy approach, but the DP guarantees finding the longest possible non-overlapping sequence. For example, given spans A:[0-200ms], B:[100-150ms], C:[160-300ms], D:[310-400ms], a greedy approach sorted by start time would select {A, D} (length 2), while the DP correctly identifies {B, C, D} (length 3).
The binary search uses partition_point directly on the sorted slice, avoiding a separate predecessor array allocation.
Why info only
The detector cannot observe data dependencies between calls. Two sequential calls to different services may be intentionally ordered (e.g. create a record, then notify a dependent service). The info severity signals an investigation opportunity, not a confirmed defect.
Template filtering
The detector skips sequences where all spans share the same normalized template. That pattern is N+1 (same operation repeated with different params), not serialization. By requiring different templates, the detector targets the "fetch user, then fetch orders, then fetch preferences" pattern where the calls are independent and could run concurrently.
Time savings estimate
The finding includes the potential time savings: total_sequential_duration - max_individual_duration. If 3 sequential calls each take 100ms, parallelizing them could reduce latency from 300ms to 100ms, saving 200ms. This is a best-case estimate that assumes no shared resource contention.
Not part of waste ratio
Serialized call findings have green_impact.estimated_extra_io_ops = 0. Parallelizing calls does not reduce the total number of I/O operations. It reduces latency, not I/O volume. FindingType::is_avoidable_io() returns false for SerializedCalls.
Detection orchestration (updated)
pub fn detect(traces: &[Trace], config: &DetectConfig) -> Vec<Finding> {
let mut findings = Vec::new();
for trace in traces {
findings.append(&mut detect_n_plus_one(trace, ...));
findings.append(&mut detect_redundant(trace));
findings.append(&mut detect_slow(trace, ...));
findings.append(&mut detect_fanout(trace, config.max_fanout));
findings.append(&mut detect_chatty(trace, config.chatty_service_min_calls));
findings.append(&mut detect_pool_saturation(trace, config.pool_saturation_concurrent_threshold));
findings.append(&mut detect_serialized(trace, config.serialized_min_sequential));
}
findings
}The seven detectors run sequentially on each trace. append(&mut ...) is used instead of extend() to move the backing allocation in O(1) without iterator overhead. Cross-trace slow percentile analysis runs separately in pipeline.rs after per-trace detection and before scoring.
Cross-trace temporal correlation (daemon mode)
In daemon mode (perf-sentinel watch), perf-sentinel sees findings from all traces over time. The CrossTraceCorrelator detects recurring temporal co-occurrences between findings from different services: "every time the N+1 in order-svc fires, pool saturation appears in payment-svc within 2 seconds."
Internal state
pub struct CrossTraceCorrelator {
occurrences: VecDeque<FindingOccurrence>,
pair_counts: HashMap<PairKey, PairState>,
source_totals: HashMap<CorrelationEndpoint, u32>,
config: CorrelationConfig,
}Three data structures track the correlation state:
occurrences: aVecDequeof recent finding occurrences, ordered by timestamp. Each entry records aCorrelationEndpoint(finding_type, service, template) and atimestamp_ms. This is the rolling window.pair_counts: aHashMapkeyed byPairKey(source endpoint, target endpoint). Each value holds the co-occurrence count, a bounded reservoir of observed lag values, atotal_observationscounter, a per-pairSplitMix64PRNG state and first/last seen timestamps. This is the correlation accumulator.source_totals: aHashMapcounting how many times eachCorrelationEndpointis currently in the window. Used as the denominator in the confidence calculation. Maintained incrementally (incremented onpush_back, decremented onpop_front); entries are removed when the count reaches zero so the map stays bounded by the number of distinct endpoints, not the number of occurrences.
The ingest() algorithm
ingest() is called from process_traces after findings are produced and confidence is stamped. It takes a &[Finding] batch and a now_ms timestamp. The algorithm has five steps:
- Evict stale entries. Walk
occurrencesfrom front to back, popping entries older thannow_ms - window_ms(default 10 minutes) and decrementsource_totalsfor each evicted endpoint. This is O(k) where k is the number of expired entries. - Prune stale pair counts. A single
HashMap::retainpass overpair_countsremoves pairs whoselast_seen_msis outside the window. O(pairs). - Scan for co-occurrences. For each incoming finding, construct a
CorrelationEndpoint. Iterateoccurrencesbackwards (most recent first). For each recent occurrence from a different service withinlag_threshold_ms(default 5 seconds), increment the pair counter and record the lag via reservoir sampling (see below). The backwards scan breaks early once it reaches entries beyond the lag threshold, keeping this O(l) where l is the number of occurrences within the lag window. - Append to window. Push the new finding occurrence onto the back of
occurrencesand increment itssource_totalscount. - Enforce pair cap. If
pair_counts.len()exceedsmax_tracked_pairs(default 10,000), useselect_nth_unstable_by_key(O(n) average) to find the lowest-count entries and remove them until the cap is met. This eviction prioritizes retaining the most significant correlations.
The active_correlations() filter
active_correlations() iterates over pair_counts and applies two thresholds:
min_co_occurrences(default 5): pairs that have co-occurred fewer times are filtered out.min_confidence(default 0.7): confidence isco_occurrence_count / source_total_occurrences. Pairs below this ratio are filtered out.
For each qualifying pair, the function computes median_lag_ms and converts first_seen_ms/last_seen_ms to ISO 8601 via time::millis_to_iso8601.
Reservoir sampling for lag values
A hot pair firing thousands of times within the window would otherwise grow lags_ms without bound (megabytes per pair). To keep memory per pair flat, record_lag uses Algorithm R reservoir sampling capped at MAX_LAG_SAMPLES = 256:
- While the reservoir has space, append unconditionally.
- Once full, draw
runiformly in[0, total_observations)viaSplitMix64. Ifr < MAX_LAG_SAMPLES, replacelags_ms[r]. Conditional onr < k,ris itself uniform in[0, k), so the slot pick is unbiased without a second PRNG draw.
The PRNG is a SplitMix64 state per PairState, seeded at construction from now_ms ^ (hash_endpoint(source) << 17) ^ hash_endpoint(target). hash_endpoint is a deterministic FNV-1a over the endpoint's finding_type, service and template strings (NOT the DefaultHasher, which uses a per-process RandomState and would make the correlator non-deterministic across runs). Two daemon runs replaying the same trace file produce identical reservoir samples and therefore identical median lags.
Median lag calculation
The median() helper sorts a clone of the lag values and returns the middle element (odd length) or the midpoint of the two middle elements (even length). Sorting is bounded by MAX_LAG_SAMPLES thanks to the reservoir, so the median computation is O(k log k) with k = 256 regardless of how often the pair fires.
Memory management
Three mechanisms bound memory usage:
- Rolling window eviction: the
occurrencesdeque is pruned on everyingest()call. Entries older thanwindow_msare removed and theirsource_totalscount is decremented. Entries reaching count zero are removed from the map. - Pair count pruning:
pair_countsentries whoselast_seen_msfalls outside the window are removed. - Reservoir cap: each
PairState.lags_msis bounded atMAX_LAG_SAMPLES = 256f64 (~2 KB per pair), regardless of how often the pair fires. - Pair cap with lowest-count eviction: when
pair_counts.len()exceedsmax_tracked_pairs, the least significant pairs (lowest co-occurrence count) are evicted viaselect_nth_unstable_by_key.
Integration point
The correlator is created conditionally in the daemon's run() function based on config.correlation_enabled (default false). It is wrapped in Arc<Mutex<CrossTraceCorrelator>> and passed to process_traces. After findings are produced and pushed to the FindingsStore, the correlator's ingest() method is called with the findings and the current timestamp.
Batch mode exclusion
The correlator is not used in batch mode (perf-sentinel analyze). Cross-trace correlation requires a stream of findings over time to detect recurring patterns. A single batch run typically processes a fixed set of traces without the temporal dimension needed for meaningful correlation.
Actionable fixes (framework-aware suggestions)
Starting in v0.4.2, a suggested_fix: Option<SuggestedFix> field on Finding carries a framework-specific remediation that goes beyond the generic suggestion string. This field is populated by detect::suggestions::enrich after the per-trace detectors return, inside detect().
Coverage grew in four steps: v1 shipped Java/JPA only; v2 added Quarkus reactive and non-reactive, WebFlux, Helidon SE/MP, EF Core, Diesel and SeaORM; v3 broadened to the seven anti-patterns that previously returned suggested_fix = None (redundant_http, slow_sql, slow_http, excessive_fanout, chatty_service, pool_saturation, serialized_calls) and added Python (Django ORM, SQLAlchemy) with scope detection via the opentelemetry.instrumentation.* prefix; v4 added Go (GORM) and Node.js/TypeScript (Prisma) with scope detection via the @opentelemetry/instrumentation-* prefix and language detection via .go, .js, .ts file extensions. New entries lean on the per-language *Generic tag when the recommendation is framework-agnostic, and reuse a framework-specific tag when the ecosystem ships a canonical primitive worth pointing at. The current state covers Java, C# (.NET 8 to 10), Python, Rust, Go and Node.js across all 10 anti-patterns, each with a generic per-language fallback.
The SuggestedFix struct
pub struct SuggestedFix {
pub pattern: String, // "n_plus_one_sql" mirrors parent finding.type
pub framework: String, // "java_jpa" or "java_generic"
pub recommendation: String, // short, imperative sentence
pub reference_url: Option<String>,
}Serialized in JSON as a nested object under finding.suggested_fix, skipped when absent. Emitted in SARIF under result.fixes[0].description.text (description-only form of the SARIF 2.1.0 fix object). The CLI renders it as a nested Suggested fix: line right after the generic Suggestion: line.
Framework detector
The detector is a pure function over fields already present on Finding (instrumentation_scopes, code_location, service), all populated at detection time from the span's OTel attributes. No span-level access, no extra allocations. It inspects five signals in order, most reliable first:
- Instrumentation scope chain, captured at OTLP ingest from the originating span and its ancestors (e.g.
io.opentelemetry.spring-data-3.0). Most reliable: the scope name is emitted by the agent regardless of how the user names their classes, so it survives user-code naming quirks. Vendor-specific scopes (io.quarkus.*,Microsoft.EntityFrameworkCore) are checked before the standardio.opentelemetry.*/opentelemetry.instrumentation.*/@opentelemetry/instrumentation-*convention scopes. Go and Node are deliberately absent from the convention scope rules: their instrumentations use ecosystem-native scope names (gorm.io/plugin/opentelemetry,@prisma/instrumentation), and the-segment boundary used for Java version suffixes would false-positive on npm package names (pgvsinstrumentation-pg-pool). - Language from ecosystem-native scope prefix. When the scope-chain check misses, the prefix still reveals the language (
github.com/= Go module path,@opentelemetry/instrumentation-or@prisma/= npm,Microsoft.EntityFrameworkCore/OpenTelemetry.Instrumentation.*= NuGet). The language-generic fallback then fires, so even a span withoutcode.filepathorcode.namespacegets a language-appropriate suggestion. code_locationnamespace with filepath-derived language (.java→ Java,.cs→ C#,.rs→ Rust,.py→ Python,.go→ Go,.js/.ts→ Node). Walks that language's rules in declared order; falls back to the language generic when no rule matches.code_locationnamespace alone when filepath is absent: tries every language's rules in order and returns the first hit. No generic fallback in this path because the language cannot be known.- Service name as a last resort, only for framework names distinctive enough to avoid false positives in arbitrary service names (e.g.
helidoninhelidon-se-svc). Lowest confidence, only reached when all OTel signals are absent.
The namespace match is segment-boundary-aware on both sides: the hint must start at the namespace root or immediately after a separator and must end at the namespace end or immediately before another separator. Boundary characters are . (Java, C#) and :: (Rust). Examples:
diesel::matchesdiesel::query_dsl::FilterDslandcrate::diesel::reexportbut notcrate::mydiesel::query(leading boundary protects user code that embeds the hint).io.helidonmatchesio.helidon.webserver.Routingbut notio.helidongrpc.Foo(trailing boundary protects against user packages whose first segment merely starts with the hint).Microsoft.EntityFrameworkCorematchesMicrosoft.EntityFrameworkCore.Querybut notMicrosoft.EntityFrameworkCoreCache.Provider.
Per-language rules
Order matters within a language: the first matching framework wins. JPA hints intentionally trail Quarkus reactive hints because org.hibernate.reactive contains org.hibernate.
Java (JAVA_RULES):
| Framework | Namespace hints |
|---|---|
JavaHelidonMp | io.helidon.microprofile |
JavaHelidonSe | io.helidon |
JavaQuarkusReactive | io.quarkus.hibernate.reactive, io.quarkus.panache.reactive, io.quarkus.reactive, org.hibernate.reactive, io.smallrye.mutiny |
JavaQuarkus | io.quarkus.hibernate.orm, io.quarkus.panache.common, io.quarkus |
JavaWebFlux | org.springframework.web.reactive, reactor.core |
JavaJpa | jakarta.persistence, javax.persistence, org.hibernate, org.springframework.data.jpa |
JavaGeneric (fallback) | (any .java file without the above) |
JavaQuarkusReactive enumerates its reactive sub-packages explicitly. The catch-all io.quarkus belongs to JavaQuarkus (non-reactive), so any reactive Quarkus namespace must be matched by one of the more-specific reactive hints first. Helidon MP must come before Helidon SE because io.helidon.microprofile is a sub-package of io.helidon.
Note on Helidon MP and JPA: Helidon MP entities are JPA-managed under Hibernate. A typical OTel JDBC span on Helidon MP code carries code.namespace = jakarta.persistence.* or org.hibernate.*, which routes to JavaJpa (not JavaHelidonMp). The JavaHelidonMp rule fires when the span comes from Helidon MP plumbing itself (REST resources, CDI containers, MicroProfile Rest Client). For database findings on Helidon MP apps, the JavaJpa recommendation applies.
C# (CSHARP_RULES):
| Framework | Namespace hints |
|---|---|
CsharpEfCore | Microsoft.EntityFrameworkCore, Pomelo.EntityFrameworkCore |
CsharpGeneric (fallback) | (any .cs file without the above) |
Rust (RUST_RULES):
| Framework | Namespace hints |
|---|---|
RustDiesel | diesel:: |
RustSeaOrm | sea_orm:: |
RustGeneric (fallback) | (any .rs file without the above) |
Mapping table
A LazyLock<HashMap<(FindingType, Framework), SuggestedFix>> static. Lookups missing from the table leave suggested_fix as None. Current entries:
| Finding type | Framework | Recommendation anchor |
|---|---|---|
NPlusOneSql | JavaJpa | JOIN FETCH or @EntityGraph, Hibernate User Guide |
NPlusOneSql | JavaQuarkusReactive | Mutiny Session.fetch() + @NamedEntityGraph, Quarkus Hibernate Reactive guide |
NPlusOneSql | JavaQuarkus | JPQL/Panache JOIN FETCH, @EntityGraph or Session.fetchProfile, Quarkus Hibernate ORM guide |
NPlusOneSql | JavaHelidonSe | Helidon DbClient named query with JOIN or :ids JDBC parameter binding |
NPlusOneSql | JavaHelidonMp | JPA @EntityGraph or JPQL JOIN FETCH (MP entities are JPA-managed under Hibernate) |
NPlusOneHttp | JavaWebFlux | Flux.merge() / Flux.zip() for parallelism or batch endpoint |
NPlusOneHttp | JavaQuarkusReactive | Uni.combine().all().unis(...) for parallelism, Mutiny combining guide |
NPlusOneHttp | JavaQuarkus | CompletableFuture.allOf on ManagedExecutor, batch via Quarkus REST Client |
NPlusOneHttp | JavaHelidonSe | Helidon SE WebClient + Single.zip / Multi.merge for parallelism or batch endpoint |
NPlusOneHttp | JavaHelidonMp | MicroProfile Rest Client + CompletableFuture.allOf on the @ManagedExecutorConfig executor or batch endpoint |
NPlusOneHttp | JavaGeneric | Batch endpoint or request-scoped @Cacheable |
RedundantSql | JavaQuarkusReactive | @CacheResult or Uni.memoize().indefinitely() |
RedundantSql | JavaQuarkus | @CacheResult (Quarkus cache extension) or @RequestScoped HashMap deduplication |
RedundantSql | JavaGeneric | Service-level cache (Caffeine, Spring Cache) |
NPlusOneSql | CsharpEfCore | .Include() / .ThenInclude(), .AsSplitQuery() for Cartesian explosion |
RedundantSql | CsharpEfCore | IMemoryCache, scoped DbContext for per-request short-circuit |
NPlusOneHttp | CsharpGeneric | Task.WhenAll for parallel calls, batch endpoint, response caching on HttpClient |
NPlusOneSql | RustDiesel | belonging_to + grouped_by or .inner_join / .left_join for single query |
NPlusOneSql | RustSeaOrm | find_with_related / find_also_related or QuerySelect::join |
RedundantSql | RustDiesel | moka cache or request-local OnceCell |
RedundantSql | RustSeaOrm | moka cache or request-local OnceCell |
NPlusOneHttp | RustGeneric | tokio::join! / futures::future::join_all for parallelism or batch endpoint |
Extension path for contributors
To add a new framework:
- Extend the private
Frameworkenum indetect/suggestions.rs. - Pick a language and append a
(Framework, &[hint])entry to that language's rule slice. Place more-specific frameworks before less-specific ones. - Add entries to the
FIXESstatic for each(FindingType, Framework)pair you want to map. - Add unit tests under the
testsmodule in the same file.
To add a new language:
- Extend the
Languageenum and itsrules()/generic()methods. - Add the file extension match in
language_from_filepath. - Define a new
*_RULESslice and a generic fallback variant onFramework.
No wiring changes elsewhere: the detect() orchestrator already calls suggestions::enrich at the end of the per-trace detection pass and the CLI / JSON / SARIF rendering already handle an optional suggested_fix.