Documentation / 06 · Ingestion & daemon

Ingestion and daemon mode

OTLP conversion

Two-pass design

convert_otlp_request() processes each resource_spans block in two passes:

Pass 1: Build span index:

rust

let span_index: HashMap<&[u8], &Span> = scope_spans.iter()
    .flat_map(|ss| &ss.spans)
    .map(|span| (span.span_id.as_slice(), span))
    .collect();

Pass 2: Convert I/O spans:

rust

for span in &scope.spans {
    if let Some(event) = convert_span(span, service_name, &span_index) {
        events.push(event);
    }
}

Why two passes? In OTLP, a parent span may appear after its child in the protobuf message. The first pass builds a lookup table so that the second pass can resolve source.endpoint from the parent span's http.route attribute. A single-pass approach would miss parent spans defined later in the message.

The index uses &[u8] keys (raw span_id bytes), avoiding hex encoding just for lookup. The span index is capped at 100,000 spans per resource to prevent memory exhaustion from pathological OTLP payloads. A tracing::warn! is emitted when the cap is reached to help operators diagnose degraded parent resolution.

`bytes_to_hex` lookup table

rust

fn bytes_to_hex(bytes: &[u8]) -> String {
    const HEX: &[u8; 16] = b"0123456789abcdef";
    let mut buf = Vec::with_capacity(bytes.len() * 2);
    for &b in bytes {
        buf.push(HEX[(b >> 4) as usize]);
        buf.push(HEX[(b & 0x0f) as usize]);
    }
    // All bytes come from HEX (ASCII 0-9, a-f), always valid UTF-8.
    String::from_utf8(buf).expect("hex table is ASCII")
}

This is a well-known optimization for hex encoding. Instead of using write!(hex, "{b:02x}") (which invokes the formatting machinery per byte at ~30ns), the lookup table converts each byte to two hex characters via bit shifting at ~5ns per byte. The Vec<u8> is pre-allocated and the from_utf8 call is infallible since only ASCII hex digits are pushed. No unsafe is needed: the expect is a zero-cost assertion on a condition that cannot fail.

For a 16-byte trace_id + 8-byte span_id, this saves ~600ns per span conversion. At 100,000 events/sec, that is 60ms/sec of avoided overhead.

`nanos_to_iso8601`: Howard Hinnant's Algorithm

Note: This function now lives in time.rs (shared module) and is reused by Jaeger and Zipkin ingestion via micros_to_iso8601.

Converting Unix nanoseconds to YYYY-MM-DDTHH:MM:SS.mmmZ uses the civil date algorithm from Howard Hinnant. The key steps:

Convert nanoseconds to days since epoch + remaining milliseconds
Shift the epoch to March 1, year 0 (by adding 719,468 days)
Compute the era (400-year cycle) and day-of-era
Derive year-of-era, day-of-year, month and day using a lookup-free formula

This avoids the chrono crate (~150KB binary overhead) and its ~200ns parse overhead. The hand-rolled algorithm handles leap years correctly (verified by a test with 2024-02-29).

Event type priority

When a span has both a SQL attribute (db.statement or db.query.text) and an HTTP attribute (http.url or url.full), SQL takes priority. This is intentional: database instrumentation is more specific than HTTP client instrumentation. The SQL attribute carries the actual query text needed for normalization, while the HTTP attribute might represent the same operation at the transport level.

Both legacy (pre-1.21) and stable (1.21+) OTel semantic conventions are supported: db.statement and db.query.text for SQL, http.url and url.full for HTTP, http.method and http.request.method for the HTTP verb, http.status_code and http.response.status_code for the status. This ensures compatibility with both older OTel SDKs and modern Java agents (v2.x).

Clock skew protection

rust

if end_nanos < start_nanos {
    tracing::trace!("Span has end_time < start_time (clock skew?), duration forced to 0");
}
let duration_us = end_nanos.saturating_sub(start_nanos) / 1000;

saturating_sub returns 0 for negative durations instead of wrapping around. A trace-level log helps operators diagnose OTLP integration issues without flooding logs.

The `MetricsSink` trait (`ingest/otlp.rs`)

ingest::otlp produces runtime telemetry on every rejection path (unsupported media type, decode failure, channel full). Before 0.6.0 these calls reached straight into report::metrics::MetricsState, which leaked the downstream metrics implementation upstream and made ingest impossible to use without paying for the Prometheus registry. The MetricsSink trait is the abstraction: MetricsState implements it (in report::metrics) so daemon callers keep the same wiring, and alternative builds (an OpenTelemetry-metrics fork, tests with a counting fake) can plug their own sink without touching ingest. The Send + Sync bounds exist because the gRPC and HTTP paths share the sink across tokio tasks via Arc<dyn MetricsSink>. The impl on MetricsState keeps the same branchless hot-path dispatch the pre-trait inherent methods had: both entry points reach the cached IntCounter children with no label-hashmap lookup.

JSON ingestion

rust

pub fn ingest(&self, raw: &[u8]) -> Result<Vec<SpanEvent>, Self::Error> {
    if raw.len() > self.max_size {
        return Err(JsonIngestError::PayloadTooLarge { ... });
    }
    serde_json::from_slice(raw)
}

The payload size is checked before deserialization. This prevents serde_json from allocating memory for a multi-gigabyte JSON payload before rejecting it.

Auto-format detection

JsonIngest now auto-detects the input format using lightweight byte-level heuristics. It peeks at the first 1-4 KB of the payload:

Starts with { and contains "data" + "spans" in the first 4 KB: Jaeger
Starts with [ and contains "traceId" + "localEndpoint" in the first 1 KB: Zipkin
Otherwise: Native perf-sentinel format

This avoids parsing the full payload into a serde_json::Value for detection, eliminating a 2x parse cost. The heuristic operates on raw bytes (std::str::from_utf8 on a bounded prefix), making it O(1) regardless of payload size.

Boundary sanitization. After parsing, the JSON ingest path validates cloud_region via is_valid_region_id and runs sanitize_span_event on every event, applying the same field-length caps and UTF-8 boundary truncation as the OTLP path. This ensures all downstream code sees consistently sanitized data regardless of ingestion format.

Jaeger JSON ingestion

ingest/jaeger.rs parses the Jaeger JSON export format ({ "data": [{ "traceID": "...", "spans": [...], "processes": {...} }] }). Key mappings:

startTime (microseconds) is converted via micros_to_iso8601 from the shared time.rs module
parent_span_id is extracted from references where refType = "CHILD_OF"
Both legacy and stable OTel semantic conventions are supported in tags

Zipkin JSON v2 ingestion

ingest/zipkin.rs parses the Zipkin JSON v2 format (flat array of span objects). Key differences from Jaeger:

parentId is a direct field (not in a references array)
Tags are a HashMap<String, String> (not an array of key-value objects)
localEndpoint.serviceName provides the service name

Daemon event loop

Architecture

OTLP gRPC (port 4317)   ─┐                                        ┌─ analysis worker ─┐
OTLP HTTP (port 4318)   ─┤─→ mpsc(1024) ─→ TraceWindow ─→ eviction ─→ mpsc(1024) ─→ detect ─→ score ─→ NDJSON
JSON unix socket        ─┘     (select! loop)                      └───────────────────┘

The event loop uses tokio::select! to multiplex:

Receive events from the channel -> normalize -> push into window -> enqueue evictions
Ticker every TTL/2 ms -> evict expired traces -> enqueue
Ctrl+C -> drain all traces -> hand to the worker -> join -> shutdown

detect+score do not run on the select! loop. They run on a single dedicated analysis worker task fed over a bounded channel (see Analysis worker), so a long analysis pass can no longer stall ingestion or eviction.

Normalization outside the lock

rust

// Normalize OUTSIDE the lock:
let normalized: Vec<_> = events.into_iter().map(normalize::normalize).collect();
// Then acquire the lock and push:
let mut w = window.lock().await;
for event in normalized { w.push(event, now_ms); }

Normalization is CPU-bound work (regex, string manipulation). Moving it outside the Mutex lock minimizes lock hold time to just the HashMap operations. Under contention (ticker and receive running concurrently), this prevents the eviction ticker from blocking on normalization.

Trace-level sampling

rust

fn should_sample(trace_id: &str, rate: f64) -> bool {
    let mut hash: u64 = 0xcbf2_9ce4_8422_2325; // FNV-1a offset basis
    for b in trace_id.as_bytes() {
        hash ^= u64::from(*b);
        hash = hash.wrapping_mul(0x0100_0000_01b3); // FNV-1a prime
    }
    (hash as f64 / u64::MAX as f64) < rate
}

The FNV-1a hash is a fast, non-cryptographic hash that produces well-distributed output. The offset basis and prime are the standard 64-bit FNV-1a constants.

Why FNV-1a? Simpler and faster (~2ns for a typical trace_id) than std::hash::DefaultHasher (SipHash, ~10ns). Cryptographic quality is not needed for sampling, only uniform distribution matters.

Deterministic: the same trace_id always produces the same sampling decision, ensuring all events from a trace are either kept or dropped together.

Per-batch caching: the apply_sampling() function filters a batch of events using a HashMap<String, bool> cache. Within a single batch, multiple events may share a trace_id. The cache uses get() before insert() so that trace_id is only cloned for the first event of each trace, not on every cache hit. Extracting this logic into a standalone function keeps the tokio::select! event loop readable.

Bounded channel

rust

let (tx, mut rx) = mpsc::channel::<Vec<SpanEvent>>(1024);

The bounded channel provides backpressure: if the event loop falls behind and the buffer fills to 1024 batches, ingestion senders will await until space is available. This prevents unbounded memory growth from fast producers.

Analysis worker

detect+score are CPU-bound. Running them inline on the select! task meant a long analysis pass blocked the loop from polling rx.recv() and the eviction ticker, so ingestion and eviction liveness depended on analysis latency. Instead, evicted (LRU), expired (TTL) and drained (shutdown) batches are handed to a single analysis worker over a second bounded channel:

rust

let (work_tx, work_rx) = mpsc::channel::<AnalysisBatch>(1024);
let worker = tokio::spawn(run_analysis_worker(work_rx, ctx));

One worker, one channel, FIFO. Analysis stays single-threaded and batches reach the worker in the order they left the window, so the stateful cross-trace correlator observes a deterministic sequence. No thread pool, no spawn_blocking per batch.
Non-blocking enqueue with metered shedding. The loop enqueues with try_reserve (synchronous, never awaits analysis), building the owned CarbonContext only once a slot is reserved so a shed never pays for a discarded clone. When the queue is full (or the worker has stopped) the whole batch is shed and counted via perf_sentinel_analysis_shed_batches_total and perf_sentinel_analysis_shed_traces_total; perf_sentinel_analysis_queue_depth tracks the backlog. Overload is explicit and observable, never a silent drop. The trade-off is intentional: under sustained overload we drop whole batches rather than block ingestion (a liveness/backpressure choice, not a throughput one).
CarbonContext sampled at eviction time. The per-batch CarbonContext (energy scraper snapshots + grid intensity) is built on the loop side when the batch is evicted and travels with it, preserving the previous sampling instant.
Shutdown drains, then joins. On Ctrl+C / SIGTERM the loop drains the window, hands the remainder to the worker with a blocking send (guaranteed delivery, no shedding), closes the channel, and awaits the worker so every buffered and in-flight batch is fully analyzed before returning.
Fail-loud on worker death. A fourth select! arm watches the worker's JoinHandle. If the worker stops before shutdown (a detector panics), run_event_loop returns DaemonError::AnalysisWorkerStopped so the process exits and a supervisor restarts it, rather than staying up while silently analyzing nothing. This restores the fail-loud semantics the inline-detection design had (a panic crashed the daemon).

Security hardening

Unix socket permissions:

rust

use std::os::unix::fs::PermissionsExt;
std::fs::set_permissions(path, std::fs::Permissions::from_mode(0o600));

The 0o600 mode restricts read/write to the socket owner only, preventing other local users from injecting events. If set_permissions fails, the socket file is removed and the listener does not start (fatal error, not a warning).

Connection semaphore:

rust

let semaphore = Arc::new(tokio::sync::Semaphore::new(128));

Limits concurrent JSON socket connections to 128. Without this, a local attacker could open thousands of connections, each consuming a tokio task and buffer memory.

Per-connection byte limit:

rust

const CONNECTION_LIMIT_FACTOR: u64 = 16;
let limited = stream.take(max_payload_size as u64 * CONNECTION_LIMIT_FACTOR);

Each connection is limited to 16 × max_payload_size bytes total (default 16 MB). This prevents a single connection from consuming unbounded memory with a stream of data that never contains a newline.

Request timeouts:

gRPC: tonic::transport::Server::builder().timeout(Duration::from_secs(60))
HTTP: tower::timeout::TimeoutLayer::new(Duration::from_secs(60)) via axum's HandleErrorLayer

These prevent slow/stalled connections from holding resources indefinitely. The HTTP timeout handler emits a tracing::debug! log before returning 408 REQUEST_TIMEOUT, helping operators diagnose slow or stalled clients.

NDJSON output

Findings are emitted as newline-delimited JSON to stdout using serde_json::to_writer with a locked stdout handle to avoid intermediate String allocations and reduce lock contention:

rust

let stdout = std::io::stdout();
let mut lock = stdout.lock();
for finding in &findings {
    if serde_json::to_writer(&mut lock, finding).is_ok() {
        let _ = writeln!(lock);
    }
}

This format is compatible with log aggregation tools (Loki, ELK) that consume line-delimited JSON. Each line is a complete JSON object that can be parsed independently.

Cumulative waste ratio

The Prometheus io_waste_ratio gauge is computed from cumulative counters:

rust

let cumulative_total = metrics.total_io_ops.get();
if cumulative_total > 0.0 {
    metrics.io_waste_ratio.set(metrics.avoidable_io_ops.get() / cumulative_total);
}

This is an all-time average, not a windowed metric. Users who need a recent rate can use Prometheus rate() on the raw counters (total_io_ops, avoidable_io_ops).

Grafana exemplars

The prometheus crate 0.14.0 does not support OpenMetrics exemplars natively. Instead of adding a dependency, exemplar annotations are injected by post-processing the rendered Prometheus text output.

Tracking worst-case trace IDs:

MetricsState stores exemplar data in RwLock-protected fields:

worst_finding_trace: HashMap<(String, String), ExemplarData>, keyed by (finding_type, severity), updated on each record_batch() call
worst_waste_trace: Option<ExemplarData>, the trace_id of the finding with the most avoidable I/O

RwLock is used instead of Mutex because render() (read path) is called frequently by Prometheus scrapes, while record_batch() (write path) is called less often. Multiple concurrent scrapes should not block each other. Lock poisoning is handled gracefully via unwrap_or_else(PoisonError::into_inner), so a panic in one thread does not cascade into crashes on subsequent lock acquisitions.

Exemplar injection:

inject_exemplars() iterates over the rendered text line by line. For perf_sentinel_findings_total{...} lines, it parses the type and severity labels to look up the matching exemplar. For perf_sentinel_io_waste_ratio lines, it appends the waste trace exemplar.

The exemplar format follows the OpenMetrics specification: metric{labels} value # {trace_id="abc123"}. When exemplars are present, the Content-Type header switches from text/plain; version=0.0.4 (Prometheus) to application/openmetrics-text; version=1.0.0 (OpenMetrics) so that Grafana's Prometheus data source can recognize and display exemplar links.

Grafana integration: with exemplars enabled, users can click from a metric spike in Grafana directly to the worst-case trace in Tempo or Jaeger, provided the Prometheus data source has "Exemplars" enabled and a Tempo/Jaeger data source is configured as the trace backend.

pg_stat_statements ingestion

ingest/pg_stat.rs provides a standalone analysis path for PostgreSQL pg_stat_statements exports. Unlike trace-based ingestion, this data has no trace_id or span_id, it cannot feed the N+1/redundant detection pipeline. Instead, it provides hotspot ranking and cross-referencing with trace findings.

Design decisions

Separate from IngestSource: the IngestSource trait returns Vec<SpanEvent>, but pg_stat_statements data does not map to SpanEvent (no trace_id, span_id or timestamp). It produces its own PgStatReport type with rankings.

Auto-format detection: follows the same byte-level heuristic pattern as json.rs. If the first non-whitespace byte is [ or {, parse as JSON; otherwise, parse as CSV. No external csv crate, the CSV parser handles RFC 4180 quoting manually (double-quoted fields, escaped "").

SQL normalization reuse: each query goes through normalize::sql::normalize_sql() to produce a template comparable with trace-based findings. PostgreSQL normalizes queries at the server level (e.g., $1 placeholders), but perf-sentinel re-normalizes for consistency with its own template format.

Four-ranking output

rank_pg_stat(entries, top_n) returns a PgStatReport with four rankings in a stable, position-indexed order: [by_total_time, by_calls, by_mean_time, by_io_blocks]. Each ranking holds the same top-N entries, reordered by its own criterion:

by_total_time: total_exec_time_ms descending. Queries that dominate wall-clock DB time. Primary hotspot signal.
by_calls: calls descending. High-volume queries, often N+1 candidates.
by_mean_time: mean_exec_time_ms descending. Individually slow queries regardless of volume.
by_io_blocks: shared_blks_hit + shared_blks_read descending. Cache-pressure signal: queries that touch the most shared buffer pages, independent of whether those pages were hot or cold. Complements by_total_time when the CPU is idle but the buffer cache churns.

The HTML dashboard's pg_stat tab sub-switcher consumes these four rankings by position, so new rankings are appended (never reordered, never inserted in the middle) to preserve index stability for downstream consumers.

Cross-referencing

cross_reference() accepts &mut [PgStatEntry] and &[Finding]. It builds a HashSet of finding templates and marks entries whose normalized_template matches. This is O(n + m) where n = entries, m = findings. The seen_in_traces flag enables the CLI to highlight queries that appear in both data sources, useful for validating OTLP trace capture fidelity against database-native ground truth.

Automated pg_stat Prometheus scrape

fetch_from_prometheus(endpoint, top_n) queries a Prometheus HTTP API for pg_stat_statements metrics, removing the need for manual CSV export.

Query and conversion

The function builds a PromQL topk(N, pg_stat_statements_seconds_total) instant query and sends it to the Prometheus /api/v1/query endpoint via the shared http_client::fetch_get helper. The response is a standard Prometheus JSON envelope:

json

{
  "data": {
    "result": [
      {
        "metric": { "query": "SELECT ...", "datname": "mydb" },
        "value": [1234567890, "1.234"]
      }
    ]
  }
}

parse_prometheus_response extracts the query (or queryid) label as the raw SQL text, the datname label as the database name and the value as total execution time in seconds. Each result is converted to a PgStatEntry with its SQL normalized through normalize::sql::normalize_sql() for consistency with trace-based findings.

CLI integration

The --prometheus flag on perf-sentinel pg-stat enables this path:

perf-sentinel pg-stat --prometheus http://prometheus:9090 --top 20

This flag is gated behind the daemon feature because it requires the hyper HTTP client stack. The rest of the pg-stat pipeline (ranking, cross-referencing, display) is identical regardless of whether the data came from a file or Prometheus.

The report subcommand exposes the same capability via --pg-stat-prometheus URL, mutually exclusive with its file-based --pg-stat FILE flag (enforced at the clap level via conflicts_with). When either flag is provided, the resulting PgStatReport is embedded into the HTML dashboard's pg_stat tab alongside the four rankings described above. The scrape path is shared with pg-stat --prometheus, no data-fetching code is duplicated.

Tempo ingestion

ingest/tempo.rs provides the post-mortem replay path: it queries a running Grafana Tempo HTTP API, fetches trace bodies as OTLP protobuf, decodes them through the existing convert_otlp_request helper and returns Vec<SpanEvent> to the standard analysis pipeline. Two modes: single-trace-by-ID (--trace-id, one GET /api/traces/{id}) or search-then-fetch (--service --lookback, one GET /api/search followed by per-trace fetches). Gated behind the tempo cargo feature.

Parallel fetch with concurrency cap

The per-trace fetch loop is parallelized via tokio::task::JoinSet guarded by an Arc<Semaphore> capped at FETCH_CONCURRENCY = 16 permits. Each spawned task acquires a permit through acquire_owned before the HTTP call and releases it on drop (RAII). The cap was picked empirically to saturate a remote-Tempo connection over a WAN link (observed ~10-20s for 100 traces vs. ~2m30s in the prior sequential loop) without overwhelming a single query-frontend replica, and is currently hardcoded rather than user-configurable. The pattern mirrors score::cloud_energy::scraper, which parallelizes per-service Prometheus CPU queries the same way.

Timeout split

Two dedicated constants instead of a single value: SEARCH_TIMEOUT = 5s for /api/search (response is a small list of trace IDs, a tight timeout fails fast on a broken endpoint) and FETCH_TRACE_TIMEOUT = 30s for /api/traces/{id} (trace bodies can legitimately be many MiB on a wide fanout request and the query-frontend has to gather spans from ingesters + long-term storage). A single 5 s cap was empirically dropping tens of traces per 100-trace batch on long lookback windows; 30 s matches the Grafana Tempo datasource default. Both timeouts are parameters of the shared fetch_raw helper rather than a single module-level constant, so search and fetch-trace paths can never drift apart.

Ctrl-C and error aggregation

The drain loop is driven by tokio::select! with biased branch ordering: tokio::signal::ctrl_c() is polled before set.join_next() so a pending interrupt is not starved by a flood of completions. On signal, set.abort_all() flags every in-flight task for cancellation; already-completed traces are preserved, aborted tasks resolve to JoinError::is_cancelled() and are silently skipped. The dedicated TempoError::Interrupted variant is returned only when zero traces had completed before the signal, so CI quality-gate paths can distinguish an operator abort from a genuine empty result (NoTracesFound).

Per-trace failures log at debug, not error. A single classified summary line (emit_fetch_summary) is emitted at the end of the loop, bucketed by error kind (timeout, transport, http_status, protobuf_decode, body_read, json_parse, task_panic) so downstream tooling (Loki, CloudWatch) can alert on the right signal without parsing 50 individual ERROR lines on a degraded Tempo. Summary severity tracks the worst class seen: warn if only TraceNotFound skips occurred (expected occasional condition, e.g. a trace rolled out of retention between search and fetch), error otherwise. A unit test (classify_fetch_error_buckets_every_hard_failure_variant) acts as a drift guard so a future variant added to TempoError does not silently fall through to "other".

Daemon query API

The daemon exposes its internal state via HTTP endpoints alongside the existing /v1/traces, /metrics and /health routes on port 4318.

The /health endpoint is a stateless liveness probe for Kubernetes, load balancers and systemd. It returns 200 OK with {"status":"ok","version":"<pkg_version>"}, holds no locks and cannot false-negative under load. It is always exposed, independent of [daemon] api_enabled, which gates only the richer /api/* surface described below.

`FindingsStore` ring buffer

FindingsStore is a thread-safe ring buffer backed by tokio::sync::RwLock<VecDeque<StoredFinding>>. Each StoredFinding wraps a Finding with a stored_at_ms monotonic timestamp.

push_batch(findings, now_ms): builds the new StoredFinding entries outside the lock, then acquires a brief write lock to extend the buffer and drain any excess. Evicts the oldest entries when the buffer exceeds max_size (default 10,000 from config max_retained_findings). The initial capacity is min(max_size, INITIAL_CAPACITY_CEILING) with a 4096 ceiling to amortize reallocations without a surprising RSS hit at startup.
max_size == 0 short-circuit: when set to 0, push_batch returns immediately without allocating. This lets operators who disable the query API (api_enabled = false) reclaim the store's memory by also setting max_retained_findings = 0.
query(filter): acquires a read lock, iterates in reverse (newest first), applies optional service, finding_type and severity filters and returns up to limit results (default 100, capped at MAX_FINDINGS_LIMIT = 1000).
by_trace_id(trace_id): acquires a read lock and returns all findings for a specific trace.

RwLock is chosen over Mutex because process_traces (writer) runs once per tick, while the API handlers (readers) may serve concurrent requests. Multiple read locks do not block each other. Clones happen outside the write lock so readers are not blocked by Finding::clone() allocations.

`QueryApiState` shared state

rust

pub struct QueryApiState {
    pub findings_store: Arc<FindingsStore>,
    pub window: Arc<tokio::sync::Mutex<TraceWindow>>,
    pub detect_config: DetectConfig,
    pub start_time: std::time::Instant,
    pub correlator: Option<Arc<tokio::sync::Mutex<CrossTraceCorrelator>>>,
}

This struct is wrapped in Arc and passed as axum State to all route handlers. It provides access to the findings ring buffer, the trace window (for explain), the detection config (for re-running detectors on explain requests) and the optional cross-trace correlator (for /api/correlations).

API endpoints

Six endpoints are mounted via query_api_router(). The router is only merged into the HTTP stack when [daemon] api_enabled = true (default true). Setting api_enabled = false disables all /api/* routes while keeping OTLP ingestion, /metrics and /health active.

Endpoint	Method	Cap	Description
`/api/findings`	GET	`?limit=` clamped to `MAX_FINDINGS_LIMIT = 1000`	Query recent findings with optional `?service=`, `?type=`, `?severity=`, `?limit=` filters
`/api/findings/{trace_id}`	GET	none	All findings for a specific trace
`/api/explain/{trace_id}`	GET	none	Trace tree with findings inline, built from daemon memory
`/api/correlations`	GET	truncated at `MAX_CORRELATIONS_LIMIT = 1000` (sorted by confidence desc)	Active cross-trace correlations from the correlator. Empty when `correlator` is `None`
`/api/status`	GET	none	Daemon health: version, uptime, active traces, stored findings count
`/api/export/report`	GET	inherits `/api/findings` + `/api/correlations` caps	Full `Report` snapshot as JSON, ready to pipe into `perf-sentinel report --input -`

`/api/export/report` snapshot semantics

The endpoint returns a Report struct shape-identical to analyze --format json, so the response can be piped directly into perf-sentinel report --input - to materialize an HTML dashboard from a live daemon. Fields are populated from the daemon's live state: findings from FindingsStore::query, correlations from CrossTraceCorrelator::active_correlations, analysis.events_processed / traces_analyzed from the metrics counters (lifetime values, for context).

green_summary is refreshed by the event loop after each completed batch. Per-batch view: every numeric field under it (total_io_ops, avoidable_io_ops, io_waste_ratio, co2.*, regions, top_offenders, transport_gco2) reflects the most recent batch only, not a daemon-lifetime aggregate. Operators wanting cumulative GreenOps numbers should scrape the /metrics Prometheus counters instead. The HTML dashboard's GreenOps tab renders only when green_summary.co2 is non-null, so daemons configured with Electricity Maps surface the chip banner naturally once at least one batch has been processed. analysis.duration_ms is 0, not daemon uptime: the batch-pipeline value times a single analysis run, and a daemon snapshot has no such run.

Cold-start handling: the endpoint returns 200 OK with an empty Report envelope (findings: [], green_summary: GreenSummary::disabled(0), warnings: ["daemon has not yet processed any events"]). Pre-0.5.16 this path returned 503 Service Unavailable, which tripped Kubernetes probes and confused CI scripts that treated 5xx as a daemon health problem; the empty envelope lets clients detect cold-start without a status-code mismatch. The cold-start check gates on a double counter (events_processed_total > 0 AND traces_analyzed_total > 0): events can be ingested seconds before the first eviction tick fires (trace_ttl_ms / 2, default 15s), so gating only on events_processed > 0 would expose a window where the cell is still disabled(0). The export_report_requests_total counter is bumped before the cold-start check, so cold-start responses are counted too, consistent with HTTP access-log conventions.

Response size is bounded by MAX_FINDINGS_LIMIT + MAX_CORRELATIONS_LIMIT (1000 + 1000 entries) plus a bounded green_summary (top_offenders capped, regions limited by cloud-region cardinality), worst-case body ~3.5 MB. Acceptable on a loopback bind (the documented posture); the cap deserves review if the daemon is ever bound to a non-loopback interface.

Snapshot atomicity: the handler acquires the FindingsStore read lock and the correlator mutex in sequence, not atomically. The two collections can therefore be one batch apart (findings from generation N, correlations from N+1), which is acceptable for a post-mortem dashboard but not for a strict snapshot contract.

Explain without eviction via `peek_clone`

The /api/explain/{trace_id} handler needs to read a trace's spans from the TraceWindow without promoting it in the LRU cache or evicting it. TraceWindow::peek_clone(trace_id) uses the underlying LruCache::peek() method (read-only, no promotion) and clones the spans into a fresh Vec<NormalizedEvent>. The handler then reconstructs a Trace, runs per-trace detectors and builds the explain tree via explain::build_tree and explain::format_tree_json.

If the trace has already been evicted from the window (TTL expired or LRU displaced), the handler returns {"error": "trace not found in daemon memory"}.

Cross-trace correlator integration

Conditional creation

In daemon::run(), the correlator is created only when config.correlation_enabled is true (default false, opt-in via [daemon.correlation] enabled = true). When created, it is wrapped in Arc<Mutex<CrossTraceCorrelator>>:

rust

let correlator = if config.correlation_enabled {
    Some(Arc::new(Mutex::new(
        CrossTraceCorrelator::new(config.correlation_config.clone()),
    )))
} else {
    None
};

Invocation in `process_traces`

The correlator reference (Option<&Mutex<CrossTraceCorrelator>>) is passed to process_traces. After findings are produced, scored and pushed to the FindingsStore, the correlator's ingest() method is called:

rust

if let Some(correlator) = correlator {
    let evicted = correlator.lock().await.ingest(&findings, now_ms);
    // evicted > 0 bumps perf_sentinel_correlator_pairs_evicted_total
    // and logs a bounded warn (eviction is amortized per 10% of cap).
}

This ordering ensures that the FindingsStore always has the findings before the correlator processes them.

Lock contention on the shared mutex was analyzed and is a non-issue at the intended scale: the single analysis worker is the only writer, the /api/correlations readers run at dashboard frequency over a bounded structure (capped pairs, 256-sample lag reservoirs), enforce_pair_cap early-returns under the cap and uses an O(n) quickselect when it trips, and the tokio mutex yields rather than blocking the runtime.

NDJSON output

Active correlations are not emitted to NDJSON stdout alongside findings. They are exposed via the /api/correlations HTTP endpoint and the perf-sentinel query correlations CLI subcommand. This separation avoids mixing findings (per-trace, per-tick) with correlations (aggregated, cross-trace) in the same output stream.

Daemon ack store: JSONL + concurrency

The daemon-side ack store (crates/sentinel-core/src/daemon/ack.rs) complements the CI-side TOML acknowledgments (crate::acknowledgments) with a runtime API for SRE-on-call use cases. The two sources are unioned at query time, with TOML winning on conflict (immutable baseline shipped via PR review).

File format

Append-only JSONL at ~/.local/share/perf-sentinel/acks.jsonl by default. Each line is one event:

jsonl

{"action":"ack","signature":"<sig>","by":"alice","reason":"...","at":"2026-05-04T13:30:00Z","expires_at":null}
{"action":"unack","signature":"<sig>","by":"alice","at":"2026-05-04T14:00:00Z"}

Compaction at startup

The daemon replays the JSONL into a HashMap<Signature, AckEntry> (apply on Ack, remove on Unack, drop on expiry), then atomically rewrites the file via tmp + rename with only the active entries. A runaway ack/unack loop therefore cannot accumulate forever, the file resets every restart.

Concurrency model

The in-memory map sits behind an RwLock for cheap read snapshots. Disk writes go through a Mutex<File> so concurrent ack/unack calls produce one well-formed JSONL line each. The mutex is held for the entire write + map-update so a failed disk write never leaves the map ahead of the persisted state.

Authorization header parsing

The auth-header helper lives in crates/sentinel-core/src/ingest/auth_header.rs. It parses a user-supplied --auth-header "Name: Value" line into a hyper-safe (HeaderName, HeaderValue) pair, shared between the Tempo and Jaeger-Query subcommands.

The parsed value is marked sensitive so hyper omits it from its own debug output and from HTTP/2 HPACK compression tables. The struct also implements a manual Debug that never prints the value, so a logged AuthHeader never leaks the credential.

Validation rules

Parsing is intentionally strict. Beyond the hyper-level checks (token-only name, VCHAR + SP + HTAB value, so internal tabs and spaces inside the value are preserved as-is and only CR/LF + non-visible ASCII are rejected) the parser also refuses:

Raw inputs longer than 8 KiB, to bound the per-task clone in the Tempo parallel fanout and stop a pathological --auth-header "X: $(cat /dev/urandom | head -c 50M | base64)" at the door. A typical JWT is 2 to 4 KiB, 8 KiB leaves headroom for long multi-claim tokens without opening the door to arbitrary blobs.
Values that are empty after trimming, which would send a pointless Authorization: to the backend and produce a confusing 401.
Header names that would enable request smuggling or authority override if user-supplied: Host, Content-Length, Transfer-Encoding, Connection, Upgrade, TE, Proxy-Connection. Users wanting to tweak those should use a local proxy, not this flag.

Ingestion and daemon mode

OTLP conversion

Two-pass design

bytes_to_hex lookup table

nanos_to_iso8601: Howard Hinnant's Algorithm

Event type priority

Clock skew protection

The MetricsSink trait (ingest/otlp.rs)

JSON ingestion

Auto-format detection

Jaeger JSON ingestion

Zipkin JSON v2 ingestion

Daemon event loop

Architecture

Normalization outside the lock

Trace-level sampling

Bounded channel

Analysis worker

Security hardening

NDJSON output

Cumulative waste ratio

Grafana exemplars

pg_stat_statements ingestion

Design decisions

Four-ranking output

Cross-referencing

Automated pg_stat Prometheus scrape

Query and conversion

CLI integration

Tempo ingestion

Parallel fetch with concurrency cap

Timeout split

Ctrl-C and error aggregation

Daemon query API

FindingsStore ring buffer

QueryApiState shared state

API endpoints

/api/export/report snapshot semantics

Explain without eviction via peek_clone

Cross-trace correlator integration

Conditional creation

Invocation in process_traces

NDJSON output

Daemon ack store: JSONL + concurrency

File format

Compaction at startup

Concurrency model

Authorization header parsing

Validation rules

`bytes_to_hex` lookup table

`nanos_to_iso8601`: Howard Hinnant's Algorithm

The `MetricsSink` trait (`ingest/otlp.rs`)

`FindingsStore` ring buffer

`QueryApiState` shared state

`/api/export/report` snapshot semantics

Explain without eviction via `peek_clone`

Invocation in `process_traces`