perf sentineldocs
FRGitHub
Documentation / Metrics

perf-sentinel exposed metrics

This document lists all metrics exposed by the perf-sentinel daemon on /metrics (Prometheus text format). The endpoint serves both text/plain; version=0.0.4 (legacy Prometheus) and application/openmetrics-text; version=1.0.0 (OpenMetrics) via content negotiation, and emits exemplars when finding-level traces are available.

Background: Prometheus and OpenMetrics primer

If you have not used Prometheus before, this short primer is a prerequisite for the rest of this document. It assumes you know what HTTP is and what a metric is. It does not assume familiarity with the Prometheus query language or operator. Other perf-sentinel docs cross-reference this primer for Prometheus concepts, see Helm deployment and Query API.

What is Prometheus. Prometheus is a Cloud Native Computing Foundation (CNCF) project, the most widely deployed open-source metrics system in the cloud-native ecosystem. It works by scraping: every 15-60 seconds the Prometheus server makes an HTTP GET to each target's /metrics endpoint, parses the response, and stores the values as time series. perf-sentinel exposes such a /metrics endpoint when running as a daemon. Operators who already run Prometheus add perf-sentinel to their scrape_configs, and the daemon's metrics show up alongside the rest of their infrastructure.

Two text formats served by perf-sentinel. Content negotiation selects which one the scraper gets.

  • text/plain; version=0.0.4 is the original Prometheus exposition format. Stable since 2014.
  • application/openmetrics-text; version=1.0.0 is OpenMetrics, the standardised evolution of the Prometheus format published by the CNCF in 2020. It is mostly a superset, with two practical additions perf-sentinel uses: # UNIT headers on each metric, and exemplars (per-point trace references that let a Grafana panel jump from a metric spike to the exact trace that produced it).

Metric types. Every metric below carries one of three types.

  • Counter, a value that only goes up (for example the number of OTLP spans ingested). Reset to zero only on restart. Always read as rate(metric[5m]) to get a per-second rate, never as the raw value.
  • Gauge, a value that goes up and down (for example the number of in-flight findings, or resident memory). Read as-is.
  • Histogram, a distribution of observations bucketed by value (for example detection latency). Exposed as several time series: _bucket{le=...} per bucket, plus _sum and _count. Read with histogram_quantile(0.99, ...) to get latency percentiles.

Where to learn more. prometheus.io, OpenMetrics spec, exemplars in OpenMetrics.

Process metrics (since 0.5.19, Linux only)

Standard process collector metrics from the prometheus crate's process_collector module. Registered only on Linux (the underlying procfs reads return errors on macOS/Windows). Operators on non-Linux hosts get perf_sentinel_* metrics only.

MetricTypeDescription
process_resident_memory_bytesgaugeRSS of the daemon process.
process_virtual_memory_bytesgaugeVirtual memory size.
process_open_fdsgaugeOpen file descriptors.
process_max_fdsgaugeMaximum allowed file descriptors.
process_start_time_secondsgaugeUnix timestamp of process start.
process_cpu_seconds_totalcounterCumulative CPU time.

Per-scrape cost. The collector reads /proc/self/{stat,status,limits} and walks /proc/self/fd/ on every scrape. On a daemon with thousands of long-lived OTLP connections plus outbound scrapers, the FD walk can dominate at 1-5 ms per scrape. The Prometheus Registry::gather() lock is held for the duration, so a slow collector blocks concurrent scrapes when several scrapers (Prometheus + vmagent + sidecar) target the same endpoint. Acceptable at the typical 15-60 second scrape interval, worth noting for tighter intervals.

Exposure scope. When the operator binds the metrics endpoint to 0.0.0.0 (Kubernetes Pod default for cluster-internal scraping), the process metrics expose operationally sensitive signals: uptime via process_start_time_seconds (patch / restart inference), file descriptor pressure via process_open_fds and process_max_fds (saturation oracle), and memory footprint via process_resident_memory_bytes. Default --listen-address is 127.0.0.1, which scopes scraping to the same host or the Pod itself. For cluster-wide scraping topologies, gate /metrics behind a Kubernetes NetworkPolicy and prefer Prometheus-side mTLS so a sibling Pod cannot read the daemon's process state freely.

OTLP ingestion metrics

MetricTypeLabelsDescription
perf_sentinel_otlp_rejected_totalcounterreasonTotal OTLP requests rejected by the daemon since start, by reason (since 0.5.19).
perf_sentinel_otlp_spans_received_totalcounter(none)Total OTLP spans received across all requests, before I/O filtering (since 0.8.7).
perf_sentinel_otlp_spans_filtered_totalcounterreasonOTLP spans skipped by conversion because they are not analyzable I/O operations (since 0.8.7).

reason label values:

  • unsupported_media_type (HTTP only): Content-Type is not application/x-protobuf. perf-sentinel does not implement the JSON-encoded OTLP variant.
  • parse_error (HTTP only): protobuf decode failed.
  • channel_full (HTTP and gRPC): the event channel is saturated or closed and the daemon could not enqueue the batch. The enqueue waits up to 2 seconds before rejecting, so short bursts absorb without a rejection while sustained saturation surfaces quickly. The HTTP path returns 503, the gRPC path returns UNAVAILABLE on saturation (both retryable per the OTLP spec) and INTERNAL only when the channel is closed during shutdown.

All 3 reasons are pre-warmed to 0 at startup so dashboards can plot zero-values before the first rejection.

payload_too_large is not counted by this metric. Tower-http (RequestBodyLimitLayer) on the HTTP path and tonic (max_decoding_message_size) on the gRPC path enforce the cap upstream and return 413 / RESOURCE_EXHAUSTED before the application handler runs. Operators concerned with payload size should monitor the upstream proxy or gateway logs, or wire a tower-http rejection counter in their own stack.

The two span-level counters expose the retention ratio of the deliberate I/O filter (only SQL and outbound-HTTP spans are analyzable, see Limitations). A fleet whose instrumentation strips db.statement or http.url converts every request to zero events while requests keep returning success, and only this counter pair makes that visible: perf_sentinel_otlp_spans_received_total rising while perf_sentinel_events_processed_total stays flat means the spans arrive but none carries an analyzable attribute. reason label values of perf_sentinel_otlp_spans_filtered_total, pre-warmed to 0:

  • not_io: span carries no db.* statement and no HTTP url or method (internal span, cache hit, middleware...). Expected to dominate on well-instrumented fleets.
  • missing_db_statement: span has db.system but neither db.statement nor db.query.text. Typical of drivers configured to omit query text.
  • missing_http_url: span has an HTTP method but neither http.url nor url.full.

Analysis and findings metrics

MetricTypeLabelsDescription
perf_sentinel_findings_totalcountertype, severityFindings detected since daemon start. type mirrors the Finding.finding_type enum, severity is critical / warning / info. Carries OpenMetrics exemplars when a trace_id is available.
perf_sentinel_traces_analyzed_totalcounter(none)Cumulative trace count processed by the event loop.
perf_sentinel_events_processed_totalcounter(none)Cumulative event count processed by the event loop.
perf_sentinel_active_tracesgauge(none)Currently active traces in the sliding window.
perf_sentinel_analysis_queue_depthgauge(none)Batches pending in the analysis worker queue (incremented on enqueue, decremented when the worker picks a batch up). A sustained nonzero value means detect+score is falling behind ingestion.
perf_sentinel_stored_findingsgauge(none)Findings currently retained in the query ring buffer (since 0.8.8). Pair with perf_sentinel_max_retained_findings for a headroom ratio.
perf_sentinel_max_active_tracesgauge(none)Configured cap of the sliding window ([daemon] max_active_traces), set once at startup (since 0.8.8). Pair with perf_sentinel_active_traces. The settings advisor hints at 90%.
perf_sentinel_analysis_queue_capacitygauge(none)Configured cap of the analysis worker queue ([daemon] analysis_queue_capacity), set once at startup (since 0.8.8). Pair with perf_sentinel_analysis_queue_depth.
perf_sentinel_max_retained_findingsgauge(none)Configured cap of the findings ring buffer ([daemon] max_retained_findings), set once at startup (since 0.8.8). Pair with perf_sentinel_stored_findings.
perf_sentinel_analysis_shed_batches_totalcounter(none)Analysis batches shed because the worker queue was full or the worker stopped. Replaces the previous implicit drop: every shed is counted here. Alert on rate(...) > 0.
perf_sentinel_analysis_shed_traces_totalcounter(none)Traces dropped by the shed batches counted in perf_sentinel_analysis_shed_batches_total.
perf_sentinel_correlator_pairs_evicted_totalcounter(none)Cross-trace correlator pairs evicted by the max_tracked_pairs cap (since 0.8.7). A sustained rate means the correlation topology exceeds the cap and lowest-count pairs are recycled, so /api/correlations may drop entries between reads.
perf_sentinel_slow_duration_secondshistogramtypeDuration histogram for spans exceeding the slow threshold, by event type (sql or http_out). Buckets: 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 3, 5, 10, 30 seconds. Used by Grafana histogram_quantile() for accurate percentiles across sharded daemon instances.
perf_sentinel_export_report_requests_totalcounter(none)Total GET /api/export/report requests. Includes cold-start responses (200 with empty envelope).

Ack metrics (since 0.5.21)

Operator-driven activity on the daemon ack API (POST / DELETE /api/findings/{signature}/ack). Read-only TOML acks loaded from .perf-sentinel-acknowledgments.toml at daemon startup are not counted, no operations occur after the initial load.

MetricTypeLabelsDescription
perf_sentinel_ack_operations_totalcounteractionSuccessful ack and unack operations.
perf_sentinel_ack_operations_failed_totalcounteraction, reasonFailed ack and unack operations, broken down by failure reason.

action label values: ack, unack.

reason label values:

  • already_acked (HTTP 409, action=ack only): signature already in the daemon JSONL, or covered by a TOML CI baseline that is still active. Both cases collapse on the same series.
  • not_acked (HTTP 404, action=unack only): signature has no active daemon ack record.
  • unauthorized (HTTP 401): [daemon.ack] api_key is set and the request is missing or has an invalid X-API-Key header. The series is pre-warmed at zero, so a non-zero value confirms api_key is configured (the counter only ever increments when auth is enforced).
  • no_store (HTTP 503): daemon ack store is disabled ([daemon.ack] enabled = false, or default storage path could not be resolved at startup).
  • invalid_signature (HTTP 400): the {signature} path segment fails canonical format validation.
  • limit_reached (HTTP 507, action=ack only): MAX_ACTIVE_ACKS (10 000) reached, refusing to accept a new entry.
  • file_too_large (HTTP 507, action=ack only): append would push the JSONL above 64 MiB. Per-daemon saturation, indicates compaction is needed at next restart or the cap should be raised. The unack path surfaces this under internal_error (HTTP 500) since the ack endpoints do not currently differentiate the cap on the unack write.
  • entry_too_large (HTTP 507, action=ack only): a single record exceeds 4 KiB after serialization, typically because the caller-supplied by or reason field is oversized. Per-request misuse, indicates client-side validation should be tightened. Same unack-path caveat as file_too_large.
  • internal_error (HTTP 500): IO failure, serialization error, symlink refused, insecure permissions, or no default storage location at write time.

Pre-warming. Both counters emit zero for documented reachable combinations before the first request, so dashboards build with rate() queries without absent() guards. The pre-warmed set is 2 success series (action=ack and action=unack) plus 13 failure series (8 reasons on action=ack, 5 reasons on action=unack). Impossible combinations (such as action=ack,reason=not_acked or action=unack,reason=already_acked) are intentionally not pre-warmed to avoid misleading series.

Sample queries.

  • rate(perf_sentinel_ack_operations_total[5m]): ack and unack operations per second, useful for trend lines.
  • sum by (reason) (rate(perf_sentinel_ack_operations_failed_total{action="ack"}[5m])): ack failures by reason. Spikes on unauthorized indicate auth misconfiguration, spikes on entry_too_large indicate a misbehaving client (oversized by / reason payloads), spikes on limit_reached or file_too_large indicate store saturation.

Scaphandre scrape counters (since 0.5.25)

Per-tick outcome of the daemon-side Scaphandre scraper (the task that fetches scaph_process_power_consumption_microwatts from the configured [green.scaphandre] endpoint every scrape_interval_secs). Only registered when the daemon is built with the daemon feature.

MetricTypeLabelsDescription
perf_sentinel_scaphandre_scrape_totalcounterstatusTotal Scaphandre scrape attempts since daemon start, partitioned by outcome.
perf_sentinel_scaphandre_scrape_failed_totalcounterreasonTotal failed Scaphandre scrapes since daemon start, partitioned by failure reason.
perf_sentinel_scaphandre_last_scrape_age_secondsgauge(none)Seconds since the last successful scrape (resets to 0 on success). Hung-scraper canary.

status label values: success, failed. Pre-warmed at zero so dashboards plot rate-zero before the first scrape.

reason label values:

  • unreachable: low-level transport failure (connection refused, DNS failure, TLS handshake error, host down). The endpoint is not reachable from the daemon pod.
  • timeout: the 3-second hard deadline on the per-scrape HTTP call elapsed before a response landed.
  • http_error: the endpoint replied with a non-2xx status code.
  • body_read_error: transport error while streaming the response body after a successful status read.
  • request_error: hyper failed to build the HTTP request from the (post-validation) URI. Rare, indicates a configuration edge case the URI parser missed.
  • invalid_utf8: the response body was not valid UTF-8. Scaphandre always emits ASCII-safe Prometheus text, so this almost always means the endpoint is not Scaphandre.

Pre-warming. Both counters emit zero for every documented label value before the first scrape, so rate() queries do not need absent() guards. The pre-warmed set is 2 status series plus 6 reason series. Configuration parsing failures (invalid endpoint URI) abort the scraper task at startup before the counter is touched, they are visible only in daemon logs at error level.

Sample queries.

  • rate(perf_sentinel_scaphandre_scrape_total{status="success"}[5m]) divided by rate(perf_sentinel_scaphandre_scrape_total[5m]): scrape success ratio over 5 minutes. Useful for an SLO panel or alert (< 0.95 over 15 minutes flags a degraded scraper).
  • topk(1, increase(perf_sentinel_scaphandre_scrape_failed_total[1h])): dominant failure reason over the past hour. Persistent unreachable typically points at a missing Scaphandre exporter on the host, persistent http_error at an exporter behind a reverse proxy returning the wrong status, persistent invalid_utf8 at an endpoint that is not Scaphandre at all.

Kepler scrape counters (since 0.7.4)

Per-tick outcome of the daemon-side Kepler scraper (fetches the configured kepler_*_cpu_joules_total series from [green.kepler] endpoint). Only registered when the daemon is built with the daemon feature. The label set mirrors Scaphandre because both sources hit the same six HTTP failure modes verbatim.

MetricTypeLabelsDescription
perf_sentinel_kepler_scrape_totalcounterstatusTotal Kepler scrape attempts since daemon start, partitioned by outcome.
perf_sentinel_kepler_scrape_failed_totalcounterreasonTotal failed Kepler scrapes since daemon start, partitioned by failure reason.
perf_sentinel_kepler_last_scrape_age_secondsgauge(none)Seconds since the last HTTP-200 (resets to 0 on any HTTP-200, see staleness note below).

status and reason labels carry the same six values as the Scaphandre counters above (success/failed, and the same six HTTP failure reasons), pre-warmed at zero before the first scrape.

Zero-sample staleness pitfall. perf_sentinel_kepler_last_scrape_age_seconds resets to 0 on every HTTP-200 response, including an HTTP-200 whose body contains no matching Kepler-v2 series (the classic v0.7.4-to-v0.7.5 upgrade case where the cluster still runs Kepler < 0.10 with the legacy metric names). Alerts driven only by the gauge will not catch this scenario. After three consecutive HTTP-200 ticks with zero matching samples, the daemon emits a tracing::warn! line containing metric and label fields; alert on the log instead, or pair the gauge with rate(perf_sentinel_kepler_scrape_total{status="success"}[5m]) and the daemon-side kepler_ebpf co2.model tag presence.

Redfish scrape counters (since 0.7.4)

Same shape as the Kepler block above with kepler -> redfish in the metric names. The reason label set adds three Redfish-specific values on top of the shared HTTP set: invalid_json, path_missing, invalid_value for vendor-variance failure modes on the BMC /Power response.

GreenOps metrics

MetricTypeLabelsDescription
perf_sentinel_io_waste_ratiogauge(none)Cumulative I/O waste ratio (avoidable / total) since daemon start. Use rate() on the underlying counters for windowed values.
perf_sentinel_energy_kwhgauge(none)Workload energy of the most recent scoring window, kWh (since 0.8.8). Scalar total only: the per-service and per-region breakdown stays off /metrics (cardinality) and lives on the query monitor Energy/Trends tabs.
perf_sentinel_carbon_gco2gauge(none)Operational carbon of the most recent scoring window, grams CO2e, summed across regions (since 0.8.8). Same scalar-only rationale as perf_sentinel_energy_kwh.
perf_sentinel_total_io_opscounter(none)Cumulative total I/O ops processed.
perf_sentinel_avoidable_io_opscounter(none)Cumulative avoidable I/O ops detected.
perf_sentinel_service_io_ops_totalcounterservicePer-service cumulative I/O ops (read by every measured-energy scraper for per-service energy attribution). Label cardinality is capped at 1024 distinct services per daemon run, new services beyond the cap are not attributed.
perf_sentinel_service_io_ops_overflow_totalcounter(none)I/O ops not attributed to a per-service counter because the 1024-service cardinality cap was reached (since 0.8.7). An ongoing increase means per-service throughput and measured-energy attribution undercount newly seen services.
perf_sentinel_scaphandre_last_scrape_age_secondsgauge(none)Seconds since the last successful Scaphandre scrape. Stays at 0 when Scaphandre is not configured. Useful for hung-scraper alerts.
perf_sentinel_cloud_energy_last_scrape_age_secondsgauge(none)Same pattern for the cloud SPECpower scraper.
perf_sentinel_kepler_last_scrape_age_secondsgauge(none)Same pattern for the Kepler scraper. See the zero-sample staleness pitfall above.
perf_sentinel_redfish_last_scrape_age_secondsgauge(none)Same pattern for the Redfish BMC scraper.

Warning kinds: transient vs sticky

Report.warning_details (since 0.5.19) has three stable kinds today, each with a different lifecycle. The distinction matters for monitoring strategies: a transient warning self-resolves, a sticky one persists until the daemon restarts.

KindLifecycleEmitted whenCleared by
cold_startTransientevents_processed_total == 0 or traces_analyzed_total == 0 on the daemonFirst successful batch (both counters strictly positive)
ingestion_dropsStickyperf_sentinel_otlp_rejected_total{reason="channel_full"} > 0 since daemon startDaemon restart (counter reset)
tuningMixedA lifetime counter shows a config knob undersized for the observed load (see below)Daemon restart for counter-driven rules, load drop for the trace-window rule

cold_start is a state warning: "the snapshot is not meaningful right now". ingestion_drops is an audit warning: "at some point since daemon start the channel saturated, here is the count for the post-mortem". Acknowledging findings via the daemon ack API does not clear any kind, they reflect daemon state rather than detection output.

The tuning advisor (since 0.8.7)

tuning entries are configuration advice: each message names the config knob, its current value, and the suggested adjustment. Six rules run on every /api/export/report call:

TriggerSuggested knob
perf_sentinel_otlp_rejected_total{reason="channel_full"} > 0[daemon] ingest_queue_capacity
perf_sentinel_analysis_shed_batches_total > 0[daemon] analysis_queue_capacity or more CPU
perf_sentinel_active_traces at 90% or more of max_active_traces[daemon] max_active_traces or a lower trace_ttl_ms
perf_sentinel_service_io_ops_overflow_total > 0Aggregate or reduce service names (the 1024-series cap is fixed)
perf_sentinel_correlator_pairs_evicted_total > 0 with correlation enabled[daemon.correlation] max_tracked_pairs
Every received OTLP span filtered as non-analyzable (after 1000 spans)Fix span attributes or point instrumented services at this endpoint

Counter-driven rules are sticky (lifetime counters only reset on restart). The trace-window rule reads a gauge, so it appears and disappears with the load. The advisor reads the config snapshot taken at daemon startup, so a hint always reflects the values the running process actually uses.

Lab tooling that asserts on warning_details[].kind == "cold_start" should account for the transient nature: any background traffic, even synthetic seed traces or health probes, can close the cold-start window in well under 60 seconds.

Cross-references

  • Shipped alerts: the Helm chart packages these alert hints as a PrometheusRule (prometheusRule.enabled), see Helm deployment.
  • Report.warning_details field (operator-facing snapshot warnings): see Runbook section "Reading Report warnings".
  • Acknowledgments workflow (cross-format finding suppression): see Acknowledgments.
  • SARIF emitter for CI integration: see SARIF.