Comparing OpenTelemetry vs Custom Metrics for GIS

When a spatial ETL pipeline stalls or a vector tile service degrades, the gap between a five-minute MTTR and a multi-hour outage comes down to one narrow decision: which signals flow through OpenTelemetry’s vendor-neutral pipeline, and which are emitted as hand-authored custom metrics bound to spatial service-level objectives. This page sits under Fallback Chains for Spatial API Failures and within the broader Geospatial Observability Architecture & Fundamentals discipline, and it answers a question every GIS platform team eventually hits: do you instrument coordinate transformation failures, GiST index regressions, and tile render timeouts with OpenTelemetry semantic conventions, or with bespoke counters and histograms? The durable answer is hybrid — OTel owns transport, context propagation, and distributed trace correlation, while custom metrics enforce strict, domain-aware thresholds. Getting the split wrong means OTel reports a healthy http.server.duration while a PostGIS planner quietly drops to a sequential scan over corrupt statistics.

Problem Framing

The decision arises because OpenTelemetry’s semantic conventions were designed for generic web and RPC workloads. They model latency, status codes, and span hierarchy beautifully, but they carry no native attributes for geometry vertex counts, spatial join cardinality, CRS pairs, or tile cache hit ratios. A pure-OTel deployment therefore tells you that a request was slow without telling you why — the bottleneck (a stale ANALYZE forcing a Seq Scan over a GiST index, or a datum-shift error in a reprojection) is invisible at the HTTP layer. Conversely, a pure custom-metrics deployment gives you sharp spatial thresholds but loses the cross-service trace context needed to follow a single feature from ingestion through enrichment to tile publication.

The signals that tell you the split is wrong are concrete. If your dashboards show stable p95 HTTP latency while users report misaligned features, your telemetry is blind to the domain. If a crs.transform.error is buried as a generic span event with no dedicated counter, it will never page anyone. The affected pipeline stage is almost always the enrichment-and-transform stage — the same stage that triggers external lookups governed by the parent Fallback Chains for Spatial API Failures router. Standardizing what each layer emits is the job of the Geospatial Metric Taxonomy for ETL, and the instrumentation wiring follows the patterns in OpenTelemetry Integration for GIS Pipelines.

Implementation

The hybrid model is implemented in three layers: enrich OTel spans with spatial attributes, route them through a contrib Collector that isolates domain signals, then emit custom Prometheus rules that own the hard thresholds. Start by attaching domain attributes to every execution span so the vendor-neutral trace still carries triage metadata. Each attribute below maps to a question an on-call engineer asks first.

from opentelemetry import trace
from opentelemetry.trace import SpanKind

tracer = trace.get_tracer("gis-pipeline-instrumentation")

def execute_spatial_operation(operation_type: str, geom_metadata: dict, src_crs: str, target_crs: str):
    with tracer.start_as_current_span(
        "geo.operation.execute",
        kind=SpanKind.INTERNAL,
        attributes={
            "geo.operation.type": operation_type,                         # spatial_join | reproject | tile_render
            "geo.geometry.complexity": geom_metadata.get("vertex_count", 0),  # drives planner cost; high = GiST risk
            "geo.geometry.bbox_area_sqkm": geom_metadata.get("bbox_area", 0.0),  # detects runaway extents
            "geo.crs.source": src_crs,                                    # e.g. EPSG:4326 — pairs with target for drift audit
            "geo.crs.target": target_crs,                                 # e.g. EPSG:3857
            "geo.index.strategy": "gist" if operation_type == "spatial_join" else "brin"
        }
    ) as span:
        # Execute spatial transformation or query logic
        pass

Route these enriched spans through an OpenTelemetry contrib Collector whose filter processor keeps only the geospatial metric families before export. This is the boundary that separates domain signal from infrastructure noise — everything matching geo_, tile_, or crs_ survives; generic http_ series are exported on a separate pipeline (not shown).

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 2000
  filter:
    metrics:
      include:
        match_type: regexp
        metric_names:
          - "geo_.*"     # geo_query_duration_seconds, geo_join_cardinality
          - "tile_.*"    # tile_render_timeout_count
          - "crs_.*"     # crs_transform_error_count

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true      # promote span resource attrs (region, service) to metric labels

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [filter, batch]
      exporters: [prometheus]

Custom metrics earn their place where OTel histograms cannot express operational reality. Enforce geo.query latency at p95 ≤ 150 ms and p99 ≤ 400 ms; cap 256×256 vector tiles at 2 s and 512×512 at 5 s via a tile_render_timeout_count counter; hold ETL throughput above 10,000 features/s/worker; and treat any crs_transform_error_count increment as a zero-tolerance P1. These boundaries are encoded as Prometheus alerting rules — the custom-metrics half of the hybrid:

groups:
  - name: spatial_slo_enforcement
    rules:
      - alert: SpatialQueryLatencyDegraded
        expr: histogram_quantile(0.95, rate(geo_query_duration_seconds_bucket[5m])) > 0.15
        for: 2m
        labels:
          severity: warning
          team: data-engineering
        annotations:
          summary: "P95 spatial query latency exceeds 150ms threshold"
          description: "Investigate GiST index fragmentation or stale planner statistics. Verify table ANALYZE schedules."

      - alert: TileRenderTimeoutCritical
        expr: increase(tile_render_timeout_count[5m]) > 0
        for: 0m
        labels:
          severity: critical
          team: platform-ops
        annotations:
          summary: "Vector tile generation timeout detected"
          description: "Check rendering worker memory limits, cache eviction policies, and geometry simplification thresholds."

      - alert: CRSTransformationFailure
        expr: increase(crs_transform_error_count[1m]) > 0
        for: 0m
        labels:
          severity: critical
          team: compliance-ops
          incident_priority: P1
        annotations:
          summary: "Coordinate transformation failure detected"
          description: "Zero-tolerance breach. Validate EPSG registry alignment, projection library versions, and datum shift parameters immediately."

Verification & Testing

Never trust a threshold you have not seen fire. Validate the rule syntax and trigger logic deterministically with promtool before relying on it in production:

promtool check rules spatial_slo_enforcement.yml

To confirm the custom counters page correctly, inject a synthetic error series and watch the breaker boundary. Push a single increment to crs_transform_error_count through the OTLP endpoint (or via the Pushgateway in a staging cluster), then assert the alert state after one evaluation interval:

# Should return a non-empty vector within ~60s of the synthetic increment
ALERTS{alertname="CRSTransformationFailure", alertstate="firing"}

For the OTel side, verify that span enrichment survived the Collector’s filter processor and arrived with labels intact. Query the exporter directly and confirm the spatial attributes were promoted to metric labels by resource_to_telemetry_conversion:

geo_query_duration_seconds_bucket{geo_operation_type="spatial_join", geo_crs_source="EPSG:4326"}

If that series is empty while traces are visible in your backend, the metric never crossed the filter boundary — a sign the metric name does not match the geo_.* / tile_.* / crs_.* allowlist. Finally, replay a recorded high-vertex polygon batch through the staging pipeline and confirm SpatialQueryLatencyDegraded resolves on its own once ANALYZE restores the planner statistics; a threshold that fires but never clears is a baseline problem, not an incident.

Gotchas & Failure Modes

Histogram buckets that hide the spatial tail. Default OTel histogram bucket boundaries are tuned for sub-second web latency. A geo.spatial_join over a corrupt GiST index can run for 8–30 s — far past the top bucket — so histogram_quantile saturates at the highest boundary and your p99 silently understates reality. Define explicit geo_query_duration_seconds bucket boundaries that extend to 60 s, or the latency SLO becomes decorative.

Adaptive sampling dropping the spans that matter. Tail-based or probabilistic sampling configured for cost control will happily discard the rare crs.transform.error span because it is statistically insignificant. Pin a sampling policy that always keeps spans carrying geo.crs.source != geo.crs.target mismatches or non-OK status — the tail-sampling guidance in OpenTelemetry Integration for GIS Pipelines covers the contrib processor config. Without it, your most important failures are the first to be sampled away.

Label cardinality explosion from raw coordinates. It is tempting to attach a feature ID or a full bounding box as a span attribute. Promoting those to metric labels via resource_to_telemetry_conversion detonates Prometheus cardinality. Keep geometry-level identifiers on traces only; let metrics carry bounded dimensions (operation type, CRS pair, region) so the multi-region rollout described in Monitoring Topology for Multi-Region GIS stays queryable.

Frequently Asked Questions

Should I replace Prometheus with OpenTelemetry metrics entirely?

No. OTel is the strongest choice for transport, context propagation, and distributed traces, but Prometheus-native recording and alerting rules remain the cleaner place to encode hard spatial SLOs. Emit metrics over OTLP, let the Collector export to Prometheus, and keep your threshold logic as Prometheus rules. The two are layers of one hybrid pipeline, not competitors.

Where do I draw the line between an OTel span attribute and a custom metric?

Anything you need to aggregate and threshold (latency quantiles, error counts, throughput floors) belongs in a custom metric with bounded labels. Anything you need to correlate across services for a single request (vertex count, source/target CRS, index strategy) belongs on the span. High-cardinality identifiers stay on spans only — never promote them to metric labels.

How do I stop coordinate transformation failures from being swallowed as generic errors?

Emit a dedicated crs_transform_error_count counter at the moment the reprojection raises, in addition to recording the exception on the span. The counter drives the zero-tolerance P1 alert; the span gives the on-call engineer the trace context to find the offending feature and datum-shift parameters. Relying on span events alone means nothing pages.

Does the contrib Collector add meaningful overhead for spatial workloads?

The filter and batch processors are cheap relative to the spatial operations themselves. The real cost lever is sampling and batch size, not the filter regex. For high-vertex polygon pipelines, the dominant cost is geometry serialization upstream of the Collector, so tune send_batch_size and keep heavy attributes off the hot path rather than worrying about filter throughput.

How does this interact with the fallback router?

The custom-metric thresholds are the same signals that drive degradation decisions in Fallback Chains for Spatial API Failures. When SpatialQueryLatencyDegraded or a tile timeout fires, the router shifts traffic to a secondary mirror or a simplified-geometry cache — so the telemetry split and the resilience policy share one source of truth and cannot drift apart.

Fallback Chains for Spatial API Failures — the parent topic: tiered routing that consumes these thresholds during degradation.
OpenTelemetry Integration for GIS Pipelines — Collector wiring, span enrichment, and sampling policy for spatial spans.
Geospatial Metric Taxonomy for ETL — the canonical metric names and dimensions this page emits.
Monitoring Topology for Multi-Region GIS — keeping label cardinality and replication drift in check across regions.
Geospatial Observability Architecture & Fundamentals — the foundational ingestion and routing model these signals plug into.