OpenTelemetry Integration for GIS Pipelines

OpenTelemetry integration is the transport and instrumentation layer that carries every spatial signal — from a self-intersecting polygon caught at ingest to a stalled tile-publish queue — out of a geospatial pipeline and into the place where alerts and SLOs live. The hard part is not wiring up a collector; it is making span context and metric attributes survive the operations that are unique to spatial ETL/ELT: coordinate reference system (CRS) reprojection, topology validation, spatial index rebuilds, and tile generation. Each of those stages re-shards work in ways that ordinary HTTP-centric tracing loses track of. This page is for the data engineers, GIS platform administrators, and SREs who run those pipelines. It sits under Geospatial Observability Architecture & Fundamentals and covers how to lay out the OpenTelemetry collector topology, attach the canonical spatial attributes that the rest of the stack expects, set sampling rules that never drop a failed-validation span, and route alerts when geometry integrity — not just request latency — degrades.

The diagram traces telemetry from the spatial operators that produce it — GDAL/OGR transform workers, PostGIS validation queries, and tile servers — through a contrib-build collector. An OTLP receiver accepts spans and metrics, a memory_limiter protects the collector under raster-tiling bursts, a filter processor keeps only the spatial.etl.* and spatial.validation.* series, a batch processor amortizes export cost, and an OTLP exporter ships the result to the backend where composite scores and alert rules live. Every box is a place where a spatially-tagged span can be dropped, so the rest of this page is about keeping that path lossless for the signals that matter.

Architecture

The telemetry plane should mirror the physical and logical flow of the pipeline rather than sit beside it. Collectors are deployed at the ingress points where raw vector data, LiDAR point clouds, and streaming feature updates first enter the transformation layer, so that span propagation begins before the first reprojection rather than after it. Anchoring resource attributes to the foundations laid out in the parent Geospatial Observability Architecture & Fundamentals lets you map OTel resource attributes directly to dataset lineage, CRS identifiers, and regional deployment zones — the same envelope used everywhere downstream, so a metric named once is recognizable across the whole stack.

The attribute conventions are not invented here; they are inherited. The gis.spatial.* and gis.etl.* metric names this page emits are defined by the Geospatial Metric Taxonomy for ETL, and which features get full instrumentation versus sampled checks is governed by the observability scoping rules for vector data — point, line, and polygon datasets carry distinct baselines, so a linework pipeline traces segment-snapping spans while a polygon pipeline traces ring-validation spans. Where each span is allowed to originate is set by the spatial data trust boundaries that segment the pipeline into integrity zones; raw WKB ingestion owns SRID-validation spans, while post-join stages own topology-preservation spans.

Three architectural constraints separate a spatial collector topology from a generic one:

Span continuity across re-sharding stages. A spatial join or a tile-pyramid build fans one input feature into many work units. Without explicit context propagation, the child spans orphan and the trace fragments. The collector must receive a coherent traceparent from every worker, and the SDK must create child spans inside the reprojection and validation calls rather than around the whole batch.
Backpressure isolation from the data plane. Geometry validation workers must never block on the observability path. The memory_limiter and asynchronous batched export exist so that a slow backend sheds telemetry instead of stalling feature throughput — observability degradation must be a soft failure.
Region- and zone-aware routing. Multi-region pipelines incur cross-AZ latency that distorts freshness signals. Resource attributes (deployment.zone, geospatial.crs) travel on every emission so an SRE can correlate a latency spike with a specific geometry class in a specific zone, not an unlabeled aggregate.

In Kubernetes this stack is deployed as a dedicated collector Deployment (or a node-local DaemonSet where spatial workers aggregate locally); the resource sizing, namespace isolation, and scrape topology are detailed in configuring spatial metric collection in Kubernetes.

Metric Specification

OpenTelemetry instruments for spatial pipelines split into two families: counters and histograms that measure the cost of spatial operations, and gauges and counters that measure their correctness. Standard infrastructure metrics cannot see either — a CRS transform that silently loses sub-meter precision raises no exception and burns no extra CPU. Every instrument below carries the canonical spatial dimensions (spatial.srid, spatial.geom_type, pipeline.stage, deployment.zone) so the backend can slice by geometry class and zone.

Signal	Metric Key	Instrument	Description	Unit	Warning	Critical
Cost	`gis.etl.crs_transform.duration`	Histogram	Wall time of a single reprojection call	ms	p95 > 400	p95 > 1200
Cost	`gis.etl.index_build_lag`	Histogram	Delta between feature commit and spatial index availability	ms	> 1500	> 5000
Cost	`gis.etl.tile_generation.duration`	Histogram	Time to render one tile in the publish pyramid	ms	p95 > 250	p95 > 800
Throughput	`gis.etl.feature_ingest_total`	Counter	Features accepted at the ingress boundary	count	—	—
Correctness	`gis.spatial.topology_invalid_total`	Counter	Invalid geometries detected post-validation	count	rate > 1%	rate > 5%
Correctness	`gis.spatial.crs_mismatch_count`	Counter	Features with an unexpected or undefined SRID	count	> 50 / batch	> 200 / batch
Correctness	`gis.spatial.coordinate_precision_loss`	Histogram	RMS error introduced by a projection shift	meters	> 0.5	> 2.0
Collector	`gis.otel.span_drop_total`	Counter	Spans shed by the collector under memory pressure	count	> 0	sustained > 0

The collector path itself needs a budget, because dropping spans silently is how a green dashboard hides a corrupted pipeline. Define the trace fidelity ratio $F$ as the fraction of emitted spans that survive sampling and export. For a window with $S_{\text{emit}}$ emitted spans, $S_{\text{drop}}$ dropped by memory_limiter, and $S_{\text{sampled}}$ removed by the sampling policy:

F = \frac{S_{\text{emit}} - S_{\text{drop}} - S_{\text{sampled}}}{S_{\text{emit}}}

Fidelity is only meaningful when it excludes the spans you must never drop. Spans tagged spatial.validation=failed are exempted from the sampling term, so the sampler may legitimately discard 90% of successful tile-render spans (driving $F$ low) while still guaranteeing every topology error is retained. An SLO is written against the failed-span keep rate (target $1.0$ ), not against raw $F$ .

Pipeline Integration & Configuration

The collector configuration below is a production shape for a spatial pipeline: it bounds memory before a raster-tiling burst can OOM the pod, keeps only the spatial metric families, and exports with lineage headers intact. The inline comments mark the choices that are specific to geospatial workloads.

# otel-collector-config.yaml — contrib build (filter + tail_sampling required)
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    # Trips before the pod's cgroup limit so spatial workers never OOM with it.
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  batch:
    # Larger batches amortize export cost during high-feature-rate ingest.
    timeout: 5s
    send_batch_size: 1024
  filter/spatial:
    metrics:
      include:
        match_type: regexp
        metric_names:
          - "gis\\.spatial\\..*"
          - "gis\\.etl\\..*"
  tail_sampling:
    decision_wait: 10s
    policies:
      # Never sample away a failed-validation span — topology errors are rare and load-bearing.
      - name: keep-spatial-failures
        type: string_attribute
        string_attribute:
          key: spatial.validation
          values: ["failed"]
      # Down-sample the high-volume success path (tile renders, valid reprojections).
      - name: sample-success
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlphttp:
    endpoint: "https://telemetry-collector.internal:4318"
    headers:
      # Lineage headers must survive every hop or downstream provenance queries break.
      X-Geo-Zone: "${REGION}"
      X-Dataset-Lineage: "${DATASET_HASH}"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, filter/spatial, batch]
      exporters: [otlphttp]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlphttp]

On the producer side, the SDK must attach the canonical dimensions at emission — re-deriving them in the backend loses the per-feature context. The snippet below instruments a reprojection call so the histogram and any child span carry the same source and target SRID:

from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

reader = PeriodicExportingMetricReader(OTLPMetricExporter(endpoint="otel-collector:4317"))
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))
meter = metrics.get_meter("gis.etl")
tracer = trace.get_tracer("gis.etl")

crs_transform_ms = meter.create_histogram(
    "gis.etl.crs_transform.duration",
    description="Wall time of a single reprojection call",
    unit="ms",
)

def reproject(feature, src_srid: int, dst_srid: int):
    attrs = {
        "spatial.srid": str(src_srid),          # canonical dimension — never drop
        "spatial.target_srid": str(dst_srid),
        "spatial.geom_type": feature.geom_type,  # distinct baseline per geometry class
        "pipeline.stage": "transform",
    }
    # Child span inside the operation keeps the trace whole across re-sharding stages.
    with tracer.start_as_current_span("crs_transform", attributes=attrs) as span:
        start = time.monotonic()
        result = gdal_reproject(feature, src_srid, dst_srid)
        crs_transform_ms.record((time.monotonic() - start) * 1000, attrs)
        if result.precision_loss_m > 0.5:
            span.set_attribute("spatial.validation", "failed")  # exempt from tail sampling
    return result

Correctness counts should be sourced from the database boundary, not re-derived in application code, so they reflect what PostGIS actually stored. The query feeding gis.spatial.topology_invalid_total reads validity straight from the engine:

-- Feed gis.spatial.topology_invalid_total + crs_mismatch_count from PostGIS itself.
SELECT
  ST_SRID(geom)                                AS srid,
  GeometryType(geom)                           AS geom_type,
  count(*)                                      AS feature_total,
  count(*) FILTER (WHERE NOT ST_IsValid(geom))  AS invalid_total,
  count(*) FILTER (WHERE ST_SRID(geom) <> 4326) AS srid_mismatch_total
FROM staging.parcel_ingest
WHERE batch_id = %(batch_id)s
GROUP BY ST_SRID(geom), GeometryType(geom);

Set OTEL_RESOURCE_ATTRIBUTES=geospatial.crs=EPSG:4326,geospatial.pipeline_stage=transform,deployment.zone=us-east-1 on every pipeline node so resource attributes stay consistent across ingress, transform, and publish pods.

Threshold Design & Alerting Logic

Thresholds must track the spatial resolution of the dataset — a cadastral pipeline with a sub-meter accuracy contract needs far tighter bounds than a continental basemap. Alerting is tiered by severity and routed by who can act on it: geometry-integrity failures go to GIS platform admins and data engineering, collector-health failures page the SRE on call.

CRITICAL (page): gis.spatial.crs_mismatch_count over its critical threshold, or the gis.spatial.coordinate_precision_loss p95 above the dataset’s accuracy contract — silent corruption that must halt the pipeline and quarantine the batch.
WARNING (ticket): gis.etl.crs_transform.duration p95 above the 95th-percentile baseline by more than 400 ms across consecutive 5-minute windows, or gis.etl.index_build_lag p95 > 5000 ms — transformation-worker or index-rebuild review.
DYNAMIC_BASELINE (dashboard): gis.otel.span_drop_total rising above zero, or the trace fidelity ratio for failed-validation spans dropping below 1.0 — the observability path itself is lossy and every other signal is now suspect.

Pre-aggregate the high-cardinality spatial counters with recording rules so incident-time queries stay cheap, then alert on the recorded series:

# prometheus-rules.yaml
groups:
  - name: gis_otel_recording
    rules:
      - record: gis_etl:crs_transform_p95_ms:5m
        expr: >
          histogram_quantile(0.95,
            sum(rate(gis_etl_crs_transform_duration_bucket[5m])) by (le, pipeline, deployment_zone))

  - name: gis_otel_alerts
    rules:
      - alert: SpatialSpanLoss
        # Any dropped span means the failed-validation keep guarantee may be broken.
        expr: increase(gis_otel_span_drop_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "Collector dropping spans in {{ $labels.deployment_zone }}"
          description: "memory_limiter is shedding load — topology-error spans may be lost. Scale the collector or raise limit_mib."

      - alert: CrsTransformLatencyRegression
        expr: gis_etl:crs_transform_p95_ms:5m > 1200
        for: 10m
        labels:
          severity: warning
          team: gis-platform
        annotations:
          summary: "CRS transform p95 above 1.2s for {{ $labels.pipeline }}"
          description: "Profile GDAL/OGR workers for projection-library lock contention before scaling out."

When a primary spatial API degrades and traffic reroutes, the alerting layer must keep watching the fallback chains for spatial API failures: fallback payloads have to carry identical SRID and precision attributes through the collector, or a quality regression hides behind a green dashboard once the cache takes over.

Failure Modes & Edge Cases

The integration earns its keep where one signal masks another. Watch for these concrete patterns:

Tail sampling discarding topology-error spans. A probabilistic policy that treats all traces equally will, by definition, drop most of the rare failed-validation spans, so the error rate reads artificially low. Diagnose by checking gis_otel_span_drop_total and confirming a string_attribute keep policy on spatial.validation=failed precedes the probabilistic policy in the tail_sampling processor.
Trace fragmentation across spatial joins. A spatial join fans one feature into many, and if the worker does not propagate traceparent the child spans orphan — the trace looks complete but covers only the parent. Inspect the collector /debug/tracez (via the zpages extension) for spans with no parent and confirm the SDK creates child spans inside the join call.
CRS mismatch masking a freshness lag. A feed silently shipping EPSG:3857 where 4326 is expected passes ingest-rate and freshness checks while gis.spatial.coordinate_precision_loss climbs. Assert ST_SRID(geom) against the trust boundary’s declared authority before trusting any latency or freshness metric.
memory_limiter shedding under raster bursts. A tile-pyramid build spikes memory and the memory_limiter starts dropping spans — including failed-validation ones if the keep policy lives downstream of it. Confirm sampling keep-rules run before any shedding stage, and size limit_mib against the worst-case tiling burst, not the average.
Cardinality blowup from unbounded bbox attributes. Attaching a serialized envelope as a high-precision string explodes the metric series and silently throttles the collector. Bucket the bounding box to a coarse grid (or drop it from metrics and keep it on spans only) so spatial.bbox never becomes a unique label per feature.

Troubleshooting Checklist

When spatial telemetry lags or fragments, isolate whether the fault is in collection, sampling, or spatial processing — in order:

Confirm the collector is not shedding. Check gis.otel.span_drop_total and the memory_limiter logs first; a non-zero drop count invalidates every downstream rate, so raise limit_mib or scale the collector before chasing anything else.
Verify trace context survives service boundaries. Enable the zpages extension and inspect /debug/tracez for orphaned spans; confirm W3C traceparent headers propagate across the reprojection, validation, and tiling calls.
Audit the sampling policy. Verify the keep-spatial-failures policy is ordered before the probabilistic policy so failed-validation spans are exempt from tail sampling — a 100% keep rate on spatial.validation=failed is non-negotiable.
Profile CRS-transform latency at the worker. A rising gis.etl.crs_transform.duration p95 with flat CPU points to lock contention in the projection library, not load; profile GDAL/OGR with perf record -g and disable PROJ_NETWORK in air-gapped environments to stop HTTP-timeout stalls.
Assert canonical attributes survive the filter. A misconfigured filter or attributes processor that drops spatial.srid or spatial.geom_type causes aggregation collisions and phantom latency — confirm the dimensions are present on the exported series.
Check lineage headers across fallback paths. When telemetry routes through a regional buffer or a degraded-mode cache, verify X-Geo-Zone and X-Dataset-Lineage remain intact so provenance queries on the backend still resolve.
Correlate index lag with the database. Cross-check gis.etl.index_build_lag against pg_stat_activity to confirm a spatial index build is not blocked by a long-running transaction snapshot before adding collector capacity.

Geospatial Observability Architecture & Fundamentals — the parent guide to instrumenting spatial pipelines end to end.
Geospatial Metric Taxonomy for ETL — the gis.spatial.* / gis.etl.* names and attributes these spans and metrics ride on.
Defining Spatial Data Trust Boundaries — the integrity zones that decide which stage may emit which span.
Observability Scoping Rules for Vector Data — per-geometry baselines that set instrumentation depth and sampling.
Fallback Chains for Spatial API Failures — keeping attributes intact when traffic reroutes to a degraded path.
Configuring Spatial Metric Collection in Kubernetes — deploying this collector topology with bounded resources and scrape isolation.