Geospatial Metric Taxonomy for ETL

Architecture

Geospatial ETL pipelines operate across distributed compute clusters, cloud storage tiers, and heterogeneous coordinate reference systems, demanding an observability baseline that captures both infrastructure state and geometric integrity. The foundation begins with establishing telemetry collection boundaries between ingestion nodes, transformation workers, and spatial indexers. In production, engineers must decouple metric emission from the data plane to prevent backpressure. Lightweight telemetry sidecars run alongside spatial operators (e.g., GDAL/OGR workers, PostGIS transform pods), routing structured logs, custom counters, and distributed traces to a centralized sink via asynchronous batched gRPC streams.

When designing this layer, pipeline architects must map data flow against Defining Spatial Data Trust Boundaries to isolate exactly where coordinate transformations, topology validations, and schema mutations occur. Trust boundaries dictate metric authority: raw WKB ingestion emits baseline geometry counts and SRID validation flags, while post-join stages track topology preservation ratios and ring orientation compliance. In distributed deployments, regional edge caches, replicated tile servers, and cross-availability-zone replication introduce latency asymmetries that distort spatial freshness metrics. The architecture must standardize telemetry envelopes to preserve SRID metadata, bounding box extents, and feature density signals, ensuring downstream alerting correlates infrastructure health with geometric fidelity.

Metric Taxonomy

A geospatial metric taxonomy must extend beyond traditional row-count and latency measurements to capture spatial state transitions. The taxonomy is organized into four operational dimensions: structural, volumetric, temporal, and quality. Each metric includes a standardized naming convention, unit, and production-ready detection threshold.

flowchart LR
  T["Geospatial metric taxonomy"] --> S["Structural"]
  T --> V["Volumetric"]
  T --> M["Temporal"]
  T --> Q["Quality"]
  S --> S1["crs_mismatch_count"]
  S --> S2["geom_type_drift_ratio"]
  S --> S3["topology_error_rate"]
  V --> V1["feature_density_variance"]
  V --> V2["bbox_drift_degrees"]
  V --> V3["null_geom_ratio"]
  M --> M1["index_build_lag_ms"]
  M --> M2["transform_queue_backlog"]
  M --> M3["source_sync_delta_hours"]
  Q --> Q1["coordinate_precision_loss_meters"]
  Q --> Q2["self_intersection_count"]
Dimension Metric Key Description Unit Warning Threshold Critical Threshold
Structural spatial.crs_mismatch_count Features ingested with unexpected or undefined SRIDs Count > 50 per batch > 200 per batch
spatial.geom_type_drift_ratio Ratio of unexpected geometry types (e.g., MultiPolygon vs Polygon) % > 2% > 8%
spatial.topology_error_rate Invalid geometries (self-intersections, unclosed rings) post-validation % > 1% > 5%
Volumetric spatial.feature_density_variance Standard deviation of feature count per spatial partition/tile σ > 3.0 > 7.5
spatial.bbox_drift_degrees Bounding box expansion ratio after spatial joins or buffering Degrees > 0.001° > 0.01°
spatial.null_geom_ratio Percentage of records with NULL or empty geometry payloads % > 0.5% > 3%
Temporal spatial.index_build_lag_ms Time delta between feature commit and spatial index availability ms > 1500 > 5000
spatial.transform_queue_backlog Pending CRS conversion or topology validation tasks Count > 500 > 2000
spatial.source_sync_delta_hours Staleness relative to authoritative upstream feeds Hours > 2.0 > 6.0
Quality spatial.coordinate_precision_loss_meters RMS error introduced during projection shifts (e.g., WGS84 → UTM) Meters > 0.5m > 2.0m
spatial.self_intersection_count Detected self-intersecting polygons/lines after snapping Count > 100 > 1000

These metrics must align with Observability Scoping Rules for Vector Data, which mandate that telemetry collection respects feature hierarchy boundaries. Point, line, and polygon datasets require distinct baselines; linework pipelines track segment snapping tolerance violations, while polygon pipelines monitor ring orientation and hole containment ratios.

Pipeline Integration & Telemetry Routing

Integrating this taxonomy into existing ETL frameworks requires explicit OpenTelemetry instrumentation and deterministic metric routing. The following Python snippet demonstrates how to attach spatial attributes to OTel counters using the official SDK:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader

reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("geospatial.etl")

# Track cumulative invalid geometry count (use Counter, not a ratio gauge)
invalid_geom_counter = meter.create_counter(
    "spatial.topology_invalid_total",
    description="Count of invalid geometries per batch",
    unit="1"
)

def emit_topology_metrics(batch_id: str, srid: int, error_count: int, total_features: int):
    invalid_geom_counter.add(
        error_count,
        attributes={
            "batch.id": batch_id,
            "spatial.srid": str(srid),
            "spatial.geom_type": "polygon",
            "pipeline.stage": "post_join_validation"
        }
    )

For orchestration, pipeline health checks must run deterministically before downstream consumers consume indexed tiles. Implementing Setting up spatial pipeline health checks in Airflow ensures that DAGs pause or reroute when spatial quality gates fail. A typical Airflow sensor configuration validates index freshness and topology thresholds before triggering tile generation:

# airflow_dag_spatial_gate.yaml (pseudo-config for documentation; wire via Python DAG)
spatial_quality_gate:
  task_type: PythonSensor
  task_id: validate_spatial_integrity
  timeout: 1800
  mode: poke
  retries: 2
  retry_delay: 120
  op_kwargs:
    prometheus_url: "http://prometheus:9090/api/v1/query"
    # Alert when topology error rate exceeds 5% for the ETL pipeline
    query: "spatial_topology_error_rate{pipeline='etl_national_boundaries'} > 0.05"
    fail_on_match: true

Detection Thresholds & Alerting

Thresholds must be calibrated to the spatial resolution and use-case of the dataset. High-precision cadastral pipelines require tighter bounds than continental-scale rasterized vector layers. Alerting rules should be tiered:

  1. Page (P1): spatial.crs_mismatch_count exceeds critical threshold, or spatial.source_sync_delta_hours > 6.0. Indicates broken ingestion or upstream feed failure.
  2. Ticket (P2): spatial.topology_error_rate > 5% or spatial.index_build_lag_ms > 5000. Requires engineering review of transformation workers or index rebuild jobs.
  3. Dashboard Warning (P3): spatial.feature_density_variance > 3.0 or spatial.bbox_drift_degrees > 0.001°. Signals potential data skew or inefficient spatial join predicates.

Prometheus recording rules should pre-aggregate spatial metrics to reduce query load during incident response:

# prometheus-rules.yaml
groups:
  - name: spatial_etl_recording
    rules:
      - record: spatial:topology_error_rate:avg5m
        expr: avg_over_time(spatial_topology_invalid_total[5m])

  - name: spatial_etl_alerts
    rules:
      - alert: SpatialTopologyDegradation
        expr: spatial:topology_error_rate:avg5m > 0.05
        for: 10m
        labels:
          severity: critical
          team: gis-platform
        annotations:
          summary: "Spatial topology error rate exceeds 5% for {{ $labels.pipeline }}"
          description: "Check CRS alignment and snapping tolerance in the transformation worker logs."

Troubleshooting Spatial Metric Lag

When metrics drift from actual pipeline state, engineers must isolate whether the lag originates in telemetry collection, metric aggregation, or spatial processing bottlenecks. Follow this diagnostic sequence:

  1. Verify OTel Exporter Flush Intervals: Default batch processors may buffer spatial counters for 5–10 seconds. Reduce OTEL_BSP_MAX_EXPORT_BATCH_SIZE to 50 and OTEL_BSP_SCHEDULE_DELAY to 2000 for near-real-time visibility.
  2. Cross-Region Sync Validation: In federated deployments, cross-AZ replication queues can delay metric propagation. Ensure Prometheus federation endpoints scrape regional aggregators with honor_timestamps: true to preserve original emission times.
  3. CRS Conversion Bottlenecks: High spatial.transform_queue_backlog paired with stable CPU metrics often indicates thread contention in projection libraries. Profile GDAL/OGR workers with perf record -g and verify that PROJ_NETWORK is disabled in air-gapped environments to prevent HTTP timeout stalls.
  4. Fallback Chain Activation: When primary spatial APIs degrade, pipelines should route to cached geometry stores. Monitor spatial.fallback_activation_ratio and ensure fallback payloads carry identical SRID and precision metadata to prevent silent quality degradation.

For persistent lag, validate that vector scoping rules are enforced at the collector level. Misconfigured attribute filters can drop high-cardinality spatial labels, causing metric aggregation collisions and artificial latency spikes. Always correlate spatial.index_build_lag_ms with database pg_stat_activity to confirm that index creation isn’t blocked by long-running transaction snapshots.