Skip to content

Observability System

Overview

Comprehensive observability with distributed tracing, metrics collection, and structured context logging. The system is implemented as a set of middleware components that plug into the apcore middleware pipeline, providing automatic per-module instrumentation. It includes an OpenTelemetry bridge for production tracing, Prometheus-format metrics export, and a standalone structured logger with trace context injection and sensitive field redaction.

Requirements

Tracing

  • Provide a Span dataclass capturing trace ID, span ID, parent span ID, name, timing, status, attributes, and events.
  • Implement TracingMiddleware using stack-based span management in context.data to correctly handle nested module-to-module calls.
  • Support four sampling strategies: full (always export), proportional (random sampling at configurable rate), error_first (always export errors, proportional for successes), and off (never export).
  • Inherit sampling decisions from parent spans in nested calls.
  • Define a SpanExporter protocol with three implementations: StdoutExporter (JSON lines to stdout), InMemoryExporter (bounded in-memory collection for testing), and OTLPExporter (OpenTelemetry bridge).
  • InMemoryExporter must be bounded (deque with configurable maxlen, default 10,000) to prevent unbounded memory growth.

Metrics

  • Implement a MetricsCollector with thread-safe counters and histograms (with configurable bucket boundaries).
  • Provide convenience methods for standard apcore metrics: increment_calls(), increment_errors(), observe_duration().
  • Support Prometheus text exposition format export via export_prometheus().
  • Implement MetricsMiddleware that automatically records call counts (success/error), error codes, and execution duration for each module call.
  • Use stack-based timing in context.data for correct nested call support.

Structured Logging

  • Implement ContextLogger as a standalone structured logger with JSON and text output formats.
  • Support log levels: trace, debug, info, warn, error, fatal.
  • Inject trace context (trace_id, module_id, caller_id) into every log entry.
  • Automatically redact fields with _secret_ prefix when redact_sensitive=True.
  • Provide ContextLogger.from_context() factory for automatic context extraction.
  • Implement ObsLoggingMiddleware using ContextLogger with stack-based timing and configurable input/output logging.

Technical Design

Tracing Architecture

The tracing system uses a stack-based approach stored in context.data["_tracing_spans"]. This correctly handles nested module calls within the same trace:

TracingMiddleware.before("mod.a"):
  Stack: [Span(mod.a)]

  TracingMiddleware.before("mod.b"):
    Stack: [Span(mod.a), Span(mod.b)]
    Span(mod.b).parent_span_id = Span(mod.a).span_id

  TracingMiddleware.after("mod.b"):
    Pop Span(mod.b), export if sampled
    Stack: [Span(mod.a)]

TracingMiddleware.after("mod.a"):
  Pop Span(mod.a), export if sampled
  Stack: []

Sampling Decision Flow

_should_sample(context):
  1. Check context.data["_tracing_sampled"] -- if exists, inherit decision
  2. If "full" strategy -> always True
  3. If "off" strategy -> always False
  4. If "proportional" or "error_first" -> random.random() < sampling_rate
  5. Store decision in context.data for child spans to inherit

For error_first, the sampling decision only affects success spans. Error spans in on_error() are always exported regardless of the stored decision.

Span Exporters

  • StdoutExporter: Converts span to dict via dataclasses.asdict() and writes as a single JSON line to stdout.
  • InMemoryExporter: Thread-safe collection using collections.deque(maxlen=max_spans) with lock protection. Provides get_spans(), clear() methods.
  • OTLPExporter: Bridges apcore spans to OpenTelemetry. Creates an OTel TracerProvider with an OTLP HTTP exporter, converts apcore span attributes (including apcore.trace_id, apcore.span_id, apcore.parent_span_id for correlation), replays events, and maps status codes. Non-primitive attributes are stringified for OTel compatibility.

Metrics Architecture

MetricsCollector maintains three internal dictionaries protected by a single lock: - _counters: Maps (name, labels_tuple) to integer counts. - _histogram_sums/_histogram_counts: Maps (name, labels_tuple) to sum/count values. - _histogram_buckets: Maps (name, labels_tuple, bucket_boundary) to bucket counts, including a +Inf bucket that is always incremented.

MetricsMiddleware uses a stack (context.data["_metrics_starts"]) to track start times for nested calls. In after(), it pops the start time, computes duration, and records success metrics. In on_error(), it additionally extracts the error code (from ModuleError.code or type(error).__name__).

Logging Architecture

ContextLogger supports two output formats: - JSON: Emits a single JSON object per line with fields: timestamp, level, message, trace_id, module_id, caller_id, logger, extra. - Text: Emits formatted lines: {timestamp} [{LEVEL}] [trace={trace_id}] [module={module_id}] {message} {extras}.

Redaction applies to any key in extra that starts with _secret_, replacing the value with ***REDACTED***.

ObsLoggingMiddleware wraps ContextLogger and uses the same stack-based timing pattern as MetricsMiddleware via context.data["_obs_logging_starts"].

As documented in the package __init__.py: 1. TracingMiddleware -- Captures total wall-clock time (outermost). 2. MetricsMiddleware -- Captures execution timing. 3. ObsLoggingMiddleware -- Logs with timing already set up (innermost).

Key Files

File Lines Purpose
src/apcore/observability/__init__.py 37 Package re-exports and recommended middleware ordering
src/apcore/observability/tracing.py 293 Span, SpanExporter, StdoutExporter, InMemoryExporter, OTLPExporter, TracingMiddleware
src/apcore/observability/metrics.py 195 MetricsCollector, MetricsMiddleware, Prometheus export
src/apcore/observability/context_logger.py 170 ContextLogger, ObsLoggingMiddleware

Dependencies

Internal

  • apcore.middleware.Middleware -- Base class for all three observability middlewares.
  • apcore.context.Context -- Provides trace_id, caller_id, call_chain, and data dict for per-call state.
  • apcore.errors.ModuleError -- Used by MetricsMiddleware to extract structured error codes.

External

  • collections (stdlib) -- deque for bounded InMemoryExporter.
  • dataclasses (stdlib) -- asdict() for span serialization in StdoutExporter.
  • threading (stdlib) -- Locks for thread-safe InMemoryExporter and MetricsCollector.
  • time (stdlib) -- Wall-clock timing for span and middleware duration measurements.
  • json (stdlib) -- JSON serialization for StdoutExporter and ContextLogger.
  • random (stdlib) -- Proportional sampling decision in TracingMiddleware.
  • opentelemetry-sdk / opentelemetry-exporter-otlp-proto-http (optional) -- Required only for OTLPExporter. Lazy-imported at instantiation time with a clear ImportError message.

Testing Strategy

Tracing Tests (tests/observability/test_tracing.py)

  • Span dataclass: Required field creation, 16-char hex span_id generation, defaults (end_time=None, status="ok", empty attributes/events/parent_span_id), and mutability of end_time/status.
  • StdoutExporter: Validates JSON line output and presence of all required fields (trace_id, span_id, name, attributes, timing).
  • InMemoryExporter: Tests export/get_spans/clear lifecycle, thread-safe concurrent export (10 threads x 100 spans), bounded deque behavior (oldest spans dropped at capacity), and default maxlen of 10,000.
  • SpanExporter protocol: Verifies that both StdoutExporter and InMemoryExporter satisfy the runtime_checkable SpanExporter protocol.
  • OTLPExporter: Tests ImportError when OpenTelemetry packages are missing, span-to-OTel conversion (timestamps, attributes, correlation IDs), error status mapping, event replay, None end_time handling, non-primitive attribute stringification, None parent_span_id skipping, and shutdown() delegation.
  • Sampling strategies: Full (always), off (never), proportional (statistical test over 1000 iterations), error_first (always exports errors, proportional for successes), and sampling decision inheritance from parent context.
  • TracingMiddleware lifecycle: before() creates span and pushes to stack, after() pops/finalizes/exports, on_error() pops/sets error status/exports, stack-based nested calls with parent-child relationships, span name convention, attribute inclusion, duration computation, and empty-stack guard (logs warning, returns None).

Metrics Tests (tests/observability/test_metrics.py)

  • MetricsCollector.increment(): Counter creation, same-label accumulation, different-label separation.
  • MetricsCollector.observe(): Histogram recording (_sum, _count), correct bucket increments (only buckets >= value), always-increment +Inf bucket.
  • Snapshot and reset: Snapshot returns dict with counters and histograms, reset clears all state.
  • Prometheus export: Text format conventions, HELP/TYPE comment lines, +Inf bucket, _sum/_count suffixed lines.
  • Configuration: Custom bucket boundaries respected.
  • Thread safety: 100 threads x 100 increments producing correct total of 10,000.
  • Convenience methods: increment_calls(), increment_errors(), observe_duration() map to correct metric names and labels.
  • MetricsMiddleware: before() pushes start time, after() records success and duration, on_error() records error calls and error counts (with ModuleError.code and generic type(error).__name__), returns None from on_error(), and nested calls produce isolated independent metrics.

Context Logger Tests (tests/observability/test_context_logger.py)

  • Creation: Default settings, from_context() extraction (trace_id, module_id from call_chain[-1], caller_id), empty call_chain handling.
  • Level filtering: Each level emits correctly, lower levels suppressed, full matrix test across all 6 levels.
  • JSON format: Valid JSON output, all fields present (timestamp, level, message, trace_id, module_id, caller_id, logger, extra), non-serializable extras handled via default=str.
  • Text format: Pattern matching for [LEVEL], [trace=...], [module=...], message, and key=val extras.
  • Redaction: _secret_ prefix keys redacted to ***REDACTED***, no redaction when disabled.
  • Custom output: Writing to custom io.StringIO target.
  • ObsLoggingMiddleware: Is Middleware subclass, before() pushes start and logs, after() pops and logs completion with duration, on_error() pops and logs failure with error type, input/output logging toggles, stack-based nested calls (4 log entries for 2 nested calls), and auto-creates ContextLogger when None.