Observability System¶
Overview¶
Comprehensive observability with distributed tracing, metrics collection, and structured context logging. The system is implemented as a set of middleware components that plug into the apcore middleware pipeline, providing automatic per-module instrumentation. It includes an OpenTelemetry bridge for production tracing, Prometheus-format metrics export, and a standalone structured logger with trace context injection and sensitive field redaction.
Requirements¶
Tracing¶
- Provide a
Spandataclass capturing trace ID, span ID, parent span ID, name, timing, status, attributes, and events. - Implement
TracingMiddlewareusing stack-based span management incontext.datato correctly handle nested module-to-module calls. - Support four sampling strategies:
full(always export),proportional(random sampling at configurable rate),error_first(always export errors, proportional for successes), andoff(never export). - Inherit sampling decisions from parent spans in nested calls.
- Define a
SpanExporterprotocol with three implementations:StdoutExporter(JSON lines to stdout),InMemoryExporter(bounded in-memory collection for testing), andOTLPExporter(OpenTelemetry bridge). InMemoryExportermust be bounded (deque with configurable maxlen, default 10,000) to prevent unbounded memory growth.
Metrics¶
- Implement a
MetricsCollectorwith thread-safe counters and histograms (with configurable bucket boundaries). - Provide convenience methods for standard apcore metrics:
increment_calls(),increment_errors(),observe_duration(). - Support Prometheus text exposition format export via
export_prometheus(). - Implement
MetricsMiddlewarethat automatically records call counts (success/error), error codes, and execution duration for each module call. - Use stack-based timing in
context.datafor correct nested call support.
Structured Logging¶
- Implement
ContextLoggeras a standalone structured logger with JSON and text output formats. - Support log levels: trace, debug, info, warn, error, fatal.
- Inject trace context (trace_id, module_id, caller_id) into every log entry.
- Automatically redact fields with
_secret_prefix whenredact_sensitive=True. - Provide
ContextLogger.from_context()factory for automatic context extraction. - Implement
ObsLoggingMiddlewareusingContextLoggerwith stack-based timing and configurable input/output logging.
Technical Design¶
Tracing Architecture¶
The tracing system uses a stack-based approach stored in context.data["_tracing_spans"]. This correctly handles nested module calls within the same trace:
TracingMiddleware.before("mod.a"):
Stack: [Span(mod.a)]
TracingMiddleware.before("mod.b"):
Stack: [Span(mod.a), Span(mod.b)]
Span(mod.b).parent_span_id = Span(mod.a).span_id
TracingMiddleware.after("mod.b"):
Pop Span(mod.b), export if sampled
Stack: [Span(mod.a)]
TracingMiddleware.after("mod.a"):
Pop Span(mod.a), export if sampled
Stack: []
Sampling Decision Flow¶
_should_sample(context):
1. Check context.data["_tracing_sampled"] -- if exists, inherit decision
2. If "full" strategy -> always True
3. If "off" strategy -> always False
4. If "proportional" or "error_first" -> random.random() < sampling_rate
5. Store decision in context.data for child spans to inherit
For error_first, the sampling decision only affects success spans. Error spans in on_error() are always exported regardless of the stored decision.
Span Exporters¶
StdoutExporter: Converts span to dict viadataclasses.asdict()and writes as a single JSON line to stdout.InMemoryExporter: Thread-safe collection usingcollections.deque(maxlen=max_spans)with lock protection. Providesget_spans(),clear()methods.OTLPExporter: Bridges apcore spans to OpenTelemetry. Creates an OTelTracerProviderwith an OTLP HTTP exporter, converts apcore span attributes (includingapcore.trace_id,apcore.span_id,apcore.parent_span_idfor correlation), replays events, and maps status codes. Non-primitive attributes are stringified for OTel compatibility.
Metrics Architecture¶
MetricsCollector maintains three internal dictionaries protected by a single lock:
- _counters: Maps (name, labels_tuple) to integer counts.
- _histogram_sums/_histogram_counts: Maps (name, labels_tuple) to sum/count values.
- _histogram_buckets: Maps (name, labels_tuple, bucket_boundary) to bucket counts, including a +Inf bucket that is always incremented.
MetricsMiddleware uses a stack (context.data["_metrics_starts"]) to track start times for nested calls. In after(), it pops the start time, computes duration, and records success metrics. In on_error(), it additionally extracts the error code (from ModuleError.code or type(error).__name__).
Logging Architecture¶
ContextLogger supports two output formats:
- JSON: Emits a single JSON object per line with fields: timestamp, level, message, trace_id, module_id, caller_id, logger, extra.
- Text: Emits formatted lines: {timestamp} [{LEVEL}] [trace={trace_id}] [module={module_id}] {message} {extras}.
Redaction applies to any key in extra that starts with _secret_, replacing the value with ***REDACTED***.
ObsLoggingMiddleware wraps ContextLogger and uses the same stack-based timing pattern as MetricsMiddleware via context.data["_obs_logging_starts"].
Recommended Registration Order¶
As documented in the package __init__.py:
1. TracingMiddleware -- Captures total wall-clock time (outermost).
2. MetricsMiddleware -- Captures execution timing.
3. ObsLoggingMiddleware -- Logs with timing already set up (innermost).
Key Files¶
| File | Lines | Purpose |
|---|---|---|
src/apcore/observability/__init__.py |
37 | Package re-exports and recommended middleware ordering |
src/apcore/observability/tracing.py |
293 | Span, SpanExporter, StdoutExporter, InMemoryExporter, OTLPExporter, TracingMiddleware |
src/apcore/observability/metrics.py |
195 | MetricsCollector, MetricsMiddleware, Prometheus export |
src/apcore/observability/context_logger.py |
170 | ContextLogger, ObsLoggingMiddleware |
Dependencies¶
Internal¶
apcore.middleware.Middleware-- Base class for all three observability middlewares.apcore.context.Context-- Providestrace_id,caller_id,call_chain, anddatadict for per-call state.apcore.errors.ModuleError-- Used byMetricsMiddlewareto extract structured error codes.
External¶
collections(stdlib) --dequefor boundedInMemoryExporter.dataclasses(stdlib) --asdict()for span serialization inStdoutExporter.threading(stdlib) -- Locks for thread-safeInMemoryExporterandMetricsCollector.time(stdlib) -- Wall-clock timing for span and middleware duration measurements.json(stdlib) -- JSON serialization forStdoutExporterandContextLogger.random(stdlib) -- Proportional sampling decision inTracingMiddleware.opentelemetry-sdk/opentelemetry-exporter-otlp-proto-http(optional) -- Required only forOTLPExporter. Lazy-imported at instantiation time with a clearImportErrormessage.
Testing Strategy¶
Tracing Tests (tests/observability/test_tracing.py)¶
- Span dataclass: Required field creation, 16-char hex span_id generation, defaults (end_time=None, status="ok", empty attributes/events/parent_span_id), and mutability of end_time/status.
- StdoutExporter: Validates JSON line output and presence of all required fields (trace_id, span_id, name, attributes, timing).
- InMemoryExporter: Tests export/get_spans/clear lifecycle, thread-safe concurrent export (10 threads x 100 spans), bounded deque behavior (oldest spans dropped at capacity), and default maxlen of 10,000.
- SpanExporter protocol: Verifies that both
StdoutExporterandInMemoryExportersatisfy theruntime_checkableSpanExporterprotocol. - OTLPExporter: Tests
ImportErrorwhen OpenTelemetry packages are missing, span-to-OTel conversion (timestamps, attributes, correlation IDs), error status mapping, event replay, None end_time handling, non-primitive attribute stringification, None parent_span_id skipping, andshutdown()delegation. - Sampling strategies: Full (always), off (never), proportional (statistical test over 1000 iterations), error_first (always exports errors, proportional for successes), and sampling decision inheritance from parent context.
- TracingMiddleware lifecycle: before() creates span and pushes to stack, after() pops/finalizes/exports, on_error() pops/sets error status/exports, stack-based nested calls with parent-child relationships, span name convention, attribute inclusion, duration computation, and empty-stack guard (logs warning, returns None).
Metrics Tests (tests/observability/test_metrics.py)¶
- MetricsCollector.increment(): Counter creation, same-label accumulation, different-label separation.
- MetricsCollector.observe(): Histogram recording (_sum, _count), correct bucket increments (only buckets >= value), always-increment +Inf bucket.
- Snapshot and reset: Snapshot returns dict with counters and histograms, reset clears all state.
- Prometheus export: Text format conventions, HELP/TYPE comment lines, +Inf bucket, _sum/_count suffixed lines.
- Configuration: Custom bucket boundaries respected.
- Thread safety: 100 threads x 100 increments producing correct total of 10,000.
- Convenience methods:
increment_calls(),increment_errors(),observe_duration()map to correct metric names and labels. - MetricsMiddleware: before() pushes start time, after() records success and duration, on_error() records error calls and error counts (with
ModuleError.codeand generictype(error).__name__), returns None from on_error(), and nested calls produce isolated independent metrics.
Context Logger Tests (tests/observability/test_context_logger.py)¶
- Creation: Default settings,
from_context()extraction (trace_id, module_id from call_chain[-1], caller_id), empty call_chain handling. - Level filtering: Each level emits correctly, lower levels suppressed, full matrix test across all 6 levels.
- JSON format: Valid JSON output, all fields present (timestamp, level, message, trace_id, module_id, caller_id, logger, extra), non-serializable extras handled via
default=str. - Text format: Pattern matching for [LEVEL], [trace=...], [module=...], message, and key=val extras.
- Redaction:
_secret_prefix keys redacted to***REDACTED***, no redaction when disabled. - Custom output: Writing to custom
io.StringIOtarget. - ObsLoggingMiddleware: Is Middleware subclass, before() pushes start and logs, after() pops and logs completion with duration, on_error() pops and logs failure with error type, input/output logging toggles, stack-based nested calls (4 log entries for 2 nested calls), and auto-creates ContextLogger when None.