You Can't Fix What You Can't See

The diagnostic problem

A player's deposit is slow. Support gets the ticket. The first question is obvious: where did the time go? Was it the payment provider? The bonus evaluation that fires after a successful deposit? A database query that hit a lock? The SignalR push that notifies the player's browser?

Without instrumentation, the answer is "check the logs." And the logs are in different files, on different servers, written by different services, with no shared identifier connecting them. The developer opens four log files, eyeballs timestamps, and tries to reconstruct what happened by matching times that are close enough. This is log file archaeology — and it's the default diagnostic method for most platforms.

It doesn't scale. When you have one service, you can grep a log file. When you have an API layer, a business logic layer, a database, a message queue, three external integrations, and a real-time push layer, grepping log files is not diagnostics. It's guesswork with extra steps.

PAM was built with the assumption that every production issue would eventually need to be traced across multiple services. That assumption shaped the architecture. Observability isn't a feature we added — it's infrastructure that was there from the beginning.

OpenTelemetry in PAM

Every API request that enters PAM carries a trace ID. That trace ID is not a custom header we invented — it's an OpenTelemetry-standard W3C trace context. It propagates automatically through every layer of the stack: from the API controller, through business logic services, into database queries, across RabbitMQ messages, and out to external integration calls.

The instrumentation covers everything that matters:

HTTP requests — every inbound API call and every outbound call to payment providers, game aggregators, KYC services, and other integrations. Request duration, status codes, and payload sizes are captured automatically.
Database queries — every EF Core query is traced with execution time. Slow queries surface immediately in the trace without anyone having to enable SQL profiling.
RabbitMQ messages — when a deposit event is published to the queue, the trace context travels with the message. The consumer — whether it's BeAware processing behavior rules or the event queue handling notifications — continues the same trace.
Redis operations — cache hits, cache misses, and cache write times are all part of the trace. When a request is slow because of a cache miss that fell through to a database query, the trace shows exactly that sequence.
SignalR pushes — the final step in many flows is a real-time notification to the player's browser. That push is the last span in the trace, closing the loop from request to response.

The result is end-to-end visibility. A single trace ID reveals the complete lifecycle of any operation — from the moment the request hits the API to the moment the player sees the result in their browser. No log file archaeology required.

Correlation IDs

A trace ID tells you what happened during a single operation. A correlation ID tells you what happened across a player's entire session, or across all operations related to a specific transaction.

In PAM, correlation IDs flow from the player's browser through every layer of the stack. When a player initiates a deposit, the correlation ID is generated at the API boundary and attached to every log entry, every database operation, every message queue event, and every external integration call that results from that deposit.

This matters most for support. When a support agent asks "what happened to this transaction?", the answer is a correlation ID lookup — not a manual search across multiple systems. The agent enters the transaction ID in the back office. The system returns a complete timeline: the API request, the payment provider call, the wallet update, the bonus evaluation, the notification push. Every step, in order, with timing.

Correlation IDs also solve the cross-service problem that trace IDs alone cannot. When a deposit triggers an asynchronous bonus evaluation via RabbitMQ, the trace ID from the original deposit request is different from the trace ID of the bonus processing. But the correlation ID is the same. It ties the entire business operation together, even when the technical operations span multiple traces.

Every log entry in PAM includes the correlation ID. Log4Net is configured to write it as a structured field, which means log aggregation tools can filter by correlation ID across all services simultaneously. The days of searching five log files with grep and hoping the timestamps align are over.

What a trace looks like

Consider a concrete example: a player deposits 100 EUR. Here's what the trace reveals:

PAM.Web.Api receives the request — the deposit endpoint is hit. The trace starts. Authentication is validated against the cookie and JWT. Time: 3ms.
Input validation runs — the deposit amount is checked against the player's deposit limits (daily, weekly, monthly). The wallet layer confirms the player hasn't exceeded their responsible gaming thresholds. Time: 8ms.
PAM.BL.Banking processes the deposit — the business logic layer calls the configured payment provider (PaymentIQ, Hexopay, or another integration). The outbound HTTP call to the provider is a child span in the trace. Time: 1,247ms — and now we know where the latency lives.
Wallet update executes — the player's balance is updated in a database transaction. EF Core logs the query. The wallet concurrency mechanism ensures no race condition. Time: 12ms.
RabbitMQ event published — a "deposit completed" event is published to the message queue. The trace context is embedded in the message headers. Time: 2ms.
BeAware processes the event — the behavior engine evaluates rules: does this deposit trigger a welcome bonus? A VIP tier upgrade? A responsible gaming review? Each rule evaluation is a span. Time: 34ms.
SignalR pushes balance update — the player's browser receives the updated balance via the /playerHub SignalR connection. The BalanceChange event fires. Time: 5ms.

Total time: 1,311ms. The payment provider took 1,247ms of that. Without the trace, a developer would have spent an hour narrowing down whether the slowness was in the API, the database, or the provider. With the trace, the answer is visible in seconds.

.NET Aspire

Production observability is essential. But the fastest way to reduce production incidents is to catch problems before they get to production. That's where .NET Aspire comes in.

PAM uses Aspire as the local development orchestration host. When a developer starts the solution locally, Aspire spins up the full service graph — the API, the back office, the behavior engine, the cron service, the event queue — along with their dependencies: SQL Server, Redis, RabbitMQ. Everything runs locally, everything is connected, and everything is observable.

The Aspire dashboard provides the same observability locally that OpenTelemetry provides in production:

Live service health — every service shows its status, resource consumption, and error rate in real time. If the behavior engine is throwing exceptions, you see it immediately.
Dependency graphs — the visual map of service dependencies shows which services talk to which. When a developer adds a new integration, the dependency graph updates automatically.
Distributed traces — the same trace format as production, viewable in the browser. A developer can trigger a deposit, then inspect the complete trace across all local services without leaving the dashboard.
Structured log streams — logs from all services aggregated in one view, filterable by service, severity, and correlation ID. No more switching between terminal windows.

Aspire's impact

The Aspire dashboard mirrors production observability in the local development environment. Bugs that would previously surface as production incidents — race conditions between services, unexpected message ordering, slow queries under realistic load — are now identified and fixed during development. The gap between "works on my machine" and "works in production" shrinks because the local environment is structurally identical to the production environment, and the observability tooling is the same.

Why this matters operationally

Observability changes incident response fundamentally. Without it, the incident response process is: get alerted, check dashboards, check logs, form a hypothesis, test the hypothesis, repeat. The time from alert to diagnosis is measured in minutes or hours, and it depends heavily on whether the right person — the one who knows which log file to check — is available.

With full instrumentation, the incident response process is: get alerted, open the trace, see the problem. The diagnostic path is the trace itself. A slow payment provider call is visible as a long span. A database deadlock is visible as a failed span with the exception attached. A message queue backup is visible as a growing gap between publish time and consume time.

This changes who can diagnose problems. When the diagnostic path is a trace, any engineer can follow it — not just the one who wrote the code. Knowledge isn't locked in someone's head; it's in the instrumentation.

It also changes how we think about system health. When monitoring is quiet, it means something. It means every trace is completing within expected bounds, every dependency is responding normally, and every message is being consumed on time. Silence isn't the absence of information — it's a positive signal that the system is healthy.

Today

PAM ships with OpenTelemetry instrumentation across every service, correlation IDs in every log entry, and .NET Aspire orchestration for local development. Traces propagate from API requests through business logic, database queries, message queues, external integrations, and real-time pushes. The median time from incident alert to root cause identification has dropped from hours of log searching to minutes of trace inspection. Every new service and integration is instrumented from day one — observability is not optional, it's part of the definition of done.

The principle

You can't fix what you can't see. That sounds like a motivational poster, but it's an engineering constraint. Systems without observability aren't systems you understand — they're systems you hope are working. Hope is not a monitoring strategy.

The investment in instrumentation pays for itself the first time a production issue is diagnosed in five minutes instead of five hours. It pays for itself again every time a developer catches a performance regression locally before it ships. And it pays for itself continuously in the confidence it gives operators: when the dashboards are green, the system is genuinely healthy — not just quietly failing where nobody is looking.

Share this insight

Share on LinkedIn

Preview post text