Observability¶

Quoting an external resource on what observability means:

In the world of software products and services, observability means you can answer any questions about what’s happening on the inside of the system just by observing the outside of the system, without having to ship new code to answer new questions. Observability is what we need our tools to deliver now that system complexity is outpacing our ability to predict what’s going to break.

In this section we'll try to provide an overview on some concepts related to observability, and later on well dig deeper on how we plan to implement them. As usual, the contents of this document are a WIP.

Pillars of Observability¶

Metrics¶

A metric is a numeric value measured over time. Some common type of metrics include:

Gauge: think of them like the speedometer from your car
Counter: some value that can go up or down and can be discrete or continuous

Some metric examples are the average CPU load from the last n seconds, average number of requests returning 404 status code, etc. Metrics are often used and aggregated to generate new metrics and fire alerts when some preestablished threshold is reached/trespassed.

Even when metrics can be used to notify us something is not behaving as we expect it to, they are only good at providing an overview/summary level of information. In order to fully understand what is going on within the boundaries of a given service we need to move to the next pillar and execute a more localized analysis.

Logs¶

Logs are something we should look at when things go wrong. Some people agree on that we should only log actionable messages. Each service can log data to its own storage mediums, which are varied and can include destinations such as: Telegram bots, MS Teams Channels, Files, Databases, 3^rd party services, etc. Logs can be simple text strings, or they can be structured. The advantages of structured logs are many, including but not limited to: querying of logs based on properties, improved tracing, etc.

Generally, log messages belong to a certain category or level as they are more commonly referred to:

Trace: highest level of information. It should only be turned on in production environments when troubleshooting the most urgent of issues
Debug: Mostly used during development, should be turned off in production except when troubleshooting some incident.
Info: used to log things that could have some business meaning (more on this below)
Warn: used when the application runs into some error or unexpected situation, yet it can try to recover or keep going despite the error. In other words "an error that can be ignored right now but will come back to haunt you later".
Error: used when the application runs into an error from which it can't recover and thus the current request being processed ends on a 400/500 series error.
Fatal: The application cannot be started i.e. ASP.NET failed to bound to an open port or similar.

We should never assume logs are retained indefinitely. As such, any information that holds business meaning should be persisted by the application to a more suitable medium of storage. Logging/Instrumentation infrastructure MUST NEVER BE USED to solve a particular use case, no matter how tempting and at reach it might be.

Finally, it's important to keep in mind that logs, even when persisted to a centralized storage medium, are by nature completely local to the service that produced them. Yes, we can do our best to log the results and errors that we came upon when interacting with other services be them 1^st or 3^rd party, but to all purposes of our service logging said services are a black box. In a microservices oriented architecture (and world), this is not enough in order to fully retrace any given request for troubleshooting. Enter the last pillar...

Traces¶

A trace is a span that represents an execution of code. A trace consists of a name, an ID, and a time value. When you combine different traces from a distributed system, you can see an end-to-end flow of an execution path

Traces are the most underrated and easy to underestimate pillar, yet the one that most value can provide to a microservices architecture. One of the simplest implementations of traces include generating a unique identifier (commonly referred to as Correlation, Tracing, Request ID) which will be appended to all logs produced by the service that started the handling flow for the request (i.e. by using a structured log property) and then forwarded to any service (1^st or 3^rd party) that is outside its boundary. And we can already imagine your faces when reading this: Yes, all services that take part to handle a request should play along and correctly forward/append/read this unique identifier or else our traces will contain black spots.

We can always derive logs from traces, and in turn derive metrics from logs. This doesn't mean we should only keep traces and ignore metrics.

Implementation¶

Logging¶

Check the Runtime Logging article.

Application metrics¶

Notice

Under construction.

Service tracing¶

Notice

Under construction.

Ideally, service tracing (both instrumentation and collection) should be implemented on such a way that we could issue a query in some tool specific language and be able to reconstruct the path a request that originated from the iInspector Web UI took, including its own API server, AIMS, Core, Hospitality services, etc, without needing to manually aggregate and preprocess multiple sources and formats of logs.

Error handling¶

Errors can surface in our Suite Modules from different places: domain invariants that weren't satisfied, unexpected errors, availability issues, etc.

Our error handling strategy is comprised of:

Translating application and domain layer exceptions into HTTP error codes and returning RFC7807.
Make sure Correlation IDs / Trace Identifiers are propagated accordingly.
Automatically log all uncaught exceptions.

Read more at Exception Handling Fundamentals.

For implementation details, read the Exception Handling Module article.

Guidelines¶

Error level¶

Use this level when:

Users are being affected without a workaround.
When a valuable use case cannot be completed.

Some kind of alerting system should/may be up and running for ERROR level logs on production environments. As such, it's important to don't overuse this level, and to have some filtering/batching in place (see here)

Warn level¶

Use this level when:

Something bad/unexpected happened, but the application can still heal itself and/or keep going
Connection to an external resource failed but it will be automatically retried.
In case the retry mechanism fails too, then that would become an Error level log.

WARN is the level that should be turned on most of the time on production environments, so that both ERROR and WARN level logs get actually logged.

Info level¶

Use this level for example when:

Some state or entity changed in the application.
You want to log application parameters when starting it.
You want to log state changes or some business process result.
You want to log some results from an scheduled job.

INFO level logs are helpful during development. Most of the INFO level logs would end up being Trace entries.

Notice

TBD about whether we can redirect or instrument dotnet core Info logging as traces.

Debug level¶

Debug level should be used in a way such as that we can turn them on for production environments during troubleshooting and gain some insight as to what the symptoms and the root cause of the issue are. However it's also important to no litter the code with DEBUG level statements because that ends up hurting readability of the code we want to take care of.

Notice

Pending some more specific debug level guidelines.

Trace level¶

TRACE level logs, not to be confused with Tracing (even though the former can help instrument the later), are used in order to help us trace the path a given request took during processing, for example:

Start/end of a method, along with its duration and parameters.
URLs our application got requests for.
Start/end of request or scheduled job processing.

Tooling¶

Serilog: Structured logging for.NET.
DiagnosticSource: Observability based approach to interface with framework events without enforcing strongly typed nor serialized payloads.
AppMetrics: open-source and cross-platform .NET library used to record metrics within an application.
Graphana:Used by thousands of companies to monitor everything from infrastructure, applications, and power plants to beehives. Provides a nice integration with Prometheus in order to power dashboards with its data.
Prometheus:An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach. Centralized repository which employs a pull based approach to get application metrics. In the case of our Suite Modules, said data could be exposed through AppMetrics for example.

Observability¶

Pillars of Observability¶

Metrics¶

Logs¶

Traces¶

Implementation¶

Logging¶

Application metrics¶

Service tracing¶

Error handling¶

Guidelines¶

Error level¶

Warn level¶

Info level¶

Debug level¶

Trace level¶

Tooling¶

Resources¶