Observability
In software engineering, more specifically in distributed computing, observability is the ability to collect data about programs’ execution, modules’ internal states, and the communication among components. To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage. One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue.
Telemetry types
Observability relies on three main types of telemetry data: metrics, logs and traces.
Metrics
A metric is a point in time measurement (scalar) that represents some system state. Examples of common metrics include:
- number of HTTP requests per second;
- total number of query failures;
- database size in bytes;
- time in seconds since last garbage collection.
Monitoring tools are typically configured to emit alerts when certain metric values exceed set thresholds. Thresholds are set based on knowledge about normal operating conditions and experience.
Metrics are typically tagged to facilitate grouping and searchability.
Application developers choose what kind of metrics to instrument their software with, before it is released. As a result, when a previously unknown issue is encountered, it is impossible to add new metrics without shipping new code. Furthermore, their cardinality can quickly make the storage size of telemetry data prohibitively expensive. Since metrics are cardinality-limited, they are often used to represent aggregate values (for example: average page load time, or 5-second average of the request rate). Without external context, it is impossible to correlate between events (such as user requests) and distinct metric values.
Logs
Logs, or log lines, are generally free-form, unstructured text blobs[clarification needed] that are intended to be human readable. Modern logging is structured to enable machine parsability. As with metrics, an application developer must instrument the application upfront and ship new code if different logging information is required.
Logs typically include a timestamp and severity level. An event (such as a user request) may be fragmented across multiple log lines and interweave with logs from concurrent events.
Traces
Distributed traces
A cloud native application is typically made up of distributed services which together fulfill a single request. A distributed trace is an interrelated series of discrete events (also called spans) that track the progression of a single user request. A trace shows the causal and temporal relationships between the services that interoperate to fulfill a request.
Instrumenting an application with traces means sending span information to a tracing backend. The tracing backend correlates the received spans to generate presentable traces. To be able to follow a request as it traverses multiple services, spans are labeled with unique identifiers that enable constructing a parent-child relationship between spans. Span information is typically shared in the HTTP headers of outbound requests.
Ref https://swsmile.info/post/distributed-tracing/ for more details.
Reference
- https://en.wikipedia.org/wiki/Observability_(software)
- https://www.elastic.co/what-is/observability
- https://aws.amazon.com/compare/the-difference-between-monitoring-and-observability/