Photo by Kai Dahms on Unsplash
“When you can measure what you are speaking about, and express it in numbers, you know something about it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts advanced to the stage of science.”
― William Thomson (Lord Kelvin)
― William Thomson (Lord Kelvin)
Microservice applications have an even greater need for monitoring than their monolithic peers due to their distributed nature. Every call between services is a potential point of failure within the application.
While the complexity of deployment can be mitigated using containers and container orchestration, we need to be aware of all operational service failures as well as individual service performance.
Service FailuresService failures can occur for many reasons. From crashes (the service itself, its container, the pod the container is running in as well as the physical hardware) to networking failure (the service is unreachable), a service failure is an important event. Even if the application's container orchestration platform catches the failure and redeploys a replacement, it is essential to monitor, log, and analyze these failures.
PerformanceBeyond monitoring the pure binary nature of a service's availability, it is also critically important to monitor a service's performance. When a service fails, it impacts the performance of each service that depends on it. While a well-engineered application can recover from a failure, its performance may be affected beyond an acceptable threshold. For applications contractually obliged to meet specific Service-Level Agreement requirements, a performance slow-down can have serious ramifications both financially as well as customer satisfaction.
Monitoring service failures and degraded performance allows for early detection and remediation of these problems to prevent systemic degradation and failure. Failure and performance metric monitoring is critical in providing operations with a real-time view of the health of the application. Situational awareness can provide application administrators with critical service health information before a situation becomes an emergency. By capturing both failure and performance metrics over time (time-series), we have an opportunity to analyze, enhance, and fine-tuned the application's performance.
With a sufficiently large time-series data set, the performance metrics can aid in the identification of operational hotspots that can be used to guide administrators in fine-tuning service scaling parameters. The same hotspot time-series data can also be used to guide developers to services that would contribute the most significant benefit from performance refactoring efforts.
What do we measure?When instrumenting the application, it is essential to consider what data is useful. Even a small microservice application can potentially generate a considerable volume of operational data that when monitored, would need to be captured, transmitted, analyzed, and stored. This size of this stream of data has a cost both in infrastructure, analysis time, and usefulness. The more data we have, the harder it may be to discriminate the signal from the noise.
For microservice applications, we need to monitor both at the infrastructure level and the application level. At the infrastructure level, we want to instrument our containers and the orchestration platform to provide insight into resource and networking utilization. At the application level, we want to focus on instrumenting the operational behavior of the services as well as Key Performance Indicators (KPI). Our focus should be on System Events, Platform Metrics, and Application Metrics.
System EventsSystem events are external actions that are applied to the application. These include service deployments, configuration updates, scaling events, and any other operation performed on the application. These events are captured and stored as time-series data allowing them to be used to correlate application behavior with these externally applied operations.
Platform MetricsPlatform metrics capture the operation and performance of the application's infrastructure. We monitor database execution time, service execution time, incoming request counts, and response times as well as service success/failure rates for each service. By monitoring this data, we can set threshold levels for each metric and generate alert notifications when significant events occur.
Application MetricsApplication metrics include the operational data that is both specific to the application and meaningful to operations. These metrics are crucial to both the application stakeholders and the DevOps team in understanding the application performance behavior. These metrics often include common Key Performance Indicators (KPI) such as User Satisfaction/Apdex Score, Daily Active Users (DAU), Daily Session per DAU, Application Error Rate, Request rate, Average Response Time, Cost per Acquisition(CPA), and Average Revenue per User (ARPU).
Capturing The DataThere are two primary mechanisms through which we will be capturing data: system logs and health checks.
System logs include any log data that is generated by any technology used within the application, including service logs, container logs, container orchestration logs, etc. We use the Twelve-Factor methodology approach and transmit each log entry as an application event. Logging data is a rich source of passive operational data.
Health checks involve an external system periodically invoking the service to retrieve the current health of the service which can be simple or sophisticated, depending on the scope of the health check. A simple health check may be used to verify that the service is alive and active or return a single value indicating some operational state. A sophisticated health check may return a complex data structure containing multiple values to provide the state of the service in greater detail. It may perform a series of invocations on its dependent services and return those metrics to provide a comprehensive view of the end-to-end health of the service.