Debugging, Monitoring and Tuning Performance

Google Stackdriver

Google Stackdriver is a multi-cloud service.

Stackdriver combines metrics logs and meta data whether you’re running on GCP, AWS, on premises infrastructure or a Hybrid Cloud. You can quickly understand service behaviour and issues from a single comprehensive view of your environment and take action if needed.

Stackdriver includes features such as

  • logging
  • monitoring
  • trace
  • error reporting
  • debug

These diagnostic features are well integrated with each other. This helps you connect and correlate diagnostic data easily.

Stackdriver profiler uses statistical techniques and low impact intrumentation that runs across all production application instances to provide a complete picture of an application performance without slowing it down. It helps you identify and eliminate potential performance issues.

Stackdriver helps increase reliability by giving users the ability to monitor GCP and multi-cloud environments to identify trends and prevent issues with stackdriver. You can reduce monitoring overhead and noise to fix problems faster.

Managing Application Performance and Debugging

Stackdriver APM (Application Performance Management) Tools

The three products that make up Stackdriver APM tools are

  • Stackdriver trace
  • Stackdriver profiler
  • Stackdriver debugger

Stackdriver APM includes advanced tools to help developers reduce latency and cost for every app by understanding in detail how they behave in production. They use some of the same tools Google’s own SRE (Site Reliability Engineering). SRE teams use to give you insights into how your code runs. This lets you take action to optimize code and fix problems, whatever Cloud you’re using. All of the Stackdriver APM tools work with code and apps running on any Cloud or even on-premises infrastructure. Google charges for Stackdriver APM tools based on the amount of data collected.

Stackdriver Trace - Distributed tracing for everyone

Stackdriver trace shows you how requests propagate through your app. You can inspect detailed latency information for a single request, or view aggregate latency for your entire app. Using the various tools and filters provided, you can quickly find where bottlenecks are occuring, and more quickly identify the root cause. Stackdriver trace is a distributed tracing system that collects latency data from your apps and displays it in the GCP console. You can track how requests propagate through your app and receive detailed near real-time performance insights.

Stackdriver trace automatically analyzes all of your apps traces to generate in depth latency reports to surface performance degradations and capture traces from all of your VMs, containers, or App Engine projects. Using Stackdriver trace, you can inspect detailed latency information for a single request, or view aggregate latency for your entire app.

Using the various tools and filters provided, you can quickly find where bottlenecks are occuring and more quickly identify the root cause. Stackdriver trace is based off of the tools used in Google to keep our services running at extreme scale. Trace continously gathers and analyzes trace data from your project to automatically identify recent changes to your apps performance. These latency distributions available through the analysis reports feature can be compared over time or versions and Stackdriver trace will automatically alert you if it detects a significant shift in your app’s latency profile.

Stackdriver trace is language-specific SDK, can analyze projects running on VMs, even those not managed by GCP. The trace SDK can be used to submit and retrieve trace data from any source. A Zipkin collector is also available, which allows Zipkin tracers to submit data to stackdriver trace. Projects running on App Engine are automatically captured. Because Stackdriver trace collects latency data from App Engine, HTTPS load balancers and apps instrumented with the Stackdriver trace API, it can help you answer the following questions.

  • How long does it take my app to handle a given request?
  • Why is taking my app so long to handle a request?
  • Why do some of my requests take longer than others?
  • What is the overall latency of requests to my app?
  • Has latency for my app increased or decreased over time?
  • What can I do to reduce application latency?
  • What are my app dependencies?

View and analyze trace data in Stackdriver’s interface

After the Stackdriver trace agent has collected trace data, you can view and analyze that data in near real-time in its interface. This interface contains three windows

  • overview
  • trace list
  • analisis report

The insights pane displays a list of performance insights for your app. This pane highlights common problems in apps such as consecutive calls to a function, that if batched, might be more efficient. The “recent trace” pane displays the most recent traces, for each the latency URI and time are displayed. You can use this summary to understand the current activity in your app.

The most frequent URIs and most frequent RPCs list the most frequest URIs and RPCs from the previous day, along with the average latency. If you click a link in either of these tables, you open the trace ist window, where you can view latency as a function of time and investigate details of any individual trace.

The chargeable trace spans pane, displays the number of spans ingested in the current calendar month and the total for the previous month.

You can use this information to monitor your costs for using Stackdriver trace. The daily analysis reports pane displays latency data for the previous days and compares it to the latency data from seven days prior. You can also create your own analysis reports to select which traces you want included in the report.

Stackdriver profiler - continuous profiler to improve performance and reduce costs

Stackdriver profiler monitors CPU and heap to help you identify latency and inefficiency using interactive graphical tools, so you can improve application bottlenecks and reduce resource consumption.

Google uses the same technology every day to identify inefficiently written code on services. Poorly performing code increases the latency and cost of apps and web services every day without anynone knowing or doing anything about it. Stackdriver profiler changes this by continuously analyzing the performance of CPU or memory intensive functions executed across an app.

Stackdriver profiler presents the call hierarchy and resource consumption of the relevant function in an interactive flame graph that helps developers understand which paths consumed the most resources and the different ways in which their code is actually called. While it’s possible to measure code performance in development environments, the results generally don’t map well to what’s happening in prod. Many production profilling techniques either slowdown code execution, or can only inspect a small subset of a codebase.

Stackdriver profiler uses statistical techniques and extremely low impact instrumentation, that runs across all production app instances to provide a complete picture of an app’s performance without slowing it down. Stackdriver profiler allows developers to analyze apps running anywhere, including GCP, other Cloud Platforms or on-premises.

Stackdriver profiler UI provides flame charts to correlate statistics with app areas and components

In order to use Stackdriver profiler, you install the profilling agent under VMs where your app runs. The agent typically comes as a library that you attached your application when you run it. The agent collects profilling data as the app runs. Stackdriver profiler is a statistical profiler, so the agent is not always active.

Stackdriver profiler creates a single profile by collecting profiling data, usually for 10 seconds every one minute for a single instance of the configured service in a single Compute Engine zone. After each instance of the agent starts, it notifies the profiler backend that it’s ready to capture data and then the agent idles until it receives a reply from the backend, that specifies the type of profile of the capture. If you have 10 instances of a service running in the same deployment, then you can create 10 profiling agents. However, most of the time, these agents are idle. Over a 10-minute period, you can expect 10 profiles. Each agent receives one reply for each profile type on average.

The overhead of the CPU and heap allocation profiling at the time of the data collection is less than five percent. Amortized over the execution time and across multiple replicas of a service, the overhead is commonly less than 0.5%, making it an affordable option for always on profilling in prod systems.

After the agent has collected some profilling data, you can use the profiler interface to see how the statistics for CPU and memory usage correlate with areas of your app.

Debug your app in developemnt and production

With Stackdriver error reporting and debugger you can debug your app in development and also troubleshoot errors in production. Error reporting displays errors that have ocurred in your apps. You can view this actuaries to determine where the error ocurred. Clicking the source code file in the stack trace takes you to the Stackdriver debugger and the line of code that has the problem.

Debugger automatically creates debug snapshots

When the application hits the code again, Stackdriver debugger creates a snapshot of the app state, including values of local variables.


Stackdriver logging

Stackdriver logging is preconfigured in other compute environments

A robust system of logging is crucial for developer productivity. And to help you understand the state of your app. You can install a Stackdriver logging agent on Compute Engine and Amazon EC2 instanaces. To stream logs from third-party apps into Stackdriver logging. The logging agent in an app based on Fluentd. When you write your logs to existing log files such as syslog on your VMs, the logging agent sends the logs to Stackdriver. Stackdriver logging is preconfigured in other compute environments.

Cloud Dataflow, Cloud Functions and App Engine have built-in support for logging. You can enable logging on GKE by simply enabling a checkbox in the Cloud platform console when you set up a container cluster.

Set up logs-based metrics and alerts

In Stackdriver logging you can view your logs and search for particular types of messages. You can create custom logs-based metrics and alerts based on those metrics. They’re a really powerful feature, in that they alert you to a problem, so that you can react to it before it becomes a major problem issue.


Monitoring and Tuning Performance

When you monitor your app and gather metrics you can make improvements to the design of your application, increase reliability, detect and fix security issues and reduce usage costs based on usage patterns.

and answer questions such as How fast is my database growing? How many users have I added in the last four quarters and so on?

Monitor to compare results over time or between experimental configurations

Is the site slower than it was last week? Are request faster with Apache or nginx web server?

Monitor to raise alerts when something is broken or about to be broken

Monitor to raise alerts when something is broken and should be fixed urgently or something is about to be broken and can be addressed preemptively.

Monitor to perform ad hoc retrospective analysis

Monitor to perform ad hoc retrospective analysis. For example, the latency of your app just increased sharply. What else happened around the same time?

Identify APIs and resources that you want to monitor

For example, there might be public and private endpoints that you want to monitor or multi-cloud resources such as Compute Engine VM instances, Cloud Storage buckets, Amazon EC2 instances in databases that you want to keep an eye on.

Identify service-level indicators and objectives

(check out Google’s SRE (Site Reliability Engineering) e-book)

A SLI (service level indicator) is a quantitative measure of some aspect of a service. A SLI might be a predefined metric or a custom metric such as a log based metric. A SLO (Service Level Objective) is a target value or range of values for a service level. SLOs should include tolerance for small variations. Absolute limits will result in noisy pagers and require you to constantly tweak thresholds.

For example, latency is a SLI. You can get a SLO indicating that 99% of requests over 30 days have latency less than 100ms. This is a great example of a SLO that has a range of values for a service leve.

Create dashboards that include four golden signals

After you idenfity the resources to monitor and define surface level indicators and objectives, you can create dashboards to view metrics for your application.

Create dashboards that include the four golden signals: latency, traffic, errors and saturation.

Latency

This is the amount of time it takes to serve a request. Make sure to distinguish between the latency for successfull and unsuccessfull requests. For example, in HTTP errors that occur due to a loss of connection to a database or another backend service might be solved really quickly. However, because an HTTP 500 error indicates a failed request including 500 errors in your overall latency might result in misleading metrics.

Traffic

Traffic is a measure of how much demand is placed on your system. It’s measured as a system specific metric. For example, web server traffic is measured as the number of HTTP or HTTPS requests per second. Tarffic to a NoSQL database is measured as the number of read or write operations per second.

Errors

Errors indicate the number of failed requests. Criteria for failure might be anything like an explicit error such as HTTP 500 or a successfull HTTP 200 response, but with incorrect content. It might also be a policy error. For example, your application promises a response time of one second, but some requests take over a second.

Saturation

Saturation indicates how full your application is or what resources are being stretched and reaching target capacity. System can degrade in performance before they achieve 100% utilization. So make sure to set utilization targets carefully.


Identifying and Troubleshooting Performance Issues

It’s important to monitor performance in the development phase and in production. In development, add performance test to your test suite to ensure that the performance of your application doesn’t degrade when you fix bugs. Add new features or change underlying software. Response times and resource requirements can change significantly when you make changes to your app.

With performance tests, you’ll be able to detect and address performance issues early in the development process.

A watchpoint is a potential area of configuration or application code that could indicate a performance issue. Performance issues may be a result of multiple watchpoints. Review metrics related to incoming requests. Review the design and implementation of your web pages. You can use pagespeed insights to view information about the lack of caching headers, the lack of compression, too many HTTP browser requests, slow DNS response and the lack of minification.

If the application is not public facing, you can use Chrome Dev Tools with pagespeed insights to manually analyze your web pages. Check for self-inflicted load. This is load caused by the appliaction itself such as a service to service or browser to service calls. For example, check for polling cron jobs, batch request or multiple AJAX request from the browser.

You can use client-side tools such as Chrome devtools and server side load analysis tools such as Stackdriver Trace to find the source of the problem.

In Development: Review application code and logs

Review them to check for performance issues. Check logs for application errors such as HTTP errors and exceptions. Identify the root cause of the log messages and confirm that they’re not related to periodic load or performance issues. Prioritize investigation by the frequency of errors. Because this data is historical, some errors might have been intermittent or already resolved.

If a log message is unreproducible, it might be better to defer the investigation.

Check for runtime code generation. Aspect-oriented programming practies can sometimes cause reduced application performance. Consider compile time code generation instead.

Don’t serve static resources from your application. Instead, use a content delivery network such as Google Cloud CDN with Google Cloud Storage.

Consider caching frequently accessed values that are retrieved from a database or require significant compute resources to recalculate. You can also cash generated HTML fragments for later use.

Look for areas where data is retrieved from a database or service with multiple requests. Replace these individual requests with a single batch request and send the requests in parallel.

Don’t retry constantly on errors. Instead, retry with exponential backoff. Implement a circuit breaker to stop retries after a certain number of failures. Note that you should only retry in case of errors such as connection timeouts or too many requests. Don’t retry in case of errors such as 5xx and malformed URL errors and so on.

Check for the following areas:

External user load. Analyze the most frequent requests and slowest requests. Confirm that the requests are expected. Determine the cause for the slowest request.

Periodic Load. Analyze traffic over an extended period of time to determine which periods have higher levels of usage. Confirm that there’s a business reason for the load.

Malicious Load. Confirm that all load is expected and legitimate. If you have a web application, make sure that all load is coming from a web or mobile client. You can further segment the load by user to understand whether the majority of requests are coming from a small number of users.

In Production: Review deployment settings

Scaling. Make sure that you have set up load balancing and auto-scaling policies as appropiate for the traffic volumes of your app.

Set target utilization levels conservatively to ensure that your app continues to handle traffic while new VM instances come online.

Region. Determiner where the bulk of your traffic is comming from. Deploy resources in the appropiate regions to reduce latency.

Cron Jobs. Make sure they’re scheduled accurately.

Review and implement best practices for each of the services that you’re using in your app.