Observability

The entire system generates detailed telemetry for all the features that can be consumed across F5® Distributed Cloud Services. This telemetry provides observability of infrastructure, applications, connectivity, and security services across a distributed environment and allows netops, devops, and application teams to troubleshoot and optimize their applications without additional burden on application developers. There are four types of telemetry data that is collected from the distributed system - metrics, logs, alerts, and events. Some of these logs, metrics, and events that are also used for post-processing to determine anomalies, analyze application APIs, security issues, create graph visualizations, etc.

This telemetry data provides different outcomes to different types of users:

  1. F5 Distributed Cloud Services SRE- our site reliability engineers goal is to ensure that customer services and our global infrastructure are operational and meeting the service level objectives

  2. Customer Operations - based on RBAC and policy configured, a significant amount of data can be consumed by the central operations teams for observability of their infrastructure, network, applications, and end-users of their applications. There is a rich amount of visibility available on the F5® Distributed Cloud Console (Console) for instant visualization of this data as well as APIs available that can be used to integrate with other tools

  3. Customer Application Teams - depending on the RBAC and policy configured, the application team will be able to get observability of application and network services that relate to their specific applications

  4. Third Party Integrations - There are many cases where certain logs and metrics needs to be sent to external systems for compliance, end-to-end visibility, alerting, etc. Some good examples are ServiceNow, Pagerduty, Splunk, NewRelic, AppDynamics, and DataDog, etc. Our APIs can be used to integrate most of the external systems that are commonly used today.

If you’re interested in further details of how the features described in this guide work. You can find out more about observability architecture in Concepts section.

Introduction to Observability

There is a complex and distributed system to collect logs, metrics, alerts, and traces from our global infrastructure as well as each of the F5 Distributed Cloud Nodes deployed across users cloud and edge locations.

img 1
Figure: Highlevel View of Observability System

From a user point of view, there are two methods to get observability into their applications and services deployed across multi-cloud, network, and edge sites - use the Console for centralized dashboards or use F5 Distributed Cloud Services APIs to integrate with 3rd party tools. There are the four different types of telemetry and observability data that is collected from distributed sources and aggregated by the system:

  1. Metrics - There are many time-series metrics for the Infrastructure (cpu, memory, disk, interfaces, connectivity, and latency), Applications, and Application Services (deployment status, application health, request rate, errors, duration, latency, and throughput) that are collected by the system.

  2. Logs - There are three types of logs that are aggregated across the system - system logs, application logs, and access logs (request and response). The applications logs are currently not automatically stored by the system and the user needs to decide how to handle its storage.

  3. Alerts - Alerts can be related to user services (eg. application restart, site connectivity lost, out of memory, etc) or infrastructure services (ver service restarted, connectivity errors, etc). All of these alerts are available in the dashboard and using the APIs can be integrated to external system like Pagerduty. Some of the alerts relating to infrastructure services are handled and mitigated automatically by the SRE team and does not require customer to worry about them.

  4. Events (Audit Logs) - These logs record an event relating to access and change of configuration resources. These are security related chronological records that can be used to identify who, when, and what changes to the configuration of an object were made.

Many of these logs and metrics are used for post-processing to determine anomalies, analyze application APIs, security issues, create graph visualizations, etc. For example, these metrics are also used to generate a health-score for sites as well as applications, determined based on statistical analysis of the metrics.

Metrics, Logs, Alerts, and Events are automatically stored by F5 Distributed Cloud for each tenant. The following list shows supported retention period:

  • Security Events: 30 days
  • Audit Logs: 30 days
  • Request Logs: 7 days

If you need to fetch logs older than the supported retention period, open a ticket with the F5 Distributed Cloud Support to request up to 1 year of logs. It is also possible to leverage the Global Log Receiver feature to send logs to SIEM such as Splunk or Datadog.

The above observability data is available to the user through two mechanisms:

  1. Console - Using a web-browser and credentials, the user can access various dashboards and graphs relating their infrastructure and applications.

    1. In the Infrastructure (system) namespace, you can get visualizations like Site Map, Site Connectivity, Site Dashboard, etc.
    2. In the respective Application namespaces, you can visualizations like Application Sites, Application Deployments, Virtual Host Dashboard, Service Mesh Graph, Security Dashboard, Application Traffic Graph, etc.
  2. APIs - There are APIs to collect infrastructure and application metrics, logs, events, and alerts. In addition, there is a graph query API that provides metrics for interactions across services. These APIs can be used to interface with external systems like Splunk or Datadog that may be used within the enterprise.


Concepts

The following concepts are used for the observability features. Click on each one to learn more: