Alert Reference
Objective
This document provides reference information on various types of alerts supported by F5® Distributed Cloud Services. Use the information provided in this document to understand the details on the various alerts and action required to be performed.
Key Points
The following apply to the alerts:
- There is no separate alert for health score. This is because health score is composed of multiple components. For example, health score of a site is computed based on the data-plane connection status to the Regional Edge (RE) sites, control-plane connection status and K8s API server status in the site. There are individual alerts defined for each of the above conditions, but no alert is available for the health score itself.
Note: You can obtain the healthscore of a site in F5® Distributed Cloud Console (Console). You can also obtain it using the API
https://www.volterra.io/docs/api/graph-connectivity#operation/ves.io.schema.graph.connectivity.CustomAPI.NodeQuery
with"field_selector":{"healthscore":{"types":["HEALTHSCORE_OVERALL"]}}
.
-
The amount of time before alert generation is not the same for all alerts. This duration is determined based on the severity of the alerts. For example, alert is raised as soon as the tunnel connection to RE goes down, whereas health check alert for a service is raised only if the condition persists for 10 minutes. This is to keep the alert volume under manageable level and not to generate alerts on temporary or transient failure conditions.
-
It is not supported to change the threshold for alerts.
-
It is not supported for users to define new alerts using an API. However, in case existing alerts do not satisfy your requirement, you can create a support request for new alert in Console.
Alerts and Descriptions
The following table presents alerts and associated details such as group, type, severity, and associated actions.
Alert | Name | Type | Group | Severity | Description | Action |
---|---|---|---|---|---|---|
APISecurityTooManyAttacks | API Security Events | metric | Security | major | More than 18 API security events were detected on the Virtual Host in 5 minutes. | Review the API security events, to investigate further and take necessary next steps (in other words, create trusted client rule or client blocking rule). |
ErrorRateAnomaly | Error Rate Anomaly | custom | Timeseries-Anomaly | minor | Error rate anomaly detected. | Metric looks abnormal and needs attention. |
FluentbitOutputErrors | Log Collection Error | metric | Infrastructure | major | Fluentbit has output errors. | Collect info and open issue. Monitor Grafana fluent dashboard. Let L2 fix during working hours. |
KubeAPIErrorsHigh | K8S API Error | metric | Infrastructure | major | API server is returning errors for some requests. | Check kube-apiserver log to see the detail. Contact support if the issue persists. |
KubeAPILatencyHigh | K8S API Error | metric | IaaS-CaaS | minor | Kubernetes API latency at 99th percentile is too high for more than 2 seconds. Possible intermittent problem which may occur during parallel application updates. | Check HW utilization of CE site. If persist for longer than hour contant support. |
KubeCronJobRunning | K8S Job Too Long | metric | IaaS-CaaS | minor | Kubernetes CronJob running for more than hour. Job can be stuck or it is expected to run longer. | Check logs from Kubernetes Pod. Contact support in case of non-customer vk8 workload. |
KubeDaemonSetMisScheduled | K8S Daemonset Error | metric | IaaS-CaaS | minor | DaemonSet member pod cannot be scheduled in all machines where it requires to be present, meaning that the DaemonSet member pod cannot be scheduled in all machines where it requires to be present. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet. |
KubeDaemonSetNotScheduled | K8S Daemonset Error | metric | IaaS-CaaS | minor | Some pods of DaemonSet are not scheduled. After 10 minutes, the alert is not sent again. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet. |
KubeDaemonSetRolloutStuck | K8S Daemonset Error | metric | IaaS-CaaS | minor | Kubernetes DaemoSet desired Pods are not scheduled or ready. After 15 minutes, the alert is reported again. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet. |
KubeDeploymentGenerationMismatch | K8S Deployment Error | metric | IaaS-CaaS | minor | Deployment generation does not match, this indicates that the Deployment has failed but has not been rolled back. After 15 minutes, the alert is not sent again. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment. |
KubeDeploymentReplicasMismatch | K8S Deployment Error | metric | IaaS-CaaS | minor | Kubernetes Deployment has not matched the expected number of Pod replicas for more than 1hr. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment. |
KubeJobFailed | K8S Job Failed | metric | IaaS-CaaS | minor | Alert is fired if any Kubernetes job is in a failed state for the past 2 hours. | Check Kubernetes Job and Pod status, events and logs in vK8s cluster. Contact support in case of etcd job. |
KubeMetricsMissing | Kubernetes Metrics Missing | metric | Infrastructure | critical | Essential Kubernetes metrics are missing. All Kubernetes alerts are affected as well. | Check if kube-state-metrics workload is running and its logs. Restart this service on cluster this alert appeared. |
KubeNodeEvictedPods | Node has evicted pods | metric | Infrastructure | critical | Node has evicted pods. | Describe the evicted pods and check for the eviction reason and escalate to L2 immediately. |
KubeNodeUnschedulable | K8S Node Scheduling Disabled | metric | IaaS-CaaS | minor | Node has Scheduling Disabled. | Describe the node and determine the reason for the disabled scheduling. |
KubePersistentVolumeFullInFourDays | K8S PVC Error | metric | IaaS-CaaS | major | The alert is triggered when the PersistentVolumeClaim is more than 85% is full and will be 100% full in less than 4 days. | Resize PVC or clean disk. |
KubePersistentVolumeSpaceLow | K8S PVC Error | metric | IaaS-CaaS | major | Kubernetes PersistentVolumeClaim is getting out of space. | Resize PVC or clean disk. |
KubePodCPUThrottlingHigh | K8S Pod CPU Throttled | metric | IaaS-CaaS | major | Kubernetes Pod container is throttling it's CPU limits. | Increase flavor for vk8s Deployment or StatefulSet definition. Contact support in case of non vk8s Pod. |
KubePodContainerTooMuchMemory | metric | IaaS-CaaS | critical | More than 90% of allowed memory is being used by container. | Add more replicas. | |
KubePodCrashLooping | K8S Pod Crashing | metric | IaaS-CaaS | minor | Kubernetes Pod container restarting often. Possible causes can be out of memory limit (OOM), liveness probe or container entrypoint failure. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment. |
KubePodNotReady | K8S Pod Not Ready | metric | IaaS-CaaS | minor | Pod has been in a non-ready state for more than 10 min. The reason might be readiness probe failures, scheduling due out of quotas or broken node. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment. |
KubeStatefulSetGenerationMismatch | K8S StatefulSet Error | metric | IaaS-CaaS | minor | StatefulSet generation does not match. After 15 minutes, the alert is not sent again. | This indicates that the StatefulSet has failed but has not been rolled back. |
KubeStatefulSetReplicasMismatch | K8S StatefulSet Error | metric | IaaS-CaaS | minor | Kubernetes StatefulSet has not matched the expected number of Pod replicas for longer than 15 minutes. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment. |
KubeStatefulSetUpdateNotRolledOut | K8S StatefulSet Error | metric | IaaS-CaaS | minor | StatefulSet update has not been rolled out. After 15 minutes, the alert is reported again. | Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment. |
KubeVersionMismatch | K8S Internal Error | metric | Infrastructure | minor | There are different versions of Kubernetes components running. This can be caused by failure during Volterra Software Upgrade. | Check Volterra Software Upgrade status. Ignore if upgrade is in progress. |
LoggingForwardFailed | Log Collection Error | metric | Infrastructure | critical | Log collection has failed to forward logs for more than 15 minutes. | Node is not sending logs. Check Fluentd status, health. Inspect fluentbit logs for errors. If none fluebtbit can reach fluentd, restart fluentd instances. If it persist for more than 2 hours escalate to L2. |
LoggingOutputQueueStucked | Log Collection Error | metric | Infrastructure | major | Fluentbit output queue is stuck. | Restart fluentbit. Escalate to L2 if it persist for more than 2 hours. |
LoggingRetriesFailed | Log Collection Error | metric | Infrastructure | critical | Log collector has tried too many times to forward logs in last 15 minutes. | Check network connectivity between CE and RE site. |
MaliciousUserDetected | Malicious User Detected | event | Security | major | Malicious user {{ user }} detected. {{ summary_message }} | Review the malicious user details and take the necessary next steps (in other words, enable malicious user mitigation or create a client blocking rule). |
NFVServiceAllInstancesUnavailable | NFV Service unavailable | metric | Nfv-Service | critical | All service instances are down | Access to the service instance and check configuration |
NFVServiceInstanceUnavailable | NFV service instances unavailable | metric | Nfv-Service | major | Some of external service instances down. | Access to the service instance and check configuration |
NodeAideFilesAddedRemoved | Node Error | event | Infrastructure | major | Monitored files on filesystem were unexpectedly modified. | Use logs to verify which files were modified and why. |
NodeAideFilesChanged | Node Error | event | Infrastructure | critical | Monitored files on filesystem were unexpectedly modified. | Use logs to verify which files were modified. Creatite an issue in Gitlab for tracking. Immediately escalate to the Security Team (L2). |
NodeAideNotRunning | Node Error | event | Infrastructure | critical | Aide check did not run in past 24 hours. | Check the service status and notify the security team. |
NodeCpuUsageHigh | Node CPU Usage High | metric | Infrastructure | major | Node runs more than 90% CPU for more than 10 mins. | Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly. |
NodeFilesystemFilesFillingUp | Node Filesystem Error | metric | Infrastructure | minor | The alert is triggered when more than 85% is full and will be 100% full in less than 4 days. | Check disk usage at Site dashboard. Deprovision workload or add new node into site. |
NodeFilesystemOutOfFiles | Node Filesystem Error | metric | Infrastructure | minor | Filesystem at node has only a few percent available inodes left. Sent when less than 15% of inodes left for warning alert and less than 10% of inodes left for critical alert. | Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. Contact support in case problem persist. |
NodeFilesystemOutOfSpace | Node Filesystem Error | metric | Infrastructure | major | Filesystem at node has only a few percent available space left. | Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. |
NodeFilesystemSpaceFillingUp | Node Filesystem Error | metric | Infrastructure | minor | The alert is triggered when more than 85% is full and will be 100% full in less than 4 days. | Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. |
NodeLoadHigh | Node Load High | metric | Infrastructure | minor | Node has higher load than 1 per CPU for more than 10 mins. | Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly. |
NodeNicMgmtDegraded | Node NIC Error | event | Infrastructure | critical | Management NIC configuration issues detected on node. This log-based alert fires when systemd-networkd has logged any error message for selected management interfaces. |
Check the network connectivity. |
NodeNicTxTimeout | Node NIC Error | event | Infrastructure | critical | Node network TX timeouts detected. This log-based alert fires when "transmission timeout detected" is detected from "mlx5_core" kernel module. | Check the network connectivity. |
NodeNotReady | K8S Node Error | metric | Infrastructure | critical | Site node is down. Pods cannot be scheduled or deprovisioned since node is not responding. | Check Node and HW status in console UI. Reboot node. If problem persist for longer than 1 hour contact support. |
NodeTooManyPods | K8S Node Error | metric | Infrastructure | minor | Number of running pods is near maximum. | Add a new node to the affected site or deprovision some workload. |
NodeUSBDeviceConnected | USB Device Detected | event | Infrastructure | major | New USB device connected to the node. | No action required. |
NodeUSBDeviceDisconnected | USB Device Disconnected | event | Infrastructure | major | USB device disconnected from the node. | No action required. |
RequestRateAnomaly | Request Rate Anomaly | custom | Timeseries-Anomaly | minor | Request rate anomaly detected. | Metric looks abnormal and needs attention. |
RequestThroughputAnomaly | Request Throughput Anomaly | custom | Timeseries-Anomaly | minor | Request throughput anomaly detected. | Metric looks abnormal and needs attention. |
ResponseLatencyAnomaly | Response Latency Anomaly | custom | Timeseries-Anomaly | minor | Response latency anomaly detected | Metric looks abnormal and needs attention. |
ResponseThroughputAnomaly | Response Throughput Anomaly | custom | Timeseries-Anomaly | major | Response throughput anomaly detected. | Metric looks abnormal and needs attention. |
SSOCreated | SSO Provider Created | event | UAM | major | New UAM SSO provider was created. | No action required. |
SSODeleted | SSO Provider Deleted | event | UAM | major | Existing UAM SSO provider was deleted. | No action required. |
ServiceClientErrorPerSourceSite | Virtual Host Client Error | metric | Virtual-Host | major | More than 10% of the requests from site to service failed due to client error. | Some clients are sending invalid requests to the virtual-host. Consider blocking the relevant users/IPs using Volterra Policy features. |
ServiceEndpointHealthcheckFailure | Endpoint healthcheck failure | metric | Virtual-Host | minor | Healthcheck failed for virtual-host endpoint. | Check the health of the origin servers. Check connectivity of origin servers to Volterra. |
ServicePolicyTooManyAttacks | Service Policy Security Events | metric | Security | major | More than 18 Service Policy security events were detected on the Virtual Host in 5 minutes. | Review the Service Policy security events, to investigate further and take necessary next steps (in other words, create trusted client rule or client blocking rule). |
ServiceServerErrorPerSourceSite | Virtual Host Server Error | metric | Virtual-Host | major | ServiceServerErrorPerSourceSite Proxy is seeing excessive errors from upstream origin servers. | Check the health of the origin servers. Check connectivity of origin servers to Volterra. |
SiteBgpToTGWDown | Site BGP to TGW Down | metric | Ves-Software | critical | Site's BGP peering to TGW is down. | Verify network connectivity on given site and status of AWS VM. |
SiteCertificateExpiration | K8S Client Certificate Error | metric | Infrastructure | minor | Kubernetes certificates is expiring for your Volterra Site. In order to avoid interruption, upgrade to latest available Volterra Software Version. | Upgrade Volterra Software Version to latest available. |
SiteCustomerTunnelInterfaceDown | Customer Tunnel Interface Down | metric | Infrastructure | major | Connection from CE to a single RE is down. Some functionality will be limited. | Check physical and network connectivity of the CE. |
SiteDeleted | Site Deleted | event | Infrastructure | critical | Entire site was deleted. | No action required. |
SiteHardwareChanged | Site Hardware Changed | metric | Infrastructure | minor | Customer Edge node changed certified hardware. | No action required. |
SiteHttpProbeDown | RE to Customer Site Tunnel Down | metric | Infrastructure | major | HTTP check from connected Regional Edge to Customer Edge has failed.' | Check the network connectivity. |
SiteHttpUnhealthy | Remote HTTP check failed | metric | IaaS-CaaS | major | Communication with Volterra services at site is failing. | Check the network connectivity. |
SiteNodeHeartbeatMissed | Site Heartbeat Down | metric | Infrastructure | major | Node at site did not send heartbeat for more than 20 minutes. | Check network connectivity and power status of node in Site. If running, trying rebooting the node. |
SiteNonconformingVersion | Site Running Nonconforming Version | metric | Infrastructure | major | Site is running unsupported software version. | Update the site's software version. |
SitePhysicalInterfaceDown | Physical Interface Down | metric | Infrastructure | critical | One of the physical interfaces of CE went down. | Check physical and network connectivity of the CE. |
SitePhysicalInterfaceDown | Physical Interface Down | event | Infrastructure | critical | Physical interface on node is down. | Check the network connectivity. |
SiteRegistrationApproved | Site Registration Approved | event | Infrastructure | major | Site registration was approved and waiting for configuration. | Check registration object for failure. |
SiteRegistrationDeleted | Site Registration Deleted | event | Infrastructure | major | The site node registration was deleted. | No action required. |
SiteRegistrationDuplicateName | Site Registration Duplicate Name Error | event | Infrastructure | major | Cannot register node with given name, the same name is already registered. | Choose different node name. |
SiteRegistrationPending | Site Registration Pending | event | Infrastructure | major | Site registration is in pending state. | Check registration object for failue. |
SiteSSHFailedLogin | SSH Failed Login | event | UAM | major | Failed SSH login to node detected. | Validate access with respect to your internal security policies. |
SiteSSHLoginWithLockOutCert | SSH Login with Lock out Cert | event | UAM | critical | SSH login to node with lock out cert detected. | Validate access with respect to your internal security policies. |
SiteSSHLoginWithOfflineCert | SSH Login with OFFLINE certificate | event | UAM | critical | SSH login to node with OFFLINE ssh-cert cert detected. | Security incident on PRODUCTION. No action needed on test/crt/staging environments. |
SiteSSHPasswordLogin | SSH Password Login | event | UAM | critical | SSH login to node using password authentication detected. | Validate access with respect to your internal security policies. |
SiteSSHPubkeyLogin | SSH Pubkey Login | event | UAM | major | SSH login using key to node detected. | Validate access with respect to your internal security policies. |
SiteSudoExecuted | Sudo Command Executed | event | UAM | major | Priviledged command execution at node detected. | Validate command with respect to your internal security policies. |
SiteTGWTunnelDown | Site's tunnel interfaces to TGW are down | metric | Ves-Software | critical | Site's tunnel interfaces to TGW are down. | Verify network connectivity on given site and status of AWS VM. |
SiteTGWTunnelDown | Site TGW Tunnel Down | metric | Ves-Software | critical | Site's tunnel interfaces to TGW are down. | Verify network connectivity on given site and status of AWS VM. |
SiteTunnelConnectionDown | IPSec/SSL Tunnel Connection Down | event | Infrastructure | critical | IPSec/SSL tunnel connection to the site is down. | Check the network connectivity. |
SiteTunnelInterfaceDown | Tunnel Interface Down | metric | Infrastructure | critical | Connection from both REs to CE are down. Majority of functionality will be impacted. | Check physical and network connectivity of the CE |
SiteUpgradeFailing | Site Upgrade Failing | metric | Infrastructure | critical | Volterra software upgrade is failing at Site. It retries every 10 minutes and keeps updating the status. | Check Volterra Software status message info. Contact support if problem persist for more than 30 minutes. |
Threat Campaigns | Threat Campaigns | metric | Security | major | More than 18 WAF security events with threat campaign signatures triggers were detected on the Virtual Host in 5 minutes. | Review the WAF security events, to investigate further and take necessary next steps (in other words, change app firewall mode to blocking, create trusted client rule or client blocking rule). |
TLSAutomaticCertificateRenewalFailure | TLS Automatic Certificate Renewal Failure | event | TLS | Major | TLS Automatic Certificate renewal is failing for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }}. Certificate expires in {{ $value }} days. | Navigate to https://docs.cloud.f5.com/docs/reference/tls-reference and follow the instructions, so that F5 Distributed Cloud can auto-renew your certificate. |
TLSAutomaticCertificateRenewalStillFailing | TLS Automatic Certificate renewal is still failing after multiple retries | event | TLS | Critical | TLS Automatic Certificate renewal is still failing for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }}. Certificate expires in {{ $value }} days. | Navigate to https://docs.cloud.f5.com/docs/reference/tls-reference and follow the instructions, so that F5 Distributed Cloud can auto-renew your certificate. |
TLSAutomaticCertificateExpired | TLS Automatic Certificate Expired | event | TLS | Critical | TLS Automatic Certificate expired for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }} has expired. | |
TLSCustomCertificateExpiring | TLS Custom Certificate Expiring | event | TLS | Major | TLS Custom Certificate for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }} expires in {{ $value }} days. | Please update the certificate via the UI or API, to avoid any downtime. |
TLSCustomCertificateExpiringSoon | TLS Custom Certificate Expiring Soon | event | TLS | Critical | TLS Custom Certificate for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }} expires in {{ $value }} days. | Please update the certificate via the UI or API, to avoid any downtime. |
TLSCustomCertificateExpired | TLS Custom Certificate Expired | event | TLS | Critical | TLS Custom Certificate expired for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }} has expired | |
UserCreated | User Created | event | UAM | major | New UAM user was created. | No action required. |
UserDeleted | User Deleted | event | UAM | major | Existing UAM user was deleted. | No action required. |
UserUpdated | User Updated | event | UAM | major | Existing UAM user was updated. | No action required. |
VegaNumericAllocatorAlmostFull | Numeric allocator {{ $labels.name }} of Vega service almost full | metric | IaaS-CaaS | critical | Service Vega numeric allocator is almost full. | Escalate to L3 immediately. |
VerHighHugepagesMemoryUsage | VER High Hugepages Memory Usage | metric | Ves-Software | critical | Hugepages Memory reached critical level. | Escalate to L2. |
VesArgoLowCountersAvailable | Argo Low Available Counters | metric | Ves-Software | critical | Argo has low available counters on node. This may lead to service crash. Argo has various internal object counters, which have limits. This indicates if any of those counters are filling up. | Escalate to L2. |
VesArgoMemoryLow | Argo Memory Low | metric | Ves-Software | major | Argo is low on free memory. | Increase Argo memory size. |
VesArgoTooManySynPackets | Argo Too Many Syn Packets | metric | Ves-Software | critical | Argo has too many syn packets on VIP. | Check for traffic source and/or escalate to L2. |
VesClientSideDefenseSuspiciousDomain | Suspicious Domain Found | event | Security-CSD | critical | Suspicious domain identified by Client-Side Defense service. | Add suspicious domain to allowed list or mitigated list. |
VesKubeCronJobRunning | K8S CronJob Runs Too Long | metric | Ves-Software | minor | CronJob is taking more than 1h to complete. | Create issue and let L2 solve it during working hours. |
VesKubeDaemonSetMisScheduled | K8S Daemonset Error | metric | Ves-Software | minor | DaemonSet member pod cannot be scheduled in all machines where it requires to be present, meaning that the DaemonSet member pod cannot be scheduled in all machines where it requires to be present. | If problem persist for more than 30 minutes, delete old Pods. If problem persits escalate to L2 immediately. |
VesKubeDaemonSetNotScheduled | K8S Daemonset Error | metric | Ves-Software | minor | Some Pods of DaemonSet are not scheduled. After 10 minutes, the alert is not sent again. | If problem persist for more than 30 minutes, describe DaemonSet resource and escalate to L2 immediately. |
VesKubeDaemonSetRolloutStuck | K8S Daemonset Error | metric | Ves-Software | major | Only part of the desired Pods of DaemonSet are scheduled and ready. After 15 minutes, the alert is reported again. | If problem persist for more than 30 minutes, describe daemonset resource and escalate to L2 immediately. |
VesKubeDeploymentGenerationMismatch | K8S Deployment Error | metric | Ves-Software | major | Deployment generation does not match, this indicates that the Deployment has failed but has not been rolled back. After 15 minutes, the alert is not sent again. | If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately. |
VesKubeDeploymentReplicasMismatch | K8S Deployment Error | metric | Ves-Software | major | Deployment has not matched the expected number of replicas for longer than an hour. | If problem persist for more than 30 minutes, describe deployment resource and escalate to L2 immediately. |
VesKubeJobFailed | K8S Job Failed | metric | Ves-Software | critical | Alert is fired if any Kubernetes job is in a failed state for the past 2 hours. | Create issue and let L2 solve it during working hours. |
VesKubeLongCronJobRunning | K8S Long CronJob Runs Too Long | metric | Ves-Software | minor | CronJob is taking more than 2 hours to complete. This alerts for jobs running for more than 1 hour. | Create issue and let L2 solve it during working hours. |
VesKubePersistentVolumeFullInFourDays | K8S PVC Error | metric | IaaS-CaaS | major | The alert is triggered when the PersistentVolumeClaim is more than 85% is full and will be 100% full in less than 4 days. | TODO |
VesKubePersistentVolumeSpaceLow | K8S PVC Error | metric | IaaS-CaaS | major | Kubernetes PersistentVolumeClaim is getting out of space. | Resize PVC or clean disk. |
VesKubePodCPUThrottlingHigh | K8S Pod CPU Throttled | metric | Infrastructure | major | Kubernetes Pod container is throttling it's CPU limits. After 1 hour, the alert is sent out. | Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal. |
VesKubePodCPUThrottlingHigh | K8S Pod CPU Throttled | metric | Infrastructure | critical | Pod container is throttling it's CPU limits. After 15 minutes, the alert is sent out. | Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal. |
VesKubePodCPUThrottlingLongTime | K8S Pod CPU Throttled for Long Time | metric | Infrastructure | major | Kubernetes Pod container is throttling it's CPU limits for long time. The condition for this alert to be issued is: throttling 50% plus CPU for more than 1 hour. Lower threshold than the VesKubePodCPUThrottlingHigh alert but longer period. Only for internal services. | Increase limits in Deployment or StatefulSet definition. File an Issue with permanent CPU limit increase proposal. |
VesKubePodContainerTooMuchMemory | K8S Pod Too Much Memory | metric | Ves-Software | major | More than 90% of allowed memory is being used by container. | Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal. |
VesKubePodCrashLooping | K8S Pod Crashing | metric | Ves-Software | critical | Pod container is crashing. Check the reason for pod crash. | Service is crashing. Very often reason can be low memory limits. Check why pod is crashing and if problem persist for more than 15 minutes or continuously crashing and it affects other services, escalate to L2. If reason is oomkill, raise resource limits. If error, contact service's team. If it's a scheduling problem, contact SRE team. |
VesKubePodEtcdBackupCrashLooping | K8S EtcdBackup Crashing | metric | Ves-Software | major | ETCD backup job is crashing. Pods with name starting "etcd-backup-" in Distributed Cloud namespaces are checked. | If problem persist for more than 30 minutes, escalate to L2 immediatelly. |
VesKubePodFluentbitCrashLooping | K8S Fluentbit Crashing | metric | Ves-Software | major | Fluentbit service is crashing. | If problem persist for more than 30 minutes, escalate to L2 immediatelly. |
VesKubePodNotReady | K8S Pod Not Ready | metric | Ves-Software | critical | Pod has been in a non-ready state for longer than 10 minutes. | If problem persist for more than 30 minutes, escalate to L2 immediatelly. |
VesKubeStatefulSetGenerationMismatch | K8S StatefulSet Error | metric | Ves-Software | major | StatefulSet generation does not match, this indicates that the StatefulSet has failed but has not been rolled back. After 15 minutes, the alert is not sent again. | If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately. |
VesKubeStatefulSetReplicasMismatch | K8S StatefulSet Error | metric | Ves-Software | major | StatefulSet has not matched the expected number of replicas for longer than 15 minutes. | If problem persist for more than 30 minutes, describe statefulset resource and escalate to L2 immediately. |
VesKubeStatefulSetUpdateNotRolledOut | K8S StatefulSet Error | metric | Ves-Software | major | StatefulSet update has not been rolled out. After 15 minutes, the alert is reported again. | If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately. |
VesKubeVk8sNodeCapacityFull | vK8S Node Out Of Capacity | metric | Infrastructure | critical | vk8s nodegroup is out of capacity. New virtual kubernetes pod will be in Pending status. | Increase vk8s nodegroup in EKS. |
VesNodeCpuUsageHigh | Node CPU Usage High | metric | Infrastructure | major | Node runs more than 90% CPU for more than 10 mins. | Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly. |
VesNodeFilesystemFilesFillingUp | Node Filesystem Error | metric | Infrastructure | minor | Filesystem at node is predicted to run out of files within the next 24 hours. | Check disk usage at Site dashboard. Deprovision workload or add new node into site. |
VesNodeFilesystemOutOfFiles | Node Filesystem Error | metric | Infrastructure | minor | Filesystem at node has only a few percent available inodes left. | Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. Contact support in case problem persist. |
VesNodeFilesystemOutOfSpace | Node Filesystem Error | metric | Infrastructure | major | Filesystem at node has only a few percent available space left. | Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. |
VesNodeFilesystemSpaceFillingUp | Node Filesystem Error | metric | Infrastructure | minor | Filesystem at node is predicted to run out of space within the next 24 hrs. | Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. |
VesNodeLoadHigh | Node Load High | metric | Infrastructure | minor | Node has higher load than 1 per CPU for more than 10 mins. | Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly. |
VesSvcLogAlert | Alert generated from log | event | Ves-Software | critical | Custom log alert. | Custom log alert. |
VesSvcRecoveredPanic | Service recovered from panic | event | Ves-Software | critical | A panic was encountered and recovered in execution of service. This is a svcfw service alert. Any svcfw service can raise this alert. | Check the service logs for details. |
VesVegaLagMemberLinkDown | VER detected link down for a bond interface | metric | Ves-Software | critical | VER detected link down for a bond interface. | Infrastructure team must check physical link status and resolve the link state problem. |
ViewActionError | View Action Error | event | IaaS-CaaS | major | View action finished with error. | Check the validity of your view variables. |
VoltShareDecryptionError | VoltShare Decryption Error | metric | VoltShare | major | Decrypt operation has failures. | Check secret policy or admin policy. |
VoltShareEncryptionError | VoltShare Encryption Error | metric | VoltShare | major | Encrypt operation has failures. | Check secret policy or admin policy. |
WAFTooManyAttacks | WAF Security Events | metric | Security | major | More than 18 WAF security events with attack signatures or violations triggers were detected on the Virtual Host in 5 minutes. | Review the WAF security events, to investigate further and take necessary next steps (in other words, change app firewall mode to blocking, create trusted client rule or client blocking rule or WAF exclusion rule). |
WAFTooManyMaliciousbots | Malicious Bots detected | metric | Security | major | More than 18 WAF security events with malicious bots signatures triggers were detected on the Virtual Host in 5 minutes. | Review the WAF security events, to investigate further and take necessary next steps (in other words, change app firewall mode to blocking, create trusted client rule or client blocking rule or WAF exclusion rule). |
L7 DDOS | L7 DDOS detected | Event | Security | major | DDoS security event was detected. {{ summary_message }} | Review the DDOS security event, to investigate further and take necessary next steps (in other words, create DDOS mitigation rules). |
TSA Severity vs Anomaly Scores
The following table presents the reference table for the Time-Series Anomaly (TSA) scores and associated severity of the alerts related to various metrics. The table also shows the absolute threshold for the associated metrics.
Metric | Severity | Score | Absolute Threshold |
---|---|---|---|
Request Rate | minor | 0.6 | NA |
Request Rate | major | 1.5 | 50 rps |
Request Rate | critical | 3.0 | 100 rps |
Request Throughput | minor | 0.6 | NA |
Request Throughput | major | 1.5 | 2500 kbps |
Request Throughput | critical | 3.0 | 5000 kbps |
Response Throughput | minor | 0.6 | NA |
Response Throughput | major | 1.5 | 25000 kbps |
Response Throughput | critical | 3.0 | 50000 kbps |
Response Latency | minor | 0.6 | NA |
Response Latency | major | 1.5 | 250 ms |
Response Latency | critical | 3.0 | 500 ms |
Error Rate | minor | 0.6 | NA |
Error Rate | major | 1.5 | 5 erps |
Error Rate | critical | 3.0 | 10 erps |
Note: For more information on the TSA, see Time-Series Anomaly Dectection guide.