Alert Reference

Objective

This document provides reference information on various types of alerts supported by F5® Distributed Cloud Services. Use the information provided in this document to understand the details on the various alerts and action required to be performed.

Key Points

The following apply to the alerts:

  • There is no separate alert for health score. This is because health score is composed of multiple components. For example, health score of a site is computed based on the data-plane connection status to the Regional Edge (RE) sites, control-plane connection status and K8s API server status in the site. There are individual alerts defined for each of the above conditions, but no alert is available for the health score itself.

Note: You can obtain the healthscore of a site in F5® Distributed Cloud Console (Console). You can also obtain it using the API https://www.volterra.io/docs/api/graph-connectivity#operation/ves.io.schema.graph.connectivity.CustomAPI.NodeQuery with "field_selector":{"healthscore":{"types":["HEALTHSCORE_OVERALL"]}}.

  • The amount of time before alert generation is not the same for all alerts. This duration is determined based on the severity of the alerts. For example, alert is raised as soon as the tunnel connection to RE goes down, whereas health check alert for a service is raised only if the condition persists for 10 minutes. This is to keep the alert volume under manageable level and not to generate alerts on temporary or transient failure conditions.

  • It is not supported to change the threshold for alerts.

  • It is not supported for users to define new alerts using an API. However, in case existing alerts do not satisfy your requirement, you can create a support request for new alert in Console.

Alerts and Descriptions

The following table presents alerts and associated details such as group, type, severity, and associated actions.

Alert Name Type Group Severity Description Action
APISecurityTooManyAttacks API Security Events metric Security major More than 18 API security events were detected on the Virtual Host in 5 minutes. Review the API security events, to investigate further and take necessary next steps (in other words, create trusted client rule or client blocking rule).
ErrorRateAnomaly Error Rate Anomaly custom Timeseries-Anomaly minor Error rate anomaly detected. Metric looks abnormal and needs attention.
FluentbitOutputErrors Log Collection Error metric Infrastructure major Fluentbit has output errors. Collect info and open issue. Monitor Grafana fluent dashboard. Let L2 fix during working hours.
KubeAPIErrorsHigh K8S API Error metric Infrastructure major API server is returning errors for some requests. Check kube-apiserver log to see the detail. Contact support if the issue persists.
KubeAPILatencyHigh K8S API Error metric IaaS-CaaS minor Kubernetes API latency at 99th percentile is too high for more than 2 seconds. Possible intermittent problem which may occur during parallel application updates. Check HW utilization of CE site. If persist for longer than hour contant support.
KubeCronJobRunning K8S Job Too Long metric IaaS-CaaS minor Kubernetes CronJob running for more than hour. Job can be stuck or it is expected to run longer. Check logs from Kubernetes Pod. Contact support in case of non-customer vk8 workload.
KubeDaemonSetMisScheduled K8S Daemonset Error metric IaaS-CaaS minor DaemonSet member pod cannot be scheduled in all machines where it requires to be present, meaning that the DaemonSet member pod cannot be scheduled in all machines where it requires to be present. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet.
KubeDaemonSetNotScheduled K8S Daemonset Error metric IaaS-CaaS minor Some pods of DaemonSet are not scheduled. After 10 minutes, the alert is not sent again. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet.
KubeDaemonSetRolloutStuck K8S Daemonset Error metric IaaS-CaaS minor Kubernetes DaemoSet desired Pods are not scheduled or ready. After 15 minutes, the alert is reported again. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet.
KubeDeploymentGenerationMismatch K8S Deployment Error metric IaaS-CaaS minor Deployment generation does not match, this indicates that the Deployment has failed but has not been rolled back. After 15 minutes, the alert is not sent again. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeDeploymentReplicasMismatch K8S Deployment Error metric IaaS-CaaS minor Kubernetes Deployment has not matched the expected number of Pod replicas for more than 1hr. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeJobFailed K8S Job Failed metric IaaS-CaaS minor Alert is fired if any Kubernetes job is in a failed state for the past 2 hours. Check Kubernetes Job and Pod status, events and logs in vK8s cluster. Contact support in case of etcd job.
KubeMetricsMissing Kubernetes Metrics Missing metric Infrastructure critical Essential Kubernetes metrics are missing. All Kubernetes alerts are affected as well. Check if kube-state-metrics workload is running and its logs. Restart this service on cluster this alert appeared.
KubeNodeEvictedPods Node has evicted pods metric Infrastructure critical Node has evicted pods. Describe the evicted pods and check for the eviction reason and escalate to L2 immediately.
KubeNodeUnschedulable K8S Node Scheduling Disabled metric IaaS-CaaS minor Node has Scheduling Disabled. Describe the node and determine the reason for the disabled scheduling.
KubePersistentVolumeFullInFourDays K8S PVC Error metric IaaS-CaaS major The alert is triggered when the PersistentVolumeClaim is more than 85% is full and will be 100% full in less than 4 days. Resize PVC or clean disk.
KubePersistentVolumeSpaceLow K8S PVC Error metric IaaS-CaaS major Kubernetes PersistentVolumeClaim is getting out of space. Resize PVC or clean disk.
KubePodCPUThrottlingHigh K8S Pod CPU Throttled metric IaaS-CaaS major Kubernetes Pod container is throttling it's CPU limits. Increase flavor for vk8s Deployment or StatefulSet definition. Contact support in case of non vk8s Pod.
KubePodContainerTooMuchMemory metric IaaS-CaaS critical More than 90% of allowed memory is being used by container. Add more replicas.
KubePodCrashLooping K8S Pod Crashing metric IaaS-CaaS minor Kubernetes Pod container restarting often. Possible causes can be out of memory limit (OOM), liveness probe or container entrypoint failure. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubePodNotReady K8S Pod Not Ready metric IaaS-CaaS minor Pod has been in a non-ready state for more than 10 min. The reason might be readiness probe failures, scheduling due out of quotas or broken node. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeStatefulSetGenerationMismatch K8S StatefulSet Error metric IaaS-CaaS minor StatefulSet generation does not match. After 15 minutes, the alert is not sent again. This indicates that the StatefulSet has failed but has not been rolled back.
KubeStatefulSetReplicasMismatch K8S StatefulSet Error metric IaaS-CaaS minor Kubernetes StatefulSet has not matched the expected number of Pod replicas for longer than 15 minutes. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeStatefulSetUpdateNotRolledOut K8S StatefulSet Error metric IaaS-CaaS minor StatefulSet update has not been rolled out. After 15 minutes, the alert is reported again. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeVersionMismatch K8S Internal Error metric Infrastructure minor There are different versions of Kubernetes components running. This can be caused by failure during Volterra Software Upgrade. Check Volterra Software Upgrade status. Ignore if upgrade is in progress.
LoggingForwardFailed Log Collection Error metric Infrastructure critical Log collection has failed to forward logs for more than 15 minutes. Node is not sending logs. Check Fluentd status, health. Inspect fluentbit logs for errors. If none fluebtbit can reach fluentd, restart fluentd instances. If it persist for more than 2 hours escalate to L2.
LoggingOutputQueueStucked Log Collection Error metric Infrastructure major Fluentbit output queue is stuck. Restart fluentbit. Escalate to L2 if it persist for more than 2 hours.
LoggingRetriesFailed Log Collection Error metric Infrastructure critical Log collector has tried too many times to forward logs in last 15 minutes. Check network connectivity between CE and RE site.
MaliciousUserDetected Malicious User Detected event Security major Malicious user {{ user }} detected. {{ summary_message }} Review the malicious user details and take the necessary next steps (in other words, enable malicious user mitigation or create a client blocking rule).
NFVServiceAllInstancesUnavailable NFV Service unavailable metric Nfv-Service critical All service instances are down Access to the service instance and check configuration
NFVServiceInstanceUnavailable NFV service instances unavailable metric Nfv-Service major Some of external service instances down. Access to the service instance and check configuration
NodeAideFilesAddedRemoved Node Error event Infrastructure major Monitored files on filesystem were unexpectedly modified. Use logs to verify which files were modified and why.
NodeAideFilesChanged Node Error event Infrastructure critical Monitored files on filesystem were unexpectedly modified. Use logs to verify which files were modified. Creatite an issue in Gitlab for tracking. Immediately escalate to the Security Team (L2).
NodeAideNotRunning Node Error event Infrastructure critical Aide check did not run in past 24 hours. Check the service status and notify the security team.
NodeCpuUsageHigh Node CPU Usage High metric Infrastructure major Node runs more than 90% CPU for more than 10 mins. Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
NodeFilesystemFilesFillingUp Node Filesystem Error metric Infrastructure minor The alert is triggered when more than 85% is full and will be 100% full in less than 4 days. Check disk usage at Site dashboard. Deprovision workload or add new node into site.
NodeFilesystemOutOfFiles Node Filesystem Error metric Infrastructure minor Filesystem at node has only a few percent available inodes left. Sent when less than 15% of inodes left for warning alert and less than 10% of inodes left for critical alert. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. Contact support in case problem persist.
NodeFilesystemOutOfSpace Node Filesystem Error metric Infrastructure major Filesystem at node has only a few percent available space left. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE.
NodeFilesystemSpaceFillingUp Node Filesystem Error metric Infrastructure minor The alert is triggered when more than 85% is full and will be 100% full in less than 4 days. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE.
NodeLoadHigh Node Load High metric Infrastructure minor Node has higher load than 1 per CPU for more than 10 mins. Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
NodeNicMgmtDegraded Node NIC Error event Infrastructure critical Management NIC configuration issues detected on node. This log-based alert fires when systemd-networkd has logged any error message for selected management interfaces. Check the network connectivity.
NodeNicTxTimeout Node NIC Error event Infrastructure critical Node network TX timeouts detected. This log-based alert fires when "transmission timeout detected" is detected from "mlx5_core" kernel module. Check the network connectivity.
NodeNotReady K8S Node Error metric Infrastructure critical Site node is down. Pods cannot be scheduled or deprovisioned since node is not responding. Check Node and HW status in console UI. Reboot node. If problem persist for longer than 1 hour contact support.
NodeTooManyPods K8S Node Error metric Infrastructure minor Number of running pods is near maximum. Add a new node to the affected site or deprovision some workload.
NodeUSBDeviceConnected USB Device Detected event Infrastructure major New USB device connected to the node. No action required.
NodeUSBDeviceDisconnected USB Device Disconnected event Infrastructure major USB device disconnected from the node. No action required.
RequestRateAnomaly Request Rate Anomaly custom Timeseries-Anomaly minor Request rate anomaly detected. Metric looks abnormal and needs attention.
RequestThroughputAnomaly Request Throughput Anomaly custom Timeseries-Anomaly minor Request throughput anomaly detected. Metric looks abnormal and needs attention.
ResponseLatencyAnomaly Response Latency Anomaly custom Timeseries-Anomaly minor Response latency anomaly detected Metric looks abnormal and needs attention.
ResponseThroughputAnomaly Response Throughput Anomaly custom Timeseries-Anomaly major Response throughput anomaly detected. Metric looks abnormal and needs attention.
SSOCreated SSO Provider Created event UAM major New UAM SSO provider was created. No action required.
SSODeleted SSO Provider Deleted event UAM major Existing UAM SSO provider was deleted. No action required.
ServiceClientErrorPerSourceSite Virtual Host Client Error metric Virtual-Host major More than 10% of the requests from site to service failed due to client error. Some clients are sending invalid requests to the virtual-host. Consider blocking the relevant users/IPs using Volterra Policy features.
ServiceEndpointHealthcheckFailure Endpoint healthcheck failure metric Virtual-Host minor Healthcheck failed for virtual-host endpoint. Check the health of the origin servers. Check connectivity of origin servers to Volterra.
ServicePolicyTooManyAttacks Service Policy Security Events metric Security major More than 18 Service Policy security events were detected on the Virtual Host in 5 minutes. Review the Service Policy security events, to investigate further and take necessary next steps (in other words, create trusted client rule or client blocking rule).
ServiceServerErrorPerSourceSite Virtual Host Server Error metric Virtual-Host major ServiceServerErrorPerSourceSite Proxy is seeing excessive errors from upstream origin servers. Check the health of the origin servers. Check connectivity of origin servers to Volterra.
SiteBgpToTGWDown Site BGP to TGW Down metric Ves-Software critical Site's BGP peering to TGW is down. Verify network connectivity on given site and status of AWS VM.
SiteCertificateExpiration K8S Client Certificate Error metric Infrastructure minor Kubernetes certificates is expiring for your Volterra Site. In order to avoid interruption, upgrade to latest available Volterra Software Version. Upgrade Volterra Software Version to latest available.
SiteCustomerTunnelInterfaceDown Customer Tunnel Interface Down metric Infrastructure major Connection from CE to a single RE is down. Some functionality will be limited. Check physical and network connectivity of the CE.
SiteDeleted Site Deleted event Infrastructure critical Entire site was deleted. No action required.
SiteHardwareChanged Site Hardware Changed metric Infrastructure minor Customer Edge node changed certified hardware. No action required.
SiteHttpProbeDown RE to Customer Site Tunnel Down metric Infrastructure major HTTP check from connected Regional Edge to Customer Edge has failed.' Check the network connectivity.
SiteHttpUnhealthy Remote HTTP check failed metric IaaS-CaaS major Communication with Volterra services at site is failing. Check the network connectivity.
SiteNodeHeartbeatMissed Site Heartbeat Down metric Infrastructure major Node at site did not send heartbeat for more than 20 minutes. Check network connectivity and power status of node in Site. If running, trying rebooting the node.
SiteNonconformingVersion Site Running Nonconforming Version metric Infrastructure major Site is running unsupported software version. Update the site's software version.
SitePhysicalInterfaceDown Physical Interface Down metric Infrastructure critical One of the physical interfaces of CE went down. Check physical and network connectivity of the CE.
SitePhysicalInterfaceDown Physical Interface Down event Infrastructure critical Physical interface on node is down. Check the network connectivity.
SiteRegistrationApproved Site Registration Approved event Infrastructure major Site registration was approved and waiting for configuration. Check registration object for failure.
SiteRegistrationDeleted Site Registration Deleted event Infrastructure major The site node registration was deleted. No action required.
SiteRegistrationDuplicateName Site Registration Duplicate Name Error event Infrastructure major Cannot register node with given name, the same name is already registered. Choose different node name.
SiteRegistrationPending Site Registration Pending event Infrastructure major Site registration is in pending state. Check registration object for failue.
SiteSSHFailedLogin SSH Failed Login event UAM major Failed SSH login to node detected. Validate access with respect to your internal security policies.
SiteSSHLoginWithLockOutCert SSH Login with Lock out Cert event UAM critical SSH login to node with lock out cert detected. Validate access with respect to your internal security policies.
SiteSSHLoginWithOfflineCert SSH Login with OFFLINE certificate event UAM critical SSH login to node with OFFLINE ssh-cert cert detected. Security incident on PRODUCTION. No action needed on test/crt/staging environments.
SiteSSHPasswordLogin SSH Password Login event UAM critical SSH login to node using password authentication detected. Validate access with respect to your internal security policies.
SiteSSHPubkeyLogin SSH Pubkey Login event UAM major SSH login using key to node detected. Validate access with respect to your internal security policies.
SiteSudoExecuted Sudo Command Executed event UAM major Priviledged command execution at node detected. Validate command with respect to your internal security policies.
SiteTGWTunnelDown Site's tunnel interfaces to TGW are down metric Ves-Software critical Site's tunnel interfaces to TGW are down. Verify network connectivity on given site and status of AWS VM.
SiteTGWTunnelDown Site TGW Tunnel Down metric Ves-Software critical Site's tunnel interfaces to TGW are down. Verify network connectivity on given site and status of AWS VM.
SiteTunnelConnectionDown IPSec/SSL Tunnel Connection Down event Infrastructure critical IPSec/SSL tunnel connection to the site is down. Check the network connectivity.
SiteTunnelInterfaceDown Tunnel Interface Down metric Infrastructure critical Connection from both REs to CE are down. Majority of functionality will be impacted. Check physical and network connectivity of the CE
SiteUpgradeFailing Site Upgrade Failing metric Infrastructure critical Volterra software upgrade is failing at Site. It retries every 10 minutes and keeps updating the status. Check Volterra Software status message info. Contact support if problem persist for more than 30 minutes.
Threat Campaigns Threat Campaigns metric Security major More than 18 WAF security events with threat campaign signatures triggers were detected on the Virtual Host in 5 minutes. Review the WAF security events, to investigate further and take necessary next steps (in other words, change app firewall mode to blocking, create trusted client rule or client blocking rule).
TLSAutomaticCertificateRenewalFailure TLS Automatic Certificate Renewal Failure event TLS Major TLS Automatic Certificate renewal is failing for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }}. Certificate expires in {{ $value }} days. Navigate to https://docs.cloud.f5.com/docs/reference/tls-reference and follow the instructions, so that F5 Distributed Cloud can auto-renew your certificate.
TLSAutomaticCertificateRenewalStillFailing TLS Automatic Certificate renewal is still failing after multiple retries event TLS Critical TLS Automatic Certificate renewal is still failing for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }}. Certificate expires in {{ $value }} days. Navigate to https://docs.cloud.f5.com/docs/reference/tls-reference and follow the instructions, so that F5 Distributed Cloud can auto-renew your certificate.
TLSAutomaticCertificateExpired TLS Automatic Certificate Expired event TLS Critical TLS Automatic Certificate expired for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }} has expired.
TLSCustomCertificateExpiring TLS Custom Certificate Expiring event TLS Major TLS Custom Certificate for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }} expires in {{ $value }} days. Please update the certificate via the UI or API, to avoid any downtime.
TLSCustomCertificateExpiringSoon TLS Custom Certificate Expiring Soon event TLS Critical TLS Custom Certificate for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }} expires in {{ $value }} days. Please update the certificate via the UI or API, to avoid any downtime.
TLSCustomCertificateExpired TLS Custom Certificate Expired event TLS Critical TLS Custom Certificate expired for virtual host {{ $labels.namespace }} / {{ $labels.vh_name }} has expired
UserCreated User Created event UAM major New UAM user was created. No action required.
UserDeleted User Deleted event UAM major Existing UAM user was deleted. No action required.
UserUpdated User Updated event UAM major Existing UAM user was updated. No action required.
VegaNumericAllocatorAlmostFull Numeric allocator {{ $labels.name }} of Vega service almost full metric IaaS-CaaS critical Service Vega numeric allocator is almost full. Escalate to L3 immediately.
VerHighHugepagesMemoryUsage VER High Hugepages Memory Usage metric Ves-Software critical Hugepages Memory reached critical level. Escalate to L2.
VesArgoLowCountersAvailable Argo Low Available Counters metric Ves-Software critical Argo has low available counters on node. This may lead to service crash. Argo has various internal object counters, which have limits. This indicates if any of those counters are filling up. Escalate to L2.
VesArgoMemoryLow Argo Memory Low metric Ves-Software major Argo is low on free memory. Increase Argo memory size.
VesArgoTooManySynPackets Argo Too Many Syn Packets metric Ves-Software critical Argo has too many syn packets on VIP. Check for traffic source and/or escalate to L2.
VesClientSideDefenseSuspiciousDomain Suspicious Domain Found event Security-CSD critical Suspicious domain identified by Client-Side Defense service. Add suspicious domain to allowed list or mitigated list.
VesKubeCronJobRunning K8S CronJob Runs Too Long metric Ves-Software minor CronJob is taking more than 1h to complete. Create issue and let L2 solve it during working hours.
VesKubeDaemonSetMisScheduled K8S Daemonset Error metric Ves-Software minor DaemonSet member pod cannot be scheduled in all machines where it requires to be present, meaning that the DaemonSet member pod cannot be scheduled in all machines where it requires to be present. If problem persist for more than 30 minutes, delete old Pods. If problem persits escalate to L2 immediately.
VesKubeDaemonSetNotScheduled K8S Daemonset Error metric Ves-Software minor Some Pods of DaemonSet are not scheduled. After 10 minutes, the alert is not sent again. If problem persist for more than 30 minutes, describe DaemonSet resource and escalate to L2 immediately.
VesKubeDaemonSetRolloutStuck K8S Daemonset Error metric Ves-Software major Only part of the desired Pods of DaemonSet are scheduled and ready. After 15 minutes, the alert is reported again. If problem persist for more than 30 minutes, describe daemonset resource and escalate to L2 immediately.
VesKubeDeploymentGenerationMismatch K8S Deployment Error metric Ves-Software major Deployment generation does not match, this indicates that the Deployment has failed but has not been rolled back. After 15 minutes, the alert is not sent again. If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately.
VesKubeDeploymentReplicasMismatch K8S Deployment Error metric Ves-Software major Deployment has not matched the expected number of replicas for longer than an hour. If problem persist for more than 30 minutes, describe deployment resource and escalate to L2 immediately.
VesKubeJobFailed K8S Job Failed metric Ves-Software critical Alert is fired if any Kubernetes job is in a failed state for the past 2 hours. Create issue and let L2 solve it during working hours.
VesKubeLongCronJobRunning K8S Long CronJob Runs Too Long metric Ves-Software minor CronJob is taking more than 2 hours to complete. This alerts for jobs running for more than 1 hour. Create issue and let L2 solve it during working hours.
VesKubePersistentVolumeFullInFourDays K8S PVC Error metric IaaS-CaaS major The alert is triggered when the PersistentVolumeClaim is more than 85% is full and will be 100% full in less than 4 days. TODO
VesKubePersistentVolumeSpaceLow K8S PVC Error metric IaaS-CaaS major Kubernetes PersistentVolumeClaim is getting out of space. Resize PVC or clean disk.
VesKubePodCPUThrottlingHigh K8S Pod CPU Throttled metric Infrastructure major Kubernetes Pod container is throttling it's CPU limits. After 1 hour, the alert is sent out. Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal.
VesKubePodCPUThrottlingHigh K8S Pod CPU Throttled metric Infrastructure critical Pod container is throttling it's CPU limits. After 15 minutes, the alert is sent out. Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal.
VesKubePodCPUThrottlingLongTime K8S Pod CPU Throttled for Long Time metric Infrastructure major Kubernetes Pod container is throttling it's CPU limits for long time. The condition for this alert to be issued is: throttling 50% plus CPU for more than 1 hour. Lower threshold than the VesKubePodCPUThrottlingHigh alert but longer period. Only for internal services. Increase limits in Deployment or StatefulSet definition. File an Issue with permanent CPU limit increase proposal.
VesKubePodContainerTooMuchMemory K8S Pod Too Much Memory metric Ves-Software major More than 90% of allowed memory is being used by container. Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal.
VesKubePodCrashLooping K8S Pod Crashing metric Ves-Software critical Pod container is crashing. Check the reason for pod crash. Service is crashing. Very often reason can be low memory limits. Check why pod is crashing and if problem persist for more than 15 minutes or continuously crashing and it affects other services, escalate to L2. If reason is oomkill, raise resource limits. If error, contact service's team. If it's a scheduling problem, contact SRE team.
VesKubePodEtcdBackupCrashLooping K8S EtcdBackup Crashing metric Ves-Software major ETCD backup job is crashing. Pods with name starting "etcd-backup-" in Distributed Cloud namespaces are checked. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
VesKubePodFluentbitCrashLooping K8S Fluentbit Crashing metric Ves-Software major Fluentbit service is crashing. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
VesKubePodNotReady K8S Pod Not Ready metric Ves-Software critical Pod has been in a non-ready state for longer than 10 minutes. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
VesKubeStatefulSetGenerationMismatch K8S StatefulSet Error metric Ves-Software major StatefulSet generation does not match, this indicates that the StatefulSet has failed but has not been rolled back. After 15 minutes, the alert is not sent again. If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately.
VesKubeStatefulSetReplicasMismatch K8S StatefulSet Error metric Ves-Software major StatefulSet has not matched the expected number of replicas for longer than 15 minutes. If problem persist for more than 30 minutes, describe statefulset resource and escalate to L2 immediately.
VesKubeStatefulSetUpdateNotRolledOut K8S StatefulSet Error metric Ves-Software major StatefulSet update has not been rolled out. After 15 minutes, the alert is reported again. If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately.
VesKubeVk8sNodeCapacityFull vK8S Node Out Of Capacity metric Infrastructure critical vk8s nodegroup is out of capacity. New virtual kubernetes pod will be in Pending status. Increase vk8s nodegroup in EKS.
VesNodeCpuUsageHigh Node CPU Usage High metric Infrastructure major Node runs more than 90% CPU for more than 10 mins. Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
VesNodeFilesystemFilesFillingUp Node Filesystem Error metric Infrastructure minor Filesystem at node is predicted to run out of files within the next 24 hours. Check disk usage at Site dashboard. Deprovision workload or add new node into site.
VesNodeFilesystemOutOfFiles Node Filesystem Error metric Infrastructure minor Filesystem at node has only a few percent available inodes left. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. Contact support in case problem persist.
VesNodeFilesystemOutOfSpace Node Filesystem Error metric Infrastructure major Filesystem at node has only a few percent available space left. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE.
VesNodeFilesystemSpaceFillingUp Node Filesystem Error metric Infrastructure minor Filesystem at node is predicted to run out of space within the next 24 hrs. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE.
VesNodeLoadHigh Node Load High metric Infrastructure minor Node has higher load than 1 per CPU for more than 10 mins. Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
VesSvcLogAlert Alert generated from log event Ves-Software critical Custom log alert. Custom log alert.
VesSvcRecoveredPanic Service recovered from panic event Ves-Software critical A panic was encountered and recovered in execution of service. This is a svcfw service alert. Any svcfw service can raise this alert. Check the service logs for details.
VesVegaLagMemberLinkDown VER detected link down for a bond interface metric Ves-Software critical VER detected link down for a bond interface. Infrastructure team must check physical link status and resolve the link state problem.
ViewActionError View Action Error event IaaS-CaaS major View action finished with error. Check the validity of your view variables.
VoltShareDecryptionError VoltShare Decryption Error metric VoltShare major Decrypt operation has failures. Check secret policy or admin policy.
VoltShareEncryptionError VoltShare Encryption Error metric VoltShare major Encrypt operation has failures. Check secret policy or admin policy.
WAFTooManyAttacks WAF Security Events metric Security major More than 18 WAF security events with attack signatures or violations triggers were detected on the Virtual Host in 5 minutes. Review the WAF security events, to investigate further and take necessary next steps (in other words, change app firewall mode to blocking, create trusted client rule or client blocking rule or WAF exclusion rule).
WAFTooManyMaliciousbots Malicious Bots detected metric Security major More than 18 WAF security events with malicious bots signatures triggers were detected on the Virtual Host in 5 minutes. Review the WAF security events, to investigate further and take necessary next steps (in other words, change app firewall mode to blocking, create trusted client rule or client blocking rule or WAF exclusion rule).
L7 DDOS L7 DDOS detected Event Security major DDoS security event was detected. {{ summary_message }} Review the DDOS security event, to investigate further and take necessary next steps (in other words, create DDOS mitigation rules).

TSA Severity vs Anomaly Scores

The following table presents the reference table for the Time-Series Anomaly (TSA) scores and associated severity of the alerts related to various metrics. The table also shows the absolute threshold for the associated metrics.

Metric Severity Score Absolute Threshold
Request Rate minor 0.6 NA
Request Rate major 1.5 50 rps
Request Rate critical 3.0 100 rps
Request Throughput minor 0.6 NA
Request Throughput major 1.5 2500 kbps
Request Throughput critical 3.0 5000 kbps
Response Throughput minor 0.6 NA
Response Throughput major 1.5 25000 kbps
Response Throughput critical 3.0 50000 kbps
Response Latency minor 0.6 NA
Response Latency major 1.5 250 ms
Response Latency critical 3.0 500 ms
Error Rate minor 0.6 NA
Error Rate major 1.5 5 erps
Error Rate critical 3.0 10 erps

Note: For more information on the TSA, see Time-Series Anomaly Dectection guide.