Stock and Predefined Alerts
This topic lists the stock alerts that are shipped along with Acceldata.
Airflow Alerts
Alert Name | Description | Configuration |
---|---|---|
Airflow _Endpoint_Check`` | This alert checks whether the system hosting Airflow is active or not. | Severity: "Critical", Execution Interval: "60" |
MemSQL Alerts
Alert Name | Description | Configuration |
---|---|---|
MEMSQL_AGGREGATOR_ROUNDTRIP_LATENCY_GT | Checks whether the MemSQL average roundtrip latency for a query is greater than 30 seconds. | Severity: "High", Execution Interval: "30" |
MEMSQL_QUERY_FAILED | Checks whether the MemSQL query failed. If yes, check the error code and error message in Acceldata query view. You can search for the query using the 'id' field. | Severity: "Medium", Execution Interval: "30" |
MEMSQL_QUERY_MEMUSAGE | Checks whether the MemSQL query is holding the memory for more than 5 minutes. | Severity: "Medium", Execution Interval: "30" |
MEMSQL_QUERY_NETWORK_BYTES | Checks whether the MemSQL query transferred more than 512 MB on the network. | Severity: "Low", Execution Interval: "30" |
MEMSQL_QUERY_EXEC_TIME | Checks whether the MemSQL Query is taking more than 15 minutes. | Severity: "Medium", Execution Interval: "30" |
MEMSQL_PIPELINE_BATCH_TIME | Checks whether the MemSQL Pipeline batch time is greater than 15 minutes. | Severity: "High", Execution Interval: "30" |
MEMSQL_AGGREGATOR_OPEN_CONNECTIONS | Checks whether the number of open MemSQL connections is greater than 30. | Severity: "Critical", Execution Interval: "30" |
MEMSQL_QUERY_NETWORK_TIME | Checks whether the MemSQL query took more than 3 minutes on network data transfer. | Severity: "Medium", Execution Interval: "30" |
MEMSQL_QUERY_READ_DISK_BYTES | Checks whether the MemSQL query read more than 512 MB of data from the disk. | Severity: "Low",Execution Interval: "30" |
MEMSQL_USER_TOO_MANY_QUERIES | Checks whether the MemSQL user fired more than 25 queries in the last 1 minute. | Severity: "Medium", Execution Interval: "60" |
MEMSQL_MAX_MEMORY_USED | MemSQL max_memory_mb is close to memory_used_mb. If the memory usage increases, the query execution stops and the server is terminated by the query allocations that exceed this limit. | Severity: "Critical", Execution Interval: "30" |
MEMSQL_LEAVES_OPEN_CONNECTIONS | Checks whether the MemSQL leaves have too many open connections. Each connection to the master aggregator opens as many connections towards the leaf as you have partitioned. This depends on NOFILE ulimit. | Severity: "High", Execution Interval: "30" |
MEMSQL_QUERY_LOCK_TIME | Checks whether the MemSQL query locked rows for more than 30 seconds. | Severity: "Medium", Execution Interval: "30" |
MEMSQL_MAX_TABLE_MEM_USED | Checks whether the MemSQL table_memory_used is close to maximum_table_memory. If yes, MemSQL becomes read-only. You can execute only SELECT and DELETE queries once the limit is reached. | Severity: "High", Execution Interval: "30" |
MEMSQL_NODE_STATUS | Checks the status of MEMSQl nodes. | Severity: "Critical", Execution Interval: "60" |
MEMSQL_PIPELINE_BATCH_FAILED_ALERT | Checks whether the Memsql Pipeline batch have failed while being loaded into the database. | Severity: "High", Execution Interval: "60" |
MEMSQL_DAEMON_ENDPOINT_CHECK | Checks whether the MemSQL daemon(s) are alive or not. | Severity: "Critical", Execution Interval: "10" |
Impala Alerts
Alert Name | Description | Configuration |
---|---|---|
IMPALA_QUERIES_DURATION_GT_3MIN | Checks when the query duration is greater than 3 min. | Severity: "Critical", Execution Interval: "120" |
IMPALA_FAILED_QUERIES | Checks for Impala queries with a failed state. | Severity: "High", Execution Interval: "120" |
IMPALA_DAEMON_ENDPOINT_CHECK | Checks whether the Impala Daemon are alive or not. | Severity: "Critical", Execution Interval: "60" |
IMPALA_NOT_ALLOWED_QUERIES | Checks specific string patterns in Impala queries, triggering alerts when matched. | Severity: "Critical", Execution Interval: "60" |
IMPALA_HIGH_IN_FLIGHT_FRAGMENTS | The number of Impala query fragment instances that are currently running is high. | Severity: "medium", Execution Interval: "300" |
IMPALA_DDL_QUERIES_GT_100 | When the count of Impala DDL queries is greater than 100 in the last 10 minutes. | Severity: "medium", Execution Interval: "120" |
IMPALAD_EXPIRED_QUERIES_GT_100 | When Impala expired queries are greater than 100 in the Impala Daemon. | Severity: "high", Execution Interval: "120" |
IMPALAD_JVM_CURRENT_USAGE_BYTES_GT_2GB | When JVM's current usage is greater than 2 GB in the Impala Daemon. | Severity: "critical", Execution Interval: "120" |
IMPALAD_SPILLED_QUERIES | When there are spilled queries in Impala Daemons. | Severity: "high", Execution Interval: "120" |
IMPALAD_TOTAL_BYTES_WRITTEN_GT_2GB | When the total bytes written is greater than 2GB in the Impala Daemon. | Severity: "high", Execution Interval: "120" |
IMPALAD_TOTAL_USED_MEMORY_GT_2GB | When the total used memory is greater than 2 GB in the Impala Daemon. | Severity: "high", Execution Interval: "120" |
IMPALA_NO_DATA_ALERT | This alert is raised when Impala data is not pushed. | Severity: "medium", Execution Interval: "120" |
Zoo Keeper Alerts
Alert Name | Description | Configuration |
---|---|---|
ZOOKEEPER_SERVER_ENDPOINT_CHECK | Checks whether the zookeeper server is alive or not. | Severity: "Critical", Execution Interval: "10" |
Spark Alerts
Alert Name | Description | Configuration |
---|---|---|
SPARK_HIVETHRIFT_SERVER_ENDPOINT_CHECK | Checks whether the 'spark2 hivethriftserver' is alive or not. | Severity: "Critical", Execution Interval: "10" |
SPARK2_JOBHISTORYSERVER_ENDPOINT_CHECK | Checks whether the 'spark2 jobhistoryserver' is alive or not. | Severity: "Critical", Execution Interval: "10" |
Kafka Alerts
Alert Name | Description | Configuration |
---|---|---|
KAFKA_STALL_TOPICS | Checks for all the topics with no data in it. | Severity: "Critical", Execution Interval: "60" |
KAFKA_BROKER_ENDPOINT_CHECK | Checks whether the kafka broker is alive or not. | Severity: "Critical", Execution Interval: "10" |
KAFKA_UNCLEAN_LEADER_ELECTION | When the leader for a partition is no longer available and when no in-sync replica exists, the election of a new leader is called unclean. In most cases there is a data load with this case. | Severity: "Medium", Execution Interval: "30" |
KAFKA_REQUEST_HANDLER_IDLE_LOW | The idle ration is between 0-1. The lower this number, the broker is more loaded. With experience, idle ratios lower than 20% indicate a potential problem, and lower than 10% is usually an active performance problem. | Severity: "Medium", Execution Interval: "30" |
KAFKA_BROKER_SKEWED | If a Kafka broker is processing more records across all topics compared to any other broker, the broker is identified as skewed. | Severity: "Medium", Execution Interval: "30" |
KAFKA_TOPIC_HIGH_DATA_THRESHOLD | Checks whether the Kafka topic is receiving unusually high number of messages. | Severity: "High", Execution Interval: "30" |
KAFKA_NO_DATA_ON_TOPIC | Checks whether a Kafka topic doesn't receive data for a configured interval of time. | Severity: "High", Execution Interval: "30" |
KAFKA_ACTIVE_CONTROLLER | Only one broker must always be a controller in a cluster. Any value other then 1 means that you will have a problem of not being able to execute administrative tasks, such as partition moves. | Severity: "High", Execution Interval: "30" |
KAFKA_OFFLINE_PARTITIONS | If, after successful leader election, the leader for the partition dies, then the partition moves to an offline partition state. | Severity: "High", Execution Interval: "30" |
KAFKA_UNDER_REPLICATED_PARTITIONS | If a broker has a topic that is not being replicated enough number of times, it results in increasing the probability of data loss because of replicas failing or dying. | Severity: "High", Execution Interval: "30" |
KAFKA_ZOOKEEPER_ REQUEST_LATENCY_ MS | Check to see whether the Zookeeper Request Latency exceeds the specified value (in ms). | Severity: "High", Execution Interval: "120" |
KAFKS_READ_WRITE_SKEWNESS | This alert is triggered when a broker experiences a disproportionate read and write activity compared to other brokers. | Severity: "Critical", Execution Interval: "120" |
Kafka Connect
Alert Name | Description | Configuration |
---|---|---|
KAFKA_CONNECT_WORKER_ENDPOINT_CHECK | This alert checks whether the Kafka Connect worker is active or not. | Severity: "Critical", Execution Interval: "60" |
KAFKA_CONNECT_OFFSET_COMMIT_FAILURE | This alert check commits failure to Kafka topic while data ingestion from an external system (in case of Source connector) or commits failure while reading data from Kafka and writing to an external system (in case of Sink connector). | Severity: "Critical", Execution Interval: "30" |
KAFKA_CONNECT_CONNECTOR_STARTUP_FAILURE | This alert checks if the worker's connectors are failing to start. | Severity: "Critical", Execution Interval: "30" |
KAFKA_CONNECT_TASK_STARTUP_FAILURE | This alert checks if the worker's tasks are failing to start. | Severity: "Critical", Execution Interval: "30" |
KAFKA_CONNECT_NO_DATA_ALERT | This alert is raised when the Kafka Connect data is not pushed. | Severity: "Medium", Execution Interval: "120" |
Kafka Cruise Control
Alert Name | Description | Configuration |
---|---|---|
KAFKA_CRUISE CONTROL_ENDPOINT CHECK | The alert checks whether the Kafka Cruise Control node is active or not. | |
KAFKA_CRUISE_CONTROL_FETCH_METRIC_FAILURE | This alert checks failures in fetching of partition level metrics from Kafka topic by Kafka Cruise Control's MetricFetcherManager Partition Samples Fetcher. | |
KAFKA_CRUISE_CONTROL_NO_DATA_ALERT | This alert gets triggered when Kafka Cruise control data is not pushed. |
Kafka 3 Alerts
Alert Name | Description | Configuration |
---|---|---|
KAFKA_STALL_TOPICS | Checks for all the topics with no data in it. | Severity: "Critical", Execution Interval: "60" |
KAFKA_BROKER_ENDPOINT_CHECK | Checks whether the kafka broker is alive or not. | Severity: "Critical", Execution Interval: "10" |
KAFKA_UNCLEAN_LEADER_ELECTION | When the leader for a partition is no longer available and when no in-sync replica exists, the election of a new leader is called unclean. In most cases there is a data load with this case. | Severity: "Medium", Execution Interval: "30" |
KAFKA_REQUEST_HANDLER_IDLE_LOW | The idle ration is between 0-1. The lower this number, the broker is more loaded. With experience, idle ratios lower than 20% indicate a potential problem, and lower than 10% is usually an active performance problem. | Severity: "Medium", Execution Interval: "30" |
KAFKA_BROKER_SKEWED | If a Kafka broker is processing more records across all topics compared to any other broker, the broker is identified as skewed. | Severity: "Medium", Execution Interval: "30" |
KAFKA_TOPIC_HIGH_DATA_THRESHOLD | Checks whether the Kafka topic is receiving unusually high number of messages. | Severity: "High", Execution Interval: "30" |
KAFKA_NO_DATA_ON_TOPIC | Checks whether a Kafka topic doesn't receive data for a configured interval of time. | Severity: "High", Execution Interval: "30" |
KAFKA_ACTIVE_CONTROLLER | Only one broker must always be a controller in a cluster. Any value other then 1 means that you will have a problem of not being able to execute administrative tasks, such as partition moves. | Severity: "High", Execution Interval: "30" |
KAFKA_OFFLINE_PARTITIONS | If, after successful leader election, the leader for the partition dies, then the partition moves to an offline partition state. | Severity: "High", Execution Interval: "30" |
KAFKA_UNDER_REPLICATED_PARTITIONS | If a broker has a topic that is not being replicated enough number of times, it results in increasing the probability of data loss because of replicas failing or dying. | Severity: "High", Execution Interval: "30" |
KAFKA_ZOOKEEPER_ REQUEST_LATENCY_ MS | Check to see whether the Zookeeper Request Latency exceeds the specified value (in ms). | Severity: "High", Execution Interval: "120" |
KAFKS_READ_WRITE_SKEWNESS | This alert is triggered when a broker experiences a disproportionate read and write activity compared to other brokers. | Severity: "Critical", Execution Interval: "120" |
Kafka 3 Connect
Alert Name | Description | Configuration |
---|---|---|
KAFKA_CONNECT_WORKER_ENDPOINT_CHECK | This alert checks whether the Kafka Connect worker is active or not. | Severity: "Critical", Execution Interval: "60" |
KAFKA_CONNECT_OFFSET_COMMIT_FAILURE | This alert check commits failure to Kafka topic while data ingestion from an external system (in case of Source connector) or commits failure while reading data from Kafka and writing to an external system (in case of Sink connector). | Severity: "Critical", Execution Interval: "30" |
KAFKA_CONNECT_CONNECTOR_STARTUP_FAILURE | This alert checks if the worker's connectors are failing to start. | Severity: "Critical", Execution Interval: "30" |
KAFKA_CONNECT_TASK_STARTUP_FAILURE | This alert checks if the worker's tasks are failing to start. | Severity: "Critical", Execution Interval: "30" |
KAFKA_CONNECT_NO_DATA_ALERT | This alert is raised when the Kafka Connect data is not pushed. | Severity: "Medium", Execution Interval: "120" |
Kafka 3 Cruise Control
Alert Name | Description | Configuration |
---|---|---|
KAFKA_CRUISE CONTROL_ENDPOINT CHECK | The alert checks whether the Kafka Cruise Control node is active or not. | |
KAFKA_CRUISE_CONTROL_FETCH_METRIC_FAILURE | This alert checks failures in fetching of partition level metrics from Kafka topic by Kafka Cruise Control's MetricFetcherManager Partition Samples Fetcher. | |
KAFKA_CRUISE_CONTROL_NO_DATA_ALERT | This alert gets triggered when Kafka Cruise control data is not pushed. |
Kudu Alerts
Alert Name | Description | Configuration |
---|---|---|
Kudu_Master_Endpoint_Check | Triggers when the Kudu Master service is unreachable. | Severity: "Critical", Execution Interval: "60" |
Kudu_Tablet_Server_Endpoint_Check | Triggers when the Kudu Tablet Server service is unreachable. | Severity: "Critical", Execution Interval: "60" |
Schema Registry
Alert Name | Description | Configuration |
---|---|---|
SCHEMA_REGISTRY ADMIN_ENDPOINT_CHECK | This alert checks whether the registry admin is active or not. | Severity: "High", Execution Interval: "30" |
SCHEMA_REGISTRY SERVER_ENDPOINT_CHECK | This alert checks whether the registry server is active or not. | Severity: "Critical", Execution Interval: "60" |
SCHEMA_REGISTRY HTTP_CLIENT_ERROR`` | This alert checks the schema registry client errors (4xx http responses). | Severity: "Medium", Execution Interval: "30" |
SCHEMA_REGISTRY HTTP_SERVER_ERROR | This alert checks the schema registry internal server errors (5xx http responses). | Severity: "Medium", Execution Interval: "30" |
SCHEMA_REGISTRY ERRORS | This alert checks the schema registry server errors | Severity: "Medium", Execution Interval: "30" |
SCHEMA_REGISTRY NO_DATA_ALERT | This alert gets raised when the Schema Registry data is not pushed. | Severity: "Medium", Execution Interval: "120" |
HBase Alerts
Alert Name | Description | Configuration |
---|---|---|
HBASE_MASTER_ENDPOINT_CHECK | Checks whether the Hbase master is alive or not. | Severity: "Critical", Execution Interval: "10" |
HBASE_REGIONSERVER_ENDPOINT_CHECK | Checks whether the hbase region server is alive or not. | Severity: "Critical", Execution Interval: "10" |
HBASE_REGIONSERVER_TABLES_COMPACTION_TIME_ALERT | This alert is raised when the 95th percentile compaction time is more than 60 seconds over a period of 60 seconds. | Severity: "Critical", Execution Interval: "60" |
HBASE_REGIONSERVER_PERCENT_LOCAL_FILE_ALERT | This alert is raised when the local file percentage is less than 80 percent per host over a period of 60 seconds. | Severity: "Critical", Execution Interval: "60" |
HBASE_REGIONSERVER_GC_ALERT | This alert is raised when the GC time is greater than equal to 60 seconds over a period of 60 seconds. | Severity: "Critical", Execution Interval: "60" |
HBASE_CALL_TIME_95TH_PERCENTILE_ALERT | This alert is raised when the 95th percentile call time is more than 60 seconds. | Severity: "Critical", Execution Interval: "60" |
HBASE_ZERO_ACTIVE_MASTER | This alert is raised when any number of active Hbase master's are detected. | Severity: "Critical", Execution Interval: "60" |
REGION_SERVER_DEAD_ALERT | This alert is raised if any region server goes down. | Severity: "Critical", Execution Interval: "10" |
YARN Alerts
Alert Name | Description | Configuration |
---|---|---|
YARN_KILLED_APPLICATION_ALERT | Checks whether the last YARN application id status is killed or not. | Severity: "High", Execution Interval: "10" |
YARN_APPTIMELINE_SERVER_ENDPOINT_CHECK | Checks whether the YARN apptimeline_server is alive or not. | Severity: "Critical", Execution Interval: "60" |
YARN_NODEMANAGER_ENDPOINT_CHECK | Checks whether the YARN nodemanager is alive or not. | Severity: "Critical", Execution Interval: "10" |
YARN_RESOURCEMANAGER_ENDPOINT_CHECK | Checks whether the YARN resourcemanager is alive or not. | Severity: "Critical", Execution Interval: "10" |
YARN_QUEUE_CAPACITY_USAGE | Checks if the absolute capacity of the queue is higher than the defined threshold during the specified time period. | Severity: "Critical" |
YARN_LONG_RUNNING_JOB_WITH_FILTERS | Checks for Yarns jobs running for long durations. |
Hive Alerts
Alert Name | Description | Configuration |
---|---|---|
HIVE_WEBHCATSERVER_ENDPOINT_CHECK | Checks whether the Hive webhcat_server is alive or not. | Severity: "Critical", Execution Interval: "10" |
HIVE_METASTORE_ENDPOINT_CHECK | Checks whether the Hive metastore is alive or not. | Severity: "Critical", Execution Interval: "10" |
HIVE_HIVESERVER2_ENDPOINT_CHECK | Checks whether the hiveserver2 is alive or not. | Severity: "Critical", Execution Interval: "10" |
HIVE_USER_EXECUTING_TOO_MANY_LLAP_QUERIES | Checks whether the Hive user has executed more than 50 LLAP queries in the last 15 minutes including running, failed, and completed queries. | Severity: "High", Execution Interval: "30" |
HIVE_USER_TOO_MANY_RUNNING_LLAP_QUERIES | Checks whether the Hive user has more than 20 LLAP queries in RUNNING state in the last 15 minutes. | Severity: "High", Execution Interval: "30" |
HIVE_LLAP_QUERY_SPILLED_REC_GT_10K | Checks whether the Hive LLAP query has spilled more than 10 thousand records to the disk. It means the memory exceeds the limit that is defined and reserved for map output buffer. Spilled records should be equal to zero which is good for Memory and IO performance. | Severity: "Medium", Execution Interval: "30" |
HIVE_LLAP_QUERY_SHUFFLE_GT_1GB | Checks whether the Hive LLAP query has shuffle greater than 1 GB. Shuffles though cannot be avoided but can cause the query to slow down. | Severity: "Medium", Execution Interval: "30" |
HIVE_LLAP_QUERY_RUNNING_GT_15MIN | Checks whether the Hive query is in a running state for more than 15 minutes. | Severity: "High", Execution Interval: "30" |
HIVE_LLAP_QUERY_RAN_FOR_TOO_LONG | Checks whether the Hive LLAP query ran for more than 4 hours. | Severity: "Medium", Execution Interval: "30" |
LLAP_QUERY_OUTPUT_RECORDS_GT_1M | Checks whether the Hive LLAP query is processing more than 1 million output records. | Severity: "Medium", Execution Interval: "30" |
HIVE_LLAP_QUERY_INPUT_RECORDS_GT_1M | Checks whether the Hive LLAP query is processing more than 1 million input records. | Severity: "Medium", Execution Interval: "30" |
HIVE_LLAP_QUERY_BYTES_WRITTEN_GT_1GB | Checks whether the Hive LLAP query has written more than 1GB of data. | Severity: "Medium", Execution Interval: "30" |
HIVE_LLAP_QUERY_BYTES_READ_GT_1GB | Checks whether the Hive LLAP query read more than 1GB of data. | Severity: "Medium", Execution Interval: "30" |
HIVE_QUERY_BYTES_WRITTEN_GT_1GB | Checks whether the Hive query written contains more than 1GB of data. | Severity: "Medium", Execution Interval: "30" |
HIVE_QUERY_BYTES_READ_GT_1GB | Checks whether the Hive query read more than 1GB of data. | Severity: "Medium", Execution Interval: "30" |
HIVE_QUERY_SHUFFLE_GT_1GB | Checks whether the Hive query contains shuffles greater than 1GB. Shuffles cannot be avoided but can cause queries to slow down. | Severity: "Medium", Execution Interval: "30" |
HIVE_QUERY_SPILLED_REC_GT_10K | Checks whether the Hive query has spilled more than 10 thousand records to the disk. It means the memory exceeds the limit that is defined and is reserved for map output buffer. Spilled records should be equal to zero which is good for memory and IO performances. | Severity: "Medium", Execution Interval: "30" |
HIVE_QUERY_RAN_FOR_TOO_LONG | Checks whether the Hive query ran for more than 4 hours. | Severity: "Medium", Execution Interval: "30" |
HIVE_QUERY_OUTPUT_RECORDS_HIGH | Checks whether the Hive query is processing more than 1 million output records. | Severity: "Medium", Execution Interval: "30" |
HIVE_QUERY_INPUT_RECORDS_HIGH | Checks whether the Hive query is processing more than 1 million input records. | Severity: "Medium" Execution Interval: "30" |
HIVE_QUERIES_FAILING | Checks whether the number of Hive queries failing in the last one hour are greater than 10. | Severity: "Medium", Execution Interval: "30" |
HIVE_HIGH_QUERY_NUMBER | Checks whether Hive is experiencing a high query count of more than 50 queries. | Severity: "Medium", Execution Interval: "30" |
HIVE_USER_TOO_MANY_RUNNING_QUERIES | Checks whether the Hive user has more than 20 of queries in RUNNING state in last 15 minutes. | Severity: "High", Execution Interval: "30" |
HIVE_USER_EXECUTING_TOO_MANY_QUERIES | Checks whether the Hive user has executed more than 50 queries in last 15 minutes, including running, failed, and completed queries. | Severity: "High", Execution Interval: "30" |
HIVE_QUERY_RUNNING_GT_15MIN | Checks whether the Hive query is in a running state for more than 15 minutes. | Severity: "Medium", Execution Interval: "30" |
HIVE_METASTORE_JVM_GC | Monitors the time spent by metastore in Java Garbage Collection. | Severity: "Medium", Execution Interval: "120" |
HIVE_METASTORE___JVM_MEMORY | Monitors the memory used by metastore in Java memory. | Severity: "Medium", Execution Interval: "120" |
HIVE_METASTORE_PROCESS_CPU | Monitors the metastore processes in the CPU. | Severity: "Medium", Execution Interval: "120" |
HIVE_SERVER_JVM_GC | Monitors the time spent by the server in Java Garbage Collection. | Severity: "Medium", Execution Interval: "120" |
HIVE_SERVER_JVM_MEMORY | Monitors the amount of memory used by the server in Java memory. | Severity: "Medium", Execution Interval: "120" |
HIVE_SERVER_PROCESS_CPU | Monitors the server processes in the CPU. | Severity: "Medium", Execution Interval: "120" |
HIVE_SERVER_INTERACTIVE_JAVA_OS | Monitors the time spent by the server in running Java. | Severity: "Medium", Execution Interval: "120" |
HIVE_SERVER_INTERACTIVE_JVM_GC | Monitors the time spent by the server in Garbage Collection. | Severity: "Medium", Execution Interval: "120" |
HIVE_INTERACTIVE_ENDPOINT_CHECK | Monitors if the Hive Server2 interactive is working | Severity: "Critical", Execution Interval: "10" |
HIVE_QUERIES_HDFS_BYTES_READ | Monitors if the HDFS bytes read is higher than the limit . You can set the HDFS byte read limit. | Severity: Medium Execution Interval: "120" |
HIVE_QUERIES_HDFS_BYTES_WRITTEN | Monitors if the HDFS bytes write is higher than the limit . You can set the HDFS byte write limit. | Severity: Medium Execution Interval: "120" |
MapReduce Alerts
Alert Name | Description | Configuration |
---|---|---|
MAPREDUCE2_JOBHISTORY UI_ENDPOINT_CHECK | Checks whether the mapreduce2_jobhistory user interface is alive or not. | Severity: "Critical", Execution Interval: "60" |
MAPREDUCE2_JOBHISTORYSERVER_ENDPOINT_CHECK | Checks whether the mapreduce2_jobhistoryserver is alive or not. | Severity: "Critical", Execution Interval: "60" |
NiFi Alerts
Alert Name | Description | Configuration |
---|---|---|
PROCESSOR_STATUS_CHECK | This alert gets triggered when a processor stops or goes into an invalid state by sending the Processor name and NiFi node in the alert body. This relies on ProcessorStatus as a metric that has been added to the new NAR file. | Severity: "High", Execution Interval: "120" |
NIFI_CLUSTER_CONNECTION_STATUS_CHECK | This alert gets triggered when a Nifi Node gets disconnected from the Nifi Cluster. It is important to note that this is different from the EndPoint check, as the NiFi process might still be up and running, however, it might have stopped sending heartbeats to the other nodes marking it as disconnected. This relies on the | Severity: "High", Execution Interval: "120" |
NiFi Registry Alerts
Alert Name | Description | Configuration |
---|---|---|
Nifi_Registry Endpoint_Check`` | Checks whether the system hosting the NiFi Registry service is active or not. | Severity: "Critical", Execution Interval: "60" |
Pinot Alerts
Alert Name | Description | Configuration |
---|---|---|
Pinot Broker Endpoint Check | Triggers an alert when the Pinot Broker service is unreachable, indicating potential service downtime or network issues. | Severity: "Critical", Execution Interval: "60" |
Pinot Controller Endpoint Check | Triggers an alert when the Pinot Controller service is unreachable, affecting coordination and metadata operations. | Severity: "Critical", Execution Interval: "60" |
Pinot Server Endpoint Check | Triggers an alert when the Pinot Server service is unreachable, which may disrupt data queries and ingestion. | Severity: "Critical", Execution Interval: "60" |
Ranger and Ranger KMS Alerts
Alert Name | Description | Configuration | |
---|---|---|---|
Ranger Admin Endpoint Check`` | Triggered when the Ranger Admin service is unavailable, indicating potential service downtime or network issues. | Severity: "Critical", Execution Interval: "60" | |
```Ranger````KMS Endpoint Check` | Triggered when the Ranger KMS service is unavailable, affecting coordination and metadata operations. | Severity: "Critical", Execution Interval: "60" |
Trino Alerts
Alert Name | Description | Configuration |
---|---|---|
Trino Coordinator Endpoint Check | Triggered when the Trino Coordinator service becomes unreachable, indicating potential downtime or network issues. | Severity: "Medium", Execution Interval: "30" |
Trino Worker Endpoint Check | Triggered when the Trino Worker service becomes unreachable, which may impact query processing and workload distribution. | Severity: "Medium", Execution Interval: "30" |
HDFS Alerts
Alert Name | Description | Configuration |
---|---|---|
HDFS_SECONDARYNAMENODE_ENDPOINT_CHECK | Checks whether the HDFS secondary namenode is alive or not. | Severity: "Critical", Execution Interval: "60" |
HDFS_DATANODES_ENDPOINT_CHECK | Checks whether the HDFS datanode is alive or not. | Severity: "Critical", Execution Interval: "60" |
HDFS_NAMENODE ENDPOINT_CHECK | Checks whether the HDFS namenode is alive or not. | Severity: "Critical", Execution Interval: "60" |
HDFS_JOURNALNODE_OPEN_FILE_DESCIPTORS_COUNT_ALERT | Monitors the number of file descriptors used. | Severity: Critical
Execution Interval: 120 |
HDFS_JOURNALNODE_USED_SWAP_SPACE_SIZE | Monitors the amount of swap memory used. | Severity: Low
Execution Interval: 120 |
HDFS_JOURNALNODE_GC_MILLIS_ALERT | Monitors the time spent in Java Garbage Collection. | Severity: Low
Execution Interval: 120 |
HDFS_JOURNALNODE SYNC_LATENCY_ALERT | Monitors the fsync latency of the JournalNode. | Severity: Low
Execution Interval: 120 |
HDFS_JOURNALNODE ENDPOINT_CHECK | Checks whether the JournalNode is alive or not. | Severity: Critical
Execution Interval: 60 |
FS_LAST_MODIFIED_FILE_COUNT | Checks various given paths to find the number of files modified prior to the given time period. | Severity: "Critical", Execution Interval: 120 |
HDFS_PATH_USAGE | Checks file sizes in a specified path and triggers an alert if they exceed the defined threshold fraction of the total HDFS storage.
pathanalysis.txt file present under $AcceloHome/data/fsanalytics/cluster/ . | Severity: "Critical", Execution Interval: 120 |
LLAP Alerts
Alert Name | Description | Configuration |
---|---|---|
LLAP_QUERY_OUTPUT_RECORDS_GT_1M | Checks whether the Hive LLAP query is processing more than 1 million output records. | Severity: "Medium", Execution Interval: "30" |
LLAP_HIGH_QUERY_NUMBER | Checks whether the Hive LLAP query is experiencing a high query count of more than 50 queries. | Severity: "High", Execution Interval: "30" |
LLAP_QUERIES_FAILING | Checks whether the Hive LLAP queries failing in the last one hour is more than 10 in number. | Severity: "High", Execution Interval: "30" |
Host Alerts
Alert Name | Description | Configuration |
---|---|---|
AVAILABLE_MEMORY_ALERT | This alert is raised if the available memory in the system for the last 60 seconds per host per mount path is more than 10 percent. | Severity: "Critical", Execution Interval: "60" |
NETWORK_USAGE_ALERT | Checks if the average of total bytes received and sent is greater than 9.0 GB over 60 seconds. | Severity: "Critical", Execution Interval: "60" |
DISK_USAGE_ALERT | This alert is raised if the percentage of disk usage in the system for the last 60 minutes per host per mount path is more than 70 percent. | Severity: "Critical", Execution Interval: "60" |
CPU_USAGE_ALERT | This alert is raised when the CPU usage is higher than 50 percent on any host in the last 60 seconds. | Severity: "Critical", Execution Interval: "60" |
Ozone Alerts
Alert Name | Description | Configuration |
---|---|---|
OZONE_DATANODE_ENDPOINT_CHECK | This alert checks whether the Ozone Datanode is alive or not. | Severity: "Critical", Execution Interval: "60" |
OZONE_MANAGER_ENDPOINT_CHECK | This alert checks whether the Ozone Manager is alive or not. | Severity: "Critical", Execution Interval: "60" |
OZONE_RECON_ENDPOINT_CHECK | This alert checks whether the Ozone Recon web UI is alive or not. | Severity: "Critical", Execution Interval: "60" |
OZONE_S3GATEWAY_ENDPOINT_CHECK | This alert checks whether the Ozone S3 gateway is alive or not. | Severity: "Critical", Execution Interval: "60" |
OZONE_SCM_ENDPOINT_CHECK | This alert checks whether the Ozone Storage Container Manager is alive or not. | Severity: "Critical", Execution Interval: "60" |
To create an alert, see Alerts.
Predefined Alerts
Stock alerts and custom alerts are limited to one query condition. Hence, to resolve the issue of complex query conditions, Pulse provides a built-in library of predefined alerts. Each alert is defined by an alert definition, which specifies the alert type, and monitors periodically in defined execution Interval and thresholds. Predefined alerts are provided as a stock alert to the user.
The user can modify the parameters, but not the function of the Predefined alerts.
The following table lists and describes the available predefined alert:
Alert | Description | Configuration |
---|---|---|
HDFS_NAMENODE_FAILOVER | Alert when active namenode is transitioned to another standby namenode. | Severity: "Critical", Execution Interval: "120" |
YARN_DEAD_NODEMANAGERS_PERCENTAGE | Percentage of dead nodemanagers are beyond a threshold. | Severity: "High", Execution Interval: "120" |
YARN_RESOURCE_MANAGER_FAILOVER_ALERT | Alert when active resource manager is transitioned to another standby resource manager. | Severity: "High", Execution Interval: "120" |
HDFS_STALE_DATANODE_PERCENTAGE | Alert when active stale datanode is more than 5 percent. | Severity: "Critical", Execution Interval: "120" |
YARN_APP_FAILED_PERCENTAGE | Percentage of app failed are beyond a threshold. | Severity: "High", Execution Interval: "120" |
HBASE_READ_SCAN_LATENCY | Alert when scan time 99th percentile is more than threshold. | Severity: "Critical", Execution Interval: "120" |
HDFS_DEAD_DATANODE_PERCENTAGE | Alert when active dead datanode is more than 5 percent. | Severity: "Critical", Execution Interval: "120" |
HBASE_READ_GET_LATENCY | Critical Alert when read latency 99th percentile is more than threshold. | Severity: "Critical", Execution Interval: "120" |
KAFKA_PARTITION_OFFLINE_LEADER | Alert when topic partition leader goes offline. | Severity: "Critical", Execution Interval: "120" |
KAFKA_CONSUMER_GROUP_NO_CONSUMPTION | Alert when the Kafka Consumer group is not consuming data from the topic. | Severity: "Critical", Execution Interval: "60" |
HBASE_REGION_STATE_CHANGE | Alert compares the HBase Region State between the current and previous durations (as defined by you). Every time the HBase Region State changes, an alert is triggered | Severity: "Critical", Execution Interval: "120" |
YARN_LONG_RUNNING_JOB | Alert when long-running jobs are taking longer than threshold time. | Severity: "Critical", Execution Interval: "120" |
YARN_PENDING_JOBS_COUNT | Alert when there are 15 YARN jobs in pending status. | Severity: "High", Execution Interval: "120" |
YARN_PENDING_APPS_ COMPARISON | Alert compares the number of YARN jobs in pending state between a custom time period (defined by you) and the current time. You can also set a custom percentage threshold. When the percentage of pending YARN jobs in the custom time frame exceeds the current time, by the defined percentage threshold, the alert is raised. | Severity: "High", Execution Interval: "120" |
HIVE_SERVER_INTERACTIVE_HEAP_USAGE | Monitors if the Hive server interactive Jvm heap usage is more than a predefined threshold. | Severity: "Critical", Execution Interval: "120" |
HBASE_MASTER_FAILOVER | When the active hbase master is switched to another standby hbase, an alert is generated. | Severity: "Critical", Execution Interval: "120" |
HIVE_LLAP_ZOMBIE_DAEMON_CHECK | When an LLAP zombie daemon process is still active, an alert is generated. | Severity: "Critical", Execution Interval: "30" |
SPARK_EXECUTOR_NODE_BLACKLISTED | Checks if any node is blocked by Spark and raises an alert if node is blocked. | Severity: "High". Execution interval: 300 |
YARN_APP_RESOURCE_USAGE | Checks if the Vcore and memory exceed the threshold value for all applications. You can set the threshold value. | Severity: "High". Execution interval: "60" |
YARN_LONG_RUNNING_JOB_QUEUE | You can specify multiple queue names (comma separated) and specify the threshold duration (seconds). This alert checks if any of the queue exceeds the set threshold level. | Severity: "High". Execution interval: "120" |
FS_SNAPSHOT_ANALYSIS_REPORT | Allows you to specify a time period (in hours) in the ThresholdTime field. If the last snapshot time is greater than the value specified in Threshold Time, an alert is raised. | Severity: "Critical". Execution interval: "300" |
Pinot Alerts
Alert Name | Description | Configuration |
---|---|---|
Pinot_Controller Interactive_Heap_Usage | Triggers an alert when the Controller's heap usage exceeds the defined threshold. Ensure the 'JVM Heap Maximum' value is configured. | Severity: "Critical". Execution interval: "120" |
Pinot_Controller_Interactive_Heap_Usage | Triggers an alert when the Broker's heap usage crosses the threshold. Confirm that the 'JVM Heap Maximum' value is set. | Severity: "Critical". Execution interval: "120" |
Pinot_Server___Interactive_Heap_Usage | Triggers an alert when the Server's heap usage surpasses the set threshold. Make sure the 'JVM Heap Maximum' value is configured. | Severity: "Critical". Execution interval: "120" |
Ranger and Ranger KMS Alerts
Alert Name | Description | Configuration |
---|---|---|
Ranger Admin Interactive Heap Usage | Triggered when Ranger Admin heap usage exceeds the defined threshold. Ensure the JVM heap maximum value is configured to enable this alert. | Severity: "Critical". Execution interval: "120" |
Ranger KMS __Server Interactive Heap Usage | Triggered when Ranger KMS heap usage exceeds the defined threshold. Ensure the JVM heap maximum value is configured to enable this alert. | Severity: "Critical". Execution interval: "120" |