Stock and Predefined Alerts

This topic lists the stock alerts that are shipped along with Acceldata.

Airflow Alerts

Alert NameDescriptionConfiguration
Airflow_Endpoint_Check``This alert checks whether the system hosting Airflow is active or not.Severity: "Critical", Execution Interval: "60"

MemSQL Alerts

Alert NameDescriptionConfiguration
MEMSQL_AGGREGATOR_ROUNDTRIP_LATENCY_GTChecks whether the MemSQL average roundtrip latency for a query is greater than 30 seconds.Severity: "High", Execution Interval: "30"
MEMSQL_QUERY_FAILEDChecks whether the MemSQL query failed. If yes, check the error code and error message in Acceldata query view. You can search for the query using the 'id' field.Severity: "Medium", Execution Interval: "30"
MEMSQL_QUERY_MEMUSAGEChecks whether the MemSQL query is holding the memory for more than 5 minutes.Severity: "Medium", Execution Interval: "30"
MEMSQL_QUERY_NETWORK_BYTESChecks whether the MemSQL query transferred more than 512 MB on the network.Severity: "Low", Execution Interval: "30"
MEMSQL_QUERY_EXEC_TIMEChecks whether the MemSQL Query is taking more than 15 minutes.Severity: "Medium", Execution Interval: "30"
MEMSQL_PIPELINE_BATCH_TIMEChecks whether the MemSQL Pipeline batch time is greater than 15 minutes.Severity: "High", Execution Interval: "30"
MEMSQL_AGGREGATOR_OPEN_CONNECTIONSChecks whether the number of open MemSQL connections is greater than 30.Severity: "Critical", Execution Interval: "30"
MEMSQL_QUERY_NETWORK_TIMEChecks whether the MemSQL query took more than 3 minutes on network data transfer.Severity: "Medium", Execution Interval: "30"
MEMSQL_QUERY_READ_DISK_BYTESChecks whether the MemSQL query read more than 512 MB of data from the disk.Severity: "Low",Execution Interval: "30"
MEMSQL_USER_TOO_MANY_QUERIESChecks whether the MemSQL user fired more than 25 queries in the last 1 minute.Severity: "Medium", Execution Interval: "60"
MEMSQL_MAX_MEMORY_USEDMemSQL max_memory_mb is close to memory_used_mb. If the memory usage increases, the query execution stops and the server is terminated by the query allocations that exceed this limit.Severity: "Critical", Execution Interval: "30"
MEMSQL_LEAVES_OPEN_CONNECTIONSChecks whether the MemSQL leaves have too many open connections. Each connection to the master aggregator opens as many connections towards the leaf as you have partitioned. This depends on NOFILE ulimit.Severity: "High", Execution Interval: "30"
MEMSQL_QUERY_LOCK_TIMEChecks whether the MemSQL query locked rows for more than 30 seconds.Severity: "Medium", Execution Interval: "30"
MEMSQL_MAX_TABLE_MEM_USEDChecks whether the MemSQL table_memory_used is close to maximum_table_memory. If yes, MemSQL becomes read-only. You can execute only SELECT and DELETE queries once the limit is reached.Severity: "High", Execution Interval: "30"
MEMSQL_NODE_STATUSChecks the status of MEMSQl nodes.Severity: "Critical", Execution Interval: "60"
MEMSQL_PIPELINE_BATCH_FAILED_ALERTChecks whether the Memsql Pipeline batch have failed while being loaded into the database.Severity: "High", Execution Interval: "60"
MEMSQL_DAEMON_ENDPOINT_CHECKChecks whether the MemSQL daemon(s) are alive or not.Severity: "Critical", Execution Interval: "10"

Impala Alerts

Alert NameDescriptionConfiguration
IMPALA_QUERIES_DURATION_GT_3MINChecks when the query duration is greater than 3 min.Severity: "Critical", Execution Interval: "120"
IMPALA_FAILED_QUERIESChecks for Impala queries with a failed state.Severity: "High", Execution Interval: "120"
IMPALA_DAEMON_ENDPOINT_CHECKChecks whether the Impala Daemon are alive or not.Severity: "Critical", Execution Interval: "60"
IMPALA_NOT_ALLOWED_QUERIESChecks specific string patterns in Impala queries, triggering alerts when matched.Severity: "Critical", Execution Interval: "60"
IMPALA_HIGH_IN_FLIGHT_FRAGMENTSThe number of Impala query fragment instances that are currently running is high.Severity: "medium", Execution Interval: "300"
IMPALA_DDL_QUERIES_GT_100When the count of Impala DDL queries is greater than 100 in the last 10 minutes.Severity: "medium", Execution Interval: "120"
IMPALAD_EXPIRED_QUERIES_GT_100When Impala expired queries are greater than 100 in the Impala Daemon.Severity: "high", Execution Interval: "120"
IMPALAD_JVM_CURRENT_USAGE_BYTES_GT_2GBWhen JVM's current usage is greater than 2 GB in the Impala Daemon.Severity: "critical", Execution Interval: "120"
IMPALAD_SPILLED_QUERIESWhen there are spilled queries in Impala Daemons.Severity: "high", Execution Interval: "120"
IMPALAD_TOTAL_BYTES_WRITTEN_GT_2GBWhen the total bytes written is greater than 2GB in the Impala Daemon.Severity: "high", Execution Interval: "120"
IMPALAD_TOTAL_USED_MEMORY_GT_2GBWhen the total used memory is greater than 2 GB in the Impala Daemon.Severity: "high", Execution Interval: "120"
IMPALA_NO_DATA_ALERTThis alert is raised when Impala data is not pushed.Severity: "medium", Execution Interval: "120"

Zoo Keeper Alerts

Alert NameDescriptionConfiguration
ZOOKEEPER_SERVER_ENDPOINT_CHECKChecks whether the zookeeper server is alive or not.Severity: "Critical", Execution Interval: "10"

Spark Alerts

Alert NameDescriptionConfiguration
SPARK_HIVETHRIFT_SERVER_ENDPOINT_CHECKChecks whether the 'spark2 hivethriftserver' is alive or not.Severity: "Critical", Execution Interval: "10"
SPARK2_JOBHISTORYSERVER_ENDPOINT_CHECKChecks whether the 'spark2 jobhistoryserver' is alive or not.Severity: "Critical", Execution Interval: "10"

Kafka Alerts

Alert NameDescriptionConfiguration
KAFKA_STALL_TOPICSChecks for all the topics with no data in it.Severity: "Critical", Execution Interval: "60"
KAFKA_BROKER_ENDPOINT_CHECKChecks whether the kafka broker is alive or not.Severity: "Critical", Execution Interval: "10"
KAFKA_UNCLEAN_LEADER_ELECTIONWhen the leader for a partition is no longer available and when no in-sync replica exists, the election of a new leader is called unclean. In most cases there is a data load with this case.Severity: "Medium", Execution Interval: "30"
KAFKA_REQUEST_HANDLER_IDLE_LOWThe idle ration is between 0-1. The lower this number, the broker is more loaded. With experience, idle ratios lower than 20% indicate a potential problem, and lower than 10% is usually an active performance problem.Severity: "Medium", Execution Interval: "30"
KAFKA_BROKER_SKEWEDIf a Kafka broker is processing more records across all topics compared to any other broker, the broker is identified as skewed.Severity: "Medium", Execution Interval: "30"
KAFKA_TOPIC_HIGH_DATA_THRESHOLDChecks whether the Kafka topic is receiving unusually high number of messages.Severity: "High", Execution Interval: "30"
KAFKA_NO_DATA_ON_TOPICChecks whether a Kafka topic doesn't receive data for a configured interval of time.Severity: "High", Execution Interval: "30"
KAFKA_ACTIVE_CONTROLLEROnly one broker must always be a controller in a cluster. Any value other then 1 means that you will have a problem of not being able to execute administrative tasks, such as partition moves.Severity: "High", Execution Interval: "30"
KAFKA_OFFLINE_PARTITIONSIf, after successful leader election, the leader for the partition dies, then the partition moves to an offline partition state.Severity: "High", Execution Interval: "30"
KAFKA_UNDER_REPLICATED_PARTITIONSIf a broker has a topic that is not being replicated enough number of times, it results in increasing the probability of data loss because of replicas failing or dying.Severity: "High", Execution Interval: "30"
KAFKA_ZOOKEEPER_ REQUEST_LATENCY_ MSCheck to see whether the Zookeeper Request Latency exceeds the specified value (in ms).Severity: "High", Execution Interval: "120"
KAFKS_READ_WRITE_SKEWNESSThis alert is triggered when a broker experiences a disproportionate read and write activity compared to other brokers.Severity: "Critical", Execution Interval: "120"

Kafka Connect

Alert NameDescriptionConfiguration
KAFKA_CONNECT_WORKER_ENDPOINT_CHECKThis alert checks whether the Kafka Connect worker is active or not.Severity: "Critical", Execution Interval: "60"
KAFKA_CONNECT_OFFSET_COMMIT_FAILUREThis alert check commits failure to Kafka topic while data ingestion from an external system (in case of Source connector) or commits failure while reading data from Kafka and writing to an external system (in case of Sink connector).Severity: "Critical", Execution Interval: "30"
KAFKA_CONNECT_CONNECTOR_STARTUP_FAILUREThis alert checks if the worker's connectors are failing to start.Severity: "Critical", Execution Interval: "30"
KAFKA_CONNECT_TASK_STARTUP_FAILUREThis alert checks if the worker's tasks are failing to start.Severity: "Critical", Execution Interval: "30"
KAFKA_CONNECT_NO_DATA_ALERTThis alert is raised when the Kafka Connect data is not pushed.Severity: "Medium", Execution Interval: "120"

Kafka Cruise Control

Alert NameDescriptionConfiguration
KAFKA_CRUISE CONTROL_ENDPOINT CHECKThe alert checks whether the Kafka Cruise Control node is active or not.
KAFKA_CRUISE_CONTROL_FETCH_METRIC_FAILUREThis alert checks failures in fetching of partition level metrics from Kafka topic by Kafka Cruise Control's MetricFetcherManager Partition Samples Fetcher.
KAFKA_CRUISE_CONTROL_NO_DATA_ALERTThis alert gets triggered when Kafka Cruise control data is not pushed.

Kafka 3 Alerts

Alert NameDescriptionConfiguration
KAFKA_STALL_TOPICSChecks for all the topics with no data in it.Severity: "Critical", Execution Interval: "60"
KAFKA_BROKER_ENDPOINT_CHECKChecks whether the kafka broker is alive or not.Severity: "Critical", Execution Interval: "10"
KAFKA_UNCLEAN_LEADER_ELECTIONWhen the leader for a partition is no longer available and when no in-sync replica exists, the election of a new leader is called unclean. In most cases there is a data load with this case.Severity: "Medium", Execution Interval: "30"
KAFKA_REQUEST_HANDLER_IDLE_LOWThe idle ration is between 0-1. The lower this number, the broker is more loaded. With experience, idle ratios lower than 20% indicate a potential problem, and lower than 10% is usually an active performance problem.Severity: "Medium", Execution Interval: "30"
KAFKA_BROKER_SKEWEDIf a Kafka broker is processing more records across all topics compared to any other broker, the broker is identified as skewed.Severity: "Medium", Execution Interval: "30"
KAFKA_TOPIC_HIGH_DATA_THRESHOLDChecks whether the Kafka topic is receiving unusually high number of messages.Severity: "High", Execution Interval: "30"
KAFKA_NO_DATA_ON_TOPICChecks whether a Kafka topic doesn't receive data for a configured interval of time.Severity: "High", Execution Interval: "30"
KAFKA_ACTIVE_CONTROLLEROnly one broker must always be a controller in a cluster. Any value other then 1 means that you will have a problem of not being able to execute administrative tasks, such as partition moves.Severity: "High", Execution Interval: "30"
KAFKA_OFFLINE_PARTITIONSIf, after successful leader election, the leader for the partition dies, then the partition moves to an offline partition state.Severity: "High", Execution Interval: "30"
KAFKA_UNDER_REPLICATED_PARTITIONSIf a broker has a topic that is not being replicated enough number of times, it results in increasing the probability of data loss because of replicas failing or dying.Severity: "High", Execution Interval: "30"
KAFKA_ZOOKEEPER_ REQUEST_LATENCY_ MSCheck to see whether the Zookeeper Request Latency exceeds the specified value (in ms).Severity: "High", Execution Interval: "120"
KAFKS_READ_WRITE_SKEWNESSThis alert is triggered when a broker experiences a disproportionate read and write activity compared to other brokers.Severity: "Critical", Execution Interval: "120"

Kafka 3 Connect

Alert NameDescriptionConfiguration
KAFKA_CONNECT_WORKER_ENDPOINT_CHECKThis alert checks whether the Kafka Connect worker is active or not.Severity: "Critical", Execution Interval: "60"
KAFKA_CONNECT_OFFSET_COMMIT_FAILUREThis alert check commits failure to Kafka topic while data ingestion from an external system (in case of Source connector) or commits failure while reading data from Kafka and writing to an external system (in case of Sink connector).Severity: "Critical", Execution Interval: "30"
KAFKA_CONNECT_CONNECTOR_STARTUP_FAILUREThis alert checks if the worker's connectors are failing to start.Severity: "Critical", Execution Interval: "30"
KAFKA_CONNECT_TASK_STARTUP_FAILUREThis alert checks if the worker's tasks are failing to start.Severity: "Critical", Execution Interval: "30"
KAFKA_CONNECT_NO_DATA_ALERTThis alert is raised when the Kafka Connect data is not pushed.Severity: "Medium", Execution Interval: "120"

Kafka 3 Cruise Control

Alert NameDescriptionConfiguration
KAFKA_CRUISE CONTROL_ENDPOINT CHECKThe alert checks whether the Kafka Cruise Control node is active or not.
KAFKA_CRUISE_CONTROL_FETCH_METRIC_FAILUREThis alert checks failures in fetching of partition level metrics from Kafka topic by Kafka Cruise Control's MetricFetcherManager Partition Samples Fetcher.
KAFKA_CRUISE_CONTROL_NO_DATA_ALERTThis alert gets triggered when Kafka Cruise control data is not pushed.

Kudu Alerts

Alert NameDescriptionConfiguration
Kudu_Master_Endpoint_CheckTriggers when the Kudu Master service is unreachable.Severity: "Critical", Execution Interval: "60"
Kudu_Tablet_Server_Endpoint_CheckTriggers when the Kudu Tablet Server service is unreachable.Severity: "Critical", Execution Interval: "60"

Schema Registry

Alert NameDescriptionConfiguration
SCHEMA_REGISTRY ADMIN_ENDPOINT_CHECKThis alert checks whether the registry admin is active or not.Severity: "High", Execution Interval: "30"
SCHEMA_REGISTRY SERVER_ENDPOINT_CHECKThis alert checks whether the registry server is active or not.Severity: "Critical", Execution Interval: "60"
SCHEMA_REGISTRY HTTP_CLIENT_ERROR``This alert checks the schema registry client errors (4xx http responses).Severity: "Medium", Execution Interval: "30"
SCHEMA_REGISTRY HTTP_SERVER_ERRORThis alert checks the schema registry internal server errors (5xx http responses).Severity: "Medium", Execution Interval: "30"
SCHEMA_REGISTRY ERRORSThis alert checks the schema registry server errorsSeverity: "Medium", Execution Interval: "30"
SCHEMA_REGISTRY NO_DATA_ALERTThis alert gets raised when the Schema Registry data is not pushed.Severity: "Medium", Execution Interval: "120"

HBase Alerts

Alert NameDescriptionConfiguration
HBASE_MASTER_ENDPOINT_CHECKChecks whether the Hbase master is alive or not.Severity: "Critical", Execution Interval: "10"
HBASE_REGIONSERVER_ENDPOINT_CHECKChecks whether the hbase region server is alive or not.Severity: "Critical", Execution Interval: "10"
HBASE_REGIONSERVER_TABLES_COMPACTION_TIME_ALERTThis alert is raised when the 95th percentile compaction time is more than 60 seconds over a period of 60 seconds.Severity: "Critical", Execution Interval: "60"
HBASE_REGIONSERVER_PERCENT_LOCAL_FILE_ALERTThis alert is raised when the local file percentage is less than 80 percent per host over a period of 60 seconds.Severity: "Critical", Execution Interval: "60"
HBASE_REGIONSERVER_GC_ALERTThis alert is raised when the GC time is greater than equal to 60 seconds over a period of 60 seconds.Severity: "Critical", Execution Interval: "60"
HBASE_CALL_TIME_95TH_PERCENTILE_ALERTThis alert is raised when the 95th percentile call time is more than 60 seconds.Severity: "Critical", Execution Interval: "60"
HBASE_ZERO_ACTIVE_MASTERThis alert is raised when any number of active Hbase master's are detected.Severity: "Critical", Execution Interval: "60"
REGION_SERVER_DEAD_ALERTThis alert is raised if any region server goes down.Severity: "Critical", Execution Interval: "10"

YARN Alerts

Alert NameDescriptionConfiguration
YARN_KILLED_APPLICATION_ALERTChecks whether the last YARN application id status is killed or not.Severity: "High", Execution Interval: "10"
YARN_APPTIMELINE_SERVER_ENDPOINT_CHECKChecks whether the YARN apptimeline_server is alive or not.Severity: "Critical", Execution Interval: "60"
YARN_NODEMANAGER_ENDPOINT_CHECKChecks whether the YARN nodemanager is alive or not.Severity: "Critical", Execution Interval: "10"
YARN_RESOURCEMANAGER_ENDPOINT_CHECKChecks whether the YARN resourcemanager is alive or not.Severity: "Critical", Execution Interval: "10"
YARN_QUEUE_CAPACITY_USAGEChecks if the absolute capacity of the queue is higher than the defined threshold during the specified time period.Severity: "Critical"
YARN_LONG_RUNNING_JOB_WITH_FILTERSChecks for Yarns jobs running for long durations.

Hive Alerts

Alert NameDescriptionConfiguration
HIVE_WEBHCATSERVER_ENDPOINT_CHECKChecks whether the Hive webhcat_server is alive or not.Severity: "Critical", Execution Interval: "10"
HIVE_METASTORE_ENDPOINT_CHECKChecks whether the Hive metastore is alive or not.Severity: "Critical", Execution Interval: "10"
HIVE_HIVESERVER2_ENDPOINT_CHECKChecks whether the hiveserver2 is alive or not.Severity: "Critical", Execution Interval: "10"
HIVE_USER_EXECUTING_TOO_MANY_LLAP_QUERIESChecks whether the Hive user has executed more than 50 LLAP queries in the last 15 minutes including running, failed, and completed queries.Severity: "High", Execution Interval: "30"
HIVE_USER_TOO_MANY_RUNNING_LLAP_QUERIESChecks whether the Hive user has more than 20 LLAP queries in RUNNING state in the last 15 minutes.Severity: "High", Execution Interval: "30"
HIVE_LLAP_QUERY_SPILLED_REC_GT_10KChecks whether the Hive LLAP query has spilled more than 10 thousand records to the disk. It means the memory exceeds the limit that is defined and reserved for map output buffer. Spilled records should be equal to zero which is good for Memory and IO performance.Severity: "Medium", Execution Interval: "30"
HIVE_LLAP_QUERY_SHUFFLE_GT_1GBChecks whether the Hive LLAP query has shuffle greater than 1 GB. Shuffles though cannot be avoided but can cause the query to slow down.Severity: "Medium", Execution Interval: "30"
HIVE_LLAP_QUERY_RUNNING_GT_15MINChecks whether the Hive query is in a running state for more than 15 minutes.Severity: "High", Execution Interval: "30"
HIVE_LLAP_QUERY_RAN_FOR_TOO_LONGChecks whether the Hive LLAP query ran for more than 4 hours.Severity: "Medium", Execution Interval: "30"
LLAP_QUERY_OUTPUT_RECORDS_GT_1MChecks whether the Hive LLAP query is processing more than 1 million output records.Severity: "Medium", Execution Interval: "30"
HIVE_LLAP_QUERY_INPUT_RECORDS_GT_1MChecks whether the Hive LLAP query is processing more than 1 million input records.Severity: "Medium", Execution Interval: "30"
HIVE_LLAP_QUERY_BYTES_WRITTEN_GT_1GBChecks whether the Hive LLAP query has written more than 1GB of data.Severity: "Medium", Execution Interval: "30"
HIVE_LLAP_QUERY_BYTES_READ_GT_1GBChecks whether the Hive LLAP query read more than 1GB of data.Severity: "Medium", Execution Interval: "30"
HIVE_QUERY_BYTES_WRITTEN_GT_1GBChecks whether the Hive query written contains more than 1GB of data.Severity: "Medium", Execution Interval: "30"
HIVE_QUERY_BYTES_READ_GT_1GBChecks whether the Hive query read more than 1GB of data.Severity: "Medium", Execution Interval: "30"
HIVE_QUERY_SHUFFLE_GT_1GBChecks whether the Hive query contains shuffles greater than 1GB. Shuffles cannot be avoided but can cause queries to slow down.Severity: "Medium", Execution Interval: "30"
HIVE_QUERY_SPILLED_REC_GT_10KChecks whether the Hive query has spilled more than 10 thousand records to the disk. It means the memory exceeds the limit that is defined and is reserved for map output buffer. Spilled records should be equal to zero which is good for memory and IO performances.Severity: "Medium", Execution Interval: "30"
HIVE_QUERY_RAN_FOR_TOO_LONGChecks whether the Hive query ran for more than 4 hours.Severity: "Medium", Execution Interval: "30"
HIVE_QUERY_OUTPUT_RECORDS_HIGHChecks whether the Hive query is processing more than 1 million output records.Severity: "Medium", Execution Interval: "30"
HIVE_QUERY_INPUT_RECORDS_HIGHChecks whether the Hive query is processing more than 1 million input records.Severity: "Medium" Execution Interval: "30"
HIVE_QUERIES_FAILINGChecks whether the number of Hive queries failing in the last one hour are greater than 10.Severity: "Medium", Execution Interval: "30"
HIVE_HIGH_QUERY_NUMBERChecks whether Hive is experiencing a high query count of more than 50 queries.Severity: "Medium", Execution Interval: "30"
HIVE_USER_TOO_MANY_RUNNING_QUERIESChecks whether the Hive user has more than 20 of queries in RUNNING state in last 15 minutes.Severity: "High", Execution Interval: "30"
HIVE_USER_EXECUTING_TOO_MANY_QUERIESChecks whether the Hive user has executed more than 50 queries in last 15 minutes, including running, failed, and completed queries.Severity: "High", Execution Interval: "30"
HIVE_QUERY_RUNNING_GT_15MINChecks whether the Hive query is in a running state for more than 15 minutes.Severity: "Medium", Execution Interval: "30"
HIVE_METASTORE_JVM_GCMonitors the time spent by metastore in Java Garbage Collection.Severity: "Medium", Execution Interval: "120"
HIVE_METASTORE___JVM_MEMORYMonitors the memory used by metastore in Java memory.Severity: "Medium", Execution Interval: "120"
HIVE_METASTORE_PROCESS_CPUMonitors the metastore processes in the CPU.Severity: "Medium", Execution Interval: "120"
HIVE_SERVER_JVM_GCMonitors the time spent by the server in Java Garbage Collection.Severity: "Medium", Execution Interval: "120"
HIVE_SERVER_JVM_MEMORYMonitors the amount of memory used by the server in Java memory.Severity: "Medium", Execution Interval: "120"
HIVE_SERVER_PROCESS_CPUMonitors the server processes in the CPU.Severity: "Medium", Execution Interval: "120"
HIVE_SERVER_INTERACTIVE_JAVA_OSMonitors the time spent by the server in running Java.Severity: "Medium", Execution Interval: "120"
HIVE_SERVER_INTERACTIVE_JVM_GCMonitors the time spent by the server in Garbage Collection.Severity: "Medium", Execution Interval: "120"
HIVE_INTERACTIVE_ENDPOINT_CHECKMonitors if the Hive Server2 interactive is workingSeverity: "Critical", Execution Interval: "10"
HIVE_QUERIES_HDFS_BYTES_READMonitors if the HDFS bytes read is higher than the limit . You can set the HDFS byte read limit.Severity: Medium Execution Interval: "120"
HIVE_QUERIES_HDFS_BYTES_WRITTENMonitors if the HDFS bytes write is higher than the limit . You can set the HDFS byte write limit.Severity: Medium Execution Interval: "120"

MapReduce Alerts

Alert NameDescriptionConfiguration
MAPREDUCE2_JOBHISTORY UI_ENDPOINT_CHECKChecks whether the mapreduce2_jobhistory user interface is alive or not.Severity: "Critical", Execution Interval: "60"
MAPREDUCE2_JOBHISTORYSERVER_ENDPOINT_CHECKChecks whether the mapreduce2_jobhistoryserver is alive or not.Severity: "Critical", Execution Interval: "60"

NiFi Alerts

Alert NameDescriptionConfiguration
PROCESSOR_STATUS_CHECK

This alert gets triggered when a processor stops or goes into an invalid state by sending the Processor name and NiFi node in the alert body.

This relies on ProcessorStatus as a metric that has been added to the new NAR file.

Severity: "High", Execution Interval: "120"
NIFI_CLUSTER_CONNECTION_STATUS_CHECK

This alert gets triggered when a Nifi Node gets disconnected from the Nifi Cluster.

It is important to note that this is different from the EndPoint check, as the NiFi process might still be up and running, however, it might have stopped sending heartbeats to the other nodes marking it as disconnected. This relies on the nifi_cluster_connection_status_connected metric that is sent by the newly introduced nifi_agent.

Severity: "High", Execution Interval: "120"

NiFi Registry Alerts

Alert NameDescriptionConfiguration
Nifi_Registry Endpoint_Check``Checks whether the system hosting the NiFi Registry service is active or not.Severity: "Critical", Execution Interval: "60"

Pinot Alerts

Alert NameDescriptionConfiguration
Pinot Broker Endpoint CheckTriggers an alert when the Pinot Broker service is unreachable, indicating potential service downtime or network issues.Severity: "Critical", Execution Interval: "60"
Pinot Controller Endpoint CheckTriggers an alert when the Pinot Controller service is unreachable, affecting coordination and metadata operations.Severity: "Critical", Execution Interval: "60"
Pinot Server Endpoint CheckTriggers an alert when the Pinot Server service is unreachable, which may disrupt data queries and ingestion.Severity: "Critical", Execution Interval: "60"

Ranger and Ranger KMS Alerts

Alert NameDescriptionConfiguration
Ranger Admin Endpoint Check``Triggered when the Ranger Admin service is unavailable, indicating potential service downtime or network issues.Severity: "Critical", Execution Interval: "60"
```Ranger````KMS Endpoint Check`Triggered when the Ranger KMS service is unavailable, affecting coordination and metadata operations.Severity: "Critical", Execution Interval: "60"

Trino Alerts

Alert NameDescriptionConfiguration
Trino Coordinator Endpoint CheckTriggered when the Trino Coordinator service becomes unreachable, indicating potential downtime or network issues.Severity: "Medium", Execution Interval: "30"
Trino Worker Endpoint CheckTriggered when the Trino Worker service becomes unreachable, which may impact query processing and workload distribution.Severity: "Medium", Execution Interval: "30"

HDFS Alerts

Alert NameDescriptionConfiguration
HDFS_SECONDARYNAMENODE_ENDPOINT_CHECKChecks whether the HDFS secondary namenode is alive or not.Severity: "Critical", Execution Interval: "60"
HDFS_DATANODES_ENDPOINT_CHECKChecks whether the HDFS datanode is alive or not.Severity: "Critical", Execution Interval: "60"
HDFS_NAMENODE ENDPOINT_CHECKChecks whether the HDFS namenode is alive or not.Severity: "Critical", Execution Interval: "60"
HDFS_JOURNALNODE_OPEN_FILE_DESCIPTORS_COUNT_ALERTMonitors the number of file descriptors used.Severity: Critical Execution Interval: 120
HDFS_JOURNALNODE_USED_SWAP_SPACE_SIZEMonitors the amount of swap memory used.Severity: Low Execution Interval: 120
HDFS_JOURNALNODE_GC_MILLIS_ALERTMonitors the time spent in Java Garbage Collection.Severity: Low Execution Interval: 120
HDFS_JOURNALNODE SYNC_LATENCY_ALERTMonitors the fsync latency of the JournalNode.Severity: Low Execution Interval: 120
HDFS_JOURNALNODE ENDPOINT_CHECKChecks whether the JournalNode is alive or not.Severity: Critical Execution Interval: 60
FS_LAST_MODIFIED_FILE_COUNTChecks various given paths to find the number of files modified prior to the given time period.Severity: "Critical", Execution Interval: 120
HDFS_PATH_USAGEChecks file sizes in a specified path and triggers an alert if they exceed the defined threshold fraction of the total HDFS storage. Note The HDFS paths must be present in the pathanalysis.txt file present under $AcceloHome/data/fsanalytics/cluster/.Severity: "Critical", Execution Interval: 120

LLAP Alerts

Alert NameDescriptionConfiguration
LLAP_QUERY_OUTPUT_RECORDS_GT_1MChecks whether the Hive LLAP query is processing more than 1 million output records.Severity: "Medium", Execution Interval: "30"
LLAP_HIGH_QUERY_NUMBERChecks whether the Hive LLAP query is experiencing a high query count of more than 50 queries.Severity: "High", Execution Interval: "30"
LLAP_QUERIES_FAILINGChecks whether the Hive LLAP queries failing in the last one hour is more than 10 in number.Severity: "High", Execution Interval: "30"

Host Alerts

Alert NameDescriptionConfiguration
AVAILABLE_MEMORY_ALERTThis alert is raised if the available memory in the system for the last 60 seconds per host per mount path is more than 10 percent.Severity: "Critical", Execution Interval: "60"
NETWORK_USAGE_ALERTChecks if the average of total bytes received and sent is greater than 9.0 GB over 60 seconds.Severity: "Critical", Execution Interval: "60"
DISK_USAGE_ALERTThis alert is raised if the percentage of disk usage in the system for the last 60 minutes per host per mount path is more than 70 percent.Severity: "Critical", Execution Interval: "60"
CPU_USAGE_ALERTThis alert is raised when the CPU usage is higher than 50 percent on any host in the last 60 seconds.Severity: "Critical", Execution Interval: "60"

Ozone Alerts

Alert NameDescriptionConfiguration
OZONE_DATANODE_ENDPOINT_CHECKThis alert checks whether the Ozone Datanode is alive or not.Severity: "Critical", Execution Interval: "60"
OZONE_MANAGER_ENDPOINT_CHECKThis alert checks whether the Ozone Manager is alive or not.Severity: "Critical", Execution Interval: "60"
OZONE_RECON_ENDPOINT_CHECKThis alert checks whether the Ozone Recon web UI is alive or not.Severity: "Critical", Execution Interval: "60"
OZONE_S3GATEWAY_ENDPOINT_CHECKThis alert checks whether the Ozone S3 gateway is alive or not.Severity: "Critical", Execution Interval: "60"
OZONE_SCM_ENDPOINT_CHECKThis alert checks whether the Ozone Storage Container Manager is alive or not.Severity: "Critical", Execution Interval: "60"

To create an alert, see Alerts.

Predefined Alerts

Stock alerts and custom alerts are limited to one query condition. Hence, to resolve the issue of complex query conditions, Pulse provides a built-in library of predefined alerts. Each alert is defined by an alert definition, which specifies the alert type, and monitors periodically in defined execution Interval and thresholds. Predefined alerts are provided as a stock alert to the user.

The user can modify the parameters, but not the function of the Predefined alerts.

The following table lists and describes the available predefined alert:

AlertDescriptionConfiguration
HDFS_NAMENODE_FAILOVERAlert when active namenode is transitioned to another standby namenode.Severity: "Critical", Execution Interval: "120"
YARN_DEAD_NODEMANAGERS_PERCENTAGEPercentage of dead nodemanagers are beyond a threshold.Severity: "High", Execution Interval: "120"
YARN_RESOURCE_MANAGER_FAILOVER_ALERTAlert when active resource manager is transitioned to another standby resource manager.Severity: "High", Execution Interval: "120"
HDFS_STALE_DATANODE_PERCENTAGEAlert when active stale datanode is more than 5 percent.Severity: "Critical", Execution Interval: "120"
YARN_APP_FAILED_PERCENTAGEPercentage of app failed are beyond a threshold.Severity: "High", Execution Interval: "120"
HBASE_READ_SCAN_LATENCYAlert when scan time 99th percentile is more than threshold.Severity: "Critical", Execution Interval: "120"
HDFS_DEAD_DATANODE_PERCENTAGEAlert when active dead datanode is more than 5 percent.Severity: "Critical", Execution Interval: "120"
HBASE_READ_GET_LATENCYCritical Alert when read latency 99th percentile is more than threshold.Severity: "Critical", Execution Interval: "120"
KAFKA_PARTITION_OFFLINE_LEADERAlert when topic partition leader goes offline.Severity: "Critical", Execution Interval: "120"
KAFKA_CONSUMER_GROUP_NO_CONSUMPTIONAlert when the Kafka Consumer group is not consuming data from the topic.Severity: "Critical", Execution Interval: "60"
HBASE_REGION_STATE_CHANGEAlert compares the HBase Region State between the current and previous durations (as defined by you). Every time the HBase Region State changes, an alert is triggeredSeverity: "Critical", Execution Interval: "120"
YARN_LONG_RUNNING_JOBAlert when long-running jobs are taking longer than threshold time.Severity: "Critical", Execution Interval: "120"
YARN_PENDING_JOBS_COUNTAlert when there are 15 YARN jobs in pending status.Severity: "High", Execution Interval: "120"
YARN_PENDING_APPS_ COMPARISONAlert compares the number of YARN jobs in pending state between a custom time period (defined by you) and the current time. You can also set a custom percentage threshold. When the percentage of pending YARN jobs in the custom time frame exceeds the current time, by the defined percentage threshold, the alert is raised.Severity: "High", Execution Interval: "120"
HIVE_SERVER_INTERACTIVE_HEAP_USAGE

Monitors if the Hive server interactive Jvm heap usage is more than a predefined threshold.

Note Ensure to set the JVM Heap Maximum value to use this alert.

Severity: "Critical", Execution Interval: "120"
HBASE_MASTER_FAILOVER

When the active hbase master is switched to another standby hbase, an alert is generated.

Note The alert will not be cleared. It must be moved to the clear state by the user.

Severity: "Critical", Execution Interval: "120"
HIVE_LLAP_ZOMBIE_DAEMON_CHECKWhen an LLAP zombie daemon process is still active, an alert is generated.Severity: "Critical", Execution Interval: "30"
SPARK_EXECUTOR_NODE_BLACKLISTEDChecks if any node is blocked by Spark and raises an alert if node is blocked.Severity: "High". Execution interval: 300
YARN_APP_RESOURCE_USAGEChecks if the Vcore and memory exceed the threshold value for all applications. You can set the threshold value.Severity: "High". Execution interval: "60"
YARN_LONG_RUNNING_JOB_QUEUEYou can specify multiple queue names (comma separated) and specify the threshold duration (seconds). This alert checks if any of the queue exceeds the set threshold level.Severity: "High". Execution interval: "120"
FS_SNAPSHOT_ANALYSIS_REPORTAllows you to specify a time period (in hours) in the ThresholdTime field. If the last snapshot time is greater than the value specified in Threshold Time, an alert is raised.Severity: "Critical". Execution interval: "300"

Pinot Alerts

Alert NameDescriptionConfiguration
Pinot_Controller Interactive_Heap_UsageTriggers an alert when the Controller's heap usage exceeds the defined threshold. Ensure the 'JVM Heap Maximum' value is configured.Severity: "Critical". Execution interval: "120"
Pinot_Controller_Interactive_Heap_UsageTriggers an alert when the Broker's heap usage crosses the threshold. Confirm that the 'JVM Heap Maximum' value is set.Severity: "Critical". Execution interval: "120"
Pinot_Server___Interactive_Heap_UsageTriggers an alert when the Server's heap usage surpasses the set threshold. Make sure the 'JVM Heap Maximum' value is configured.Severity: "Critical". Execution interval: "120"

Ranger and Ranger KMS Alerts

Alert NameDescriptionConfiguration
Ranger Admin Interactive Heap UsageTriggered when Ranger Admin heap usage exceeds the defined threshold. Ensure the JVM heap maximum value is configured to enable this alert.Severity: "Critical". Execution interval: "120"
Ranger KMS __Server Interactive Heap UsageTriggered when Ranger KMS heap usage exceeds the defined threshold. Ensure the JVM heap maximum value is configured to enable this alert.Severity: "Critical". Execution interval: "120"
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard