Cruise Control
Cruise Control facilitates the management of Kafka clusters on a significant scale by automating various Kafka operations. These operations include monitoring cluster workload, rebalancing clusters according to preset constraints, and identifying and resolving anomalies. It comprises four main components: Load Monitor, Analyser, Anomaly Detector, and Executor, alongside a REST API.
Prerequisites
- Cruise Control Version 2.5.137 .
- Cruise Control by default runs on JDK 11.
Ambari Installation Steps for Cruise Control
To install Ambari for Cruise Control, perform the following steps:
- Add Cruise Control Component to Kafka Cluster Nodes via Ambari: Cruise Control is included as part of Kafka components and can be installed on one of the nodes within the Kafka cluster, which it will monitor using Ambari.
- To add Cruise Control, select Kafka as a service under the Choose Services section.

- Assign the nodes where you want to install Kafka broker.

- Select the node where you wish to add Cruise Control as a component.
Note Cruise Control cannot be installed on more than one node.

- Configurational Changes for Cruise Control: After selecting the node, make the following configurational changes related to Cruise Control:
- Under Advanced cruise-control-env and Advanced kafka-env, specify the Java implementation to use.

By default cruise-control runs on port 9095. If you wish to run cruise-control on a different port, make the following changes:
- Under Advanced cruise-control change
webserver.http.port
to the desired port. - Change the port number in Advanced cruise-control-env and Advanced cruise-control-ui-config via Ambari.
- Under Advanced cruise-control change
Under Advanced cruise-control-ui-config, edit the localhost to the hostname on which cruise-control is installed. This configuration is related to the cruise-control UI. If SSL is enabled, user also has to edit the
http
protocol tohttps
in Advanced cruise-control-ui-config.
Format of config.csv:
config.csv
is a CSV file with three columns separated by commas.
- First column:
Logical Group Name
(Examples:dev
,production
,trade
) - Second column:
Cruise control Instance Name
(Examples:finance
,logging
, ) - Third column: URL path to the cruise control instances (Examples:
/kafkacruisecontrol/
,http://dev.acceldata.ce:9095/kafkacruisecontrol
)

The following cruise-control configurations under Advanced cruise-control and Advanced kafka-broker will be managed by Ambari if left blank.
- Advanced cruise-control :
webserver.http.address
bootstrap.servers
security.inter.broker.protocol
security.protocol
zookeeper.security.enabled
- Advanced kafka-broker :
metric.reporters
cruise.control.metrics.reporter.bootstrap.servers
cruise.control.metrics.reporter.listeners
cruise.control.metrics.reporter.security.protocol
cruise.control.metrics.reporter.security.inter.broker.protocol
cruise.control.metrics.reporter.authorizer.class.name
cruise.control.metrics.reporter.sasl.kerberos.principal.to.local.rules
cruise.control.metrics.reporter.sasl.kerberos.service.name
cruise.control.metrics.reporter.ssl.client.auth
- Deployment of Kafka Brokers and Cruise Control: Click on DEPLOY to deploy the Kafka brokers and Cruise Control on the selected nodes.

- Completion of Installation: Once the deployment is completed, Cruise Control and Kafka broker will be installed on the specified nodes.

- Cruise Control requires some time to read the raw Kafka metrics from the cluster.
- The metrics of a newly up broker may take a few minutes to stabilize. Cruise Control will drop the inconsistent metrics, such as when topic bytes-in is higher than broker bytes-in, so the first few windows may not have enough valid partitions.
Security Configuration
- SSL Enablement: Update the following properties based on your SSL configurations:

Advanced cruise-control

Advanced kafka-broker

Advanced cruise-control-ui-config
- Kerberos Configuration: Cruise Control utilizes the Kafka user as its default user. Therefore, no specific Kerberos configuration changes are required for Cruise Control.
- Service principal and keytabs will be configured using Ambari automation.
- Advanced
cruise-control-jaas-conf
file will also be managed by Ambari.
Configuring JWT Authentication with Knox
To configure JWT authentication with Apache Knox, follow these steps:
- Create a credentials file (e.g., /tmp/test-roles.credentials) with the following content:
kafka: ,ADMIN
sam: ,VIEWER
By default, Cruise Control defines three roles: VIEWER, USER, and ADMIN. VIEWER has access to lightweight endpoints, USER has access to most GET endpoints, and ADMIN has access to all endpoints.
- In Advanced cruise-control via Ambari, make the following changes:
jwt.auth.certificate.location=/path/to/gateway-identity.pem
jwt.authentication.provider.url=https://cent1.acceldata.ce:8443/gateway/knoxsso/api/v1/websso?originalUrl=http://cent1.acceldata.ce:9095
jwt.cookie.name=hadoop-jwt
webserver.auth.credentials.file=/path/to/test-roles.credentials
webserver.security.enable=true
webserver.security.provider=com.linkedin.kafka.cruisecontrol.servlet.security.jwt.JwtSecurityProvider

- Edit the value of
knoxsso.token.ttl
in the Advanced knoxsso-topology template to set the Time To Live (TTL) for tokens used in Knox Single Sign-On (SSO) authentication. By default, this value is set to 30000ms (30 seconds).

- Open the URL mentioned in Advanced cruise-control-ui-config. This should redirect you to the Knox auth page, where you can log in with the admin/admin-password username/password pair. After logging in, you should be redirected to the Cruise Control page.
Pluggable Components
Metric Sampler: The Metric Sampler stands out as one of the core pluggable elements in Kafka Cruise Control. It offers users the flexibility to deploy Cruise Control across various environments and seamlessly integrate with existing metric systems.
By default, the Metric Sampler reads broker metrics generated by CruiseControlMetricsReporter
on the broker. This assumes users are running Kafka brokers by configuring the metric.reporters
property on the Kafka brokers to com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
.
Metric Sampler Partition Assignor: When users employ multiple metric sampler threads, the Metric Sampler Partition Assignor plays a pivotal role in assigning partitions to the metric samplers. This becomes particularly useful when users have an existing metric system. The default implementation allocates all partitions of the same topic to the same metric sampler.
Sample Store: The Sample Store serves as a repository for storing collected metric samples and training samples in external storage. A key challenge in metric sampling is the utilization of derived data from raw metrics. The accuracy of this derived data relies on the metadata of the cluster at the time the metric was collected. The Sample Store resolves this issue by directly storing derived data to external storage for future retrieval.
The default implementation of the Sample Store returns the samples back to the Kafka topic.
Broker Capacity Config Resolver: The Broker Capacity Config Resolver is instrumental in enabling Cruise Control to access the capacity of each broker for various resources. The default implementation relies on file-based properties. Users also have the option to implement custom solutions to retrieve broker capacity from hardware resource management systems.
Goals
Goals in Kafka Cruise Control are pluggable and come with different priorities.
Rack-awareness: Ensures that all replicas of each partition are assigned in a rack-aware manner.
- RackAwareDistributionGoal: Unlike RackAwareGoal, this goal permits placing multiple replicas of a partition into a single rack as long as replicas of each partition can achieve a perfectly even distribution across the racks.
ReplicaCapacityGoal: Aims to limit the number of replicas on all brokers in a cluster.
CapacityGoals: Goals ensuring that broker resource utilization remains below specified thresholds for corresponding resources.
- DiskCapacityGoal
- NetworkInboundCapacityGoal
- NetworkOutboundCapacityGoal
- CpuCapacityGoal
ReplicaDistributionGoal: Strives to distribute replicas evenly across all brokers in a cluster.
PotentialNwOutGoal: Ensures that potential network output on each broker does not exceed the broker’s network outbound bandwidth capacity when all replicas become leaders.
ResourceDistributionGoals: Aims to maintain uniform resource utilization variance among all brokers within a certain range for each resource.
- DiskUtilDistributionGoal
- NetworkInboundUtilDistributionGoal
- NetworkOutboundUtilDistributionGoal
- CpuUtilDistributionGoal
TopicReplicaDistributionGoal: Aims to evenly distribute replicas of the same topic across the entire cluster.
LeaderReplicaDistributionGoal: Strives to ensure that all brokers in a cluster have a similar number of leader replicas.
LeaderBytesInDistributionGoal: Aims to balance the leader bytes-in rate on each host.
PreferredLeaderElectionGoal: Ensures that the first replica in the replica list becomes the leader replica of the partition for all topic partitions.
KafkaAssignerGoals: These goals are selected if the kafka_assigner parameter is set to true in the corresponding request.
- KafkaAssignerDiskUsageDistributionGoal
- KafkaAssignerEvenRackAwareGoal
IntraBrokerDiskCapacityGoal: Ensures that disk resource utilization remains below a specified threshold. This goal is selected if the rebalance_disk parameter is set to true in the rebalance request.
- IntraBrokerDiskUsageDistributionGoal: Aims to maintain utilization variance among all disks within the same broker within a certain range. This goal is selected if the rebalance_disk parameter is set to true in the rebalance request.
BrokerSetAwareGoal: A goal that confines replica movements within the boundary of a BrokerSet, where a BrokerSet is defined as a subset of brokers in the cluster.
Cruise Control REST API Endpoints
You can manage and monitor Cruise Control through its REST API, which offers GET and POST endpoints. Knowing the specific action or query you want to perform is essential when crafting a curl
command to interact with the Cruise Control REST API.
Cruise Control's REST API includes two main types of endpoints:
- GET: The GET endpoints allow you to retrieve details concerning the rebalancing operations, the current state of Cruise Control, and the status of Kafka brokers. These endpoints perform read-only functions, meaning they do not affect Cruise Control, Kafka, or the ongoing rebalancing activities.
- POST: With POST endpoints, you have the capability to alter the rebalancing operations, adjust the count of Kafka brokers, set configurations for Cruise Control, and tailor specific tasks related to sampling and proposals.
By utilizing these endpoints appropriately, you can effectively control and check the status of Cruise Control using simple command line tools.
Using Cruise Control REST API Endpoints
The following provides an overview of the operations available through the GET and POST endpoints of the Cruise Control REST API. This summary is intended to give a quick understanding of the endpoints functions. For the detailed information, see the official REST API documentation.
GET Requests
GET requests in the Kafka Cruise Control REST API are for read-only operations, meaning they do not have any external impacts. These requests include:
- Querying the State of Cruise Control:
- Endpoint:
/kafkacruisecontrol/state
- Description: Retrieve the current state of Kafka Cruise Control, including the status of its monitor, executor, analyzer, and anomaly detector components.
- Endpoint:
Example:
GET 'http://hostname:9095/kafkacruisecontrol/state'
- Querying the Current Cluster Load:
- Endpoint:
/kafkacruisecontrol/load
- Description: Retrieve information on the current cluster load, including load per broker and load per host, provided the Load Monitor is in the RUNNING state.
Example:
GET 'http://hostname:9095/kafkacruisecontrol/load'
- Querying the Partition Resource Utilization:
- Endpoint:
/kafkacruisecontrol/partition_load
- Description: Query the resource utilization of partitions in Kafka Cruise Control. The partition load data is organized and presented based on the specified resource type. The returned result would be a partition list sorted by the utilization of the specified resource in the time range specified by
start
andend
. The resource can beCPU
,NW_IN
,NW_OUT
andDISK
. By default, thestart
is the earliest monitored time, theend
is current wall clock time,resource
isDISK
, andentries
are all partitions in the cluster.
- Endpoint:
Example:
GET 'http://hostname:9095/kafkacruisecontrol/kafka_cluster_state'
- Querying Partition and Replica State:
- Endpoint:
/kafkacruisecontrol/kafka_cluster_state
- Description: On querying the partition and replica state in Kafka Cruise control, the query provides detailed information for each broker and partition, helping you understand its current operational status.
- Endpoint:
The returned result contains the following information:
For each broker:
- Distribution of leader/follower/out-of-sync/offline replica information
- Online/offline disks
For each partition:
- Distribution of leader/follower/in-sync/out-of-sync/offline replica information\
- Get Optimization Proposals:
- Endpoint:
/kafkacruisecontrol/proposals
- Description: This GET request returns the optimization proposals generated based on the workload model of the given timestamp. The workload summary prior to and following the optimization will also be provided.
- Endpoint:
Example:
GET 'http://hostname:9095/kafkacruisecontrol/proposals'
These proposals can be generated either using the snapshot window or focusing on valid partitions, providing insights into potential optimization strategies.
Proposal can be generated based on valid_window or valid_partitions.
- valid_windows: rebalance the cluster based on the information in the available valid snapshot windows. A valid snapshot window is a windows whose valid monitored partitions coverage meets the requirements of all the goals. (This is the default behavior)
- valid_partitions: rebalance the cluster based on all the available valid partitions. All the snapshot windows will be included in this case.
- You can only specify either
valid_windows
orvalid_partitions
, but not both. - If
verbose
is turned on, Cruise Control will return all the generated proposals. Otherwise a summary of the proposals will be returned. - You can specify
excluded_topics
to prevent certain topics replicas from moving in the generated proposals. - If
use_ready_default_goals
is turned on, Cruise Control will use any ready goals (based on available metric data) to calculate the proposals.
- Querying User Request Result:
- Endpoint:
/kafkacruisecontrol/user_tasks
- Description: You have the capability to access the outcomes of user requests within Kafka Cruise Control. This encompasses an extensive inventory of both ongoing and finalized tasks, granting insight into the current and past operations within Cruise Control.
- Endpoint:
Example:
GET 'http://hostname:9095/kafkacruisecontrol/user_tasks'
You can specify user_task_ids
, client_ids
, endpoints
, or types to filter requests of interest. By default, all requests are returned.
Setting fetch_completed_task
to true will result in the original response of each request being retrieved. If a task encounters errors during completion, the response will indicate CompletedWithError.
POST Requests
POST requests in the Kafka Cruise Control REST API are operations that impact the Kafka cluster. These operations include:
- Triggering a Workload Balance:
- Endpoint:
/kafkacruisecontrol/rebalance
- Description: Initiate a workload balance for a Kafka cluster based on specified optimization goals. By default, this operation runs in dry-run mode, simulating the process without actual execution.
- Endpoint:
Example:
POST /kafkacruisecontrol/rebalance
dryrun=false
.
- Adding List of New Brokers to Kafka the Cluster:
- Endpoint:
/kafkacruisecontrol/add_broker?brokerid=[id1,id2...]
- Description: To integrate a group of brokers into a Kafka cluster, Kafka Cruise Control offers a streamlined process. This procedure entails transferring replicas from current brokers to the newly introduced ones. Additionally, users have the option to regulate the pace of replica movement to the newly added brokers, with the throttling mechanism applied to existing brokers.
- Endpoint:
Example:
POST /kafkacruisecontrol/add_broker?brokerid=[id1,id2...]
dry_run=false
.
When introducing new brokers into a Kafka cluster, Cruise Control ensures that replicas are solely relocated from existing brokers to the specified new broker, without any movement among existing brokers.
- Decommissioning List of Brokers from Kafka the Cluster:
- Endpoint:
/kafkacruisecontrol/remove_broker?brokerid=[id1,id2...]
- Description: You can remove a list of brokers from a Kafka cluster using Kafka Cruise Control. This process involves transferring partitions from the brokers slated for removal to other existing brokers. You have the option to specify the destination broker for these partitions. Additionally, you can throttle the removed brokers during the partition movement.
- Endpoint:
Example:
POST /kafkacruisecontrol/remove_broker?brokerid=[id1,id2...]
dryrun=false
.
If the topics specified in excluded_topics
have replicas on the removed broker, the replicas will get moved from the broker.
- Fixing Offline Replicas in the Kafka Cluster:
- Endpoint:
/kafkacruisecontrol/fix_offline_replicas
- Description: You can resolve offline replicas in a Kafka cluster using Kafka Cruise Control. If the specified topic contains offline replicas, Cruise Control will move them to healthy brokers.
- Endpoint:
Example:
POST /kafkacruisecontrol/fix_offline_replicas
dryrun=false
.
If the topics specified in excluded_topics
has offline replicas, the replicas will still get moved to healthy brokers.
- Demoting List of Brokers from the Kafka Cluster:
- Endpoint:
/kafkacruisecontrol/demote_broker?brokerid=[id1, id2...]
- Description: Using Kafka Cruise Control, you have the ability to migrate all leader replicas from a designated set of brokers. This process entails downgrading all replicas on the specified brokers to the least preferred status, triggering a preferred leader replica election to relocate the leader replicas away from those brokers.
- Endpoint:
Example:
POST /kafkacruisecontrol/demote_broker?brokerid=[id1, id2...]
dryrun=false
.
- Moving all the leader replicas away from the a list of disks:
- Endpoint:
/kafkacruisecontrol/demote_broker?brokerid_and_logdirs=[id1-logdir1, id2-logdir2...]
- Description: Using Kafka Cruise Control, you have the ability to relocate all leader replicas from a specified list of disks. This action entails designating all replicas on the specified disks to be least preferred, subsequently initiating a preferred leader replica election to migrate the leader replicas away from those disks.
- Endpoint:
POST /kafkacruisecontrol/demote_broker?brokerid_and_logdirs=[id1-logdir1, id2-logdir2...]
The process of demoting a broker or disk involves two key steps:
- Designating all replicas on the specified broker or disk as the least preferred for leadership election within their respective partitions.
- Initiating a preferred leader election on the partitions to shift the leader replicas away from the broker or disk.
- Stopping Current Proposal Execution Tasks:
- Endpoint:
/kafkacruisecontrol/stop_proposal_execution
- Description: Using Kafka Cruise Control, you have the ability to halt an ongoing proposal execution task whilst the task is in progress.
- Endpoint:
Example:
POST /kafkacruisecontrol/stop_proposal_execution
- Pausing Metrics Load Sampling:
- Endpoint:
/kafkacruisecontrol/pause_sampling
- Description: Using Cruise Control, you have the ability to pause ongoing metrics sampling process. The reason for the pause is recorded and can be viewed when querying the state of Cruise Control.The reason to pause the sampling is also recorded and shows up in
state
endpoint(underLoadMonitor
sub state).
- Endpoint:
Example:
POST /kafkacruisecontrol/pause_sampling
- Resuming Metrics Load Sampling:
- Endpoint:
/kafkacruisecontrol/resume_sampling
- Description: You have the ability to resume paused metrics load sampling process in Kafka Cruise Control. When resuming, the reason for the resumption is recorded and can be viewed when querying the state of Cruise Control.
- Endpoint:
Example:
POST /kafkacruisecontrol/resume_sampling
- Changing Kafka Topic Configuration:
- Endpoint:
/kafkacruisecontrol/topic_configuration?topic=[topic_regex]&replication_factor=[target_replication_factor]
- Description: Currently Cruise Control only supports changing topic's replication factor via
topic_configuration
endpoint. Ultimately we want make Cruise Control the central place to change any topic configurations(partition count, retention time etc.).
- Endpoint:
Example:
POST /kafkacruisecontrol/topic_configuration?topic=[topic_regex]&replication_factor=[target_replication_factor]
Changing topic's replication factor will not move any existing replicas. goals
are used to determine which replica is to be deleted(to decrease topic's replication factor) and which broker to assign new replica (to increase topic's replication factor).
- Changing Cruise Control Configurations:
Adjusting Cruise Control configuration options can be achieved dynamically through the admin endpoint, facilitating the following capabilities:
- Modifying partition and leadership concurrency, along with the interval for checking and updating ongoing execution progress.
- Enabling or disabling self-healing for specific anomaly types.
- Removing selected brokers that were recently removed or demoted.
- Activating or deactivating specified concurrency adjusters.
- Enabling or disabling (At/Under)MinISR-based concurrency adjustment.
To enable or disable self-healing, utilize a POST request similar to:
POST /kafkacruisecontrol/admin?disable_self_healing_for=[anomaly_type]
For adjusting execution concurrency, submit a POST request like:
POST /kafkacruisecontrol/admin?concurrent_partition_movements_per_broker=[integer]
To drop recently removed or demoted brokers, issue a POST request as follows:
POST /kafkacruisecontrol/admin?drop_recently_removed_brokers=[broker_ids]
2-Step Verification for POST Requests
To prevent unintended executions of POST requests, 2-step verification can be enabled for certain endpoints. This feature requires you to explicitly review and approve requests before execution. Endpoints requiring 2-step verification include:
add_broker
remove_broker
fix_offline_replicas
rebalance
,stop_proposal_execution
pause_sampling
resume_sampling
demote_broker
admin
To enable 2-step verification, set the two.step.verification.enabled
config to true in Advanced Cruise Control via Ambari.

Cruise Control Frontend (CCFE)
The Cruise Control Frontend provides a centralized dashboard for Kafka Cruise Control, offering various features:
Key Features:
- Kafka Cluster Status: Real-time status of the Kafka cluster, reporting any URP, offline partitions, disks, etc.

- Kafka Cluster Load: Displays the load on all brokers as calculated by Cruise Control.

- Cruise Control State: Shows the internal status of Cruise Control, including the Monitor, Executor, Analyzer, and Anomaly Detector.

- Cruise Control Tasks: Provides information on past and current tasks managed by Cruise Control.

- Kafka Cluster Administration: Lists all brokers in the system and offers administrative actions such as adding, removing, or demoting brokers, as well as rebalancing.


Kafka Cluster Rebalance with Advanced Options
- Cruise Control Proposals: Displays proposed optimizations and actions generated by Cruise Control.
- Peer Review: Facilitates peer review of proposed actions.
Additional Features:
- Safe Mode Execution: All actions run in Dry Run (Safe Mode) by default.
- Customizable Parameters: Exposes all Cruise Control REST API parameters as input controls.
- Fast Execution: Provides safe default values to execute actions as quickly as possible.
- Async Response Handling: Understands and renders progress from Cruise Control's asynchronous responses.
- Endpoint URLs: Displays URLs of endpoints for every action.
- Detailed Error Responses: Shows full stack trace responses from Cruise Control instead of abbreviated ones.

If you experience an error after refreshing the pages for Kafka Cluster Load, Cruise Control Proposals, or Peer Review, navigate to the Kafka Cluster State page and then return to the problematic page. This should resolve the issue.
