Deploying ODP In Production Data Centers With Firewalls
A typical Open Source Data Platform (ODP) install, requires access to the internet in order to fetch software packages from a remote repository. Since, corporate networks typically have various levels of firewalls, these firewalls may limit or restrict internet access, making it impossible for your cluster nodes to access the ODP repository during installation.
The solution for this is to either:
- Create a local mirror repository inside your firewall hosted on a local mirror server; or
- Provide a trusted proxy server inside your firewall that can access the hosted repositories.
This document covers these two options in detail, discusses the trade-offs, provides configuration guidelines, and recommendations for your deployment strategy.
In general, before installing Open Source Data Platform in a production data center, it is best to ensure that both the Data Center Security team and the Data Center Networking team are informed and engaged to assist with these aspects of the deployment.
Terminology
The table below lists the various terms used throughout this section.
Table 6.1. Terminology
Item | Description |
---|---|
Yum Package Manager (yum) | A package management tool that fetches and installs software packages and performs automatic dependency resolution. |
Local Mirror Repository | The yum repository hosted on your Local Mirror Server that serves the ODP software. |
Local Mirror Server | The server in your network that will host the Local Mirror Repository. This server must be accessible from all hosts in your cluster where you install ODP. |
ODP Repositories | A set of repositories hosted by Acceldata that contains the ODP software packages. ODP software packages include the ODP Repository and the ODP-UTILS Repository. |
ODP Repository Tarball | A tarball image that contains the complete contents of the ODP Repositories. |
Mirroring or Proxying
ODP uses yum
to install the software, and this software is obtained from the ODP Repositories. If your firewall prevents internet access, you must mirror or proxy the ODP Repositories in your Data Center.
Mirroring a repository involves copying the entire repository and all its contents onto a local server and enabling an HTTPD service on that server to serve the repository locally. Once the local mirror server setup is complete, the *.repo
configuration files on every cluster node must be updated, so that the given package names are associated with the local mirror server instead of the remote repository server.
Two methods exist for setting up a local mirror server, with detailed explanations provided in subsequent sections of this document.
- Mirror server has no access to internet at all: Use a web browser on your workstation to download the ODP Repository Tarball, move the tarball to the selected mirror server using scp or an USB drive, and extract it to create the repository on the local mirror server.
- Mirror server has temporary access to internet: Temporarily configure a server to have internet access, download a copy of the ODP Repository to this server using the reposync command, then reconfigure the server so that it is back behind the firewall.
- Option I is probably the least effort, and in some respects, is the most secure deployment option.
- Option III is best if you want to be able to update your Hadoop installation >periodically from the Acceldata Repositories.
Trusted proxy server: Proxying a repository involves setting up a standard HTTP proxy on a local server to forward repository access requests to the remote repository server and route responses back to the original requestor. Effectively, the proxy server makes the repository server accessible to all clients, by acting as an intermediary.
Once the proxy is configured, change the /etc/yum.conf file on every cluster node, so that when the client attempts to access the repository during installation, the request goes through the local proxy server instead of going directly to the remote repository server.
Considerations for Choosing a Mirror or Proxy Solution
The following table lists some benefits provided by these alternative deployment strategies:
However, each of the above approaches are also known to have the following disadvantages:
- Mirrors have to be managed for updates, upgrades, new versions, and bug fixes.
- Proxy servers rely on the repository provider to not change the underlying files without notice.
- Caching proxies are necessary, because non-caching proxies do not decrease WAN traffic and do not speed up the install process.
Recommendations for Deploying ODP
This section provides information on the various components of the Apache Hadoop ecosystem.
In many data centers, using a mirror for the ODP Repositories can be the best deployment strategy. The ODP repositories are small and easily mirrored, allowing you secure control over the contents of the Hadoop packages accepted for use in your data center.
Detailed Instructions for Creating Mirrors and Proxies
Option I - Mirror server has no access to the internet
Complete the following instructions to set up a mirror server that has no access to the internet:
- Check Your Prerequisites.
Select a mirror server host with the following characteristics:
- The server OS is CentOS (7), RHEL (7), RHEL (8), RL(8), or Ubuntu (20,22), and has several GB of storage available.
- This server and the cluster nodes shall all be running the same OS.
- The firewall should let all cluster nodes (the servers on which you want to install ODP) access this serve.
- Install the Repos.
a. Use a workstation with access to the internet and download the tarball image of the appropriate Acceldata ODP repository.
Table 6.2. Acceldata ODP Repositories
Cluster OS | ODP Repository Tarballs |
---|---|
RHEL/CentOS 7 | wget [INSERT_URL] |
RHEL 8/RL 8 | wget [INSERT_URL] |
Ubuntu 20/22 | wget [INSERT_URL] wget [INSERT_URL] |
b. Create an HTTP server.
• On the mirror server, install an HTTP server (such as Apache httpd) using the instructions provided here.
• Activate this web server.
• Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server.
- If you are using EC2, make sure that SELinux is disabled.
- If you are using EC2, make sure that SELinux is disabled.
c. On your mirror server, create a directory for your web server.
For example, from a shell window, type:
- For RHEL/CentOS 7:
mkdir –p /var/www/html/odp/
- For Ubuntu 18/20:
mkdir –p /var/www/html/odp/
If you are using a symlink, enable the following symlinks on your web server.
d. Copy the ODP Repository Tarball to the directory created in step 3, and untar it.
e. Verify the configuration.
- The configuration is successful, if you can access the above directory through your web browser.
To test this out, browse to the following location: http://$yourwebserver/odp/$os/ODP-3.2.3.3-2/.
You should see directory listing for all the ODP components along with the RPMs at: $os/ODP-3.2.3.3-2.
$os
can be Centos7, Ubuntu 18/20. Use the following options table for $os parameter.
Table 6.3. ODP Component Options
Operating System | Value |
---|---|
RHEL/CentOs 7 | centos7 |
RHEL 8/RL 8 | rhel8 |
Ubuntu 20 | ubuntu20 |
Ubuntu 22 | ubuntu22 |
f. Configure the yum clients on all the nodes in your cluster.
- Fetch the yum configuration file from your mirror server.
- Store the
odp.repo
file to a temporary location. - Edit the
odp.repo
file changing the value of the base url property to point to your local repositories based on your cluster OS.
where
$yourwebserver
is the FQDN of your local mirror server.$os
can be RHEL 7, Centos7, RHEL (8), RL(8),or Ubuntu 18/20. Use the following options table for$os
parameter:
Table 6.4. Yum Client Options
Operating System | Value |
---|---|
RHEL/CentOs 7 | centos 7 |
RHEL 8/RL 8 | rhel8 |
Ubuntu 20 | ubuntu20 |
Ubuntu 22 | ubuntu22 |
For RHEL/CentOS 7 and RHEL 8/RL 8 :
- Add the following file on every node in the cluster.
vi /etc/yum.repos.d/ambari.repo
async = 1
baseurl = http://<internal-server>
gpgcheck = 0
name = ambari Version - ambari-2.7.8.2-2
For Ubuntu 20/22 :
- Add the following file on every node in the cluster.
sudo vi /etc/apt/sources.list.d/ambari.list
deb http://<internal-serval-hostname>/ODP-3.2.3.3-2-deb/3.2.2.0-2/ ODP main
deb http://<internal-serval-hostname>/ODP-3.2.3.3-2-deb/3.2.2.0-2/ ODP-UTILS main
Option II - Mirror server has temporary or continuous access to the internet
Complete the following instructions to set up a mirror server that has temporary access to the internet:
- Check Your Prerequisites.
Select a local mirror server host with the following characteristics:
- The server OS is CentOS (7), RHEL (7), RHEL (8), RL(8), or Ubuntu (20,22), and has several GB of storage available.
- The local mirror server and the cluster nodes must have the same OS. If they are not running CentOS or RHEL, the mirror server must not be a member of the Hadoop cluster.
- The firewall allows all cluster nodes (the servers on which you want to install ODP) to access this server.
- Ensure that the mirror server has yum installed.
- Add the
yum-utils
andcreaterepo
packages on the mirror server.yum install yum-utils createrepo
- Install the Repos.
- Temporarily reconfigure your firewall to allow internet access from your mirror server host.
- Execute the following command to download the appropriate Acceldata yum client configuration file and save it in /etc/yum.repos.d/ directory on the mirror server host.
Table 6.5. ODP Client Configuration Commands
Cluster OS | ODP Repository Tarballs |
---|---|
RHEL/CentOS 7 | wget [INSERT_URL] |
RHEL 8/RL 8 | wget [INSERT_URL] |
Ubuntu 20 | wget [INSERT_URL] |
Ubuntu 22 | wget [INSERT_URL] |
- Create an HTTP server.
- On the mirror server, install an HTTP server (such as Apache httpd using the instructions provided
- Activate this web server.
- Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server.
sed -e s/Options None/Options Indexes MultiViews/ig /etc/apache2/default-server.conf /tmp/tempfile.tmp
mv /tmp/tempfile.tmp /etc/apache2/default-server.conf
On your mirror server, create a directory for your web server.
• For example, from a shell window, type:
• For RHEL/CentOS 7:
mkdir –p /var/www/html/odp/
• For Ubuntu 20/22:
mkdir –p /var/www/html/odp/
If you are using a symlink, enable the follow symlinks on your web server.
• Copy the contents of entire ODP repository for your desired OS from the remote
- Continuing the previous example, from a shell window, type:
- For RHEL/CentOS 7/Ubuntu 20/22:
cd/var/www/html/odp
Then for all hosts, type:
- ODP Repository
reposync -r ODP
reposync -r ODP-3.2.3.3-2
reposync -r ODP-UTILS-1.1.0.21
You should see both an ODP-3.2.3.1-2 directory and an ODP-UTILS-1.1.0.21 directory, each with several subdirectories.
- Generate appropriate metadata.
This step defines each directory as a yum repository. From a shell window, type:
- For RHEL/CentOS 7:
- ODP Repository:
createrepo /var/www/html/odp/ODP-3.2.3.3-2 createrepo /var/www/html/odp/ODP-UTILS-1.1.0.21
You should see a new folder called repodata inside both ODP directories.
- Verify the configuration.
- The configuration is successful, if you can access the above directory through your web browser.
To test this out, browse to the following location:
ODP:http://$yourwebserver/odp/ODP-3.2.3.3-2/
- You should now see directory listing for all the ODP components.
- At this point, you can disable external internet access for the mirror server, so that the mirror server is again entirely within your data center firewall.
- Depending on your cluster OS, configure the yum clients on all the nodes in your cluster
- Edit the repo files, changing the value of the baseurl property to the local mirror URL.
- Edit the /etc/yum.repos.d/odp.repo file, changing the value of the baseurl property to point to your local repositories based on your cluster OS.
where
$yourwebserver
is the FQDN of your local mirror server.$os
can be Centos7, RHEL 8, or Ubuntu 20/22. Use the following options table for$os
parameter:
Table 6.6. $OS Parameter Values
Operating System | Value |
---|---|
RHEL 7/CentOs 7 | centos7 |
RHEL 8/RL 8 | rhel8 |
Ubuntu 20 | ubuntu20 |
Ubuntu 22 | ubuntu22 |
For RHEL/CentOS 7 and RHEL 8/RL 8 :
- Add the following file on every node in the cluster.
vi /etc/yum.repos.d/ambari.repo
async = 1
baseurl = http://<internal-server>
gpgcheck = 0
name = ambari Version - ambari-2.7.8.2-2
- If using Ambari, verify the configuration by deploying an Ambari server on one of the cluster nodes.
yum update
yum install ambari-server
For Ubuntu 18/20 :
Add the following file on every node in the cluster.
sudo vi /etc/apt/sources.list.d/ambari.list :
deb http://<internal-serval-hostname>/ODP-3.2.3.3-2-deb/3.2.2.0-2/ ODP main
deb http://<internal-serval-hostname>/ODP-3.2.3.3-2-deb/3.2.2.0-2/ ODP-UTILS main
If using Ambari, verify the configuration by deploying an Ambari server on one of the cluster nodes.
wget -qO - <internal-serval-key> | sudo apt-key add -
sudo apt update
apt-get install ambari-server
apt list ambari-server* [For confirmation of ambari configured]
- Set up a Trusted Proxy Server
Complete the following instructions to set up a trusted proxy server:
- Check Your Prerequisites.
Select a mirror server host with the following characteristics:
- This server runs on either RHEL/CentOS 7, RHEL 8/RL 8, or Ubuntu 20/22, and has several GB of storage available.
- The firewall allows all cluster nodes (the servers on which you want to install ODP) to access this server, and allows this server to access the internet (at least those internet servers for the repositories to be proxied)Install the Repos
- Create a caching HTTP Proxy server on the selected host.
• It is beyond the scope of this document to show how to set up an HTTP PROXY server, given the many variations that may be required, depending on your data center’s network security policy. If you choose to use the Apache HTTPD server, it starts by installing httpd, using the instructions provided here , and then adding the mod_proxy and mod_cache modules, as stated here. Please engage your network security specialists to correctly set up the proxy server.
- Activate this proxy server and configure its cache storage location.
- Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server, and outbound access to the desired repo sites, including: public-repo-1.acceldata.com.
If you are using EC2, make sure that SELinux is disabled.
- Depending on your cluster OS, configure the yum clients on all the nodes in your cluster.
The following description is taken from the CentOS documentation. On each cluster node, add the following lines to the /etc/yum.conf file. (As an example, the settings below will enable yum to use the proxy server mycache.mydomain.com, connecting to port 3128, with the following credentials: yum-user/query.
# proxy server:port number
proxy=http://mycache.mydomain.com:3128
# account details for secure yum proxy connections
proxy_username=yum-user
proxy_password=qwerty
- Once all nodes have their /etc/yum.conf file updated with appropriate configuration info, you can proceed with the ODP installation just as though the nodes had direct access to the internet repositories.
- If this proxy configuration does not seem to work, try adding a / at the end of the proxy URL. For example:
proxy=http://mycache.mydomain.com:3128/