Configuring Hadoop and starting the cluster using Ansible

Harshil Shah
4 min readMar 21, 2021

Step by step guide on how to configure and start hadoop cluster with help of Ansible.

NOTE: In this article I am not using ansible roles. And this practical is performed in RedHat linux 8 OS.

Github repo : https://github.com/iamShahHarshil/hadoop_with_ansible

Here in this article I will only explain what tasks are used to configure hadoop on slave and master node.

Let’s start with master node first.

First for setting up hadoop in any system, we need hadoop’s and jdk’s rpm files (since this practical is performed in RedHat linux). rpm stands for RedHat package manager. So using copy module, copy both software in master node. IP of master node is already mentioned in inventory file.

Now we need to install java and hadoop software as well.

using shell module or we can also use cmd module, the entire rpm command that we usually run manually, will be ran over ssh protocol by ansible. And we store the information regarding jdk installation in a variable called java_install. This variable we use to check if jdk is installed properly or not by checking return code (rc). If rc = 0 than success otherwise failure.

So hadoop software will install only when rc for java installation is 0 i.e. successfull.

The similar process you have to do for slave node as well. You can write code for slave node in same playbook by mentioning new host or you can create new playbook for slave node. I have made separate playbook for configuring both nodes.

Obviously we need to give a volume to share. So create an empty folder anywhere you wish. And now we will have to copy and replace hdfs and core file for both, master and slave. These are hadoop configuration files and are used to mention details regarding hadoop cluster like IP and port number for hadoop cluster. So we copy using copy module.

Note: I have created and stored both files — hdfs-site.xml and core-site.xml with same format and with my details of hadoop cluster, in same folder where my playbook is

But the destination for both files is fixed in master and slave. And name for file is also fixed.

Finally we will format name node and start namenode services.

Note: Similar steps we have to perform for slave node, but only difference is that we will not format before starting datanode service. And also hadoop conf file ( hdfs-site.xml) is different and for slave node also we will create an empty folder which will be used to link slave node’s storage to the hadoop cluster.

Remember, Name node i.e. master node is used to manage all slave nodes. Whatever data you upload to hadoop cluster, master will create blocks of same size of these files and distribute them over all slave nodes present in cluster.

core-site.xml and hdfs-site.xml files for Datanode.

core-site.xml
hdfs-site.xml

core-site.xml and hdfs-site.xml files for Namenode.

core-site.xml
hdfs-site.xml

hadoop-vars.yml file for keeping all variables used in playbook.

Finally run both playbooks one after another using command:

ansible-playbook masterconf.yml

ansible-playbook slaveconf.yml

How to make HTTPD service idempotence in nature using ansible.

What idempotence mean here ?

If any changes in config file of any service is made, than we have to restart the service in order to reflect the changes we made. So this nature is known as idempotence nature of any service.

So we can bring this kind of nature to any service using ansible with help of topic called handlers and notify.

here I have used handler and notify to make httpd service idempotence in nature.

handlers module.
notify.

So here notify will call handler only when it will see if some changes are made to conf file. And handlers will restart service as shown above.

So that is it. I have mentioned github link for this repo in the begining of the article.

Thank you ✌🏼

--

--