The Basics of Administrating a Hadoop Cluster

So assuming you followed and completed my first post, Getting Started with Hortonworks Data Platform 2.3, you should now have your very own Hadoop cluster (albeit, it pales slightly to Yahoo!’s reported 4,500 node cluster).


Still, you’ve taken a very big step towards learning about Hadoop and how to use it effectively. Well done!

We left off at the previous post looking at a webpage in your browser: the front end of Ambari.

Hopefully it looks something like this:

ambari

Ambari is a very important source of information for everything about your cluster. The center tiles tell you at a glance many of the most important aspects of your cluster including available disk space, nodes available, and cluster uptime. To the left, you have listed a number of Services installed on your cluster. Services are individual components that live in the Hadoop ecosystem and provide a great number of capabilities that do not come with the distributed file system that Hadoop is most known for.

Take a moment to click around and investigate what these are. Some of the essentials are HDFS, MapReduce2, Yarn, Hive, and Pig. Of course there are many more listed (and even more available out in the wilds of the Internet) but these are the ones that are included with the Hortonworks Data Platform.

Service health is indicated by the color next to the name, Green indicates good health, Yellow indicates warnings, and Red indicates errors or alerts. The first aid kit indicates that the service is stopped and in maintenance mode.

To start, stop, restart or enter maintenance mode for any service, select it on the left then click the ‘Service Actions’ drop down menu on the far right of the service’s page, circled below.

serviceactions

Installing a new service is as easy as selecting the ‘Actions’ drop down menu below all the listed Services and following the wizard to install. I’ll go through that in a later post however.


Next we’ll look at the other tabs available in Ambari: Hosts, Alerts, Admin.

The Hosts tab shows all the nodes, both worker and master, that are connected to your cluster along with their health (same indicators as before), name, IP address, server rack, cores, RAM, disk usage, load averages, Hadoop version installed, and finally: all the components that are installed on that particular node. In our cluster, we obviously only have one that we can see here:

hosts

The Alerts tab gives you a play-by-play of the things that go wrong in your cluster like process failures, node failure, or critical disk space usage. It’ll tell you what went wrong, when it went wrong, what service it’s associated with and what its current state is. We get some alerts from when Hortonworks first established this virtual machine image. In my case, this happened 15 days ago. You can also see that those alerts have been resolved. Thank you Hortonworks.

alerts

The Admin tab doesn’t let you do much, but it’s still full of information that’s very important in administrating a Hadoop cluster, particularly if you have other users running about inside of it (we’ll simulate this later). It tells you what services you have installed, what version you have of those services installed (that’s nice to know when someone asks you “Hey what version of Hive are we running?” because they need to know if a bug fix was implemented or not).

admin

Also on this tab is a thing called Kerberos. Kerberos is an useful security feature in Hadoop. It was essentially the first and only security feature available to a Hadoop cluster other than basic user authentication. Unfortunately, implementing it is rather complicated and outside of the scope of this post. Also if done incorrectly, your entire cluster will be locked down and then your only option is to pray to the Hadoop gods for mercy or start completely over from scratch. The latter is not a good idea in a production environment.

This post gave a good overview of what Ambari is and how it helps you administer a cluster on a basic level. The next post will involve us actually getting our hands dirty and typing things on the command line. Exciting!

Until then, questions or comments below please.

-James

Getting Started with Hortonworks Data Platform 2.3

As my first post, I’m going to walk through setting up Hortonworks Data Platform (HDP) 2.3. HDP is very nice because it is free to use at any level for any sized cluster, from curious developers with virtual environments to Fortune 50 companies with 100+ node clusters. The cost comes from requiring support on Hortonworks‘ software (more on that later).

There’s a couple of reasons for this:

  1. I’m familiar with developing on and ‘administrating’ a HDP 2.2 VM and want to see the differences
  2. I want to do something I know I can accomplish on my first post.
  3. Finally, getting started with a new technology, I’ve found is often the most difficult part. Hopefully this will help someone stuck on their first steps into the world of Big Data

So here we go:

  1. Get yourself a hot new copy of VirtualBox (or whatever VM hosting environment you please, this guide will be through VirtualBox, however) at Oracle’s website: https://www.virtualbox.org/wiki/Downloads
    • I’m using VirtualBox 5.0 on Windows 8.1 at the time of this post.
    • Install it using the wizard
  2. Make your way over to Hortonwork’s website and download the .ova file for their VirtualBox environment: http://hortonworks.com/hdp/downloads/
    • I got HDP 2.3, again for Windows 8.1
    • It’s a large file (~7GB), so make sure you have enough space in your Downloads area
  3. Once that’s downloaded, start up VirtualBox. Navigate your way through any “Welcome to VirtualBox” splash screens until you find yourself looking at something like this:
    welcometovb

    • To import your HDP 2.3 image, do the following:
      1. File > Import Appliance…
      2. Navigate to where your HDP 2.3.ova file was saved and select that
      3. Select Next > Import
    • It might take some time to import it; that’s normal, don’t worry.
    • Once it’s imported, your VirtualBox window should look something like this:  vbmanagerpostadd
    • Double click the icon in the VirtualBox tray and it should bring up a command prompt window and boot up the machine. After it boots, the window should look like this:
      machinestarted
    • Do as it says and open a browser and navigate to http://127.0.0.1:8888/
  4. By logging into your cluster’s Ambari interface (http://127.0.0.1:8080, Username: admin; Password: admin), you can now begin to administrate your cluster.

Congratulations, you now have your very own hadoop cluster set up in a VirtualBox environment courtesy of Hortonworks and their robust Hortonworks Data Platform.

Please leave a comment if you have any questions or concerns!

-James