Installing and Running Apache NiFi on your HDP Cluster


Hey everyone,

I learned today about a cool ETL/data pipeline/make your life easier tool that was recently released by the NSA (not kidding) as a way to manage the flow of data in and out of system: Apache NiFi. To me, that functionality seems to match PERFECTLY with what people like to do with Hadoop. This guide will just set up NiFi, not do anything with it (that’ll come later!)

Things you’ll need:

  • HDP2.3
  • Maven > 3.1

You don’t even need root access!

Here’s how to get it running on your HDP2.3 cluster:

First we need to actually get the source code:



cd nifi-master

export MAVEN_OPTS="-Xms1024m -Xmx3076m -XX:MaxPermSize=256m"

mvn -T C2.0 clean install

(Takes about 8 minutes to run all the tests)

After that’s done,

cd nifi-assembly/target

tar -zxvf nifi-0.3.1-SNAPSHOT-bin.tar.gz

cd nifi-0.3.1-SNAPSHOT

vi conf/

On line 106 ( :106 in vim)


bin/ install

service nifi start

Then navigate to http://localhost:9000/nifi

There you have it! With any luck, you now have NiFi installed!

Loading Data into Hive Using a Custom SerDe

Welcome back! If you read my previous post, you know that we’ve run into an issue with our Chicago crime data that we just loaded into HIve. Specifically, one of the columns has commas included implicitly in the row data.

Here is one example:
10200202,HY386892,08/17/2015 09:42:00 PM,011XX W 66TH ST,1345,CRIMINAL DAMAGE,TO CITY OF CHICAGO PROPERTY,SCHOOL,NULL,NULL,false,0724,007,17,68,14,2015,NULL,NULL

The data in its raw form looks like this:
10200202,HY386892,08/17/2015 09:42:00 PM,011XX W 66TH ST,1345,CRIMINAL DAMAGE,TO CITY OF CHICAGO PROPERTY,“SCHOOL, PUBLIC, BUILDING”,false,false,0724,007,17,68,14,,,2015,08/24/2015 03:30:37 PM,,,

I’ve bolded the data in the problem column, LOCATIONDESCRIPTION.

So what happened here? The data in the field has commas and since we told Hive to split fields in rows by commas. As a result, we ended up splitting the record prematurely due to the commas in the field. Hive didn’t do anything wrong, we just didn’t do enough research about the data that we ingested.

This is one of the trickiest parts about big data: making data usable. In my experience, you’ll be handed an assignment to get data from point A to Hadoop. That’s no problem, the command is hadoop fs -put <filename>. But getting the data to a usable state takes a lot more brainpower.

Luckily with hadoop and other open-source software, it’s more than acceptable to use someone else’s brainpower (make sure to give them credit and respect licenses!) to solve whatever problem you’re having. In this case, we’ll be using a csv-serde that was created by Larry Ogrodnek ( Thanks Larry! Additionally, writing a custom serde is a little out of scope at this point. I’ll touch on it later though!

First off, we need to install Maven, a packaging software. Switch to user root (if you’re not root, don’t worry! You can still follow these steps to install Maven, just change the paths to somewhere your user can write to.)

su root

Download it here

Select a tar.gz link to download and either download it locally and then move it into your cluster or use wget:


tar xzf apache-maven-3.3.3-bin.tar.gz -C /usr/local

cd /usr/local

ln -s apache-maven-3.3.3 maven

vi /etc/profile.d/
export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Next we get Mr. Ogrodnek’s Serde:

su sherlock



cd csv-serde-master

mvn package

Now you wait for your project to be built!

Once that’s done:


add jar /home/sherlock/csv-serde-master/target/csv-serde-1.1.2-0.11.0-all.jar;

create table fixed_all_crimes(
id string,
casenumber string,
caldate string,
block string,
iucr string,
primarytype string,
description string,
locationdescription string,
arrest boolean,
domestic boolean,
beat string,
district string,
ward string,
communityarea string,
fbicode string,
xcoordinate string,
ycoordinate string,
year string,
updatedon string,
latitude decimal(10,0),
longitude decimal(10,0),
location string
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile
location '/user/sherlock/chicago_crimes';

Once that completes,

select * from chicago_crimes.fixed_all_crimes where id == '10200202';

I delimited the results with pipes ( | ) just to show the difference the serde made:

10200202|HY386892|08/17/2015 09:42:00 PM|011XX W 66TH ST|1345|CRIMINAL DAMAGE|TO CITY OF CHICAGO PROPERTY|SCHOOL, PUBLIC, BUILDING|false|false|0724|007|17|68|14|||2015|08/24/2015 03:30:37 PM|||

And just like that, we’ve repaired around 6000 rows of data from our original set! Another thing that we’ve fixed is also location data. The record above doesn’t have any GPS data associated with it, however our serde also took care of the information in quotes too.

Take away points:

  1. Getting data to be usable in Hadoop is almost never going to be as easy as getting it first into Hadoop.
  2. Always know what type of data you’ll be dealing with and what your best approach should be.
  3. If you’re ever stumped, look around the internet. I guarantee you aren’t the first person with this problem (and if you still can’t find anything, ask a question somewhere!)

Until next time, happy Hadoopin!

Analyzing Chicago Crime Data with Apache Hive on HDP 2.3

After a brief hiatus in the great state of Alaska, I’m back to discuss actually analyzing data on your new Hadoop cluster that we set up together in previous blog posts. Specifically we’ll be looking at crime data from the City of Chicago from 2001 to the day this was first written, 8/26/2015. There’s a couple things we need to take care of before we get started though, Sherlock.

So if you shut down your cluster last time, please reboot it by starting up VirtualBox and double clicking your Virtual Machine (VM). Next, let’s make sure everything is up and running on the cluster. Do this by navigating to Ambari (user: admin, password: admin) again and verifying that there aren’t any errors or stopped services. If there are any, just try restarting them by selecting the option on each service. Once it’s all green, continue below!

You’ll notice that every time you start up your VM, you’re logged in as a user, root. This user is the superuser in the Linux VM that we have our Hadoop ecosystem running on. If you want more information, read here, but for now just know that this user has the ability to read, write, and delete any file on the file system. As a result, doing things as root can be dangerous if you don’t know what you’re doing (you could accidentally delete key system files). It’s also kind of cheating, since you’re an all-powerful user in a system. You will basically never be able to log in as root in any real, production system. So let’s make a new user account for us to play around with as we begin: sherlock.

To create this user, ensure you are root and execute the following commands:

useradd sherlock

passwd sherlock
The second command will allow you to set a password for your new user. Since this is a closed VM, I just set mine to nothing to save time later by hitting ‘Enter’ two times, despite Linux’s warnings of a bad password.


You shouldn’t get any errors and there will now be a new user in your Linux VM. Next issue the following command to create a space for this user in HDFS:

hadoop fs -mkdir /user/dev/

****Note that this directory is NOT created automatically for you upon creating a new user****

sherlock DOES however have a home directory created automatically in the Linux environment: /home/sherlock/. You can switch users to sherlock from root by typing:

su sherlock

Once you’re sherlock, we need to get us some data. For this specific tutorial, we’re going to look at the crime data freely available from the City of Chicago. Download it by clicking the following link:

Once that’s done, you should have a csv file somewhere in your downloads with the name Crime_-_2001_to_present.csv. How do we move this to our cluster though?

There are a number of ways to do this, either through WinSCP (on Windows) or through a nice service that comes with your cluster, Hue. Access it by navigating to this link: You should be logged in automatically as the user, hue. Hue is a great way to interact with your cluster as a developer. It’s a nice GUI that allows you to look at the file system, move files around in HDFS, schedule jobs to crunch data, and interact with data on the fly. It’s also how we get our data onto our cluster in this example.

Click on the brown file browser icon at the top of the webpage, and select ‘Upload>Files’ and select Crime_-_2001_to_present.csv.


That should finish relatively quickly. Congratulations, you’ve just uploaded your first data file to your hadoop cluster!

Let’s go back to the command line and our super-sleuth, sherlock.

Make a copy of the data file:

hadoop fs -cp /user/hue/Crime_-_2001_to_present.csv /user/sherlock/chicago_crimes
Next, let’s start the Hive Command Line Interface:



And create a new database:

CREATE DATABASE chicago_crimes;

We’ll now create a new table:
CREATE external TABLE chicago_crimes.all_crimes(
id string,
casenumber string,
caldate string,
block string,
iucr string,
primarytype string,
description string,
locationdescription string,
arrest boolean,
domestic boolean,
beat string,
district string,
ward string,
communityarea string,
fbicode string,
xcoordinate string,
ycoordinate string,
year string,
updatedon string,
latitude decimal(10,0),
longitude decimal(10,0),
location string
row format delimited fields terminated by ','
stored as textfile
location '/user/sherlock/chicago_crimes';

If you’re familiar with RDBMS systems then you’ll be able to figure out what’s going on here. The only real different thing to know about is the ‘external’ keyword used before the table name. That tells Hive to leave the source data alone in its current location. If you omit the ‘external’ keyword, your data will be move from its current location to the Hive metastore.

Anyways, once you submit that query, let’s make sure the data is all there:

select count(*) from chicago_crimes.all_crimes;

I got 288,126 reported crimes. Due to the dynamic nature of that file you downloaded earlier, you might get more than me.

There’s around 2.7 million people living in Chicago in 2015 according to US Census data. That’s about 1 crime for every 10 people in Chicago.

This is obviously just the beginning of our adventure with this data. If you want to see what we’ll be tackling next, stick this query into Hive and try to see what is “wrong” with our data in this table: select * from chicago_crimes.all_crimes where locationdescription == "SCHOOL";

HINT: Loading data is NEVER as easy as you want it to be.


The Basics of Administrating a Hadoop Cluster

So assuming you followed and completed my first post, Getting Started with Hortonworks Data Platform 2.3, you should now have your very own Hadoop cluster (albeit, it pales slightly to Yahoo!’s reported 4,500 node cluster).

Still, you’ve taken a very big step towards learning about Hadoop and how to use it effectively. Well done!

We left off at the previous post looking at a webpage in your browser: the front end of Ambari.

Hopefully it looks something like this:


Ambari is a very important source of information for everything about your cluster. The center tiles tell you at a glance many of the most important aspects of your cluster including available disk space, nodes available, and cluster uptime. To the left, you have listed a number of Services installed on your cluster. Services are individual components that live in the Hadoop ecosystem and provide a great number of capabilities that do not come with the distributed file system that Hadoop is most known for.

Take a moment to click around and investigate what these are. Some of the essentials are HDFS, MapReduce2, Yarn, Hive, and Pig. Of course there are many more listed (and even more available out in the wilds of the Internet) but these are the ones that are included with the Hortonworks Data Platform.

Service health is indicated by the color next to the name, Green indicates good health, Yellow indicates warnings, and Red indicates errors or alerts. The first aid kit indicates that the service is stopped and in maintenance mode.

To start, stop, restart or enter maintenance mode for any service, select it on the left then click the ‘Service Actions’ drop down menu on the far right of the service’s page, circled below.


Installing a new service is as easy as selecting the ‘Actions’ drop down menu below all the listed Services and following the wizard to install. I’ll go through that in a later post however.

Next we’ll look at the other tabs available in Ambari: Hosts, Alerts, Admin.

The Hosts tab shows all the nodes, both worker and master, that are connected to your cluster along with their health (same indicators as before), name, IP address, server rack, cores, RAM, disk usage, load averages, Hadoop version installed, and finally: all the components that are installed on that particular node. In our cluster, we obviously only have one that we can see here:


The Alerts tab gives you a play-by-play of the things that go wrong in your cluster like process failures, node failure, or critical disk space usage. It’ll tell you what went wrong, when it went wrong, what service it’s associated with and what its current state is. We get some alerts from when Hortonworks first established this virtual machine image. In my case, this happened 15 days ago. You can also see that those alerts have been resolved. Thank you Hortonworks.


The Admin tab doesn’t let you do much, but it’s still full of information that’s very important in administrating a Hadoop cluster, particularly if you have other users running about inside of it (we’ll simulate this later). It tells you what services you have installed, what version you have of those services installed (that’s nice to know when someone asks you “Hey what version of Hive are we running?” because they need to know if a bug fix was implemented or not).


Also on this tab is a thing called Kerberos. Kerberos is an useful security feature in Hadoop. It was essentially the first and only security feature available to a Hadoop cluster other than basic user authentication. Unfortunately, implementing it is rather complicated and outside of the scope of this post. Also if done incorrectly, your entire cluster will be locked down and then your only option is to pray to the Hadoop gods for mercy or start completely over from scratch. The latter is not a good idea in a production environment.

This post gave a good overview of what Ambari is and how it helps you administer a cluster on a basic level. The next post will involve us actually getting our hands dirty and typing things on the command line. Exciting!

Until then, questions or comments below please.


Getting Started with Hortonworks Data Platform 2.3

As my first post, I’m going to walk through setting up Hortonworks Data Platform (HDP) 2.3. HDP is very nice because it is free to use at any level for any sized cluster, from curious developers with virtual environments to Fortune 50 companies with 100+ node clusters. The cost comes from requiring support on Hortonworks‘ software (more on that later).

There’s a couple of reasons for this:

  1. I’m familiar with developing on and ‘administrating’ a HDP 2.2 VM and want to see the differences
  2. I want to do something I know I can accomplish on my first post.
  3. Finally, getting started with a new technology, I’ve found is often the most difficult part. Hopefully this will help someone stuck on their first steps into the world of Big Data

So here we go:

  1. Get yourself a hot new copy of VirtualBox (or whatever VM hosting environment you please, this guide will be through VirtualBox, however) at Oracle’s website:
    • I’m using VirtualBox 5.0 on Windows 8.1 at the time of this post.
    • Install it using the wizard
  2. Make your way over to Hortonwork’s website and download the .ova file for their VirtualBox environment:
    • I got HDP 2.3, again for Windows 8.1
    • It’s a large file (~7GB), so make sure you have enough space in your Downloads area
  3. Once that’s downloaded, start up VirtualBox. Navigate your way through any “Welcome to VirtualBox” splash screens until you find yourself looking at something like this:

    • To import your HDP 2.3 image, do the following:
      1. File > Import Appliance…
      2. Navigate to where your HDP 2.3.ova file was saved and select that
      3. Select Next > Import
    • It might take some time to import it; that’s normal, don’t worry.
    • Once it’s imported, your VirtualBox window should look something like this:  vbmanagerpostadd
    • Double click the icon in the VirtualBox tray and it should bring up a command prompt window and boot up the machine. After it boots, the window should look like this:
    • Do as it says and open a browser and navigate to
  4. By logging into your cluster’s Ambari interface (, Username: admin; Password: admin), you can now begin to administrate your cluster.

Congratulations, you now have your very own hadoop cluster set up in a VirtualBox environment courtesy of Hortonworks and their robust Hortonworks Data Platform.

Please leave a comment if you have any questions or concerns!