Installing and Running Apache NiFi on your HDP Cluster

nifi

Hey everyone,

I learned today about a cool ETL/data pipeline/make your life easier tool that was recently released by the NSA (not kidding) as a way to manage the flow of data in and out of system: Apache NiFi. To me, that functionality seems to match PERFECTLY with what people like to do with Hadoop. This guide will just set up NiFi, not do anything with it (that’ll come later!)

Things you’ll need:

  • HDP2.3
  • Maven > 3.1

You don’t even need root access!


Here’s how to get it running on your HDP2.3 cluster:

First we need to actually get the source code:

wget https://github.com/apache/nifi/archive/master.zip

unzip master.zip

cd nifi-master

export MAVEN_OPTS="-Xms1024m -Xmx3076m -XX:MaxPermSize=256m"

mvn -T C2.0 clean install

(Takes about 8 minutes to run all the tests)

After that’s done,

cd nifi-assembly/target

tar -zxvf nifi-0.3.1-SNAPSHOT-bin.tar.gz

cd nifi-0.3.1-SNAPSHOT

vi conf/nifi.properties

On line 106 ( :106 in vim)

nifi.web.http.port=9000

bin/nifi.sh install

service nifi start

Then navigate to http://localhost:9000/nifi

There you have it! With any luck, you now have NiFi installed!

Loading Data into Hive Using a Custom SerDe

Welcome back! If you read my previous post, you know that we’ve run into an issue with our Chicago crime data that we just loaded into HIve. Specifically, one of the columns has commas included implicitly in the row data.

Here is one example:
10200202,HY386892,08/17/2015 09:42:00 PM,011XX W 66TH ST,1345,CRIMINAL DAMAGE,TO CITY OF CHICAGO PROPERTY,SCHOOL,NULL,NULL,false,0724,007,17,68,14,2015,NULL,NULL

The data in its raw form looks like this:
10200202,HY386892,08/17/2015 09:42:00 PM,011XX W 66TH ST,1345,CRIMINAL DAMAGE,TO CITY OF CHICAGO PROPERTY,“SCHOOL, PUBLIC, BUILDING”,false,false,0724,007,17,68,14,,,2015,08/24/2015 03:30:37 PM,,,

I’ve bolded the data in the problem column, LOCATIONDESCRIPTION.

So what happened here? The data in the field has commas and since we told Hive to split fields in rows by commas. As a result, we ended up splitting the record prematurely due to the commas in the field. Hive didn’t do anything wrong, we just didn’t do enough research about the data that we ingested.

This is one of the trickiest parts about big data: making data usable. In my experience, you’ll be handed an assignment to get data from point A to Hadoop. That’s no problem, the command is hadoop fs -put <filename>. But getting the data to a usable state takes a lot more brainpower.

Luckily with hadoop and other open-source software, it’s more than acceptable to use someone else’s brainpower (make sure to give them credit and respect licenses!) to solve whatever problem you’re having. In this case, we’ll be using a csv-serde that was created by Larry Ogrodnek (https://github.com/ogrodnek/csv-serde/blob/master/readme.md). Thanks Larry! Additionally, writing a custom serde is a little out of scope at this point. I’ll touch on it later though!

First off, we need to install Maven, a packaging software. Switch to user root (if you’re not root, don’t worry! You can still follow these steps to install Maven, just change the paths to somewhere your user can write to.)

su root

Download it here https://maven.apache.org/download.cgi

Select a tar.gz link to download and either download it locally and then move it into your cluster or use wget:

wget http://mirrors.koehn.com/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz

tar xzf apache-maven-3.3.3-bin.tar.gz -C /usr/local

cd /usr/local

ln -s apache-maven-3.3.3 maven

vi /etc/profile.d/maven.sh
export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Next we get Mr. Ogrodnek’s Serde:

su sherlock

wget https://github.com/ogrodnek/csv-serde/archive/master.zip

unzip master.zip

cd csv-serde-master

mvn package

Now you wait for your project to be built!

Once that’s done:

hive

add jar /home/sherlock/csv-serde-master/target/csv-serde-1.1.2-0.11.0-all.jar;

create table fixed_all_crimes(
id string,
casenumber string,
caldate string,
block string,
iucr string,
primarytype string,
description string,
locationdescription string,
arrest boolean,
domestic boolean,
beat string,
district string,
ward string,
communityarea string,
fbicode string,
xcoordinate string,
ycoordinate string,
year string,
updatedon string,
latitude decimal(10,0),
longitude decimal(10,0),
location string
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile
location '/user/sherlock/chicago_crimes';

Once that completes,

select * from chicago_crimes.fixed_all_crimes where id == '10200202';

I delimited the results with pipes ( | ) just to show the difference the serde made:

10200202|HY386892|08/17/2015 09:42:00 PM|011XX W 66TH ST|1345|CRIMINAL DAMAGE|TO CITY OF CHICAGO PROPERTY|SCHOOL, PUBLIC, BUILDING|false|false|0724|007|17|68|14|||2015|08/24/2015 03:30:37 PM|||

And just like that, we’ve repaired around 6000 rows of data from our original set! Another thing that we’ve fixed is also location data. The record above doesn’t have any GPS data associated with it, however our serde also took care of the information in quotes too.

Take away points:

  1. Getting data to be usable in Hadoop is almost never going to be as easy as getting it first into Hadoop.
  2. Always know what type of data you’ll be dealing with and what your best approach should be.
  3. If you’re ever stumped, look around the internet. I guarantee you aren’t the first person with this problem (and if you still can’t find anything, ask a question somewhere!)

Until next time, happy Hadoopin!

Analyzing Chicago Crime Data with Apache Hive on HDP 2.3

After a brief hiatus in the great state of Alaska, I’m back to discuss actually analyzing data on your new Hadoop cluster that we set up together in previous blog posts. Specifically we’ll be looking at crime data from the City of Chicago from 2001 to the day this was first written, 8/26/2015. There’s a couple things we need to take care of before we get started though, Sherlock.


So if you shut down your cluster last time, please reboot it by starting up VirtualBox and double clicking your Virtual Machine (VM). Next, let’s make sure everything is up and running on the cluster. Do this by navigating to Ambari (user: admin, password: admin) again and verifying that there aren’t any errors or stopped services. If there are any, just try restarting them by selecting the option on each service. Once it’s all green, continue below!

You’ll notice that every time you start up your VM, you’re logged in as a user, root. This user is the superuser in the Linux VM that we have our Hadoop ecosystem running on. If you want more information, read here, but for now just know that this user has the ability to read, write, and delete any file on the file system. As a result, doing things as root can be dangerous if you don’t know what you’re doing (you could accidentally delete key system files). It’s also kind of cheating, since you’re an all-powerful user in a system. You will basically never be able to log in as root in any real, production system. So let’s make a new user account for us to play around with as we begin: sherlock.

To create this user, ensure you are root and execute the following commands:

useradd sherlock

passwd sherlock
The second command will allow you to set a password for your new user. Since this is a closed VM, I just set mine to nothing to save time later by hitting ‘Enter’ two times, despite Linux’s warnings of a bad password.

sherlockpassword

You shouldn’t get any errors and there will now be a new user in your Linux VM. Next issue the following command to create a space for this user in HDFS:

hadoop fs -mkdir /user/dev/

****Note that this directory is NOT created automatically for you upon creating a new user****

sherlock DOES however have a home directory created automatically in the Linux environment: /home/sherlock/. You can switch users to sherlock from root by typing:

su sherlock

Once you’re sherlock, we need to get us some data. For this specific tutorial, we’re going to look at the crime data freely available from the City of Chicago. Download it by clicking the following link:

https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD

Once that’s done, you should have a csv file somewhere in your downloads with the name Crime_-_2001_to_present.csv. How do we move this to our cluster though?

There are a number of ways to do this, either through WinSCP (on Windows) or through a nice service that comes with your cluster, Hue. Access it by navigating to this link: http://127.0.0.1:8000. You should be logged in automatically as the user, hue. Hue is a great way to interact with your cluster as a developer. It’s a nice GUI that allows you to look at the file system, move files around in HDFS, schedule jobs to crunch data, and interact with data on the fly. It’s also how we get our data onto our cluster in this example.

Click on the brown file browser icon at the top of the webpage, and select ‘Upload>Files’ and select Crime_-_2001_to_present.csv.

hueupload

That should finish relatively quickly. Congratulations, you’ve just uploaded your first data file to your hadoop cluster!


Let’s go back to the command line and our super-sleuth, sherlock.

Make a copy of the data file:

hadoop fs -cp /user/hue/Crime_-_2001_to_present.csv /user/sherlock/chicago_crimes
Next, let’s start the Hive Command Line Interface:

hive

boothive

And create a new database:

CREATE DATABASE chicago_crimes;

We’ll now create a new table:
CREATE external TABLE chicago_crimes.all_crimes(
id string,
casenumber string,
caldate string,
block string,
iucr string,
primarytype string,
description string,
locationdescription string,
arrest boolean,
domestic boolean,
beat string,
district string,
ward string,
communityarea string,
fbicode string,
xcoordinate string,
ycoordinate string,
year string,
updatedon string,
latitude decimal(10,0),
longitude decimal(10,0),
location string
)
row format delimited fields terminated by ','
stored as textfile
location '/user/sherlock/chicago_crimes';

If you’re familiar with RDBMS systems then you’ll be able to figure out what’s going on here. The only real different thing to know about is the ‘external’ keyword used before the table name. That tells Hive to leave the source data alone in its current location. If you omit the ‘external’ keyword, your data will be move from its current location to the Hive metastore.

Anyways, once you submit that query, let’s make sure the data is all there:

select count(*) from chicago_crimes.all_crimes;

I got 288,126 reported crimes. Due to the dynamic nature of that file you downloaded earlier, you might get more than me.

There’s around 2.7 million people living in Chicago in 2015 according to US Census data. That’s about 1 crime for every 10 people in Chicago.


This is obviously just the beginning of our adventure with this data. If you want to see what we’ll be tackling next, stick this query into Hive and try to see what is “wrong” with our data in this table: select * from chicago_crimes.all_crimes where locationdescription == "SCHOOL";

HINT: Loading data is NEVER as easy as you want it to be.

Later!