Loading Data into Hive Using a Custom SerDe

Welcome back! If you read my previous post, you know that we’ve run into an issue with our Chicago crime data that we just loaded into HIve. Specifically, one of the columns has commas included implicitly in the row data.

Here is one example:
10200202,HY386892,08/17/2015 09:42:00 PM,011XX W 66TH ST,1345,CRIMINAL DAMAGE,TO CITY OF CHICAGO PROPERTY,SCHOOL,NULL,NULL,false,0724,007,17,68,14,2015,NULL,NULL

The data in its raw form looks like this:
10200202,HY386892,08/17/2015 09:42:00 PM,011XX W 66TH ST,1345,CRIMINAL DAMAGE,TO CITY OF CHICAGO PROPERTY,“SCHOOL, PUBLIC, BUILDING”,false,false,0724,007,17,68,14,,,2015,08/24/2015 03:30:37 PM,,,

I’ve bolded the data in the problem column, LOCATIONDESCRIPTION.

So what happened here? The data in the field has commas and since we told Hive to split fields in rows by commas. As a result, we ended up splitting the record prematurely due to the commas in the field. Hive didn’t do anything wrong, we just didn’t do enough research about the data that we ingested.

This is one of the trickiest parts about big data: making data usable. In my experience, you’ll be handed an assignment to get data from point A to Hadoop. That’s no problem, the command is hadoop fs -put <filename>. But getting the data to a usable state takes a lot more brainpower.

Luckily with hadoop and other open-source software, it’s more than acceptable to use someone else’s brainpower (make sure to give them credit and respect licenses!) to solve whatever problem you’re having. In this case, we’ll be using a csv-serde that was created by Larry Ogrodnek (https://github.com/ogrodnek/csv-serde/blob/master/readme.md). Thanks Larry! Additionally, writing a custom serde is a little out of scope at this point. I’ll touch on it later though!

First off, we need to install Maven, a packaging software. Switch to user root (if you’re not root, don’t worry! You can still follow these steps to install Maven, just change the paths to somewhere your user can write to.)

su root

Download it here https://maven.apache.org/download.cgi

Select a tar.gz link to download and either download it locally and then move it into your cluster or use wget:

wget http://mirrors.koehn.com/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz

tar xzf apache-maven-3.3.3-bin.tar.gz -C /usr/local

cd /usr/local

ln -s apache-maven-3.3.3 maven

vi /etc/profile.d/maven.sh
export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Next we get Mr. Ogrodnek’s Serde:

su sherlock

wget https://github.com/ogrodnek/csv-serde/archive/master.zip

unzip master.zip

cd csv-serde-master

mvn package

Now you wait for your project to be built!

Once that’s done:


add jar /home/sherlock/csv-serde-master/target/csv-serde-1.1.2-0.11.0-all.jar;

create table fixed_all_crimes(
id string,
casenumber string,
caldate string,
block string,
iucr string,
primarytype string,
description string,
locationdescription string,
arrest boolean,
domestic boolean,
beat string,
district string,
ward string,
communityarea string,
fbicode string,
xcoordinate string,
ycoordinate string,
year string,
updatedon string,
latitude decimal(10,0),
longitude decimal(10,0),
location string
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile
location '/user/sherlock/chicago_crimes';

Once that completes,

select * from chicago_crimes.fixed_all_crimes where id == '10200202';

I delimited the results with pipes ( | ) just to show the difference the serde made:

10200202|HY386892|08/17/2015 09:42:00 PM|011XX W 66TH ST|1345|CRIMINAL DAMAGE|TO CITY OF CHICAGO PROPERTY|SCHOOL, PUBLIC, BUILDING|false|false|0724|007|17|68|14|||2015|08/24/2015 03:30:37 PM|||

And just like that, we’ve repaired around 6000 rows of data from our original set! Another thing that we’ve fixed is also location data. The record above doesn’t have any GPS data associated with it, however our serde also took care of the information in quotes too.

Take away points:

  1. Getting data to be usable in Hadoop is almost never going to be as easy as getting it first into Hadoop.
  2. Always know what type of data you’ll be dealing with and what your best approach should be.
  3. If you’re ever stumped, look around the internet. I guarantee you aren’t the first person with this problem (and if you still can’t find anything, ask a question somewhere!)

Until next time, happy Hadoopin!