Category Archives: Big Data

Machine Learning 101

A survey of the landscape

Machine Learning is the practice of using computer algorithms to extract information from datasets, and many times this is synonymous with Predictive Modeling. Rather than have a human sift through the proverbial haystack, these advanced algorithms, often powered by very large datasets and computing centers, perform a variety of complex tasks. Machine Learning has a rich and complex mathematical underpinning that can often be daunting to new comers to the field, however, most Machine Learning algorithms can be split into two distinct classifications, namely based on the problem they are attempting to solve, Supervised Learning and Unsupervised Learning.

How to Train Your Computer

The first step in any Machine Learning task is the create what is known as a training dataset and a test data set. The training dataset is used to teach the algorithm and the test dataset is only used to evaluate how well the algorithm has learned. It is not always the case, but generally in Supervised Learning we are attempting to make a prediction. A trivial example: Given the age, weight, height, and  number of cigarettes smoked a day, can we predict if a person has cancer or not? This process is known as Supervised Learning because when we train the algorithm, we provide all the necessary data for the prediction (weight, height, number of cigarettes) AND if the persons have cancer. There are various algorithms in this space and a quick few are Logistic Regression, Random Forests, Decision Trees, and Gradient Boosted Decision Trees. Esoteric names aside, at a high level, each one of these algorithms solves this problem by learning the relationship between our predictor variables (age, weight, height, number of cigarettes)  and the response, if the person has cancer.

Once we have trained the algorithm we can test how well it is able to predict by using our test dataset. Here we actually known the result (does a person have cancer) but we only show the algorithm the data, not the answer, and then see how many are correct. This is a very simplified example, however most people have undergone this exact process and did not even know it. If you have ever had your blood drawn, various aspects about your blood were fed through predictive model.

The Big (Data) Picture

While we have seen a simple example what Machine Learning looks like, in reality the datasets and systems are much more complex. If you have ever received a promotional email from your favorite store, chances are it was generated by a Machine Learning algorithm, only instead of using 4 variables about you, it used 2,000; every click you make on a website, every item you viewed, how long you viewed it, is cataloged and stored. This is Big Data.

Advertisements

Clustering Users in a Big Data World

Machine Learning to the Rescue

Clustering, or Cluster Analysis is one of the primary branches of Machine Learning, and is a technique for uncovering possibly hidden groups within a dataset. It is known as an unsupervised learning technique, which means we do not need to give the model a response variable to predict, the model simply learns the innate structure of our dataset. Traditionally, these techniques were only able to be utilized on a small scale. With the rise of various Big Data technologies, companies are able to store vast amounts of user behavior; every in store purchase, website click, app swipe – stored, cataloged, used in Machine Learning algorithms.

Marketing Departments love Data Scientists

Customers visiting an ecommerce website exhibit a wide range of digital behavior. Some customers visit your site, and purchase an item within the first 10 minutes. Some customers only come after you send a promotional email. In an ecommerce world, we can use a clustering algorithm to segment our customers into various groups, based on their digital behavior.  One great use case for this technique is sending various promotional offers to different groups. You can send a targeted email offer to the group that you know responds well to email offers, and to your app users, you can send a push notification. Big Data and Machine Learning enable companies to more effectively communicate their brand, their message, and provide better offers, and the experience can be tailored to fit the customers needs.

Web Site Optimization

Not all customers are created equal, or better yet, and it is best to listen to the voice of your best customers. Your best customer’s digital behavior its telling you what they like and do not like about your web site. When clustering users based on web site behavior, it is important to identify the key behaviors that top customers exhibit, and tailor their experiences accordingly. To the female customer who only buys high heals, should you show her a pop up for a new drill? When designing a new page it can be important to look at the interaction behavior or your best segments and try to understand how these customers are interacting, and design for them.

 

 

Install R Package from source (not on CRAN)

Recently needed to install a graph mining package in R that was not on CRAN. To install, untar/gz the package, then run

install.packages("c:\\Users//me/Desktop/subgraphMining", repos =NULL, type = "source")


Pig Filtering on NOT match

Quick example of  Pig code that returns the records that do NOT match some criteria;

if we have a relation called A: with the following data:

 

ip       product_list    
1         'akdfjfd_sugg='apple' 
2         ;liiirgij_view='real' 
3         'adfd_sugg='books' 


We can remove records with the word "sugg" from our relation (real world: I want to remove hits from Omniture weblogs that are suggestions made by a recommender systems, as these are not items clicked by our users: 
 

A = load 'hits' as (ip:chararray, product_list:chararray);
filtered_A = filter A by NOT(product_list matches '.*sugg*.');
dump filtered_A
2         ;liiirgij_view='real'
-- easy
Tagged , ,

Using Jython and Pig: Parsing variable length / nested data

The problem 

In Hadoop I have a year of weblogs — one of the fields that we capture from Omniture is the ‘productlist’.

This string captures all the products and their quantities /amount at checkout. Seeing as one customer can checkout with items >= 1,
this string can not be parsed in a fixed length data structure.
Read more here: http://blogs.adobe.com/digitalmarketing/analytics/products-variable-inside-omniture-sitecatalyst/

The string

#;BGF14_C1DS4;1;120.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_Y1N8C;1;101.0;278=145.00;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business D# ay,;BGF14_C1J5Y;1;0.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_T7BPM;1;495.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14#_T6WXA;1;325.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_T71C8;1;198.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_T71E0;# 1;275.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_C0X4R;1;0.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_C15D1;1;0.0;;eV# ar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;Tax & Shipping;;;202=0.0|203=15.0,;Gift Wrap;;;211=0.0"

For a single product, the list contains  fields delimited by a semi colon —  we will refer to this as a  product:

productid;qty;cost;evars;

example:

BGF14_C1DS4;1;120.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;

But a customer can purchase multiple products, so we end up getting

product,product,product,product — where product contains the fields, seperated by a “;” and products seperated by “,” . this is how we end up getting the aforementioned raw string

The Goal

Find the highest purchased product (or really just to be able to aggregate on the product id) on a given day

The Solution

In order to acomplish this, we need 2 things — a way to handle the parsing, and an appropriate data structure to work with in Pig.

Note: Pigs complex data types can bend you brain in weird ways.

Also, I only care about the productid,qty, and cost  for now..

The python UDF:

in a file called parser.py

@outputSchema("record:bag{t:(productid:chararray,qty:double, cost:double)}")
def parse_list(product_list):
    product_dictinary = {"productid":[],"qty":[],"amount":[]}
    outbag = []
    if type(product_list) == type(None):
        return outbag
    for product in product_list.split(','):
        fields = product.split(";")
        del fields[0] # removes the leading empty field from the list -- every product begins with a ';', split adds a '' to the 0 index
        if len(fields) > 1:
            productid = fields[0].strip()
            qty = fields[1].strip() if len(fields) > 2 else None
            amount = fields[2].strip() if len(fields) > 3 else None
            product_dictinary["productid"].append(productid)
            product_dictinary["qty"].append(qty)
            product_dictinary["amount"].append(amount)

    for i in range(len(product_dictinary["productid"])):
        tuples = (product_dictinary["productid"][i]
                  ,product_dictinary["qty"][i],
                  product_dictinary["amount"][i])
        outbag.append(tuples)
    return outbag

Breaking some of this down — a lot of the logic that is implemented is discussed above(string delimited format) – 2 very important things — Because we are using Jython, and how it is implemented with Pig, Python data types have a matching counterpart in Pig — While the behavior is vastly different for each construct (a Pig bag does not have the behavior of a Python array), its helpful to be able to know what we are returning. See http://pig.apache.org/docs/r0.12.0/udf.html for more information

Data goes in the function as:

#;BGF14_C1DS4;1;120.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_Y1N8C;1;101.0;278=145.00;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business D# ay,;BGF14_C1J5Y;1;0.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_T7BPM;1;495.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14#_T6WXA;1;325.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_T71C8;1;198.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_T71E0;# 1;275.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_C0X4R;1;0.0;;eVar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;BGF14_C15D1;1;0.0;;eV# ar10=KS|eVar11=66208|eVar13=MISSION HILLS|eVar64=Second Business Day,;Tax & Shipping;;;202=0.0|203=15.0,;Gift Wrap;;;211=0.0"

is transformed inside the function to:

{'amount': ['120.0', '101.0', '0.0', '495.0', '325.0', '198.0', '275.0', '0.0', '0.0', '', ''],
 'qty': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '', ''],
'productid': ['BGF14_C1DS4', 'BGF14_Y1N8C', 'BGF14_C1J5Y', 'BGF14_T7BPM', 'BGF14_T6WXA', 'BGF14_T71C8', 'BGF14_T71E0', 'BGF14_C0X4R', 'BGF14_C15D1', 'Tax & Shipping', 'Gift Wrap']}

the ith index of each dictionary value matches across the keys, so the final bit outputs the following data structure:

   #productid    #qty   #amount
[('BGF14_C1DS4', '1', '120.0'),
('BGF14_Y1N8C', '1', '101.0'),
('BGF14_C1J5Y', '1', '0.0'),
('BGF14_T7BPM', '1', '495.0'),
('BGF14_T6WXA', '1', '325.0'),
 ('BGF14_T71C8', '1', '198.0'),
('BGF14_T71E0', '1', '275.0'),
('BGF14_C0X4R', '1', '0.0'),
('BGF14_C15D1', '1', '0.0'),
('Tax & Shipping', '', ''),
('Gift Wrap', '', '')]

pretty neato

Some quick notes
Python Dictionary = Pig Map
Python list = Pig Bag
Python tuple  =   Pig tuple

The final pig data structure we are aiming for is a bag ,which can be a variable length, and hold our tuples.  This is specified with the @outputSchema decorator:

@outputSchema("record:bag{t:(productid:chararray,qty:double, cost:double)}")

(important because 1 productstring can produce 1 to many products), of tuples with the fields  (productid, qty, cost)

Now the Pig part:

register 'parser.py' using jython as udf;
A = load 'prdouctlist.txt' as p:chararray;
B = foreach A generate udf.parse_list(s);
   -- we now have a bag of tuples, we need to flatten
   -- ({(BGF14_C1DS4,1,120.0),(BGF14_Y1N8C,1,101.0),(BGF14_C1J5Y,1,0.0),(BGF14_T7BPM,1,495.0),(BGF14_T6WXA,1,325.0),
   -- (BGF14_T71C8,1,198.0),(BGF14_T71E0,1,275.0),(BGF14_C0X4R,1,0.0),(BGF14_C15D1,1,0.--0),(Tax & Shipping,,),(Gift Wrap,,)})
   --
  -- from the book Programming Pig: "<code class="code">flatten</code> can also be applied to a tuple. In this case,
  -- it does not produce a cross product; instead, it elevates each field in the tuple to a top-level field."</pre>
C = foreach B generate FLATTEN($0);
 -- This is just what we want
 -- (BGF14_C1DS4,1,120.0)
 -- (BGF14_Y1N8C,1,101.0)
 -- (BGF14_C1J5Y,1,0.0)
 -- (BGF14_T7BPM,1,495.0)
 -- (BGF14_T6WXA,1,325.0)
 -- (BGF14_T71C8,1,198.0)
 -- (BGF14_T71E0,1,275.0)
 -- (BGF14_C0X4R,1,0.0)
 -- (BGF14_C15D1,1,0.0)

-- we can now group by product id and find the top selling products / do all sorts of aggregations
D = group C by productid;
E = foreach D generate group as productid, SUM(C.cost);
-- super cool
Tagged , , , ,

Windowing Functions in PIG

PiggyBank

Simple Example of  the following SQL in pig — The table is composed of web hits —

select *, count(*) over() partition by (date_time) desc from nmdata

launch pig with awareness of the HCatalog metastore;

pig -useHCatalog; 
register piggybank.jar 
define Stitch org.apache.pig.piggybank.evaluation.Stitch;
define Over org.apache.pig.piggybank.evaluation.Over; 
A = load 'nmdata' using org.apache.hcatalog.pig.HCatLoader(); 
B = group A by visid_high; 
C = foreach B { 
     C1 = order A by date_time; 
     generate FLATTEN(Stitch(C1, Over(C1.$3, 'count'))); 
}; 
DUMP C;

Hive split on special characters (escape in split())

Hives split function in more of a regex extractor than a completely SQL like split() function — So somtimes (as is common in Hive) something happens that is not completely obvious — like splitting on  a period (or special character.

Example:

I have a table called weblogs, with a column of ip addresses that look like this:

-- 192.168.1.1
select split('client_ip','.')  from weblogs

returns: NULL

Werid.

But not too weird — under the hood split is implemented in Java and running a  regular expression; the ‘.’ character has meaning. Whats the solution?

Escape it:

select split(client_ip,'\\.')  limit 1

returns the following array

[192,168,1,1]

Run hive from bash script & loop through file

Your given the task for running counts on all of the tables in Hive.

Not a completely real world example, but we are currently in UAT with with our Hadoop platform and I need to make sure that aggregates of our clickstream data (in Hive) match Omniture (web analytics tool).

We need to prepare the query we are going to run and save it to a file called ‘counts.hql’ — notice the formatting of the parameter:

select  count(*)  from ${hiveconf:tablename}

we can run this query from the command line, manually passing in a parameter with the following command:

hive -hiveconf tablename=om -f counts.hql

With a little bash we can loop through all our tables, appending the results to a file:
#get a list of the files

hive -e "show tables;" > hivetables.txt

Now — cool stuff:

for line in $(cat hivetables.txt) ; 
do 
     results=$(hive -hiveconf tablename=$line -f test.hql) 
     echo $results >> tablecounts.txt
 done
Tagged , ,

Hive Explode Multiple Arrays

I recently ran in to a problem where I needed to explode a table with multiple arrays –specifically where each arrays index positions matched.

Original questions on stackoverflow:

I have a hive table with the following schema:

COOKIE | PRODUCT_ID | CAT_ID | QTY
1234123 [1,2,3] [r,t,null] [2,1,null]

How can I normalize the arrays so I get the following result

COOKIE | PRODUCT_ID | CAT_ID | QTY
1234123    [1]          [r]     [2]
1234123    [2]          [t]     [1]
1234123    [3]          null    null

I was pointed to the following blog: http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/

–And it perfectly solved my problem.

Check out their entire udf package called Brickhouse here:  https://github.com/klout/brickhouse