Monday, April 21, 2014

How to Get 35% Discount on Airfare (Thank you, Hadoop!)

As data geek and travel addict, quite often I find myself checking tickets prices and running Map Reduce jobs for fun. (Hehehe)

Here is one very interesting discovery!

Here is the availability for the afternoon flights out of San Jose (SJC) to Los Angeles (LAX). 
AA 2580 is available for $201.00 USD   3:45 pm departure and 5:00 pm arrival 
Looks normal, right? It is  reasonable price for the last minute ticket on a busy route.

Let's try to book a ticket a bit further - to San Diego, for example.
 Here is what American Airlines can offer me. 

AA 2580 3:35 pm departure with 5:00 pm departure to LAX
AA 2786 6:35 pm departure wiith 7:20 pm arrival to SAN

Looks normal, right? Reasonable price for last minute to a smaller city.
Except AA 2580 is the same flight on both reservations and the second one is 50$ (40%) cheaper.

I could not help but wonder why I would not book cheaper flights and walk out in Los Angeles.


Cheers!  



Wednesday, May 16, 2012

NoSQL and Big Data Vlab event at Stanford

 

The Vlab at Stanford put together another great event  on NoSQL and Big Data. 

Here is the description from their web-site. 

Event Description 
  
0 to 50 million users in 50 days? Disruptive scaling is painless with NoSQL.
Even enterprises have taken notice. They are using the same technology that propelled Google, Facebook and Amazon to massive success by analyzing petabytes of data from sensors, social media and commerce demographics.
NoSQL ("Not only SQL") enables queries and analytics against "unstructured" data, which allows for rapid prototyping and agile development of business intelligence applications. This is in stark contrast to the case where IT has to approve, fund, and develop every change to their "structured" database applications, and where scaling up requires a lengthy budgeting process.
Imagine: If power companies had real-time analytics from all the log files in the grid to improve response time in emergencies. Or if the sales team had insightful analytics about trouble tickets or other call center issues... before they got an escalation from their customer's executives.
Worth of this ‘Big Data’ market is projected to reach $53 billion in five years, and NoSQL is open-source. How can startups cash in, and how will incumbents respond? Big Data and Open Source Software powered the massive disruption we call Web2.0, and continues to power most of the big name IPOs of 2012. We are just getting started.
  • James Phillips - Director, Co-Founder & SVP of Products at Couchbase
  • Max Schireson - President at 10gen, the company that develops and supports MongoDB
  • Doug Cutting - Chairman of the Board of Directors at Apache Software Foundation, Creator of Hadoop
  • Andrew Mendelsohn, Senior Vice President of Database Server Technologies at Oracle
  • Ravi Mohan, Managing Director of Shasta Ventures
With Oracle in the panel, the event was very controversial.  



To learn more about vlab go to http://www.vlab.org/


Also, we are hiring BIg Data developers! 


Let me know if want to join us at any of the big data events
 



Friday, April 13, 2012

LADY GAGA HADOOP NODE ISSUE

Hadoop User group in San Francisco is always a great place to learn about new technologies and meet super interesting people. The last meet up was held at Twitter office. Alex, Dave and Egor joined me this time. Twitter break room is very impressive. It feels like an upscale grocery store with organic soda and healthy snacks. Even though I am a "foodie", I have not seen some of the soda brands they had there. Yay! Go Twitter!

When sessions started, I joined the group about Mahout and it is implication to product recommendations.  Cloudera and WibiData is looking into mahout implementation. In general,  this is not the first time I have heard interest from customers about Mahout.  Mahout definitely  gets a lot of attention.

The second session was about new release of the HBase from Richard from Salesforce. They have implemented quite a few improvements for performance and overall deployment. Another big project to look into.

Alex was telling me about his session about Twitter "reach" session. The basic problem twitter is "Lady Gaga". Every time she tweets some of the twitter servers goes down because she had billion followers!
Guys call it "Lady Gaga problem". I laughed for 20 minutes !!!!




We had a little celebration afterwards!!!



Wednesday, April 4, 2012

Common Hadoop Troubleshooting Tips


1. One of the common problems with new installation is connecting datanode to a different namenode

2012-04-04 18:50:38,863 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /var/lib/hadoop-0.20/cache/hdfs/dfs/data: namenode namespaceID = 1635219806; datanode namespaceID = 976537351
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:238)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:153)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:410)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:305)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1627)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1567)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1585)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1711)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1728)

2012-04-04 18:50:38,864 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:


Solution:
Delete old data from datanode and connect to namenode again. 



Wednesday, March 7, 2012

Hadoop Apache Avro Data Serialization


Merced Systems, Inc., the company I work at,  hosted a great meet up with Scott Carey as a guest speaker. 

http://www.meetup.com/MRDesignPattern/

We went though the general concepts and tip/trick about avro

Here is the general outline of the meeting:

What is Avro?




Avro is a  serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data.
Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas




The Avro schema used to write data is required to be available when reading it.
● Fields are not tagged
  • ○  More compact
  • ○  Potentially faster
    ● Code generation is optional.
  • ○  Simple implementations can read and write data
  • ○  Dynamic, discoverable RPC is also possible (but not implemented)
    Schema storage explicitly or by reference required. 


    The compression is awesome!!!


    class Card {
      int number; //ace = 1, king = 13
      Suit suit;
    
    }
    enum Suit {

      SPADE, HEART, DIAMOND, CLUB;
    }
    
    Java Heap: 24 bytes (32 bit JVM) to 32 bytes Avro binary: 2 bytes

    Card card = new Card();
    card.number = 1;
    card.suit = Suit.SPADE;
    
    Avro binary: 0x02 0x00
    First byte: the integer 1, encoded
    Second byte: the ordinal of the Suit enum (0), encoded


    Go Avro!!!!
     

Wednesday, November 23, 2011

Hadoop Installation

Hadoop has traditionally been a royal pain to setup and configure properly. With recent Cloudera’s distribution releases, this process has gotten simpler, but is still a far cry from straightforward. We’ll try to, if not simplify it, at least document it thoroughly so you follow clear, step-by-step instructions to get your first Hadoop cluster up and running locally. Let’s dive in!

Prerequisites

This tutorial requires the following two hefty installers downloaded to your workstation:
  1. Oracle VirtualBox to in order to run Virtual Machine Images (VMs) on your machine. Here is the link to the Virtual Box download page:
  2. An Ubuntu 10 Image that will house our Hadoop installation. You can grab one from here: NOTE: as of this writing, Cloudera’s Hadoop distribution was not compatible with Ubuntu 11. Just pick the version 10.04 LTS from the downloads drop-down menu to avoid any issues with your installation.

Install VirtualBox

  1. Download the installation package for your operating system (Windows or Mac OS X recommended).
  2. Close all applications and run the installation package following the on screen instructions.
    NOTE: The current tested version is 4.0.8 (08/05/2011).

Install Ubuntu 10 Image

  1. Download Ubuntu OS Version 10.04 LTS.
  2. Start VirtualBox from application selection menu:
  1. Click on the New button to create new virtual machine and click continue
  2. Provide a name for your VM and select Linux and Ubuntu in OS options
 
  1. Keep the rest of the settings as defaults and continue with instructions
  2. Start the VM after it was created by selecting the VM in the left screen and clicking on the Start button


7. Select installation media as the downloaded Ubuntu installation package
  1. Proceed with default settings during the installation.
    NOTE: the user hadoop is reserved and should not be selected as your user.
  2. Restart your VM OS after the installation has been completed. You should see the following screen:

Install Java JDK and Hadoop

  1. Open new terminal by going to Applications => Accessories => Terminal.
2. Check the release version of the Ubuntu by running the following command:
lsb_release -c






The expected output should be lucid
 3.  Inside the Terminal, create an empty file /etc/apt/sources.list.d/cloudera.list by running the following command:

sudo vi /etc/apt/sources.list.d/cloudera.list


4. Paste the following two lines into the file

deb http://archive.cloudera.com/debian lucid-cdh3 contrib 
deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib

 5. Run the following commands in the terminal window:

sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jdk
sudo apt-get install hadoop-0.20
 
6. Install Hadoop components:

sudo apt-get install hadoop-0.20-namenode
sudo apt-get install hadoop-0.20-datanode
sudo apt-get install hadoop-0.20-jobtracker
sudo apt-get install hadoop-0.20-tasktracker 
 
7. Install configuration for pseudo distributed cluster: 
 
sudo apt-get install hadoop-0.20-conf-pseudo

8. Start services by running the following command in the terminal window:
 
for x in /etc/init.d/hadoop-* ; do sudo $x start; done 
 
9. Check your installation by opening the following links in your internet browser:
 
http://localhost:50070
http://localhost:50030