Wednesday, May 16, 2012

NoSQL and Big Data Vlab event at Stanford

 

The Vlab at Stanford put together another great event  on NoSQL and Big Data. 

Here is the description from their web-site. 

Event Description 
  
0 to 50 million users in 50 days? Disruptive scaling is painless with NoSQL.
Even enterprises have taken notice. They are using the same technology that propelled Google, Facebook and Amazon to massive success by analyzing petabytes of data from sensors, social media and commerce demographics.
NoSQL ("Not only SQL") enables queries and analytics against "unstructured" data, which allows for rapid prototyping and agile development of business intelligence applications. This is in stark contrast to the case where IT has to approve, fund, and develop every change to their "structured" database applications, and where scaling up requires a lengthy budgeting process.
Imagine: If power companies had real-time analytics from all the log files in the grid to improve response time in emergencies. Or if the sales team had insightful analytics about trouble tickets or other call center issues... before they got an escalation from their customer's executives.
Worth of this ‘Big Data’ market is projected to reach $53 billion in five years, and NoSQL is open-source. How can startups cash in, and how will incumbents respond? Big Data and Open Source Software powered the massive disruption we call Web2.0, and continues to power most of the big name IPOs of 2012. We are just getting started.
  • James Phillips - Director, Co-Founder & SVP of Products at Couchbase
  • Max Schireson - President at 10gen, the company that develops and supports MongoDB
  • Doug Cutting - Chairman of the Board of Directors at Apache Software Foundation, Creator of Hadoop
  • Andrew Mendelsohn, Senior Vice President of Database Server Technologies at Oracle
  • Ravi Mohan, Managing Director of Shasta Ventures
With Oracle in the panel, the event was very controversial.  



To learn more about vlab go to http://www.vlab.org/


Also, we are hiring BIg Data developers! 


Let me know if want to join us at any of the big data events
 



Friday, April 13, 2012

LADY GAGA HADOOP NODE ISSUE

Hadoop User group in San Francisco is always a great place to learn about new technologies and meet super interesting people. The last meet up was held at Twitter office. Alex, Dave and Egor joined me this time. Twitter break room is very impressive. It feels like an upscale grocery store with organic soda and healthy snacks. Even though I am a "foodie", I have not seen some of the soda brands they had there. Yay! Go Twitter!

When sessions started, I joined the group about Mahout and it is implication to product recommendations.  Cloudera and WibiData is looking into mahout implementation. In general,  this is not the first time I have heard interest from customers about Mahout.  Mahout definitely  gets a lot of attention.

The second session was about new release of the HBase from Richard from Salesforce. They have implemented quite a few improvements for performance and overall deployment. Another big project to look into.

Alex was telling me about his session about Twitter "reach" session. The basic problem twitter is "Lady Gaga". Every time she tweets some of the twitter servers goes down because she had billion followers!
Guys call it "Lady Gaga problem". I laughed for 20 minutes !!!!




We had a little celebration afterwards!!!



Wednesday, April 4, 2012

Common Hadoop Troubleshooting Tips


1. One of the common problems with new installation is connecting datanode to a different namenode

2012-04-04 18:50:38,863 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /var/lib/hadoop-0.20/cache/hdfs/dfs/data: namenode namespaceID = 1635219806; datanode namespaceID = 976537351
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:238)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:153)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:410)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:305)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1627)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1567)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1585)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1711)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1728)

2012-04-04 18:50:38,864 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:


Solution:
Delete old data from datanode and connect to namenode again. 



Wednesday, March 7, 2012

Hadoop Apache Avro Data Serialization


Merced Systems, Inc., the company I work at,  hosted a great meet up with Scott Carey as a guest speaker. 

http://www.meetup.com/MRDesignPattern/

We went though the general concepts and tip/trick about avro

Here is the general outline of the meeting:

What is Avro?




Avro is a  serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data.
Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas




The Avro schema used to write data is required to be available when reading it.
● Fields are not tagged
  • ○  More compact
  • ○  Potentially faster
    ● Code generation is optional.
  • ○  Simple implementations can read and write data
  • ○  Dynamic, discoverable RPC is also possible (but not implemented)
    Schema storage explicitly or by reference required. 


    The compression is awesome!!!


    class Card {
      int number; //ace = 1, king = 13
      Suit suit;
    
    }
    enum Suit {

      SPADE, HEART, DIAMOND, CLUB;
    }
    
    Java Heap: 24 bytes (32 bit JVM) to 32 bytes Avro binary: 2 bytes

    Card card = new Card();
    card.number = 1;
    card.suit = Suit.SPADE;
    
    Avro binary: 0x02 0x00
    First byte: the integer 1, encoded
    Second byte: the ordinal of the Suit enum (0), encoded


    Go Avro!!!!