Wednesday, March 7, 2012

Hadoop Apache Avro Data Serialization


Merced Systems, Inc., the company I work at,  hosted a great meet up with Scott Carey as a guest speaker. 

http://www.meetup.com/MRDesignPattern/

We went though the general concepts and tip/trick about avro

Here is the general outline of the meeting:

What is Avro?




Avro is a  serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data.
Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas




The Avro schema used to write data is required to be available when reading it.
● Fields are not tagged
  • ○  More compact
  • ○  Potentially faster
    ● Code generation is optional.
  • ○  Simple implementations can read and write data
  • ○  Dynamic, discoverable RPC is also possible (but not implemented)
    Schema storage explicitly or by reference required. 


    The compression is awesome!!!


    class Card {
      int number; //ace = 1, king = 13
      Suit suit;
    
    }
    enum Suit {

      SPADE, HEART, DIAMOND, CLUB;
    }
    
    Java Heap: 24 bytes (32 bit JVM) to 32 bytes Avro binary: 2 bytes

    Card card = new Card();
    card.number = 1;
    card.suit = Suit.SPADE;
    
    Avro binary: 0x02 0x00
    First byte: the integer 1, encoded
    Second byte: the ordinal of the Suit enum (0), encoded


    Go Avro!!!!