Wednesday, March 7, 2012

Hadoop Apache Avro Data Serialization

Merced Systems, Inc., the company I work at,  hosted a great meet up with Scott Carey as a guest speaker.

We went though the general concepts and tip/trick about avro

Here is the general outline of the meeting:

What is Avro?

Avro is a  serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data.
Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas

The Avro schema used to write data is required to be available when reading it.
● Fields are not tagged
  • ○  More compact
  • ○  Potentially faster
    ● Code generation is optional.
  • ○  Simple implementations can read and write data
  • ○  Dynamic, discoverable RPC is also possible (but not implemented)
    Schema storage explicitly or by reference required. 

    The compression is awesome!!!

    class Card {
      int number; //ace = 1, king = 13
      Suit suit;
    enum Suit {

    Java Heap: 24 bytes (32 bit JVM) to 32 bytes Avro binary: 2 bytes

    Card card = new Card();
    card.number = 1;
    card.suit = Suit.SPADE;
    Avro binary: 0x02 0x00
    First byte: the integer 1, encoded
    Second byte: the ordinal of the Suit enum (0), encoded

    Go Avro!!!!

1 comment:

  1. Google brain is working in the Google cloud big data services to make it a huge success for the world. We hope that society will soon use AI devices at a reasonable cost.
