All Things Data: March 2012

Merced Systems, Inc., the company I work at, hosted a great meet up with Scott Carey as a guest speaker.

http://www.meetup.com/MRDesignPattern/

We went though the general concepts and tip/trick about avro

Here is the general outline of the meeting:

What is Avro?

•Avro is a serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data.

•Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas

The Avro schema used to write data is required to be available when reading it.
● Fields are not tagged

○ More compact
○ Potentially faster
● Code generation is optional.

○ Simple implementations can read and write data
○ Dynamic, discoverable RPC is also possible (but not implemented)
● Schema storage explicitly or by reference required.

The compression is awesome!!!
```
class Card {
  int number; //ace = 1, king = 13
  Suit suit;
```
}
enum Suit {
```
  SPADE, HEART, DIAMOND, CLUB;
}
```
Java Heap: 24 bytes (32 bit JVM) to 32 bytes Avro binary: 2 bytes
```
Card card = new Card();
card.number = 1;
card.suit = Suit.SPADE;
```
Avro binary: 0x02 0x00
First byte: the integer 1, encoded
Second byte: the ordinal of the Suit enum (0), encoded
Go Avro!!!!

All Things Data

Wednesday, March 7, 2012

Hadoop Apache Avro Data Serialization