Part of the Map of Streaming Data Systems series

What is Apache Avro?

Apache Avro is a data serialization format that is designed to be efficient and compact. It is used in many different areas of the software development process, including data storage, data compression, data serialization, and data transmission.

Avro is a part of the Apache project. It is used and adopted by many other projects like the Apache Hadoop project and tools like Hive and Pig.

Probably the most popular Avro project is the Apache Kafka project. Avro has been chosen as the only supported data format by the Confluent Schema Registry.

Before we dive into Avro, let’s quickly review the evolution of data formats.

CSV - Comma-Separated Values

CSV is a common data format used in many applications. It is a human-readable format that is easy to read and write. It is also easy to parse and analyze. However, it is not very efficient.

Here is a simple example of a CSV file:

id,name,age,email,phone,is_active
1,Jane,25,jane@example.com,555-555-5555,true
2,John,30,john@example.com,,false
3,Doe,twenty,,111-111-1111,
4,Smith,40,,222-222-2222,'true'

As you can see we have six columns: id, name, age, email, phone, and is_active.

However, there are no restrictions about the data types of the columns and also nothing is preventing null values. In other words, the data can be anything and there is no way to know what the data type of each column is.

CSV - Advantages

The main advantage of CSV are:

  • It is human-readable.
  • It is easy to read and write.
  • It is also relatively easy to parse and analyze.

CSV - Disadvantages

The main disadvantage of CSV are:

  • It is not efficient because it is not compact and it is not optimized for performance instead it is optimized for human readability.
  • The data types of the columns are not known and there is no way to know what the data type of each column is.
  • Parsing CSV is a bit more complicated than parsing JSON especially when your data contains commas.

Relational Tables

Unlike CSV, relational tables are not human-readable but they have data types and are optimized for performance.

When creating a relational table, you can specify the data types of the columns. Here is an example of a relational table create statement:

CREATE TABLE users (
  id INTEGER NOT NULL,
  name VARCHAR(255) NOT NULL,
  age INTEGER NOT NULL,
  email VARCHAR(255) NOT NULL,
  phone VARCHAR(255) NOT NULL,
  is_active BOOLEAN NOT NULL,
  PRIMARY KEY (id)
);

Again we have the same six columns, but unlike CSV, we have specified the data types of each column. The database would reject any data that does not match the data type of the column.

Unlike CSV files, relational tables have named columns and rows, this means that there is no need to specify the order of the columns but instead you can just refer to the column names.

Relational Tables - Advantages

The main advantage of relational tables are:

  • Main benefit is that the data is fully typed and the database can reject any data that does not match the data type of the column.
  • They are optimized for performance.

Relational Tables - Disadvantages

The main disadvantage of relational tables are:

  • The data has to be flat.
  • As there are many database engines, in order for you to access the data, you need to have the driver for the specific database engine.

JSON - JavaScript Object Notation

JSON is a data format that is easy to read and write. It is also easy to parse and analyze. It is also human-readable. JSON is widely used in many applications and is probably the most popular data format.

JSON can be shared across the network over HTTP and other developers don’t need a specific database engine or driver to access the data as most programming languages have a built-in JSON parser.

An example of a JSON file:

{
    "id": 1,
    "name": "Jane",
    "age": 25,
    "email": "jane@example.com",
    "phone": "555-555-5555",
    "is_active": true
},
{
    "id": 2,
    "name": "John",
    "age": 30,
    "email": "john@exmaple.com",
    "phone": "",
    "is_active": false
}

JSON - Advantages

The main advantages of JSON are:

  • It is all text-based and therefore easy to read and write.
  • It is easy to parse and analyze as every language has a built-in JSON parser.
  • It is widely used in many applications and can be shared over the network.

JSON - Disadvantages

The main disadvantages of JSON are:

  • There is no data type information and therefore there is no way to know what the data type of each column is.
  • The JSON object is not optimized for performance as it can get very large due to the repetition of keys.

Avro - Apache Avro

Having reviewed the data formats, we can now dive into Avro.

Avro is defined by a schema that describes the data. The schema itself is a JSON object.

Avro could be considered as a JSON object with a schema attached to it.

An example of an Avro schema would be:

{
    "namespace": "com.example",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"},
        {"name": "email", "type": "string"},
        {"name": "phone", "type": "string"},
        {"name": "is_active", "type": "boolean"}
    ]
}

Avro - Advantages

The main advantages of Avro are:

  • The data is fully typed and the data can be rejected if it does not match the data type of the column.
  • It is optimized for performance as the data can automatically be compressed and decompressed. This essentially means that there will be less data to transfer over the network and less CPU will be needed to process the data.
  • The schema is attached to the data and therefore it is easy to share the data with other developers.
  • The schema is also human-readable and therefore it is easy to read and write but also you can embed the documentation in the schema so other developers can easily understand it.
  • The data can be read by a wide variety of programming languages.
  • The schema can change over time.

Avro - Disadvantages

As with anything, there are some disadvantages to using Avro:

  • Some programming languages do not have a built-in Avro support.
  • Unlike a JSON document, that you can read directly, with Avro the data can not be printed out without using the Avro library as it is serialized and compressed.

Conclusion

For more information about Avro, please visit the Avro website and the Avro documentation.