View Source AvroSchema (avro_schema v0.4.0)

AvroSchema


CircleCI

This is a library for working with Avro schemas and the Confluent® Schema Registry, primarily focused on working with Kafka streams. It relies on erlavro for encoding and decoding data and confluent_schema_registry to look up schemas using the Schema Registry API.

Its primary value is that it caches schemas for performance and to allow programs to work independently of the Schema Registry being available. It also has a consistent set of functions to manage schema tags, look up schemas from the Schema Registry or files, and encode/decode data.

Much thanks to Klarna for Avlizer, which provides similar functionality to this library in Erlang, erlavro for Avro, and brod for dealing with Kafka.

Installation

Add the package to your list of dependencies in mix.exs:

def deps do
  [{:avro_schema, "~> 0.1.0"}]
end

Then run mix deps.get to fetch the new dependency.

Documentation is on HexDocs. To generate a local copy, run mix docs.

Starting

Add the cache GenServer to your application's supervision tree:

def start(_type, _args) do
  cache_dir = Application.get_env(:yourapp, :cache_dir, "/tmp")

  children = [
    {AvroSchema, [cache_dir: cache_dir]},
  ]

  opts = [strategy: :one_for_one, name: LogElasticsearch.Supervisor]
  Supervisor.start_link(children, opts)
end

Overview

When using Kafka, producers and consumers are separated, and schemas may evolve over time. It is common for producers to tag data indicating the schema that was used to encode it. Consumers can then look up the corresponding schema version and use it decode the data.

This library supports two tagging formats, Confluent wire format, and Avro single object encoding.

Confluent wire format

With the Confluent wire format, Avro binary encoded objects are prefixed with a five-byte tag.

The first byte indicates the Confluent serialization format version number, currently always 0. The following four bytes encode the integer schema ID as returned from the Schema Registry in network byte order.

Avro single-object encoding

When used without a schema registry, it's common to prefix binary data with a hash of the schema that created it. In the past, that might be something like MD5.

The Avro "Single-object encoding" formalizes this, prefixing Avro binary data with a two-byte marker, C3 01, to show that the message is Avro and uses this single-record format (version 1). That is followed by the 8-byte little-endian CRC-64-AVRO fingerprint of the object's schema.

The CRC64 algorithm is uncommon, but used because it is relatively short, while still being good enough to detect collisions. The fingerprint function is implemented in fingerprint_schema/1.

Schema Registry

In a relatively static system, it's not too hard to exchange schema files between producers and consumers. When things are changing more frequently, it can be difficult to keep files up to date. It's also easy for insignificant differences such as whitespace to result in different schema hashes.

The Schema Registry solves this by providing a centralized service which producers and consumers can call to get a unique identifier for a schema version. Producers register a schema with the service and get an id. Consumers look up the id to get the schema.

The Schema Registry also does validation on new schemas to ensure that they meet a backwards compatibility policy for the organization. This helps you to evolve schemas over time and deploy them without breaking running applications.

The disadvantage of the Schema Registry is that it can be a single point of failure. Different schema registries will in general assign a different numeric id to the same schema.

This library provides functions to register schemas with the Schema Registry and look them up by id. It caches the results in RAM (ETS) for performance, and optionally also on disk (DETS). This gives good performance and allows programs to work without needing to communicate with the Schema Registry. Once read, the numeric IDs never change, so it's safe to cache them indefinitely.

The library also has support for managing schemas from files. It can add files to the cache by fingerprint, registering the same schema under multiple fingerprints, i.e. the raw JSON, a version in Parsing Canonical Form and with whitespace stripped out. You can also manually register aliases for the name and fingerprint to handle legacy data.

Kafka producer example

A Kafka producer program needs to be able to encode the data with an Avro schema and tag it with the schema ID or fingerprint. It may store the schema in the code or read it from a file, or it may look it up from the Schema Registry using the subject.

The subject is a registered name which identifies the type of data. There are are several standard strategies used by Confluent in their Kafka libraries.

  • TopicNameStrategy, the default, registers the schema based on the Kafka topic name, implicitly requiring that all messages use the same schema.

  • RecordNameStrategy names the schema using the record type, allowing a single topic to have multiple different types of data or multiple topics to have the same type of data.

    In an Avro schema the "full name", is a namespace-qualified name for the record, e.g. com.example.X. In the schema, it is the name field.

  • TopicRecordNameStrategy names the schema using a combination of topic and record.

With the subject, the producer can call the Schema Registry to get the ID matching the Avro schema:

iex> schema_json = "{\"name\":\"test\",\"type\":\"record\",\"fields\":[{\"name\":\"field1\",\"type\":\"string\"},{\"name\":\"field2\",\"type\":\"int\"}]}"
iex> subject = "test"
iex> {:ok, ref} = AvroSchema.register_schema(subject, schema_json)
{:ok, 21}

If the schema has already been registered, then the Schema Registry will return the current id. If you are registering a new version of the schema, then the Schema Registry will first check if it is compatible with the old one. Depending on the compatibility rules, it may reject the schema.

The producer next needs to get an encoder for the schema.

The encoder is a function that takes Avro key/value data and encodes it to a binary.

iex> {:ok, encoder} = AvroSchema.make_encoder(schema_json)
{:ok, #Function<2.110795165/1 in :avro.make_simple_encoder/2>}

Next we encode some data:

iex> data = %{field1: "hello", field2: 100}
iex> encoded = AvroSchema.encode!(data, encoder)
[['\n', "hello"], [200, 1]]

Finally, we tag the data:

iex> tagged_confluent = AvroSchema.tag(encoded, 21)
[<<0, 0, 0, 0, 21>>, [['\n', "hello"], [200, 1]]]

If you are using files, the process is similar. First create a fingerprint for the schema:

iex> fp = AvroSchema.fingerprint_schema(schema_json)
<<172, 194, 58, 14, 16, 237, 158, 12>>

Next tag the data:

iex> tagged_avro = AvroSchema.tag(encoded, fp)
[
  <<195, 1>>,
  <<172, 194, 58, 14, 16, 237, 158, 12>>,
  [['\n', "hello"], [200, 1]]
]

Now you can send the data to Kafka.

Kafka consumer example

The process for a consumer is similar.

Receive the data and get the registration id in Confluent format:

iex> tagged_confluent = IO.iodata_to_binary(AvroSchema.tag(encoded, 21))
<<0, 0, 0, 0, 21, 10, 104, 101, 108, 108, 111, 200, 1>>

iex> {:ok, {{:confluent, regid}, bin}} = AvroSchema.untag(tagged_confluent)
{:ok, {{:confluent, 21}, <<10, 104, 101, 108, 108, 111, 200, 1>>}}

Get the schema from the Schema Registry:

iex> {:ok, schema} = AvroSchema.get_schema(regid)
{:ok,
 {:avro_record_type, "test", "", "", [],
  [
    {:avro_record_field, "field1", "", {:avro_primitive_type, "string", []},
     :undefined, :ascending, []},
    {:avro_record_field, "field2", "", {:avro_primitive_type, "int", []},
     :undefined, :ascending, []}
  ], "test", []}}

Create a decoder and decode the data:

iex> {:ok, decoder} = AvroSchema.make_decoder(schema)
{:ok, #Function<4.110795165/1 in :avro.make_simple_decoder/2>}

iex> decoded = AvroSchema.decode!(bin, decoder)
%{"field1" => "hello", "field2" => 100}

The process is similar with a fingerprint. In this case, we get the schema from files and register it in the cache using the schema name and fingerprints. There is more than one fingerprint because we register it with the raw schema from the file and the normalized JSON for better interop.

iex> {:ok, files} = AvroSchema.get_schema_files("test/schemas")
{:ok, ["test/schemas/test.avsc"]}

iex> for file <- files, do: AvroSchema.cache_schema_file(file)
[
  ok: [
    {"test", <<172, 194, 58, 14, 16, 237, 158, 12>>},
    {"test", <<194, 132, 80, 199, 36, 146, 103, 147>>}
  ]
]

To decode, separate the fingerprint from the data:

iex> tagged_avro = IO.iodata_to_binary(AvroSchema.tag(encoded, fp))
<<195, 1, 172, 194, 58, 14, 16, 237, 158, 12, 10, 104, 101, 108, 108, 111, 200,
  1>>

iex> {:ok, {{:avro, fp}, bin}} = AvroSchema.untag(tagged_avro)
{:ok, {{:avro, <<172, 194, 58, 14, 16, 237, 158, 12>>},
  <<10, 104, 101, 108, 108, 111, 200, 1>>}}

Get the decoder and decode the data:

iex> {:ok, schema} = AvroSchema.get_schema({"test", fp})
{:ok,
 {:avro_record_type, "test", "", "", [],
  [
    {:avro_record_field, "field1", "", {:avro_primitive_type, "string", []},
     :undefined, :ascending, []},
    {:avro_record_field, "field2", "", {:avro_primitive_type, "int", []},
     :undefined, :ascending, []}
  ], "test", []}}

Decoding works the same as with the Schema Registry:

iex> {:ok, decoder} = AvroSchema.make_decoder(schema)
{:ok, #Function<4.110795165/1 in :avro.make_simple_decoder/2>}

iex> decoded = AvroSchema.decode!(bin, decoder)
%{"field1" => "hello", "field2" => 100}

Performance

For best performance, save the encoder or decoder in your process state to avoid the overhead of looking it up for each message.

An in-memory ETS cache maps the integer registry ID or name + fingerprint to the corresponding schema and decoder. It also allows lookups using name and fingerprint as a key.

The fingerprint is CRC64 by default. You can also register a name with your own fingerprint.

This library also allows consumers to look up the schema on demand from the Schema Registry using the name + fingerprint as the registry subject name.

This library can optionally persist the cache data on disk using DETS, allowing programs to work without continuous access to the Schema Registry.

Programs which use Kafka may process high message volumes, so efficiency is important. They generally use multiple processes, typically one per topic partition or more. On startup, each process may simultaneously attempt to look up schemas.

The cache lookup runs in the caller's process, so it can run in parallel. If there is a cache miss, then it calls the GenServer to update the cache. This has the effect of serializing requests, ensuring that only one runs at a time. See https://www.cogini.com/blog/avoiding-genserver-bottlenecks/ for discussion.

In order to improve interoperability, the schema should be put into standard form.

It might also call the schema registry to get the schema for a given subject:

iex> ConfluentSchemaRegistry.get_schema(client, "test")
{:ok,
%{
 "id" => 21,
 "schema" => "{\"type\":\"record\",\"name\":\"test\",\"fields\":[{\"name\":\"field1\",\"type\":\"string\"},{\"name\":\"field2\",\"type\":\"int\"}]}",
 "subject" => "test",
 "version" => 13
}}

Timestamps

Avro timestamps are in Unix format with microsecond precision:

iex> datetime = DateTime.utc_now()
~U[2019-11-08 09:09:01.055742Z]

iex> timestamp = AvroSchema.to_timestamp(datetime)
1573204141055742

iex> datetime = AvroSchema.to_datetime(timestamp)
~U[2019-11-08 09:09:01.055742Z]

Contacts

I am jakemorrison on on the Elixir Slack and Discord, reachfh on Freenode #elixir-lang IRC channel. Happy to chat or help with your projects.

Summary

Types

Cache value

Tesla client

Fingerprint, normally CRC-64-AVRO but could be e.g. MD5

Cache key

Integer ID returned by Schema Registry

Subject in Schema Registry / Avro name

Functions

Cache schema files with fingerprints.

Ensure Avro schema is in Parsing Canonical Form.

Returns a specification to start this module under a supervisor.

Create CRC-64-AVRO fingerprint hash for Avro schema JSON.

Decode binary Avro data.

Decode binary Avro data, raises if there is a decoding error

Encode Avro data to binary.

Encode Avro data to binary, raises if there is an encoding error

Encode parsed schema as JSON.

Create fingerprint of schema JSON.

Get full name field from schema.

Get encoder function for registration id.

Get schema for schema reference.

List schema files in directory.

Make Avro encoder for schema.

Make registration subject from name + fingerprint.

Normalize JSON by decoding and re-encoding it.

Parse schema into Avro library internal form.

Register schema in Confluent Schema Registry.

Start cache GenServer.

Stop cache GenServer

Tag Avro binary data with schema that created it.

Convert Avro integer timestamp to DateTime.

Convert binary fingerprint to hex

Convert DateTime to Avro integer timestamp with ms precision.

Split schema tag from tagged Avro binary data.

Types

@type cache_value() ::
  [{ref(), :avro.avro_type()}] | {{binary(), binary()}, integer()}

Cache value

@type client() :: Tesla.Client.t()

Tesla client

@type decoded() :: map() | [{binary(), term()}]
@type fp() :: binary()

Fingerprint, normally CRC-64-AVRO but could be e.g. MD5

@type ref() :: regid() | {subject(), fp()}

Cache key

@type regid() :: pos_integer()

Integer ID returned by Schema Registry

@type subject() :: binary()

Subject in Schema Registry / Avro name

Functions

Link to this function

cache_registration(subject, schema, regid, persistent \\ false)

View Source
@spec cache_registration(binary(), binary(), integer(), boolean()) :: :ok | :error

Cache registration.

Inserts the schema in the local cache.

register_schema/1 will then return the id without needing to communicate with the Schema Registry.

Link to this function

cache_schema(refs, schema, persistent \\ false)

View Source
@spec cache_schema(ref() | [ref()], binary() | :avro.avro_type(), boolean()) ::
  :ok | {:error, term()}

Cache schema locally.

Inserts the schema in the local cache under one or more references.

get_schema/1 will then return the schema without needing to communicate with the Schema Registry.

Examples

iex> schema_json = "{"name":"test","type":"record","fields":[{"name":"field1","type":"string"},{"name":"field2","type":"int"}]}"
iex> {:ok, schema} = AvroSchema.parse_schema(schema_json)
iex> full_name = AvroSchema.full_name(schema_json)
iex> fp = AvroSchema.fingerprint_schema(schema_json)
iex> ref = {full_name, fp}
iex> :ok = AvroSchema.cache_schema(ref, schema)
:ok
Link to this function

cache_schema_file(path, subject_aliases \\ %{})

View Source
@spec cache_schema_file(Path.t(), map()) ::
  {:ok, [{binary(), binary()}]} | {:error, term()}

Cache schema files with fingerprints.

Loads a schema file from disk, and parses it to get the full name.

Generates fingerprints for the schema using the raw file bytes, Parsing Canonical Form, and normalized JSON for the canonical form (whitespace stripped). This improves interop with fingerprints generated by other programs, avoiding insignificant differences.

Also accepts a map with aliases that the schema should be registered under, e.g. with a different subject name or legacy fingerprint. Map key is the full name, and value is a list of fingerprints or {name, fingerprint} tuples.

Deduplicates the fingerprints and aliases, then calls cache_schema/3.

Link to this function

canonicalize_schema(schema)

View Source
@spec canonicalize_schema(binary()) :: binary()

Ensure Avro schema is in Parsing Canonical Form.

Converts schema into Parsing Canonical Form by decoding it and re-encoding it.

Returns a specification to start this module under a supervisor.

See Supervisor.

Link to this function

create_fingerprint(binary)

View Source
@spec create_fingerprint(binary()) :: fp()

Create CRC-64-AVRO fingerprint hash for Avro schema JSON.

See CRC-64-AVRO

Examples

iex> schema_json = "{"name":"test","type":"record","fields":[{"name":"field1","type":"string"},{"name":"field2","type":"int"}]}"
iex> AvroSchema.fingerprint_schema(schema_json)
<<172, 194, 58, 14, 16, 237, 158, 12>>
@spec decode(binary(), (... -> any())) :: {:ok, decoded()} | {:error, term()}

Decode binary Avro data.

@spec decode!(binary(), (... -> any())) :: decoded()

Decode binary Avro data, raises if there is a decoding error

@spec encode(map() | [{binary(), term()}], (... -> any())) ::
  {:ok, binary()} | {:error, term()}

Encode Avro data to binary.

@spec encode!(map() | [{binary(), term()}], (... -> any())) :: binary()

Encode Avro data to binary, raises if there is an encoding error

Link to this function

encode_schema(schema, opts \\ [])

View Source
@spec encode_schema(:avro.avro_type(), Keyword.t()) :: binary()

Encode parsed schema as JSON.

Link to this function

fingerprint_schema(schema)

View Source
@spec fingerprint_schema(binary()) :: fp()

Create fingerprint of schema JSON.

Ensures that schema is in standard form, then generates an CRC-64-AVRO fingerprint on it using create_fingerprint/1.

Examples

iex> schema_json = "{"name":"test","type":"record","fields":[{"name":"field1","type":"string"},{"name":"field2","type":"int"}]}"
iex> AvroSchema.fingerprint_schema(schema_json)
<<172, 194, 58, 14, 16, 237, 158, 12>>
@spec full_name(:avro.avro_type() | binary()) :: binary()

Get full name field from schema.

This can be used as the Schema Registry subject.

Link to this function

get_decoder(ref, decoder_opts \\ [record_type: :map, map_type: :map])

View Source
@spec get_decoder(ref() | {:confluent, regid()}, Keyword.t()) ::
  {:ok, (... -> any())} | {:error, term()}

Get decoder function for registration id.

Convenience function, calls get_schema/1 on the id, then make_decoder/2.

Link to this function

get_encoder(ref, encoder_opts \\ [])

View Source
@spec get_encoder(ref(), Keyword.t()) :: {:ok, (... -> any())} | {:error, term()}

Get encoder function for registration id.

Convenience function, calls get_schema/1 on the id, then make_encoder/2.

@spec get_schema(ref()) :: {:ok, :avro.avro_type()} | {:error, term()}

Get schema for schema reference.

This tries to read the schema from the cache. If not found, it makes a call to the Schema Registry.

This is typically called by a Kafka consumer to find the schema which was used to encode data based on the tag.

This call has the overhead of an ETS lookup and potentially a GenServer call to fetch the Avro schema via HTTP. If you need maximum performance, keep the result and reuse it for future requests with the same reference.

Link to this function

get_schema_files(dir, ext \\ "avsc")

View Source

List schema files in directory.

Lists files in a directory matching an extension, default avsc.

Link to this function

make_decoder(schema, decoder_opts \\ [record_type: :map, map_type: :map])

View Source
@spec make_decoder(binary() | :avro.avro_type(), Keyword.t()) ::
  {:ok, (... -> any())} | {:error, term()}

Make Avro decoder for schema.

Creates a function which decodes a Avro encoded binary data to a map. By default, a :hook option is provided that will convert all :null values to nil.

Link to this function

make_encoder(schema_json, encoder_opts \\ [])

View Source
@spec make_encoder(binary() | :avro.avro_type(), Keyword.t()) ::
  {:ok, (... -> any())} | {:error, term()}

Make Avro encoder for schema.

Creates a function which encodes Avro terms to binary.

@spec make_fp_subject({binary(), fp()}) :: binary()

Make registration subject from name + fingerprint.

Link to this function

make_fp_subject(name, fp)

View Source
@spec make_fp_subject(binary(), fp()) :: binary()
@spec normalize_json(binary()) :: binary()

Normalize JSON by decoding and re-encoding it.

This reduces irrelevant differences such as whitespace which may affect fingerprinting.

@spec parse_schema(binary()) :: {:ok, :avro.avro_type()} | {:error, term()}

Parse schema into Avro library internal form.

Examples

iex> schema_json = "{"name":"test","type":"record","fields":[{"name":"field1","type":"string"},{"name":"field2","type":"int"}]}" iex> {:ok, schema} = AvroSchema.parse_schema(schema_json) {:ok, {:avro_record_type, "test", "", "", [],

[
  {:avro_record_field, "field1", "", {:avro_primitive_type, "string", []},
    :undefined, :ascending, []},
Link to this function

register_schema(subject, schema)

View Source
@spec register_schema(subject(), binary()) :: {:ok, regid()} | {:error, term()}

Register schema in Confluent Schema Registry.

The subject is a unique name to register the schema, often the full name from the Avro schema. See the standard strategies used by Confluent in their Kafka libraries.

It is safe to register the same schema multiple times, it will always return the same ID.

Link to this function

start_link(args, opts \\ [])

View Source
@spec start_link(list(), list()) :: {:ok, pid()} | {:error, any()}

Start cache GenServer.

@spec stop() :: :ok

Stop cache GenServer

@spec tag(iodata(), regid() | fp()) :: iolist()

Tag Avro binary data with schema that created it.

Adds a tag to the front of data indicating the schema that was used to encode it.

Uses Confluent wire format for integer registry IDs and Avro single object encoding for fingerprints.

This function matches schema IDs as integers and encodes them using Confluent format, and fingerprints as binary and encodes them as Avro.

Strictly speaking, however, fingerprints are integers, so make sure that you convert them to binary before calling this function.

Note that this function returns an iolist for efficiency, not a binary.

Examples

iex> schema_json = "{"name":"test","type":"record","fields":[{"name":"field1","type":"string"},{"name":"field2","type":"int"}]}"
iex> fp = AvroSchema.fingerprint_schema(schema_json)
iex> AvroSchema.tag("hello", fp)
[<<195, 1>>, <<172, 194, 58, 14, 16, 237, 158, 12>>, "hello"]

Convert Avro integer timestamp to DateTime.

iex> timestamp = 1573204141055742
iex> datetime = AvroSchema.to_datetime(timestamp)
~U[2019-11-08 09:09:01.055742Z]
@spec to_hex(fp()) :: binary()

Convert binary fingerprint to hex

@spec to_timestamp(DateTime.t()) :: non_neg_integer()

Convert DateTime to Avro integer timestamp with ms precision.

Examples

iex> datetime = DateTime.utc_now()
~U[2019-11-08 09:09:01.055742Z]

iex> timestamp = AvroSchema.to_timestamp(datetime)
1573204141055742
@spec untag(iodata()) ::
  {:ok, {{:confluent, regid()}, binary()}}
  | {:ok, {{:avro, fp()}, binary()}}
  | {:error, :unknown_tag}

Split schema tag from tagged Avro binary data.

Supports Confluent wire format for integer registry IDs and Avro single object encoding for fingerprints.

Examples

iex> tagged_avro = IO.iodata_to_binary([<<195, 1>>, <<172, 194, 58, 14, 16, 237, 158, 12>>, "hello"]) iex> AvroSchema.untag(tagged_avro) {:ok, {{:avro, <<172, 194, 58, 14, 16, 237, 158, 12>>}, "hello"}}

iex> tagged_confluent = IO.iodata_to_binary([<<0, 0, 0, 0, 7>>, "hello"]) iex> AvroSchema.untag(tagged_confluent) {:ok, {{:confluent, 7}, "hello"}}