glazer_csv (glazer v0.5.4)

View Source

Fast CSV encoding and decoding using the glaze C++ library.

By default nulls (e.g. produced by on_failure => null) are represented as the atom null. To change it application-wide, set the null env key in your config:

{glazer, [{null, nil}]}.

Features

  • RFC 4180 CSV encoding/decoding via decode/1,2 and encode/1,2, with optional header-row support
  • Per-column field type conversion ({fields, Specs}), including integers, floats, booleans, datetimes, atoms, and strings (binaries)
  • Incremental/streaming CSV decoding via stream_decoder/0,1, stream_feed/2, stream_eof/1
  • Configurable representation of CSV null values
  • read_file/1,2 and write_file/2,3 helpers for decoding/encoding directly to/from a file

See also [https://github.com/stephenberry/glaze]

Summary

Types

The result of a successful CSV decode: a map with two keys.

Error reasons returned by try_decode/1,2 or raised by decode/1,2

A single CSV decode option. See decode_opts/0 for the full reference table of all available options and their effects.

CSV decode options

A single CSV encode option. See encode_opts/0 for descriptions of all available options.

CSV encode options

Controls what happens when a non-empty field fails to convert to the requested field_type() (default binary)

A single element of the {fields, Specs} CSV decode option: either a field_type() directly, or a map for more control

A single column's target type for the {fields, Specs} CSV decode option

How the header row should be represented when using {headers, Type}

Resumable state of the incremental row-boundary scanner used inside a stream_decoder/0. Carries the current byte offset and a flag indicating whether the scanner is currently inside a quoted field. Exposed so the state can be serialised or inspected; normal usage does not require direct access to this type.

Opaque handle for incremental CSV decoding. Created by stream_decoder/0,1 and threaded through successive stream_feed/2 calls; call stream_eof/1 to flush any remaining buffered bytes at the end of the input.

Functions

Decode a CSV binary or iolist.

Decode a CSV binary or iolist with options (see decode_opts/0). Returns a csv_result/0. Raises Reason :: t:decode_error/0 on invalid input.

Encode a list of rows to a CSV binary.

Encode a list of rows to a CSV binary, with options.

Read Filename and decode its contents as CSV.

Read Filename and decode its contents as CSV, with decode options (see decode/2).

Create a new incremental decoder for feeding CSV in chunks (e.g. from a socket or file), useful when the whole input isn't available up front.

Create a new incremental CSV decoder, passing Opts through to every internal decode/2 call.

Signal end-of-stream: decode any remaining buffered bytes as a final row (useful when the input doesn't end with a trailing line break).

Feed a chunk of bytes into the decoder, returning any complete CSV rows found so far (in order) along with the updated decoder.

Decode a CSV binary or iolist, returning {ok, Result} or {error, Reason} instead of raising. Result is a csv_result/0; Reason is a decode_error/0.

Decode a CSV binary or iolist with options (see decode_opts/0), returning {ok, Result} or {error, Reason} instead of raising. Result is a csv_result/0; Reason is a decode_error/0.

Encode Data to CSV and write it to Filename, overwriting any existing file.

Encode Data to CSV with encode options (see encode/2) and write it to Filename, overwriting any existing file.

Types

csv_result()

-type csv_result() :: #{headers := nil | [binary() | atom()], data := [[term()]] | [tuple()] | [map()]}.

The result of a successful CSV decode: a map with two keys.

  • headers - nil when the headers option was not given; otherwise a list of column names (binaries by default, atoms with {headers, atom} or {headers, existing_atom})
  • data - list of data rows; each row is a list of field values by default, a tuple of field values with {return, tuple}, or a map keyed by the column names when both headers and {return, map} are given

decode_error()

-type decode_error() ::
          unterminated_quoted_field | duplicate_header |
          {invalid_field_value, Row :: pos_integer(), Column :: pos_integer()}.

Error reasons returned by try_decode/1,2 or raised by decode/1,2:

  • unterminated_quoted_field — input ended inside a "..." field with no closing quote
  • duplicate_header — two columns share the same name and {return, map} was requested (map keys must be unique)
  • {invalid_field_value, Row, Column} — a field at the given 1-based row/column position failed to convert to the type requested by {fields, Specs} with on_failure => raise

decode_opt()

-type decode_opt() ::
          {delimiter, char()} |
          headers |
          {headers, [atom() | binary()] | headers_type()} |
          {fields, [field_spec()]} |
          {null_term, atom()} |
          {return, list | map | tuple} |
          {skip, non_neg_integer() | {pos_integer(), pos_integer()}} |
          {limit, pos_integer()}.

A single CSV decode option. See decode_opts/0 for the full reference table of all available options and their effects.

decode_opts()

-type decode_opts() :: [decode_opt()].

CSV decode options:

OptionDescription
{delimiter, Char}Field delimiter character (default $,)
headersTreat the first row as column names (shorthand for {headers, binary})
{headers, [Name, ...]}Use the given list of atoms or binaries as column names; the first data row is not consumed as a header
{headers, binary}First row → binary column names (same as bare headers)
{headers, string}Alias for {headers, binary}
{headers, atom}First row → atom column names (via binary_to_atom/2-equivalent)
{headers, existing_atom}First row → existing-atom column names (fall back to binary for unknown atoms)
{headers, charlist}First row → column names as lists of Unicode codepoints
{return, list}Data rows are lists of field values (default)
{return, tuple}Data rows are tuples of field values
{return, map}Data rows are maps keyed by column names; requires headers or {headers, ...}. Raises duplicate_header on duplicate column names
{fields, Specs}Per-column type conversion, applied positionally; see field_spec/0
{skip, N}Skip the first N data rows (after any header row)
{skip, {From, To}}Process only data rows From..To (1-based inclusive); equivalent to {skip, From-1} plus {limit, To-From+1}
{limit, N}Process at most N data rows (after skipping)
{null_term, Atom}Atom to use for on_failure => null; overrides the library-wide null env var

encode_opt()

-type encode_opt() :: {delimiter, char()} | headers | {line_ending, lf | crlf}.

A single CSV encode option. See encode_opts/0 for descriptions of all available options.

encode_opts()

-type encode_opts() :: [encode_opt()].

CSV encode options:

  • {delimiter, Char} - field delimiter (default $,)
  • headers - input is a list of maps; the first map's keys become the header row, and subsequent maps are encoded as rows in that column order (missing keys produce empty fields)
  • {line_ending, lf | crlf} - line terminator (default crlf, per RFC 4180)

field_on_failure()

-type field_on_failure() :: binary | raise | default | null.

Controls what happens when a non-empty field fails to convert to the requested field_type() (default binary):

  • binary - leave the field as the original binary (default)
  • raise - raise (or return {error, Reason} from try_decode/2) {invalid_field_value, Row, Column} (1-based)
  • default - use the spec's default value (falls back to binary if no default is given)
  • null - use the configured null term: {null_term, Atom} if given, otherwise the library-wide null term (see the null application env var, Null term configuration)

field_spec()

-type field_spec() ::
          field_type() | #{type := field_type(), default => term(), on_failure => field_on_failure()}.

A single element of the {fields, Specs} CSV decode option: either a field_type() directly, or a map for more control:

  • type - the field_type() to convert the field to
  • default - used in place of the converted value whenever the raw CSV field is empty
  • on_failure - see field_on_failure/0 (default binary)

field_type()

-type field_type() ::
          integer |
          {float, non_neg_integer()} |
          boolean |
          {datetime, binary()} |
          binary | charlist | existing_atom |
          {atom, ExistingAtoms :: [atom()]}.

A single column's target type for the {fields, Specs} CSV decode option:

  • integer - parse as an integer
  • {float, Precision} - parse as a float, rounded to Precision decimal digits
  • boolean - parse "true"/"false" (any case) as true/ false
  • {datetime, InputFormat} - parse using a strptime-like format string (%Y %m %d %H %M %S %f %z and literals; %z accepts Z, +HHMM, or +HH:MM), converting the result to Unix epoch seconds (UTC)
  • binary - leave as a binary (default)
  • charlist - convert to a list of Unicode code points
  • existing_atom - convert to an existing atom, falling back to a binary if no such atom exists
  • {atom, ExistingAtoms} - convert to an atom only if the field's text matches (and exists as) one of ExistingAtoms, falling back to a binary otherwise

headers_type()

-type headers_type() :: atom | existing_atom | binary | string | charlist.

How the header row should be represented when using {headers, Type}:

  • atom - column names are converted to atoms (via binary_to_atom/2-equivalent)
  • existing_atom - column names are converted to existing atoms (binaries if not found)
  • binary - column names are kept as binaries (default)
  • string - alias for binary
  • charlist - column names are converted to lists of Unicode codepoints

scan_state()

-type scan_state() :: {non_neg_integer(), boolean()}.

Resumable state of the incremental row-boundary scanner used inside a stream_decoder/0. Carries the current byte offset and a flag indicating whether the scanner is currently inside a quoted field. Exposed so the state can be serialised or inspected; normal usage does not require direct access to this type.

stream_decoder()

-opaque stream_decoder()

Opaque handle for incremental CSV decoding. Created by stream_decoder/0,1 and threaded through successive stream_feed/2 calls; call stream_eof/1 to flush any remaining buffered bytes at the end of the input.

Functions

decode(Input)

-spec decode(binary() | iolist()) -> csv_result().

Decode a CSV binary or iolist.

Returns a csv_result/0 map #{headers => nil, data => Rows} where Rows is a list of rows, each row a list of binary fields. With the headers option the first row is captured as column names in headers instead of appearing in data. Raises Reason :: t:decode_error/0 on invalid input.

Examples

1> glazer_csv:decode(<<"a,b\n1,2\n3,4\n">>).
#{headers => nil, data => [[<<"a">>,<<"b">>],[<<"1">>,<<"2">>],[<<"3">>,<<"4">>]]}

2> glazer_csv:decode(<<>>).
#{headers => nil, data => []}

3> glazer_csv:decode(<<"\"hello, world\",42\n">>).
#{headers => nil, data => [[<<"hello, world">>,<<"42">>]]}

decode(Input, Opts)

-spec decode(binary() | iolist(), decode_opts()) -> csv_result().

Decode a CSV binary or iolist with options (see decode_opts/0). Returns a csv_result/0. Raises Reason :: t:decode_error/0 on invalid input.

Examples

%% First row as binary column names
1> glazer_csv:decode(<<"name,age\nAlice,30\nBob,25\n">>, [headers]).
#{headers => [<<"name">>,<<"age">>],
  data    => [[<<"Alice">>,<<"30">>],[<<"Bob">>,<<"25">>]]}

%% Explicit column names — no header row expected in the data
2> glazer_csv:decode(<<"Alice,30\n">>, [{headers, [name, age]}, {return, map}]).
#{headers => [name,age], data => [#{age => <<"30">>, name => <<"Alice">>}]}

%% Per-column type conversion
3> glazer_csv:decode(<<"Alice,30\n">>, [{fields, [binary, integer]}]).
#{headers => nil, data => [[<<"Alice">>,30]]}

%% Semi-colon delimiter, skip first 2 rows, limit to 3
4> glazer_csv:decode(<<"h1;h2\nr1a;r1b\nr2a;r2b\nr3a;r3b\nr4a;r4b\n">>,
                     [{delimiter, $;}, headers, {skip, 1}, {limit, 2}]).
#{headers => [<<"h1">>,<<"h2">>],
  data    => [[<<"r2a">>,<<"r2b">>],[<<"r3a">>,<<"r3b">>]]}

%% Rows as maps with atom keys
5> glazer_csv:decode(<<"a,b\n1,2\n">>, [{headers, existing_atom}, {return, map}]).
#{headers => [a,b], data => [#{a => <<"1">>, b => <<"2">>}]}

%% Rows as tuples
6> glazer_csv:decode(<<"a,b\n1,2\n">>, [{return, tuple}]).
#{headers => nil, data => [{<<"a">>,<<"b">>},{<<"1">>,<<"2">>}]}

encode(Data)

-spec encode([[term()]] | [map()]) -> binary().

Encode a list of rows to a CSV binary.

Each row is a list of fields (binaries, atoms, integers, or floats). Fields containing the delimiter, a double quote, or a line break are quoted per RFC 4180, with embedded quotes doubled.

Examples

1> glazer_csv:encode([[<<"a">>, <<"b">>], [1, 2]]).
<<"a,b\r\n1,2\r\n">>

2> glazer_csv:encode([[<<"hello, world">>, <<"say \"hi\"">>]]).
<<"\"hello, world\",\"say \"\"hi\"\"\"\r\n">>

3> glazer_csv:encode([]).
<<>>

encode(Data, Opts)

-spec encode([[term()]] | [map()], encode_opts()) -> binary().

Encode a list of rows to a CSV binary, with options.

With the headers option, Data is a list of maps: the first map's keys become the header row (in iteration order), and each map is encoded as a row in that column order.

Examples

%% Maps to CSV with a header row
1> glazer_csv:encode([#{<<"name">> => <<"Alice">>, <<"age">> => 30}], [headers]).
<<"age,name\r\n30,Alice\r\n">>

%% Semicolon delimiter with LF line endings
2> glazer_csv:encode([[<<"a">>, <<"b">>], [1, 2]],
                     [{delimiter, $;}, {line_ending, lf}]).
<<"a;b\n1;2\n">>

read_file(Filename)

-spec read_file(file:name_all()) -> csv_result().

Read Filename and decode its contents as CSV.

Raises Reason::decode_error() if the file's contents aren't valid CSV, or a binary "Filename: Reason" message (see file:format_error/1) if the file can't be read.

Examples

%% File contains: name,age\nAlice,30\n
1> glazer_csv:read_file("data.csv").
#{headers => nil, data => [[<<"name">>,<<"age">>],[<<"Alice">>,<<"30">>]]}

2> glazer_csv:read_file("missing.csv").
** exception error: <<"missing.csv: no such file or directory">>

read_file(Filename, Opts)

-spec read_file(file:name_all(), decode_opts()) -> csv_result().

Read Filename and decode its contents as CSV, with decode options (see decode/2).

Examples

%% File contains: name,age\nAlice,30\nBob,25\n
1> glazer_csv:read_file("data.csv", [headers, {return, map}]).
#{headers => [<<"name">>,<<"age">>],
  data    => [#{<<"age">> => <<"30">>,  <<"name">> => <<"Alice">>},
              #{<<"age">> => <<"25">>,  <<"name">> => <<"Bob">>}]}

2> glazer_csv:read_file("data.csv", [headers, {fields, [binary, integer]}]).
#{headers => [<<"name">>,<<"age">>], data => [[<<"Alice">>,30],[<<"Bob">>,25]]}

stream_decoder()

-spec stream_decoder() -> stream_decoder().

Create a new incremental decoder for feeding CSV in chunks (e.g. from a socket or file), useful when the whole input isn't available up front.

Each complete row is decoded as soon as its terminating line break is seen, via decode/2 on that single row. Only the row boundary detection is incremental — a small byte-scanner tracks whether the cursor is inside a quoted field across chunks, so that \n/\r\n inside quoted fields doesn't end a row.

With the headers option, the first complete row is captured as the header; no row is emitted for it. Passes the same options as decode/2 to every row decode internally (see stream_decoder/1 to supply options).

Examples

1> D0 = glazer_csv:stream_decoder(),
   {Rows1, D1} = glazer_csv:stream_feed(D0, <<"a,b\n1,2\n3,">>),
   Rows1.
[[<<"a">>,<<"b">>],[<<"1">>,<<"2">>]]

2> {Rows2, D2} = glazer_csv:stream_feed(D1, <<"4\n">>),
   Rows2.
[[<<"3">>,<<"4">>]]

3> glazer_csv:stream_eof(D2).
{ok, []}

stream_decoder(Opts)

-spec stream_decoder(decode_opts()) -> stream_decoder().

Create a new incremental CSV decoder, passing Opts through to every internal decode/2 call.

All options from decode/2 are accepted except {skip, ...} and {limit, ...}, which are ignored in streaming mode (the caller controls which rows to process by consuming the output of stream_feed/2).

When {headers, [List]} is given, the explicit header names are pre-populated and no header row is consumed from the stream.

Examples

%% Headers option: first row captured, data rows returned as field lists
1> D0 = glazer_csv:stream_decoder([headers]),
   {Rows, D1} = glazer_csv:stream_feed(D0, <<"name,age\nAlice,30\n">>),
   Rows.
[[<<"Alice">>,<<"30">>]]

%% Explicit headers + map output
2> D0 = glazer_csv:stream_decoder([{headers, [name, age]}, {return, map}]),
   {Rows, _D1} = glazer_csv:stream_feed(D0, <<"Alice,30\n">>),
   Rows.
[#{age => <<"30">>, name => <<"Alice">>}]

%% Semicolon delimiter
3> D0 = glazer_csv:stream_decoder([{delimiter, $;}]),
   {Rows, _D1} = glazer_csv:stream_feed(D0, <<"a;b\n1;2\n">>),
   Rows.
[[<<"a">>,<<"b">>],[<<"1">>,<<"2">>]]

stream_eof/1

-spec stream_eof(stream_decoder()) -> {ok, [[term()]] | [tuple()] | [map()]} | {error, term()}.

Signal end-of-stream: decode any remaining buffered bytes as a final row (useful when the input doesn't end with a trailing line break).

Returns {ok, Rows} with zero or one trailing row, or {error, Reason} if the remaining bytes don't form a valid row.

Examples

%% Input without a trailing newline
1> D0 = glazer_csv:stream_decoder(),
   {Rows1, D1} = glazer_csv:stream_feed(D0, <<"a,b\n1,2">>),
   Rows1.
[[<<"a">>,<<"b">>]]

2> glazer_csv:stream_eof(D1).
{ok, [[<<"1">>,<<"2">>]]}

%% Input ending with a newline — nothing left at EOF
3> D0 = glazer_csv:stream_decoder(),
   {_Rows, D1} = glazer_csv:stream_feed(D0, <<"a,b\n">>),
   glazer_csv:stream_eof(D1).
{ok, []}

%% Unterminated quoted field surfaces here
4> D0 = glazer_csv:stream_decoder(),
   {[], D1} = glazer_csv:stream_feed(D0, <<"\"unterminated">>),
   glazer_csv:stream_eof(D1).
{error, unterminated_quoted_field}

stream_feed/2

-spec stream_feed(stream_decoder(), binary() | iolist()) ->
                     {[[term()]] | [tuple()] | [map()], stream_decoder()}.

Feed a chunk of bytes into the decoder, returning any complete CSV rows found so far (in order) along with the updated decoder.

Raises the same exceptions as decode/2 if a row that the scanner deemed complete fails to decode.

Examples

%% Rows split across two feed calls
1> D0 = glazer_csv:stream_decoder(),
   {Rows1, D1} = glazer_csv:stream_feed(D0, <<"a,b\n1,">>),
   Rows1.
[[<<"a">>,<<"b">>]]

2> {Rows2, D2} = glazer_csv:stream_feed(D1, <<"2\n">>),
   Rows2.
[[<<"1">>,<<"2">>]]

3> glazer_csv:stream_eof(D2).
{ok, []}

%% Typical socket-reading loop
loop(Socket, D0) ->
  case gen_tcp:recv(Socket, 0) of
    {ok, Chunk} ->
      {Rows, D1} = glazer_csv:stream_feed(D0, Chunk),
      handle_rows(Rows),
      loop(Socket, D1);
    {error, closed} ->
      case glazer_csv:stream_eof(D0) of
        {ok, Trailing}  -> handle_rows(Trailing);
        {error, Reason} -> handle_truncated_stream(Reason)
      end
  end.

try_decode(Input)

-spec try_decode(binary() | iolist()) -> {ok, csv_result()} | {error, decode_error()}.

Decode a CSV binary or iolist, returning {ok, Result} or {error, Reason} instead of raising. Result is a csv_result/0; Reason is a decode_error/0.

Examples

1> glazer_csv:try_decode(<<"a,b\n1,2\n">>).
{ok, #{headers => nil, data => [[<<"a">>,<<"b">>],[<<"1">>,<<"2">>]]}}

2> glazer_csv:try_decode(<<"\"unterminated">>).
{error, unterminated_quoted_field}

try_decode(Input, Opts)

-spec try_decode(binary() | iolist(), decode_opts()) -> {ok, csv_result()} | {error, decode_error()}.

Decode a CSV binary or iolist with options (see decode_opts/0), returning {ok, Result} or {error, Reason} instead of raising. Result is a csv_result/0; Reason is a decode_error/0.

Examples

1> glazer_csv:try_decode(<<"name,age\nAlice,30\n">>, [headers]).
{ok, #{headers => [<<"name">>,<<"age">>], data => [[<<"Alice">>,<<"30">>]]}}

2> glazer_csv:try_decode(<<"x">>,
                         [{fields, [#{type => integer, on_failure => raise}]}]).
{error, {invalid_field_value, 1, 1}}

write_file(Filename, Data)

-spec write_file(file:name_all(), [[term()]] | [map()]) -> ok.

Encode Data to CSV and write it to Filename, overwriting any existing file.

Raises a binary "Filename: Reason" message (see file:format_error/1) if the file can't be written.

Examples

1> glazer_csv:write_file("out.csv", [[<<"name">>,<<"age">>],[<<"Alice">>,30]]).
ok

2> glazer_csv:write_file("/read-only/out.csv", []).
** exception error: <<"/read-only/out.csv: permission denied">>

write_file(Filename, Data, Opts)

-spec write_file(file:name_all(), [[term()]] | [map()], encode_opts()) -> ok.

Encode Data to CSV with encode options (see encode/2) and write it to Filename, overwriting any existing file.

Examples

%% Write maps as CSV with a header row and LF line endings
1> glazer_csv:write_file("out.csv",
                         [#{<<"name">> => <<"Alice">>, <<"score">> => 99}],
                         [headers, {line_ending, lf}]).
ok

%% Write with a semicolon delimiter
2> glazer_csv:write_file("out.csv",
                         [[<<"a">>, <<"b">>], [1, 2]],
                         [{delimiter, $;}]).
ok