glazer_csv (glazer v0.5.13)

View Source

Fast CSV encoding and decoding using the glaze C++ library.

By default nulls (e.g. produced by on_failure => null) are represented as the atom null. To change it application-wide, set the null env key in your config:

{glazer, [{null, nil}]}.

Features

  • RFC 4180 CSV encoding/decoding via decode/1,2 and encode/1,2, with optional header-row support
  • Per-column field type conversion ({fields, Specs}), including integers, floats, booleans, datetimes, atoms, and strings (binaries)
  • Incremental/streaming CSV decoding via stream_decoder/0,1, stream_feed/2, stream_eof/1
  • Configurable representation of CSV null values
  • read_file/1,2 and write_file/2,3 helpers for decoding/encoding directly to/from a file

See also [https://github.com/stephenberry/glaze]

Summary

Types

The result of a successful CSV decode: a map with two keys.

Error reasons returned by try_decode/1,2 or raised by decode/1,2

A single CSV decode option. See decode_opts/0 for the full reference table of all available options and their effects.

CSV decode options

A single CSV encode option. See encode_opts/0 for descriptions of all available options.

CSV encode options

Controls what happens when a non-empty field fails to convert to the requested field_type() (default binary)

A single element of the {fields, Specs} CSV decode option: either a field_type() directly, or a map for more control

A single column's target type for the {fields, Specs} CSV decode option

How the header row should be represented when using {headers, Type}

Resumable state of the incremental row-boundary scanner used inside a stream_decoder/0. Carries the current byte offset and a flag indicating whether the scanner is currently inside a quoted field. Exposed so the state can be serialised or inspected; normal usage does not require direct access to this type.

Opaque handle for incremental CSV decoding. Created by stream_decoder/0,1 and threaded through successive stream_feed/2 calls; call stream_eof/1 to flush any remaining buffered bytes at the end of the input.

Functions

Decode a CSV binary or iolist.

Decode a CSV binary or iolist with options (see decode_opts/0). Returns a csv_result/0. Raises Reason :: t:decode_error/0 on invalid input.

Encode a list of rows to a CSV binary.

Encode a list of rows to a CSV binary, with options.

Read Filename and decode its contents as CSV.

Read Filename and decode its contents as CSV, with decode options (see decode/2).

Create a new incremental decoder for feeding CSV in chunks (e.g. from a socket or file), useful when the whole input isn't available up front.

Create a new incremental CSV decoder, passing Opts through to every internal decode/2 call.

Signal end-of-stream: decode any remaining buffered bytes as a final row (useful when the input doesn't end with a trailing line break).

Feed a chunk of bytes into the decoder, returning any complete CSV rows found so far (in order) along with the updated decoder.

Decode a CSV binary or iolist, returning {ok, Result} or {error, Reason} instead of raising. Result is a csv_result/0; Reason is a decode_error/0.

Decode a CSV binary or iolist with options (see decode_opts/0), returning {ok, Result} or {error, Reason} instead of raising. Result is a csv_result/0; Reason is a decode_error/0.

Encode Data to CSV and write it to Filename, overwriting any existing file.

Encode Data to CSV with encode options (see encode/2) and write it to Filename, overwriting any existing file.

Types

csv_result()

-type csv_result() :: #{headers := nil | [binary() | atom()], data := [[term()]] | [tuple()] | [map()]}.

The result of a successful CSV decode: a map with two keys.

  • headers - nil when the headers option was not given; otherwise a list of column names (binaries by default, atoms with {headers, atom} or {headers, existing_atom})
  • data - list of data rows; each row is a list of field values by default, a tuple of field values with {return, tuple}, or a map keyed by the column names when both headers and {return, map} are given

decode_error()

-type decode_error() ::
          unterminated_quoted_field | duplicate_header |
          {invalid_field_value, Row :: pos_integer(), Column :: pos_integer()}.

Error reasons returned by try_decode/1,2 or raised by decode/1,2:

  • unterminated_quoted_field — input ended inside a "..." field with no closing quote
  • duplicate_header — two columns share the same name and {return, map} was requested (map keys must be unique)
  • {invalid_field_value, Row, Column} — a field at the given 1-based row/column position failed to convert to the type requested by {fields, Specs} with on_failure => raise

decode_opt()

-type decode_opt() ::
          {delimiter, char()} |
          headers |
          {headers, [atom() | binary()] | headers_type()} |
          {fields, [field_spec()]} |
          {null_term, atom()} |
          {return, list | map | tuple} |
          {skip, non_neg_integer() | {pos_integer(), pos_integer()}} |
          {limit, pos_integer()} |
          copy_strings.

A single CSV decode option. See decode_opts/0 for the full reference table of all available options and their effects.

decode_opts()

-type decode_opts() :: [decode_opt()].

CSV decode options:

OptionDescription
{delimiter, Char}Field delimiter character (default $,)
headersTreat the first row as column names (shorthand for {headers, binary})
{headers, [Name, ...]}Use the given list of atoms or binaries as column names; the first data row is not consumed as a header
{headers, binary}First row → binary column names (same as bare headers)
{headers, string}Alias for {headers, binary}
{headers, atom}First row → atom column names (via binary_to_atom/2-equivalent)
{headers, existing_atom}First row → existing-atom column names (fall back to binary for unknown atoms)
{headers, charlist}First row → column names as lists of Unicode codepoints
{return, list}Data rows are lists of field values (default)
{return, tuple}Data rows are tuples of field values
{return, map}Data rows are maps keyed by column names; requires headers or {headers, ...}. Raises duplicate_header on duplicate column names
{fields, Specs}Per-column type conversion, applied positionally; see field_spec/0
{skip, N}Skip the first N data rows (after any header row)
{skip, {From, To}}Process only data rows From..To (1-based inclusive); equivalent to {skip, From-1} plus {limit, To-From+1}
{limit, N}Process at most N data rows (after skipping)
{null_term, Atom}Atom to use for on_failure => null; overrides the library-wide null env var
copy_stringsAlways allocate a fresh binary for each decoded field, rather than returning a sub-binary that references the original input. By default (without this option) fields are zero-copy sub-binaries of the input, which is faster but keeps the entire input binary alive in memory as long as any decoded field referencing it is reachable. Use copy_strings when decoded fields are long-lived and the input is large, to allow the GC to reclaim the input buffer independently.

encode_opt()

-type encode_opt() ::
          {delimiter, char()} | headers | {headers, [atom() | binary()]} | {line_ending, lf | crlf}.

A single CSV encode option. See encode_opts/0 for descriptions of all available options.

encode_opts()

-type encode_opts() :: [encode_opt()].

CSV encode options:

  • {delimiter, Char} - field delimiter (default $,)
  • headers - input is a list of maps; the first map's keys become the header row, and subsequent maps are encoded as rows in that column order (missing keys produce empty fields)
  • {headers, [Name, ...]} - input is a list of maps; uses the given list of atoms or binaries (matching the maps' key type) as the column order and header row, instead of deriving it from the first map's keys (missing keys produce empty fields)
  • {line_ending, lf | crlf} - line terminator (default crlf, per RFC 4180)

field_on_failure()

-type field_on_failure() :: binary | raise | default | null.

Controls what happens when a non-empty field fails to convert to the requested field_type() (default binary):

  • binary - leave the field as the original binary (default)
  • raise - raise (or return {error, Reason} from try_decode/2) {invalid_field_value, Row, Column} (1-based)
  • default - use the spec's default value (falls back to binary if no default is given)
  • null - use the configured null term: {null_term, Atom} if given, otherwise the library-wide null term (see the null application env var, Null term configuration)

field_spec()

-type field_spec() ::
          field_type() | #{type := field_type(), default => term(), on_failure => field_on_failure()}.

A single element of the {fields, Specs} CSV decode option: either a field_type() directly, or a map for more control:

  • type - the field_type() to convert the field to
  • default - used in place of the converted value whenever the raw CSV field is empty
  • on_failure - see field_on_failure/0 (default binary)

field_type()

-type field_type() ::
          integer |
          {float, non_neg_integer()} |
          boolean |
          {datetime, binary()} |
          binary | charlist | existing_atom |
          {atom, ExistingAtoms :: [atom()]}.

A single column's target type for the {fields, Specs} CSV decode option:

  • integer - parse as an integer
  • {float, Precision} - parse as a float, rounded to Precision decimal digits
  • boolean - parse "true"/"false" (any case) as true/ false
  • {datetime, InputFormat} - parse using a strptime-like format string (%Y %m %d %H %M %S %f %z and literals; %z accepts Z, +HHMM, or +HH:MM), converting the result to Unix epoch seconds (UTC)
  • binary - leave as a binary (default)
  • charlist - convert to a list of Unicode code points
  • existing_atom - convert to an existing atom, falling back to a binary if no such atom exists
  • {atom, ExistingAtoms} - convert to an atom only if the field's text matches (and exists as) one of ExistingAtoms, falling back to a binary otherwise

headers_type()

-type headers_type() :: atom | existing_atom | binary | string | charlist.

How the header row should be represented when using {headers, Type}:

  • atom - column names are converted to atoms (via binary_to_atom/2-equivalent)
  • existing_atom - column names are converted to existing atoms (binaries if not found)
  • binary - column names are kept as binaries (default)
  • string - alias for binary
  • charlist - column names are converted to lists of Unicode codepoints

scan_state()

-type scan_state() :: {non_neg_integer(), boolean()}.

Resumable state of the incremental row-boundary scanner used inside a stream_decoder/0. Carries the current byte offset and a flag indicating whether the scanner is currently inside a quoted field. Exposed so the state can be serialised or inspected; normal usage does not require direct access to this type.

stream_decoder()

-opaque stream_decoder()

Opaque handle for incremental CSV decoding. Created by stream_decoder/0,1 and threaded through successive stream_feed/2 calls; call stream_eof/1 to flush any remaining buffered bytes at the end of the input.

Functions

decode(Input)

-spec decode(binary() | iolist()) -> csv_result().

Decode a CSV binary or iolist.

Returns a csv_result/0 map #{headers => nil, data => Rows} where Rows is a list of rows, each row a list of binary fields. With the headers option the first row is captured as column names in headers instead of appearing in data. Raises Reason :: t:decode_error/0 on invalid input.

Examples

1> glazer_csv:decode(<<"a,b\n1,2\n3,4\n">>).
#{headers => nil, data => [[<<"a">>,<<"b">>],[<<"1">>,<<"2">>],[<<"3">>,<<"4">>]]}

2> glazer_csv:decode(<<>>).
#{headers => nil, data => []}

3> glazer_csv:decode(<<"\"hello, world\",42\n">>).
#{headers => nil, data => [[<<"hello, world">>,<<"42">>]]}

decode(Input, Opts)

-spec decode(binary() | iolist(), decode_opts()) -> csv_result().

Decode a CSV binary or iolist with options (see decode_opts/0). Returns a csv_result/0. Raises Reason :: t:decode_error/0 on invalid input.

Examples

%% First row as binary column names
1> glazer_csv:decode(<<"name,age\nAlice,30\nBob,25\n">>, [headers]).
#{headers => [<<"name">>,<<"age">>],
  data    => [[<<"Alice">>,<<"30">>],[<<"Bob">>,<<"25">>]]}

%% Explicit column names — no header row expected in the data
2> glazer_csv:decode(<<"Alice,30\n">>, [{headers, [name, age]}, {return, map}]).
#{headers => [name,age], data => [#{age => <<"30">>, name => <<"Alice">>}]}

%% Per-column type conversion
3> glazer_csv:decode(<<"Alice,30\n">>, [{fields, [binary, integer]}]).
#{headers => nil, data => [[<<"Alice">>,30]]}

%% Semi-colon delimiter, skip first 2 rows, limit to 3
4> glazer_csv:decode(<<"h1;h2\nr1a;r1b\nr2a;r2b\nr3a;r3b\nr4a;r4b\n">>,
                     [{delimiter, $;}, headers, {skip, 1}, {limit, 2}]).
#{headers => [<<"h1">>,<<"h2">>],
  data    => [[<<"r2a">>,<<"r2b">>],[<<"r3a">>,<<"r3b">>]]}

%% Rows as maps with atom keys
5> glazer_csv:decode(<<"a,b\n1,2\n">>, [{headers, existing_atom}, {return, map}]).
#{headers => [a,b], data => [#{a => <<"1">>, b => <<"2">>}]}

%% Rows as tuples
6> glazer_csv:decode(<<"a,b\n1,2\n">>, [{return, tuple}]).
#{headers => nil, data => [{<<"a">>,<<"b">>},{<<"1">>,<<"2">>}]}

encode(Data)

-spec encode([[term()]] | [map()]) -> binary().

Encode a list of rows to a CSV binary.

Each row is a list of fields (binaries, atoms, integers, or floats). Fields containing the delimiter, a double quote, or a line break are quoted per RFC 4180, with embedded quotes doubled.

Raises {encode_error, {Msg, Term}} if any row or field cannot be encoded (e.g. an improper list, a map field, or a tuple field).

Examples

1> glazer_csv:encode([[<<"a">>, <<"b">>], [1, 2]]).
<<"a,b\r\n1,2\r\n">>

2> glazer_csv:encode([[<<"hello, world">>, <<"say \"hi\"">>]]).
<<"\"hello, world\",\"say \"\"hi\"\"\"\r\n">>

3> glazer_csv:encode([]).
<<>>

4> glazer_csv:encode([[<<"a">>|<<"b">>]]).
** exception error: {encode_error,{<<"cannot encode improper list as CSV row">>,<<"b">>}}

encode(Data, Opts)

-spec encode([[term()]] | [map()], encode_opts()) -> binary().

Encode a list of rows to a CSV binary, with options.

With the headers option, Data is a list of maps: the first map's keys become the header row (in iteration order), and each map is encoded as a row in that column order.

Raises {encode_error, {Msg, Term}} if any row or field cannot be encoded.

Examples

%% Maps to CSV with a header row
1> glazer_csv:encode([#{<<"name">> => <<"Alice">>, <<"age">> => 30}], [headers]).
<<"age,name\r\n30,Alice\r\n">>

%% Maps to CSV with an explicit column order
2> glazer_csv:encode([#{<<"name">> => <<"Alice">>, <<"age">> => 30}],
                     [{headers, [<<"name">>, <<"age">>]}]).
<<"name,age\r\nAlice,30\r\n">>

%% Semicolon delimiter with LF line endings
3> glazer_csv:encode([[<<"a">>, <<"b">>], [1, 2]],
                     [{delimiter, $;}, {line_ending, lf}]).
<<"a;b\n1;2\n">>

read_file(Filename)

-spec read_file(file:name_all()) -> csv_result().

Read Filename and decode its contents as CSV.

Raises Reason::decode_error() if the file's contents aren't valid CSV, or a binary "Filename: Reason" message (see file:format_error/1) if the file can't be read.

Examples

%% File contains: name,age\nAlice,30\n
1> glazer_csv:read_file("data.csv").
#{headers => nil, data => [[<<"name">>,<<"age">>],[<<"Alice">>,<<"30">>]]}

2> glazer_csv:read_file("missing.csv").
** exception error: <<"missing.csv: no such file or directory">>

read_file(Filename, Opts)

-spec read_file(file:name_all(), decode_opts()) -> csv_result().

Read Filename and decode its contents as CSV, with decode options (see decode/2).

Examples

%% File contains: name,age\nAlice,30\nBob,25\n
1> glazer_csv:read_file("data.csv", [headers, {return, map}]).
#{headers => [<<"name">>,<<"age">>],
  data    => [#{<<"age">> => <<"30">>,  <<"name">> => <<"Alice">>},
              #{<<"age">> => <<"25">>,  <<"name">> => <<"Bob">>}]}

2> glazer_csv:read_file("data.csv", [headers, {fields, [binary, integer]}]).
#{headers => [<<"name">>,<<"age">>], data => [[<<"Alice">>,30],[<<"Bob">>,25]]}

stream_decoder()

-spec stream_decoder() -> stream_decoder().

Create a new incremental decoder for feeding CSV in chunks (e.g. from a socket or file), useful when the whole input isn't available up front.

Each complete row is decoded as soon as its terminating line break is seen, via decode/2 on that single row. Only the row boundary detection is incremental — a small byte-scanner tracks whether the cursor is inside a quoted field across chunks, so that \n/\r\n inside quoted fields doesn't end a row.

With the headers option, the first complete row is captured as the header; no row is emitted for it. Passes the same options as decode/2 to every row decode internally (see stream_decoder/1 to supply options).

Examples

1> D0 = glazer_csv:stream_decoder(),
   {Rows1, D1} = glazer_csv:stream_feed(D0, <<"a,b\n1,2\n3,">>),
   Rows1.
[[<<"a">>,<<"b">>],[<<"1">>,<<"2">>]]

2> {Rows2, D2} = glazer_csv:stream_feed(D1, <<"4\n">>),
   Rows2.
[[<<"3">>,<<"4">>]]

3> glazer_csv:stream_eof(D2).
{ok, []}

stream_decoder(Opts)

-spec stream_decoder(decode_opts()) -> stream_decoder().

Create a new incremental CSV decoder, passing Opts through to every internal decode/2 call.

All options from decode/2 are accepted except {skip, ...} and {limit, ...}, which are ignored in streaming mode (the caller controls which rows to process by consuming the output of stream_feed/2).

When {headers, [List]} is given, the explicit header names are pre-populated and no header row is consumed from the stream.

Examples

%% Headers option: first row captured, data rows returned as field lists
1> D0 = glazer_csv:stream_decoder([headers]),
   {Rows, D1} = glazer_csv:stream_feed(D0, <<"name,age\nAlice,30\n">>),
   Rows.
[[<<"Alice">>,<<"30">>]]

%% Explicit headers + map output
2> D0 = glazer_csv:stream_decoder([{headers, [name, age]}, {return, map}]),
   {Rows, _D1} = glazer_csv:stream_feed(D0, <<"Alice,30\n">>),
   Rows.
[#{age => <<"30">>, name => <<"Alice">>}]

%% Semicolon delimiter
3> D0 = glazer_csv:stream_decoder([{delimiter, $;}]),
   {Rows, _D1} = glazer_csv:stream_feed(D0, <<"a;b\n1;2\n">>),
   Rows.
[[<<"a">>,<<"b">>],[<<"1">>,<<"2">>]]

stream_eof/1

-spec stream_eof(stream_decoder()) -> {ok, [[term()]] | [tuple()] | [map()]} | {error, term()}.

Signal end-of-stream: decode any remaining buffered bytes as a final row (useful when the input doesn't end with a trailing line break).

Returns {ok, Rows} with zero or one trailing row, or {error, Reason} if the remaining bytes don't form a valid row.

Examples

%% Input without a trailing newline
1> D0 = glazer_csv:stream_decoder(),
   {Rows1, D1} = glazer_csv:stream_feed(D0, <<"a,b\n1,2">>),
   Rows1.
[[<<"a">>,<<"b">>]]

2> glazer_csv:stream_eof(D1).
{ok, [[<<"1">>,<<"2">>]]}

%% Input ending with a newline — nothing left at EOF
3> D0 = glazer_csv:stream_decoder(),
   {_Rows, D1} = glazer_csv:stream_feed(D0, <<"a,b\n">>),
   glazer_csv:stream_eof(D1).
{ok, []}

%% Unterminated quoted field surfaces here
4> D0 = glazer_csv:stream_decoder(),
   {[], D1} = glazer_csv:stream_feed(D0, <<"\"unterminated">>),
   glazer_csv:stream_eof(D1).
{error, unterminated_quoted_field}

stream_feed/2

-spec stream_feed(stream_decoder(), binary() | iolist()) ->
                     {[[term()]] | [tuple()] | [map()], stream_decoder()}.

Feed a chunk of bytes into the decoder, returning any complete CSV rows found so far (in order) along with the updated decoder.

Raises the same exceptions as decode/2 if a row that the scanner deemed complete fails to decode.

Examples

%% Rows split across two feed calls
1> D0 = glazer_csv:stream_decoder(),
   {Rows1, D1} = glazer_csv:stream_feed(D0, <<"a,b\n1,">>),
   Rows1.
[[<<"a">>,<<"b">>]]

2> {Rows2, D2} = glazer_csv:stream_feed(D1, <<"2\n">>),
   Rows2.
[[<<"1">>,<<"2">>]]

3> glazer_csv:stream_eof(D2).
{ok, []}

%% Typical socket-reading loop
loop(Socket, D0) ->
  case gen_tcp:recv(Socket, 0) of
    {ok, Chunk} ->
      {Rows, D1} = glazer_csv:stream_feed(D0, Chunk),
      handle_rows(Rows),
      loop(Socket, D1);
    {error, closed} ->
      case glazer_csv:stream_eof(D0) of
        {ok, Trailing}  -> handle_rows(Trailing);
        {error, Reason} -> handle_truncated_stream(Reason)
      end
  end.

try_decode(Input)

-spec try_decode(binary() | iolist()) -> {ok, csv_result()} | {error, decode_error()}.

Decode a CSV binary or iolist, returning {ok, Result} or {error, Reason} instead of raising. Result is a csv_result/0; Reason is a decode_error/0.

Examples

1> glazer_csv:try_decode(<<"a,b\n1,2\n">>).
{ok, #{headers => nil, data => [[<<"a">>,<<"b">>],[<<"1">>,<<"2">>]]}}

2> glazer_csv:try_decode(<<"\"unterminated">>).
{error, unterminated_quoted_field}

try_decode(Input, Opts)

-spec try_decode(binary() | iolist(), decode_opts()) -> {ok, csv_result()} | {error, decode_error()}.

Decode a CSV binary or iolist with options (see decode_opts/0), returning {ok, Result} or {error, Reason} instead of raising. Result is a csv_result/0; Reason is a decode_error/0.

Examples

1> glazer_csv:try_decode(<<"name,age\nAlice,30\n">>, [headers]).
{ok, #{headers => [<<"name">>,<<"age">>], data => [[<<"Alice">>,<<"30">>]]}}

2> glazer_csv:try_decode(<<"x">>,
                         [{fields, [#{type => integer, on_failure => raise}]}]).
{error, {invalid_field_value, 1, 1}}

write_file(Filename, Data)

-spec write_file(file:name_all(), [[term()]] | [map()]) -> ok.

Encode Data to CSV and write it to Filename, overwriting any existing file.

Raises a binary "Filename: Reason" message (see file:format_error/1) if the file can't be written.

Examples

1> glazer_csv:write_file("out.csv", [[<<"name">>,<<"age">>],[<<"Alice">>,30]]).
ok

2> glazer_csv:write_file("/read-only/out.csv", []).
** exception error: <<"/read-only/out.csv: permission denied">>

write_file(Filename, Data, Opts)

-spec write_file(file:name_all(), [[term()]] | [map()], encode_opts()) -> ok.

Encode Data to CSV with encode options (see encode/2) and write it to Filename, overwriting any existing file.

Examples

%% Write maps as CSV with a header row and LF line endings
1> glazer_csv:write_file("out.csv",
                         [#{<<"name">> => <<"Alice">>, <<"score">> => 99}],
                         [headers, {line_ending, lf}]).
ok

%% Write with a semicolon delimiter
2> glazer_csv:write_file("out.csv",
                         [[<<"a">>, <<"b">>], [1, 2]],
                         [{delimiter, $;}]).
ok