csv (util v1.3.5)

View Source

CSV file parsing functions

Author: Serge Aleynikov saleyn@gmail.com Copyright: 2021 Serge Aleynikov

Summary

Types

Options for loading data to a database.

CSV Parsing Options.

Functions

Guess the type of data by its value

Guess data types of fields in the given CSV list of rows obtained by parsing a CSV file with parse(File,[fix_lengths]). The function returns a list of tuples {Type, MaxFieldLen, NumOfNulls}, where the Type is a field type, MaxFieldLen is the max length of data in this column, and NumOfNulls is the number of rows with empty values in this column.

Load CSV data from a File to a MySQL database. Tab is the name of a table where to load data. MySqlPid is the pid of a MySQL database connection returned by mysql:start_link/1. The data in the table is replaced according to {import_type, Type}

Get max field lengths for a list obtained by parsing a CSV file with parse_csv_file(File,[fix_lengths]).

Parse a CSV file using default options.

Parse a given CSV file.

Parse a CSV line

Types

load_options()

-type load_options() ::
          [{load_type, recreate | replace | ignore_dups | update_dups} |
           {col_types, #{binary() => ColType :: atom() | {ColType :: atom(), ColLen :: integer()}}} |
           {batch_size, integer()} |
           {blob_size, integer()} |
           {create_table, boolean()} |
           {save_create_sql_to_file, string()} |
           {transforms, #{binary() => fun((term()) -> term())}} |
           {guess_types, boolean()} |
           {guess_limit_rows, integer()} |
           {max_nulls_pcnt, float()} |
           {primary_key, PKColumns :: binary() | [binary() | list()]} |
           {indexes,
            [{IdxName :: string() | binary(), [Col :: string() | binary()]} |
             {IdxName :: string() | binary(), [Col :: string() | binary()], unique | undefined}]} |
           {drop_temp_table, boolean()} |
           {encoding, string() | atom()} |
           {verbose, boolean() | integer()}].

Options for loading data to a database.

  • {load_type, Type} — Type of loading to perform. recreate will replace the table by atomically dropping the old one, creating/loading the new one, and replacing the table. replace will do an insert by using REPLACE INTO statement. ignore_dups will use INSERT IGNORE INTO statement to ignore records with duplicate keys. update_dups will do an INSERT INTO and ON DUPLICATE KEY UPDATE, so that the old records are updated and the new ones are inserted.
  • {create_table, Allow} — Allow to create a table if it doesn't exist
  • {col_types, Map} — Types of data for all or some columens. The Map is in the format: ColName::binary() => ColInfo, where ColInfo is ColType | {ColType, ColLen::integer()}, and ColType is: date | datetime | integer | float | blob | number.
  • {transforms, Map} — Value transformation function for columns. The Map is in the format: ColName::binary() => fun((Value::term()) -> term()).
  • {batch_size, Size} — Number of records per SQL insert/update/replace call
  • {blob_size, Size} — Threshold in number of bytes at which a VARCHAR field is defined as BLOB
  • {save_create_sql_to_file, Filename::string()} — Save CREATE TABLE sql statement to a file
  • guess_types — When specified, the function will try to guess the type of data in columns instead of treating all data as string fields. The possible data typed guessed: integer, float, date, datetime, number, string
  • {guess_limit_rows, Limit} — Limit the number of rows for guessing the column data types
  • {max_nulls_pcnt, Percent} — A percentage threshold of permissible NULLs in a column (0-100), above which the column data type is forced to be treated as string
  • {primary_key, Fields} — Names of primary key fields in the created table
  • {indexes, Indexes} — Indexes to create
  • {drop_temp_table, boolean()} — When true (default), temp table is dropped.
  • {encoding, Encoding} — The name of the encoding to use for storing data. For the list of permissible values see this link
  • verbose — Print additional details to stdout

parse_options()

-type parse_options() ::
          [fix_lengths | binary | list |
           {open, list()} |
           {columns, [binary() | string()]} |
           {converters,
            [{binary() | string() | all,
              fun((binary(), binary()) -> binary()) | {rex, binary(), binary()}}]}].

CSV Parsing Options.

  • fix_lengths — if a record has a column count greater than what's found in the header row, those extra columns will be dropped, and if a row has fewer columns, empty columns will be added.
  • binary — Return fields as binaries (default)
  • list — Return fields as lists
  • {open, list()} — Options given to file:open/2
  • {columns, Names} — Return data only in given columns
  • {converters, [{Col, fun((ColName, Value) -> NewValue)| {rex, RegEx, Replace}]} — Data format converter. If Col is all, the same converting function is used for all columns. If the converter is a {rex, RegEx, Replace} then the regular expression replacement will be run on a value in the requested column.

Functions

guess_data_type(S)

-spec guess_data_type(binary()) -> {null | date | datetime | integer | float | string, term(), string()}.

Guess the type of data by its value

guess_data_types(HasHeaderRow, Rows)

-spec guess_data_types(HasHeaderRow :: boolean(), Rows :: [Fields :: [binary()]]) ->
                          {Type :: string | integer | number | float | date | datetime,
                           MaxFieldLen :: integer(),
                           NumOfNulls :: integer()}.

Guess data types of fields in the given CSV list of rows obtained by parsing a CSV file with parse(File,[fix_lengths]). The function returns a list of tuples {Type, MaxFieldLen, NumOfNulls}, where the Type is a field type, MaxFieldLen is the max length of data in this column, and NumOfNulls is the number of rows with empty values in this column.

load_to_mysql(File, Tab, MySqlPid, Opts)

-spec load_to_mysql(File :: string(), Tab :: string(), MySqlPid :: pid(), Opts :: load_options()) ->
                       {Columns :: list(), AffectedCount :: integer(), RecCount :: integer()}.

Load CSV data from a File to a MySQL database. Tab is the name of a table where to load data. MySqlPid is the pid of a MySQL database connection returned by mysql:start_link/1. The data in the table is replaced according to {import_type, Type}:

  • recreate - The table is entirely replaced with the data from file.
  • replace - Use "REPLACE INTO" instead of "INSERT INTO" existing table
  • ignore_dups - The insert in the existing table is performed and the
  • update_dups - The insert in the existing table is performed and the
  • upsert - The insert/update in the existing table is performed

NOTE: this function requires [https://github.com/mysql-otp/mysql-otp.git]

max_field_lengths(HasHeaderRow, Rows)

-spec max_field_lengths(HasHeaderRow :: boolean(), Rows :: [Fields :: list()]) -> [Len :: integer()].

Get max field lengths for a list obtained by parsing a CSV file with parse_csv_file(File,[fix_lengths]).

parse(File)

-spec parse(string()) -> [[binary()]].

Parse a CSV file using default options.

parse/2

-spec parse(binary() | string(), parse_options()) -> [[string()]].

Parse a given CSV file.

parse_line(Line)

-spec parse_line(binary()) -> list().

Parse a CSV line