View Source Explorer.DataFrame (Explorer v0.2.0)
The DataFrame struct and API.
Dataframes are two-dimensional tabular data structures similar to a spreadsheet. For example, the Iris dataset:
iex> Explorer.Datasets.iris()
#Explorer.DataFrame<
Polars[150 x 5]
sepal_length float [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width float [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length float [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width float [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
This dataframe has 150 rows and five columns. Each column is an Explorer.Series
of the same size (150):
iex> df = Explorer.Datasets.iris()
iex> df["sepal_length"]
#Explorer.Series<
float[150]
[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, ...]
>
creating-dataframes
Creating dataframes
Dataframes can be created from normal Elixir objects. The main ways you might do this are
from_columns/1
and from_rows/1
. For example:
iex> Explorer.DataFrame.new(a: ["a", "b"], b: [1, 2])
#Explorer.DataFrame<
Polars[2 x 2]
a string ["a", "b"]
b integer [1, 2]
>
verbs
Verbs
Explorer uses the idea of a consistent set of SQL-like verbs
like dplyr
which can help solve common data manipulation challenges. These are split into single table verbs and multiple table verbs.
single-table-verbs
Single table verbs
Single table verbs are (unsurprisingly) used for manipulating a single dataframe. These are:
select/3
for picking variablesfilter/2
for picking rows based on predicatesmutate/2
for adding or replacing columns that are functions of existing columnsarrange/2
for changing the ordering of rowsdistinct/2
for picking unique rowssummarise/2
for reducing multiple rows down to a single summarypivot_longer/3
andpivot_wider/4
for massaging dataframes into longer or wider forms, respectively
Each of these combine with Explorer.DataFrame.group_by/2
for operating by group.
multiple-table-verbs
Multiple table verbs
Multiple table verbs are used for combining tables. These are:
join/3
for performing SQL-like joinsconcat_rows/1
for vertically "stacking" dataframes
io
IO
Explorer supports reading and writing of:
- delimited files (such as CSV)
- Parquet
- Arrow IPC
- Newline Delimited JSON
The convention Explorer uses is to have from_*
and to_*
functions to read and write
to files in the formats above. load_*
and dump_*
versions are also available to read
and write those formats directly in memory.
access
Access
In addition to this "grammar" of data manipulation, you'll find useful functions for
slicing and dicing dataframes such as pull/2
, head/2
, sample/3
, slice/3
, and
take/2
.
Explorer.DataFrame
also implements the Access
behaviour (also known as the brackets
syntax). This should be familiar for users coming from other language with dataframes
such as R or Python. For example:
iex> df = Explorer.Datasets.wine()
iex> df["class"]
#Explorer.Series<
integer[178]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
>
Link to this section Summary
Functions: Single-table
Arranges/sorts rows by columns.
This collects the lazy data frame into an eager one, computing the query.
Takes distinct rows by a selection of columns.
Drop nil values.
Turns a set of columns to dummy variables.
Subset rows using column values.
Group the dataframe by one or more variables.
Returns the first n rows of the dataframe.
Creates and modifies columns.
Creates a new dataframe.
Pivot data from wide to long.
Pivot data from long to wide.
Extracts a single column as a series.
Renames columns.
Renames columns with a function.
Sample rows from a dataframe.
Selects a subset of columns by name.
Subset a continuous set of rows.
Summarise each group to a single row.
Display the DataFrame in a tabular fashion.
Returns the last n rows of the dataframe.
Subset rows with a list of indices.
Converts a dataframe to a list of columns with lists as values.
Converts the dataframe to the lazy version of the current backend.
Converts a dataframe to a list of maps (rows).
Converts a dataframe to a list of columns with series as values.
Removes grouping variables.
Functions: Multi-table
Combine two or more dataframes row-wise (stack).
Combine two dataframes row-wise.
Join two tables.
Functions: Introspection
Gets the dtypes of the dataframe columns.
Returns the groups of a dataframe.
Returns the number of columns in the dataframe.
Returns the number of rows in the dataframe.
Gets the names of the dataframe columns.
Gets the shape of the dataframe as a {height, width}
tuple.
Functions: IO
Writes a dataframe to a binary representation of a delimited file.
Similar to from_csv/2
but raises if there is a problem reading the CSV.
Reads a delimited file into a dataframe.
Similar to from_ipc/2
but raises if there is a problem reading the IPC file.
Reads an IPC file into a dataframe.
Read a file of JSON objects or lists separated by new lines
Reads a parquet file into a dataframe.
Similar to to_csv/3
but raises if there is a problem reading the CSV.
Writes a dataframe to a delimited file.
Writes a dataframe to a IPC file.
Writes a dataframe to a ndjson file.
Writes a dataframe to a parquet file.
Link to this section Types
@type column() :: column_name() | non_neg_integer()
@type column_names() :: [column_name()]
@type data() :: Explorer.Backend.DataFrame.t()
@type t() :: %Explorer.DataFrame{ data: data(), dtypes: %{ required(Explorer.Backend.DataFrame.column_name()) => Explorer.Backend.Series.dtype() }, groups: [String.t()], names: [Explorer.Backend.DataFrame.column_name()] }
Link to this section Functions: Single-table
Arranges/sorts rows by columns.
examples
Examples
A single column name will sort ascending by that column:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.arrange(df, "a")
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b integer [3, 1, 2]
>
You can also sort descending:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.arrange(df, desc: "a")
#Explorer.DataFrame<
Polars[3 x 2]
a string ["c", "b", "a"]
b integer [2, 1, 3]
>
Sorting by more than one column sorts them in the order they are entered:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.arrange(df, asc: "total", desc: "country")
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2011, 2011, 2012, ...]
country string ["NIUE", "TUVALU", "TUVALU", "NIUE", "NIUE", ...]
total integer [1, 2, 2, 2, 2, ...]
solid_fuel integer [0, 0, 0, 0, 0, ...]
liquid_fuel integer [1, 2, 2, 2, 2, ...]
gas_fuel integer [0, 0, 0, 0, 0, ...]
cement integer [0, 0, 0, 0, 0, ...]
gas_flaring integer [0, 0, 0, 0, 0, ...]
per_capita float [0.52, 0.0, 0.0, 1.04, 1.04, ...]
bunker_fuels integer [0, 0, 0, 0, 0, ...]
>
This collects the lazy data frame into an eager one, computing the query.
If already eager, this is a noop.
Takes distinct rows by a selection of columns.
examples
Examples
By default will return unique values of the requested columns:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, columns: ["year", "country"])
#Explorer.DataFrame<
Polars[1094 x 2]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
If keep_all?
is set to true
, then the first value of each column not in the requested
columns will be returned:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, columns: ["year", "country"], keep_all?: true)
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
A callback on the dataframe's names can be passed instead of a list (like select/3
):
iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> Explorer.DataFrame.distinct(df, columns: &String.starts_with?(&1, "x"))
#Explorer.DataFrame<
Polars[2 x 2]
x1 integer [1, 3]
x2 string ["a", "c"]
>
If the dataframe has groups, then the columns of each group will be added to the distinct columns:
iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> df = Explorer.DataFrame.group_by(df, "x1")
iex> Explorer.DataFrame.distinct(df, columns: ["x2"])
#Explorer.DataFrame<
Polars[2 x 2]
Groups: ["x1"]
x1 integer [1, 3]
x2 string ["a", "c"]
>
Drop nil values.
Optionally accepts a subset of columns.
examples
Examples
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3])
iex> Explorer.DataFrame.drop_nil(df)
#Explorer.DataFrame<
Polars[1 x 2]
a integer [1]
b integer [1]
>
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5, 6])
iex> Explorer.DataFrame.drop_nil(df, [:a, :c])
#Explorer.DataFrame<
Polars[1 x 3]
a integer [2]
b integer [nil]
c integer [5]
>
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5, 6])
iex> Explorer.DataFrame.drop_nil(df, 0..1)
#Explorer.DataFrame<
Polars[1 x 3]
a integer [1]
b integer [1]
c integer [nil]
>
Turns a set of columns to dummy variables.
examples
Examples
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a", "c"], b: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, ["a"])
#Explorer.DataFrame<
Polars[4 x 3]
a_a integer [1, 0, 1, 0]
a_b integer [0, 1, 0, 0]
a_c integer [0, 0, 0, 1]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a", "c"], b: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, ["a", "b"])
#Explorer.DataFrame<
Polars[4 x 6]
a_a integer [1, 0, 1, 0]
a_b integer [0, 1, 0, 0]
a_c integer [0, 0, 0, 1]
b_a integer [0, 1, 0, 0]
b_b integer [1, 0, 1, 0]
b_d integer [0, 0, 0, 1]
>
@spec filter(df :: t(), mask :: Explorer.Series.t() | [boolean()]) :: t()
@spec filter(df :: t(), callback :: function()) :: t()
Subset rows using column values.
examples
Examples
You can pass a mask directly:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, Explorer.Series.greater(df["b"], 1))
#Explorer.DataFrame<
Polars[2 x 2]
a string ["b", "c"]
b integer [2, 3]
>
You can combine masks using Explorer.Series.and/2
or Explorer.Series.or/2
:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> b_gt = Explorer.Series.greater(df["b"], 1)
iex> a_eq = Explorer.Series.equal(df["a"], "b")
iex> Explorer.DataFrame.filter(df, Explorer.Series.and(a_eq, b_gt))
#Explorer.DataFrame<
Polars[1 x 2]
a string ["b"]
b integer [2]
>
Including a list:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, [false, true, false])
#Explorer.DataFrame<
Polars[1 x 2]
a string ["b"]
b integer [2]
>
Or you can invoke a callback on the dataframe:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, &Explorer.Series.greater(&1["b"], 1))
#Explorer.DataFrame<
Polars[2 x 2]
a string ["b", "c"]
b integer [2, 3]
>
@spec group_by(df :: t(), groups_or_group :: column_names() | column_name()) :: t()
Group the dataframe by one or more variables.
When the dataframe has grouping variables, operations are performed per group.
Explorer.DataFrame.ungroup/2
removes grouping.
examples
Examples
You can group by a single variable:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, "country")
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country"]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Or you can group by multiple:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, ["country", "year"])
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country", "year"]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Returns the first n rows of the dataframe.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df)
#Explorer.DataFrame<
Polars[5 x 10]
year integer [2010, 2010, 2010, 2010, 2010]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
total integer [2308, 1254, 32500, 141, 7924]
solid_fuel integer [627, 117, 332, 0, 0]
liquid_fuel integer [1601, 953, 12381, 141, 3649]
gas_fuel integer [74, 7, 14565, 0, 374]
cement integer [5, 177, 2598, 0, 204]
gas_flaring integer [0, 0, 2623, 0, 3697]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37]
bunker_fuels integer [9, 7, 663, 0, 321]
>
@spec mutate(df :: t(), columns :: column_pairs(any())) :: t()
Creates and modifies columns.
Columns are added with keyword list or maps. New variables overwrite existing variables of the same name. Column names are coerced from atoms to strings.
examples
Examples
You can pass in a list directly as a new column:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, c: [4, 5, 6])
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b integer [1, 2, 3]
c integer [4, 5, 6]
>
Or you can pass in a series:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> s = Explorer.Series.from_list([4, 5, 6])
iex> Explorer.DataFrame.mutate(df, c: s)
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b integer [1, 2, 3]
c integer [4, 5, 6]
>
Or you can invoke a callback on the dataframe:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, c: &Explorer.Series.add(&1["a"], &1["b"]))
#Explorer.DataFrame<
Polars[3 x 3]
a integer [4, 5, 6]
b integer [1, 2, 3]
c integer [5, 7, 9]
>
You can overwrite existing columns:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: [4, 5, 6])
#Explorer.DataFrame<
Polars[3 x 2]
a integer [4, 5, 6]
b integer [1, 2, 3]
>
Scalar values are repeated to fill the series:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: 4)
#Explorer.DataFrame<
Polars[3 x 2]
a integer [4, 4, 4]
b integer [1, 2, 3]
>
Including when a callback returns a scalar:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: &Explorer.Series.max(&1["b"]))
#Explorer.DataFrame<
Polars[3 x 2]
a integer [3, 3, 3]
b integer [1, 2, 3]
>
Alternatively, all of the above works with a map instead of a keyword list:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, %{"c" => [4, 5, 6]})
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b integer [1, 2, 3]
c integer [4, 5, 6]
>
@spec new( Table.Reader.t() | series_pairs, opts :: Keyword.t() ) :: t() when series_pairs: %{required(column_name()) => Explorer.Series.t()} | [{column_name(), Explorer.Series.t()}]
Creates a new dataframe.
Accepts any tabular data adhering to the Table.Reader
protocol, as well as a map or a keyword list with series.
options
Options
backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.
examples
Examples
From series:
iex> Explorer.DataFrame.new(%{floats: Explorer.Series.from_list([1.0, 2.0]), ints: Explorer.Series.from_list([1, nil])})
#Explorer.DataFrame<
Polars[2 x 2]
floats float [1.0, 2.0]
ints integer [1, nil]
>
From columnar data:
iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, nil]})
#Explorer.DataFrame<
Polars[2 x 2]
floats float [1.0, 2.0]
ints integer [1, nil]
>
iex> Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
#Explorer.DataFrame<
Polars[2 x 2]
floats float [1.0, 2.0]
ints integer [1, nil]
>
iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, "wrong"]})
** (ArgumentError) cannot create series "ints": the value "wrong" does not match the inferred series dtype :integer
From row data:
iex> rows = [%{id: 1, name: "José"}, %{id: 2, name: "Christopher"}, %{id: 3, name: "Cristine"}]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
Polars[3 x 2]
id integer [1, 2, 3]
name string ["José", "Christopher", "Cristine"]
>
iex> rows = [[id: 1, name: "José"], [id: 2, name: "Christopher"], [id: 3, name: "Cristine"]]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
Polars[3 x 2]
id integer [1, 2, 3]
name string ["José", "Christopher", "Cristine"]
>
@spec pivot_longer( df :: t(), columns_to_pivot :: columns() | function(), opts :: Keyword.t() ) :: t()
Pivot data from wide to long.
Explorer.DataFrame.pivot_longer/3
"lengthens" data, increasing the number of rows and
decreasing the number of columns. The inverse transformation is
Explorer.DataFrame.pivot_wider/4
.
The second argument (columns_to_pivot
) can be either an array of column names to pivot
or a filter callback on the dataframe's names. These columns must always have the same
data type.
options
Options
keep
- Columns that are not in the list of pivot and should be kept in the dataframe. May be a filter callback on the dataframe's column names. Defaults to all columns except the ones to pivot.drop
- Columns that are not in the list of pivot and should be dropped from the dataframe. May be a filter callback on the dataframe's column names. This list of columns is going to be subtracted from the list ofkeep
. Defaults to an empty list.names_to
- A string specifying the name of the column to create from the data stored in the column names of the dataframe. Defaults to"variable"
.values_to
- A string specifying the name of the column to create from the data stored in series element values. Defaults to"value"
.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
Polars[3282 x 9]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
value integer [627, 117, 332, 0, 0, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"), keep: ["year", "country"])
#Explorer.DataFrame<
Polars[3282 x 4]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
value integer [627, 117, 332, 0, 0, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], keep: ["year", "country"], drop: ["country"])
#Explorer.DataFrame<
Polars[1094 x 3]
year integer [2010, 2010, 2010, 2010, 2010, ...]
variable string ["total", "total", "total", "total", "total", ...]
value integer [2308, 1254, 32500, 141, 7924, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], keep: [], names_to: "my_var", values_to: "my_value")
#Explorer.DataFrame<
Polars[1094 x 2]
my_var string ["total", "total", "total", "total", "total", ...]
my_value integer [2308, 1254, 32500, 141, 7924, ...]
>
@spec pivot_wider( df :: t(), names_from :: column(), values_from :: column(), opts :: Keyword.t() ) :: t()
Pivot data from long to wide.
Explorer.DataFrame.pivot_wider/4
"widens" data, increasing the number of columns and
decreasing the number of rows. The inverse transformation is
Explorer.DataFrame.pivot_longer/3
.
Due to a restriction upstream, values_from
must be a numeric type.
options
Options
id_columns
- A set of columns that uniquely identifies each observation. Defaults to all columns in data except for the columns specified innames_from
andvalues_from
. Typically used when you have redundant variables, i.e. variables whose values are perfectly correlated with existing variables. May accept a filter callback, a list or a range of column names. Default value is0..-1
. If an empty list is passed, or a range that results in a empty list of column names, it raises an error.names_prefix
- String added to the start of every variable name. This is particularly useful ifnames_from
is a numeric vector and you want to create syntactic variable names.
examples
Examples
iex> df = Explorer.DataFrame.new(id: [1, 1], variable: ["a", "b"], value: [1, 2])
iex> Explorer.DataFrame.pivot_wider(df, "variable", "value")
#Explorer.DataFrame<
Polars[1 x 3]
id integer [1]
a integer [1]
b integer [2]
>
@spec pull(df :: t(), column :: column()) :: Explorer.Series.t()
Extracts a single column as a series.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, "total")
#Explorer.Series<
integer[1094]
[2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, 2)
#Explorer.Series<
integer[1094]
[2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>
@spec rename( df :: t(), names :: column_names() | column_pairs(column_name()) ) :: t()
Renames columns.
To apply a function to a subset of columns, see rename_with/3
.
examples
Examples
You can pass in a list of new names:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, ["c", "d"])
#Explorer.DataFrame<
Polars[3 x 2]
c string ["a", "b", "a"]
d integer [1, 3, 1]
>
Or you can rename individual columns using keyword args:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, a: "first")
#Explorer.DataFrame<
Polars[3 x 2]
first string ["a", "b", "a"]
b integer [1, 3, 1]
>
Or you can rename individual columns using a map:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, %{"a" => "first"})
#Explorer.DataFrame<
Polars[3 x 2]
first string ["a", "b", "a"]
b integer [1, 3, 1]
>
Renames columns with a function.
examples
Examples
If no columns are specified, it will apply the function to all column names:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.upcase/1)
#Explorer.DataFrame<
Polars[1094 x 10]
YEAR integer [2010, 2010, 2010, 2010, 2010, ...]
COUNTRY string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
TOTAL integer [2308, 1254, 32500, 141, 7924, ...]
SOLID_FUEL integer [627, 117, 332, 0, 0, ...]
LIQUID_FUEL integer [1601, 953, 12381, 141, 3649, ...]
GAS_FUEL integer [74, 7, 14565, 0, 374, ...]
CEMENT integer [5, 177, 2598, 0, 204, ...]
GAS_FLARING integer [0, 0, 2623, 0, 3697, ...]
PER_CAPITA float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
BUNKER_FUELS integer [9, 7, 663, 0, 321, ...]
>
A callback can be used to filter the column names that will be renamed, similarly to select/3
:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.ends_with?(&1, "_fuel"), &String.trim_trailing(&1, "_fuel"))
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid integer [627, 117, 332, 0, 0, ...]
liquid integer [1601, 953, 12381, 141, 3649, ...]
gas integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Or you can just pass in the list of column names you'd like to apply the function to:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, ["total", "cement"], &String.upcase/1)
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
TOTAL integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
CEMENT integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Sample rows from a dataframe.
If given an integer as the second argument, it will return N samples. If given a float, it will return that proportion of the series.
Can sample with or without replacement.
options
Options
replacement
- If set totrue
, each sample will be independent and therefore values may repeat. Required to betrue
forn
greater then the number of rows in the dataframe orfrac
> 1.0. (default:false
)seed
- An integer to be used as a random seed. If nil, a random value between 1 and 1e12 will be used. (default: nil)
examples
Examples
You can sample N rows:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 3, seed: 100)
#Explorer.DataFrame<
Polars[3 x 10]
year integer [2012, 2012, 2013]
country string ["ZIMBABWE", "NICARAGUA", "NIGER"]
total integer [2125, 1260, 529]
solid_fuel integer [917, 0, 93]
liquid_fuel integer [1006, 1176, 432]
gas_fuel integer [0, 0, 0]
cement integer [201, 84, 4]
gas_flaring integer [0, 0, 0]
per_capita float [0.15, 0.21, 0.03]
bunker_fuels integer [9, 18, 19]
>
Or you can sample a proportion of rows:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 0.03, seed: 100)
#Explorer.DataFrame<
Polars[33 x 10]
year integer [2013, 2012, 2013, 2012, 2010, ...]
country string ["BAHAMAS", "POLAND", "SLOVAKIA", "MOZAMBIQUE", "OMAN", ...]
total integer [764, 81792, 9024, 851, 12931, ...]
solid_fuel integer [1, 53724, 3657, 11, 0, ...]
liquid_fuel integer [763, 17353, 2090, 632, 2331, ...]
gas_fuel integer [0, 8544, 2847, 47, 9309, ...]
cement integer [0, 2165, 424, 161, 612, ...]
gas_flaring integer [0, 6, 7, 0, 679, ...]
per_capita float [2.02, 2.12, 1.67, 0.03, 4.39, ...]
bunker_fuels integer [167, 573, 34, 56, 1342, ...]
>
@spec select( df :: t(), columns_or_callback :: columns() | function(), keep_or_drop :: :keep | :drop ) :: t()
Selects a subset of columns by name.
Can optionally return all but the named columns if :drop
is passed as the last argument.
examples
Examples
You can select columns with a list of names:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ["a"])
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
You can also use a range or a list of integers:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, [0, 1])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b integer [1, 2, 3]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, 0..1)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b integer [1, 2, 3]
>
Or you can use a callback function that takes the dataframe's names as its first argument:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, &String.starts_with?(&1, "b"))
#Explorer.DataFrame<
Polars[3 x 1]
b integer [1, 2, 3]
>
If you pass :drop
as the third argument, it will return all but the named columns:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ["b"], :drop)
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, ["a", "b"], :drop)
#Explorer.DataFrame<
Polars[3 x 1]
c integer [4, 5, 6]
>
Subset a continuous set of rows.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, 1, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year integer [2010, 2010]
country string ["ALBANIA", "ALGERIA"]
total integer [1254, 32500]
solid_fuel integer [117, 332]
liquid_fuel integer [953, 12381]
gas_fuel integer [7, 14565]
cement integer [177, 2598]
gas_flaring integer [0, 2623]
per_capita float [0.43, 0.9]
bunker_fuels integer [7, 663]
>
Negative offsets count from the end of the series:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year integer [2014, 2014]
country string ["UNITED STATES OF AMERICA", "URUGUAY"]
total integer [1432855, 1840]
solid_fuel integer [450047, 2]
liquid_fuel integer [576531, 1700]
gas_fuel integer [390719, 25]
cement integer [11314, 112]
gas_flaring integer [4244, 0]
per_capita float [4.43, 0.54]
bunker_fuels integer [30722, 251]
>
If the length would run past the end of the dataframe, the result may be shorter than the length:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 20)
#Explorer.DataFrame<
Polars[10 x 10]
year integer [2014, 2014, 2014, 2014, 2014, ...]
country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN", "VANUATU", "VENEZUELA", ...]
total integer [1432855, 1840, 28692, 42, 50510, ...]
solid_fuel integer [450047, 2, 1677, 0, 204, ...]
liquid_fuel integer [576531, 1700, 2086, 42, 28445, ...]
gas_fuel integer [390719, 25, 23929, 0, 12731, ...]
cement integer [11314, 112, 1000, 0, 1088, ...]
gas_flaring integer [4244, 0, 0, 0, 8042, ...]
per_capita float [4.43, 0.54, 0.97, 0.16, 1.65, ...]
bunker_fuels integer [30722, 251, 0, 10, 1256, ...]
>
Summarise each group to a single row.
Implicitly ungroups.
supported-operations
Supported operations
The following aggregations may be performed:
:min
- Take the minimum value within the group. SeeExplorer.Series.min/1
.:max
- Take the maximum value within the group. SeeExplorer.Series.max/1
.:sum
- Take the sum of the series within the group. SeeExplorer.Series.sum/1
.:mean
- Take the mean of the series within the group. SeeExplorer.Series.mean/1
.:median
- Take the median of the series within the group. SeeExplorer.Series.median/1
.:first
- Take the first value within the group. SeeExplorer.Series.first/1
.:last
- Take the last value within the group. SeeExplorer.Series.last/1
.:count
- Count the number of rows per group.:n_unique
- Count the number of unique rows per group.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> df |> Explorer.DataFrame.group_by("year") |> Explorer.DataFrame.summarise(total: [:max, :min], country: [:n_unique])
#Explorer.DataFrame<
Polars[5 x 4]
year integer [2010, 2011, 2012, 2013, 2014]
total_max integer [2393248, 2654360, 2734817, 2797384, 2806634]
total_min integer [1, 2, 2, 2, 3]
country_n_unique integer [217, 217, 220, 220, 220]
>
Display the DataFrame in a tabular fashion.
examples
Examples
df = Explorer.Datasets.iris() Explorer.DataFrame.table(df) Explorer.DataFrame.table(df, limit: 1) Explorer.DataFrame.table(df, limit: :infinity)
Returns the last n rows of the dataframe.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df)
#Explorer.DataFrame<
Polars[5 x 10]
year integer [2014, 2014, 2014, 2014, 2014]
country string ["VIET NAM", "WALLIS AND FUTUNA ISLANDS", "YEMEN", "ZAMBIA", "ZIMBABWE"]
total integer [45517, 6, 6190, 1228, 3278]
solid_fuel integer [19246, 0, 137, 132, 2097]
liquid_fuel integer [12694, 6, 5090, 797, 1005]
gas_fuel integer [5349, 0, 581, 0, 0]
cement integer [8229, 0, 381, 299, 177]
gas_flaring integer [0, 0, 0, 0, 0]
per_capita float [0.49, 0.44, 0.24, 0.08, 0.22]
bunker_fuels integer [761, 1, 153, 33, 9]
>
Subset rows with a list of indices.
examples
Examples
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.take(df, [0, 2])
#Explorer.DataFrame<
Polars[2 x 2]
a integer [1, 3]
b string ["a", "c"]
>
Converts a dataframe to a list of columns with lists as values.
See to_series/2
if you want a list of columns with series as values.
options
Options
:atom_keys
- Configure if the resultant map should have atom keys. (default:false
)
examples
Examples
iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> Explorer.DataFrame.to_columns(df)
%{"floats" => [1.0, 2.0], "ints" => [1, nil]}
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_columns(df, atom_keys: true)
%{floats: [1.0, 2.0], ints: [1, nil]}
Converts the dataframe to the lazy version of the current backend.
If already lazy, this is a noop.
Converts a dataframe to a list of maps (rows).
Warning
This may be an expensive operation because
polars
stores data in columnar format.
options
Options
:atom_keys
- Configure if the resultant maps should have atom keys. (default:false
)
examples
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df)
[%{"floats" => 1.0, "ints" => 1}, %{"floats" => 2.0 ,"ints" => nil}]
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df, atom_keys: true)
[%{floats: 1.0, ints: 1}, %{floats: 2.0, ints: nil}]
Converts a dataframe to a list of columns with series as values.
See to_columns/2
if you want a list of columns with lists as values.
options
Options
:atom_keys
- Configure if the resultant map should have atom keys. (default:false
)
examples
Examples
iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> map = Explorer.DataFrame.to_series(df)
iex> Explorer.Series.to_list(map["floats"])
[1.0, 2.0]
iex> Explorer.Series.to_list(map["ints"])
[1, nil]
@spec ungroup(df :: t(), groups_or_group :: column_names() | column_name() | :all) :: t()
Removes grouping variables.
Accepts a list of group names. If groups is not specified, then all groups are removed.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df, ["country"])
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["year"]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df)
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Link to this section Functions: Multi-table
Combine two or more dataframes row-wise (stack).
Column names and dtypes must match. The only exception is for numeric columns that can be mixed together, and casted automatically to float columns.
examples
Examples
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
Polars[6 x 2]
x integer [1, 2, 3, 4, 5, ...]
y string ["a", "b", "c", "d", "e", ...]
>
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4.2, 5.3, 6.4], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
Polars[6 x 2]
x float [1.0, 2.0, 3.0, 4.2, 5.3, ...]
y string ["a", "b", "c", "d", "e", ...]
>
Combine two dataframes row-wise.
concat_rows(df1, df2)
is equivalent to concat_rows([df1, df2])
.
Join two tables.
join-types
Join types
inner
- Returns all rows fromleft
where there are matching values inright
, and all columns fromleft
andright
.left
- Returns all rows fromleft
and all columns fromleft
andright
. Rows inleft
with no match inright
will havenil
values in the new columns.right
- Returns all rows fromright
and all columns fromleft
andright
. Rows inright
with no match inleft
will havenil
values in the new columns.outer
- Returns all rows and all columns from bothleft
andright
. Where there are not matching values, returnsnil
for the one missing.cross
- Also known as a cartesian join. Returns all combinations ofleft
andright
. Can be very computationally expensive.
options
Options
on
- The columns to join on. Defaults to overlapping columns. Does not apply to cross join.how
- One of the join types (as an atom) described above. Defaults to:inner
.
examples
Examples
Inner join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right)
#Explorer.DataFrame<
Polars[3 x 3]
a integer [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
Left join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :left)
#Explorer.DataFrame<
Polars[4 x 3]
a integer [1, 2, 2, 3]
b string ["a", "b", "b", "c"]
c string ["d", "e", "f", nil]
>
Right join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :right)
#Explorer.DataFrame<
Polars[3 x 3]
a integer [1, 2, 4]
c string ["d", "e", "f"]
b string ["a", "b", nil]
>
Outer join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :outer)
#Explorer.DataFrame<
Polars[4 x 3]
a integer [1, 2, 4, 3]
b string ["a", "b", nil, "c"]
c string ["d", "e", "f", nil]
>
Cross join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :cross)
#Explorer.DataFrame<
Polars[9 x 4]
a integer [1, 1, 1, 2, 2, ...]
b string ["a", "a", "a", "b", "b", ...]
a_right integer [1, 2, 4, 1, 2, ...]
c string ["d", "e", "f", "d", "e", ...]
>
Inner join with different names:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(d: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, on: [{"a", "d"}])
#Explorer.DataFrame<
Polars[3 x 3]
a integer [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
Link to this section Functions: Introspection
Gets the dtypes of the dataframe columns.
examples
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.dtypes(df)
%{"floats" => :float, "ints" => :integer}
Returns the groups of a dataframe.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, "country")
iex> Explorer.DataFrame.groups(df)
["country"]
Returns the number of columns in the dataframe.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_columns(df)
10
Returns the number of rows in the dataframe.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_rows(df)
1094
Gets the names of the dataframe columns.
examples
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.names(df)
["floats", "ints"]
Gets the shape of the dataframe as a {height, width}
tuple.
examples
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0, 3.0], ints: [1, 2, 3])
iex> Explorer.DataFrame.shape(df)
{3, 2}
Link to this section Functions: IO
Writes a dataframe to a binary representation of a delimited file.
options
Options
header
- Should the column names be written as the first line of the file? (default:true
)delimiter
- A single character used to separate fields within a record. (default:","
)
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> df |> Explorer.DataFrame.head(2) |> Explorer.DataFrame.dump_csv()
"year,country,total,solid_fuel,liquid_fuel,gas_fuel,cement,gas_flaring,per_capita,bunker_fuels\n2010,AFGHANISTAN,2308,627,1601,74,5,0,0.08,9\n2010,ALBANIA,1254,117,953,7,177,0,0.43,7\n"
Similar to from_csv/2
but raises if there is a problem reading the CSV.
Reads a delimited file into a dataframe.
If the CSV is compressed, it is automatically decompressed.
options
Options
delimiter
- A single character used to separate fields within a record. (default:","
)dtypes
- A list/map of{"column_name", dtype}
tuples. Any non-specified column has its type imputed from the first 1000 rows. (default:[]
)header
- Does the file have a header of column names as the first row or not? (default:true
)max_rows
- Maximum number of lines to read. (default:nil
)null_character
- The string that should be interpreted as a nil value. (default:"NA"
)skip_rows
- The number of lines to skip at the beginning of the file. (default:0
)columns
- A list of column names or indexes to keep. If present, only these columns are read into the dataframe. (default:nil
)infer_schema_length
Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).parse_dates
- Automatically try to parse dates/ datetimes and time. If parsing fails, columns remain of dtypestring
Similar to from_ipc/2
but raises if there is a problem reading the IPC file.
Reads an IPC file into a dataframe.
options
Options
columns
- List with the name or index of columns to be selected. Defaults to all columns.
Read a file of JSON objects or lists separated by new lines
options
Options
batch_size
- Sets the batch size for reading rows. This value may have significant impact in performance, so adjust it for your needs (default:1000
).infer_schema_length
- Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).
Reads a parquet file into a dataframe.
Similar to to_csv/3
but raises if there is a problem reading the CSV.
@spec to_csv(df :: t(), filename :: String.t(), opts :: Keyword.t()) :: {:ok, String.t()} | {:error, term()}
Writes a dataframe to a delimited file.
options
Options
header
- Should the column names be written as the first line of the file? (default:true
)delimiter
- A single character used to separate fields within a record. (default:","
)
Writes a dataframe to a IPC file.
Apache IPC is a language-agnostic columnar data structure that can be used to store data frames. It excels as a format for quickly exchange data between different programming languages.
options
Options
compression
- Sets the algorithm used to compress the IPC file. It accepts"ZSTD"
or"LZ4"
compression. (default:nil
)
Writes a dataframe to a ndjson file.
Writes a dataframe to a parquet file.