Skip to content

colnade.dataframe

DataFrame, LazyFrame, GroupBy, JoinedDataFrame, and JoinedLazyFrame.

DataFrame

DataFrame(*, _data=None, _schema=None, _backend=None)

Bases: Generic[S]

A typed, materialized DataFrame parameterized by a Schema.

Schema-preserving operations (filter, sort, limit, etc.) return DataFrame[S]. Schema-transforming operations (select, group_by+agg) return DataFrame[Any] and require cast_schema() to bind to a named output schema.

height property

Return the number of rows.

width property

Return the number of columns.

Raises TypeError on DataFrame[Any] (schema erased). Use cast_schema() first to bind to a named schema.

shape property

Return (rows, columns).

to_native()

Return the underlying backend-native data object (e.g. pl.DataFrame).

__len__()

Return the number of rows.

is_empty()

Return True if the DataFrame has zero rows.

iter_rows_as(row_type)

Iterate rows, constructing row_type instances via row_type(**row_dict).

Works with Schema.Row (frozen dataclass), dict, plain dataclasses, NamedTuple, Pydantic models, or any callable accepting **kwargs.

item(column=None)

item(column: _IntCol) -> int
item(column: _IntColN) -> int | None
item(column: _FloatCol) -> float
item(column: _FloatColN) -> float | None
item(column: Column[Utf8]) -> str
item(column: Column[Utf8 | None]) -> str | None
item(column: Column[Bool]) -> bool
item(column: Column[Bool | None]) -> bool | None
item(column: Column[Binary]) -> bytes
item(column: Column[Binary | None]) -> bytes | None
item(column: Column[Date]) -> date
item(column: Column[Date | None]) -> date | None
item(column: Column[Datetime]) -> datetime
item(column: Column[Datetime | None]) -> datetime | None
item(column: Column[Duration]) -> timedelta
item(column: Column[Duration | None]) -> timedelta | None
item(column: Column[Time]) -> time
item(column: Column[Time | None]) -> time | None
item(column: Column[Any]) -> Any
item() -> Any

Extract a scalar value from a single-row DataFrame.

Parameters:

Name Type Description Default
column Column[Any] | None

Column to extract. If None, the DataFrame must have exactly 1 row and 1 column.

None

Returns:

Type Description
Any

A plain Python scalar whose type corresponds to the column dtype

Any

(e.g. int for integer columns, str for Utf8).

Raises:

Type Description
ValueError

If the shape constraint is not met (1×1 when column is None, or 1 row when column is given).

filter(predicate)

Filter rows by a boolean expression.

sort(*columns, descending=False)

Sort rows by columns or sort expressions.

limit(n)

Limit to the first n rows.

head(n=5)

Return the first n rows (materialized only).

tail(n=5)

Return the last n rows (materialized only).

sample(n)

Return a random sample of n rows (materialized only).

unique(*columns)

Remove duplicate rows based on the given columns.

drop_nulls(*columns)

Drop rows with null values in the given columns.

with_columns(*exprs)

Add or overwrite columns. Returns DataFrame[S] (optimistic).

select(*columns)

select(c1: Column[Any]) -> DataFrame[Any]
select(c1: Column[Any], c2: Column[Any]) -> DataFrame[Any]
select(
    c1: Column[Any], c2: Column[Any], c3: Column[Any]
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
    c9: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
    c9: Column[Any],
    c10: Column[Any],
) -> DataFrame[Any]

Select columns. Returns DataFrame[Any] — use cast_schema() to bind.

agg(*exprs)

Aggregate all rows into a single row.

group_by(*keys)

Group by columns for aggregation.

join(other, on, how='inner')

Join with another DataFrame on a JoinCondition.

cast_schema(schema, mapping=None, extra='drop')

Bind to a new schema via mapping resolution.

lazy()

Convert to a lazy query plan.

with_raw(fn)

Apply a function to the raw engine DataFrame and re-wrap.

The function receives the underlying engine DataFrame (e.g. pl.DataFrame, pd.DataFrame) and must return the same type. The result is wrapped back into DataFrame[S] with the same schema and backend. If validation is enabled, the result is validated before returning.

A bounded escape hatch — like Rust's unsafe block.

validate()

Validate that the data conforms to the schema.

Always runs structural checks (columns, types, nullability) and value-level constraint checks (Field() constraints, @schema_check) regardless of the validation level toggle.

to_batches(batch_size=None)

Convert to an iterator of typed Arrow batches.

Delegates to the backend's to_arrow_batches() method, wrapping each raw pa.RecordBatch in an ArrowBatch[S] to preserve schema type information across the boundary.

from_dict(data, schema, backend) classmethod

Create a DataFrame from a columnar dict.

The backend reads column dtypes from schema and coerces values to the correct native types. Validates if validation is enabled.

from_batches(batches, schema, backend) classmethod

Create a DataFrame from an iterator of typed Arrow batches.

Unwraps each ArrowBatch[S] to its raw pa.RecordBatch and delegates to the backend's from_arrow_batches() method.

LazyFrame

LazyFrame(*, _data=None, _schema=None, _backend=None)

Bases: Generic[S]

A typed, lazy query plan parameterized by a Schema.

Supports the same operations as DataFrame. Use collect() to materialize.

width property

Return the number of columns.

Derivable from the schema without materializing. Raises TypeError on LazyFrame[Any] (schema erased).

height property

Return the number of rows.

This triggers computation on lazy backends (e.g. Dask).

to_native()

Return the underlying backend-native data object (e.g. pl.LazyFrame).

__len__()

Return the number of rows.

to_batches(batch_size=None)

Convert to an iterator of typed Arrow batches.

This triggers computation on lazy backends (e.g. Dask).

item(column=None)

item(column: _IntCol) -> int
item(column: _IntColN) -> int | None
item(column: _FloatCol) -> float
item(column: _FloatColN) -> float | None
item(column: Column[Utf8]) -> str
item(column: Column[Utf8 | None]) -> str | None
item(column: Column[Bool]) -> bool
item(column: Column[Bool | None]) -> bool | None
item(column: Column[Binary]) -> bytes
item(column: Column[Binary | None]) -> bytes | None
item(column: Column[Date]) -> date
item(column: Column[Date | None]) -> date | None
item(column: Column[Datetime]) -> datetime
item(column: Column[Datetime | None]) -> datetime | None
item(column: Column[Duration]) -> timedelta
item(column: Column[Duration | None]) -> timedelta | None
item(column: Column[Time]) -> time
item(column: Column[Time | None]) -> time | None
item(column: Column[Any]) -> Any
item() -> Any

Extract a scalar value from a single-row LazyFrame.

This triggers computation on lazy backends.

Parameters:

Name Type Description Default
column Column[Any] | None

Column to extract. If None, the LazyFrame must have exactly 1 row and 1 column.

None

Returns:

Type Description
Any

A plain Python scalar whose type corresponds to the column dtype

Any

(e.g. int for integer columns, str for Utf8).

Raises:

Type Description
ValueError

If the shape constraint is not met (1×1 when column is None, or 1 row when column is given).

filter(predicate)

Filter rows by a boolean expression.

sort(*columns, descending=False)

Sort rows by columns or sort expressions.

limit(n)

Limit to the first n rows.

head(n=5)

Return the first n rows (alias for limit).

tail(n=5)

Return the last n rows.

unique(*columns)

Remove duplicate rows based on the given columns.

drop_nulls(*columns)

Drop rows with null values in the given columns.

with_columns(*exprs)

Add or overwrite columns. Returns LazyFrame[S] (optimistic).

select(*columns)

select(c1: Column[Any]) -> LazyFrame[Any]
select(c1: Column[Any], c2: Column[Any]) -> LazyFrame[Any]
select(
    c1: Column[Any], c2: Column[Any], c3: Column[Any]
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
    c9: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
    c9: Column[Any],
    c10: Column[Any],
) -> LazyFrame[Any]

Select columns. Returns LazyFrame[Any] — use cast_schema() to bind.

agg(*exprs)

Aggregate all rows into a single row.

group_by(*keys)

Group by columns for aggregation.

join(other, on, how='inner')

Join with another LazyFrame on a JoinCondition.

cast_schema(schema, mapping=None, extra='drop')

Bind to a new schema via mapping resolution.

collect()

Materialize the lazy query plan into a DataFrame.

with_raw(fn)

Apply a function to the raw engine LazyFrame and re-wrap.

The function receives the underlying engine LazyFrame and must return the same type. The result is wrapped back into LazyFrame[S] with the same schema and backend.

Validation is deferred — it runs at collect() time if enabled, not at with_raw() time.

validate()

Validate that the data conforms to the schema.

Always runs structural checks and value-level constraint checks regardless of the validation level toggle.

GroupBy

GroupBy(*, _data=None, _schema=None, _keys=(), _backend=None)

Bases: Generic[S]

GroupBy on a materialized DataFrame.

agg(*exprs)

Aggregate grouped data. Returns DataFrame[Any] — use cast_schema().

LazyGroupBy

LazyGroupBy(*, _data=None, _schema=None, _keys=(), _backend=None)

Bases: Generic[S]

GroupBy on a lazy query plan.

agg(*exprs)

Aggregate grouped data. Returns LazyFrame[Any] — use cast_schema().

JoinedDataFrame

JoinedDataFrame(*, _data=None, _schema_left=None, _schema_right=None, _backend=None)

Bases: Generic[S, S2]

A transitional typed DataFrame resulting from a join of two schemas.

Operations accept columns from either schema S or S2. Available operations are limited to filtering, sorting, and other row-level transforms. Use cast_schema() to flatten into a DataFrame[S3] before group_by, head/tail/sample, or passing to functions that expect a single schema.

to_native()

Return the underlying backend-native data object (e.g. pl.DataFrame).

filter(predicate)

Filter rows by a boolean expression.

sort(*columns, descending=False)

Sort rows by columns or sort expressions.

limit(n)

Limit to the first n rows.

unique(*columns)

Remove duplicate rows based on the given columns.

drop_nulls(*columns)

Drop rows with null values in the given columns.

with_columns(*exprs)

Add or overwrite columns.

select(*columns)

select(c1: Column[Any]) -> DataFrame[Any]
select(c1: Column[Any], c2: Column[Any]) -> DataFrame[Any]
select(
    c1: Column[Any], c2: Column[Any], c3: Column[Any]
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
    c9: Column[Any],
) -> DataFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
    c9: Column[Any],
    c10: Column[Any],
) -> DataFrame[Any]

Select columns. Returns DataFrame[Any] — use cast_schema() to bind.

cast_schema(schema, mapping=None, extra='drop')

Flatten join result into a single-schema DataFrame.

lazy()

Convert to a lazy query plan.

JoinedLazyFrame

JoinedLazyFrame(*, _data=None, _schema_left=None, _schema_right=None, _backend=None)

Bases: Generic[S, S2]

A transitional typed lazy query plan resulting from a join of two schemas.

Available operations are limited to filtering, sorting, and other row-level transforms. Use cast_schema() to flatten into a LazyFrame[S3] before group_by or passing to functions that expect a single schema.

to_native()

Return the underlying backend-native data object (e.g. pl.LazyFrame).

filter(predicate)

Filter rows by a boolean expression.

sort(*columns, descending=False)

Sort rows by columns or sort expressions.

limit(n)

Limit to the first n rows.

unique(*columns)

Remove duplicate rows based on the given columns.

drop_nulls(*columns)

Drop rows with null values in the given columns.

with_columns(*exprs)

Add or overwrite columns.

select(*columns)

select(c1: Column[Any]) -> LazyFrame[Any]
select(c1: Column[Any], c2: Column[Any]) -> LazyFrame[Any]
select(
    c1: Column[Any], c2: Column[Any], c3: Column[Any]
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
    c9: Column[Any],
) -> LazyFrame[Any]
select(
    c1: Column[Any],
    c2: Column[Any],
    c3: Column[Any],
    c4: Column[Any],
    c5: Column[Any],
    c6: Column[Any],
    c7: Column[Any],
    c8: Column[Any],
    c9: Column[Any],
    c10: Column[Any],
) -> LazyFrame[Any]

Select columns. Returns LazyFrame[Any] — use cast_schema() to bind.

cast_schema(schema, mapping=None, extra='drop')

Flatten join result into a single-schema LazyFrame.

collect()

Materialize the lazy query plan into a JoinedDataFrame.

concat

concat(*frames)

concat(*frames: DataFrame[S]) -> DataFrame[S]
concat(*frames: LazyFrame[S]) -> LazyFrame[S]

Concatenate DataFrames or LazyFrames vertically (stack rows).

All inputs must share the same schema class (identity check, not structural equality) and the same frame type (all DataFrame or all LazyFrame). The backend is taken from the first frame.

Parameters:

Name Type Description Default
*frames DataFrame[S] | LazyFrame[S]

Two or more frames to stack. All must be parameterised by the same Schema subclass and be the same frame type.

()

Returns:

Type Description
DataFrame[S] | LazyFrame[S]

A new DataFrame[S] or LazyFrame[S] containing all rows from

DataFrame[S] | LazyFrame[S]

the input frames, in order.

Raises:

Type Description
ValueError

If fewer than 2 frames are provided, or if any frame's schema does not match the first frame's schema.

TypeError

If frames mix DataFrame and LazyFrame.

RuntimeError

If the first frame has no backend attached.

Usage::

combined = concat(df_jan, df_feb, df_mar)  # DataFrame[Sales]
combined = concat(lazy_jan, lazy_feb)       # LazyFrame[Sales]