Skip to content

DataFrames

DataFrame[S] and LazyFrame[S] are the primary interfaces for working with typed data.

Constructing DataFrames

Every backend provides from_rows() and from_dict() for creating typed DataFrames from Python data. The schema drives dtype coercion — you never need to specify backend-specific types.

From rows

from colnade_polars import from_rows

df = from_rows(Users, [
    Users.Row(id=1, name="Alice", age=30, score=85.0),
    Users.Row(id=2, name="Bob", age=25, score=92.5),
])
# df is DataFrame[Users] with correct dtypes

from_rows accepts Row[S] instances — the type checker verifies that rows match the schema, so passing Orders.Row where Users.Row is expected is a static error. For row-oriented dicts, construct Row instances first: Users.Row(**d).

From columnar dict

from colnade_polars import from_dict

df = from_dict(Users, {
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 35],
    "score": [85.0, 92.5, 78.0],
})

Both functions validate the data if validation is enabled (see Validation).

From files

Use read_parquet(), read_csv(), or their lazy equivalents (scan_parquet(), scan_csv()):

from colnade_polars import read_parquet

df = read_parquet("users.parquet", Users)

From Arrow batches

DataFrame.from_batches() creates a DataFrame from an iterator of typed ArrowBatch[S] objects. This is useful for streaming or inter-process data transfer:

batches = df.to_batches()  # Iterator[ArrowBatch[Users]]
restored = cn.DataFrame.from_batches(batches, Users, backend)

to_batches() and from_batches() form a round-trip through Arrow record batches. Validation is applied on from_batches() when enabled.

DataFrame vs LazyFrame

DataFrame LazyFrame
Execution Immediate (eager) Deferred (lazy)
head(), tail() Available Available
height, len() Available Available (triggers computation)
to_batches() Available Available (triggers computation)
collect() N/A Materializes to DataFrame
Best for Interactive work, small data Optimized pipelines, large data

Convert between them:

lazy = df.lazy()          # DataFrame → LazyFrame
eager = lazy.collect()    # LazyFrame → DataFrame

Schema-preserving operations

These operations return the same schema type (DataFrame[S]DataFrame[S]):

df.filter(Users.age > 25)                    # filter rows
df.sort(Users.score.desc())                  # sort rows
df.sort(Users.name, Users.age)               # sort by multiple columns
df.limit(100)                                # first n rows
df.head(10)                                  # first n rows
df.tail(10)                                  # last n rows
df.sample(50)                                # random sample (eager only)
df.unique(Users.name)                        # deduplicate by columns
df.drop_nulls(Users.age, Users.score)        # drop null rows
df.with_columns(                             # add/overwrite columns
    (Users.score * 2).alias(Users.score)
)

Concatenation

Stack DataFrames vertically with concat():

combined = cn.concat(df_jan, df_feb, df_mar)  # DataFrame[Sales]

All inputs must share the same schema class — this is an identity check (is), not structural equality. Two different schema classes with identical fields will be rejected. Works with both DataFrame and LazyFrame (but you cannot mix them in one call):

combined = cn.concat(lazy_jan, lazy_feb)  # LazyFrame[Sales]

Rows appear in input order: all rows from the first frame, then all rows from the second, and so on. At least 2 frames are required.

Schema-transforming operations

These change the column set and return DataFrame[Any]:

# select — choose columns
selected = df.select(Users.name, Users.score)  # DataFrame[Any]

# agg — aggregate all rows into a single row
summary = df.agg(
    Users.score.mean().alias(Stats.avg_score),
    Users.id.count().alias(Stats.user_count),
)  # DataFrame[Any]

# group_by + agg — grouped aggregation
grouped = df.group_by(Users.name).agg(
    Users.score.mean().alias(Users.score)
)  # DataFrame[Any]

After a schema-transforming operation, use cast_schema() to bind to a named output schema.

cast_schema

cast_schema binds data to a new schema by resolving column mappings:

class UserSummary(cn.Schema):
    name: cn.Column[cn.Utf8]
    score: cn.Column[cn.Float64]

summary = df.select(Users.name, Users.score).cast_schema(UserSummary)
# summary is DataFrame[UserSummary]

Resolution precedence per target column:

  1. Explicit mappingmapping={Target.col: Source.col}
  2. mapped_fromcol: Column[T] = mapped_from(Source.col)
  3. Name matching — target column name matches source column name

The extra parameter controls extra columns in the source:

  • extra="drop" (default) — silently drop extra columns
  • extra="forbid" — raise SchemaError if extra columns exist

cast_schema is a trust boundary

cast_schema is analogous to a type cast in Go or Rust — it asserts that the data conforms to the target schema. The type checker verifies that expressions reference valid columns on the input schema, but cast_schema is a promise about the output. If you .select() the wrong columns, the type checker won't catch it.

Mitigations:

  • Use mapped_from on output schema fields to create static links between input and output columns. The more fields that declare their provenance, the narrower the trust gap:

    class UserRevenue(cn.Schema):
        user_name: cn.Column[cn.Utf8] = cn.mapped_from(Users.name)
        user_id: cn.Column[cn.UInt64] = cn.mapped_from(Users.id)
        total_amount: cn.Column[cn.Float64]  # only this field is "trust me"
    
  • Use extra="forbid" to catch unexpected columns that might indicate a wrong select.

  • Enable validation — with validation on, df.validate() after cast_schema verifies structural conformance at runtime. Consider calling .cast_schema(Target).validate() at critical pipeline boundaries.

Group by

group_by() is available on DataFrame[S] and LazyFrame[S] — but not on JoinedDataFrame or JoinedLazyFrame. If you need to aggregate joined data, first cast_schema() to flatten to a single schema:

# Join → cast_schema → group_by → cast_schema
totals = (
    users.join(orders, on=Users.id == Orders.user_id)
    .cast_schema(UserOrders)
    .group_by(UserOrders.user_name)
    .agg(UserOrders.amount.sum().alias(UserOrders.amount))
    .cast_schema(UserTotals)
)

Introspection

DataFrame provides properties for inspecting dimensions:

df.height      # number of rows (int)
len(df)        # same as height
df.width       # number of columns (int)
df.shape       # (rows, columns) tuple
df.is_empty()  # True if zero rows
Property/Method DataFrame LazyFrame JoinedDataFrame
height Yes Available (triggers computation) No (cast_schema first)
len() Yes Available (triggers computation) No
width Yes Yes (from schema) No
shape Yes Available (triggers computation) No
is_empty() Yes Available (triggers computation) No

width raises TypeError on DataFrame[Any] (schema erased) — use cast_schema() first.

Scalar extraction

Extract a single Python value from a DataFrame with item():

# No-arg form: 1×1 DataFrame (e.g. from agg)
mean_score = df.agg(Users.score.mean().alias(Stats.avg_score)).item()

# Column form: 1-row DataFrame, pick column
name = df.head(1).item(Users.name)  # → str

The return type is inferred from the column dtype — item(Column[UInt64]) returns int, item(Column[Utf8]) returns str, etc. The no-arg form returns Any.

Raises ValueError if the shape constraint is not met (1×1 for no-arg, 1 row for column form). Available on both DataFrame and LazyFrame (triggers computation on lazy backends).

Typed row iteration

iter_rows_as(row_type) iterates rows as typed Python objects:

# Using Schema.Row (frozen dataclass)
for row in df.iter_rows_as(Users.Row):
    print(row.name, row.age)  # typed attribute access

# Using dict
for row in df.iter_rows_as(dict):
    print(row["name"], row["age"])

iter_rows_as accepts any callable that takes **kwargs:

  • Schema.Row — frozen dataclass with typed attributes (recommended)
  • dict — plain dictionary
  • Custom dataclasses, NamedTuple, Pydantic models, etc.

iter_rows_as is only available on DataFrame — not LazyFrame (would require materialization) and not JoinedDataFrame (use cast_schema() first).

Validation

Validate that data conforms to the schema:

df.validate()  # raises SchemaError on mismatch

Checks column existence, data types, and nullability constraints.

Validation levels

Level Behavior
ValidationLevel.OFF No runtime checks. Trust the type checker. Zero overhead. (default)
ValidationLevel.STRUCTURAL Check columns exist, dtypes match, nullability. Also checks literal type compatibility in expressions.
ValidationLevel.FULL Structural checks plus value-level constraints from Field() metadata.

Enable auto-validation at data boundaries:

import colnade as cn

cn.set_validation(cn.ValidationLevel.STRUCTURAL)  # or FULL
# Strings and booleans still work for convenience:
cn.set_validation("structural")
cn.set_validation(True)  # → STRUCTURAL

Or via environment variable:

COLNADE_VALIDATE=structural pytest tests/
COLNADE_VALIDATE=full pytest tests/
# Legacy: COLNADE_VALIDATE=1 → STRUCTURAL

df.validate() always runs the full level of checks regardless of the toggle — calling it explicitly signals intent.

What Colnade validates

Colnade catches errors at three levels (see also Core Concepts: Safety Model):

Level 1: In your editor (static analysis)

Your type checker (ty, pyright, mypy) catches errors before code runs:

  • Schema-aware return typesdf.filter(...) returns DataFrame[Users], not just DataFrame
  • Type boundary enforcementJoinedDataFrame[S, S2] is a distinct type from DataFrame[S]. You cannot pass a joined frame where a DataFrame is expected — you must cast_schema() first
  • Schema-transforming operationsselect() and group_by().agg() return DataFrame[Any], requiring cast_schema() to regain a named schema
  • Join conditions — cross-schema == returns JoinCondition, same-schema == returns BinOp[Bool]

Level 2: At data boundaries (runtime structural validation)

When validation is enabled, data boundaries and df.validate() check:

  • Column existence — missing columns raise SchemaError
  • Data types — type mismatches raise SchemaError
  • Null violations — non-nullable columns with null values raise SchemaError
  • Extra columns — optionally flagged via extra="forbid" on cast_schema()
  • Expression column membership — operations like filter, sort, select verify that all column references in expressions belong to the frame's schema (e.g., using Orders.amount on a DataFrame[Users] raises SchemaError). On JoinedDataFrame, columns from either schema are accepted.

Level 3: On your data values (value-level constraints)

Value-level constraints validate domain invariants using Field() metadata:

import colnade as cn

class Users(cn.Schema):
    id: cn.Column[cn.UInt64] = cn.Field(unique=True)
    age: cn.Column[cn.UInt64] = cn.Field(ge=0, le=150)
    name: cn.Column[cn.Utf8] = cn.Field(min_length=1)
    email: cn.Column[cn.Utf8] = cn.Field(pattern=r"^[^@]+@[^@]+\.[^@]+$")
    score: cn.Column[cn.Float64] = cn.Field(ge=0.0, le=100.0)
    status: cn.Column[cn.Utf8] = cn.Field(isin=["active", "inactive"])

Available constraints:

Constraint Types Meaning
ge Numeric, temporal Value >= bound
gt Numeric, temporal Value > bound
le Numeric, temporal Value <= bound
lt Numeric, temporal Value < bound
min_length String String length >= n
max_length String String length <= n
pattern String Matches regex
unique Any No duplicate values
isin Any Value in allowed set

Field() is a superset of mapped_from() — use Field(ge=0, mapped_from=Source.age) to combine constraints with column mapping.

Cross-column constraints use @schema_check:

class Events(cn.Schema):
    start: cn.Column[cn.UInt64]
    end: cn.Column[cn.UInt64]

    @cn.schema_check
    def start_before_end(cls):
        return Events.start <= Events.end

Value constraints are checked by df.validate() (always) and by auto-validation when the level is "full". Structural-level auto-validation skips value checks for performance.

Current limitations

Column type parameters carry the data type (Column[UInt64]) but not the schema they belong to. This means the type checker cannot statically verify that df.filter(Orders.amount > 5) is invalid when df is a DataFrame[Users]. This limitation exists because Python 3.10 lacks TypeVar defaults (PEP 696). Schema enforcement at the column level would require Column[DType, Schema], which is planned for future versions.

Runtime mitigation: When validation is enabled (STRUCTURAL or FULL), all DataFrame/LazyFrame operations validate expression column membership at runtime. See Type Checker Integration: Wrong-schema columns for details.

Adding computed columns

Use with_columns to add a computed column, then cast_schema to transition to a richer child schema:

class EnrichedUsers(Users):
    risk_score: cn.Column[cn.Float64]

result = df.with_columns(
    (Users.age * 0.1 + Users.score * 0.9).alias(EnrichedUsers.risk_score)
).cast_schema(EnrichedUsers)
# result is DataFrame[EnrichedUsers]

This works because cast_schema recognizes schema inheritance — columns declared on the child schema (risk_score) that aren't in the parent (Users) are resolved by identity (the column name matches itself in the data). Columns inherited from the parent resolve by normal name matching.

Escape hatches

When you need to use engine-native operations not exposed by Colnade, with_raw lets you operate on the raw DataFrame within a bounded scope — like Rust's unsafe block:

# Apply a Polars-native operation, then re-enter the typed world
result = df.with_raw(
    lambda raw: raw.with_columns(
        pl.col("age").map_batches(some_custom_fn)
    )
)
# result is still DataFrame[Users]
# validated automatically if validation is enabled

For complex multi-step logic, use a named function:

def custom_transform(raw_df: pl.DataFrame) -> pl.DataFrame:
    # complex engine-native logic here
    return raw_df.with_columns(...)

result = df.with_raw(custom_transform)

with_raw is available on DataFrame and LazyFrame, but not on JoinedDataFrame — use cast_schema() first.