Joins¶
Colnade supports typed joins between DataFrames with different schemas.
Join condition¶
When you compare columns from different schemas with ==, Colnade creates a JoinCondition instead of a regular BinOp:
# Same schema → BinOp[Bool] (filter expression)
Users.age == 30
# Different schemas → JoinCondition
Users.id == Orders.user_id
Performing a join¶
joined = users.join(orders, on=Users.id == Orders.user_id)
# joined is JoinedDataFrame[Users, Orders]
The how parameter controls the join type:
users.join(orders, on=Users.id == Orders.user_id, how="inner") # default
users.join(orders, on=Users.id == Orders.user_id, how="left")
users.join(orders, on=Users.id == Orders.user_id, how="outer")
JoinedDataFrame — a transitional type¶
JoinedDataFrame[S, S2] is a transitional type. It holds data from two schemas but does not itself satisfy a single schema contract. Before you can use schema-dependent operations like group_by(), head(), tail(), or sample(), you must call cast_schema() to flatten the join result into a DataFrame[S3].
Available operations on JoinedDataFrame:
| Operation | Returns | Description |
|---|---|---|
filter() |
JoinedDataFrame[S, S2] |
Filter rows using columns from either schema |
sort() |
JoinedDataFrame[S, S2] |
Sort rows |
limit() |
JoinedDataFrame[S, S2] |
Limit to first n rows |
unique() |
JoinedDataFrame[S, S2] |
Deduplicate by columns |
drop_nulls() |
JoinedDataFrame[S, S2] |
Drop rows with nulls |
with_columns() |
JoinedDataFrame[S, S2] |
Add or overwrite columns |
select() |
DataFrame[Any] |
Select columns (schema erased) |
cast_schema() |
DataFrame[S3] |
Flatten to a single schema |
lazy() |
JoinedLazyFrame[S, S2] |
Convert to lazy |
Operations not available on joined frames (use cast_schema() first):
group_by()— requires a single schemahead(),tail(),sample()— materialized-only ops onDataFrame
# Typical workflow: join → filter → cast_schema → group_by
result = (
users.join(orders, on=Users.id == Orders.user_id)
.filter(Orders.amount >= 100)
.cast_schema(UserOrders)
.group_by(UserOrders.user_name)
.agg(UserOrders.amount.sum().alias(UserOrders.amount))
.cast_schema(UserTotals)
)
Flattening with cast_schema¶
Use cast_schema to flatten a JoinedDataFrame into a single-schema DataFrame:
class UserOrders(cn.Schema):
user_name: cn.Column[cn.Utf8] = cn.mapped_from(Users.name)
user_id: cn.Column[cn.UInt64] = cn.mapped_from(Users.id)
amount: cn.Column[cn.Float64]
result = joined.cast_schema(UserOrders)
# result is DataFrame[UserOrders]
Disambiguating shared column names¶
When both schemas have a column with the same name (e.g., both Users and Orders have id), use mapped_from to specify which source you want:
class JoinOutput(cn.Schema):
user_id: cn.Column[cn.UInt64] = cn.mapped_from(Users.id) # from Users
order_id: cn.Column[cn.UInt64] = cn.mapped_from(Orders.id) # from Orders
amount: cn.Column[cn.Float64] # unambiguous name match
Without mapped_from, ambiguous column names (present in both schemas) are skipped during name matching and will produce a SchemaError for missing columns.
Explicit mapping¶
For full control, pass an explicit mapping dict:
result = joined.cast_schema(Output, mapping={
Output.person_name: Users.name,
Output.total: Orders.amount,
})
Explicit mapping takes the highest precedence, overriding both mapped_from and name matching.
Lazy joins¶
LazyFrame joins work the same way but produce JoinedLazyFrame[S, S2]:
joined_lazy = users_lazy.join(orders_lazy, on=Users.id == Orders.user_id)
# JoinedLazyFrame[Users, Orders]
result = joined_lazy.cast_schema(UserOrders).collect()
# DataFrame[UserOrders]
JoinedLazyFrame has the same restrictions as JoinedDataFrame — no group_by(), head(), tail(), or sample(). Use cast_schema() first.