提升效率，必看这个Python数据处理神器！

来源：投稿作者：sunny
编辑：学姐

unsetunsetPolars简介unsetunset

Polars 是一个速度极快的 DataFrame 库，用于处理结构化数据。核心是用 Rust 编写的，可用于 Python、R 和 NodeJS。

对于Polar而言，它有着如下特点：

快速：用 Rust 从头开始编写，设计接近机器，没有外部依赖。
I/O：对所有常见数据存储层的一流支持：本地、云存储和数据库。
直观的 API：按预期方式编写查询。Polars 将在内部确定使用其查询优化器执行的最有效方式。
Out of Core：流式处理 API 允许您处理结果，而无需所有数据同时在内存中。
并行：通过在可用的 CPU 内核之间分配工作负载来利用计算机的强大功能，而无需任何额外的配置。
矢量化查询引擎：使用 Apache Arrow（一种列式数据格式）以矢量化方式处理查询，并使用 SIMD 优化 CPU 使用率。

Polars 的目标是提供一个闪电般快速的 DataFrame 库，该库：

利用计算机上所有可用的内核。
优化查询以减少不必要的工作/内存分配。
处理比可用 RAM 大得多的数据集。
一致且可预测的 API。
遵循严格的架构（在运行查询之前应知道数据类型）。

Polars 是用 Rust 编写的，这赋予了它 C/C++ 性能，并允许它完全控制查询引擎中的性能关键部分。

unsetunsetPolars安装和使用unsetunset

在正式讲述这个库之前，我们先来安装该库。

pip install polars

读取/写入文件

Polars 支持常见文件格式（e.g. csv、json、parquet）、云存储（S3、Azure Blob、BigQuery）和数据库（如 postgres、mysql）的读写。

import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {
        "integer": [1, 2, 3],
        "date": [
            datetime(2025, 1, 1),
            datetime(2025, 1, 2),
            datetime(2025, 1, 3),
        ],
        "float": [4.0, 5.0, 6.0],
        "string": ["a", "b", "c"],
    }
)

print(df)

# 结果如下：
shape: (3, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date                ┆ float ┆ string │
│ ---     ┆ ---                 ┆ ---   ┆ ---    │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1       ┆ 2025-01-01 00:00:00 ┆ 4.0   ┆ a      │
│ 2       ┆ 2025-01-02 00


    
:00:00 ┆ 5.0   ┆ b      │
│ 3       ┆ 2025-01-03 00:00:00 ┆ 6.0   ┆ c      │
└─────────┴─────────────────────┴───────┴────────┘

在下面的示例中，我们将 DataFrame 写入一个名为 output.csv 的 csv 文件。之后，我们用 read_csv 回读它，然后 print 用结果进行检查。

df.write_csv("docs/data/output.csv")
df_csv = pl.read_csv("docs/data/output.csv")
print(df_csv)

# 结果如下：
shape: (3, 4)
┌─────────┬────────────────────────────┬───────┬────────┐
│ integer ┆ date                       ┆ float ┆ string │
│ ---     ┆ ---                        ┆ ---   ┆ ---    │
│ i64     ┆ str                        ┆ f64   ┆ str    │
╞═════════╪════════════════════════════╪═══════╪════════╡
│ 1       ┆ 2025-01-01T00:00:00.000000 ┆ 4.0   ┆ a      │
│ 2       ┆ 2025-01-02T00:00:00.000000 ┆ 5.0   ┆ b      │
│ 3       ┆ 2025-01-03T00:00:00.000000 ┆ 6.0   ┆ c      │
└─────────┴────────────────────────────┴───────┴────────┘

表达式

Expressions 是极地的核心力量。它 expressions 提供了一个模块化结构，允许您将简单的概念组合到复杂的查询中。下面我们将介绍作为所有查询的构建块（或在 Polars 术语上下文中）的基本组件：

select
filter
with_columns
group_by

Select选择

要选择列，我们需要做两件事：

定义 DataFrame 我们想要的数据来源。
选择我们需要的数据。

在下面的示例中，您会看到我们选择 col('*') 。星号代表所有列。

df.select(pl.col("*"))

# 结果如下：
shape: (5, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a   ┆ b        ┆ c                   ┆ d     │
│ --- ┆ ---      ┆ ---                 ┆ ---   │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64   │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 0   ┆ 0.337071 ┆ 2025-12-01 00:00:00 ┆ 1.0   │
│ 1   ┆ 0.930258 ┆ 2025-12-02 00:00:00 ┆ 2.0   │
│ 2   ┆ 0.24564  ┆ 2025-12-03 00:00:00 ┆ NaN   │
│ 3   ┆ 0.673248 ┆ 2025-12-04 00:00:00 ┆ -42.0 │
│ 4   ┆ 0.460631 ┆ 2025-12-05 00:00:00 ┆ null  │
└─────┴──────────┴─────────────────────┴───────┘

您还可以指定要返回的特定列。有两种方法可以做到这一点。第一个选项是传递列名称，如下所示。

df.select(pl.col("a", "b"))

# 结果如下：
shape: (5, 2)
┌─────┬──────────┐
│ a   ┆ b        │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 0   ┆ 0.337071 │
│ 1   ┆ 0.930258 │
│ 2   ┆ 0.24564  │
│ 3   ┆ 0.673248 │
│ 4   ┆ 0.460631 │
└─────┴──────────┘

Filter过滤器

该 filter 选项允许我们创建 DataFrame .我们使用与之前相同的 DataFrame 方法，并在两个指定日期之间进行过滤。

df.filter(
    pl.col("c").is_between(datetime(2025, 12, 2), datetime(2025, 12, 3)),
)

# 结果如下：
shape: (2, 4)
┌─────┬──────────┬─────────────────────┬─────┐
│ a   ┆ b        ┆ c                   ┆ d   │
│ --- ┆ ---      ┆ ---                 ┆ --- │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64 │
╞═════╪══════════╪═════════════════════╪═════╡
│ 1   ┆ 0.930258 ┆ 2025-12-02 00:00:00 ┆ 2.0 │
│ 2   ┆ 0.24564  ┆ 2025-12-03 00:00:00 ┆ NaN │
└─────┴──────────┴─────────────────────┴─────┘

您还可以 filter 创建包含多个列的更复杂的过滤器。

df.filter((pl.col("a") <= 3) & (pl.col("d").is_not_nan()))

# 结果如下：
shape: (3, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a   ┆ b        ┆ c                   ┆ d     │
│ --- ┆ ---      ┆ ---                 ┆ ---   │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64   │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 0   ┆ 0.337071 ┆ 2025-12-01 00:00:00 ┆ 1.0   │
│ 1   ┆ 0.930258 ┆ 2025-12-02 00:00:00 ┆ 2.0   │
│ 3   ┆ 0.673248 ┆ 2025-12-04 00:00:00 ┆ -42.0 │
└─────┴──────────┴─────────────────────┴───────┘

添加列Add columns

with_columns 允许您为分析创建新列。我们创建两个新列 e 和 b+42 .首先，我们将列 b 中的所有值相加，并将结果存储在列 e 中。之后，我们添加 42 的 b 值。创建一个新列 b+42 来存储这些结果。

df.with_columns(pl.col("b").sum().alias("e"), (pl.col("b") + 42).alias("b+42"))

# 结果如下：
shape: (5, 6)
┌─────┬──────────┬─────────────────────┬───────┬──────────┬───────────┐
│ a   ┆ b        ┆ c                   ┆ d     ┆ e        ┆ b+42      │
│ --- ┆ ---      ┆ ---                 ┆ ---   ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64   ┆ f64      ┆ f64       │
╞═════╪══════════╪═════════════════════╪═══════╪══════════╪═══════════╡
│ 0   ┆ 0.337071 ┆ 2025-12-01 00:00:00 ┆ 1.0   ┆ 2.646848 ┆ 42.337071 │
│ 1   ┆ 0.930258 ┆ 2025-12-02 00:00:00 ┆ 2.0   ┆ 2.646848 ┆ 42.930258 │
│ 2   ┆ 0.24564  ┆ 2025-12-03 00:00:00 ┆ NaN   ┆ 2.646848 ┆ 42.24564  │
│ 3   ┆ 0.673248 ┆ 2025-12-04 00:00:00 ┆ -42.0 ┆ 2.646848 ┆ 42.673248 │
│ 4   ┆ 0.460631 ┆ 2025-12-05 00:00:00 ┆ null  ┆ 2.646848 ┆ 42.460631 │
└─────┴──────────┴─────────────────────┴───────┴──────────┴───────────┘

分组依据Group by

我们将按功能为组创建一个新 DataFrame 功能。这个新内容 DataFrame 将包括我们想要分组的几个“组”。

df2 = pl.DataFrame(
    {
        "x": range(8),



    
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)

# 结果如下：
shape: (8, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0   ┆ A   │
│ 1   ┆ A   │
│ 2   ┆ A   │
│ 3   ┆ B   │
│ 4   ┆ B   │
│ 5   ┆ C   │
│ 6   ┆ X   │
│ 7   ┆ X   │
└─────┴─────┘

df2.group_by("y", maintain_order=True).len()

# 结果如下：
shape: (4, 2)
┌─────┬─────┐
│ y   ┆ len │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═════╡
│ A   ┆ 3   │
│ B   ┆ 2   │
│ C   ┆ 1   │
│ X   ┆ 2   │
└─────┴─────┘

df2.group_by("y", maintain_order=True).agg(
    pl.col("*").count().alias("count"),
    pl.col("*").sum().alias("sum"),
)

# 结果如下：
shape: (4, 3)
┌─────┬───────┬─────┐
│ y   ┆ count ┆ sum │
│ --- ┆ ---   ┆ --- │
│ str ┆ u32   ┆ i64 │
╞═════╪═══════╪═════╡
│ A   ┆ 3     ┆ 3   │
│ B   ┆ 2     ┆ 7   │
│ C   ┆ 1     ┆ 5   │
│ X   ┆ 2     ┆ 13  │
└─────┴───────┴─────┘

组合Combination

下面是有关如何组合操作以创建 DataFrame 所需的操作的一些示例。

df_x = df.with_columns((pl.col("a") * pl.col("b")).alias("a * b")).select(
    pl.all().exclude(["c", "d"])
)

print(df_x)

# 结果如下：
shape: (5, 3)
┌─────┬──────────┬──────────┐
│ a   ┆ b        ┆ a * b    │
│ --- ┆ ---      ┆ ---      │
│ i64 ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╡
│ 0   ┆ 0.337071 ┆ 0.0      │
│ 1   ┆ 0.930258 ┆ 0.930258 │
│ 2   ┆ 0.24564  ┆ 0.491279 │
│ 3   ┆ 0.673248 ┆ 2.019744 │
│ 4   ┆ 0.460631 ┆ 1.842525 │
└─────┴──────────┴──────────┘

df_y = df.with_columns((pl.col("a") * pl.col("b")).alias("a * b")).select(
    pl.all().exclude("d")
)

print(df_y)

# 结果如下：
shape: (5, 4)
┌─────┬──────────┬─────────────────────┬──────────┐
│ a   ┆ b        ┆ c                   ┆ a * b    │
│ --- ┆ ---      ┆ ---                 ┆ ---      │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64      │
╞═════╪══════════╪═════════════════════╪══════════╡
│ 0   ┆ 0.337071 ┆ 2025-12-01 00:00:00 ┆ 0.0      │
│ 1   ┆ 0.930258 ┆ 2025-12-02 00:00:00 ┆ 0.930258 │
│ 2   ┆ 0.24564  ┆ 2025


    
-12-03 00:00:00 ┆ 0.491279 │
│ 3   ┆ 0.673248 ┆ 2025-12-04 00:00:00 ┆ 2.019744 │
│ 4   ┆ 0.460631 ┆ 2025-12-05 00:00:00 ┆ 1.842525 │
└─────┴──────────┴─────────────────────┴──────────┘

合并DataFrame

根据用例，有两种方法可以 DataFrame 组合 s：join 和 concat。

Join

Polars 支持所有类型的连接（例如左连接、右连接、内连接、外连接）。让我们仔细看看如何 join 将两个 DataFrames 合并为一个 DataFrame 。我们两个 DataFrames 都有一个类似“id”的列： a 和 x 。 DataFrames 在此示例中，我们可以将 join 这些列用于。

df = pl.DataFrame(
    {
        "a": range(8),
        "b": np.random.rand(8),
        "d": [1, 2.0, float("nan"), float("nan"), 0, -5, -42, None],
    }
)

df2 = pl.DataFrame(
    {
        "x": range(8),
        "y": ["A", "A", "A", "B", "B"


    
, "C", "X", "X"],
    }
)
joined = df.join(df2, left_on="a", right_on="x")
print(joined)

# 结果如下：
shape: (8, 4)
┌─────┬──────────┬───────┬─────┐
│ a   ┆ b        ┆ d     ┆ y   │
│ --- ┆ ---      ┆ ---   ┆ --- │
│ i64 ┆ f64      ┆ f64   ┆ str │
╞═════╪══════════╪═══════╪═════╡
│ 0   ┆ 0.371542 ┆ 1.0   ┆ A   │
│ 1   ┆ 0.748599 ┆ 2.0   ┆ A   │
│ 2   ┆ 0.490039 ┆ NaN   ┆ A   │
│ 3   ┆ 0.292075 ┆ NaN   ┆ B   │
│ 4   ┆ 0.227413 ┆ 0.0   ┆ B   │
│ 5   ┆ 0.410372 ┆ -5.0  ┆ C   │
│ 6   ┆ 0.549948 ┆ -42.0 ┆ X   │
│ 7   ┆ 0.48205  ┆ null  ┆ X   │
└─────┴──────────┴───────┴─────┘

Concat

我们也可以 concatenate 两个 DataFrames .垂直串联将使 DataFrame 时间更长。水平串联将使 DataFrame 宽度更大。下面你可以看到我们两个 DataFrames 的水平串联的结果。

stacked = df.hstack(df2)
print(stacked)

# 结果如下：
shape: (8, 5)
┌─────┬──────────┬───────┬─────┬─────┐
│ a   ┆ b        ┆ d     ┆ x   ┆ y   │
│ --- ┆ ---      ┆ ---   ┆ --- ┆ --- │
│ i64 ┆ f64      ┆ f64   ┆ i64 ┆ str │
╞═════╪══════════╪═══════╪═════╪═════╡
│ 0   ┆ 0.371542 ┆ 1.0   ┆ 0   ┆ A   │
│ 1   ┆ 0.748599 ┆ 2.0   ┆ 1   ┆ A   │
│ 2   ┆ 0.490039 ┆ NaN   ┆ 2   ┆ A   │
│ 3   ┆ 0.292075 ┆ NaN   ┆ 3   ┆ B   │
│ 4   ┆ 0.227413 ┆ 0.0   ┆ 4   ┆ B   │
│ 5   ┆ 0.410372 ┆ -5.0  ┆ 5   ┆ C   │
│ 6   ┆ 0.549948 ┆ -42.0 ┆ 6   ┆ X   │
│ 7   ┆ 0.48205  ┆ null  ┆ 7   ┆ X   │
└─────┴──────────┴───────┴─────┴─────┘

unset unset数据结构unsetunset

Polars 提供的核心基础数据结构是 Series 和 DataFrame 。

Series

Series是一维数据结构。在序列中，所有元素都具有相同的数据类型。下面的代码片段显示了如何创建简单的命名 Series 对象。

import polars as pl

s = pl.Series("a", [1, 2, 3, 4, 5])
print(s)

# 结果如下：
shape: (5,)
Series: 'a' [i64]
[
    1
    2
    3
    4
    5
]

DataFrame

A DataFrame 是一个由支持 Series 的二维数据结构，它可以看作是的集合（例如列表）的 Series 抽象。可以对执行 DataFrame 的操作与 SQL 在类似查询中执行的操作非常相似。您可以 GROUP BY 、 JOIN 、 PIVOT 定义自定义函数。

from datetime import datetime

df = pl.DataFrame(
    {
        "integer": [1, 2, 3, 4, 5],
        "date": [
            datetime(2022, 1, 1),
            datetime(2022, 1, 2),
            datetime(2022, 1, 3),
            datetime(2022, 1, 4),
            datetime(2022, 1, 5),
        ],
        "float": [4.0, 5.0, 6.0, 7.0, 8.0],
    }
)

print(df)

# 结果如下：
shape: (5, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   │
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
│ 4       ┆ 2022-01-04 00:00:00 ┆ 7.0   │
│ 5       ┆ 2022-01-05 00:00:00 ┆ 8.0   │
└─────────┴─────────────────────┴───────┘

查看数据

本部分重点介绍如何 DataFrame 查看 .我们将使用上一个示例中的作为 DataFrame 起点。

head

默认情况下，该 head 函数显示 . DataFrame 您可以指定要查看的行数（例如）。 df.head(10)

print(df.head(3))

# 结果如下：
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   │
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
└─────────┴─────────────────────┴───────┘

tail

该 tail 函数显示 . DataFrame 您还可以指定要查看的行数，类似于 head 。

print(df.tail(3))

# 结果如下：
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
│ 4       ┆ 2022-01-04 00:00:00 ┆ 7.0   │
│ 5       ┆ 2022-01-05 00


    
:00:00 ┆ 8.0   │
└─────────┴─────────────────────┴───────┘

sample

如果你想得到你 DataFrame 的数据的印象，你也可以使用 sample 。从 sample 中得到 DataFrame n 个随机行。

print(df.sample(2))

# 结果如下：
shape: (2, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 5       ┆ 2022-01-05 00:00:00 ┆ 8.0   │
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   │

describe

Describe 返回 . DataFrame 如果可能，它将提供一些快速统计信息。

print(df.describe())

# 结果如下：
shape: (9, 4)
┌────────────┬──────────┬─────────────────────┬──────────┐
│ statistic  ┆ integer  ┆ date                ┆ float    │
│ ---        ┆ ---      ┆ ---                 ┆ ---      │
│ str        ┆ f64      ┆ str                 ┆ f64      │
╞════════════╪══════════╪═════════════════════╪══════════╡
│ count      ┆ 5.0      ┆ 5                   ┆ 5.0      │
│ null_count ┆ 0.0      ┆ 0                   ┆ 0.0      │
│ mean       ┆ 3.0      ┆ 2022-01-03 00:00:00 ┆ 6.0      │
│ std        ┆ 1.581139 ┆ null                ┆ 1.581139 │
│ min        ┆ 1.0      ┆ 2022-01-01 00:00:00 ┆ 4.0      │
│ 25%        ┆ 2.0      ┆ 2022-01-02 00:00:00 ┆ 5.0      │
│ 50%        ┆ 3.0      ┆ 2022-01-03


    
 00:00:00 ┆ 6.0      │
│ 75%        ┆ 4.0      ┆ 2022-01-04 00:00:00 ┆ 7.0      │
│ max        ┆ 5.0      ┆ 2022-01-05 00:00:00 ┆ 8.0      │
└────────────┴──────────┴─────────────────────┴──────────┘

unsetunset上下文unsetunset

Polars 开发了自己的领域特定语言（DSL）来转换数据。该语言非常易于使用，并允许复杂的查询保持人类可读性。该语言的两个核心组件是上下文和表达式，后者我们将在下一节中介绍。

顾名思义，上下文是指需要计算表达式的上下文。主要有三种情况：

Selection: df.select(...), df.with_columns(...)
Filtering: df.filter()
Group by / Aggregation: df.group_by(...).agg(...)

以下示例在以下 DataFrame 情况下执行：

df = pl.DataFrame(
    {
        "nrs": [1, 2


    
, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)

# 结果如下：
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
│ null ┆ egg   ┆ 0.533739 ┆ C      │
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

Selection

选择上下文将表达式应用于列。 select 可能会生成聚合、表达式组合或文本的新列。

选择上下文中的表达式必须生成 Series 长度相同或长度为 1 的表达式。文本被视为 length-1 Series 。

out = df.select(
    pl.sum("nrs"),
    pl.col("names").sort(),
    pl.col("names").first().alias("first name"),
    (pl.mean("nrs") * 10).alias("10xnrs"),
)
print(out)

# 结果如下：
shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ ---   ┆ ---        ┆ ---    │
│ i64 ┆ str   ┆ str        ┆ f64    │
╞═════╪═══════╪════════════╪════════╡
│ 11  ┆ null  ┆ foo        ┆ 27.5   │
│ 11  ┆ egg   ┆ foo        ┆ 27.5   │
│ 11  ┆ foo   ┆ foo        ┆ 27.5   │
│ 11  ┆ ham   ┆ foo        ┆ 27.5   │
│ 11  ┆ spam  ┆ foo        ┆ 27.5   │
└─────┴───────┴────────────┴────────┘

从查询中可以看出，选择上下文非常强大，允许您计算彼此独立（并行）的任意表达式。

与 select 语句类似，该 with_columns 语句也进入选择上下文。和 select 之间的 with_columns 主要区别在于保留 with_columns 原始列并添加新列，而 select 删除原始列。

df = df.with_columns(
    pl.sum("nrs").alias("nrs_sum"),
    pl.col("random").count().alias("count"),
)
print(df)

# 结果如下：
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 11      ┆ 5     │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      ┆ 11      ┆ 5     │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 11      ┆ 5     │
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└──────┴───────┴──────────┴────────┴─────────┴───────┘

Filtering

筛选上下文根据计算结果为 Boolean 数据类型的一个或多个表达式筛选 a DataFrame 。

out = df.filter(pl.col("nrs") > 2)
print(out)

# 结果如下：
shape: (2, 6)
┌─────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64 ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞═════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 3   ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
│ 5   ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└─────┴───────┴──────────┴────────┴─────────┴───────┘

Group by

在 group_by 上下文中，表达式适用于组，因此可以产生任何长度的结果（一个组可能有许多成员）。




    
out = df.group_by("groups").agg(
    pl.sum("nrs"),  # sum nrs by groups
    pl.col("random").count().alias("count"),  # count group members
    # sum random where name != null
    pl.col("random").filter(pl.col("names").is_not_null()).sum().name.suffix("_sum"),
    pl.col("names").reverse().alias("reversed names"),
)
print(out)

# 结果如下：
shape: (3, 5)
┌────────┬─────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs ┆ count ┆ random_sum ┆ reversed names │
│ ---    ┆ --- ┆ ---   ┆ ---        ┆ ---            │
│ str    ┆ i64 ┆ u32   ┆ f64        ┆ list[str]      │
╞════════╪═════╪═══════╪════════════╪════════════════╡
│ B      ┆ 8   ┆ 2     ┆ 0.263315   ┆ [null, "spam"] │
│ A      ┆ 3   ┆ 2     ┆ 0.894213   ┆ ["ham", "foo"] │
│ C      ┆ 0   ┆ 1     ┆ 0.533739   ┆ ["egg"]        │
└────────┴─────┴───────┴────────────┴────────────────┘

从结果中可以看出，所有表达式都应用于上下文定义的 group_by 组。除了标准 group_by 、 group_by_dynamic 和 group_by_rolling 之外，还通过上下文进入该组。

unsetunset表达式unsetunset

Polars 有一个强大的概念，称为表达式，这是其非常快速的性能的核心。

表达式是描述如何构造一个或多个系列的操作树。由于输出是 Series，因此可以直接应用一系列表达式（类似于 pandas 中的方法链），每个表达式都会转换上一步的输出。

下面是一个表达式：

pl.col("foo").sort().head(2)

选择列“foo”，然后对列进行排序（不按相反的顺序排列），取排序输出的前两个值。

表达式的力量在于，每个表达式都会产生一个新的表达式，并且它们可以通过管道连接在一起。您可以通过将表达式传递给其中一个 Polars 执行上下文来运行表达式。

在这里，我们通过运行 df.select 以下命令来运行两个表达式：

df.select(pl.col("foo").sort().head(2), pl.col("bar").filter(pl.col("foo") == 1).sum())

所有表达式都是并行运行的，这意味着单独的 Polars 表达式是令人尴尬的并行。请注意，在表达式中可能会进行更多的并行化。

在前面中，我们概述了什么是 Expressions 以及它们如何是无价的。在本节中，我们将重点介绍 Expressions 它们本身。每个部分都概述了它们的作用，并提供了其他示例。

基本运算符

本节介绍如何将基本运算符（例如加法、减法）与表达式结合使用。我们将在以下数据帧的上下文中使用不同的主题提供各种示例。

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)

# 结果如下：
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
│ null ┆ egg   ┆ 0.533739 ┆ C      │
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

Numerical

df_numerical = df.select(
    (pl.col("nrs") + 5).alias("nrs + 5"),
    (pl.col("nrs") - 5).alias("nrs - 5"),
    (pl.col("nrs") * pl.col("random")).alias("nrs * random"),
    (pl.col("nrs") / pl.col("random")).alias("nrs / random"),
)
print(df_numerical)

# 结果如下：
shape: (5, 4)
┌─────────┬─────────┬──────────────┬──────────────┐
│ nrs + 5 ┆ nrs - 5 ┆ nrs * random ┆ nrs / random │
│ ---     ┆ ---     ┆ ---          ┆ ---          │
│ i64     ┆ i64     ┆ f64          ┆ f64          │
╞═════════╪═════════╪══════════════╪══════════════╡
│ 6       ┆ -4      ┆ 0.154163     ┆ 6.486647     │
│ 7       ┆ -3      ┆ 1.480099     ┆ 2.702521     │
│ 8       ┆ -2


    
      ┆ 0.789945     ┆ 11.393198    │
│ null    ┆ null    ┆ null         ┆ null         │
│ 10      ┆ 0       ┆ 0.072875     ┆ 343.054056   │
└─────────┴─────────┴──────────────┴──────────────┘

Logical

df_logical = df.select(
    (pl.col("nrs") > 1).alias("nrs > 1"),
    (pl.col("random") <= 0.5).alias("random <= .5"),
    (pl.col("nrs") != 1).alias("nrs != 1"),
    (pl.col("nrs") == 1).alias("nrs == 1"),
    ((pl.col("random") <= 0.5) & (pl.col("nrs") > 1)).alias("and_expr"),  # and
    ((pl.col("random") <= 0.5) | (pl.col("nrs") > 1)).alias("or_expr"),  # or
)
print(df_logical)

# 结果如下：
shape: (5, 6)
┌─────────┬──────────────┬──────────┬──────────┬──────────┬─────────┐
│ nrs > 1 ┆ random <= .5 ┆ nrs != 1 ┆ nrs == 1 ┆ and_expr ┆ or_expr │
│ ---     ┆ ---          ┆ ---      ┆ ---      ┆ ---      ┆ ---     │
│ bool    ┆ bool         ┆ bool     ┆ bool     ┆ bool     ┆ bool    │
╞═════════╪══════════════╪══════════╪══════════╪══════════╪═════════╡
│ false   ┆ true         ┆ false    ┆ true     ┆ false    ┆ true    │
│ true    ┆ false        ┆ true     ┆ false    ┆ false    ┆ true    │
│ true    ┆ true         ┆ true     ┆ false    ┆ true     ┆ true    │
│ null    ┆ false        ┆ null     ┆ null     ┆ false    ┆ null    │
│ true    ┆ true         ┆ true     ┆ false    ┆ true     ┆ true    │
└─────────┴──────────────┴──────────┴──────────┴──────────┴─────────┘

列选择

让我们创建一个数据集以在本节中使用：

from datetime import date, datetime

import polars as pl

df = pl.DataFrame(
    {
        "id": [9, 4, 2],
        "place": ["Mars", "Earth", "Saturn"],
        "date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),
        "sales": [33.4, 2142134.1, 44.7],
        "has_people": [False, True, False],
        "logged_at": pl.datetime_range(
            datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True
        ),
    }
).with_row_index("index")
print(df)

# 结果如下：
shape: (3, 7)
┌───────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ index ┆ id  ┆ place  ┆ date       ┆ sales     ┆ has_people ┆ logged_at           │
│ ---   ┆ --- ┆ ---    ┆ ---        ┆ ---       ┆ ---        ┆ ---                 │
│ u32   ┆ i64 ┆ str    ┆ date       ┆ f64       ┆ bool       ┆ datetime[μs]        │



    
╞═══════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0     ┆ 9   ┆ Mars   ┆ 2022-01-01 ┆ 33.4      ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1     ┆ 4   ┆ Earth  ┆ 2022-01-02 ┆ 2142134.1 ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2     ┆ 2   ┆ Saturn ┆ 2022-01-03 ┆ 44.7      ┆ false      ┆ 2022-12-01 00:00:02 │
└───────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘

函数

Polars 表达式具有大量内置函数。这些允许您创建复杂的查询，而无需用户定义的函数。这里要介绍的太多了，但我们将介绍一些更流行的用例。如果要查看所有函数，请转到编程语言的 API 参考。

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", "spam"],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)

# 结果如下：
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
│ null ┆ egg   ┆ 0.533739 ┆ C      │
│ 5    ┆ spam  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

字符串

以下部分讨论对 String 数据执行的操作，这是 DataType 在使用 DataFrames .但是，由于字符串的内存大小不可预测，处理字符串通常效率低下，导致 CPU 访问许多随机内存位置。为了解决这个问题，Polars 使用 Arrow 作为其后端，将所有字符串存储在一个连续的内存块中。因此，字符串遍历对于 CPU 来说是缓存最优且可预测的。

可以通过具有 String 数据类型的列的 .str 属性访问 str 命名空间。在以下示例中，我们创建一个名为 animal 的列，并根据字节数和字符数计算列中每个元素的长度。如果您使用的是 ASCII 文本，那么这两个计算的结果将是相同的，建议使用 lengths ，因为它更快。

df = pl.DataFrame({"animal": ["Crab", "cat and dog", "rab$bit", None]})

out = df.select(
    pl.col("animal").str.len_bytes().alias("byte_count"),
    pl.col("animal").str.len_chars().alias("letter_count"),
)
print(out)

# 结果如下：
shape: (4, 2)
┌────────────┬──────────────┐
│ byte_count ┆ letter_count │
│ ---        ┆ ---          │
│ u32        ┆ u32          │
╞════════════╪══════════════╡
│ 4          ┆ 4            │
│ 11         ┆ 11           │
│ 7          ┆ 7            │
│ null       ┆ null         │
└────────────┴──────────────┘

out = df.select(
    pl.col("animal"),
    pl.col("animal").str.contains("cat|bit").alias("regex"),
    pl.col("animal").str.contains("rab$", literal=True).alias("literal"),
    pl.col("animal").str.starts_with("rab").alias("starts_with"),
    pl.col("animal").str.ends_with("dog").alias("ends_with"),
)
print(out)

# 结果如下：
shape: (4, 5)
┌─────────────┬───────┬─────────┬─────────────┬───────────┐
│ animal      ┆ regex ┆ literal ┆ starts_with ┆ ends_with │
│ ---         ┆ ---   ┆ ---     ┆ ---         ┆ ---       │
│ str         ┆ bool  ┆ bool    ┆ bool        ┆ bool      │
╞═════════════╪═══════╪═════════╪═════════════╪═══════════╡
│ Crab        ┆ false ┆ false   ┆ false       ┆ false     │
│ cat and dog ┆ true  ┆ false   ┆ false       ┆ true      │
│ rab$bit     ┆ true  ┆ true    ┆ true        ┆ false     │
│ null        ┆ null  ┆ null    ┆ null        ┆ null      │


df = pl.DataFrame(
    {
        "a": [
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",
            "http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
        ]
    }
)
out = df.select(
    pl.col("a").str.extract(r"candidate=(\w+)", group_index=1),
)
print(out)

# 结果如下：
shape: (3, 1)
┌─────────┐
│ a       │
│ ---     │
│ str     │
╞═════════╡
│ messi   │
│ null    │
│ ronaldo │
└─────────┘

df = pl.DataFrame({"foo": ["123 bla 45 asd", "xyz 678 910t"]})
out = df.select(
    pl.col("foo").str.extract_all(r"(\d+)").alias("extracted_nrs"),
)
print(out)

# 结果如下：
shape: (2, 1)
┌────────────────┐
│ extracted_nrs  │
│ ---            │
│ list[str]      │
╞════════════════╡
│ ["123", "45"]  │
│ ["678", "910"] │
└────────────────┘

df = pl.DataFrame({"id": [1, 2], "text": ["123abc", "abc456"]})
out = df.with_columns(
    pl.col("text").str.replace(r"abc\b", "ABC"),
    pl.col("text").str.replace_all("a", "-", literal=True).alias("text_replace_all"),
)
print(out)

# 结果如下：



    
shape: (2, 3)
┌─────┬────────┬──────────────────┐
│ id  ┆ text   ┆ text_replace_all │
│ --- ┆ ---    ┆ ---              │
│ i64 ┆ str    ┆ str              │
╞═════╪════════╪══════════════════╡
│ 1   ┆ 123ABC ┆ 123-bc           │
│ 2   ┆ abc456 ┆ -bc456           │
└─────┴────────┴──────────────────┘

缺失值

（ DataFrame 或等效的） Series 中的每一列都是一个 Arrow 数组或基于 Apache Arrow 规范的 Arrow 数组的集合。缺失数据在箭头和极坐标中表示，并带有一个 null 值。此 null 缺失值适用于所有数据类型，包括数值。

Polars还允许 NotaNumber 浮点列的值。 NaN 这些 NaN 值被视为一种浮点数据，而不是缺失数据。我们将在下面单独讨论 NaN 值。

您可以使用 python None 值手动定义缺失值：

df = pl.DataFrame(
    {
        "value": [1, None],
    },
)
print(df)

# 结果如下：
shape: (2, 1)
┌───────┐
│ value │
│ ---   │
│ i64   │
╞═══════╡
│ 1     │
│ null  │
└───────┘

null_count_df = df.null_count()
print(null_count_df)

# 结果如下：
shape: (1, 1)
┌───────┐
│ value │
│ ---   │
│ u32   │
╞═══════╡
│ 1     │
└───────┘

is_null_series = df.select(
    pl.col("value").is_null(),
)
print(is_null_series)

# 结果如下：
shape: (2, 1)
┌───────┐
│ value │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
└───────┘

# 填充缺失值
df = pl.DataFrame(
    {
        "col1": [1, 2, 3],
        "col2": [1, None, 3],
    },
)
print(df)

# 结果如下：
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 1    │
│ 2    ┆ null │
│ 3


    
    ┆ 3    │
└──────┴──────┘

# 使用指定文本填充
fill_literal_df = df.with_columns(
    pl.col("col2").fill_null(pl.lit(2)),
)
print(fill_literal_df)

# 结果如下：
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 1    │
│ 2    ┆ 2    │
│ 3    ┆ 3    │
└──────┴──────┘

窗口函数

窗口函数是具有超能力的表达式。它们允许您对 select 上下文中的组执行聚合。首先，我们创建一个数据集。

import polars as pl

# then let's load some csv data with information about pokemon
df = pl.read_csv(
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"
)
print(df.head())

# 结果如下：
shape: (5, 13)
┌─────┬───────────────────────┬────────┬────────┬───┬─────────┬───────┬────────────┬───────────┐
│ #   ┆ Name                  ┆ Type 1 ┆ Type 2 ┆ … ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ ---                   ┆ ---    ┆ ---    ┆   ┆ ---     ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ str                   ┆ str    ┆ str    ┆   ┆ i64     ┆ i64   ┆ i64        ┆ bool      │
╞═════╪═══════════════════════╪════════╪════════╪═══╪═════════╪═══════╪════════════╪═══════════╡
│ 1   ┆ Bulbasaur             ┆ Grass  ┆ Poison ┆ … ┆ 65      ┆ 45    ┆ 1          ┆ false     │
│ 2   ┆ Ivysaur               ┆ Grass  ┆ Poison ┆ … ┆ 80      ┆ 60    ┆ 1          ┆ false     │
│ 3   ┆ Venusaur              ┆ Grass  ┆ Poison ┆ … ┆ 100     ┆ 80    ┆ 1          ┆ false     │
│ 3   ┆ VenusaurMega Venusaur ┆ Grass  ┆ Poison ┆ … ┆ 120     ┆ 80    ┆ 1          ┆ false     │
│ 4   ┆ Charmander            ┆ Fire   ┆ null   ┆ … ┆ 50      ┆ 65    ┆ 1          ┆ false     │
└─────┴───────────────────────┴────────┴────────┴───┴─────────┴───────┴────────────┴───────────┘

按选择中的聚合进行分组:

out = df.select(
    "Type 1",
    "Type 2",
    pl.col("Attack").mean().over("Type 1").alias("avg_attack_by_type"),
    pl.col("Defense")
    .mean()
    .over(["Type 1", "Type 2"])
    .alias("avg_defense_by_type_combination"),
    pl.col("Attack").mean().alias("avg_attack"),
)
print(out)

# 结果如下：
shape: (163, 5)
┌─────────┬────────┬────────────────────┬─────────────────────────────────┬────────────┐
│ Type 1  ┆ Type 2 ┆ avg_attack_by_type ┆ avg_defense_by_type_combinatio… ┆ avg_attack │
│ ---     ┆ ---    ┆ ---                ┆ ---                             ┆ ---        │
│ str     ┆ str    ┆ f64                ┆ f64                             ┆ f64        │
╞═════════╪════════╪════════════════════╪═════════════════════════════════╪════════════╡
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
│ Fire    ┆ null   ┆ 88.642857


    
          ┆ 58.3                            ┆ 75.349693  │
│ …       ┆ …      ┆ …                  ┆ …                               ┆ …          │
│ Fire    ┆ Flying ┆ 88.642857          ┆ 82.0                            ┆ 75.349693  │
│ Dragon  ┆ null   ┆ 94.0               ┆ 55.0                            ┆ 75.349693  │
│ Dragon  ┆ null   ┆ 94.0               ┆ 55.0                            ┆ 75.349693  │
│ Dragon  ┆ Flying ┆ 94.0               ┆ 95.0                            ┆ 75.349693  │
│ Psychic ┆ null   ┆ 53.875             ┆ 51.428571                       ┆ 75.349693  │
└─────────┴────────┴────────────────────┴─────────────────────────────────┴────────────┘

每个组的操作数:

filtered = df.filter(pl.col("Type 2") == "Psychic").select(
    "Name",
    "Type 1",
    "Speed",
)
print(filtered)

# 结果如下：
shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Slowpoke            ┆ Water  ┆ 15    │
│ Slowbro             ┆ Water  ┆ 30    │
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
│ Exeggcute           ┆ Grass  ┆ 40    │
│ Exeggutor           ┆ Grass  ┆ 55    │
│ Starmie             ┆ Water  ┆ 115   │
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

out = filtered.with_columns(
    pl.col(["Name", "Speed"]).sort_by("Speed", descending=True).over("Type 1"),
)
print(out)

# 结果如下：
shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Starmie             ┆ Water  ┆ 115   │
│ Slowbro             ┆ Water  ┆ 30    │
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
│ Exeggutor           ┆ Grass  ┆ 55    │
│ Exeggcute           ┆ Grass  ┆ 40    │
│ Slowpoke            ┆ Water  ┆ 15    │
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

unsetunset列表和数组unset unset

Polars 对 List 列具有一流的支持：即每行都是不同长度的同构元素列表的列。Polars 也有一个 Array 数据类型，类似于 NumPy 的 ndarray 对象，其中行的长度相同。

注意：这与 Python list 的对象不同，Python 的对象元素可以是任何类型。Polars 可以将这些存储在列中，但作为一种通用 Object 数据类型，它不具有我们将要讨论的特殊列表操作功能。

强大的 `List` 操纵

假设我们有来自全州不同气象站的以下数据。当气象站无法获得结果时，将记录错误代码，而不是当时的实际温度。

weather = pl.DataFrame(
    {
        "station": ["Station " + str(x) for x in range(1, 6)],
        "temperatures": [
            "20 5 5 E1 7 13 19 9 6 20",
            "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
            "19 24 E9 16 6 12 10 22",
            "E2 E0 15 7 8 10 E1 24 17 13 6",
            "14 8 E0 16 22 24 E1",
        ],
    }
)
print(weather)

# 结果如下：
shape: (5, 2)
┌───────────┬─────────────────────────────────┐
│ station   ┆ temperatures                    │
│ ---       ┆ ---                             │
│ str       ┆ str                             │
╞═══════════╪═════════════════════════════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20        │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90… │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22          │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6   │
│ Station 5 ┆ 14 8 E0 16 22 24 E1             │
└───────────┴─────────────────────────────────┘

创建 `List` 列

对于上面创建的 weather DataFrame ，我们很可能需要对每个站点捕获的温度进行一些分析。为了实现这一点，我们首先需要能够获得单独的温度测量值。这是通过以下方式完成的：

out = weather.with_columns(pl.col("temperatures").str.split(" "))
print(out)

# 结果如下：
shape: (5, 2)
┌───────────┬──────────────────────┐
│ station   ┆ temperatures         │
│ ---       ┆ ---                  │
│ str       ┆ list[str]            │
╞═══════════╪══════════════════════╡
│ Station 1 ┆ ["20", "5", … "20"]  │
│ Station 2 ┆ ["18", "8", … "40"]  │
│ Station 3 ┆ ["19", "24", … "22"] │
│ Station 4 ┆ ["E2", "E0", … "6"]  │
│ Station 5 ┆ ["14", "8", … "E1"]  │
└───────────┴──────────────────────┘

操作 `List` 列

Polars 在色谱柱上 List 提供了多种标准操作。如果我们想要前三个测量，我们可以做一个 head(3) .最后三个可以通过 tail(3) 获得，也可以 slice 通过（支持负索引）获得。我们还可以通过 lengths 来识别观测值的数量。让我们看看它们在行动中：

out = weather.with_columns(pl.col("temperatures").str.split(" ")).with_columns(
    pl.col("temperatures").list.head(3).alias("top3"),
    pl.col("temperatures").list.slice(-3, 3).alias("bottom_3"),
    pl.col("temperatures").list.len().alias("obs"),
)
print(out)

# 结果如下：
shape: (5, 5)
┌───────────┬──────────────────────┬────────────────────┬────────────────────┬─────┐
│ station   ┆ temperatures         ┆ top3               ┆ bottom_3           ┆ obs │
│ ---       ┆ ---                  ┆ ---                ┆ ---                ┆ --- │
│ str       ┆ list[str]            ┆ list[str]          ┆ list[str]          ┆ u32 │



    
╞═══════════╪══════════════════════╪════════════════════╪════════════════════╪═════╡
│ Station 1 ┆ ["20", "5", … "20"]  ┆ ["20", "5", "5"]   ┆ ["9", "6", "20"]   ┆ 10  │
│ Station 2 ┆ ["18", "8", … "40"]  ┆ ["18", "8", "16"]  ┆ ["90", "70", "40"] ┆ 13  │
│ Station 3 ┆ ["19", "24", … "22"] ┆ ["19", "24", "E9"] ┆ ["12", "10", "22"] ┆ 8   │
│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ ["E2", "E0", "15"] ┆ ["17", "13", "6"]  ┆ 11  │
│ Station 5 ┆ ["14", "8", … "E1"]  ┆ ["14", "8", "E0"]  ┆ ["22", "24", "E1"] ┆ 7   │
└───────────┴──────────────────────┴────────────────────┴────────────────────┴─────┘

按行计算

我们可以使用 list.eval （ list().eval in Rust）表达式对列表的元素应用任何 Polars 操作！这些表达式完全在 Polars 的查询引擎上运行，并且可以并行运行，因此会得到很好的优化。假设我们有另一组三天的天气数据，适用于不同的台站：

weather_by_day = pl.DataFrame(
    {
        "station": ["Station " + str(x) for x in range(1, 11)],
        "day_1": [17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
        "day_2": [15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
        "day_3": [16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
    }
)
print(weather_by_day)

# 结果如下：
shape: (10, 4)
┌────────────┬───────┬───────┬───────┐
│ station    ┆ day_1 ┆ day_2 ┆ day_3 │
│ ---        ┆ ---   ┆ ---   ┆ ---   │
│ str        ┆ i64   ┆ i64   ┆ i64   │
╞════════════╪═══════╪═══════╪═══════╡
│ Station 1  ┆ 17    ┆ 15    ┆ 16    │
│ Station 2  ┆ 11    ┆ 11    ┆ 15    │



    
│ Station 3  ┆ 8     ┆ 10    ┆ 24    │
│ Station 4  ┆ 22    ┆ 8     ┆ 24    │
│ Station 5  ┆ 9     ┆ 7     ┆ 8     │
│ Station 6  ┆ 21    ┆ 14    ┆ 23    │
│ Station 7  ┆ 20    ┆ 18    ┆ 19    │
│ Station 8  ┆ 8     ┆ 21    ┆ 23    │
│ Station 9  ┆ 8     ┆ 15    ┆ 16    │
│ Station 10 ┆ 17    ┆ 13    ┆ 10    │
└────────────┴───────┴───────┴───────┘

推荐课程

《Python · AI&数据科学入门》

点这里👇关注我，回复“python”了解课程

往期精彩阅读

10个赞学姐的午饭就可以有个鸡腿🍗