说再见Python循环，“向量化”让我们代码更高效

介绍

循环在我们身边自然而然地出现，我们几乎在所有编程语言中都学过循环。因此，默认情况下，每当有重复操作时，我们就开始实现循环。但是当我们处理大量迭代（数百万/数十亿行）时，使用循环就是一种罪行。我们可能会卡住好几个小时，最后意识到它行不通。这就是在Python中实现向量化变得非常关键的地方。

什么是向量化？

向量化是在数据集上实现（NumPy）数组操作的技术。在后台，它将操作应用于数组或系列的所有元素，一次性完成（不像“for”循环一次操作一行）。在这篇文章中中，我们可以轻松地用向量化替代Python循环。这将帮助我们节省时间，并在编码方面变得更加熟练。

用例1：找到数字的和

首先，我们将看一个使用循环和Python中的向量化找到数字和的基本示例。

使用循环

import time start = time.time()
# iterative sumtotal = 0# iterating through 1.5 Million numbersfor item in range(0, 1500000):    total = total + item
print('sum is:' + str(total))end = time.time()print(end - start)#1124999250000#0.14 Seconds

使用向量化

import numpy as np
start = time.time()# vectorized sum - using numpy for vectorization# np.arange create the sequence of numbers from 0 to 1499999print(np.sum(np.arange(1500000)))end = time.time()print(end - start)
##1124999250000##0.008 Seconds

相比于使用range函数进行迭代，向量化执行时间约为循环的18倍。在使用Pandas DataFrame时，这种差异将变得更为显著。

用例2：数学运算（在DataFrame上）

在数据科学中，开发人员在使用Pandas DataFrame时，使用循环进行数学运算以创建新的派生列。在以下示例中，我们可以看到如何轻松地将循环替换为这种情况下的向量化。

创建DataFrame

DataFrame是以行和列形式的表格数据。我们创建一个具有500万行和4列，填充了0到50之间的随机值的Pandas DataFrame。

import numpy as npimport pandas as pddf = pd.DataFrame(np.random.randint(0, 50, size=(5000000, 4)), columns=('a','b','c','d'))df.shape# (5000000, 5)df.head()

前5行

我们将创建一个新列'ratio'，以找到列'd'和'c'的比率。

使用循环

import time start = time.time()
# Iterating through DataFrame using iterrowsfor idx, row in df.iterrows():


    
    # creating a new column     df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])  end = time.time()print(end - start)### 109 Seconds

使用向量化

start = time.time()df["ratio"] = 100 * (df["d"] / df["c"])
end = time.time()print(end - start)### 0.12 seconds

我们可以看到，在DataFrame中，与Python中的循环相比，向量化操作所需的时间几乎快1000倍。

用例3：if-else语句（在DataFrame上）

我们实现了许多需要使用“If-else”类型逻辑的操作。我们可以轻松地用Python中的向量化操作替换这些逻辑。让我们看下面的例子以更好地理解它（我们将使用在用例2中创建的DataFrame）：

假设我们想基于对现有列‘a’的某些条件创建一个新列‘e’。

使用循环

import time start = time.time()
# Iterating through DataFrame using iterrowsfor idx, row in df.iterrows():    if row.a == 0:        df.at[idx,'e'] = row.d        elif (row.a <= 25) & (row.a > 0):        df.at[idx,'e'] = (row.b)-(row.c)        else:        df.at[idx,'e'] = row.b + row.cend = time.time()print(end - start)### Time taken: 177 seconds

使用向量化

# using vectorization 
start = time.time()df['e'] = df['b'] + df['c']df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c']df.loc[df['a']==0, 'e'] = df['d']end = time.time()print(end - start)## 0.28007707595825195 sec

向量化操作所需的时间比带有if-else语句的Python循环快600倍。

用例4（高级）：解决机器学习/深度学习网络

深度学习要求我们解决多个复杂的方程，而且还要为数百万甚至数十亿行运行。在Python中运行循环来解决这些方程非常慢，而向量化是最优解。例如，为了计算多元线性回归方程中数百万行的y值：

多元线性回归

我们可以用向量化替换循环。m1，m2，m3...的值是通过使用与x1，x2，x3...相对应的数百万个值解上述方程得出的（为简单起见，我们只看一个简单的乘法步骤）

创建数据

import numpy as np# setting initial values of m m = np.random.rand(1,5)
# input values for 5 million rowsx = np.random.rand(5000000,5)

m的输出

x的输出

使用循环

import numpy as npm = np.random.rand(1,5)x = np.random.rand(5000000,5)
total = 0tic = time.process_time()for i in range(0,5000000):    total = 0    for j in range(0,5):        total = total + x[i][j]*m[0][j]             zer[i] = total toc = time.process_time()print ("Computation time = " + str((toc - tic)) + "seconds")####Computation time = 28.228 seconds

使用向量化

两个矩阵的点积

tic = time.process_time()
#dot product np.dot(x,m.T) toc = time.process_time()print ("Computation time = " + str((toc - tic)) + "seconds")####Computation time = 0.107 seconds

np.dot在后台实现了向量化矩阵乘法。与Python中的循环相比，它快165倍。

结论

在Python中，向量化非常快，应该优先使用，特别是在处理非常大的数据集时。

随着时间的推移开始实施它，我们将逐渐习惯以向量化代码的方式思考。

·  END  ·

HAPPY LIFE

本文仅供学习交流使用，如有侵权请联系作者删除