机器学习预测作物产量模型 Flask 部署详细教程（附python代码演练）

介绍

作物产量预测是农业中重要的预测分析技术。这是一种农业实践，可以帮助农民和农业企业预测特定季节的作物产量、何时种植作物、何时收获以获得更高的作物产量。预测分析是一种强大的工具，可以帮助改善农业决策。它可用于作物产量预测、风险缓解、降低化肥成本等。

本文将使用机器学习进行作物产量预测，对天气条件、土壤质量、果实质量等进行分析，并使用 flask 部署。

学习目标

我们将简要介绍使用授粉模拟模型预测作物产量的端到端项目。
我们将跟踪数据科学项目生命周期的每个步骤，包括数据探索、预处理、建模、评估和部署。
最后，我们将使用 Flask API 将模型部署在 render 云服务平台上。

那么让我们开始讨论这个令人兴奋的现实世界问题。

项目介绍

该项目使用的数据集是使用空间显式模拟计算模型生成的，用于分析和研究影响野生蓝莓预测的各种因素，包括：

厂区空间布局
异交和自花授粉
蜂种成分
天气条件（单独和综合）影响农业生态系统中野生蓝莓的授粉效率和产量。

该模拟模型经过近 30 年在美国缅因州和加拿大沿海地区收集的现场观察和实验数据的验证，现已成为野生蓝莓产量预测的假设检验和估计的有用工具。这些模拟数据为研究人员提供了从田间收集的实际数据，用于作物产量预测的各种实验，并为开发人员和数据科学家提供了构建用于作物产量预测的真实机器学习模型的数据。

模拟野生蓝莓田

什么是授粉模拟模型？

授粉仿真建模是利用计算机模型模拟授粉过程的过程。授粉模拟有多种用例，例如：

研究不同因素对授粉的影响，例如气候变化、栖息地丧失和农药
设计有利于授粉的景观
预测授粉对作物产量的影响

授粉模拟模型可用于研究花粉粒在花朵之间的运动、授粉事件的时间以及不同授粉策略的有效性。这些信息可用于提高授粉率和作物产量，从而进一步帮助农民有效生产作物并获得最佳产量。

授粉模拟模型仍在开发中，但它们有可能在未来的农业中发挥重要作用。通过了解授粉的工作原理，我们可以更好地保护和管理这一重要过程。

在我们的项目中，我们将使用具有各种特征的数据集，例如 “ clonesize ”、“ honeybee ”、“ RainingDays ”、“ AverageRainingDays ” 等，这些特征是使用授粉模拟过程创建的，以估计作物产量。

问题陈述

在这个项目中，我们的任务是通过每天的任务，根据其他 17 个特征逐步对收益变量（目标特征）进行分类。评估指标将采用 RMSE 进行评分。我们将使用 Python 的 Flask 框架在基于云的平台上部署模型。

先决条件

该项目非常适合数据科学和机器学习的中级学习者构建他们的组合项目。该领域的初学者如果熟悉以下技能，就可以开始这个项目：

了解 Python 编程语言以及使用 scikit-learn 库的机器学习算法
对使用 Python 的 Flask 框架进行网站开发有基本的了解
了解回归评估指标

数据说明

在本节中，我们将查看项目数据集的每个变量。

Clonesize — m2 — 田间蓝莓克隆的平均大小
Honeybee — bees/m2/min — 田间蜜蜂密度
Bumbles — bees/m2/min — 田间的大黄蜂密度
Andrena — bees/m2/min — 安德雷纳 (Andrena) 田间蜜蜂密度
Osmia — bees/m2/min — 田间的 Osmia 蜜蜂密度
MaxOfUpperTRange — ℃ — 花季最高气温记录
MinOfUpperTRange — ℃ — 每日最高气温最低记录
AverageOfUpperTRange — ℃ — 每日最高气温平均值
MaxOfLowerTRange — ℃ — 低带日气温最高记录
MinOfLowerTRange — ℃ — 低带日气温最低记录
AverageOfLowerTRange — ℃ — 低带日气温平均值
RainingDays — Day — 花季期间降水量大于零的总天数
AverageRainingDays — Day — 整个花季的平均下雨天数
**Fruitset ** — 坐果期的过渡时间
Fruitmass — 水果组的质量
**Seeds ** — 果实中的种子数量
**Yield ** — 作物产量（目标变量）

这些数据对于作物预测用例有什么价值？

该数据集提供了有关野生蓝莓植物空间特征、蜜蜂种类和天气情况的实用信息。因此，它使研究人员和开发人员能够构建机器学习模型来早期预测蓝莓产量。
对于拥有现场观察数据但希望通过将真实数据的使用与计算机模拟生成的数据作为作物产量预测的输入进行比较，来测试和评估不同机器学习算法的性能的其他研究人员来说，该数据集可能至关重要。
不同级别的教育工作者可以使用该数据集来训练农业行业中的机器学习分类或回归问题。

加载数据集

在本节中，我们将在你正在使用的任何环境中加载数据集。在 kaggle 环境中加载数据集。使用 kaggle 数据集或将其下载到本地计算机并在本地环境中运行。

数据集来源：https://github.com/avikumart/Wild-Blue-Berry-Yield-Prediction-Flask-app-deploy/blob/main/WildBlueberryPollinationSimulationData.csv

让我们看一下加载数据集和加载项目库的代码。

import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.feature_selection import mutual_info_regression, SelectKBestfrom sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler, MinMaxScalerfrom sklearn.model_selection import train_test_split, cross_val_score, KFold from sklearn.model_selection import GridSearchCV, RepeatedKFoldfrom sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor from sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error, r2_score, mean_absolute_errorimport sklearnfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import GridSearchCVimport statsmodels.api as smfrom xgboost import XGBRegressorimport shap
# setting up os env in kaggle import os


    
for dirname, _, filenames in os.walk('/kaggle/input'):    for filename in filenames:        print(os.path.join(dirname, filename))
# read the csv file and load first 5 rows in the platform df = pd.read_csv("/kaggle/input/wildblueberrydatasetpollinationsimulation/WildBlueberryPollinationSimulationData.csv",                  index_col='Row#')df.head()

上述代码的输出

# print the metadata of the datasetdf.info()
# data descriptiondf.describe()

上述代码的输出

代码的输出

**“df.info()” 提供了数据帧的摘要，其中包括行数、空值数量、每个变量的数据类型等，而 “df.describe ()”** 提供了数据集的描述性统计信息，如平均值、数据集中每个变量的中位数、计数和百分位数。

探索性数据分析

在本节中，我们将研究农作物数据集的探索性数据分析，并从数据集中得出见解。

数据集的热图

# create featureset and target variable from the datasetfeatures_df = df.drop('yield', axis=1)tar = df['yield']
# plot the heatmap from the datasetplt.figure(figsize=(15,15))sns.heatmap(df.corr(), annot=Truev, vmin=-1, vmax=1)plt.show()

上图显示了数据集相关系数的可视化。使用 Python 的 seaborn 库，我们只需 3 行代码就可以将其可视化。

目标变量的分布

# plot the boxplot using seaborn library of the target variable 'yield'plt.figure(figsize=(5,5))sns.boxplot(x='yield', data=df)plt.show()

上面的代码使用箱线图显示目标变量的分布。我们可以看到分布的中位数约为 6000，有几个收益率最低的异常值。

按数据集的分类特征进行分布

# matplotlib subplot for the categorical feature nominal_df = df[['MaxOfUpperTRange','MinOfUpperTRange','AverageOfUpperTRange','MaxOfLowerTRange',               'MinOfLowerTRange','AverageOfLowerTRange','RainingDays','AverageRainingDays']]
fig, ax = plt.subplots(2,4, figsize=(20,13))for e, col in enumerate(nominal_df.columns):    if e<=3:        sns.boxplot(data=df, x=col, y='yield', ax=ax[0,e])    else:        sns.boxplot(data=df, x=col, y='yield', ax=ax[1,e-4])       plt.show()

数据集中蜜蜂类型的分布

# matplotlib subplot technique to plot distribution of bees in our datasetplt.figure(figsize=(15,10))plt.subplot(2,3,1)plt.hist(df['bumbles'])plt.title("Histogram of bumbles column")plt.subplot(2,3,2)plt.hist(df['andrena'])plt.title("Histogram of andrena column")plt.subplot(2,3,3)


    
plt.hist(df['osmia'])plt.title("Histogram of osmia column")plt.subplot(2,3,4)plt.hist(df['clonesize'])plt.title("Histogram of clonesize column")plt.subplot(2,3,5)plt.hist(df['honeybee'])plt.title("Histogram of honeybee column")plt.show()

让我们记下分析中的一些观察结果：

上部和下部 T 范围列相互关联
雨天数和平均雨天数相互关联
“Fruitmass”、“fruitset” 和 “seeds” 相关
' bumbles ' 列高度不平衡，而 ' andrena ' 和 ' osmia ' 列则不然
与 “ clonesize ” 相比，“Honeybee” 也是一个不平衡的列

数据预处理和数据准备

在本节中，我们将对数据集进行预处理以进行建模。我们将执行 “互信息回归” 以从数据集中选择最佳特征，我们将对数据集中的蜜蜂类型进行聚类，并对数据集进行标准化以实现高效的机器学习建模。

互信息回归

# run the MI scores of the datasetmi_score = mutual_info_regression(features_df, tar, n_neighbors=3,random_state=42)mi_score_df = pd.DataFrame({'columns':features_df.columns, 'MI_score':mi_score})mi_score_df.sort_values(by='MI_score', ascending=False)

上面的代码使用皮尔逊系数计算互回归，以找到与目标变量最相关的特征。我们可以按降序查看最相关的特征以及与目标特征最相关的特征。现在我们将对蜜蜂的类型进行聚类以创建一个新特征。

使用 K 均值聚类

# clustering using kmeans algorithmX_clus = features_df[['honeybee','osmia','bumbles','andrena']]
# standardize the dataset using standard scalerscaler = StandardScaler()scaler.fit(X_clus)X_new_clus = scaler.transform(X_clus)
# K means clustering clustering = KMeans(n_clusters=3


    
, random_state=42)clustering.fit(X_new_clus)n_cluster = clustering.labels_
# add new feature to feature_Df features_df['n_cluster'] = n_clusterdf['n_cluster'] = n_clusterfeatures_df['n_cluster'].value_counts()
---------------------------------[Output]----------------------------------1    3680    2132    196Name: n_cluster, dtype: int64

上面的代码标准化了数据集，然后应用聚类算法将行分为 3 个不同的组。

使用 MinMaxScaler 进行数据标准化

features_set = ['AverageRainingDays','clonesize','AverageOfLowerTRange',               'AverageOfUpperTRange','honeybee','osmia','bumbles','andrena','n_cluster']
# final dataframe  X = features_df[features_set]y = tar.round(1)
# train and test dataset to build baseline model using GBT and RFs by scaling the datasetmx_scaler = MinMaxScaler()X_scaled = pd.DataFrame(mx_scaler.fit_transform(X))X_scaled.columns = X.columns

上面的代码表示标准化特征集 “ X_scaled ” 和将用于建模的目标变量 “ y ”。

建模与评估

在本节中，我们将了解使用梯度增强建模和超参数调整的机器学习建模，以获得所需的模型精度和性能。

另外，使用 statsmodels 库和形状模型解释器查看普通最小二乘回归模型，以可视化哪些特征对于我们的目标作物产量预测最重要。

机器学习建模基线

# let's fit the data to the models lie adaboost, gradientboost and random forestmodel_dict = {"abr": AdaBoostRegressor(),               "gbr": GradientBoostingRegressor(),               "rfr": RandomForestRegressor()             }


    

# Cross value scores of the modelsfor key, val in model_dict.items():    print(f"cross validation for {key}")    score = cross_val_score(val, X_scaled, y, cv=5, scoring='neg_mean_squared_error')    mean_score = -np.sum(score)/5    sqrt_score = np.sqrt(mean_score)     print(sqrt_score)
-----------------------------------[Output]------------------------------------cross validation for abr730.974385377955cross validation for gbr528.1673164806733cross validation for rfr608.0681265123212

在上面的机器学习建模中，我们在梯度提升回归器上得到了最低的均方误差，而在 Adaboost 回归器上得到了最高的误差。现在，我们将使用 scikit-learn 训练训练梯度提升模型，并评估误差，测试 split 方法。

# split the train and test dataX_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# gradient boosting regressor modelingbgt = GradientBoostingRegressor(random_state=42)bgt.fit(X_train,y_train)preds = bgt.predict(X_test)score = bgt.score(X_train,y_train)rmse_score = np.sqrt(mean_squared_error(y_test, preds))r2_score = r2_score(y_test, preds)print("RMSE score gradient boosting machine:", rmse_score)      print("R2 score for the model: ", r2_score)
-----------------------------[Output]-------------------------------------------RMSE score gradient boosting machine: 363.18286194620714R2 score for the model:  0.9321362721127562

在这里，我们可以看到，在没有对模型进行超参数调整的情况下，梯度增强建模的 RMSE 得分约为 363。而模型的 R2 约为 93%，这是比基线精度更好的模型精度。接下来，调整超参数以优化机器学习模型的准确性。

超参数调优

# K-fold split the dataset


    
kf = KFold(n_splits = 5, shuffle=True, random_state=0)
# params grid for tuning the hyperparametersparam_grid = {'n_estimators': [100,200,400,500,800],             'learning_rate': [0.1,0.05,0.3,0.7],             'min_samples_split': [2,4],             'min_samples_leaf': [0.1,0.4],             'max_depth': [3,4,7]             }
# GBR estimator object estimator = GradientBoostingRegressor(random_state=42)
# Grid search CV object clf = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=kf,                    scoring='neg_mean_squared_error', n_jobs=-1)clf.fit(X_scaled,y)
# print the best the estimator and paramsbest_estim = clf.best_estimator_best_score = clf.best_score_best_param = clf.best_params_print("Best Estimator:", best_estim)print("Best score:", np.sqrt(-best_score))
-----------------------------------[Output]----------------------------------Best Estimator: GradientBoostingRegressor(max_depth=7, min_samples_leaf=0.1,                                           n_estimators=500, random_state=42)Best score: 306.57274619213206

我们可以看到，调整后的梯度提升模型的误差比之前的模型进一步减少，并且我们还优化了 ML 模型的参数。

shaply 模型解释器

机器学习的可解释性是当今机器学习建模的一个非常重要的方面。虽然机器学习模型在许多领域都给出了有希望的结果，但其固有的复杂性使得理解它们如何得出某些预测或决策具有挑战性。

Shap 库使用 “ shaply ” 值来衡量哪些特征对预测目标值有影响。现在让我们看一下梯度增强模型的 “ shap ” 模型解释图。

# shaply tree explainershap_tree = shap.TreeExplainer(bgt)


    
shap_values = shap_tree.shap_values(X_test)shap.summary_plot(shap_values, X_test)

上述代码的输出

在上面的输出图中，很明显，AverageRainingDays 是解释目标变量预测值最有影响力的变量。而 andrena 特征对预测变量的结果影响最小。

使用 FlaskAPI 部署模型

在本节中，我们将使用 FlaskAPI 将机器学习模型部署在名为 render.com 的云服务平台上。在部署之前，需要使用 joblib 扩展名保存模型文件，以便创建可以部署在云端的 API。

保存模型文件

# remove the 'n_cluster' feature from the datasetX_train_n = X_train.drop('n_cluster', axis=1)X_test_n = X_test.drop('n_cluster', axis=1)
# train a model for flask API creation =xgb_model = XGBRegressor(max_depth=9, min_child_weight=7, subsample=1.0)xgb_model.fit(X_train_n, y_train)pr = xgb_model.predict(X_test_n)err = mean_absolute_error(y_test, pr)rmse_n = np.sqrt(mean_squared_error(y_test, pr))
# after training, save the model using joblib libraryjoblib.dump(xgb_model, 'wbb_xgb_model2.joblib')

正如你所看到的，我们在上面的代码中保存了模型文件，以及我们将如何编写 Flask 应用程序文件和模型文件以上传到 github 存储库。

应用程序存储库结构

应用程序存储库的屏幕截图

上图是应用程序存储库的快照，其中包含以下文件和目录。

app.py — Flask 应用程序文件
model.py — 模型预测文件
requirements.txt — 应用程序依赖项
**Model directory ** — 保存的模型文件
templates 目录 — 前端 UI 文件

app.py 文件

from flask import Flask, render_template, Responsefrom flask_restful import reqparse, Apiimport flask
import numpy as np


    
import pandas as pdimport ast
import osimport json
from model import predict_yield
curr_path = os.path.dirname(os.path.realpath(__file__))
feature_cols = ['AverageRainingDays', 'clonesize', 'AverageOfLowerTRange',    'AverageOfUpperTRange', 'honeybee', 'osmia', 'bumbles', 'andrena']
context_dict = {    'feats': feature_cols,    'zip': zip,    'range': range,    'len': len,    'list': list,}
app = Flask(__name__)api = Api(app)
# # FOR FORM PARSINGparser = reqparse.RequestParser()parser.add_argument('list', type=list)
@app.route('/api/predict', methods=['GET','POST'])def api_predict():    data = flask.request.form.get('single input')        # converts json to int     i = ast.literal_eval(data)        y_pred = predict_yield(np.array(i).reshape(1,-1))        return {


    
'message':"success", "pred":json.dumps(int(y_pred))}
@app.route('/')def index():        # render the index.html templete        return render_template("index.html", **context_dict)
@app.route('/predict', methods=['POST'])def predict():    # flask.request.form.keys() will print all the input from form    test_data = []    for val in flask.request.form.values():        test_data.append(float(val))    test_data = np.array(test_data).reshape(1,-1)
    y_pred = predict_yield(test_data)    context_dict['pred']= y_pred
    print(y_pred)
    return render_template('index.html', **context_dict)
if __name__ == "__main__":    app.run()

上面的代码是 Python 文件，它接受用户的输入并在前端打印作物产量预测。

Model.py 文件

import joblib import pandas as pdimport numpy as npimport os
# load the model filecurr_path = os.path.dirname(os.path.realpath(__file__))xgb_model = joblib.load(curr_path + "/model/wbb_xgb_model2.joblib")



    
# function to predict the yielddef predict_yield(attributes: np.ndarray):    """ Returns Blueberry Yield value"""    # print(attributes.shape) # (1,8)
    pred = xgb_model.predict(attributes)    print("Yield predicted")
    return pred[0]

Model.py 文件在运行时加载模型并给出预测的输出。

渲染部署

将所有文件推送到 github 存储库后，你只需在 render.com 上创建一个帐户即可推送包含 app.py 文件以及其他工件的存储库分支。

然后只需简单地推送即可在几秒钟内部署。此外，渲染还提供自动部署选项，确保对部署文件进行的任何更改都会自动反映在网站上。

github 存储库：https://github.com/avikumart/Wild-Blue-Berry-Yield-Prediction-Flask-app-deploy/tree/main

结论

在本文中，我们了解了一个使用机器学习算法预测野生蓝莓产量并使用 FlaskAPI 进行部署的端到端项目。我们开始加载数据集，然后是 EDA、数据预处理、机器学习建模以及云服务平台上的部署。

结果表明，该模型能够以高达 93% 的 R2 预测作物产量。Flask API 可以轻松访问模型并使用它进行预测。它使广泛的用户可以使用它，包括农民、研究人员和政策制定者。现在让我们看看从本文中吸取的一些教训。

我们学习了如何定义项目的问题陈述并执行端到端的 ML 项目管道。
我们了解了探索性数据分析和建模数据集的预处理
最后，我们将机器学习算法应用于我们的特征集，以部署预测模型

常见问题

Q1. 什么是使用机器学习预测作物产量？

农民和农业产业可以利用作物产量预测（一种机器学习应用程序）来准确预测和预测给定年份或季节的特定作物产量。这使他们能够为收获季节做好准备并有效管理相关成本。

Q2。农民和农业行业在智慧农业中使用哪些算法？

在智慧农业中，根据应用场景采用不同的算法。其中一些算法包括决策树回归器、随机森林回归器、梯度提升回归器、深度神经网络等。

Q3。如何在农业中使用人工智能和机器学习？

使用 AI 和 ML 预测作物产量，并预测一个季节收获的估计成本。人工智能算法有助于检测农作物病害和植物分类，以实现农作物的顺利分类和分配。

Q4。产量预测的参数是什么？

温度、昆虫组成、作物高度、土壤位置等参数以及降雨量和湿度等各种天气参数可用于预测作物产量。

Q5. 农作物产量预测项目的目标是什么？

帮助农民和农业产业发展并估算作物产量。另一个目标是帮助政府机构决定农作物产出的价格，并采取适当的措施储存和分配农作物产量。

✄-----------------------------------------------

看到这里，说明你喜欢这篇文章，请点击「在看」或顺手「转发」「点赞」。

欢迎微信搜索「panchuangxx」，添加小编磐小小仙微信，每日朋友圈更新一篇高质量推文（无广告），为您提供更多精彩内容。

▼ ▼ 扫描二维码添加小编 ▼ ▼