Stata-Python交互：在Stata中实现机器学习-支持向量机

👇 连享会 · 推文导航 | www.lianxh.cn

🍎 Stata：Stata基础 | Stata绘图 | Stata程序 | Stata新命令
📘 论文：数据处理 | 结果输出 | 论文写作 | 数据分享
💹 计量：回归分析 | 交乘项-调节 | IV-GMM | 时间序列 | 面板数据 | 空间计量 | Probit-Logit | 分位数回归
⛳ 专题：SFA-DEA | 生存分析 | 爬虫 | 机器学习 | 文本分析
🔃 因果：DID | RDD | 因果推断 | 合成控制法 | PSM-Matching
🔨 工具：工具软件 | Markdown | Python-R-Stata
🎧 课程：最新专题 | 计量专题 | 关于连享会

🍓 课程推荐：连享会：2025 寒假班
嘉宾：连玉君（初级|高级）；杨海生（前沿）
时间：2025 年 1 月 13-24 日
咨询：王老师 18903405450（微信）

温馨提示: 文中链接在微信中无法生效。请点击底部「阅读原文」。或直接长按/扫描如下二维码，直达原文：

作者:吕卓阳（厦门大学）
E-Mail：lvzy20@163.com

致谢： 本文摘自以下文章，特此感谢！
Source： Chuck Huber, 2020, Stata/Python integration part 7: Machine learning with support vector machines, -Link-

Stata/Python 交互系列推文 源自 Stata 公司的统计项目总监 Chuck Huber 博士发表于 Stata 官网的系列博文，一共 9 篇。较为系统地介绍了 Stata 与 Python 的交互方式，包括：如何配置你的软件、如何实现 Stata 与 Python 数据集互通、如何调用 Python 工具包、如何进行机器学习分析等。

Part 1: Setting up Stata to use Python -Link-
Part 2: Three ways to use Python in Stata -Link-
Part 3: How to install Python packages -Link-
Part 4: How to use Python packages-Link-
Part 5: Three-dimensional surface plots of marginal predictions-Link-
Part 6: Working with APIs and JSON data -Link-
Part 7 : Machine learning with support vector machines, -Link-
Part 8: Using the Stata Function Interface to copy data from Stata to Python, -Link-
Part 9: Using the Stata Function Interface to copy data from Python to Stata, -Link-

中文编译稿列表如下：

Stata-Python交互-9：将python数据导入Stata
Stata-Python交互-8：将Stata数据导入Python
Stata-Python交互-7：在Stata中实现机器学习-支持向量机
Stata-Python交互-6：调用APIs和JSON数据
Stata-Python交互-5：边际效应三维立体图示
Stata-Python交互-4：如何调用Python宏包
Stata-Python交互-3：如何安装Python宏包
Stata-Python交互-2：在Stata中调用Python的三种方式
Stata-Python交互-1：二者配合的基本设定

目录[

1. 数据探索性分析
2. 使用交叉验证来拟合最优 SVM 模型
3. 在测试集上拟合模型
4. 结论
5. 参考资料
6. 附：文中使用的 dofiles 汇总
7. 相关推文

Stata16 已具有和 python 交互的功能，由此，我们可以在 Stata 中调用 python，也可以在 python 中读取 Stata 数据，从而实现“他山之石，可以攻玉”。本节我们将向大家介绍在 stata 中实现机器学习，我们将使用一个支持向量机（SVM）的栗子进行说明。

支持向量机（SVM）是一类按照监督学习方式进行二元分类的线性分类器，主要原理是求解最大间隔超平面，从而对样本进行二元分类。我们拟使用美国国家健康与营养调查数据（ NHANES）,调用 python 的 sklearn 模块在 stata 中应用机器学习以区分糖尿病患者。

1. 数据探索性分析

我们主要使用人口统计数据的年龄（age）、糖化血红蛋白（HbA1c）作为特征，使用是否是糖尿病（diabetes）作为因变量，经过数据的预处理与合并后，我们得到样本的初步描述。

. list in 1/5

     +------------------------+
     | diabetes   HbA1c   age |
     |------------------------|
  1. |        1       7    62 |
  2. |        0     5.5    53 |
  3. |        1     5.8    78 |
  4. |        0     5.6    56 |
  5. |        0     5.6    42 |
     +------------------------+

我们按照因变量分组统计，得到样本的初步描述，87.49%的患者未患有糖尿病，12.51%的患者患有糖尿病。

. tabulate diabetes

Doctor told |
   you have |
   diabetes |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      5,531       87.49       87.49
          1 |        791       12.51      100.00
------------+-----------------------------------
      Total |      6,322      100.00

接下来，我们对原始数据进行绘制，我们试图探查不同特征是否可以区分患者是否有糖尿病，由此，我们调用 python 的 matplotlib 模块进行绘图。在 stata 中调用 python 可以参见stata 的代码文档,安装好之后，在命令窗口内输入 python 即可调用 python，以 end 作为 python 代码的结束。我们使用 python 来进行绘图，我们将糖化血红蛋白作为 y 轴、年龄作为 x 轴，蓝色点为未患有糖尿病的样本点，红色点为患有糖尿病的样本点。

python:
# Import the necessary packages
import


    
 pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

# Read the Stata dataset into Python
data = pd.read_stata('diabetes.dta',
convert_categoricals=False,
preserve_dtypes=True,
convert_missing=False)

# Define the feature matrix (independent variables)
# and the target variable (dependent variable)
X = data[['age','HbA1c']]
y = data['diabetes']

# Plot the raw data
plt.scatter(X['age'], X['HbA1c'],
c=y,
cmap = mcolors.ListedColormap(["navy", "darkred"]))
plt.xlabel('Age (years)')
plt.ylabel('HbA1c')
plt.xticks((12,20,30,40,50,60,70,80))
plt.yticks((4,6,8,10,12,14,16))
plt.title('Diabetes status by Age and HbA1c')
plt.show()
# Save the graph
plt.savefig("scatterplot.png")
end

由图中显示，患有糖尿病的人往往年龄较大、糖化血红蛋白的水平较高。

2. 使用交叉验证来拟合最优 SVM 模型

接下来，我们使用 python 的 sklearn 模块来进行机器学习建模，我们首先按照 40%与 60%的比例随机划分测试集与训练集。

python:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=0)
end

其次，我们使用交叉验证的方法来拟合最优的 SVM 模型，我们需要寻找到最优的核函数（kernel）、度（degree）和正则化参数（regularization parameter C），我们在此对 SVM 模型的基本概念不加以赘述，可以参加过往的推送来更深入的了解 SVM 模型。

我们使用“k-折交叉验证”的技术，把训练组划分为 k 个子组，在 k-1 个子组上训练 SVM 模型，在第 k 个子组上测试模式，我们重复 k 次使每个子组都作为测试组，然后，我们将计算的结果计算平均值，选择拟合度度最好的模型参数作为真正拟合的参数。调用 python 的 sklearn 模块代码如下：

python:
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# Do a grid search for the parameters "degree" and "C" using 10-fold
# cross-validation
model = svm.SVC(kernel='poly')
parameters = {'degree':[1,2,3], 'C':[1,2,3]}
poly_svc = GridSearchCV(model,
parameters,
cv=10,
scoring='accuracy').fit(X_train, y_train)

# Display the parameters that yield the best-fitting model
poly_svc.fit(X_train,y_train)
print(poly_svc.best_params_)
end

由结果可知，我们选取正则化参数（C）为 3、度（degree）为 3 的 SVM 拟合效果最好，由此，我们将 poly_svc 的模型作为拟合测试集的模型。

>>> poly_svc.fit(X_train,y_train)
GridSearchCV(cv=10, estimator=SVC(kernel='poly'),
             param_grid={'C': [1, 2, 3], 'degree': [1, 2, 3]},



    
             scoring='accuracy')
>>> print(poly_svc.best_params_)
{'C': 3, 'degree': 3}

3. 在测试集上拟合模型

通过交叉验证法拟合到最优的 SVM 模型后，我们在测试集上进行拟合，并展示模型的准确性。

# Fit the SVM model using the parameters selected from the grid search
poly_svc = svm.SVC(kernel='poly', degree=3, C=3).fit(X_train, y_train)
scores = cross_val_score(poly_svc, X_test, y_test, cv=10, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

>>> Accuracy: 0.93 (+/- 0.03)

由结果可知，使用 SVM 模型拟合的准确度为 93%。接下来我们对拟合结果进行绘图。

4. 结论

我们通过在 stata 中调用 python，实现了使用 SVM 算法将样本划分为糖尿病患者与非糖尿病患者，我们将测试集数据按照 93%的正确率来分类，我们也可以调用 python 的其他模型来进行机器学习，譬如随机森林、logi 回归等，总之，stata16 与 python 的交互实现了软件间功能的互通，为我们的学习与研究增添了一大有利工具。

5. 参考资料

Chuck Huber, 2020, Stata/Python integration part 7: Machine learning with support vector machines, -Link-
田原，连享会推文，支持向量机：Stata 和 Python 实现.

6. 附：文中使用的 dofiles 汇总

///////////stata应用机器学习
///////使用支持向量机来区分是否是糖尿病患者，使用NHANES的数据，使用人口统计数据的age，HbA1c来自糖化血红蛋白数据，糖尿病数据来自diabetes数据
/////数据预处理
clear
cd "/Users/lvzhuoyang/Desktop/第二次推送"
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT", clear
save age.dta, replace
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/GHB_I.XPT", clear
save glucose.dta, replace
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DIQ_I.XPT", clear
save diabetes, replace

merge 1:1 seqn using "glucose.dta"
drop _merge
merge 1:1 seqn using "age.dta"

rename ridageyr age
rename lbxgh HbA1c
rename diq010 diabetes
recode diabetes (1=1)(2/3=0)(9=.)

keep diabetes age HbA1c
drop if missing(diabetes,age,HbA1c)
save diabetes,replace

erase age.dta
erase glucose.dta

/////数据描述性统计
list in 1/5
tabulate diabetes

/////使用python数据绘图
python
## import necessary packages
import


    
 pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

## read the stata dataset into python
data = pd.read_stata('diabetes.dta',convert_categoricals=False,preserve_dtypes=True,convert_missing=False)

## define the feature matrix(independent variables)
## and the target variable(dependent variable)
X = data[['age','HbA1c']]
Y = data['diabetes']

## plot the raw data
plt.scatter(X['age'],X['HbA1c'],c=Y,cmap=mcolors.ListedColormap(["navy","darkred"]))
plt.xlabel('Age(years)')
plt.ylabel('HbA1c')
plt.xticks((12,20,30,40,50,60,70,80))
plt.yticks((4,6,8,10,12,14,16))
plt.title('Diabets status by Age and HbA1c')
plt.show()

## Save the graph
plt.savefig("scatterplot.png")
end


///////split the data into training and testing datasets
python
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.4,random_state=0)
end


//////using cross-validation to choose parameters for SVM model
python
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

## do a grid search for the parameters "degree" and "C" using 10-fold cross-validation
model = svm.SVC(kernel='poly')
parameters = {'degree':[1,2,3],'C':[1,2,3]}
poly_svc = GridSearchCV(model,parameters,cv=10,scoring='accuracy').fit(X_train,Y_train)

## display the parameters that yield the best-fitting model
poly_svc.fit(X_train,Y_train)
print(poly_svc.best_params_)
end

//////test the model on the testing datasets
python
poly_svc = svm.SVC(kernel='poly',degree=3,C=3).fit(X_train,Y_train)
scores = cross_val_score(poly_svc,X_test,Y_test,cv=10,scoring='accuracy')
print("Accuracy: %0.2f(+/- %0.2f)" %(scores.mean(),scores.std()*2))
end

/////plot the results of the SVM model
//////使用numpy的meshgrid来画二维图
python
import numpy as np
## create a mesh on which to plot the results of the SVM model
h=0.1
x_min,x_max=X['age'].min()-1,X['age'].max()+1
y_min,y_max=X['HbA1c'].min()-1,X['HbA1c'].max()+1
xx,yy=np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))


## plot the predicted decision boundary
Z=poly_svc.predict(np.c_[xx.ravel(),yy.ravel()])
Z=Z.reshape(xx.shape)
plt.contourf(xx,yy,Z,cmap=mcolors.ListedColormap(["dodgerblue","red"]),alpha=0.8)
plt.show()
plt.savefig("boundaryplot.png")
end

/////scatterplot
python
## plot the raw data on the predicted decision boundary
plt.scatter(X['age'],X['HbA1c'],c=Y,cmap=mcolors.ListedColormap(["navy","darkred"]))
plt.xlabel('Age(years)')
plt.ylabel('HbA1c')
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.xticks((12,20,30,40,50,60,70,80))
plt.yticks((4,6,7,10,12,14,16))
plt.title('Diabets status by Age and HbA1c')
# Save the graph
plt.show()
plt.savefig("coutourplot.png")
end

7. 相关推文

Note：产生如下推文列表的命令为：
lianxh Stata Python +
安装最新版 lianxh 命令：
ssc install lianxh, replace

专题：Stata入门

使用 Jupyter Notebook 配置 Stata\Python\Julia\R

专题：Stata程序

Stata程序：是否有类似-Python-中的-zip()-函数

专题：文本分析-爬虫

VaR 风险价值: Stata 及 Python 实现
支持向量机：Stata 和 Python 实现

专题：Python-R-Matlab

Stata交互：Python-与-Stata-对比
Python+Stata：批量制作个性化结业证书

专题：其它

ES 期望损失: Stata 及 Python 实现

🍓 课程推荐：公开课：Stata+R 软件基础
嘉宾：候丹丹
扫码进群 获取资料及听课链接！

尊敬的老师 / 亲爱的同学们：

连享会致力于不断优化和丰富课程内容，以确保每位学员都能获得最有价值的学习体验。为了更精准地满足您的学习需求，我们诚挚地邀请您参与到我们的课程规划中来。请您在下面的问卷中，分享您 感兴趣的学习主题或您希望深入了解的知识领域 。您的每一条建议都是我们宝贵的资源，将直接影响到我们课程的改进和创新。我们期待您的反馈，因为您的参与和支持是我们不断前进的动力。感谢您抽出宝贵时间，与我们共同塑造更加精彩的学习旅程！https://www.wjx.cn/vm/YgPfdsJ.aspx# 再次感谢大家宝贵的意见！

New！ Stata 搜索神器：lianxh 和 songbl GIF 动图介绍
搜：推文、数据分享、期刊论文、重现代码 ……
👉 安装：
. ssc install lianxh
. ssc install songbl
👉 使用：
. lianxh DID 倍分法
. songbl all

🍏 关于我们

连享会 ( www.lianxh.cn，推文列表) 由中山大学连玉君老师团队创办，定期分享实证分析经验。
直通车： 👉【百度一下：连享会】即可直达连享会主页。亦可进一步添加「知乎」,「b 站」,「面板数据」,「公开课」等关键词细化搜索。