前言在生物信息学数据分析中,许多分析软件都是基于R开发的。这里介绍一个可以在Python 中进行基因富集分析的Python 软件 GSEAPY
(Gene Set Enrichment Analysis in Python)
GSEApy is a python wrapper for GESA and Enrichr.
It’s used for convenient GO enrichments and produce publication-quality figures from python.
GSEAPY 安装可以通过conda
或 pip
进行安装
# if you have conda $ conda install -c conda-forge -c bioconda gseapy# or use pip to install the latest release $ pip install gseapy
pip 安装要是遇到这样的报错
data = self.read(amt=amt, decode_content=decode_content) File "/opt/conda/lib/python3.9/site-packages/pip/_vendor/urllib3/response.py", line 541, in read raise IncompleteRead(self._fp_bytes_read, self.length_remaining) File "/opt/conda/lib/python3.9/contextlib.py", line 135, in __exit__ self.gen.throw(type, value, traceback) File "/opt/conda/lib/python3.9/site-packages/pip/_vendor/urllib3/response.py", line 443, in _error_catcher raise ReadTimeoutError(self._pool, None, "Read timed out.") pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.
可以使用清华镜像,进行安装:
$ pip install gseapy -i https://pypi.tuna.tsinghua.edu.cn/simple
富集分析 背景信息gene set, 指一组具有相同特征的基因。如一个GO term 对应的多个基因,一个kegg pathway对应的多个基因 gene set library,多个相关的gene set 。如所有GO term组成一个gene set library. enrichment analysis, gene set library 作为注释基因集合,已知的先验知识。对于一个输入基因集合,富集分析通过计算分析哪些注释gene set 显著存在于输入基因集合中。例如:GO 富集分析中,查看哪些GO terms 显著存在于输入基因列表中。 有多种基因集富集分析策略,我们常说的GO/KEGG 富集分析 应该大多数指over represent analysis(ORA)。还有个常用富集方法叫GSEA(Gene Set Enrichment Analysis), 翻译过来也是基因集富集分析。下文GSEA,特指这种策略。
ORA测试数据,可以从GSEApy/tests/data下载。富集的函数是enricher
.
先展示一下,富集的代码:
gene_list="./gene_list.txt" gene_sets='KEGG_2016' gene_sets=['KEGG_2016' ,'KEGG_2013' ] enr = gp.enrichr(gene_list=gene_list, gene_sets=gene_sets, organism='Human' , # don't forget to set organism to the one you desired! e.g. Yeast description='kegg' , outdir='test/enrichr' , # no_plot=True, cutoff=0.5 # test dataset, use lower value from range(0,1) )
运行完后,'test/enrichr'
目录下存放着会有富集的图片以及文本。
(base) jovyan@95c3096ad9ae:~$ ll test /enrichr -rw-r--r-- 1 jovyan users 22003 Dec 26 14:59 KEGG_2013.Human.enrichr.reports.pdf
-rw-r--r-- 1 jovyan users 22130 Dec 26 14:59 KEGG_2013.Human.enrichr.reports.txt -rw-r--r-- 1 jovyan users 25722 Dec 26 14:59 KEGG_2016.Human.enrichr.reports.pdf -rw-r--r-- 1 jovyan users 48458 Dec 26 14:59 KEGG_2016.Human.enrichr.reports.txt
查看KEGG_2016.Human.enrichr.reports.pdf,图片只显示了前10个,这是由参数top_term=10,所决定的
同时,富集结果也保存在enr.results
里,如查看前五个数据
enr.results.head(5)
输出
Gene_set Term Overlap P-value Adjusted P-value Old P-value Old Adjusted P-value Odds Ratio Combined Score Genes 0 KEGG_2016 Osteoclast differentiation Homo sapiens hsa04380 28/132 3.104504e-13 7.885440e-11 0 0 6.659625 191.802220 LILRA6;ITGB3;LILRA2;LILRA5;PPP3R1;FCGR3B;SIRPA... 1 KEGG_2016 Tuberculosis Homo sapiens hsa05152 31/178 4.288559e-12 5.446470e-10 0 0 5.224941 136.763196 RAB5B;ITGB2;PPP3R1;HLA-DMA;FCGR3B;HLA-DMB;CASP... 2 KEGG_2016 Phagosome Homo sapiens hsa04145 28/154 1.614009e-11 1.366528e-09 0 0 5.490501 136.437381 ATP6V1A;RAB5B;ITGB5;ITGB3;ITGB2;HLA-DMA;FCGR3B... 3 KEGG_2016 Rheumatoid arthritis Homo sapiens hsa05323 19/90 2.197884e-09 1.395656e-07 0 0 6.554453 130.668081 ATP6V1A;ATP6V1G1;ATP6V0B;TGFB1;ITGB2;FOS;ITGAL... 4 KEGG_2016 Leishmaniasis Homo sapiens hsa05140 17/73 3.132614e-09 1.591368e-07 0 0 7.422186 145.336773 TGFB1;IFNGR1;PRKCB;IFNGR2;ITGB2;FOS;MAPK14;HLA...
查看enricher函数帮助文档
help(gp.enrichr)
Help on function enrichr in module gseapy.enrichr: enrichr(gene_list, gene_sets, organism='human' , description='' , outdir='Enrichr' , background='hsapiens_gene_ensembl' , cutoff=0.05, format='pdf' , figsize=(8, 6), top_term=10, no_plot=False, verbose=False) ...... ......
由帮助文档可知enricher
函数所需参数如下:
gene_list, 所需查询gene_list,可以是一个列表,也可为文件(一列,每行一个基因) gene_sets, gene set library。该参数,有两种形式: 可以设置enricher自带的gene set library 详细列表可见https://maayanlab.cloud/Enrichr/#libraries。可单个'KEGG_2016',或多个['KEGG_2016','KEGG_2013'] 一种自定义gene set library。可以是gmt文件,或者输入一个字典 gene_sets={'term_A' :['gene1' , 'gene2' ,...], 'term_B' :['gene2' , 'gene4' ,...], ...}
organism,支持(human, mouse, yeast, fly, fish, worm), 自定义gene_set 则无影响。 format, 输出图片格式('pdf','png','eps'...) figsize, 图片大小, (width,height). Default: (6.5,6). 绘图gseapy 也提供了绘图函数进行绘制
# simple plotting function from gseapy.plot import barplot, dotplot# to save your figure, make sure that ``ofname`` is not None
barplot(enr.res2d, title='KEGG_2013' ,)
image.png enr.res2d
存储着最近一次查询富集的结果, 上面的例子中, enr.res2d
储存的是'KEGG_2013']富集结果,因为它是list最后一个.
gene_sets=['KEGG_2016' ,'KEGG_2013' ]
enr.results
保存着所有的富集结果,所以我么也可以挑选数据可视化
barplot(enr.results.loc[enr.results["Gene_set" ] == "KEGG_2016" ,], title='KEGG_2016' ,)
image.png 气泡图也是有的;
image.png GSEA PrerankPrerank
用于已经排好序的数据来做GSEA。如,根据logFC 从大到小排好序后,去做GSEA。
# gsea_data.gsea_data.rnk 是已经排好序的数据 rnk = pd.read_csv("./gsea_data.gsea_data.rnk" , header=None, sep="\t" ) rnk.head()
0 1 CTLA2B 2.502482 SCARA3 2.095578 LOC100044683 1.116398
pre_res = gp.prerank(rnk=rnk, gene_sets='KEGG_2016' , processes=4, outdir='test/prerank_report_kegg' , format='png' , seed=6)
pre_res.res2d.head()
image.png 绘图
from gseapy.plot import gseaplot terms = pre_res.res2d.index# to save your figure, make sure that ofname is not None gseaplot(rank_metric=pre_res.ranking, term=terms[0], **pre_res.results[terms[0]])
image.png 未完待续...
参考https://gseapy.readthedocs.io/en/latest/introduction.html