考虑到咱们生信技能树粉丝对单细胞数据挖掘的需求,我开通了一个专栏《100个单细胞转录组数据降维聚类分群图表复现 》,也亲自示范了几个,不过自己带娃,读博,时间精力有限,所以把剩余的90多个任务安排了学徒,实习生,学员。真的是太棒了 ,群策群力!
值得注意的是本次投稿并不是基于R语言,而是python哦!
作者:xjtu申小忱
这次我们来复现一篇单细胞的文章。 这篇我们只来复现细胞图谱和拟时序分析 像细胞通讯,还有富集分析还是很简单的。 大家可以继续走下去,然后我们来交流讨论! 这篇全篇基于python复现。
首先介绍一下scanpy 网上有很多推文,但是大多是抄的教程,没有自己的理解。我用一句话介绍一下scanpy =pandas+dic。大家只要记住这个就可以完美驾驭!因为平时做深度学习 最常用的库就是 torch+numpy+pandas 所以无论做什么掌握pandas都是关键的! 导库 (如果你还不会安装python的模块,需要自己学一下基础语法哦)import pandas as pd import numpy as npimport pandas as pdimport scanpy as sc import numpy as npimport pandas as pdimport matplotlib.pyplot as plfrom matplotlib import rcParamsimport scanpy as sc
这里说明一下,因为如果大家要发文章肯定需要dpi要求。所以这里可以提前设置好。相等于全局变量。sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3) #sc.logging.print_versions() #results_file = './write/pa.h5ad' sc.settings.set_figure_params(dpi=300 , frameon=False , figsize=(3 , 3 ), facecolor='white' )
读入数据 为3个txt文件。大概每个1个G左右 分为 乳腺癌 阳性淋巴结 阴性淋巴结 3个文件raw_UMIcounts = pd.read_table("./GSM4798908_B2019-1.expression_matrix.txt" , header=0 , index_col=0 )#raw_UMIcounts.head(10)
raw_UMIcounts_pl = pd.read_table("./GSM4798909_B2019-2.expression_matrix.txt" , header=0 , index_col=0 )
raw_UMIcounts_nl= pd.read_table("./GSM4798910_B2019-3.expression_matrix.txt" , header=0 , index_col=0 )
pi_gene_bar= raw_UMIcounts.columns
pi_gene_bar=pd.Series(pi_gene_bar)+"_primary"
raw_UMIcounts.columns= pi_gene_bar
raw_UMIcounts_pl.columns = pd.Series(raw_UMIcounts_pl.columns)+"_positive"
raw_UMIcounts_nl.columns = pd.Series(raw_UMIcounts_nl.columns)+"_negtive"
因为我发现作者就是简单合并的所以这里 就是简单合并 相当于R里面的merge(我会类比R让大家看懂!)counts_join=raw_UMIcounts.join(raw_UMIcounts_pl,how="inner" )
counts_join= counts_join.join(raw_UMIcounts_nl,how="inner" )
print(11323 +11356 +11467 )
一共是34146个细胞!
34146
counts_join.T.to_csv("./counts_join.csv" )
这一步就相当于构建了seurat对象!对等理解!adata = sc.read_csv("./counts_join.csv" , first_column_names=True )
adata.write('./join_adata.h5ad' )
保存成h5ad 文件 这样的话下次读取会很快adata=sc.read_h5ad("./join_adata.h5ad" )
看看前20个表达量高的基因sc.pl.highest_expr_genes(adata, n_top=20 , )
下面进行质控 所有内容和在R里是一样的!图片也是一样的!
sc.pp.filter_cells(adata, min_genes=200 ) sc.pp.filter_genes(adata, min_cells=5 )
adata.var['mt' ] = adata.var_names.str.startswith('MT-' ) # annotate the group of mitochondrial genes as 'mt' sc.pp.calculate_qc_metrics(adata, qc_vars=['mt' ], percent_top=None , log1p=False , inplace=True )
sc.pl.violin(adata, ['n_genes_by_counts' , 'total_counts' , 'pct_counts_mt' ], jitter=0.4 , multi_panel=True )
sc.pl.scatter(adata, x='total_counts' , y='pct_counts_mt' ) sc.pl.scatter(adata, x='total_counts' , y='n_genes_by_counts' )
adata = adata[adata.obs.n_genes_by_counts <3000 , :] adata = adata[adata.obs.pct_counts_mt <20 , :]
sc.pl.violin(adata, ['n_genes_by_counts' , 'total_counts' , 'pct_counts_mt' ], jitter=0.4 , multi_panel=True )
这里选择20这个阈值过滤线粒体,虽然有些高,但是原文作者是50过滤,我觉得较为离谱,所以这里选择了比作者小的20.a=[]for i in adata.obs.index: a.append(i[13 :])
from collections import Counter
count = Counter(a) print(count)
Counter({'positive': 11147, 'negtive': 10036, 'primary': 9621})
这是过滤之后三种样本剩余细胞情况,总体细胞和原作者一致,各个类型还是有一定差异的!adata.obs["disease_group" ]=a
Trying to set attribute `.obs` of view, copying.
sc.pp.normalize_total(adata, target_sum=1e4 )
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125 , max_mean=3 , min_disp=0.5 )
sc.pl.highly_variable_genes(adata)
这里是筛选高变基因流程和seurat 也是一样的!adata = adata[:, adata.var.highly_variable]
Counter(adata.var.highly_variable)
Counter({True: 3907})
sc.pp.regress_out(adata, ['total_counts' , 'pct_counts_mt' ])
C:\Users\shenxiaochen\anaconda3\lib\site-packages\anndata\_core\anndata.py:1228: ImplicitModificationWarning: Initializing view as actual. warnings.warn( Trying to set attribute `.obs` of view, copying. ... storing 'disease_group' as categorical
sc.pp.scale(adata, max_value=10 )
标准化数据用于下一步分析 同:seuratsc.tl.pca(adata, svd_solver='arpack' )
sc.pl.pca_variance_ratio(adata, log=True )
pca 降维结果。
results_file= './first_ana.h5ad' adata.write(results_file)
adata=sc.read_h5ad("./first_ana.h5ad" )
这里我们选择20个pcsc.pp.neighbors(adata, n_neighbors=15 , n_pcs=20 )
computing neighbors using 'X_pca' with = 20 finished: added to `.uns['neighbors']` `.obsp['distances']`, distances for each pair of neighbors `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
sc.tl.umap(adata,min_dist=0.3 )
computing UMAP finished: added 'X_umap', UMAP coordinates (adata.obsm) (0:00:14)
adata
AnnData object with n_obs × n_vars = 30804 × 3907 obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'disease_group' var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std' uns: 'hvg', 'pca', 'neighbors', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' obsp: 'distances', 'connectivities'
sc.pl.umap(adata, color="disease_group" ,)
画一下不同疾病组别的umap的图!和原文保持一致!import scanpy as scimport pandas as pdfrom matplotlib.pyplot import rc_context
with rc_context({'figure.figsize' : (5 , 5 )}): sc.pl.umap(adata, color='disease_group' , add_outline=True , legend_fontsize=12 , legend_fontoutline=2 ,frameon=False , title='clustering of cells' , palette='Paired' )
可以看下图,原文不同疾病分组的umap和我复现的是一致的!只是他是躺着的。。。sc.tl.leiden(adata)
running Leiden clustering finished: found 23 clusters and added 'leiden', the cluster labels (adata.obs, categorical) (0:00:04)
这里有两种分群算法,一种是leiden 一种是louvain. python 里默认的是leiden R seurat 是 louvain。看过原文,作者说 leiden 是 louvain 的进化版。所以建议大家还是用leiden。 这里的参数同样和 seurat 一样也是resolution python里默认的是1.原文作者也是1.所以我就没写,大家平时更换不同的参数。。因为原文是基于seurat的, 所以我也是选择了原文的 louvain.sc.tl.louvain(adata)
running Louvain clustering using the "louvain" package of Traag (2017) finished: found 20 clusters and added 'louvain', the cluster labels (adata.obs, categorical) (0:00:05)
#with rc_context({'figure.figsize': (10, 5)}): sc.pl.umap(adata, color=['leiden' ,'louvain' ], add_outline=True ,legend_loc='on data' , legend_fontsize=12 , legend_fontoutline=2 ,frameon=False , title=['clustering of cells' ,'clustering of cells' ], palette='Set1' )
左边是 leiden 右边是 louvain 大家可以对比一下sc.tl.rank_genes_groups(adata, 'louvain' , method='wilcoxon' )#sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
ranking genes finished: added to `.uns['rank_genes_groups']` 'names', sorted np.recarray to be indexed by group ids 'scores', sorted np.recarray to be indexed by group ids 'logfoldchanges', sorted np.recarray to be indexed by group ids 'pvals', sorted np.recarray to be indexed by group ids 'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:17) C:\Users\shenxiaochen\anaconda3\lib\site-packages\scanpy\tools\_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2 self.stats[group_name, 'logfoldchanges'] = np.log2(
这一步相当于 seurat 里面的find_all_markersresult = adata.uns['rank_genes_groups' ] groups = result['names' ].dtype.names marker=pd.DataFrame( {group + '_' + key[:1 ]: result[key][group] for group in groups for key in ['names' , 'pvals' ]})
marker.to_csv("./marker_louvain.csv" )
下面进行分群注释!cluster2annotation = { '0' :'Bcells' , '1' :'myofibroblasts' , '2' : 'NaiveT' , '3' :'NaiveT' , '4' :'cancercells' , '5' : 'Bcells' , '6' :'NaiveT' , '7' : 'NaiveT' , '8' :'myofibroblasts' , '9' :'CD8Effector' , '10' :'myeloid' , '11' :'Bcells' , '12' : 'endothelial' , '13' :'Macrophages' , '14' : 'cancerstemcells' , '15' :'myeloid' , '16' :'CXCL14cancer' , '17' :'plasma' , '18' :'matureDC' , '19' :'pericytes' }# add a new `.obs` column called `cell type` by mapping clusters to annotation using pandas `map` function adata.obs['cell type' ] = adata.obs['louvain' ].map(cluster2annotation).astype('category' )
根据作者文中提到的一些marker 和 cellmarker ,PanglaoDB 数据库进行注释!这两个数据库是我经常光顾的。。原因就是简单!!!
marker_genes_dict = { 'cancercells' : ['KRT19' ], 'cancerstemcells' : ['KRT19' ,'TOP2A' ], 'CXCL14cancer' : ['CXCL14' ], #'NaiveT': [], #'CD8Effector': [], 'Bcells' : ['CD79A' ,'CD79B' ], 'Macrophages' : ['LYZ' ,'IL1B' ], 'myeloid' :['LYZ' ], 'matureDC' :['LAMP3' ,'CCR7' ], 'plasma' :['JCHAIN' ,'IGHG3' ,'MZB1' ], 'endothelial' :['MCAM' ,'PECAM1' ], 'pericytes' :['ACTA2' ,'TAGLN' ,'MCAM' ], 'myofibroblasts' :['LUM' ,'DCN' ,'TAGLN' ], 'cyclingcells' :['TOP2A' ] }
sc.pl.dotplot(adata, marker_genes_dict, 'louvain' , dendrogram=True )
WARNING: dendrogram data not found (using key=dendrogram_louvain). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently. using 'X_pca' with n_pcs = 50 Storing dendrogram info using `.uns['dendrogram_louvain']` WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: 0, 1, 2, etc. var_group_labels: cancercells, cancerstemcells, CXCL14cancer, etc.
画几张marker 基因的图!这些用R都是可以实现。。我曾经无聊的时候写过一写复现的脚本。。后来发现github上也有。。。。png ax = sc.pl.stacked_violin(adata, marker_genes_dict, groupby='louvain' , swap_axes=False , dendrogram=True ,cmap='Paired_r' )
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: 0, 1, 2, etc. var_group_labels: cancercells, cancerstemcells, CXCL14cancer, etc.
sc.pl.tracksplot(adata, marker_genes_dict, groupby='louvain' , swap_axes=False , dendrogram=True ,cmap='Paired_r' )
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: 0, 1, 2, etc. var_group_labels: cancercells, cancerstemcells, CXCL14cancer, etc.
adata.var_names
Index(['CCND1', 'ACTA2', 'POSTN', 'GADD45G', 'PSMA7', 'PDCD11', 'IGFBP7', 'ING5', 'RPL32P3', 'AC241952.1', ... 'AC125807.1', 'CD1A', 'H2AFZP3', 'IGLV1-44', 'CLEC17A', 'EEF1A1P9', 'AC073111.2', 'AL355076.2', 'AC103563.7', 'IGHV5-78'], dtype='object', length=3907)
with rc_context({'figure.figsize' : (4.5 , 3 )}): sc.pl.violin(adata, ['CD79A' , 'CD79B' ], groupby='louvain' )
sc.pl.umap(adata, color='louvain' , legend_loc='on data' , frameon=False , legend_fontsize=10 , legend_fontoutline=2 ,title="" )
sc.pl.umap(adata, color='cell type' , legend_loc='on data' , frameon=False , legend_fontsize=5 , legend_fontoutline=0.5 ,save="jmzeng.jpg" )
WARNING: saving figure to file figures\umapjmzeng.jpg.pdf
根据对应marker注释每一群,和原文基本一致!adata.obs['cell type' ].cat.categories
注释到如下几种细胞类型!Index(['Bcells', 'CD8Effector', 'CXCL14cancer', 'Macrophages', 'NaiveT', 'cancercells', 'cancerstemcells', 'endothelial', 'matureDC', 'myeloid', 'myofibroblasts', 'pericytes', 'plasma'], dtype='object')
adata_new=adata[(adata.obs['cell type' ]== 'CXCL14cancer' )|(adata.obs['cell type' ]=='cancercells' )|(adata.obs['cell type' ]== 'cancerstemcells' ), :]
adata_new.obs['cell type' ].cat.categories
然后作者选择如下三种细胞类型,做了拟时序分析!go! go! go!Index(['CXCL14cancer', 'cancercells', 'cancerstemcells'], dtype='object')
adata_new
这里选用paga 大家都知道的应该是monocle3 但是在一篇 nature methods 当中,作者是认为 paga 才是拟时序分析的最优选择!View of AnnData object with n_obs × n_vars = 2883 × 3907 obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'disease_group', 'leiden', 'louvain', 'cell type' var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std' uns: 'hvg', 'pca', 'neighbors', 'umap', 'disease_group_colors', 'leiden', 'louvain', 'leiden_colors', 'louvain_colors', 'rank_genes_groups', 'dendrogram_louvain', 'cell type_colors' obsm: 'X_pca', 'X_umap' varm: 'PCs' obsp: 'distances', 'connectivities'
sc.tl.paga(adata_new, groups='cell type' )
running PAGA Trying to set attribute `.uns` of view, copying. finished: added 'paga/connectivities', connectivities adjacency (adata.uns) 'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
sc.pl.paga(adata_new, color=['cell type' ,'CXCL14' ])
--> added 'pos', the PAGA positions (adata.uns['paga'])
#adata_new.uns['iroot'] = np.flatnonzero(adata_new.obs['cell type'] == 'cancerstemcells')[0] sc.tl.draw_graph(adata_new, init_pos='paga' )
drawing single-cell graph using layout 'fa' finished: added 'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:17)
adata_new.uns['iroot' ] = np.flatnonzero(adata_new.obs['cell type' ] == 'cancerstemcells' )[0 ] sc.tl.dpt(adata_new)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters. computing Diffusion Maps using n_comps=15(=n_dcs) computing transitions finished (0:00:00) eigenvalues of transition matrix [1. 0.9954424 0.9901249 0.98625064 0.98440194 0.9764193 0.97149456 0.96843475 0.9636642 0.9555143 0.94893146 0.9446058 0.94269824 0.9330055 0.9283923 ] finished: added 'X_diffmap', diffmap coordinates (adata.obsm) 'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:00) computing Diffusion Pseudotime using n_dcs=10 finished: added 'dpt_pseudotime', the pseudotime (adata.obs) (0:00:00)
这里看出, cancer_stem_cells 分化为 cancer_cells 和 cxcl14_cancer_cells 和原文分析的基本一致!cxcl14_cancer_cells 主要存在于阳性淋巴结和阴性淋巴结!cxcl14_cancer_cells和 cancer_cells时序上基本一致!sc.pl.draw_graph(adata_new, color=['cell type' , 'dpt_pseudotime' ,'disease_group' ], legend_loc='on data' )
sc.tl.paga(adata, groups='cell type' ) sc.pl.paga(adata, color=['cell type' ],node_size_scale=0.5 )
running PAGA finished: added 'paga/connectivities', connectivities adjacency (adata.uns) 'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00) --> added 'pos', the PAGA positions (adata.uns['paga'])
下面我又做了作者没有做的所有细胞类型的拟时序分析!sc.tl.draw_graph(adata, init_pos='paga' )
drawing single-cell graph using layout 'fa' finished: added 'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:05:47)
sc.pl.draw_graph(adata, color=['cell type' ], legend_loc='on data' ,legend_fontsize='xx-small' ,legend_fontweight='normal' )
adata.uns['iroot' ] = np.flatnonzero(adata.obs['cell type' ] == 'NaiveT' )[0 ] sc.tl.dpt(adata)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters. computing Diffusion Maps using n_comps=15(=n_dcs) computing transitions finished (0:00:00) eigenvalues of transition matrix [1. 0.9967204 0.9959651 0.9953239 0.9936516 0.9933531 0.98944426 0.98545295 0.9851446 0.9843129 0.9827176 0.98058397 0.9788297 0.97733855 0.97462684] finished: added 'X_diffmap', diffmap coordinates (adata.obsm) 'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:00) computing Diffusion Pseudotime using n_dcs=10 finished: added 'dpt_pseudotime', the pseudotime (adata.obs) (0:00:00)
sc.pl.draw_graph(adata, color=['cell type' , 'dpt_pseudotime' ,'disease_group' ], legend_loc='on data' ,legend_fontsize='xx-small' ,legend_fontweight='normal' )
可以看到从naiveT细胞的逐渐分化轨迹! 最后想和大家说一句话,是我导师教导我的分享给大家。学无止境,得饶人处且饶人,开阔胸襟,是做人的一种境界!写在文末 虽然说上面的代码都是复制粘贴即可运行,但是如果要更好地完成上面的图表,通常是需要掌握5个R包,分别是: scater,monocle,Seurat,scran,M3Drop ,需要熟练掌握它们的对象,:一些单细胞转录组R包的对象 而且分析流程也大同小异:
单细胞转录组数据分析的标准降维聚类分群,并且进行生物学注释后的结果。可以参考前面的例子:人人都能学会的单细胞聚类分群注释 ,我们演示了第一层次的分群。
如果你对单细胞数据分析还没有基础认知,可以看基础10讲:
咱们现在这个专栏《 100个单细胞转录组数据降维聚类分群图表复现 》分享的代码是到此为止,但是一般来说单细胞文章数据分析还有很多进阶图表制作,比如inferCNV看肿瘤拷贝数变异,monocle看拟时序等等。如果你也需要,可以加入我们这个专栏《 100个单细胞转录组数据降维聚类分群图表复现 》 创作团队,获取进阶指引哦!见:急!计划招募100个单细胞爱好者,免费学全套单细胞降维聚类分群和生物学亚群注释
单细胞服务列表