宏基因组数据分析:差异分析(LEfSe安装使用及LDA score计算)

宏基因组数据分析:差异分析(LEfSe安装使用及LDA score计算)文章目录简介安装简介安装报错:Collectingpackagemetadata(current_repodata.json):doneSolvingenvironment:failedwithinitialfrozensolve.Retryingwithflexiblesolve.Solvingenvironment:failedwithrepodatafromcurrent_repodata.json,willretrywithnextre

大家好,又见面了,我是你们的朋友全栈君。

导读

  • LEfSe(Linear discriminant analysis Effect Size)通过将用于统计显着性的标准检验与编码生物一致性和效果相关性的检验相结合,确定最有可能解释类别之间差异的特征。
  • LEfSe分析可以实现多个分组之间的比较,还进行分组比较的内部进行亚组比较分析,从而找到组间在丰度上有显著差异的物种(即biomaker)。

原理

在这里插入图片描述

  1. 首先在多组样本中采用的非参数检验Kruskal-Wallis秩和检验检测不同分组间丰度差异显著的特征;
  2. 然后在上一步中获得的显著差异特征,用成组的Wilcoxon秩和检验进行组间差异分析(若没有亚组,该步跳过);
  3. 最后用线性判别分析(LDA)对数据进行分类和评估差异显著的物种的影响力(即LDA score)。

LDA score计算

计算步骤包括(详细计算过程看源码):

  1. 拟合lda模型,获取第一特征向量;
  2. 对第一特征向量进行标准化;
  3. 根据标准化后的第一特征向量,计算样本新坐标;
  4. 根据分组信息,计算组间距离,作为效应系数;
  5. LDA score = 效应系数 * 标准化后特征向量 ;
  6. 对LDA score进行log转换。
def test_lda_r(cls,feats,cl_sl,boots,fract_sample,lda_th,tol_min,nlogs):
fk = list(feats.keys())
means = dict([(k,[]) for k in feats.keys()])
feats['class'] = list(cls['class'])
clss = list(set(feats['class']))
for uu,k in enumerate(fk):
if k == 'class':
continue
ff = [(feats['class'][i],v) for i,v in enumerate(feats[k])]
for c in clss:
if len(set([float(v[1]) for v in ff if v[0] == c])) > max(float(feats['class'].count(c))*0.5,4):
continue
for i,v in enumerate(feats[k]):
if feats['class'][i] == c:
feats[k][i] = math.fabs(feats[k][i] + lrand.normalvariate(0.0,max(feats[k][i]*0.05,0.01)))
rdict = { 
}
for a,b in feats.items():
if a == 'class' or a == 'subclass' or a == 'subject':
rdict[a] = robjects.StrVector(b)
else:
rdict[a] = robjects.FloatVector(b)
robjects.globalenv["d"] = robjects.DataFrame(rdict)
lfk = len(feats[fk[0]])
rfk = int(float(len(feats[fk[0]]))*fract_sample)
f = "class ~ "+fk[0]
for k in fk[1:]:
f += " + " + k.strip()
ncl = len(set(cls['class']))
min_cl = int(float(min([cls['class'].count(c) for c in set(cls['class'])]))*fract_sample*fract_sample*0.5)
min_cl = max(min_cl,1)
pairs = [(a,b) for a in set(cls['class']) for b in set(cls['class']) if a > b]
for k in fk:
for i in range(boots):
means[k].append([])
for i in range(boots):
for rtmp in range(1000):
rand_s = [lrand.randint(0,lfk-1) for v in range(rfk)]
if not contast_within_classes_or_few_per_class(feats,rand_s,min_cl,ncl):
break
rand_s = [r+1 for r in rand_s]
means[k][i] = []
for p in pairs:
robjects.globalenv["rand_s"] = robjects.IntVector(rand_s)
robjects.globalenv["sub_d"] = robjects.r('d[rand_s,]')
z = robjects.r('z <- suppressWarnings(lda(as.formula('+f+'),data=sub_d,tol='+str(tol_min)+'))')
robjects.r('w <- z$scaling[,1]')
robjects.r('w.unit <- w/sqrt(sum(w^2))')
robjects.r('ss <- sub_d[,-match("class",colnames(sub_d))]')
if 'subclass' in feats:
robjects.r('ss <- ss[,-match("subclass",colnames(ss))]')
if 'subject' in feats:
robjects.r('ss <- ss[,-match("subject",colnames(ss))]')
robjects.r('xy.matrix <- as.matrix(ss)')
robjects.r('LD <- xy.matrix%*%w.unit')
robjects.r('effect.size <- abs(mean(LD[sub_d[,"class"]=="'+p[0]+'"]) - mean(LD[sub_d[,"class"]=="'+p[1]+'"]))')
scal = robjects.r('wfinal <- w.unit * effect.size')
rres = robjects.r('mm <- z$means')
rowns = list(rres.rownames)
lenc = len(list(rres.colnames))
coeff = [abs(float(v)) if not math.isnan(float(v)) else 0.0 for v in scal]
res = dict([(pp,[float(ff) for ff in rres.rx(pp,True)] if pp in rowns else [0.0]*lenc ) for pp in [p[0],p[1]]])
for j,k in enumerate(fk):
gm = abs(res[p[0]][j] - res[p[1]][j])
means[k][i].append((gm+coeff[j])*0.5)
res = { 
}
for k in fk:
m = max([numpy.mean([means[k][kk][p] for kk in range(boots)]) for p in range(len(pairs))])
res[k] = math.copysign(1.0,m)*math.log(1.0+math.fabs(m),10)
return res,dict([(k,x) for k,x in res.items() if math.fabs(x) > lda_th])

安装

# 新建虚拟环境,lefse依赖于python2
conda create -n lefse python=2.7
# 激活虚拟环境
source activate lefse
# conda安装lefse
conda install lefse
# 退出虚拟环境
source deactivate

报错:如果是r2py的问题,基本都是python版本的问题。
解决:虚拟环境中重新安装python2.7(推荐)。另一种方式是对虚拟环境中的conda降级,之后再进行安装。


报错:如果是axis_bgcolor()的问题,那就是matplotlib版本问题。
解决:虚拟环境中重新安装metaplotlib 2.0版本,conda install matplotlib=2.0(推荐)。另一种方式是将lefse-plot_cladogam.py文件中的axis_bgcolor()修改为facecolor()


分析

文件输入格式

文本制表符(单个Tab)分隔的输入文件由特征列表,分类向量和可选的亚组变量和样本ID向量组成。元数据位于前几行,特征矩阵随后。

bodysite                                mucosal         mucosal         mucosal         mucosal         mucosal         non_mucosal     non_mucosal     non_mucosal     non_mucosal     non_mucosal
subsite                                 oral            gut             oral            oral            gut             skin            nasal           skin            ear             nasal
id                                      1023            1023            1672            1876            1672            159005010       1023            1023            1023            1672
Bacteria                                0.99999         0.99999         0.999993        0.999989        0.999997        0.999927        0.999977        0.999987        0.999997        0.999993
Bacteria|Actinobacteria                 0.311037        0.000864363     0.00446132      0.0312045       0.000773642     0.359354        0.761108        0.603002        0.95913         0.753688
Bacteria|Bacteroidetes                  0.0689602       0.804293        0.00983343      0.0303561       0.859838        0.0195298       0.0212741       0.145729        0.0115617       0.0114511
Bacteria|Firmicutes                     0.494223        0.173411        0.715345        0.813046        0.124552        0.177961        0.189178        0.188964        0.0226835       0.192665
Bacteria|Proteobacteria                 0.0914284       0.0180378       0.265664        0.109549        0.00941215      0.430869        0.0225884       0.0532684       0.00512034      0.0365453
Bacteria|Firmicutes|Clostridia          0.090041        0.170246        0.00483188      0.0465328       0.122702        0.0402301       0.0460614       0.135201        0.0115835       0.0537381

数据格式转换

数据分析前需要对数据格式进行转换,原始数据可以为16S那种reads count,也可以是宏基因组相对丰度数据,但需要按照指定格式,行为分组和物种信息,列为样本。-c指定分组行;-s指定亚组行,若没有可以不指定;-u指定样本编号;-o指定归一化后范围,主要针对宏基因组数据,目的是对相对丰度进行放大。

lefse-format_input.py hmp_aerobiosis_small.txt hmp_aerobiosis_small.in -c 1 -s 2 -u 3 -o 1000000

详细参数:

usage: lefse-format_input.py [-h] [--output_table OUTPUT_TABLE] [-f { 
c,r}]
[-c [1..n_feats]] [-s [1..n_feats]] [-o float]
[-u [1..n_feats]] [-m { 
f,s}] [-n int]
[-biom_c BIOM_CLASS] [-biom_s BIOM_SUBCLASS]
INPUT_FILE OUTPUT_FILE
LEfSe formatting modules
positional arguments:
INPUT_FILE            the input file, feature hierarchical level can be
specified with | or . and those symbols must not be
present for other reasons in the input file.
OUTPUT_FILE           the output file containing the data for LEfSe
optional arguments:
-h, --help            show this help message and exit
--output_table OUTPUT_TABLE
the formatted table in txt format
-f { 
c,r}              set whether the features are on rows (default) or on
columns
-c [1..n_feats]       set which feature use as class (default 1)
-s [1..n_feats]       set which feature use as subclass (default -1 meaning
no subclass)
-o float              set the normalization value (default -1.0 meaning no
normalization)
-u [1..n_feats]       set which feature use as subject (default -1 meaning
no subject)
-m { 
f,s}              set the policy to adopt with missin values: f removes
the features with missing values, s removes samples
with missing values (default f)
-n int                set the minimum cardinality of each subclass
(subclasses with low cardinalities will be grouped
together, if the cardinality is still low, no pairwise
comparison will be performed with them)
-biom_c BIOM_CLASS    For biom input files: Set which feature use as class
-biom_s BIOM_SUBCLASS
For biom input files: set which feature use as
subclass

lefse分析

转换格式后,可以进行lefse差异分析。需要指定输入和输出文件。还可以通过-a指定组间比较检验水准阈值,-w指定成组比较检验水准阈值,-l指定lda score阈值。除了可以选择lda,还可以选择svm进行分析。

run_lefse.py hmp_aerobiosis_small.in hmp_aerobiosis_small.res

详细参数:

usage: run_lefse.py [-h] [-o str] [-a float] [-w float] [-l float]
[--nlogs int] [--verbose int] [--wilc int] [-r str]
[--svm_norm int] [-b int] [-e int] [-c int] [-f float]
[-s { 
0,1,2}] [--min_c int] [-t str] [-y { 
0,1}]
INPUT_FILE OUTPUT_FILE
LEfSe 1.0
positional arguments:
INPUT_FILE      the input file
OUTPUT_FILE     the output file containing the data for the visualization
module
optional arguments:
-h, --help      show this help message and exit
-o str          set the file for exporting the result (only concise textual
form)
-a float        set the alpha value for the Anova test (default 0.05)
-w float        set the alpha value for the Wilcoxon test (default 0.05)
-l float        set the threshold on the absolute value of the logarithmic
LDA score (default 2.0)
--nlogs int     max log ingluence of LDA coeff
--verbose int   verbose execution (default 0)
--wilc int      wheter to perform the Wicoxon step (default 1)
-r str          select LDA or SVM for effect size (default LDA)
--svm_norm int  whether to normalize the data in [0,1] for SVM feature
waiting (default 1 strongly suggested)
-b int          set the number of bootstrap iteration for LDA (default 30)
-e int          set whether perform the wilcoxon test only among the
subclasses with the same name (default 0)
-c int          set whether perform the wilcoxon test ing the Curtis's
approach [BETA VERSION] (default 0)
-f float        set the subsampling fraction value for each bootstrap
iteration (default 0.66666)
-s { 
0,1,2}      set the multiple testing correction options. 0 no correction
(more strict, default), 1 correction for independent
comparisons, 2 correction for independent comparison
--min_c int     minimum number of samples per subclass for performing
wilcoxon test (default 10)
-t str          set the title of the analysis (default input file without
extension)
-y { 
0,1}        (for multiclass tasks) set whether the test is performed in
a one-against-one ( 1 - more strict!) or in a one-against-
all setting ( 0 - less strict) (default 0)

绘制lefse结果图

将lefse分析结果绘图。可以根据需求指定绘图格式。

lefse-plot_res.py hmp_aerobiosis_small.res hmp_aerobiosis_small.png

详细参数:

usage: lefse-plot_res.py [-h] [--feature_font_size FEATURE_FONT_SIZE]
[--format { 
png,svg,pdf}] [--dpi DPI] [--title TITLE]
[--title_font_size TITLE_FONT_SIZE]
[--class_legend_font_size CLASS_LEGEND_FONT_SIZE]
[--width WIDTH] [--height HEIGHT] [--left_space LS]
[--right_space RS] [--orientation { 
h,v}]
[--autoscale { 
0,1}] [--background_color { 
k,w}]
[--subclades N_SCL]
[--max_feature_len MAX_FEATURE_LEN]
[--all_feats ALL_FEATS] [--otu_only]
[--report_features]
INPUT_FILE OUTPUT_FILE
Plot results
positional arguments:
INPUT_FILE            tab delimited input file
OUTPUT_FILE           the file for the output image
optional arguments:
-h, --help            show this help message and exit
--feature_font_size FEATURE_FONT_SIZE
the file for the output image
--format { 
png,svg,pdf}
the format for the output file
--dpi DPI
--title TITLE
--title_font_size TITLE_FONT_SIZE
--class_legend_font_size CLASS_LEGEND_FONT_SIZE
--width WIDTH
--height HEIGHT       only for vertical histograms
--left_space LS
--right_space RS
--orientation { 
h,v}
--autoscale { 
0,1}
--background_color { 
k,w}
set the color of the background
--subclades N_SCL     number of label levels to be dislayed (starting from
the leaves, -1 means all the levels, 1 is default )
--max_feature_len MAX_FEATURE_LEN
Maximum length of feature strings (def 60)
--all_feats ALL_FEATS
--otu_only            Plot only species resolved OTUs (as opposed to all
levels)
--report_features     Report important features to STDOUT

绘制特征条形图

另外,可以分别对指定的物种,或者所以差异物种绘制条形图。

# 单个特征
lefse-plot_features.py -f one --feature_name "Bacteria.Bacteroidetes" hmp_aerobiosis_small.in hmp_aerobiosis_small.res Bacteroidetes.png
# 差异特征
lefse-plot_features.py -f diff --archive zip hmp_aerobiosis_small.in hmp_aerobiosis_small.res biomarkers.zip

详细参数:

usage: lefse-plot_features.py [-h] [--width WIDTH] [--height HEIGHT]
[--top TOP] [--bot BOT]
[--title_font_size TITLE_FONT_SIZE]
[--class_font_size CLASS_FONT_SIZE]
[--class_label_pos { 
up,down}]
[--subcl_mean { 
y,n}] [--subcl_median { 
y,n}]
[--font_size FONT_SIZE] [-n flt]
[--format { 
png,pdf,svg}] [-f { 
all,diff,one}]
[--feature_name FEATURE_NAME]
[--feature_num FEATURE_NUM]
[--archive { 
zip,none}]
[--background_color { 
k,w}] [--dpi DPI]
INPUT_FILE INPUT_FILE OUTPUT_FILE
Cladoplot
positional arguments:
INPUT_FILE            dataset files
INPUT_FILE            LEfSe output file
OUTPUT_FILE           the file for the output (the zip file if an archive is
required, the output directory otherwise)
optional arguments:
-h, --help            show this help message and exit
--width WIDTH
--height HEIGHT
--top TOP             set maximum y limit (-1.0 means automatic limit)
--bot BOT             set minimum y limit (default 0.0, -1.0 means automatic
limit)
--title_font_size TITLE_FONT_SIZE
--class_font_size CLASS_FONT_SIZE
--class_label_pos { 
up,down}
--subcl_mean { 
y,n}
--subcl_median { 
y,n}
--font_size FONT_SIZE
-n flt                unused
--format { 
png,pdf,svg}
the format for the output file
-f { 
all,diff,one}     wheter to plot all features (all), only those
differentially abundant according to LEfSe or only one
(the one given with --feature_name)
--feature_name FEATURE_NAME
The name of the feature to plot (levels separated by
.)
--feature_num FEATURE_NUM
The number of the feature to plot
--archive { 
zip,none}
--background_color { 
k,w}
set the color of the background
--dpi DPI

绘制系统发育树图

可以将差异结果,结合物种进化关系,绘制系统发育树图。前提是输入文件必须指定物种所有的物种等级。例如metaphlan的输出结果。如果物种太多,可以进一步对差异分析结果进行筛选过滤,只对差异特征进行绘图。

## 可直接绘制单个水平的图,也可将所有水平进行分析后,得到的结果进行绘图
## 如果很多特征都是没有组间差异的,可以对特征进行筛选后再绘图 grep -v '-' 即可
## --clade_sep 默认为1.5,但绘图时会自动调整,也可以自己设定
## --abrv_stop_lev 7 默认为5,即展示到第5水平,可设置到7,暂时到第7水平的结果。可以结合标签参数和缩写参数来展示。
lefse-plot_cladogram.py --dpi 500 all_level_lefse_significant.res all_level_lefse_significant_cladogram.pdf --format pdf --clade_sep 0.05 --abrv_stop_lev 7

详细参数:

usage: lefse-plot_cladogram.py [-h] [--clade_sep CLADE_SEP]
[--max_lev MAX_LEV]
[--max_point_size MAX_POINT_SIZE]
[--min_point_size MIN_POINT_SIZE]
[--point_edge_width MARKEREDGEWIDTH]
[--siblings_connector_width SIBLINGS_CONNECTOR_WIDTH]
[--parents_connector_width PARENTS_CONNECTOR_WIDTH]
[--radial_start_lev RADIAL_START_LEV]
[--labeled_start_lev LABELED_START_LEV]
[--labeled_stop_lev LABELED_STOP_LEV]
[--abrv_start_lev ABRV_START_LEV]
[--abrv_stop_lev ABRV_STOP_LEV]
[--expand_void_lev EXPAND_VOID_LEV]
[--class_legend_vis CLASS_LEGEND_VIS]
[--colored_connector COLORED_CONNECTORS]
[--alpha ALPHA] [--title TITLE]
[--sub_clade SUB_CLADE]
[--title_font_size TITLE_FONT_SIZE]
[--right_space_prop R_PROP]
[--left_space_prop L_PROP]
[--label_font_size LABEL_FONT_SIZE]
[--background_color { 
k,w}]
[--colored_labels { 
0,1}]
[--class_legend_font_size CLASS_LEGEND_FONT_SIZE]
[--dpi DPI] [--format { 
png,svg,pdf}]
[--all_feats ALL_FEATS]
INPUT_FILE OUTPUT_FILE
Cladoplot
positional arguments:
INPUT_FILE            tab delimited input file
OUTPUT_FILE           the file for the output image
optional arguments:
-h, --help            show this help message and exit
--clade_sep CLADE_SEP
--max_lev MAX_LEV
--max_point_size MAX_POINT_SIZE
--min_point_size MIN_POINT_SIZE
--point_edge_width MARKEREDGEWIDTH
--siblings_connector_width SIBLINGS_CONNECTOR_WIDTH
--parents_connector_width PARENTS_CONNECTOR_WIDTH
--radial_start_lev RADIAL_START_LEV
--labeled_start_lev LABELED_START_LEV
--labeled_stop_lev LABELED_STOP_LEV
--abrv_start_lev ABRV_START_LEV
--abrv_stop_lev ABRV_STOP_LEV
--expand_void_lev EXPAND_VOID_LEV
--class_legend_vis CLASS_LEGEND_VIS
--colored_connector COLORED_CONNECTORS
--alpha ALPHA
--title TITLE
--sub_clade SUB_CLADE
--title_font_size TITLE_FONT_SIZE
--right_space_prop R_PROP
--left_space_prop L_PROP
--label_font_size LABEL_FONT_SIZE
--background_color { 
k,w}
set the color of the background
--colored_labels { 
0,1}
draw the label with class color (1) or in black (0)
--class_legend_font_size CLASS_LEGEND_FONT_SIZE
--dpi DPI
--format { 
png,svg,pdf}
the format for the output file
--all_feats ALL_FEATS

参考资料

https://github.com/SegataLab/lefse

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/135291.html原文链接:https://javaforall.cn

【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛

【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...

(0)
blank

相关推荐

  • xsync文件「建议收藏」

    xsync文件「建议收藏」1、写好xsync.sh脚本#!/bin/bash#1.判断参数个数if[$#-lt1]thenechoNotEnoughArguement!exit;fi#2.遍历集群所有机器forhostinhadoop101hadoop102hadoop103doecho====================$host====================#3.遍历所有目录,挨个发送forfilein$@do

  • pycharm配置路径_如何在pycharm添加解释器

    pycharm配置路径_如何在pycharm添加解释器步骤一:pycharm–>settingforNewProjects步骤二:settingsforNewprojects–>projectInterpreter–>showAll–>Add

  • python和c++哪个好_pycharm怎么创建python项目

    python和c++哪个好_pycharm怎么创建python项目创建项目首先打开Pycharm勾选IconfirmthatIhavereadandacceptthetermsofthisUserAgreement接下来选择Don’tsend就可以了初次登陆的时候因为你没有Project,所以我们可以创建一个Project随后,我们便可以得到以下界面,new1是项目的名称,new1之前的是你项目所在路径,当然你也可以设置其他的名称,尽…

  • c盘替换文件需要权限_windows安装命令

    c盘替换文件需要权限_windows安装命令大家都知道08权限的系统权限设置很严格,且在2003系统中常用到的溢出工具都失效。面对限制IP连接的情况我们及时拿到system权限有账号也上不去这种情况下只能弄shift后门或者放大镜了。但08权限在system权限也操作不了系统文件夹。先查通过whoami查看下登录帐号权限。通过下图我们看到是普通权限我用的到时MS12042这个大家都会用单独讲sysret.ex…

  • 在Ubuntu16.04 安装RabbitVCS[通俗易懂]

    在Ubuntu16.04 安装RabbitVCS[通俗易懂]系统:Ubuntu16.041、添加源将rabbitvcs的添加到源里面。(首次操作会提示是否要添加到源里面,点击ENTER添加,Ctrl+C不添加),这里选择ENTER方便更新。sudoadd-apt-repositoryppa:rabbitvcs/ppa2、导入key如果第一步出现已经导入key的提示,此步骤可以省略sudoapt-keyadv–keyserverkeyserver.ubuntu.com–recv-keys34EF4A..

  • linux vim安装_centos mysql

    linux vim安装_centos mysqlLinuxcentos7系统下配置vim

发表回复

您的电子邮箱地址不会被公开。

关注全栈程序员社区公众号