EggNOG功能注释数据库在线和本地使用

EggNOG功能注释数据库在线和本地使用文章目录 COG 简介 eggNOG 简介 eggNOGmapper 在线版 eggNOGmapper 本地版安装说明软件安装数据库下载基本使用 HMMER 方法 diamond 方法结果解读高级使用服务器共用内存模式宏基因组大数据模式同源检索功能注释附 1 emapper py 参数详解 Reference 猜你喜欢写在后面 COG 简介 COG ClustersofOr

COG简介

主页:https://www.ncbi.nlm.nih.gov/COG/

image

COG单字母描述,详见 http://www.sbg.bio.ic.ac.uk/~phunkee/html/old/COG_classes.html

COG one letter code descriptions

INFORMATION STORAGE AND PROCESSING

  • [J] Translation, ribosomal structure and biogenesis
  • [A] RNA processing and modification
  • [K] Transcription
  • [L] Replication, recombination and repair
  • [B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING

  • [D] Cell cycle control, cell division, chromosome partitioning
  • [Y] Nuclear structure
  • [V] Defense mechanisms
  • [T] Signal transduction mechanisms
  • [M] Cell wall/membrane/envelope biogenesis
  • [N] Cell motility
  • [Z] Cytoskeleton
  • [W] Extracellular structures
  • [U] Intracellular trafficking, secretion, and vesicular transport
  • [O] Posttranslational modification, protein turnover, chaperones

METABOLISM

  • [C] Energy production and conversion
  • [G] Carbohydrate transport and metabolism
  • [E] Amino acid transport and metabolism
  • [F] Nucleotide transport and metabolism
  • [H] Coenzyme transport and metabolism
  • [I] Lipid transport and metabolism
  • [P] Inorganic ion transport and metabolism
  • [Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED

  • [R] General function prediction only
  • [S] Function unknown

eggNOG简介

image

eggNOG注释的原理和解读

通过已知蛋白对未知序列进行功能注释;

通过查看指定的eggNOG编号对应的protein数目,存在及缺失,从而能推导特定的代谢途径是否存在;

每个eggNOG编号是一类蛋白,将query序列和比对上的eggNOG编号的proteins进行多序列比对,能确定保守位点,分析其进化关系。

eggNOG mapper在线版

eggNOG-mapper就比对、注释eggNOG数据库的专用工具。

eggNOG-mapper在线分析,只需鼠标单击三步完成。

1.访问在线工具

http://eggnogdb.embl.de/#/app/emapper

2.参数设置

主要是选择蛋白序列文件,和设置邮箱。一般其它默认即可。

image

注意方法选择:diamond在序列少时相对较慢,但序列多时相对较快。HMMER方法对于亲源较远序列预测成功率更高,但数据量大时计算时间长,在线限制一次最多5000条序列。

3.提交任务

点击Run按扭即提交任务。会出现如下窗口。

image

出现任务状态,和引文列表页面。值得注意的是,在线分析,即有序列限制,又要排队,如果用的人多,有时需要等很久。

eggNOG mapper本地版

更推荐conda安装,轻松稿定依赖关系和环境变量

conda install eggnog-mapper 

手动软件下载和安装

cd ~/software wget https://github.com/jhcepas/eggnog-mapper/archive/1.0.3.tar.gz tar xvzf 1.0.3.tar.gz cd eggnog-mapper-1.0.3 

软件说明

less README.md

使用eggNOG数据库进行功能注释新基因、蛋白序列。常用于新基因组、转录组和宏基因组的基因集。直系同源(orthology)功能预测认为比传统的同源搜索更准确,可以避免直接从旁系同源(paralogs)借用功能注释(基因重复有很高的机会形成功能分化)。

帮助文档

https://github.com/jhcepas/eggnog-mapper/wiki

安装说明

软件依赖python2.7, wget, hmmer3, diamond,

硬盘空间要求:

  • eggNOG注释数据库:~20GB
  • eggNOG序列fasta文件:~20GB
  • eggNOG数据库(euk, bact, arch): ~130GB,还有1-35GB的每个库对应的HMM数据库,不用全下载,需要什么下什么。
    每个HMM库大小见 http://beta-eggnogdb.embl.de/download/eggnog_4.5/hmmdb_levels/

内存要求:

HMMER3注释时大内存时非常快,内存需要如下:

  • 真核数据库euk: ~90GB
  • 细菌数据库bact:~32GB
  • 古细菌数据库arch:~10GB

软件安装

上面使用conda或wget下载方式安装,还可选git方式

git clone https://github.com/jhcepas/eggnog-mapper.git

数据库下载

  • eggNOG提供了107个分类学的HMM数据库,三个最优数据库真核euk、细菌bact和古菌arch,和一个病毒特异数据库viruses
  • 三个最优库包括对应所有HMM。
  • 具体107个数据子集见 http://eggnogdb.embl.de/#/app/downloads

显示程序帮助

python eggnog-mapper/download_eggnog_data.py -h

mkdir -p eggnog python eggnog-mapper/download_eggnog_data.py --data_dir eggnog -y -f euk bact arch viruses 

基本使用

cd eggnog-mapper 

HMMER方法

本地检索细菌数据库
Disk based searches on the optimized bacterial database
-i输入、–output输出文件前缀、-d指定数据库数据、–data_dir指定数据库位置




python emapper.py -i test/polb.fa --output polb_bact -d bact --data_dir ~/data/db/eggnog 

diamond方法

-m指定diamond方法,默认为hmmer方法。diamond在多于千条序列时才会体现速度优势,少量序列会感觉非常慢,而且结果也没有hmmer的更准确,尤其是对远源注释方面。

python emapper.py -i test/polb.fa --output diamond_bact_ -d bact --data_dir ~/data/db/eggnog -m diamond 

时间较长,1个多小时

结果解读

https://github.com/jhcepas/eggnog-mapper/wiki/Results-Interpretation

结果有三个文件

polb_bact.emapper.annotations polb_bact.emapper.hmm_hits polb_bact.emapper.seed_orthologs 

主要关注annotations结果,其中包括基因对应的GO、KEGG和COG描述

[project_name].emapper.hmm_hits文件:hmm比对结果列表

For each query sequence, a list of significant hits to eggNOG Orthologous Groups (OGs) is reported. Each line in the file represents a hit, where evalue, bit-score, query-coverage and the sequence coordinates of the match are reported. If multiple hits exist for a given query, results are sorted by e-value.

[project_name].emapper.seed_orthologs文件:最佳结果列表

each line in the file provides the best match of each query within the best Orthologous Group (OG) reported in the [project].hmm_hits file, obtained running PHMMER against all sequences within the best OG. The seed ortholog is used to fetch fine-grained orthology relationships from eggNOG. If using the diamond search mode, seed orthologs are directly obtained from the best matching sequences by running DIAMOND against the whole eggNOG protein space.

[project_name].emapper.annotations文件:比对结果整理,这才是重点。
This file provides final annotations of each query. Tab-delimited columns in the file are:

制表符分隔的13列文件,如下:

  1. 序列名query_name: query sequence name
  2. eggNOG编号seed_eggNOG_ortholog: best protein match in eggNOG
  3. seed_ortholog_evalue: best protein match (e-value)
  4. seed_ortholog_score: best protein match (bit-score)
  5. 预测基因名predicted_gene_name: Predicted gene name for query sequences
  6. 逗号分隔的GO注释GO_terms: Comma delimited list of predicted Gene Ontology terms
  7. KO编号注释KEGG_KO: Comma delimited list of predicted KEGG KOs
  8. 代谢反应BiGG_Reactions: Comma delimited list of predicted BiGG metabolic reactions
  9. 注释物种范围Annotation_tax_scope: The taxonomic scope used to annotate this query sequence
  10. OG编号Matching_OGs: Comma delimited list of matching eggNOG Orthologous Groups
  11. best_OG|evalue|score: Best matching Orthologous Groups (only in HMM mode)
  12. COG分类COG functional categories: COG functional category inferred from best matching OG
  13. 模型注释eggNOG_HMM_model_annotation: eggNOG functional description inferred from best matching OG

高级使用

https://github.com/jhcepas/eggnog-mapper/wiki/Advanced-usage-and-tips

大内存和多线程加速

–usemem可读入全部数据进内存,可使用内存预测加载数据,–cpu可设置多线程,–override是强制覆盖结果,否则有结果文件会中止

python emapper.py -i test/polb.fa --output polb_bact --database bact --data_dir ~/data/db/eggnog --usemem --cpu 10 --override # Total time: 11.8659 secs 

服务器共用内存模式

先读入细菌库,数据库选择,仅能指定某一类数据库

python emapper.py –database bact –data_dir ~/data/db/eggnog –cpu 10 –servermode

需要时间读入数据

Waiting for server to become ready... localhost 51500

直到显示:

Server ready listening at localhost:51500 and using 10 CPU cores Use `emapper.py -d bact:localhost:51500 (...)` to search against this server 

再启动分析命令

–usemem可读入全部数据进内存,可使用内存预测加载数据,–cpu可设置多线程

python emapper.py -i test/polb.fa --output polb_bact --database bact:localhost:51500 --data_dir ~/data/db/eggnog --usemem --cpu 10 --override # Total time: 9.77332 secs 

宏基因组大数据模式

https://github.com/jhcepas/eggnog-mapper/wiki/Setting-up-large-scale-analyses

大基因组,和宏基因组数据的注释(>100M的蛋白)。

分析主要分两步:同源检索,计算密集;功能注释,读写密集。数据拆分会提高效率。

同源检索

1. 序列拆分

准备文件并调整为单行fasta

cp /mnt/bai/yongxin/test/meta1809/temp/23prokka_all/mg.faa input_file0.faa format_fasta_1line.pl -i input_file0.faa -o input_file.faa 

拆分为文件,每个2百万行,1百万条序列。这里测序用10000行,5000条序列。

# -l按行数分割,-a后缀宽度3位,默认2位;-d数据后缀 split -l 10000 -a 3 -d input_file.faa input_file.chunk_ 

2.并行比对

方法1. 产生命令用于集群

for f in *.chunk_*; do echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f; done 

方法2. 并行计算

time parallel -j 3 --xapply \ 'python emapper.py -m diamond --no_annot --no_file_comments --data_dir ~/data/db/eggnog --cpu 16 -i {1} -o {1}' \ ::: input_file.chunk* 

耗时 real 14m45.579s

功能注释

此步为硬盘密集型,推荐将eggnog.db存储于SSD硬盘,或/dev/shm内存目录中

3. 合并比对结果

cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs

4.注释

为了提高速度,将数据库复制到内存,21s

cp ~/data/db/eggnog/eggnog.db /dev/shm

time emapper.py --annotate_hits_table input_file.emapper.seed_orthologs --no_file_comments -o output_file --cpu 20 --data_dir /dev/shm --override 

数据库在内存时,处理1万条序列大约15s

现在我们获得了所有基因注释的列表。配合基因丰度矩阵,可以进行可种汇总、差异比较、功能描述了。

附1. emapper.py参数详解

python emapper.py -h usage: emapper.py [-h] [--guessdb] [--database] [--dbtype {hmmdb,seqdb}] [--data_dir] [--qtype {hmm,seq}] [--tax_scope] [--target_orthologs {one2one,many2one,one2many,many2many,all}] [--excluded_taxa] [--go_evidence {experimental,non-electronic}] [--hmm_maxhits] [--hmm_evalue] [--hmm_score] [--hmm_maxseqlen] [--hmm_qcov] [--Z] [--dmnd_db DMND_DB] [--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}] [--gapopen GAPOPEN] [--gapextend GAPEXTEND] [--seed_ortholog_evalue] [--seed_ortholog_score] [--output] [--resume] [--override] [--no_refine] [--no_annot] [--no_search] [--report_orthologs] [--scratch_dir] [--output_dir] [--temp_dir] [--no_file_comments] [--keep_mapping_files] [-m {hmmer,diamond}] [-i] [--translate] [--servermode] [--usemem] [--cpu] [--annotate_hits_table] [--version] optional arguments: -h, --help 显示帮助show this help message and exit --version 版本号 Target HMM Database Options: --guessdb 根据物种ID猜所属数据库guess eggnog db based on the provided taxid --database , -d 数据库选择,仅能指定某一类数据库specify the target database for sequence searches.Choose among: euk,bact,arch, host:port, or a local hmmpressed database --dbtype {hmmdb,seqdb} 数据库类型 --data_dir 数据目录 Directory to use for DATA_PATH. --qtype {hmm,seq} 方法选择,序列少用hmm,序列多用seq Annotation Options: --tax_scope 设定物种范围,默认自动调整Fix the taxonomic scope used for annotation, so only orthologs from a particular clade are used for functional transfer. By default, this is automatically adjusted for every query sequence. --target_orthologs {one2one,many2one,one2many,many2many,all} 功能注释类型 defines what type of orthologs should be used for functional transfer --excluded_taxa (for debugging and benchmark purposes) --go_evidence {experimental,non-electronic} 注释准确度,只选实验 Defines what type of GO terms should be used for annotation:experimental = Use only terms inferred from experimental evidencenon-electronic = Use only non- electronically curated terms HMM search_options: --hmm_maxhits 匹配结果数量,默认1 Max number of hits to report. Default=1 --hmm_evalue E-value threshold. Default=0.001 --hmm_score Bit score threshold. Default=20 --hmm_maxseqlen 忽略序列大于5000的蛋白Ignore query sequences larger than `maxseqlen`. Default=5000 --hmm_qcov min query coverage (from 0 to 1). Default=(disabled) --Z Fixed database size used in phmmer/hmmscan (allows comparing e-values among databases). Default=40,000,000 diamond search_options: --dmnd_db DMND_DB 数据库位置Path to DIAMOND-compatible database --matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30} Scoring matrix --gapopen GAPOPEN Gap open penalty --gapextend GAPEXTEND Gap extend penalty Seed ortholog search option: --seed_ortholog_evalue Min E-value expected when searching for seed eggNOG ortholog. Applies to phmmer/diamond searches. Queries not having a significant seed orthologs will not be annotated. Default=0.001 --seed_ortholog_score Min bit score expected when searching for seed eggNOG ortholog. Applies to phmmer/diamond searches. Queries not having a significant seed orthologs will not be annotated. Default=60 Output options: --output , -o base name for output files --resume Resumes a previous execution skipping reported hits in the output file. --override Overwrites output files if they exist. --no_refine Skip hit refinement, reporting only HMM hits. --no_annot Skip functional annotation, reporting only hits --no_search Skip HMM search mapping. Use existing hits file --report_orthologs The list of orthologs used for functional transferred are dumped into a separate file --scratch_dir Write output files in a temporary scratch dir, move them to final the final output dir when finished. Speed up large computations using network file systems. --output_dir Where output files should be written --temp_dir Where temporary files are created. Better if this is a local disk. --no_file_comments No header lines nor stats are included in the output files --keep_mapping_files Do not delete temporary mapping files used for annotation (i.e. HMMER and DIAMOND search outputs) Execution options: -m {hmmer,diamond} 运行选项,默认为hmmer,可选diamondDefault:hmmer -i 输入文件 Input FASTA file containing query sequences --translate 输入核酸序列,翻译为蛋白 Assume sequences are genes instead of proteins --servermode 数据载入内存模式,方便反复使用Loads target database in memory and keeps running in server mode, so another instance of eggnog-mapper can connect to this sever. Auto turns on the --usemem flag --usemem 读入整个数据库至内存 If a local hmmpressed database is provided as target using --db, this flag will allocate the whole database in memory using hmmpgmd. Database will be unloaded after execution. --cpu 多线程 --annotate_hits_table 注释结果 Annotatate TSV formatted table of query->hits. 4 fields required: query, hit, evalue, score. Implies --no_search and --no_refine. 

Reference

https://github.com/jhcepas/eggnog-mapper/wiki

猜你喜欢

  • 10000+: 菌群分析
    宝宝与猫狗 提DNA发Nature 实验分析谁对结果影响大 Cell微生物专刊 肠道指挥大脑

  • 系列教程:微生物组入门 Biostar 微生物组 宏基因组
  • 专业技能:生信宝典 学术图表 高分文章 不可或缺的人
  • 一文读懂:宏基因组 寄生虫益处 进化树
  • 必备技能:提问 搜索 Endnote
  • 文献阅读 热心肠 SemanticScholar Geenmedical
  • 扩增子分析:图表解读 分析流程 统计绘图
  • 16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun
  • 在线工具:16S预测培养基 生信绘图
  • 科研经验:云笔记 云协作 公众号
  • 编程模板: Shell R Perl
  • 生物科普: 肠道细菌 人体上的生命 生命大跃进 细胞暗战 人体奥秘

写在后面

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请联系我们举报,一经查实,本站将立刻删除。

发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/199080.html原文链接:https://javaforall.net

(0)
上一篇 2026年3月26日 下午1:53
下一篇 2026年3月26日 下午1:54


相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注全栈程序员社区公众号