DNA作为生物的遗传物质,由2条互补的链按照A-T,C-G碱基配对方式形成双螺旋结构。在生物信息处理时,常常需要得到已知序列的反向互补序列(反向是由于学术界约定俗成序列方向为5端到3端,因此只互补配对得到的是3端到5端的配对序列,还需要再反向一次变成5端到3端),诸如Python和其他专门的小工具都可以实现,但是这里尝试用Shell来解决问题。
- 问题:
获得fasta格式核酸序列的反向互补配对序列。fasta格式中,>开头的是序列名称,后面的多行为具体的核酸序列,例如:
>gene_name1
ATACTCGATATGATGTAGTGATGTAGTGAGA
CAGTGTGTTGTGTTGTTGTGGGTGTG
>gene_name2
ATACTCGATATGATGTAGTGATGTAGTGAGA
CAGTGGTGGGTGTG - 思路
为了解决这个问题,首先介绍Shell中2个与此相关的命令:逆转字符串命令rev和翻译字符串命令tr;
其次是使用bash的数组将每个序列名称和序列对应起来,形成类似与key-value的关系。
逆转:rev
- rev是逆转每一行字符串顺序的命令
- 使用语法
rev [file …]
当指定文件时(一个或多个),返回文件每一行字符串逆序输出结果,没有指定文件时,则从标准输入读取
# 从文件逆转 [Neptuneyut]$ cat test.txt aaaaaaaaaaaaaaaaaaagggggggggggggggggggg [Neptuneyut]$ rev test.txt test.txt ggggggggggggggggggggaaaaaaaaaaaaaaaaaaa ggggggggggggggggggggaaaaaaaaaaaaaaaaaaa # 从标准输入读取 [Neptuneyut]$ echo aaaaaaaaaaaaaaaaaaagggggggggggggggggggg|rev ggggggggggggggggggggaaaaaaaaaaaaaaaaaaa
- 应用
获取反向互补序列
翻译:tr
tr用于翻译、压缩和删除字符串
- 语法
Usage: tr [OPTION]… SET1 [SET2]
将集合1中的字符替换为集合2中的字符,可以是一一对应的关系,例如set1中2个不同字符对应set2中2个不同字符;也可以是set1中多个字符对应set2中同一个字符。 - 选项
-c, -C, –complement use the complement of SET1
-d, –delete delete characters in SET1, do not translate
-s, –squeeze-repeats replace each input sequence of a repeated character
that is listed in SET1 with a single occurrence
of that character
-t, –truncate-set1 first truncate SET1 to length of SET2,默认进行集合替换
# 将at都替换成A,注意这里与正则式不同,不是将at作为整体替换掉 [Neptuneyut]$ echo aatc|tr at A AAAc - 支持的集合
\NNN character with octal value NNN (1 to 3 octal digits) \\ backslash \a audible BEL \b backspace \f form feed \n new line \r return \t horizontal tab \v vertical tab CHAR1-CHAR2 all characters from CHAR1 to CHAR2 in ascending order [CHAR*] in SET2, copies of CHAR until length of SET1 [CHAR*REPEAT] REPEAT copies of CHAR, REPEAT octal if starting with 0 [:alnum:] all letters and digits [:alpha:] all letters [:blank:] all horizontal whitespace [:cntrl:] all control characters [:digit:] all digits [:graph:] all printable characters, not including space [:lower:] all lower case letters [:print:] all printable characters, including space [:punct:] all punctuation characters [:space:] all horizontal or vertical whitespace [:upper:] all upper case letters [:xdigit:] all hexadecimal digits [=CHAR=] all characters which are equivalent to CHAR 例如:
# 将小写字母转换成大写字母 [Neptuneyut]$ echo aaattt0|tr a-z A-Z AAATTT0 # 或者 [Neptuneyut]$ echo aaattt0|tr [:lower:] [:upper:] AAATTT0 # -c 将不在set1的字符替换为set2 [Neptuneyut]$ echo aaattt-0?9|tr -c a-z0-9 "\t" aaattt 0 9 # -d 不接set2,则将set1中的字符删除 [Neptuneyut]$ echo aaattt-0?9|tr -d a ttt-0?9 # -s 压缩 # 不接set2,压缩set1中连续的字符为1个 [Neptuneyut]$ echo aaattt-0?9 |tr -s at at-0?9 #接set2z,则将连续的set1字符压缩为1个set2字符 [Neptuneyut]$ echo aaattt-0?9 |tr -s t B aaaB-0?9 [Neptuneyut]$ echo aaattt-0?9 |tr -s at AT AT-0?9
反向互补序列
# 生成互补序列 [Neptuneyut]$ echo atttcttcantgatgct|tr a-z A-Z |tr ATCG TAGC TAAAGAAGTNACTACGA [Neptuneyut]$ echo atttcttcantgatgct|tr a-z A-Z |tr ATCG TAGC |rev AGCATCANTGAAGAAAT # 生成反向序列
fasta序列反向互补脚本
ReversedComplementationSequence.sh
#!/usr/bin/env bash # Author:yutao # Date:Tue Oct 22 13:40:49 CST 2019 # Description # To get the reversed complementation nucleotide sequence # Point: # Using bash array to restore ID and Sequence respectively input_seq=$1 #input nucleotide sequence,fasta format ID=() #initialization ID array ID+=($(grep ">" $input_seq)) #save the ID in the array SEQ=() #initialization Sequence array # sed -n "/pattern1/,/pattern2/p" file |sed "/>/d" to get the seq between two pattern for((i=0;i<${#ID[@]};i++)) do next_index=$((i+1)) #curretn index and next index start=${ID[i]} #pattern1 end=${ID[next_index]} #pattern2 seq=$(sed -n "/${start}/,/${end}/p" ${
input_seq}|sed "/>/d"|tr -d "\n") #refromate multi-seq to single seq SEQ+=(${seq}) done for((i=0;i<${#ID[@]};i++)) do echo ${ID[i]} echo ${SEQ[i]}|tr a-z A-Z|tr ATCG TAGC |rev #reversing and complementary sequence done
发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/205917.html原文链接:https://javaforall.net
