This online tool generates a regular expression from nucleotide sequences which can include IUPAC codes. This allows to use any string/pattern search program (e.g. the linux commandline tool grep) to extract a given consensus sequence from a large file, for example a FASTA/FASTQ file obtained from a next generation sequencing experiment.
Consensus nucleotide sequence with IUPAC as extracted from the genome browser
GCNATAACTMTGTHC
Regular expression with ambigous IUPAC characters resolved:
GC[ACGT]ATAACT[AC]TGT[ACT]C
Finding the sequencing in a FASTQ file on the commandline:
grep "GC[ACGT]ATAACT[AC]TGT[ACT]C" SAMPLE_1.fastq
ecSeq is a bioinformatics solution provider with solid expertise in the analysis of high-throughput sequencing data. We can help you to get the most out of your sequencing experiments by developing data analysis strategies and expert consulting. We organize public workshops and conduct on-site trainings on NGS data analysis.
Last updated on August 07, 2016