Abstract :
[en] Motivation: Transcription regulatory protein factors often bind DNA
as homo-dimers or hetero-dimers. Thus they recognize structured
DNA motifs that are inverted or direct repeats or spaced motif
pairs. However, these motifs are often difficult to identify owing to
their high divergence. The motif structure included explicitly into
the motif recognition algorithm improves recognition efficiency for
highly divergent motifs as well as estimation of motif geometric
parameters.
Result: We present a modification of the Gibbs sampling motif extraction
algorithm, SeSiMCMC (Sequence Similarities by Markov Chain
Monte Carlo), which finds structured motifs of these types, as well
as non-structured motifs, in a set of unaligned DNA sequences. It
employs improved estimators of motif and spacer lengths. The probability
that a sequence does not contain any motif is accounted for in a
rigorous Bayesian manner. We have applied the algorithm to a set of
upstream regions of genes from two Escherichia coli regulons involved
in respiration. We have demonstrated that accounting for a symmetric
motif structure allows the algorithm to identify weak motifs more accurately.
In the examples studied, ArcA binding sites were demonstrated
to have the structure of a direct spaced repeat, whereas NarP binding
sites exhibited the palindromic structure.
Availability: The WWW interface of the program, its FreeBSD (4.0) and Windows 32 console executables are available at http://bioinform.genetika.ru/SeSiMCMC
Scopus citations®
without self-citations
75