Friday, April 10, 2009

Similarity Between Sequences

One of the most important applications of clustering in life sciences
is clustering sequences, e.g., DNA or protein sequences. Many clustering
algorithms have been proposed to enhance sequence database searching,
organize sequence databases, generate phylogenetic trees, guide multiple
sequence alignment, and so on. In this specifi c clustering problem, the objects
of interest are biological sequences, which consist of a sequence of symbols,
which could be nucleotides, amino acids, or secondary structure elements
(SSEs). Biological sequences are different from the objects we have discussed
so far, in the sense that they are not defi ned by a collection of attributes. Hence,
the similarity measures we discussed so far are not applicable to biological
sequences.
Over the years, a number of different approaches have been developed for
computing similarity between two sequences (49). The most common are the
alignment-based measures, which fi rst compute an optimal alignment between
two sequences (either globally or locally), and then determine their similarity
by measuring the degree of agreement in the aligned positions of the two
sequences. The aligned positions are usually scored using a symbol-to-symbol
scoring matrix, and in the case of protein sequences, the most commonly used
scoring matrices are PAM (50,51) and BLOSUM (52).
The global sequence alignment (Needleman–Wunsch alignment (53)) aligns
the entire sequences using dynamic programming. The recurrence relations
are the following (49). Given two sequences S1 of length n and S2 of length m,
and a scoring matrix S, let score(i, j) be the score of the optimal alignment of
prefi xes S1[1 . . . i] and S2[1 . . . j].

No comments:

Post a Comment