Friday, April 10, 2009

Similarity Between Sequences

One of the most important applications of clustering in life sciences
is clustering sequences, e.g., DNA or protein sequences. Many clustering
algorithms have been proposed to enhance sequence database searching,
organize sequence databases, generate phylogenetic trees, guide multiple
sequence alignment, and so on. In this specifi c clustering problem, the objects
of interest are biological sequences, which consist of a sequence of symbols,
which could be nucleotides, amino acids, or secondary structure elements
(SSEs). Biological sequences are different from the objects we have discussed
so far, in the sense that they are not defi ned by a collection of attributes. Hence,
the similarity measures we discussed so far are not applicable to biological
sequences.
Over the years, a number of different approaches have been developed for
computing similarity between two sequences (49). The most common are the
alignment-based measures, which fi rst compute an optimal alignment between
two sequences (either globally or locally), and then determine their similarity
by measuring the degree of agreement in the aligned positions of the two
sequences. The aligned positions are usually scored using a symbol-to-symbol
scoring matrix, and in the case of protein sequences, the most commonly used
scoring matrices are PAM (50,51) and BLOSUM (52).
The global sequence alignment (Needleman–Wunsch alignment (53)) aligns
the entire sequences using dynamic programming. The recurrence relations
are the following (49). Given two sequences S1 of length n and S2 of length m,
and a scoring matrix S, let score(i, j) be the score of the optimal alignment of
prefi xes S1[1 . . . i] and S2[1 . . . j].

Assessing Cluster Quality

Clustering results are hard to evaluate, especially for high-dimensional data
and without a priori knowledge of the objects’ distribution, which is quite
common in practical cases. However, assessing the quality of the resulting
clusters is as important as generating the clusters. Given the same data set,
different clustering algorithms with various parameters or initial conditions
will give very different clusters. It is essential to know whether the resulting
clusters are valid and how to compare the quality of the clustering results, so
that the right clustering algorithm can be chosen and the best clustering results
can be used for further analysis.

Case Study: Clustering Gene Expression Data

Recently developed methods for monitoring genomewide mRNA expression
changes such as oligonucleotide chips (68) and cDNA microarrays (69) are
especially powerful because they allow us to monitor quickly and inexpensively
the expression levels of a large number of genes at different time points for
different conditions, tissues, and organisms. Knowing when and under what
conditions a gene or a set of genes is expressed often provides strong clues to
their biological role and function.
Clustering algorithms are used as an essential tool to analyze these data sets
and provide valuable insight into various aspects of the genetic machinery.
There are four distinct classes of clustering problems that can be formulated
from the gene expression data sets, each addressing a different biological
problem. The fi rst problem focuses on fi nding coregulated genes by grouping
genes that have similar expression profi les. These coregulated genes can be
used to identify promoter elements by fi nding conserved areas in their upstream
regions. The second problem focuses on fi nding distinctive tissue types by
grouping tissues whose genes have similar expression profi les. These tissue
groups can then be further analyzed to identify the genes that best distinguish
the various tissues. The third clustering problem focuses on fi nding common
inducers by grouping conditions for which the expression profi les of the
genes are similar. Finding such groups of common inducers will allow us to
substitute different “trigger” mechanisms that still elicit the same response
(e.g., similar drugs, or similar herbicides or pesticides). Finally, the fourth
clustering problem focuses on fi nding organisms that exhibit similar responses
over a specifi ed set of tested conditions by grouping organisms for which the
expression profi les of their genes (in an ortholog sense) are similar. This would
allow us to identify organisms with similar responses to chosen conditions
(e.g., microbes that share a pathway).

Microarray Technologies

The Affymetrix GeneChip oligonucleotide array contains several thousand
ssDNA oligonucleotide probe pairs. Each probe pair consists of an element
containing oligonucleotides that perfectly match the target (PM probe) and an
element containing oligonucleotides with a single base mismatch (MM probe).
A probe set consists of a set of probe pairs corresponding to a target gene.
Similarly, the labeled RNA is extracted from sample cell and hybridizes to its
complementary sequence. The expression level is measured by determining the
difference between the PM and MM probes. Then, for each gene (i.e., probe
set) average difference or log average can be calculated, in which the average
difference is defi ned as the average difference between the PM and MM of
every probe pair in a probe set and log average is defi ned as the average log
ratios of the PM/MM intensities for each probe pair in a probe set.

Clustering Approaches for Gene Expression Data

Since the early days of the development of microarray technologies, a
wide range of existing clustering algorithms have been used, and novel new
approaches have been developed for clustering gene expression data sets.
The most effective traditional clustering algorithms are based either on the
group-average variation of the agglomerative clustering methodology, or on the
K-means approach applied to unit-length gene or condition expression vectors.
Unlike other applications of clustering in life sciences, such as the construction
of phylogenetic trees, or guide trees for multiple sequence alignment, there is
no biological reason that justifi es that the structure of the correct clustering
solution is in the form of a tree. Thus, agglomerative solutions are inherently
suboptimal when compared to partitional approaches, which allow for a wider
range of feasible solutions at various levels of cluster granularity. However,
the agglomerative solutions do tend to produce reasonable and biologically
meaningful results and allow easy visualization of the relationships between
the various genes and/or conditions in the experiments.

Summary of Biologically Relevant Features

The behavior of vcluster and scluster can be controlled by specifying
more than 30 different optional parameters. These parameters can be broadly
categorized into three groups. The fi rst group controls various aspects of the
clustering algorithm, the second group controls the type of analysis and reporting
that is performed on the computed clusters, and the third group controls
the visualization of the clusters. Some of the most important parameters

Building Tree for Large Data Sets

Hierarchical agglomerative trees are used extensively in life sciences
because they provide an intuitive way to organize and visualize the clustering
results. However, there are two limitations with such trees. First, hierarchical
agglomerative clustering may not be the optimal way to cluster data in which
there is no biological reason to suggest that the objects are related to each other
in a tree fashion. Second, hierarchical agglomerative clustering algorithms
have high computational and memory requirements, making them impractical
for data sets with more than a few thousand objects.
To address these problems CLUTO provides the -fulltree option that can be used
to produce a complete tree using a hybrid of partitional and agglomerative
approaches. In particular, when -fulltree is specifi ed, CLUTO builds a complete
hierarchical tree that preserves the clustering solution that was computed. In
this hierarchical clustering solution, the objects of each cluster form a subtree,
and the different subtrees are merged to get an all-inclusive cluster at the end.
Furthermore, the individual trees are combined in a meaningful way, so that
the similarities within each tree are accurately represented.