Friday, April 10, 2009

Similarity Between Sequences

One of the most important applications of clustering in life sciences
is clustering sequences, e.g., DNA or protein sequences. Many clustering
algorithms have been proposed to enhance sequence database searching,
organize sequence databases, generate phylogenetic trees, guide multiple
sequence alignment, and so on. In this specifi c clustering problem, the objects
of interest are biological sequences, which consist of a sequence of symbols,
which could be nucleotides, amino acids, or secondary structure elements
(SSEs). Biological sequences are different from the objects we have discussed
so far, in the sense that they are not defi ned by a collection of attributes. Hence,
the similarity measures we discussed so far are not applicable to biological
sequences.
Over the years, a number of different approaches have been developed for
computing similarity between two sequences (49). The most common are the
alignment-based measures, which fi rst compute an optimal alignment between
two sequences (either globally or locally), and then determine their similarity
by measuring the degree of agreement in the aligned positions of the two
sequences. The aligned positions are usually scored using a symbol-to-symbol
scoring matrix, and in the case of protein sequences, the most commonly used
scoring matrices are PAM (50,51) and BLOSUM (52).
The global sequence alignment (Needleman–Wunsch alignment (53)) aligns
the entire sequences using dynamic programming. The recurrence relations
are the following (49). Given two sequences S1 of length n and S2 of length m,
and a scoring matrix S, let score(i, j) be the score of the optimal alignment of
prefi xes S1[1 . . . i] and S2[1 . . . j].

Assessing Cluster Quality

Clustering results are hard to evaluate, especially for high-dimensional data
and without a priori knowledge of the objects’ distribution, which is quite
common in practical cases. However, assessing the quality of the resulting
clusters is as important as generating the clusters. Given the same data set,
different clustering algorithms with various parameters or initial conditions
will give very different clusters. It is essential to know whether the resulting
clusters are valid and how to compare the quality of the clustering results, so
that the right clustering algorithm can be chosen and the best clustering results
can be used for further analysis.

Case Study: Clustering Gene Expression Data

Recently developed methods for monitoring genomewide mRNA expression
changes such as oligonucleotide chips (68) and cDNA microarrays (69) are
especially powerful because they allow us to monitor quickly and inexpensively
the expression levels of a large number of genes at different time points for
different conditions, tissues, and organisms. Knowing when and under what
conditions a gene or a set of genes is expressed often provides strong clues to
their biological role and function.
Clustering algorithms are used as an essential tool to analyze these data sets
and provide valuable insight into various aspects of the genetic machinery.
There are four distinct classes of clustering problems that can be formulated
from the gene expression data sets, each addressing a different biological
problem. The fi rst problem focuses on fi nding coregulated genes by grouping
genes that have similar expression profi les. These coregulated genes can be
used to identify promoter elements by fi nding conserved areas in their upstream
regions. The second problem focuses on fi nding distinctive tissue types by
grouping tissues whose genes have similar expression profi les. These tissue
groups can then be further analyzed to identify the genes that best distinguish
the various tissues. The third clustering problem focuses on fi nding common
inducers by grouping conditions for which the expression profi les of the
genes are similar. Finding such groups of common inducers will allow us to
substitute different “trigger” mechanisms that still elicit the same response
(e.g., similar drugs, or similar herbicides or pesticides). Finally, the fourth
clustering problem focuses on fi nding organisms that exhibit similar responses
over a specifi ed set of tested conditions by grouping organisms for which the
expression profi les of their genes (in an ortholog sense) are similar. This would
allow us to identify organisms with similar responses to chosen conditions
(e.g., microbes that share a pathway).

Microarray Technologies

The Affymetrix GeneChip oligonucleotide array contains several thousand
ssDNA oligonucleotide probe pairs. Each probe pair consists of an element
containing oligonucleotides that perfectly match the target (PM probe) and an
element containing oligonucleotides with a single base mismatch (MM probe).
A probe set consists of a set of probe pairs corresponding to a target gene.
Similarly, the labeled RNA is extracted from sample cell and hybridizes to its
complementary sequence. The expression level is measured by determining the
difference between the PM and MM probes. Then, for each gene (i.e., probe
set) average difference or log average can be calculated, in which the average
difference is defi ned as the average difference between the PM and MM of
every probe pair in a probe set and log average is defi ned as the average log
ratios of the PM/MM intensities for each probe pair in a probe set.

Clustering Approaches for Gene Expression Data

Since the early days of the development of microarray technologies, a
wide range of existing clustering algorithms have been used, and novel new
approaches have been developed for clustering gene expression data sets.
The most effective traditional clustering algorithms are based either on the
group-average variation of the agglomerative clustering methodology, or on the
K-means approach applied to unit-length gene or condition expression vectors.
Unlike other applications of clustering in life sciences, such as the construction
of phylogenetic trees, or guide trees for multiple sequence alignment, there is
no biological reason that justifi es that the structure of the correct clustering
solution is in the form of a tree. Thus, agglomerative solutions are inherently
suboptimal when compared to partitional approaches, which allow for a wider
range of feasible solutions at various levels of cluster granularity. However,
the agglomerative solutions do tend to produce reasonable and biologically
meaningful results and allow easy visualization of the relationships between
the various genes and/or conditions in the experiments.

Summary of Biologically Relevant Features

The behavior of vcluster and scluster can be controlled by specifying
more than 30 different optional parameters. These parameters can be broadly
categorized into three groups. The fi rst group controls various aspects of the
clustering algorithm, the second group controls the type of analysis and reporting
that is performed on the computed clusters, and the third group controls
the visualization of the clusters. Some of the most important parameters

Building Tree for Large Data Sets

Hierarchical agglomerative trees are used extensively in life sciences
because they provide an intuitive way to organize and visualize the clustering
results. However, there are two limitations with such trees. First, hierarchical
agglomerative clustering may not be the optimal way to cluster data in which
there is no biological reason to suggest that the objects are related to each other
in a tree fashion. Second, hierarchical agglomerative clustering algorithms
have high computational and memory requirements, making them impractical
for data sets with more than a few thousand objects.
To address these problems CLUTO provides the -fulltree option that can be used
to produce a complete tree using a hybrid of partitional and agglomerative
approaches. In particular, when -fulltree is specifi ed, CLUTO builds a complete
hierarchical tree that preserves the clustering solution that was computed. In
this hierarchical clustering solution, the objects of each cluster form a subtree,
and the different subtrees are merged to get an all-inclusive cluster at the end.
Furthermore, the individual trees are combined in a meaningful way, so that
the similarities within each tree are accurately represented.

A Primer on the Visualization of Microarray Data

Introduction
DNA microarrays represent a powerful technology offering unprecedented
scope for discovery (1). However, the ability to measure, in parallel, the gene
expression patterns for thousands of genes represents both the strength and
a key weakness of microarrays. One of the central challenges of functional
genomics has been to cope with the enormity of microarray data sets, and,
indeed, the usefulness of microarrays has been limited by our ability to extract
useful information from these data. In general terms, analyzing microarray
data requires a series of numerical transformations and/or fi lters intended to
extract from the data set the subset of represented genes that may be of interest.
The resulting lists generally represent genes with large variance or periodicity
within their gene expression vectors (2); high fold inductions over a time course
(3); genes that are considered signifi cant by some statistical criterion (4); or
genes that meet some other threshold, such as exceeding a given percentile rank
in the distribution of ratios (5,6). However, examining a spreadsheet of gene
names and expression ratios often provides little insight into the interesting
trends or patterns that may exist within the data. Rather, methods have been
developed for both the classifi cation and display of these data sets. Indeed,
given the non-hypothesis-driven nature of many microarray experiments, the
ability to readily visualize trends in the data assumes paramount importance.

A Note on Gene Expression Data

Array data are invariably expressed in the form of ratios, so a preliminary
discussion is required to indicate how these are normally represented. When
expression ratios are shown numerically, they are most commonly expressed in
logarithmic space, typically to the base 2, i.e., ratio = log2(red intensity/green
intensity). The resulting ratios have the property of being symmetrical about
zero, and a negative sign denotes ratios with a larger denominator (making it
much easier to grasp the size of such ratios, since these ratios are no longer
compressed between 0 and 1, as occurs in real space). Although the base used
is not critical, base 2 logarithms are commonly employed because ratios thus
expressed are easily converted to “fold inductions” in real space, owing to the
familiarity of powers of two as used in the binary numbering system (i.e., a
log2-transformed ratio of 4 = 24 = a 16-fold in difference in real space). Note
that logarithms to the base 10 are seldom employed, for the simple reason that
most biologically relevant ratios tend to fall in the 1- to 10- or 1- to 100-fold
range. However, even when ratios are log transformed, it is still diffi cult to make
sense of columns of numbers, and therefore sensible alternatives are required.

TreeView and Cluster

One of the the most widely used programs for displaying gene expression
data is TreeView, notion that has gained general
currency.

Promoter

Developed by Joe DeRisi, now at UCSF, Promoter is a package designed
to display features associated with genomes and to allow the user to perform
powerful searches to retrieve associated sequence data. This package is available
at both .. The display feature of Promoter is particularly relevant
because it creates genome maps with features annotated in genome order, and
with a correct scale representation of both the size and position of the feature on
the genome or chromosome of interest. Features are also correctly displayed on
either the Crick or Watson strands or can be marked as intergenic. By loading
an optional “gene data” fi le (consisting simply of a tab-delimited text fi le with
feature names and an associated log ratio, with each feature separated with a
carriage return), Promoter also has the ability to map onto the features data a
colorimetric representation of array data. Ultimately, Promoter produces output
in PostScript format suitable for direct importation in Adobe Illustrator or
similar vector-drawing software, where the annotation is then often customized
to suit the fi gure being created. An excerpt from a recently published example
is shown in Fig. 2, which illustrates the results for a single yeast chromosome
for a chromosomal immunoprecipitation array experiment performed using
antibodies against the Rap1 transcription factor (see Chapter 8 for a description
of chromosomal immunoprecipitation on chip methodology).
Although not a focus of this discussion, it should also be noted that Promoter’s
ability to extract primary sequence data from genetic regions corresponding
to hits in microarray experiments is an important preliminary step when
attempting to correlate array results with features of the primary sequence
data, such as when performing scans for the presence of possible regulatory
motifs. Promoter also has the ability to extract primary sequence-containing
motifs specifi ed by the user using the IUPAC degenerate nucleotide code, with
a user-selected acceptable number of mismatches from consensus.

Caryoscope

Unlike the other display methods discussed, Caryoscope developed by Christian Rees of
the Botstein laboratory at Stanford, is not primarily intended to summarize
and display gene expression data but, rather, to display the results of array
comparative genomic hybridizations (aCGH). The essential idea of array-based
CGH is that, instead of cDNA produced from mRNA, genomic DNA (gDNA)
can be directly used in array experiments. Comparative gDNA hybridization
can therefore be used for gene copy estimation (9). As with gene expression
studies, two genomic DNA samples are hybridized at once to the same array,
each having been labeled with a different fl uorophore. Typically, these genomic
DNA samples are obtained from either a normal and a diseased tissue (i.e.,
channel one measures a sample from a cancerous tissue, which may have
undergone gene duplications or deletions, while channel two measures a sample
from a normal tissue) (9,10), or from two closely related species, such as
recently evolutionarily diverged strains of yeast or bacteria (11,12). In either
case, obtaining a log ratio that signifi cantly deviates from zero is presumptive
evidence that either a gene amplifi cation has occurred in one of the tissues, or
that the gene copy number has been increased in the other tissue. Caryoscope is
designed to display these data in a genome-ordered fashion. Separate versions
are available for both yeast and human physical map data. The ability to display
aCGH ratio data in the context of a physical map is particularly useful given that
deletions or duplications often occur on a larger scale than individual genes;
indeed, amplifi cations or deletions often occur in large contiguous regions, and
this is immediately evident on examination of the data with Caryoscope.
A very nice feature of Caryoscope is that a separate diagram is produced
for each of the chromosomes of the organism of interest. As shown in Fig. 3,
Caryoscope displays ratios of greater than zero as red bars to the right of the
line representing the chromosome, while green bars representing ratios of less
than zero are displayed to the left of the line. In this case, color is used only to
indicate the sign of the ratio, while the actual deviation of the log ratio from
zero is proportional to the height of the bar representing a particular locus. In
the example shown, the lines drawn to the right of the main line indicate the
ratio associated with a hypothetical bar of that length. Clearly, most of the
bars in Fig. 3 are of a length consistent with fl uctuation within the “noise” of
the expression ratios, but with notable peaks corresponding to areas of copy
number variation.

DecCor2 exploits

Aside from the visual display, DecCor2 exploits the characteristics of the
distribution of gene expression vector similarities in order to output lists of
genes representing outliers in the distribution. This allows the easy creation
of lists of genes most strongly similar in their expression pattern to a gene of
interest. By specifying the number of standard deviations (SDs) away from
the mean of the distribution that a gene should be before being included in
an output list, the user has direct control over the inclusion parameter. Either
the distribution of gene-specifi c similarities (i.e., a gene-specifi c cutoff) or the
global distribution generated by considering all pairwise similarities (i.e., a
global cutoff) may be used.

Equipment Requirements

DecCor2 was developed and tested on a PC-compatible machine running
Microsoft Windows NT 4.0, but also runs correctly under Windows 2000 and
Windows XP. There is a known problem with Windows 95/98 when searching
for a gene profi le by name: it will not allow entry of the gene name. This
software requires a large amount of memory and 512 megabytes or more of
random access memory is recommended.

Data Formats for DecCor2

DecCor2 minimally requires a data input fi le and may optionally use a
second fi le format that allows the user to highlight specifi c user-specifi ed
sets of genes.

Array Data

The fi rst input fi le that is required is the transcriptional profi ling data. This
should be a tab-delimited text format fi le (which may conveniently be created
using the “Save As” feature from within Microsoft Excel or another spreadsheet
package). Each row of this fi le represents the gene expression vector for a
discrete spot on the arrays. The fi rst column should therefore be a text identifi er
for the spot (this will in most cases be a gene name). The user must ensure
that each row has a unique identifi er, or the behavior of the program will be
undefi ned. Additionally, there must be only one such column of identifi ers.
Future versions should support the inclusion of an optional description column,
but this is not supported at this time. Note that the order in which the gene
names appear in the fi rst column determines the order in which the genes will
be displayed on the screen. If you wish to use the program to display data
in genome order, this order must be refl ected in the order of the genes in
the input file. Note also that it is possible in principle to order genes in
any manner you wish, such as ordering genes by functional category. After
the identifi er column, the number of subsequent columns depends on the
number of measurements available for each array feature. Hence, each subsequent
column represents the log ratios from a particular array. It is crucial
that the ratios be log transformed (the base of the logarithm is unimportant) to
ensure that positive and negative ratios are symmetric around zero. DecCor2
will work with incomplete data sets (resulting, e.g., from spots that have
not passed some quality control measure), provided that empty spots in the
spreadsheet are fi lled in using the “dummy” value of –999 (again, conveniently
done with Excel). There must not be an initial header row describing the
experiments.
An example fi le format for array data is as follows (all columns are tabdelimited):
GeneA 2.13 0.51 1.07 0.25 –2.34
GeneB 1.63 –999 5.14 –0.38 1.4
GeneC –4.71 2.14 –999 3.23 –999

Gene Lists

Once an initial analysis has been performed, DecCor2 also permits the user
to load a custom gene list that results in specifi ed genes being highlighted
on the gene “maps.” To create a gene list, simply create a text fi le with the
name of each unique identifi er that you wish to be highlighted separated by
carriage returns. Any identifi ers that were not used in the original data fi le will
be ignored, and there should be only one identifi er per line.

Concluding a DecCor2 Session

Pressing the “OK” button in the lower right-hand corner exits the graphic
display portion of DecCor2 and initiates the creation of the output fi le. This
output fi le will be created in the directory containing the DecCor2 executable
and will have the name specifi ed by the user during program initialization.
This output fi le, which should be opened in a text editor such as Wordpad that
supports long lines, consists of three portions:
1. A series of 200 bins containing the counts of correlation coeffi cients falling
within each bin. Suitable for constructing a histogram of the global distribution
of correlation coeffi cients, this output often assumes a very normal appearance
when using a large input data set.
2. A rowwise output of all the input genes, followed by a list on the same row of
all the genes that had a correlation coeffi cient higher than the specifi ed number
of SDs above the mean of all the pairwise correlation coeffi cients. The cutoff
correlation coeffi cient used for this output is the aforementioned global cutoff.
3. Another rowwise list of all genes, followed by a list on the same row of all the
genes with a correlation coeffi cient exceeding a gene-specifi c cutoff. In this
case, the cutoff is the mean plus the specifi ed number of SDs calculated from the
distribution of only that gene’s pairwise correlation coeffi cients. Optionally, this
list may be fi ltered to show only genes that are pairwise reciprocal (i.e., GeneB
appears in the list for GeneA, and vice versa). Both of the gene lists generated are
intended to help identify genes that are most similar in their pattern of expression
to the gene listed in the fi rst column. Note that the most similar gene is always
itself, owing to the correlation coeffi cient of 1 for self-similarity.

Tracking data

Spotted Samples
The key piece of information that has to be recorded when using microarrays
concerns the nature of the elements that are spotted on the array. Depending
on exactly what is being spotted, and how it was generated, these data may
be in the form of DBEST clone IDs and accession numbers, polymerase chain
reaction (PCR) primer sequences, or sequences of long oligonucleotides that
are spotted on the array. It has been argued that sequences should be mandatory
because each clone on an array can be represented in public databases by
several different IDs and accession numbers.

Quality Control Information from Preparation of Materials for Spotting

When either cDNA clones are spotted or specifi c regions of a genome are
selected and amplifi ed, they have to go through at least one round of PCR
before printing. It is important that quality control information about these PCR
products be tracked, because it may subsequently be useful for selection of
data from a microarray database. For example, data from spots where the PCR
failed, or the fragment was of an unexpected size, can be fi ltered out. Failed
or anomalous products are printed because PCR reactions are usually done in
96-well format, so to simply spot only the products of those reactions that were
deemed to have worked is impractical—instead, the content of each well in each
plate is spotted. Thus, tracking the quality of those products is necessary.

Physical Location of Products

DNA microarrays on glass slides are usually printed by spotting DNA
solutions from individual wells in the microtiter plates. Each spot on a DNA
microarray uniquely corresponds to the well in a microtiter plate and not to the
identity of a DNA material that has been printed. That is why it is critical to
keep track of the physical location—plate coordinates—of each DNA solution.
In practice, a database user will benefi t from this information at least on two
accounts. First, DNA replicates with the same IDs within the same or different
PCR plates can be easily identifi ed by their different locations on the printing
plates. This facilitates the independent statistical analysis of hybridization data
obtained on the identical DNA sequences. Second, if some nonidentical DNA
sequences demonstrate very similar expression profi les, it is common practice
to verify their relative location on the array. If such genes make a “streak” on
the array (i.e., printed by the same print tip) located next to each other in a
subarray and show identical trend in hybridization outcome (“green” or “red”),
one has to rule out the possibility of cross-contamination during the print. A
similar need arises when one deals with “identical expression profi les” of the
genes that are situated next to each other in a print microtiter plate. Effi cient
recovery of the information about the print plate location of DNA solutions
in question will streamline the analysis as well as help line up the control
experiments.