|
[ Home ] [ Data Simulator
] [ Data Analysis
Tool ] [ Dataset Links
] [ Recommendations
for Analysis ]
|
|
|
|
1. Data
File Upload Options
1.1 GEDA Input
format 1
1.2 GEDA Input
format 2
1.3 Sample
Data Files
2. Data
Pre-Processing Steps
2.1 Log
Transformation
2.2 Normalization
options
2.3 Duplicate
Gene Entry
2.4 Graphical
Output Options
3. Analysis
Settings 1: Test for Differentially Expressed Genes
3.1 Permutation
Test Option
3.2 Test
for Differentially Expressed Genes Options
3.3 Threshold
3.4 Measure
of Central Tendency
3.5 Special
Options
4. Analysis
Settings 2: Sample Classification Operation
4.1 Clustering
Algorithms
4.2 Distance
Functions
5. Analysis
Settings 3: Gene Classification Options
6. Analysis
Settings 4: Computational Validation
8. Group
Settings
7.1 Number
of Samples
7.2 Number
of Genes
7.3 Sample
Classification Form
7.4 Interpreting
the Output
7.5 Future
Development
|
|
1. Data File Upload Options
The Gene Expression Data
Analysis Tool allows the user to browse their local drive for their
tab-delimited text files in two formats:
- 1.1 Pitt Standard 1
In Pitt Standard 1, the first column of the data file contains
either gene names or accession number (unique gene ID numbers).
- 1.2 Pitt Standard 2
In Pitt Standard 2, the first column of the data contains
the gene name, and the second column contains the gene accession
number, or unqiue gene ID. The accession numbers and unique ID numbers
are used to generate a table of links to genomic databases to facilitate
further downstream interpretation (see .
In both Pitt Standard input
formats 1 and 2, the first cell (first row and column) must contain
the string "Name". The first row of each remaining columns are labelled
with the array name (sample name), and each column contains the the
expression values for each array.
We encourage users to use uncorrected, unnormalized, truly raw
intensity values (no background subtraction) because the benefit of
every data pre-processing step, while motivated by concern over systematic
error, must be weighed against the introduction of additional systematic
error (overcorrection) and the introduction of unwanted measurement error
through error propagation (background values, like intensity values,
are measured with error). Our simulation results have demonstrated to
us that background subtraction and high pass filter methods for removing
genes that are not 'above background' results in a cost to the performance
of all methods of analysis examined to date.
Example data formats are:
Pitt Standard 1 (GEDA format 1)
Name S1 S2 S3 S4 ... Sm
G1 I11 I12 I13 I14 ... I1m
G2 I21 I22 I23 I24 ... I2m
G3 I31 I32 I33 I34 ... I3m
G4 I41 I42 I43 I44 ... I4m
. . . . . ... .
. . . . . ... .
. . . . . ... .
. . . . . ... .
. . . . . ... .
Gn In1 In2 In3 In4 ... Inm
The dataset is a tab delimited
text file where the first column is for the name of the genes/gene
accession number where G1,G2...Gn are gene name/gene accession number.
Consecutive columns are for individual array where S1,S2...Sm are sample
names. Ixy represents Gene Expression Intensity Value.
Pitt Standard 2 (GEDA Format 2)
Name ACC_NO S1 S2 S3 S4 ... Sm
G1 ACC-1 I11 I12 I13 I14 ... I1m
G2 ACC-2 I21 I22 I23 I24 ... I2m
G3 ACC-3 I31 I32 I33 I34 ... I3m
G4 ACC-4 I41 I42 I43 I44 ... I4m
. . . . . . ... .
. . . . . . ... .
. . . . . . ... .
. . . . . . ... .
. . . . . . ... .
Gn ACC-n In1 In2 In3 In4 ... Inm
The dataset is a tab delimited
text file where the first Column is for the name of the gene where
G1,G2...Gn are gene names. The second column is for the gene accession
number which will be used to generate links to public databases in
the output. Consecutive columns are for individual array where S1,S2...Sm
are sample names. Ixy represents Gene Expression Intensity Value for
the yth gene in the xth sample.
1.3 Sample
Data Files
To facilitate use of the GEDA tool for training, we have included
a variety of published cancer data sets that auto-load when the user
selects this option. To faciliate research involving the re-analysis
of published cancer data sets, we provide a list of links to each of
these published cancer microarray data sets under the UPITT Cancer Gene
Expression Dataset (http://bioinformatics1.upmc.edu/Help/UPITTGED.htm).
Individuals who wish to submit links to their published cancer microarray
data sets should send an email to the senior PI (lyonsweilerj@msx.upmc.edu).
|
|
2. Data Pre-Processing Steps
2.1 Log
Transformation-(not recommended)
We have included an option to log-transform (log2, log10 or ln).
When this option is selected, the individual expression intensity values
are log-transformed. We do not recommended this practice because in spite
of expectations to the contrary, our simulation studies clearly demonstrate
that a loss of useful information occurs under all methods of analysis
evaluated to date when the data are log-transformed. The tests for differentially
expressed genes are proving to be more robust to the violation of the
assumption of normality than to the ill effects of quasi-normality imposed
by the log transformation.
2.2 Normalization options (Last updated 7/3/03)
Normalization
are most commonly applied to correct for among-array differences in
measured global expression. There are three types of bias that may
exist in microarray data: linear multiplicative (y = ax), linear additive
(y = x+a), nonlinear (y = xa). Various biological,
experimental, and technological sources of variability can 'exist'
simultaneously in any given data set, and in reality the 'enter into' the
experiment at various stages of processing. Different approaches to
normalization exist, and as of this writing (7/3/03) have not been
completely compared using our simulations.
|
Source
of Variation
|
Type
|
Corrections
|
Array-Wide
|
|
RNA extraction quality/quantity
|
linear? multiplicative
|
various (scaling factors)
|
|
tissue heterogeneity
|
mixed
|
LCM, GMA, ???
|
|
warm handling time
|
linear additive
|
randomization/GMA
|
|
slide/chip age
|
linear additive
|
randomization/GMA
|
|
drying time
|
linear additive
|
randomization/GMA
|
|
washing efficiency
|
linear additive
|
randomization/GMA
|
|
dye-labeling efficiency
|
nonlinear mixed
|
GMA, lowess??
|
|
background
|
linear additive
|
GMA
|
|
scanner intensity
|
linear? Multiplicative?
|
???
|
|
Spot/Probe/Gene
Specific
|
|
sequence variation
|
nonlinear mixed
|
PM-MM?, ???
|
|
nonspecific cross-hybridization
|
linear multiplicative
|
PM-MM?, ???
|
|
gene loss
|
linear multiplicative
|
signal!
|
|
gene amplification
|
linear multiplicative
|
signal!
|
|
upregulation
|
linear additive
|
signal!
|
|
downregulation
|
linear additive
|
signal!
|
Normalization
for Linear Multiplicative Biases
The
options in the GEDA tool for linear multiplicative normalization include
those that implement within-array normalization listed by Kroll et al.
(2002): sum scaling factor, mean scaling factor, median scaling
factor, quantile/percentile scaling factor, trimmed mean scaling
factor and asymmetric trimmed mean scaling factor, and two that implement
among array normalization. Each of these is briefly described:
Within-Array
Sum:
all intensity values on the array are multiplied by the sum of
all intensity values on the array.
Mean:
all intensity values on the array are multiplied by the mean of
all intensity values on the array.
Median:
all intensity values on the array are multiplied by the median
of all intensity values on the array. In the case of an odd number
of spots, the median is the central value; in the case of an even number
of spots, the median is the mean of the two central values.
Quantile/Percentile:
all intensity values on the array are multiplied by the mean of
all intensity values on the array beyond the specified quantile.
Trimmed
mean: all intensity values on the array are multiplied by the
mean of all intensity values on the array between the two specified
quantiles.
Among-Array
Median Mean - Chip-wide expression intensity varies from
array to array. A number of papers performed ‘globalization’ using a
simple multiplicative factor to make overall mean intensities of all arrays
the same. One such factor is the ‘median mean’ mean adjustment, where
the mean intensity of each arrays are calculated, the median mean is identified,
and an array-specific multiplicative factor is defined as the ratio
of that array’s mean to the median mean. Each intensity value is scaled
using that factor,
(1)
NB: We have determined using
simulations that ‘median mean’ normalization does not recover information
due to simple linear additive biases for any test for differentially expressed
gene and should not be used unless multiplicative bias is suspected.
Similarly, GMA partially, but not fully, recovers information lost due to multiplicative
biases. Precisely how to tell when multiplicative bias is present is
still under study. We recommend the alternative linear
direct scaling we call the Global Mean Adjustment (see below) until this is
sorted out.
Minimum
Mean - This is identical to the median mean, but instead of
normalizing to the median mean, the scaling factor become a given array's
mean:minimum of all arrays mean.
Normalization
for Linear Additive Biases
The
options in the GEDA tool for linear additive normalization are the
among-array Global Mean Adjustment and Min=0, Max=1 scaling. We also
offer the z-transformation. These are described briefly below:
Global Mean Adjustment (GMA) – highly recommended,
especially if the among-array coefficient of variation among all array
expression intensities mean > 0.3. This linear normalization function
works as follows. First, the global mean (grand mean) of all intensities
of all array is calculated. Then, the difference between each individual
array mean and the grand mean is calculated. This array-specific difference
value is the added to (subtracted from) each individual expression intensity
value on an array. The result is that all arrays now have a the same overall
mean.
We have found no conditions under which the GMA decreases the performance
of the tests for differentially expressed genes (i.e., they always outperform
themselves after GMA) and for some tests (the J5 test in particular) the
GMA recovers all information loss due to simple linear biases (in contrast
to the median mean multiplicative factor in equation 1).
The adjusted intensity value I of the ith of N genes in the jth
of M arrays under the GMA is
(2)
Max1, Min0 (not recommended)- Re-scales
each expression intensity profile so the array-specific maxima and
minima are equal in all arrays.
Z-transformation-like the log transformation, the z-transformation
can change the shape of the distribution, and can reduce the performance
of some tests for differentially expressed genes. The original intensity
value xij is replaced with
(3)
2.3 Duplicate Gene Entry - this option initiates
a routine prior to any other pre-processing or analysis that averages
all genes with the same (identical) string in the name column, removes
all but one of those gene entries, and replaces the solitary remaning
gene entry with the average of all the entries.
2.4 Graphical Output Options - This option
can be selected if the user wishes to view the frequency distributions
of all of the array individually in the output. We recommend this as
a routine data quality check prior to any interpretative analysis.
|
|
3. Analysis Settings 1: Test for Differentially
Expressed Genes
3.1 Permutation Test Option- Any
test for differentially expressed genes can be applied as a permutation
test as well. Under permutation tests, the data set is randomized some
arbitrarily large number of times and the test statistic of interest
is determined for all genes in each random data set. Genes that have
a test statistic that falls within the user-defined tail (usually upper
(and when appropriate) lower 5% tails) of the result distribution of values
of test statistics are considered to be significant.
Permutation tests can be considered to have a number of advantages, including
escape of the reliance of analytical distributions and their corallary
assumptions. We have not yet evaluated the use of permutation tests using
simulations, and cannot yet make a recommendation for or against their
use.
3.2 Test for Differentially Expressed Genes Options
Pooled variance t-test- (somewhat recommended). Appropriate
when n1 <> n2; identical to the simple t test when n1 = n2.
(4)
Simple t test- Appropriate when n1 =
n2
(5)
Signal-to-Noise – This measure is similar to the t-test.
Our early studies show that it has a high pFDR (positive false discovery
rate).
(6)
Simple separability test (Inverse Gaussian) - We define N1 as the number
of samples in Group A for which the expression value for the ith gene
is less than the minimum value of that gene of Group B; N2 as the number
of samples in Group A for which the expression value for the ith gene
is greater than the maximum value of that gene of Group B; N3 as the
number of samples in Group B for which the expression value for the
ith gene is less than the minimum value of that gene of Group A; and
N4 as the number of samples in Group B for which the expression value
for the ith genes is greater than the maximum value of that gene of Group
A. This is equivalent to the proportion of non-overlap between the two
distributions.
The counts N1, N2, N3 and N4 are paired; in the perfectly separable case,
when N1 = Na, N2 = 0, or vice versa; when N3 = Nb, N4 = 0 or vice versa.
Our score S, which is
S = (N1+N4)/(Na+Nb) or
(N2+N3)/(Na+Nb)(7)
whichever is greater,
has an upper limit of 1.0. The threshold Ts ranges from 0 to 1.
Genes with scores S greater than or equal to Ts are retained (1 = perfect
separability, no overlap in the distribution).
Weighted Separability Under weighted separability, the score for
each gene is weighted using information on the magnitude of the difference
between the means. Specifically, each weight for a given gene is
wi = (meania-meanib)/ (1/n)SUM(meanja - meanjb)(8)
Under simple separability, the score S becomes wS and has a lower
limit of 0 and an upper limit of Na+Nb.
J5 test- (highly recommended) This is wi used as a test and essentially
compares the difference of means in any gene to the average difference
in means over the whole array. It has proven to be superior using
ROC analysis and FDR analysis under a wide variety of distributions.
It is robust like the t-test (remarkably insensitive to violations of
normality).
(9)
nfold ratio of means-(not recommended) This
version of ‘fold-change’ is not robust to changes in the distribution
and never outperforms the J5 test.
(10)
nfold mean of ratios-(not recommended) This version of
‘fold-change’ is slightly more robust that the ratio of means bu never
outperforms the J5 test.
(11)
Most of these tests have been compared using simulations. In
brief, the J5 test appears superior to all others under the follow
evaluation criteria: area under the ROC curve (Fig. 1), SN+SP+PU* (=PER*
where SN is sensitivity, SP is specificity, PU* is a weighted measure
of the predictive utility of the gene set selected by the test for sample
classification inferences (class discovery and class prediction)), positive
false discovery rate (pFDR; Fig. 2+3)
SAM
The SAM module implements Significance Analysis of
Microarrays for the two-class unpaired design. For a complete
description of the SAM Module as implemented in the GEDA tool click
here.
Fig. 1. ROC curve generated using
the associated Gene Expression Data Simulator for three tests under
the truncated Cauchy distribution. Log-transformation of the data prior
to these tests makes each test worse (not shown). The signal-to-noise
test exhibits similar performance to the t-test.
Fig. 2. pFDR and PER* of four
tests for finding differentially expressed genes. Simulation conditions and
models to be described in upcoming papers. The differences among the tests
are not distribution-shape dependent and all tests were performed at a
thresholed where the expected Type 1 error rate was 0.05.
Fig. 3. PER* of three tests
for differentially expressed genes under six distributions. This result shows that the
J5 test and the t-test are remarkably robust to differences in the
underlying data distribution but that ratio-of-means fold changes is
less robust.
BSS/WSS Ratio (Dudoit et al.,
2002)
For
a given gene j in a sample j, with any classes (k = 1, 2,
3...K),
= average expression level of gene j among samples in class k
= average expression level of gene j among all samples.
This
ratio was originally proposed by Fisher and forms the basis of linear
discriminant analysis. It has not yet been evaluated using our
simulator.
3.4 Threshold (T)
This is a value that a test statistic (or other measure such as
ratio) must exceed for a gene to be considered 'significantly' different
between the specified groups. The threshold value used of the t-test
can be compared to the the critical values associated with presumed Type
1 error risk. The objective selection of a threshold for any test
can be determined using Reiterative Cross-Fold Validation (RCFV). In
this approach, training sets made of 70% of the samples in a data set,
randomly selected, are used to find differentially expressed genes.
These genes are used to predict the class membership of the remaining
30% of the samples. This procedure is iterated over the range of threshold
(T) values. The appropriate threshold for a given test is the smallest
value of T where the PU is maximized.
An alternative approach is
available in the SAM test, which provides an estimate of the number
of false positives in a signifcant gene list. From this estimate we
calculate the percentage of true positives over a range of T. Again,
the appropriate threshold for the test is deemed to be the level of
T where %TP is maximized. These two alternative approaches have not
yet been compared using simulations.
3.4
Measure of Central Tendency
Mean-the most popular measure of central tendancy (average).
Median-the value in the expression data set that is located
at the numeric middle of the distribution.
Trimmed Mean-this is the average with
x% of the upper and lower tails removed.
3.5 Special Options
Jackknife to Reduce False Positives (Lyons-Weiler et al.,
2003)
If this option is selected, the test will be perform N times for
a dataset with N sample, each time leaving one of the samples out. The
jacknife count option is used to specify the number of cases in which
a gene must be found in the list of significant genes to be retained in
the final list. If the jacknife count = the number of samples, the gene
list output contains those genes that are found in all of gene lists generated
(genes that significant regardless of which sample is left out). If the
jackknife count = N-1, then the gene list contains all the genes that are
significant always except when 1 sample is left out. If the jacknife count
= 1, then then gene list contains all the genes that are significant in
any gene list (union of gene lists)
This implementation increase true positives left out due to outliers.
Our studies show that only the latter setting yields improved performance;
we call this the jackknife to increase true positives.
MDSS Algorithm (Lyons-Weiler et al., 2003: Abstract)
This algorithm can be used with any combination of test for identifying
differentially expressed genes and classification algorithm. The MDSS
(Maximum Difference SubSet) algorithm applies the test to the original
data set, and then ranks the genes in descending order according to the
computed test statistic. The user must specify the number genes to begin
the next step, in which a classification is derived using the top M genes,
then M-1 genes, and so on, until a perfect bipartition between the user-defined
sample classes is recovered. Then the jacknife to reduce false positives
is applied to remove false positives due to outliers.
|
4. Analysis Settings 2: Sample
Classification Operation
The sample classification
module in the GEDA tool can be used in two modes for sample classification-
class discovery (unsupervised clustering) and class prediction (semi-supervised).
In the first mode, the user should set the threshold for any test (e.g.,
the t-test) low enough to allow all genes to pass the test (e.g., 0.0001).
Then clustering of the samples will be performed using all genes in
the data set.
The second mode is called ‘semi-supervised’
because the user first identifies significant genes (aka ‘performs feature
selection’ ) in a supervised manner, and the samples are classified using
the retained genes as features. The clustering algorithm do not use the
sample label to enforce the clustering. Most people use this step as a
way to test the informativeness of the set of retained genes, but this
approach (clustering samples using genes found with the same sample set)
tends to lead to overtraining (i.e., the set of retained genes can lead
to correct classification of all of the samples in the original data set
but has no predictive value for sample external to the original data set).
See Computational Validation for information on using leave-one-out validation
or cross-validation to obtain unbiased estimates of classification error.
4.1 Clustering Algorithms
Average Linkage Clustering
Average linkage clustering proceeds by first finding pairs of
expression profiles that are most similar, joining them, calculating
the (sometimes weighted) average between the members of the joined
cluster, re-calculating the pairwise distance, and treating the average
profile as one profile, and repeating the procedure until all profiles
are joined. Average linkage clustering can be conducted using all-pairwise-sample
average of differences or using cluster average differences. The latter
is also known as centroid clustering, but centroids can be calculated
using methods other than simple averages.
Simple Linkage Clustering
Simple linkage clustering involves finding the single members
(samples) of existing clusters that are most similar and joining clustering
at places such that these members connect.
Maximum Linkage Clustering
Maximum linkage clustering is also known as complete linkage clustering
and involves finding single members (samples) of existing clusters
that are most different and joining them.
K-means clustering
The user defines a number of groups and some way of identifying
a 'typical' expression profile within each group. The 'typical' profile
becomes the first member of each group. In our implementation, the
'typical' profile is the one nearest to the mean for the group (distance
selected by the user). Next, it examines each profile in the data set
and and assigns it to one of the clusters depending on the minimum distance.
The centroid's position is recalculated everytime a profile is added
to the cluster. This process continues until all the profiles are grouped
into the final required number of clusters. We use a pre-defined number
of re-assignment iterations to avoid infinite loops cause by 'ties'.
We have found that k-means clustering is slightly inferior to average
linkage clustering using Euclidean distance for sample classification.
4.2 Distance Functions
There are a very large number of distances that could be calculated
for sample and for gene classification. Which distance one selected
does matter - a great deal. In our simulations, we have found that the
Euclidean distance performance best for sample classification under
the case vs. control study design. The distance based on Pearson’s may
be more suitable for gene classification operations, such as attempts
to find co-regulated genes. The following table contains the equations
for the pairwise distances included in the GEDA tool as of this writing.
|
5. Analysis Settings 3: Gene
Classification Options (not yet operational)
Clustering Algorithms
Average Linkage Clustering
Average linkage clustering proceeds by first finding pairs of
expression profiles that are most similar, joining them, calculating
the (sometimes weighted) average between the members of the joined
cluster, re-calculating the pairwise distance, and treating the average
profile as one profile, and repeating the procedure until all profiles
are joined. Average linkage clustering can be conducted using all-pairwise-sample
average of differences or using cluster average differences. The latter
is also known as centroid clustering, but centroids can be calculated
using methods other than simple averages.
Simple Linkage Clustering Simple linkage clustering involves
finding the single members (samples) of existing clusters that are
most similar and joining clustering at places such that these members
connect.
Maximum Linkage Clustering Maximum linkage clustering is
also known as complete linkage clustering and involves finding single
members (samples) of existing clusters that are most different and joining
them.
Neighbor joining - Not Yet Validated, so not yet implemented.
K-means clustering
The user defines a number of groups and some way of identifying
a 'typical' expression profile within each group. The 'typical' profile
becomes the first member of each group. In our implementation, the
'typical' profile is the one nearest to the mean for the group (distance
selected by the user). Next, it examines each profile in the data set
and and assigns it to one of the clusters depending on the minimum distance.
The centroid's position is recalculated everytime a profile is added
to the cluster. This process continues until all the profiles are grouped
into the final required number of clusters. We use a pre-defined number
of re-assignment iterations to avoid infinite loops cause by 'ties'.
Distance Functions
All of the distance functions available for sample classification
are also available for gene classification. We (and others; Knudsen,
2002) recommend the us 1-Pearson's correlation coefficient for clustering
genes, but the use of Euclidean for clustering samples using biomarkers
discovered using case vs. control study designs and Minkowski's distance
for clustering samples using biomarkers discovered using tumor vs. normal
adjacent to tumor study designs. Our recommendation stems from the analysis
of a large number of published cancer data sets and the use of simulations
to compare the efficiencies of various measures of pairwise distance in
both applications.
|
6. Analysis Settings 4: Computational
Validation
Leave-one-out validation
In leave-one-out validation, the selected analysis is perform
N times. Each time a sample is left out, a gene list is determined,
and then the left out sample is replaced. Clustering (or other classification
algorithm) is performed on that data set. If the bipartition is found
(e.g., all normals in one group, all tumors in the other), then we know
that the class membership of the sample left out was accurately predicted,
and that no other sample was inaccurately placed. The score is the proportion
of N times that a sample was correctly placed.
An influence measure is also reported for each sample; this number reports
the proportion of all cases in which a sample was included where the
classification was accurate. This allows the user to determine which samples
cause misclassifications.
Cross-Validation
In cross validation, a portion of the samples are used as a training
set, and the remainder are used as a test set. Equal proportions of the
both groups are used in the training set. The user defines the X-fold
iterations (number of training sets and test sets to create and analyze),
and the percentage of samples to use in the training set.
The output reports the total percent of X-fold iterations in which the
user-specific bipartition was in fact found in the classification performed
on the combined test and training sets, The influence measure of each
sample is the proportion of times a sample is used in a test set that
lead to an accurate classification.
Bootstrap
In the bootstrap, pseudoreplicate data sets are created by resampling
genes with replacement to create a data set the same size (number of
genes) as the original data set (Felsenstein, 1985). The test and classification
procedures are performed on the pseudoreplicated data sets. The bootstrap
proportion is reported for all bipartitions above a specified level.
For now, these options are restricted to sample cluster analysis using
the significant gene list as features.
|
7. Group Settings
7.1 Number of Samples - number
of independent samples (number of columns-1). Usually the number of
arrays in single-dye studies. In studies that combine samples from different
extractions (e.g., 2-dye competitive hybridization experiments), it's
the number of extractions. In those cases use the profile of intensity
measured in each channel as a sample.
7.2 Number of Genes- usually the number of
independent probes (number of rows-1). NOT the number of individuals
genes. The GEDA tool will calculate the averages of probes from different
genes if the genes have the identical gene name, but only if the option
to do so is selected.
7.3 Sample Classification Form - Use this form
to tell GEDA which group each sample belongs to. Use a list of sample
names with a tab after each sample name followed by a group number,
for example
SampleA 1
SampleB 1
SampleC 1
SampleD 2
SampleE 2
SampleF 2
The following is also allowed
SampleA 2
SampleB 2
SampleC 2
SampleD 1
SampleE 1
SampleF 1
As is this (or any other order)
SampleA 2
SampleF 1
SampleC 2
SampleD 1
SampleB 2
SampleE 1
Samples can be excluded
from an analysis by assigning it to group '0'. For example, ‘SampleB’
and ‘SampleF’ are excluded from the analysis with the following settings:
SampleA 1
SampleB 0
SampleC 1
SampleD 2
SampleE 2
SampleF 0
In this encoding, samples
B and F will be ignored in all the data pre-processing and analysis.
7.4 Interpreting the Output
A. Data Visualization for Pre-Interpretative Analyses
Histograms
The GEDA tool output includes frequency histogram for two groups
and, if requested, individual histograms. These can be evaluated to identify
arrays with unsual distributions. Odd distributions can result from
poor sample handling (long ischemia time), incorrect scanner settings,
a bad batch of hybridization or labelling reagents, and outdated (expired)
chip, and so on. It is risky to merely blindly analyze the data without
proper a priori visualization. Decisions on which samples should be
excluded after a chip is run are similar to decisions on which samples
should not be run based on RNA quality (3':5', for example). Importantly,
such decisions must only use non-interpretative analysis tools such as
boxplots and histograms. For now, it is generally considered inappropriate
to change the study design by removing an expression profile because
one fails to achieve a desired result. Nevertheless, there are some circumstances
where it becomes appropriate (a misclassified normal sample was wrongly
labelled as tumor, for example, and this was discovered because it never
clustered with the normals and a check of the workflow revealed the mislabeling).
Mean (Value) Correlation Graph
Given good RNA quality, good hybridization conditions, and proper
study design, a high correlation of mean values between the two groups
of samples can be expected, even with a small number of samples in
each group. The mean correlation graph is a biplot of the mean intensity
value (after any pre-processing). We report the r^2 value. Correlations
lower than 0.95 should be taken as a sign of a need to improve the study
design. This is still pre-interpretive analysis.
Score Histogram:(Top 50 Genes)
This image shows the score for each of the top 50 genes as overexpressed
(higher in group 1 than in group 2 = positive + red) or underexpressed
(lower in group 1 than in group 2 = negative + green).
B. Mean Histogram - Sorted by Score (Top 50 Genes)
This image shows the groupwise mean value for each gene., sorted
by the magnitude of the score. A closed green rectangle denotes a mean
value that is underexpressed in group 1; an open green rectangle denotes
a mean value that is overexpressed in group 2; an closed red rectangle
denotes a mean value that is overexpressed in group 1; a closed red rectangle
denotes a mean value that is overexpressed in group 2.
C. Score Histogram
The score histogram shows the frequency distribution of test statistics
for the analyzed data set. This is not the null distribution, but it
sometimes informative on what a reasonable cut-off might be for a given
test.
D. Retained Genes Table
This is a table of all of the significant (retained) genes sorted
in descending order according to the test statistic. The table includes
links to the Locus Link, Genome View, UCSC, ensemble, UniGene, dbSNP,
AmiGO and OMIM database entries. These links will only work properly
if the user opts to create a Pitt2 Standard Input formatted data set.
E. Cross-fold validation plots
This plot (Fig. 4) provides
two pieces of information. First, it reports on where along the range
of threshold values requested by the user the optimal threshold T is
located. Users should always use this graph before applying any threshold
for interpretative analysis. Second. superior tests for differentially
expressed genes will ‘peak early and stay late’, which is to say that in
the Cross-fold validation plot, poor methods (defined as methods with high
pFDR) will have a steep approach to and decline from the peak. Superior
methods with lof pFDR will have high classification accuracy across the
range of threshold values, typically with a steep approach to the peak
with increasing T and a gradual decline in PU after the peak. One way of
expressing this is the area under the curve from last initial zero to first
final zero (“from zero to zero”). The reason for this is that the classification
algorithm finds useful information even in the smaller sets of significant
genes because they, too are mostly true positives. Users that find all
zeros for a given test should try a exanding the range of threshold values
explored. These plots can be constructed ‘manually’ by varying T and
recording the result. We will add a feature to ‘automate’ this procedure.
Fig. 4. RCFV plot for three tests for differentially expresed
genes (ovarian cancer data set of Welsh et al (2001). Clearly the tests all perform well for this
test set at specific values. A score of 100 for a given test means
that the correct classification was recovered in the test set using
the markers selected with the test set. Because the J5 test has a low
pFDR (and a correspondingly high %TP), the cross-fold validation score
stay high over a broad range of threshold values. A very good method
(that returns only true positives) will have high predictive utility over
the entire range of threshold values after the initial peak. This result
was achieved using 100 iterations of cross-fold validation er point and
ALCED Clustering.
F. %TP x T plot
This plot reports the %TP (as
estimated using the results from the SAM module) over the range of
values of T and can interpreted in a similar way to the cross-fold
validation plots. (*under development)
G. SAM Plot
In the SAM plot (Fig. 5), the
expected value of the test statistic of each rank position in compared
to the ‘observed’ test statistic value at the same rank position. Dotted
lines represent the position of the threshold of the test (critical
value of t) associated with the value D. Blue spots represent genes called not
significant; red dots represent genes called significantly overexpressed,
and green dots represent genes called significantly underexpressed. The
numerical values of the number of significant genes in reported, as is
the estimated median number of false significant genes. The pFDR = FP/(TP+FP)
and %TP = TP/(TP+FP) are also reported. Note that the SAM test is assymetric
(the critical value of t associated with a given delta is different for
finding overexpressed and underexpressed genes. For a complete
description of how SAM is implemented in the GEDA tool click
here.
Fig. 5. The GEDA-generated SAM plot for the colon tumor
samples vs. epithelial-like colon tumor samples in the Alon et al. (1999)
colon tumor gene expression data set.
7.5 Future Developments
We expect to expand GEDA in about two dozen
ways to meet the growing needs of clinical researchers who use microarray
data, including support for additional data formats, nonlinear transformations/normalization,
>20 additional tests for differentially expressed genes, including
likelihood and empirical Bayesian models, options for multiclass analysis,
for generating ROC curves using training and test sets, > 20 additional
classification algorithms, additional distance measures, classifiers
such as Fisher linear discriminant analysis, diagonal discriminant analysis,
weighted voting, variable selection methods, classification trees (e.g.,
CART and aggregated CART), prediction vote methods, ROC-only methods,
bagging, boosting, probe-level analysis, enhanced error rate estimation
(study-wide and observation-wise), expanded gene clustering capabilities,
heat maps/SOMs, clinical analyses, including uploading clinical and
environmental data (covariates), multivariate analysis, survivorship
analysis, M-A plots, box plots, prediction error rate plots, regression
shrinkage methods (LAR, LASSO, forward stagewise regression), support
vector machines, neural networks, imputation of missing values, modules
to expedite translational research, additional links in the gene list
table to additional genomic, proteomic, structural and regulatory element
databases, enhanced online support and annotation, and support of additional
experimental designs. With additional effort, each of the various
approaches to analysis will be evaluated over a broad range of model
and parameter space with our online Gene Expression Data Simulator.
The specific details of these plans are available upon request, and we
especially welcome contacts from prospective collaborators.
|
Ackowledgements
The following people are
thanked for the critical evaluation and feedback on the GEDA tool,
and on our attempts to evaluate and compare the performance of methods
of analysis: Roger Day, Milos Hauskreceht, Art Wetzel, Doug Landsittel,
Vanathi Gopalakrishnan, Takis Benos, Ivet Bahar, Sam Wieand, Tom Richards,
Naftali Kaminksi, Tue Ngyuen, Bill LaFromboise, Paul Hergenroader, Cathy
Ma, Uma Chandran, Federico Monzon-Bordonaba, Bill Bigbee, Michael Lotze,
Adam Brufsky, John Kirkwood, Michael Becich, Rob Tibshirani, Rebecca
Doerge, and Yaov Benjamini.
Literature Cited
Felsenstein, J. (1985). Confidence limits on phylogenies: an
approach using the bootstrap. Evolution, 39:783-791.
Lyons-Weiler J, Patel S,
Bhattacharya S. A classification-based machine learning approach for
the analysis of genome-wide expression data. Genome Res 2003 Mar;13(3):503-12.
Knudsen, S. 2002. A Biologist's Guide to the Analysis of Microarray
Data {link: http://www.cbs.dtu.dk/~steen/book.html] [Steen's Homepage:
http://www.cbs.dtu.dk/~steen/]
Citing this Document
This document should be cited as
Lyons-Weiler, J., S. Bhattacharya, and S. Patel. 2003v3. Gene
Expression Data Analysis Tool Online Description and Documentation
version 1.0. http://bioinformatics.upmc.edu.
|