[ Home ] [ Data Simulator ] [ Data Analysis Tool ] [ Dataset Links ] [ Recommendations for Analysis ]



Gene Expression Data Analysis Tool Online Description and Documentation v3

J. Lyons-Weiler and S. Patel

http://bioinformatics.upmc.edu/
Last updated 4/29/04
Questions and feedback are most welcome!
lyonsweilerj@upmc.edu



1. Data File Upload Options

1.1 GEDA Input format 1
1.2 GEDA Input format 2
1.3 Sample Data Files

2. Data Pre-Processing Steps

2.1 Log Transformation
2.2 Normalization options
2.3 Duplicate Gene Entry
2.4 Graphical Output Options

3. Analysis Settings 1: Test for Differentially Expressed Genes

3.1 Permutation Test Option
3.2 Test for Differentially Expressed Genes Options
3.3 Threshold
3.4 Measure of Central Tendency
3.5 Special Options

4. Analysis Settings 2: Sample Classification Operation

4.1 Clustering Algorithms
4.2 Distance Functions

5. Analysis Settings 3: Gene Classification Options

6. Analysis Settings 4: Computational Validation

8. Group Settings

7.1 Number of Samples
7.2 Number of Genes
7.3 Sample Classification Form
7.4 Interpreting the Output
7.5 Future Development

1. Data File Upload Options

The Gene Expression Data Analysis Tool allows the user to browse their local drive for their tab-delimited text files in two formats:

  • 1.1 Pitt Standard 1
    In Pitt Standard 1, the first column of the data file contains either gene names or accession number (unique gene ID numbers).
  • 1.2 Pitt Standard 2
    In Pitt Standard 2, the first column of the data contains the gene name, and the second column contains the gene accession number, or unqiue gene ID. The accession numbers and unique ID numbers are used to generate a table of links to genomic databases to facilitate further downstream interpretation (see .

In both Pitt Standard input formats 1 and 2, the first cell (first row and column) must contain the string "Name". The first row of each remaining columns are labelled with the array name (sample name), and each column contains the the expression values for each array.

We encourage users to use uncorrected, unnormalized, truly raw intensity values (no background subtraction) because the benefit of every data pre-processing step, while motivated by concern over systematic error, must be weighed against the introduction of additional systematic error (overcorrection) and the introduction of unwanted measurement error through error propagation (background values, like intensity values, are measured with error). Our simulation results have demonstrated to us that background subtraction and high pass filter methods for removing genes that are not 'above background' results in a cost to the performance of all methods of analysis examined to date.

Example data formats are:

Pitt Standard 1 (GEDA format 1)

Name    S1      S2      S3      S4      ...     Sm
G1      I11     I12     I13     I14     ...     I1m
G2      I21     I22     I23     I24     ...     I2m
G3      I31     I32     I33     I34     ...     I3m
G4      I41     I42     I43     I44     ...     I4m 
.       .       .       .       .       ...     .
.       .       .       .       .       ...     .  
.       .       .       .       .       ...     .  
.       .       .       .       .       ...     .
.       .       .       .       .       ...     .
Gn      In1     In2     In3     In4     ...     Inm

The dataset is a tab delimited text file where the first column is for the name of the genes/gene accession number where G1,G2...Gn are gene name/gene accession number. Consecutive columns are for individual array where S1,S2...Sm are sample names. Ixy represents Gene Expression Intensity Value.

Pitt Standard 2 (GEDA Format 2)

Name    ACC_NO  S1      S2      S3      S4      ...     Sm
G1      ACC-1   I11     I12     I13     I14     ...     I1m
G2      ACC-2   I21     I22     I23     I24     ...     I2m
G3      ACC-3   I31     I32     I33     I34     ...     I3m
G4      ACC-4   I41     I42     I43     I44     ...     I4m
.       .       .       .       .       .       ...     .
.       .       .       .       .       .       ...     .
.       .       .       .       .       .       ...     .
.       .       .       .       .       .       ...     .
.       .       .       .       .       .       ...     .
Gn      ACC-n   In1     In2     In3     In4     ...     Inm


The dataset is a tab delimited text file where the first Column is for the name of the gene where G1,G2...Gn are gene names. The second column is for the gene accession number which will be used to generate links to public databases in the output. Consecutive columns are for individual array where S1,S2...Sm are sample names. Ixy represents Gene Expression Intensity Value for the yth gene in the xth sample.

1.3 Sample Data Files
To facilitate use of the GEDA tool for training, we have included a variety of published cancer data sets that auto-load when the user selects this option. To faciliate research involving the re-analysis of published cancer data sets, we provide a list of links to each of these published cancer microarray data sets under the UPITT Cancer Gene Expression Dataset (http://bioinformatics1.upmc.edu/Help/UPITTGED.htm). Individuals who wish to submit links to their published cancer microarray data sets should send an email to the senior PI (lyonsweilerj@msx.upmc.edu).

2. Data Pre-Processing Steps

2.1 Log Transformation-(not recommended)
We have included an option to log-transform (log2, log10 or ln). When this option is selected, the individual expression intensity values are log-transformed. We do not recommended this practice because in spite of expectations to the contrary, our simulation studies clearly demonstrate that a loss of useful information occurs under all methods of analysis evaluated to date when the data are log-transformed. The tests for differentially expressed genes are proving to be more robust to the violation of the assumption of normality than to the ill effects of quasi-normality imposed by the log transformation.
2.2 Normalization options (Last updated 7/3/03)

Normalization are most commonly applied to correct for among-array differences in measured global expression.  There are three types of bias that may exist in microarray data: linear multiplicative (y = ax), linear additive (y = x+a), nonlinear (y = xa).  Various biological, experimental, and technological sources of variability can 'exist' simultaneously in any given data set, and in reality the 'enter into' the experiment at various stages of processing.  Different approaches to normalization exist, and as of this writing (7/3/03) have not been completely compared using our simulations.

 

Source of Variation

Type

Corrections

Array-Wide

RNA extraction quality/quantity linear? multiplicative various (scaling factors)

tissue heterogeneity

mixed

LCM, GMA, ???

warm handling time

linear additive

randomization/GMA

slide/chip age

linear additive

randomization/GMA

drying time

linear additive

randomization/GMA

washing efficiency

linear additive

randomization/GMA

dye-labeling efficiency

nonlinear mixed

GMA, lowess??

background

linear additive

GMA

scanner intensity

linear? Multiplicative?

???

Spot/Probe/Gene Specific

sequence variation

nonlinear mixed

PM-MM?, ???

nonspecific cross-hybridization

linear multiplicative

PM-MM?, ???

gene loss

linear multiplicative

signal!

gene amplification

linear multiplicative

signal!

upregulation

linear additive

signal!

downregulation

linear additive

signal!

 

 

Normalization for Linear Multiplicative Biases

The options in the GEDA tool for linear multiplicative normalization include those that implement within-array normalization listed by Kroll et al. (2002): sum scaling factor, mean scaling factor, median scaling factor,  quantile/percentile scaling factor, trimmed mean scaling factor and asymmetric trimmed mean scaling factor, and two that implement among array normalization.  Each of these is briefly described:

 

Within-Array

Sum: all intensity values on the array are  multiplied by the sum of all intensity values on the array.

Mean: all intensity values on the array are  multiplied by the mean of all intensity values on the array.

Median:  all intensity values on the array are  multiplied by the median of all intensity values on the array.  In the case of an odd number of spots, the median is the central value; in the case of an even number of spots, the median is the mean of the two central values.

Quantile/Percentile: all intensity values on the array are  multiplied by the mean of all intensity values on the array beyond the specified quantile.

Trimmed mean: all intensity values on the array are  multiplied by the mean of all intensity values on the array between the two specified quantiles.

 

Among-Array

Median Mean  - Chip-wide expression intensity varies from array to array. A number of papers performed ‘globalization’ using a simple multiplicative factor to make overall mean intensities of all arrays the same. One such factor is the ‘median mean’ mean adjustment, where the mean intensity of each arrays are calculated, the median mean is identified, and an array-specific multiplicative factor is defined as the ratio of that array’s mean to the median mean. Each intensity value is scaled using that factor,

(1)

NB: We have determined using simulations that ‘median mean’ normalization does not recover information due to simple linear additive biases for any test for differentially expressed gene and should not be used unless multiplicative bias is suspected. Similarly, GMA partially, but not fully, recovers information lost due to multiplicative biases.  Precisely how to tell when multiplicative bias is present is still under study.  We recommend the alternative linear direct scaling we call the Global Mean Adjustment (see below) until this is sorted out.

 

Minimum Mean - This is identical to the median mean, but instead of normalizing to the median mean, the scaling factor become a given array's mean:minimum of all arrays mean.

 

Normalization for Linear Additive Biases

The options in the GEDA tool for linear additive normalization are the among-array Global Mean Adjustment and Min=0, Max=1 scaling.  We also offer the z-transformation. These are described briefly below:

Global Mean Adjustment (GMA)highly recommended, especially if the among-array coefficient of variation among all array expression intensities mean > 0.3. This linear normalization function works as follows.  First, the global mean (grand mean) of all intensities of all array is calculated.  Then, the difference between each individual array mean and the grand mean is calculated.  This array-specific difference value is the added to (subtracted from) each individual expression intensity value on an array.  The result is that all arrays now have a the same overall mean.


We have found no conditions under which the GMA decreases the performance of the tests for differentially expressed genes (i.e., they always outperform themselves after GMA) and for some tests (the J5 test in particular) the GMA recovers all information loss due to simple linear biases (in contrast to the median mean multiplicative factor in equation  1).


The adjusted intensity value I of the ith of N genes in the jth of M arrays under the GMA is
 

(2)

Max1, Min0 (not recommended)- Re-scales each expression intensity profile so the array-specific maxima and minima are equal in all arrays.


Z-transformation-like the log transformation, the z-transformation can change the shape of the distribution, and can reduce the performance of some tests for differentially expressed genes. The original intensity value xij is replaced with

(3)

2.3 Duplicate Gene Entry - this option initiates a routine prior to any other pre-processing or analysis that averages all genes with the same (identical) string in the name column, removes all but one of those gene entries, and replaces the solitary remaning gene entry with the average of all the entries.
2.4 Graphical Output Options - This option can be selected if the user wishes to view the frequency distributions of all of the array individually in the output. We recommend this as a routine data quality check prior to any interpretative analysis.

3. Analysis Settings 1: Test for Differentially Expressed Genes

3.1 Permutation Test Option- Any test for differentially expressed genes can be applied as a permutation test as well. Under permutation tests, the data set is randomized some arbitrarily large number of times and the test statistic of interest is determined for all genes in each random data set. Genes that have a test statistic that falls within the user-defined tail (usually upper (and when appropriate) lower 5% tails) of the result distribution of values of test statistics are considered to be significant.
Permutation tests can be considered to have a number of advantages, including escape of the reliance of analytical distributions and their corallary assumptions. We have not yet evaluated the use of permutation tests using simulations, and cannot yet make a recommendation for or against their use.
3.2 Test for Differentially Expressed Genes Options
Pooled variance t-test- (somewhat recommended). Appropriate when n1 <> n2; identical to the simple t test when n1 = n2.

(4)







Simple t test
- Appropriate when n1 = n2
(5)


Signal-to-Noise – This measure is similar to the t-test. Our early studies show that it has a high pFDR (positive false discovery rate).

(6)


Simple separability test (Inverse Gaussian)
- We define N1 as the number of samples in Group A for which the expression value for the ith gene is less than the minimum value of that gene of Group B; N2 as the number of samples in Group A for which the expression value for the ith gene is greater than the maximum value of that gene of Group B; N3 as the number of samples in Group B for which the expression value for the ith gene is less than the minimum value of that gene of Group A; and N4 as the number of samples in Group B for which the expression value for the ith genes is greater than the maximum value of that gene of Group A. This is equivalent to the proportion of non-overlap between the two distributions.
The counts N1, N2, N3 and N4 are paired; in the perfectly separable case, when N1 = Na, N2 = 0, or vice versa; when N3 = Nb, N4 = 0 or vice versa. Our score S, which is
S = (N1+N4)/(Na+Nb) or
(N2+N3)/(Na+Nb)(7)


whichever is greater,
has an upper limit of 1.0. The threshold Ts ranges from 0 to 1. Genes with scores S greater than or equal to Ts are retained (1 = perfect separability, no overlap in the distribution).

Weighted Separability
Under weighted separability, the score for each gene is weighted using information on the magnitude of the difference between the means. Specifically, each weight for a given gene is
wi = (meania-meanib)/ (1/n)SUM(meanja - meanjb)(8)


Under simple separability, the score S becomes wS and has a lower limit of 0 and an upper limit of Na+Nb.


J5 test
- (highly recommended) This is wi used as a test and essentially compares the difference of means in any gene to the average difference in means over the whole array. It has proven to be superior using ROC analysis and FDR analysis under a wide variety of distributions. It is robust like the t-test (remarkably insensitive to violations of normality).



(9)

nfold ratio of means-(not recommended) This version of ‘fold-change’ is not robust to changes in the distribution and never outperforms the J5 test.
(10)


nfold mean of ratios-(not recommended) This version of ‘fold-change’ is slightly more robust that the ratio of means bu never outperforms the J5 test.

(11)


Most of these tests have been compared using simulations. In brief, the J5 test appears superior to all others under the follow evaluation criteria: area under the ROC curve (Fig. 1), SN+SP+PU* (=PER* where SN is sensitivity, SP is specificity, PU* is a weighted measure of the predictive utility of the gene set selected by the test for sample classification inferences (class discovery and class prediction)), positive false discovery rate (pFDR; Fig. 2+3)

 

 

SAM

The SAM module implements Significance Analysis of Microarrays for the two-class unpaired design.  For a complete description of the SAM Module as implemented in the GEDA tool click here.


Fig. 1. ROC curve generated using the associated Gene Expression Data Simulator for three tests under the truncated Cauchy distribution. Log-transformation of the data prior to these tests makes each test worse (not shown). The signal-to-noise test exhibits similar performance to the t-test.


Fig. 2. pFDR and PER* of four tests for finding differentially expressed genes. Simulation conditions and models to be described in upcoming papers. The differences among the tests are not distribution-shape dependent and all tests were performed at a thresholed where the expected Type 1 error rate was 0.05.

Fig. 3. PER* of three tests for differentially expressed genes under six distributions. This result shows that the J5 test and the t-test are remarkably robust to differences in the underlying data distribution but that ratio-of-means fold changes is less robust.

BSS/WSS Ratio (Dudoit et al., 2002)

For a given gene j in a sample j, with any classes (k = 1, 2, 3...K),

 

 

= average expression level of gene j among samples in class k

= average expression level of gene j among all samples.

 

This ratio was originally proposed by Fisher and forms the basis of linear discriminant analysis.  It has not yet been evaluated using our simulator.

 

3.4 Threshold (T)
This is a value that a test statistic (or other measure such as ratio) must exceed for a gene to be considered 'significantly' different between the specified groups. The threshold value used of the t-test can be compared to the the critical values associated with presumed Type 1 error risk. The objective selection of a threshold for any test can be determined using Reiterative Cross-Fold Validation (RCFV). In this approach, training sets made of 70% of the samples in a data set, randomly selected, are used to find differentially expressed genes. These genes are used to predict the class membership of the remaining 30% of the samples. This procedure is iterated over the range of threshold (T) values. The appropriate threshold for a given test is the smallest value of T where the PU is maximized.

An alternative approach is available in the SAM test, which provides an estimate of the number of false positives in a signifcant gene list. From this estimate we calculate the percentage of true positives over a range of T. Again, the appropriate threshold for the test is deemed to be the level of T where %TP is maximized. These two alternative approaches have not yet been compared using simulations.

3.4 Measure of Central Tendency
Mean-the most popular measure of central tendancy (average).
Median-the value in the expression data set that is located at the numeric middle of the distribution.

Trimmed Mean-this is the average with x% of the upper and lower tails removed.
3.5 Special Options
Jackknife to Reduce False Positives (Lyons-Weiler et al., 2003)
If this option is selected, the test will be perform N times for a dataset with N sample, each time leaving one of the samples out. The jacknife count option is used to specify the number of cases in which a gene must be found in the list of significant genes to be retained in the final list. If the jacknife count = the number of samples, the gene list output contains those genes that are found in all of gene lists generated (genes that significant regardless of which sample is left out). If the jackknife count = N-1, then the gene list contains all the genes that are significant always except when 1 sample is left out. If the jacknife count = 1, then then gene list contains all the genes that are significant in any gene list (union of gene lists)
This implementation increase true positives left out due to outliers. Our studies show that only the latter setting yields improved performance; we call this the jackknife to increase true positives.
MDSS Algorithm (Lyons-Weiler et al., 2003: Abstract)
This algorithm can be used with any combination of test for identifying differentially expressed genes and classification algorithm. The MDSS (Maximum Difference SubSet) algorithm applies the test to the original data set, and then ranks the genes in descending order according to the computed test statistic. The user must specify the number genes to begin the next step, in which a classification is derived using the top M genes, then M-1 genes, and so on, until a perfect bipartition between the user-defined sample classes is recovered. Then the jacknife to reduce false positives is applied to remove false positives due to outliers.

4. Analysis Settings 2: Sample Classification Operation

The sample classification module in the GEDA tool can be used in two modes for sample classification- class discovery (unsupervised clustering) and class prediction (semi-supervised). In the first mode, the user should set the threshold for any test (e.g., the t-test) low enough to allow all genes to pass the test (e.g., 0.0001). Then clustering of the samples will be performed using all genes in the data set.

The second mode is called ‘semi-supervised’ because the user first identifies significant genes (aka ‘performs feature selection’ ) in a supervised manner, and the samples are classified using the retained genes as features. The clustering algorithm do not use the sample label to enforce the clustering. Most people use this step as a way to test the informativeness of the set of retained genes, but this approach (clustering samples using genes found with the same sample set) tends to lead to overtraining (i.e., the set of retained genes can lead to correct classification of all of the samples in the original data set but has no predictive value for sample external to the original data set). See Computational Validation for information on using leave-one-out validation or cross-validation to obtain unbiased estimates of classification error.

4.1 Clustering Algorithms
Average Linkage Clustering
Average linkage clustering proceeds by first finding pairs of expression profiles that are most similar, joining them, calculating the (sometimes weighted) average between the members of the joined cluster, re-calculating the pairwise distance, and treating the average profile as one profile, and repeating the procedure until all profiles are joined. Average linkage clustering can be conducted using all-pairwise-sample average of differences or using cluster average differences. The latter is also known as centroid clustering, but centroids can be calculated using methods other than simple averages.
Simple Linkage Clustering
Simple linkage clustering involves finding the single members (samples) of existing clusters that are most similar and joining clustering at places such that these members connect.
Maximum Linkage Clustering
Maximum linkage clustering is also known as complete linkage clustering and involves finding single members (samples) of existing clusters that are most different and joining them.
K-means clustering
The user defines a number of groups and some way of identifying a 'typical' expression profile within each group. The 'typical' profile becomes the first member of each group. In our implementation, the 'typical' profile is the one nearest to the mean for the group (distance selected by the user). Next, it examines each profile in the data set and and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a profile is added to the cluster. This process continues until all the profiles are grouped into the final required number of clusters. We use a pre-defined number of re-assignment iterations to avoid infinite loops cause by 'ties'. We have found that k-means clustering is slightly inferior to average linkage clustering using Euclidean distance for sample classification.
4.2 Distance Functions
There are a very large number of distances that could be calculated for sample and for gene classification. Which distance one selected does matter - a great deal. In our simulations, we have found that the Euclidean distance performance best for sample classification under the case vs. control study design. The distance based on Pearson’s may be more suitable for gene classification operations, such as attempts to find co-regulated genes. The following table contains the equations for the pairwise distances included in the GEDA tool as of this writing.

5. Analysis Settings 3: Gene Classification Options (not yet operational)

Clustering Algorithms
Average Linkage Clustering
Average linkage clustering proceeds by first finding pairs of expression profiles that are most similar, joining them, calculating the (sometimes weighted) average between the members of the joined cluster, re-calculating the pairwise distance, and treating the average profile as one profile, and repeating the procedure until all profiles are joined. Average linkage clustering can be conducted using all-pairwise-sample average of differences or using cluster average differences. The latter is also known as centroid clustering, but centroids can be calculated using methods other than simple averages.
Simple Linkage Clustering Simple linkage clustering involves finding the single members (samples) of existing clusters that are most similar and joining clustering at places such that these members connect.
Maximum Linkage Clustering Maximum linkage clustering is also known as complete linkage clustering and involves finding single members (samples) of existing clusters that are most different and joining them.
Neighbor joining - Not Yet Validated, so not yet implemented.

K-means clustering
The user defines a number of groups and some way of identifying a 'typical' expression profile within each group. The 'typical' profile becomes the first member of each group. In our implementation, the 'typical' profile is the one nearest to the mean for the group (distance selected by the user). Next, it examines each profile in the data set and and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a profile is added to the cluster. This process continues until all the profiles are grouped into the final required number of clusters. We use a pre-defined number of re-assignment iterations to avoid infinite loops cause by 'ties'.
Distance Functions
All of the distance functions available for sample classification are also available for gene classification. We (and others; Knudsen, 2002) recommend the us 1-Pearson's correlation coefficient for clustering genes, but the use of Euclidean for clustering samples using biomarkers discovered using case vs. control study designs and Minkowski's distance for clustering samples using biomarkers discovered using tumor vs. normal adjacent to tumor study designs. Our recommendation stems from the analysis of a large number of published cancer data sets and the use of simulations to compare the efficiencies of various measures of pairwise distance in both applications.

6. Analysis Settings 4: Computational Validation

Leave-one-out validation
In leave-one-out validation, the selected analysis is perform N times. Each time a sample is left out, a gene list is determined, and then the left out sample is replaced. Clustering (or other classification algorithm) is performed on that data set. If the bipartition is found (e.g., all normals in one group, all tumors in the other), then we know that the class membership of the sample left out was accurately predicted, and that no other sample was inaccurately placed. The score is the proportion of N times that a sample was correctly placed.
An influence measure is also reported for each sample; this number reports the proportion of all cases in which a sample was included where the classification was accurate. This allows the user to determine which samples cause misclassifications.
Cross-Validation
In cross validation, a portion of the samples are used as a training set, and the remainder are used as a test set. Equal proportions of the both groups are used in the training set. The user defines the X-fold iterations (number of training sets and test sets to create and analyze), and the percentage of samples to use in the training set.
The output reports the total percent of X-fold iterations in which the user-specific bipartition was in fact found in the classification performed on the combined test and training sets, The influence measure of each sample is the proportion of times a sample is used in a test set that lead to an accurate classification.
Bootstrap
In the bootstrap, pseudoreplicate data sets are created by resampling genes with replacement to create a data set the same size (number of genes) as the original data set (Felsenstein, 1985). The test and classification procedures are performed on the pseudoreplicated data sets. The bootstrap proportion is reported for all bipartitions above a specified level.
For now, these options are restricted to sample cluster analysis using the significant gene list as features.

7. Group Settings

7.1 Number of Samples - number of independent samples (number of columns-1). Usually the number of arrays in single-dye studies. In studies that combine samples from different extractions (e.g., 2-dye competitive hybridization experiments), it's the number of extractions. In those cases use the profile of intensity measured in each channel as a sample.
7.2 Number of Genes- usually the number of independent probes (number of rows-1). NOT the number of individuals genes. The GEDA tool will calculate the averages of probes from different genes if the genes have the identical gene name, but only if the option to do so is selected.
7.3 Sample Classification Form - Use this form to tell GEDA which group each sample belongs to. Use a list of sample names with a tab after each sample name followed by a group number, for example

SampleA 1
SampleB 1 
SampleC 1 
SampleD 2
SampleE 2
SampleF 2

The following is also allowed

SampleA 2
SampleB 2 
SampleC 2
SampleD 1
SampleE 1 
SampleF 1
As is this (or any other order)
SampleA 2
SampleF 1
SampleC 2
SampleD 1
SampleB 2 
SampleE 1 

Samples can be excluded from an analysis by assigning it to group '0'. For example, ‘SampleB’ and ‘SampleF’ are excluded from the analysis with the following settings:

SampleA 1 
SampleB 0 
SampleC 1 
SampleD 2 
SampleE 2 
SampleF 0

In this encoding, samples B and F will be ignored in all the data pre-processing and analysis.
7.4 Interpreting the Output
A. Data Visualization for Pre-Interpretative Analyses
Histograms
The GEDA tool output includes frequency histogram for two groups and, if requested, individual histograms. These can be evaluated to identify arrays with unsual distributions. Odd distributions can result from poor sample handling (long ischemia time), incorrect scanner settings, a bad batch of hybridization or labelling reagents, and outdated (expired) chip, and so on. It is risky to merely blindly analyze the data without proper a priori visualization. Decisions on which samples should be excluded after a chip is run are similar to decisions on which samples should not be run based on RNA quality (3':5', for example). Importantly, such decisions must only use non-interpretative analysis tools such as boxplots and histograms. For now, it is generally considered inappropriate to change the study design by removing an expression profile because one fails to achieve a desired result. Nevertheless, there are some circumstances where it becomes appropriate (a misclassified normal sample was wrongly labelled as tumor, for example, and this was discovered because it never clustered with the normals and a check of the workflow revealed the mislabeling).
Mean (Value) Correlation Graph
Given good RNA quality, good hybridization conditions, and proper study design, a high correlation of mean values between the two groups of samples can be expected, even with a small number of samples in each group. The mean correlation graph is a biplot of the mean intensity value (after any pre-processing). We report the r^2 value. Correlations lower than 0.95 should be taken as a sign of a need to improve the study design. This is still pre-interpretive analysis.
Score Histogram:(Top 50 Genes)
This image shows the score for each of the top 50 genes as overexpressed (higher in group 1 than in group 2 = positive + red) or underexpressed (lower in group 1 than in group 2 = negative + green).
B. Mean Histogram - Sorted by Score (Top 50 Genes)
This image shows the groupwise mean value for each gene., sorted by the magnitude of the score. A closed green rectangle denotes a mean value that is underexpressed in group 1; an open green rectangle denotes a mean value that is overexpressed in group 2; an closed red rectangle denotes a mean value that is overexpressed in group 1; a closed red rectangle denotes a mean value that is overexpressed in group 2.
C. Score Histogram
The score histogram shows the frequency distribution of test statistics for the analyzed data set. This is not the null distribution, but it sometimes informative on what a reasonable cut-off might be for a given test.
D. Retained Genes Table
This is a table of all of the significant (retained) genes sorted in descending order according to the test statistic. The table includes links to the Locus Link, Genome View, UCSC, ensemble, UniGene, dbSNP, AmiGO and OMIM database entries. These links will only work properly if the user opts to create a Pitt2 Standard Input formatted data set.

E. Cross-fold validation plots
This plot (Fig. 4) provides two pieces of information. First, it reports on where along the range of threshold values requested by the user the optimal threshold T is located. Users should always use this graph before applying any threshold for interpretative analysis. Second. superior tests for differentially expressed genes will ‘peak early and stay late’, which is to say that in the Cross-fold validation plot, poor methods (defined as methods with high pFDR) will have a steep approach to and decline from the peak. Superior methods with lof pFDR will have high classification accuracy across the range of threshold values, typically with a steep approach to the peak with increasing T and a gradual decline in PU after the peak. One way of expressing this is the area under the curve from last initial zero to first final zero (“from zero to zero”). The reason for this is that the classification algorithm finds useful information even in the smaller sets of significant genes because they, too are mostly true positives. Users that find all zeros for a given test should try a exanding the range of threshold values explored. These plots can be constructed ‘manually’ by varying T and recording the result. We will add a feature to ‘automate’ this procedure.


Fig. 4. RCFV plot for three tests for differentially expresed genes (ovarian cancer data set of Welsh et al (2001). Clearly the tests all perform well for this test set at specific values. A score of 100 for a given test means that the correct classification was recovered in the test set using the markers selected with the test set. Because the J5 test has a low pFDR (and a correspondingly high %TP), the cross-fold validation score stay high over a broad range of threshold values. A very good method (that returns only true positives) will have high predictive utility over the entire range of threshold values after the initial peak. This result was achieved using 100 iterations of cross-fold validation er point and ALCED Clustering.

F. %TP x T plot

This plot reports the %TP (as estimated using the results from the SAM module) over the range of values of T and can interpreted in a similar way to the cross-fold validation plots. (*under development)

G. SAM Plot

In the SAM plot (Fig. 5), the expected value of the test statistic of each rank position in compared to the ‘observed’ test statistic value at the same rank position. Dotted lines represent the position of the threshold of the test (critical value of t) associated with the value D. Blue spots represent genes called not significant; red dots represent genes called significantly overexpressed, and green dots represent genes called significantly underexpressed. The numerical values of the number of significant genes in reported, as is the estimated median number of false significant genes. The pFDR = FP/(TP+FP) and %TP = TP/(TP+FP) are also reported. Note that the SAM test is assymetric (the critical value of t associated with a given delta is different for finding overexpressed and underexpressed genes.  For a complete description of how SAM is implemented in the GEDA tool click here.

Fig. 5. The GEDA-generated SAM plot for the colon tumor samples vs. epithelial-like colon tumor samples in the Alon et al. (1999) colon tumor gene expression data set.

7.5 Future Developments

We expect to expand GEDA in about two dozen ways to meet the growing needs of clinical researchers who use microarray data, including support for additional data formats, nonlinear transformations/normalization, >20 additional tests for differentially expressed genes, including likelihood and empirical Bayesian models, options for multiclass analysis, for generating ROC curves using training and test sets, > 20 additional classification algorithms, additional distance measures, classifiers such as Fisher linear discriminant analysis, diagonal discriminant analysis, weighted voting, variable selection methods, classification trees (e.g., CART and aggregated CART), prediction vote methods, ROC-only methods, bagging, boosting, probe-level analysis, enhanced error rate estimation (study-wide and observation-wise), expanded gene clustering capabilities, heat maps/SOMs, clinical analyses, including uploading clinical and environmental data (covariates), multivariate analysis, survivorship analysis, M-A plots, box plots, prediction error rate plots, regression shrinkage methods (LAR, LASSO, forward stagewise regression), support vector machines, neural networks, imputation of missing values, modules to expedite translational research, additional links in the gene list table to additional genomic, proteomic, structural and regulatory element databases, enhanced online support and annotation, and support of additional experimental designs. With additional effort, each of the various approaches to analysis will be evaluated over a broad range of model and parameter space with our online Gene Expression Data Simulator. The specific details of these plans are available upon request, and we especially welcome contacts from prospective collaborators.

Ackowledgements

The following people are thanked for the critical evaluation and feedback on the GEDA tool, and on our attempts to evaluate and compare the performance of methods of analysis: Roger Day, Milos Hauskreceht, Art Wetzel, Doug Landsittel, Vanathi Gopalakrishnan, Takis Benos, Ivet Bahar, Sam Wieand, Tom Richards, Naftali Kaminksi, Tue Ngyuen, Bill LaFromboise, Paul Hergenroader, Cathy Ma, Uma Chandran, Federico Monzon-Bordonaba, Bill Bigbee, Michael Lotze, Adam Brufsky, John Kirkwood, Michael Becich, Rob Tibshirani, Rebecca Doerge, and Yaov Benjamini.
Literature Cited
Felsenstein, J. (1985). Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 39:783-791.

Lyons-Weiler J, Patel S, Bhattacharya S. A classification-based machine learning approach for the analysis of genome-wide expression data. Genome Res 2003 Mar;13(3):503-12.


Knudsen, S. 2002. A Biologist's Guide to the Analysis of Microarray Data {link: http://www.cbs.dtu.dk/~steen/book.html] [Steen's Homepage: http://www.cbs.dtu.dk/~steen/]


Citing this Document
This document should be cited as
Lyons-Weiler, J., S. Bhattacharya, and S. Patel. 2003v3. Gene Expression Data Analysis Tool Online Description and Documentation version 1.0. http://bioinformatics.upmc.edu.