You are here: home » Oceans and Human Health » Marine Genomics » Bioinformatics Project » More Information
More Information about Bioinformatics
Background
Over the past decade, the explosion in molecular genetics and genomics technology has provided biologists the capabilities to produce unparalleled quantities of data about the thousands of genes in individual organisms. Historically, assessment of transcriptional changes in a few genes associated with an environmental challenge was a major undertaking. With this new knowledge and technology, surveying the changes in the entire transcriptome is currently within reach for some species.
For most genomic studies, the number of independent variables (individual gene expression levels) is huge and the sample sizes (number of individual animals and sites studied) are small. Analysis of these data using conventional statistical approaches produces more unknowns than equations and lacks the amount of data needed to solve the problem. In addition, few phenomena in biology are strictly linear, which are easier to interpret. Therefore, the most appropriate mathematical structure for representing the data is generally unclear (Almeida and Voit, 2003).
A final complication to the understanding of genetic profiles is that in normal physiological responses, nearly all gene products are part of a metabolic cascade and their individual contributions to metabolic output might be small. However, their collective (epistatic) interactions are likely to be major factors affecting the process and its output. While such effects can be identified by linear statistics, they cannot be effectively modeled using this approach.
Analysis and synthesis of the complex data resulting from genomics research, therefore, require new statistical approaches. “Bioinformatic” approaches can be very different from traditional statistical approaches to data analysis and are based upon the application of non-linear modeling tools (e.g., artificial neural networks) and machine learning to mine the data and information that exist in complex genomic data sets.
Approach
The ultimate goal of the bioinformatics program is to generate mathematical representations of gene expression profiles, which can be used to identify new and innovative biomarkers of environmental stress and accomplish the transition from the dynamics of transcript profiles in individuals to the behaviors of populations and ecosystems. The first part of the goal is a relatively straightforward application of existing bioinformatics tools (e.g., Ball et al., 2002) and requires the application of a multi-layer, feed-forward Artificial Neural Network (ANN). The model architecture is illustrated and described in Figure 1.
The development of models that can predict population or ecosystem health from gene expression profiles consists of two steps. Step one is the construction of expressions mapping gene expression profiles to the health status of individual organisms. The second step is to make the transition from the molecular dynamics of individuals to the behavior of populations and ecosystems. The essence of this problem is defining the appropriate measures of population and ecosystem health status. Regardless of the data set and purpose of the analysis, it will be important to identify which of the vast number of genes on the microarray are truly responding to environmental challenges of interest. This is necessary for several reasons. First, the complex microarray can provide information on the genetic response (transcriptome) for 1000s of genes. They are, however, expensive and time-consuming to develop and may provide more information than can be interpreted or comprehended given the present of state-of-knowledge for most marine species. If a microaray can be reduced to a relatively small number of genes that represents the overall response of the individual to the challenge (e.g., 100s of genes), then the biological significance of the data will be easier to comprehend and apply. Reducing the number of genes being evaluated would also shorten the computation time and reduce the hardware resources needed. There are a variety of approaches to accomplish the task of transcribing a microarray comprised of 1000s of genes to a much smaller suite that best represents the overall organism response: clustering and self-organizing maps (Eisen et al., 1998), shrinkage-based metrics (Cherepinsky et al., 2003), and Spearman rank correlations, as well as sensitivity analyses from the ANN itself.
Literature Cited
Almeida, JS, Voit, EO. 2003. Neural-Network-Based Parameter Estimation in S-System Models of Biological Networks. Genome Informatics 14: 114-23.
Cherepinsky V, Feng J, Rejali M, Mishra B. 2003. Shrinkage-based similarity metric for cluster analysis of microarray data. Proc Natl Acad Sci U S A100(17): 9668-9673.
Eisen MB, Spellman PT, Brown PO, Botstein D.1998. Cluster analysis and display of genome-wide expression patterns..Proc Natl Acad Sci U S A. 25: 14863-14868.
Holland AF, Sanger, DM, Gawle,CP. Lerberg, SB, Santiago, MS, Riekerk, GH, Zimmerman, L., Scott, GI. 2004 Linkages between tidal creek ecosystems and the landscape and demographic attributes of their watersheds Journal of Experimental Marine Biology and Ecology 298 151-178.