Past Research

Previous Research Interests

2002

2005

2009

2012

Research Interest Statement 2012

I believe plants are the most important life forms on earth. They give us the air we breathe, clothes we wear, houses we live in, energy we burn and drugs that cure our diseases or make us feel better. Plants will give us all of these until we last and then outlive us all. Despite all this, we know very little about how they do what they do. Even for the best-studied species, Arabidopsis thaliana (a wild mustard), we know about less than 20% of what its genes do and how or why they do it.
We want to uncover the molecular mechanisms underlying adaptive traits in plants to understand how these traits evolved. A bottleneck in achieving our goals is the limited understanding of the functions of most genes in plant genomes. With a sequenced genome as a starting point, we are building genome-wide molecular networks of genes and proteins using a combination of computational and empirical approaches. Using these networks, we want to elucidate functions of uncharacterized genes rapidly and systematically. Ultimately we are interested in finding patterns of network evolution to identify the evolutionary paths of functional innovation for adaptation.
The questions that we are pursuing are:
* Why are plants so robust to genetic and environmental perturbations and how do they express this resilience?
* How is plant metabolism wired and how does it evolve?
The approaches and projects we are developing to answer the questions are:

Computational framework to predict metabolic networks of plants
Reconstruction of co-function networks in plants
Identification of genome-wide genetic interaction network of plants
Empirical testing of plant metabolic networks using genetic and metabolomic approaches
Novel method of measuring functional similarities
Identification of all genes involved in complex traits such as salt tolerance
Computational and empirical identification of signaling pathways and complexes
Identification of novel classes of transcription factor regulators
Characterization of novel gene families

We employ several methods in our quest: 1) combination of computational modeling and targeted experimental testing in the lab; 2) systematic collection of large-scale data needed for the modeling through collaboration with other labs; 3) robust, quantitative analysis of the data and the models. Our work is inherently embedded in collaborations with other labs both at Carnegie and other institutions.This is my vision of our mode of operation. Our own lab is most invested in the synthesis aspect of the ‘research engine’ with a growing component on experimentation. But we collaborate with many excellent labs in all three aspects.

Research Interest Statement 2009

Two things drive my research these days. One is a desire to uncover the mysteries of how plants process the myriad of information from their environment and reprogram their growth and development. I am fascinated with how plants decide. The other is a desire to uncover the potential for greatness in emerging scientists. I am fascinated with how humans, in particular scientists, decide.
I am trying to engage these two drivers in designing projects to foster the diverse interests of individual members of my group, ranging from evolution of pathways to mechanisms of gene regulation, and to capitalize on the diverse training backgrounds in the group, ranging from molecular biology to physics.
Naturally this network of inspiration lends itself to projects that can be described as a series of Venn diagrams where the intersections represent collaborative and integrative projects among the members. I believe that the union will help satisfy the two driving forces of my research program. Also turnover of lab members over time serves as natural check points and selective forces on the evolution of our collective knowledge and expertise.
We employ several methods in our quest: 1) combination of computational modeling and targeted experimental testing in the lab; 2) systematic collection of large-scale data needed for the modeling through collaboration with other labs; 3) robust, quantitative analysis of the data and the models.
Here are examples of the ‘intersecting’ projects in our group to illustrate the types of questions we are asking and approaches we are taking.
Questions and approaches…

The quest for novel biological processes

Systematic discovery of the role of protein degradation in response to the environment

Exploring the power of metabolomics in dissecting genetic interactions

Systematic discovery of signaling pathways and complexes (Lalonde et al, 2010)

Methodologies and tools…

Genome-wide gene association network (AraNet) (Lee et al, 2010)

Network-guided reverse genetic screening (Lee et al, 2010)

Automated metabolic network generation from a set of protein sequences (Zhang et al, 2010)

Discovering missing links in gene expression clusters (Chen et al, 2009)

The next five to ten years…I expect that many of the current projects will leave along with the lab members to establish their own groups. I am humbled by so many intriguing, unsolved problems in plant biology and am brewing some concrete ideas about the following topics currently, which may or may not become major projects in my group in the next several years.

Systematic discovery of novel reactions and pathways

Determining the “key” players in functional modules

Mechanism of cross-talk between exogenous and endogenous signals for growth and development

Mechanism of genome-wide homeostasis against genetic and environmental variation

Research Interest Statement 2005

Introduction
Recent technical advances in large-scale sequencing and genomics methods as well as in communications have triggered a scientific revolution with immense potential for extending biological knowledge. They have also posed an immense challenge: how to make optimal use of vast quantities of biological data. Without long-term high quality mechanisms for accessing and analyzing the data, the resources used in generating the data are in danger of going to waste. My long-term goal is to discover the rules and mechanisms underlying the workings of a flowering plant (Arabidopsis thaliana) by building an infrastructure to bring all the available data together, developing computer programs that infer knowledge based on the available data, and engaging the research community to test the inferences. Towards this end, we need the following: 1) standards to code not only how much is known to what extent, but also how much is unknown; 2) a collaborative environment that allows researchers to share information and knowledge effectively; 3) systematic, multi-disciplinary approaches for generating, analyzing, and interpreting data capable of handling large-scale datasets without sacrificing data quality; and 4) multi-disciplinary approaches to develop efficient methods of inferring knowledge. This will result not only in new paradigms in plant biology but also in advancement of our knowledge to a point where we can effectively manipulate the flora to improve human health and our environment.
I have been involved several ongoing projects that address some of the needs stated above. The projects can be grouped into three categories: biological databases, bio-ontologies, and systems approaches in biology. Biological databases include a database for all information of a single organism, a database for a specific type of information (metabolism) in many species, and a database for managing and exploring literature data for any type of system of interest. Bio-ontologies include designing and building ontologies specific for particular domains of biological knowledge such as biological processes, molecular functions, cellular components of all organisms and anatomical parts and developmental stages for flowering plants. Systems approaches include two small projects in collaboration with plant biologists to address questions about specific aspects of Arabidopsis biology such as deciphering the transcriptional regulatory circuit for cold acclimation in plants and systematic determination of subcellular and tissue localization of proteins of unknown function in planta.
In addition to the projects described above, I have a personal mission to mobilize the research community to contribute to biological databases and share knowledge and expertise, to bridge the gaps of information dissemination between traditional scientific journals and biological databases, and to bridge the gap between biologists and computer scientists. I believe that the plant biology community is not taking full advantage of the recent advances in communications and technology. Through TAIR, we are creating and testing mechanisms for researchers to provide data and expertise directly to a database. I am communicating with publishers of major plant journals to share data and establish cross-references between journal websites and databases. I am also in communication with an open-access publisher to create a joint journal devoted to publishing papers that are not suitable for traditional journals such as functional genomics like microarray data, methods, and reproducible negative results. Finally, I believe that major breakthroughs in bioinformatics will come from in-depth collaborations between biology experts and computer science experts rather than from people who know a little bit of both. As an editor for Plant Physiology, I am managing the publication of bioinformatics papers in this journal in order to educate plant biologists about bioinformatics. I would be very interested in doing the converse: bringing biology papers into a computer science journal.
Biological databases
There are three types of biological database projects in my group, an organism-specific database (TAIR), a metabolism database (MetaCyc), and a literature curation database (PubSearch). All three projects are carried out in collaboration with other groups. Four years ago, we created TAIR (The Arabidopsis Information Resource, arabidopsis.org) in collaboration with software developers at the National Center for Genome Resources. It is a comprehensive Web-based information resource for the model plant Arabidopsis thaliana. Our primary goal was to develop a new information infrastructure containing all available genomic and genetic data and make it accessible to the public through a set of user-friendly search, browse, and visualization tools. In addition to a comprehensive database and web applications, we developed a set of standards in the semantics and syntax of the data to facilitate curation, exchange, and analysis. It is one of the most used resources for plant research today, with about 900,000 page views accessed by about 30,000 unique IP addresses per month. Currently there are 12,752 registered users and 4,745 laboratories, making our user group one of the largest organism-based biological research communities. MetaCyc (www.metacyc.org) is collaboration with Peter Karp's group at SRI international and aims to represent all experimentally studied metabolism information (including pathways, reactions, enzymes, compounds, and cellular locations) from microbes and plants in computer- and human-readable formats. It has tremendous potential for genomics (serving as a reference database for inferring metabolic pathway annotation using sequence similarity measures of the enzyme sequences), metabolic engineering (comparing metabolic pathways in different organisms), and biological databases (providing detailed, experimentally verified information for genes of interest).
For most biological databases, the literature is one of the main data sources, and significant resources are devoted to capturing this information. Our long-term goal is to develop a set of systematic procedures and tools for integrating knowledge from the confined context of a research article into the dynamic, broad context of a biological database. We have developed a literature curation tool called PubSearch (www.pubsearch.org), which stores literature, gene, functional annotation, and keyword data in a stand-alone database and allows curators to establish associations between these data types using a web browser. In collaboration with Simon Twigger's group at the Medical College of Wisconsin, we are extending PubSearch to include a literature fetching function (PubFetch) and work-tracking function (PubTrack) to create a comprehensive environment to manage the literature data.
Bio-ontologies
Although biology is one of the complex systems where large bodies of knowledge exist, descriptions of rules underlying the knowledge reside in a thick semantic soup. Attempts to standardize nomenclature across organisms have essentially failed and remain a difficult task even within a single organism research community. Recently, a few model organism databases have joined forces to standardize the semantics for describing biological process, molecular function, and cellular components of all organisms (Gene Ontology (GO) Consortium, www.geneontology.org) and my group has been an integral part of this effort since 2000. Although the use of GO is becoming a standard, it has some limitations. For example, it does not accommodate anatomical parts or developmental stages of a multicellular organism. Furthermore, it does not attempt to describe traits or phenotypes. In order to accommodate the description of genes and gene products in Arabidopsis, we developed orthologous vocabulary systems for anatomical parts and developmental stages. In addition, we have established a collaboration with other plant model organism databases such as MaizeDB, Gramene, and IRRI, in a project called Plant Ontology Consortium (www.plantontology.org), to develop shared anatomy and developmental stages ontologies for flowering plants. The establishment and usage of these shared, controlled vocabularies will allow researchers to query across all organisms for knowledge and begin to address correlations between structure and function in explicit, systematic ways.
Systems approaches
If we could obtain all the necessary facts about a biological system in computer- and human comprehensive ways, we can start to ask new questions about biology. I have two recently started projects in this category; One project aims to decipher the transcriptional regulatory network involved in cold acclimation in plants (http://aztec.stanford.edu/cold/), and the other attempts to identify the subcellular location of several hundred proteins of unknown function (http://aztec.stanford.edu/gfp/). The cold-acclimation regulatory circuit project is in collaboration with Mike Thomashow and colleagues at Michigan State U. and Oregon State U. and we are using a combination of microarray analysis, promoter analysis, phylogenetic analysis, and reverse genetics approaches in cold-acclimating plants such as Arabidopsis and barley and non-acclimating plants like rice and tomato to ask which genes are involved specifically in cold-acclimation and how the genes are transcriptionally regulated.
In an effort to systematically characterize Arabidopsis proteins with unknown function, we are collaborating with four cell biology labs (David Jackson at Cold Spring Harbor Laboratory, David Ehrhardt at Carnegie Institution, Vitaly Cytovsky at SUNY Stoneybrook, and Natasha Raikhel at UC Riverside) to identify subcellular localization of approximately 800 genes that have no known function in planta (real-time images of live cells in intact plants). In addition to discovering localization patterns of these novel proteins, we are already identifying potential novel organelles and suborganelles.
FUTURE PLANS (NEXT FIVE YEARS)
In the next five years, I would like to continue the three categories of the projects (biological databases, bio-ontologies, and systems approaches) but make a transition from developing infrastructure and tools to creating applications that use the infrastructure to infer new information or identify patterns. However, I value the critical importance of maintaining and updating the resources, which will be done by professional curators and software developers. Personally, I would like to develop programs that can, for example, predict function based on the knowledge and information embedded in TAIR. Also, I am interested in analyzing the bio-ontologies and their annotations to identify any novel patterns, both regular and irregular. In addition to continuing the existing projects, I intend to initiate a couple of new projects, one on building an infrastructure for metabolomics and the other on analyzing the correlation between networking and scientific success in collaboration with social scientists.
Biological Databases
I would like to transform TAIR into a discovery environment for all plant researchers, educators, and students. The proposed work will include a comprehensive annotation of the genome, transcriptome and proteome, including regulation and phenotype information. TAIR will provide access to all public data resulting from large-scale 'omics' research and traditional 'hypothesis-driven' research in intuitive, powerful, and highly integrated views capable of facilitating new discoveries about plant development and physiology. The project will continue to develop controlled vocabularies and standardized data exchange mechanisms for maximal interoperability with other biological databases and will provide data in explicitly defined and structured formats to facilitate programmatic data retrieval. TAIR's strong support within the plant research community will be utilized to create networks of information connecting TAIR to other plant databases, web resources for specific types of Arabidopsis information, and traditional scientific journals. In addition, TAIR's role as an essential resource in the plant research community requires that a mechanism for long-term support of the project be established. To that end, several potential ways to generate revenues will be explored.
In the next five years, we will focus on completing the plant metabolism information in MetaCyc to a golden standard such that it will effectively have replaced all the textbooks. Towards that end, we will actively solicit collaboration from the classical biochemists and other colleagues from the Society of Phytochemistry in addition to curating the data from the primary literature, reviews, and textbooks and results from functional genomics and proteomics experiments. Once the known information is complete and updated in the database, we can start to ask questions about missing information (e. g. missing enzymes, compounds, and pathways in an organism as compared to another). In addition, we should be able to ask questions about the differences and similarities between strategies taken by different organisms.
For PubSearch, I am interested in collaborating with computer scientists to incorporate methods such as Natural Language Processing for more automated literature curation. Our experience of manual and semi-manual extraction of knowledge from literature would provide a good baseline for such a collaborative project.
Bio-ontologies
One of the immediate applications of bio-ontologies is in associating biological objects such as genes. This allows quantitative comparison of genes and can facilitate interoperability (querying of one database by another) if multiple databases use the same ontologies to annotate data objects. I am interested in analyzing the ontologies and their annotated data objects in TAIR, GOC, and POC databases to determine the global organization patterns of the ontologies and the genes using graph theoretical calculations. I am also interested in creating and using ontologies for complex information and will focus on describing phenotype information using multiple ontologies.
Systems approaches
I am particularly interested in the preliminary results from the projects in the category. From the cold acclimation project, we found that transcripts are turned on in a series of waves as a function of time in cold-treated Arabidopsis. In order to group the genes into more discrete regulons (genes that are regulated by the same transcription factor(s)), I feel that we need to learn more about the potential promoter regions. Towards that end, we have gathered and curated all the experimentally verified cis-elements from biological databases. Using PubSearch, we can efficiently extract all other experimentally verified cis-elements. We will use this dataset to map the non-coding sequences of genes and intergenic regions and ask if there are any high-level patterns of cis-element compositions in the non-coding genic and intergenic sequences. If we can define promoters more precisely, then we should be able to develop algorithms to compare promoters. Using better-defined promoter information, we want to analyze the microarray data. In addition, we intend to curate all of the known cis-element/transcription factor relationships. I would like to collaborate with computer scientists interested in developing heuristic algorithms that could predict transcription factor/cis-element relationships based on the curated dataset.
The unknown protein localization project also has several venues we want to pursue. First, we want to use the experimental data as a training set to determine if we can identify any new targeting/localization signals and motifs by either using existing algorithms (e.g., TargetP) or developing new algorithms in collaboration with computer scientists. In addition, this project is just starting to produce results, and the first 1% of the unknown proteins revealed not only interesting localization patterns such as cell-type and tissue specificity, but also uncovered some novel localization patterns. Some of the novel localization patterns may be novel organelles or suborganelles previously undetected. We are set to capture localization images of 800 genes this year and have submitted a renewal to do another 4000 genes. Even within the current grant period, we will produce about 8000 images. In order to group the novel patterns into categories and analyze all of the images efficiently, we need to perform content-based image searching as well as the ability to cluster the images. I would be very interested in collaborating with computer scientists to develop such programs and use them to analyze these localization patterns.

Research Interest Statement 2002

My goal is to build an infrastructure that allows researchers to share information and knowledge in order to identify new insights and facilitate the process of generating new paradigms in biology. A long-term goal is to systematically delineate what is known and unknown in order to mobilize the research community to solve the rules underlying the workings of an organism.
One of the most efficient ways of solving problems in biology lies in the use of model organisms or systems in which the basic rules are uncovered and applied to more diverse sets of organisms and problems. For higher plants, Arabidopsis thaliana has been adopted as a model organism due to its small genome size, self-compatibility, and short generation time. Since its adoption as a model organism, many tools have been developed for this plant, including facile and efficient methods of transformation, complete genome sequence, and high-density genetic maps. Capturing and representing biological knowledge from studies using Arabidopsis thaliana is the subject of my research. More specifically, my group has developed a computer-based infrastructure to capture the research community information and the knowledge generated in the research literature and developed a query/analysis/visualization system to allow researchers to identify correlations in the information. In the future, we would like to develop a knowledge-capture system to bring the research findings directly into the computer infrastructure, and develop a simulation system that can predict an accurate outcome of any scenarios that may occur in the plant.

I. The Arabidopsis Information Resource (TAIR): A Comprehensive Infrastructure for Arabidopsis Biology Information
The most amount of knowledge resides in the minds of individual researchers and their laboratories. Some of this knowledge is refined in a form of publication. With approximately 11,000 researchers and 4,000 laboratories around the world, Arabidopsis research community is arguably the largest body of a model organism research community to date, with a possible exception of the human biology research community. Drosophila melanogaster, an insect that has been the subject of genetic research for almost 100 years (history of more than five-fold of that for Arabidopsis), has about half of the size of Arabidopsis community, at about 5,000 researchers.
In order to capture the knowledge from this large body of research community, we need to develop an infrastructure that allows researchers to find and share the information and knowledge generated. Advancement of computer science and communications technology has established the internet to be the most efficient medium for exchanging knowledge. In addition, advancement of high-throughput technology such as sequencing and microarray methods have allowed biologists to produce large quantities of data. Developing an infrastructure to house and make accessible these large quantities of data has been a problem for many research communities. In collaboration with information technology scientists at the National Center for Genome Resources in Santa Fe, New Mexico, my group has been engaged in developing an infrastructure to house the vast quantities of information for Arabidopsis. The infrastructure is called the Arabidopsis Information Resource (TAIR, http://arabidopsis.org), which is accessible via commonly used web browsers and can be searched and downloaded in a number of ways. For example, researchers can identify genes or proteins of interest based on many parameters (e.g. subcellular localization, expression patterns, or mutant phenotypes) from the text-based search forms, sequence analysis tools, or bulk query forms. SeqViewer (http://arabidopsis.org/servlets/sv) allows visualization of these genes on the genome decorated with clones, transcripts, genetic markers and polymorphisms. The SeqViewer interactively displays the genome from the whole chromosome down to 10 kb of nucleotide sequence. Alternatively, researchers can visualize these genes mapped on metabolic pathways from the whole cell level down to individual reactions along with metabolic compound structures using AraCyc (http://arabidopsis.org/tools/aracyc). Upon finding relevant information about genes, researchers can order associated DNA or seed stocks from the Arabidopsis Biological Resource Center (ABRC, http://arabidopsis.org/arbrc). Detailed, and up-to-date information about the database content as well as its usage statistics can be found online (http://arabidopsis.org/about).
TAIR uses an object-oriented approach to data representation and software architecture. The underlying database is implemented in a relational database management system (Sybase version 11.9.2). The data is organized in a hierarchical structure where a parent table groups a set of child tables with similar attributes and each node can be linked to other nodes and tables. At the top of the data hierarchy is the TairObject class, which is linked to other top parent classes such as Attribution (source of the data), Reference (experimental evidence source), and Annotation (descriptive information). Thus, the Attribution, Reference and Annotation classes constitute the meta data of all TAIR objects. This design has the advantage of allowing easy expansion of new data types as well as flexibility and minimization of linking tables. More detailed information about the database schemas and documentation can be found online (http://arabidopsis.org/search/schemas.html).
TAIR software is developed in a client-server mode using the JAVA Servlet technology. All applications are accessible to users by common web browsers to accommodate maximum user platform and software (operating system) diversity. Software for accessing the database is developed using an object-oriented architecture. A set of JAVA classes called TAIR Foundation Classes serve a number of functions to the front-end applications that use JAVA Server Pages. Documentation of the TAIR Application Program Interface can be found on 'About TAIR’ section of the home page. A set of bulk download tools based on flat files use CGI scripts written in Perl. Finally a number of weekly updated, static HTML pages serve relevant Arabidopsis and external links information to the community.
This project, in its third year, is accessed by about 20,000 unique internet addresses per month. Approximately 2.5 million hits and 500,000 web pages are accessed by researchers around the globe every month. TAIR is currently the most visible Arabidopsis project. For example, when using the word `Arabidopsis’ on Google (http://google.com), TAIR is on top of the list.
II. PubSearch: A Comprehensive Literature Extraction and Curation System
Peer-reviewed research articles remain the best medium for representing and disseminating the refinement of scientific knowledge. For any model organism database (MOD), the literature is one of the main data sources, and significant resources are devoted to capturing this information. Our long-term goal is to develop a set of systematic procedures and tools for integrating knowledge from the confined context of a research article into the dynamic, broad context of a model organism database.
We have developed a literature curation tool called PubSearch, which stores literature, gene, functional annotation, and keyword data in a stand-alone database and allows curators to establish associations between these data types using a web browser. In PubSearch, first-pass associations between terms (gene names and keywords) and articles are made automatically by a string matching program that indexes terms to articles. Commonly occurring words such as AND, THE, IF (stop words) are filtered out to minimize meaningless associations from being stored. For terms with a higher signal-to-noise ratio, curators verify the matches via the web browser user interface.
PubSearch uses a simple database schema in a MySQL database management system (DBMS) (version 3.21), which can be queried and updated using a password-protected login mechanism via the internet using a web-browser. The middleware is written in Java (version 1.3) and uses Java Servlet and Java Server Page (JSP) technology. The system is currently running on a Linux RedHat7.2 system with Tomcat (version 4.0) as the servlet engine. A demo of the current version of this tool and its documentation can be accessed from:
http://tesuque.stanford.edu:9999/pub/index.jsp
Username: demo Password: demo
The tool has been used and refined for the past 6 months by 7 curators at TAIR and 5 Arabidopsis curators at the Institute for Genome Resources (TIGR) to curate over 12,000 articles. The tool is much more convenient and user-friendly than our old system involving flat files and our curation work has become much more efficient as a result.
In addition to providing curators with a sophisticated tool to facilitate literature curation, this project impacts three bodies of the research community significantly. First, the Arabidopsis research community benefits from access to accurate and consistent annotations of data objects from the literature, which are produced in a fast, efficient manner. Second, researchers engaged in high throughput genomic projects benefit by having access to reliable, high quality annotations that can be used to enhance automated annotations. Often sequence comparison is used to predict the potential function of genes and gene products in a newly sequenced organism; accurate and detailed descriptions of a model genome and its complements will improve the accuracy of the newly sequenced organism’s annotation. Third, members of the computer science research community can use the rules, methods and curated data to develop more sophisticated and accurate algorithms to extract and analyze data from the literature. The set of human-curated data along with explicit rules used for the annotations will provide much-needed test data sets for developing and improving algorithms based on methods such as natural language processing and machine learning. This final application of the tool lends the possibility that manual curation of literature can be infinitely reduced, allowing our curation teams the freedom to use their scientific training to explore and question the data collected in MODs leading to new hypotheses and potential discoveries.
III. Gene Ontology Consortium and Plant Ontology Consortium: Establishing systematic ways of describing biology for all organisms in both human and machine-readable forms
Although biology is one of the complex systems where large bodies of knowledge exist, descriptions of rules underlying the knowledge reside in a thick semantic soup. Attempts to standardize nomenclature across organisms have essentially failed and remain a difficult task even within a single organism research community. Recently, a few model organism databases (yeast, mouse, and Drosophila) have joined forces to standardize the semantics with which to describe the roles of genes and gene products (Gene Ontology (GO) Consortium, http://www.geneontology.org) and my group has been an integral part of this effort since 2000. GO attempts to describe the roles of genes and gene products in three large aspects: molecular function, biological process, and anatomical parts. Controlled vocabularies within each of these three aspects are structured in directed acyclic graphs (DAG), which allow multiple parent-child relationships for each vocabulary. Two types of parent-child relationship 'is a’ and 'part of’, currently exist in GO. Since joining this group, we have added over 500 terms relevant for plants as well as restructuring about 400 terms within the ontologies to better reflect plant biology. We have collectively developed over 12,000 terms. This project has been well-received by the biology community and is currently used by over 10 large databases around the world, including SWISS-PROT and TIGR, and is being implemented into MEDLINE.
Although the use of GO is becoming a standard, it has some limitations. For example, it does not accommodate anatomical parts or developmental stages of a multicellular organism. Furthermore, it does not attempt to describe traits or phenotypes. In order to accommodate the description of genes and gene products in Arabidopsis, we developed orthologous vocabulary systems for anatomical parts and developmental stages, in collaboration with Jonathan Clarke at John Innes Centre, UK. In addition, we have established a collaboration with other plant model organism databases such as MaizeDB, Gramene, and IRRI, in a project called Plant Ontology Consortium, to develop shared anatomy and developmental stages ontologies. In this project, Arabidopsis vocabularies have been used as the baseline onto which terms from other plants have been added and the structures modified with a goal to accommodate the description of all plant genes and gene products.
The establishment and usage of these shared, controlled vocabularies will allow researchers to query across all organisms for knowledge and begin to address correlations between structure and function in explicit, systematic ways.
FUTURE PLANS IN THE NEXT FEW YEARS
I. Enhancement of TAIR schema and content
Currently the information in TAIR is heavily focused on the finished genome and its gene complements. In the next few years, we would like to enhance the structure of the TAIR database to represent more information about gene products. These include genetic, physical, and regulatory relationship between genes and gene products. In addition, the relationship between genotype (polymorphism in a sequence) to phenotype (of a germplasm harboring the polymorphism(s)) will be established. Finally, more derived relationships of genes and gene products will be stored; these include gene family information based on phylogenetic analysis, expression clusters based on microarray data analysis, and metabolic pathway groupings based on enzymatic assays.
II. Enhancement of TAIR’s query and data input systems
Most of the initial efforts on the TAIR project went into developing a database structure to store complex data types and relationships to represent Arabidopsis biology. In addition, a set of sophisticated query and data retrieval software has been implemented. However, current set of query tools do not reflect the underlying complexity of the database structure. In the next few years, we will focus on developing a comprehensive set of query tools that allow researchers to perform and get access to any combinations and correlations of data stored in TAIR. In effect, we will be developing a user interface for researchers to design and execute Structured Query Language (SQL) to the TAIR database.
In addition, we will develop a set of data entry and update tools to allow researchers to add and update any information in the database. Currently, we have an interactive data entry system only for person or organization profile information. We plan on expanding this to allow researchers to add information about genetic markers, genes, proteins, microarray experiments, etc. In addition, we will implement a system to allow a researcher to attach his or her own comments to any information at TAIR. Our long-term goal is to establish TAIR as an essential communication and research tool whereby it is the first place a researcher should go to find out about any aspect of Arabidopsis biology. Some aspect of in-house curation will always be essential but we hope to disperse some of the curation responsibilities to those researchers that have generated the data and thus create a co-operative resource.
III. Expansion of TAIR for plant researchers
Because the value of Arabidopsis derives from its utility in understanding other plants, our goal is to build an infrastructure that permits facile high resolution linking of specific information about Arabidopsis to similar information in all other plants (and vice versa).
Ultimately, our goal is to provide the common vocabulary, visualization tools, and information retrieval mechanisms that permit integration of all knowledge about Arabidopsis into a seamless whole that can be queried from any perspective. Of equal importance for plant biologists, the ideal TAIR will permit a user to use information about one organism to develop hypotheses about less well-studied organisms. In the next few years, we hope to develop user-friendly tools that permit an individual working outside this model species to formulate a query based on their organism of interest, have that query directed to the relevant knowledge in Arabidopsis, and present the information in a way that can be understood by any plant biologist. We will be making efforts to cross-link information in TAIR with information about other plants and organisms in other databases. In addition, we will develop a more comprehensive help system to allow researchers not familiar with Arabidopsis to use the information in TAIR more effectively.
IV. Dissecting the unknown in Arabidopsis
Sequencing the genome revealed the extent of gaps of our knowledge about Arabidopsis. Approximately 27000 genes (and 2000 pseudogenes) have been predicted based on gene prediction programs and sequence comparisons. Of these, approximately 30% have evidence of transcription (e.g. ESTs available) but are not similar to any genes of known function. About 10-15% of the genes do not even have any evidence of transcription (termed 'hypothetical’). In addition, approximately 1% of the genes have experimental evidence for subcellular location.
In an effort to systematically characterize the unknown, we are collaborating with four cell biology labs (David Jackson at Cold Spring Harbor Laboratory, David Ehrhardt at Carnegie Institution, Vitaly Cytovsky at SUNY Stoneybrook, and Natasha Raikhel at UC Riverside) to identify subcellular localization of approximately 800 genes that have no known function, not similar to any known genes, and have no localization information. The selected genes with their 5’ and 3’ intergenic regions will be PCR-amplified, fused to GFP, and the transgenic plants harboring the clones will be examined for subcellular localization. Our role will be to develop a Laboratory Information Management System (LIMS) to store and prioritize the candidate genes for cloning based on a number of criteria (including annotation download from TAIR, existence of full-length cDNA, etc.), track the status of the cloning, upload the preliminary results for internal discussions, and export the data to TAIR and other public repositories. In addition, the experimental results from this study will be used to identify potential novel signal peptides and improve subcellular localization prediction algorithms.
V. Education and outreach to scientists, educators, and general public
We plan on expanding the resources at TAIR for education and outreach. First, we will provide educational resources for high school and undergraduate-level teachers (e.g. curricula, protocols, professional development materials) engaged and interested in teaching plant courses and laboratories. In addition to gathering these materials ourselves, we will implement an online submission form for teachers and scientists to submit useful, classroom-tested protocols. Second, we will establish a community of teachers and scientists by setting up a mailing list and actively recruiting members from the scientific community to be involved as advisors for the teachers. Third, we are developing a set of extensive help pages, glossary, and tutorials for the resources available at TAIR, to facilitate high school and undergraduate-level teachers and students in using TAIR for their projects. This aspect of the project will be enhanced by collaborations with teachers who are interested in developing courses that use TAIR. We are currently in discussion with a couple of local high school and community college teachers.