Genomics Links


Sequence Databases and Retrieval Tools

NCBI is the number one resource for molecular biologists. GenBank, and provides free BLAST and ENTREZ searches via e-mail, client software, or directly over the Web.

This is the premier Web engine for DNA and protein homology searches.

Entrez is a molecular sequence and document retrieval system, which contains an integrated view of portions of MEDLINE, and all publicly available nucleotide and protein databases. The Protein and Nucleotide entries in Entrez have been compiled from a variety of sources, including GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, and PDB. Entrez is extremely useful for obtaining cross-referenced documentation for a particular sequence once you know its database accession number.

The European equivalent of NCBI

Mirrors of GenBank/EMBL databases as well as local databases including Genome Information Broker for Microbial Genomes (GIB), Protein Data Bank (PDB), and a Unified taxonomy database (TXSearch). Provides online tools for FASTA, SSEARCH and BLAST searches, Multiple Alignment using "MALIGN" and "CLUSTAL W", protein secondary structure prediciton, and protein 3D structure analysis by threading (LIBRA).

OWL is a non-redundant superset of SwisProt, PIR, GenPept, and NRL-3D. Entries are amalgamated from primary source databases by a process in which redundant and trivially different entries are eliminated.

TIGR is responsible for creating the first database of cDNA sequnces and the first complete microbial genome sequnce. They have a tremendous amount of sequence information available including both "fresh off the sequencer" and processed and annotated data.

A WWW implementation of SRS, similar to the LOOKUP program available at RCR.

The Protein Data Bank is an archive of experimentally determined three-dimensional structures of biological macromolecules, serving a global community of researchers, educators, and students.

The PIR is the host of the PIR-International Protein Sequence Database --- a comprehensive, annotated, and non-redundant protein sequence database in which entries are classified into family groups and alignments of each group are available.


The GeneX project aims to provide an Internet-available repository of gene expression data with an integrated toolset that will enable researchers to analyze their own microarray data and compare them with other such data.

GEO is a gene expression and hybridization array data repositoryas well as an online resource for the retrieval of gene expression data from any organism or artificial source.

A public repository for microarray based gene expression data at the European Bioinformatics Institute. Currently the EBI is establishing a pilot database containing the microarray gene expression data that are available publicly. An Expression Profiler set of tools is in development to facilitate the analysis and clustering of gene expression and sequence data which may help in the discovery of sequence pattern profiles in the regulatory regions of co-expressed genes.

ExpressDB: a relational database of yeast RNA expression data.
As of July, 1999 ExpressDB contains 17.5 million pieces of information loaded from 11 yeast gene expression studies. To assist with the interpretation, extracts of current Saccharomyces Genome Database (SGD) gene name and description data are linked with their corresponding ORFs. ExpressDB also contains 207 functional groupings of yeast ORFs derived from the MIPS database.

The primary objective of KEGG is to computerize the current knowledge of molecular interactions; namely, metabolic pathways, regulatory pathways, and molecular assemblies. KEGG maintains gene catalogs for all the organisms that have been sequenced and links each known protein to a component on the pathway. KEGG also organizes a database of all chemical compounds in living cells and links each compound to one or more pathways.

A classification of all of the proteins in the PIR database into families and superfamilies based purely on sequence homology.

An example of comparative genomics.

Human Genome

Compare your sequence to all of the latest contigs of human genomic sequences

Ensembl provides automatic annotation to human genome data. Ensembl takes raw DNA sequence contigs from the public Human Genome Project and runs a number of computer programs to determination annotation of genes, transcripts (ESTs), introns and exons, mapped STSs, etc. The results are stored in a relational database and accessible via a Web-bases Genome Browser.

Private Human Genome database with limited free access.

The Genome Database (GDB) stores and curates human genomic mapping data submitted by researchers worldwide and provides this information electronically to the scientific community.

The Genome Sequence DataBase is dedicated to supporting scientific research and development by creating, maintaining and distributing a complete, timely, accurate and useful collection of DNA sequences and related information. The core sequence data at GSDB are incorporated within GenBank, but the GSDB is an on-line, client-server, relational database enabling complex SQL queries and much additional annotation.

Protein Pattern and Structural Analysis

PROSITE is Dr. Amos Bairoch's meticulously annotated database of biologically significant protein sites, patterns and profiles that help to identify to which known family of protein (if any) a new sequence belongs. This server allows only text searches of the database.

Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The BLOCKS database is created automatically by taking the most highly conserved regions from groups of proteins in the PROSITE database and using them to search the SWISS-PROT database. These sequences are then aligned to form the BLOCKS database. An online search tool is available to compare user entered sequences against the database and also for text-based searches.

PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along a sequence. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs. The database thus provides a useful adjunct to PROSITE. This server provides both sequence similarity and text-based database searches, an interesting interactive multiple sequence alignment editor (knonw as CINEMA) is also available.

The ProDom protein domain database consists of an automatic compilation of 9600 homologous domains detected in the SWISS-PROT database by the DOMAINER algorithm (Sonnhammer, E.L.L. & Kahn, D., 1994, Protein Sci. 3:482-492). The server provides sequence similarity searches of user sequences agains the consensus sequences of the domains. Beautiful graphical representaions are available of multiple alignments of all SwissProt sequences that contain each domain.

Pfam is a large collection of multiple sequence alignments and hidden Markov models covering most common protein domains. You can search your favorite sequence against Pfam; access, view, and download individual alignments from Pfam; or download HMMs that you can use locally if you have installed HMMER hidden Markov model software. Output pages are hyperlinked to other relevant databases, including Swissprot, Genbank, PDB, PROSITE, and Medline.

MEME is a tool for the discovery of conserved regions in groups of unaligned (but related) DNA or protein sequences. It is one of the best tools to identify potential common transcription factor binding sites in clusters of co-expressed sequences found in microarray experiments.

Pretty Printing and Shading of Multiple-Alignment files. A Web implementation of the same BOXSHADE program running on the RCR Alpha.

ProMod is a Protein Modeling tool which requires similarities with experimentally determined protein structures. ProMod is based on knowledge-based protein modeling methods. The structure database used by Swiss-Model is derived from the Brookhaven Protein Data Bank (PDB).

Provides multiple sequence alignments and predictions of secondary structure, residue solvent accessibility and the location of transmembrane helices

Protein secondary structure prediction from single sequence or a set of sequences

Pratt is a tool that allows the user to search for patterns conserved in sets of unaligned protein sequences. The user can specify what kind of patterns should be searched for, and how many sequences should match a pattern to be reported.

Protein motif and structural prediction tools. Offers several tools linked to the PDB including MOOSE, Protein Kinase Database, and PDB Toolbox.

Offers web-based BLAST searching of proteins domains and cross-references to the other major protein databases.

The UCLA-DOE Protein Fold-Recognition server is a new project aimed to help in the computational analysis and prediction of structure from amino acid sequences. It is a comprehensive package providing users with computation time, storage and collection of data, and organization of the results for easy analysis.

DNA Pattern Analysis

The Restriction Enzyme Database is a collection of information about restriction enzymes, methylases, the microorganisms from which they have been isolated, recognition sequences, cleavage sites, methylation specificity, the commercial availability of the enzymes, and references - both published and unpublished observations.

Paste in your DNA sequence, choose your enzymes and get an instant restriction map.

This is the home of the TRANSFAC database (implemented as the local file TFDATA at RCR). It compiles data about gene regulatory DNA sequences and protein factors binding to them. This web site provides on-line programs that help to identify putative promoter or enhancer structures within your DNA sequences and to suggest their features. This site also provides a huge list of WWW links to sources of useful biology (and other) information on the Web.

The Eukaryotic Promoter Database is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally.

SIGNAL SCAN finds homologies of published signal sequences to your sequence, most of these are transcriptional elements.

Other sites providing interesting Bioinformatics Resources

An EXCELLENT online reference to Bioinformatics. This is an online version of a chapter from a new book to be published by Cold Spring Harbor Press called GENOME ANALYSIS: A LABORATORY MANUAL

An extensive listing of bioinformatics teaching resources available on the Web. Maintained by Georg Fullen

at Heidelberg, Germany

at European Bioinformatics Institute, Cambridge, UK

The ExPASy WWW server is dedicated to molecular biology with an emphasis on data relevant to proteins. It allows you to browse through a number of databases produced in Geneva, such as SWISS-PROT, PROSITE, SWISS- 2DPAGE, SWISS-3DIMAGE and SeqAnalRef. It also allows access to various sequence analysis tools.

The BCM Search Launcher is an on-going project to organize molecular biology-related search and analysis services available on the WWW by function by providing a single point-of-entry for related searches. WWW servers are grouped into the following categories: Protein sequence/pattern searches, Nucleic acid sequence searches, Multiple sequence alignments, Pairwise sequence alignments, Gene features (motifs), Sequence utilities, Protein secondary structure prediction

They provide a web version of the GCG documentation (the same text found in GenHelp), and a list of other useful web sites.

The Dictionary of Cell Biology was first published in 1989, and has since been translated into several languages. It is intended to provide quick access to easily-understood and cross-referenced definitions of terms frequently encountered in reading the modern biology literature. This server contains the text of the Second edition, published in April 1995, together with enhancements, hypertext links and new entries which are destined for the third edition.

GCG is the home of the Wisconsin Sequence Analysis Package, the most comprehensive suite of DNA and protein sequence analysis tools available, and the core software offered by the RCR. The GCG web site offers the company newsletter, advertisements for GCG products, and some links to other biocomputing sites that offer useful information such as online documentation and tutorials for the GCG software.

This guide was written by Cary O'Donnell of the AFRC Computing Division, Harpenden Herts, AL5 2JE UK. It is widely considered to be the best available tutorial for GCG.

Provides a number of interesting services including: The Arabidopsis, Rice, Corn, Pine, and Brassica napus cDNA Sequence Analysis Projects, The Virtual Genome Center with information about Candida albicans molecular biology, Neuroscience Database Program, and a web-based Recombinant DNA Technology Course.

Other Lists of Bioinformatics Resources

maintained by Richard M.K. Yu at the City University of Hong Kong.

(a mirror of Pedro's list is also available at the University of Dusseldorf, Germany)
Pedro has collected an awesome set of WWW links that offer everything from on-line reading of your favorite journals to Web-based multiple sequence alignment tools. Virtually every biologist's work can benefit from the resources listed on this page.

the Molecular Biology server at the University of Geneva, Switzerland.

list of DNA and Protein analysis links.

references and links

A complete listing of DNA sequence manipulation tools available on the WWW

Created by Emmanuel Skoufos, Ph.D., Yale University School of Medicine. This page organizes links to existing search engines in a coherent, stepwise fashion.

A commercial site (funded by vendors whose products are featured) that contains many useful links and embedded mini-search engines for specific databaes.

Maintained by Keith Robison at Harvard University. A huge, well organized list of "biosciences" resources available on the Internet.

An excellent and no-nonsense collection of links with an emphasis on protein databases.

Compiled by New England Biolabs, Inc. Here are a series of pointers to Internet resources of biological interest. These are tools used by our scientists on a daily basis. We hope they can be of help in your work.

Yahoo is the ultimate, comprehensive, hierarchically organized list of Internet resources. If you can't find it anywhere else, then look here.

