Table Browser User's Guide
 

Table of Contents:

The UCSC Table Browser User's Guide was last updated on 23 October 2003. Questions and feedback are welcome.


  Introduction
  The Table Browser provides a query interface to the Genome Browser database tables. Because the Table Browser uses the same database as the Genome Browser, the two views are always consistent. (Chromosomal coordinates usually change with each assembly because of newly filled gaps or assembly procedure refinements.)

What can the Table Browser do?

The Table Browser can retrieve DNA sequence data or annotation data, for Genome Browser tracks or for intersections/unions of tracks, over a given position range, over a given set of accessions, or over the whole genome. It can display some basic statistics calculated over selected data. It can generate a custom track and automatically add it to the user's session, so that it can be graphically displayed in the Genome Browser. It allows combined queries on two tables, or on a user custom track and a table, so compound queries are possible.

One goal of the Table Browser is to provide an extremely flexible query interface. The user can specify constraints on the values of fields, to narrow down the set of results. The output format is also highly configurable.

Some things that the Table Browser cannot do (yet) include joining together relational tables in a unified interface. The Table Browser is not as powerful as running MySQL on a local installation of the database itself.

About the Genome Browser databases and tables

The Genome Browser Database actually consists of several separate databases, one for each genome assembly.

The Table Browser divides the tables in a database into two categories: positional and non-positional. Some output formats and query options are only applicable to positional tables, thus the distinction.

  • Positional tables contain data associated with specific locations in the genome. mRNA alignments, gene predictions, cross-species alignments, and other annotations are stored in positional tables. Each Genome Browser track is based on a positional table (but may incorporate information from non-positional tables).
  • Non-positional tables contain data not tied to genomic locations. For example, a table that relates a Genethon marker name to a Marshfield marker name is non-positional. Many tables relate internal numeric mRNA IDs to information such as author, tissue, keyword, etc. Some "meta" tables contain information about the structure of the database itself or describe external files containing sequence data.

Positional tables can be further subdivided into several categories based on the type of data they describe. Some tables describe alignments, so they describe a block structure for each element. Other tables do not describe a block structure (only a start and an end). Some tables specify a translation start & end in addition to a transcription start & end. Some tables specify the strand, some don't. Most, but not all, give a name for each element. Based on what a table describes, different query and output formatting options may be offered.

Descriptions of the database tables are available here.


  Table Browser Forms
  The following sections describe the Table Browser pages with the given titles.

Choose Organism & Assembly

This page prompts the user to select the organism (e.g. "Human") and assembly (e.g. "Nov. 2002"). This is the gateway page for the Table Browser; it will appear when the user clicks on the "Tables" link from the main page of the genome server, or whenever the Table Browser is invoked without specifying which genome assembly to use.

Choose a Table

This page is the entry point for the Table Browser when the genome assembly has already been specified (for example, when the user follows the "Tables" link from the Genome Browser). There are four categories of tables:
  • Browser tracks: Tables that correspond to Genome Browser tracks.
  • Custom tracks: User custom tracks in the current session.
  • Positional tables: All positional tables, whether or not they correspond to Genome Browser tracks, both from the current assembly and from the hgFixed shared database.
  • Non-positional tables: Tables without genomic coordinates, both from the current assembly and from the hgFixed shared database.
In web browsers that support Javascript, whenever a selection is made from one of the above menus, the other three menus are reset to avoid confusion. In older web browsers without Javascript, it is possible to select tables from multiple menus. In that case, the Table Browser will use the leftmost selected table. Other selected tables will be ignored.

When a browser track, custom track or positional table is selected, items can be selected from the table by position or by item name (which is usually an accession for gene/mRNA/EST tracks). The default position is "genome", which selects all items in the table. Any position or search term accepted by the Genome Browser will also be accepted by the Table Browser. If you have chosen to select items by name/accession, then you must paste or upload a list of names/accessions before retrieving data.

When a non-positional table is selected, the position and any specified names/accessions are ignored. All items in the table will be returned by default. To restrict non-positional table items by name or accession, paste values into the filter input for the appropriate field on the Advanced Query page.

The "Get all fields" button simply retrieves all elements from the table (at the specified position, if the table is positional), and prints them out as tab-separated plain text.

The "Get sequence" button works only when a browser track, custom track, or positional table has been selected. It is for retrieving FASTA sequence data for selected items.

The "Advanced Query" button leads to a richer choice of output formats and filtering options, described below.

The "Describe table" button displays available information about the selected table.

Paste in Names/Accessions

Here the user can enter in a list of item names (usually accessions for gene/mRNA/EST tracks), which may contain wildcard characters ("?" to match exactly one character, "*" to match 0 or more characters).

Note: name/accession matching is not supported for non-positional tables.

Upload a File of Names/Accessions

On this page, the user can upload a file containing item names (usually accessions for gene/mRNA/EST tracks). Wildcards are not allowed; the names must have exact matches in the table, or they will be ignored.

Note: name/accession matching is not supported for non-positional tables.

Describe Table

This page shows descriptive information about the currently selected table and its fields. Buttons at the top and bottom of the page offer the same choice of actions as the Choose a Table page.
  • AutoSql definition (if available): Many browser database tables are generated and processed using SQL and C code generated by the tool AutoSql. The AutoSql input format includes the name, data type and descriptive comment for each field in a table and for the table itself. It is often the best available documentation for a table.
  • Link to table description (if available): The Genome Browser Database document is another source of information about tables. Most database tables are described in corresponding sections of the document. Many sections contain just the autoSql definition, but some sections contain more information about the format and other related tables.
  • Field SQL types: Just in case the autoSql definition is outdated (this can happen - please email the genome mailing list if you encounter this, thanks!) or not available, we display the SQL field names and types for the table as it truly exists in the current database. For convenience, there are buttons which lead to histograms showing the frequency of occurrences of values of text fields within the current position range. (The Summary/Statistics output type, available via Advanced Query, provides statistics about integer values of position and exon/block fields.)
  • Row count and first few rows: The total number of rows in the table and the contents of the first few rows give an idea of the data size and example values. No filtering or batch-querying is reflected in the count, and the example rows will probably not be from the current position (unless it is "genome"). They are shown only as a descriptive example.
  • Browser Track information (if available): If the table is the main (positional) table for a Genome Browser track, then the data type, track group name and HTML track description are shown.

Advanced Query

This page offers choices for output format and filtering of results in several sections: For convenience, the "Get results" button (for non-intersected results) appears in both the output format section and the filtering section. It is the same button (same action), offered in two places to save the user some scrolling.

Intersect Results

At this point the user has specified two positional tables, as well as filtering constraints (if any) on the primary table. The output format has also been selected -- for intersection of results, the output format must be FASTA, BED/Custom Track, Hyperlinks, GTF, or Summary/Statistics. Now, the user may specify filtering constraints on the secondary table, and must choose a mode for combining the results from each table:

BED/Custom Track Options

When the user has selected the BED/Custom Track format, the Table Browser prompts for a few additional parameters:
  • Include custom track header:
    If this box is checked, then a custom track "track" header line will be generated with the name, description, visibility, and url fields. Defaults are supplied, so the user does not have to enter any values.
  • If the user has chosen to include a custom track header in the output, then the user can also check Load this custom track into my session so that the results of the query will be locally stored (until 4 hours after last usage). To view the custom track in the Genome Browser, click the "Get BED" button and view (or save) the resulting BED data; then click your web browser's "Back" button, then click the "Genome Browser" link in the blue bar at the top of the page.
  • Create one BED record per:
    Here the user has the option to retrieve BED just for certain components of each record, or from upstream/downstream of each record. The options are different for each type of table.

FASTA (DNA sequence) Options

This form prompts the user for additional parameters of FASTA (DNA sequence) output:
  • Sequence Retrieval Region Options:
    If the table is a gene prediction or alignment table, i.e. specifying exon, translation or block structure, then the user can select which subregions of table records are used to fetch DNA sequence. The user also has the choice to concatenate the chunks of DNA sequence for each selected subregion into one sequence, or to retrieve a separate sequence for each selected subregion. (Concatenating is useful when all regions are selected, or when exons are selected and introns are unselected to see spliced mRNA sequence.)
  • Sequence Formatting Options:
    The user can select upper or lower case (for some subregions if the table specifies structures), and can mask locations of repetitive sequences (RepeatMasker) if desired.

Choose Fields

This form prompts the user to select which fields of the table to include in tab-separated output when the user has selected the "Tab Separated, Choose Fields..." output format.

  Output Data Formats
  The Table Browser returns query results in one of the following formats. Note: When querying non-positional tables, only the Tab Separated and Summary/Statistics formats are available.

BED/Custom Track

BED stands for Browser Extensible Data. It is a plain-text data format, and is the internal representation used for all user custom tracks and for many Genome Browser tracks. The BED format is described on the Data File Formats page.

Selecting BED as the output data format allows the user to retain their query results as user custom tracks, which can then be viewed in the Genome Browser or queried again in the Table Browser to form compound queries.

FASTA (DNA Sequence)

FASTA is a very simple plain text format for displaying nucleotide or protein sequence. (The Table Browser only outputs DNA sequence, not protein sequence, at this time.) For each record in FASTA output, there is one header line which begins with ">" and contains a description of the record, followed by one or more lines whose letters represent the sequence.

Below is an example FASTA record returned by the Table Browser, with an explanation of the header description fields. (The header line is artificially broken into two lines for better display/printing of this web page.)


>hg12_acembly_glutee_0 range=chr22:15586756-15586867 5'pad=0 3'pad=0 \
 revComp=TRUE strand=- repeatMasking=none
GGCACTGACAGGACTTGACACATTCTAATCTCCTGAGGGCCGCTAACAAC
ATAGATGACAGTCTGGACAACACAAAGGGTCAACAGGTGCTCCGAGTGTC
CGGCAGGGCTAA
  • hg12_acembly_glutee_0: The database is hg12 (Human June 2002), the table is acembly, the feature name is glutee, and the user has selected to return a separate FASTA record for each exon/intron on the FASTA/DNA Sequence options form and this is the first (_0) record (the first exon).
  • range=chr22:15586756-15586867: The feature extends from 15586756 to 15586867 on chr22. Note: the user may choose to generate a single FASTA record for all exons, excluding introns. In that case, the record may be shorter than the length implied by the range.
  • 5'pad=0 3'pad=0: The user has chosen to include no extra bases at the 5' and 3' ends of each feature.
  • revComp=TRUE strand=-: This feature is on the '-' strand, so the sequence has been reverse complemented relative to the genomic sequence at the given range.
  • repeatMasking=none: The user has chosen not to mask out (i.e. force to 'N' or lower-case) the bases that have been identified as repeats by RepeatMasker.

GTF

GTF stands for Gene Transfer Format. It is also known as GFF version 2. The spec can be found here.

Hyperlinks to Genome Browser

For convenience, the Table Browser can generate links to the Genome Browser at the location of each feature identified by a query on a positional table.

Tab Separated

This plain text output format simply contains the contents of a table, one table record per line, with fields (columns) separated by tab characters. The first line of output is a comment, starting with "#" and displaying the field names separated by tab characters. This format is easy to load into spreadsheets or process with scripts, if the user wishes to perform additional analysis on a table.

Summary/Statistics

This format summarizes the user's query (including any filtering constraints or comparison of tables. For positional tables, some statistics about the features are computed (count, min/max/average/stdev of length, etc.) and displayed. For all tables, the fields of the table and their SQL database types are displayed. For text fields, the user may click to obtain a frequency histogram of a field's values in the query results.

  Filtering Results by Constraining Field Values
  By default, the table browser retrieves all records for the specified coordinate range (use "genome" to search all chromosomes) or set of item names/accessions. To restrict the set of records retrieved from this range, you may enter constraints on the values of fields. The individual field constraints are combined with AND. In other words, a record must meet all individual field constraints in order to be retrieved. The default values will match all records, so the only constraints that apply are the ones that you edit.

Constraints on strings

Text fields are compared to words or patterns containing wildcard characters. Wildcards are "*" (to match 0 or more characters) and "?" (to match a single character). Each space-separated word/pattern in a text field box will be matched against the value of that field in each record. If any word/pattern matches the value, then the record meets the constraint on that field.

Constraints on numbers

Numeric fields are compared to table data using an operator such as <, >, != (not equal) followed by a number. To specify a range, enter two numbers (start and end) separated by whitespace and/or a comma (",").

Free-Form Query

When the filters above aren't enough, free-form queries allow fancier constraints that typically relate two or more field names of the selected table. If you are familiar with SQL syntax, queries here are just "where" clauses (with wildcards as above). Queries can combine simple constraints with AND, OR, and NOT using parentheses as needed for clarity.

A simple constraint consists of a field name listed above, a comparison operator (see below), and a value: a number, string, wildcard value (see below), or another field name. In place of a field name, you may use an arithmetic expression of numeric field names.

String or wildcard values for text comparisons must be quoted. Single or double quotes may be used. If comparing to a literal string value, you may use the "=" (or "!=") operator. If comparing to a wildcard value, you must use the "LIKE" (or "NOT LIKE") operator.

Numeric comparison operators include <, <=, =, != (not equal), >=, and > . Arithmetic operators include +, -, *, and /. Other SQL comparison keywords may be used.

Free-form query examples (taken from the refGene table):

  • txStart = cdsStart searches for gene models missing expected 5' UTR upstream sequence (if strand is '+'; 3' UTR downstream if strand is '-').
  • chrom NOT LIKE "chr??" filters search to chromosomes 1-9 plus X and Y.
  • (cdsEnd - cdsStart) > 10000 selects genes whose coding regions span over 10 kbp.
  • (txStart != cdsStart) AND (txEnd != cdsEnd) AND exonCount = 1 finds single exon genes with both 3' and 5' flanking UTR.
  • ((cdsEnd - cdsStart) > 30000) AND (exonCount=2 OR exonCount=3) finds genes with long spans but only 2-3 exons.


  Filtering Results by Comparing Two Tables
  Sometimes it is interesting to see where the features of two positional tables do or do not cover the same positions in the genome. The Table Browser offers several kinds of comparisons of tables by feature positions.

Comparisons that preserve primary table structure

Some comparisons preserve the structure of the primary table. In other words, if the primary table describes exon structure and is filtered by one of these comparisons to any other table, then the results will describe exon structure (unless the user chooses an output format in which the structure is lost). Primary table features are kept or discarded based on the amount of positional overlap with secondary table features.
  • Any overlap: A primary table record will appear in the output if any of its base positions are also covered by any feature in the secondary table.
  • No overlap: A primary table record will appear in the output only if none of its base positions are also covered by any feature in the secondary table.
  • Overlap over a specified threshold: A primary table record will appear in the output if the ratio of the number of its base positions covered by secondary table features to the total number of its base positions is greater than the user-specified threshold.
  • Overlap under a specified threshold: A primary table record will appear in the output if the ratio of the number of its base positions covered by secondary table features to the total number of its base positions is less than the user-specified threshold.
Note: If the primary table has exon/block structure, then only base positions in exons/blocks will be counted.

Base-by-base comparisons

The positions covered by features of the primary and secondary tables can be compared one base position at a time, or base-by-base (as opposed to feature-by-feature like above). If the user selects one of the following comparisons, then the structure of the primary table will not be preserved. In other words, even if the primary table describes exon structure, the results will not describe exon structure. Only position ranges will be returned, with no information about exon/block structure, strand, or translation region.
  • Base-by-base intersection (AND): If a nucleotide position is covered by at least one feature of both the primary table and the secondary table, then it will be included in the output.
  • Base-by-base union (OR): If a nucleotide position is covered by at least one feature of either the primary table or the secondary table, then it will be included in the output.
Note: If the primary table has exon/block structure, then only base positions in exons/blocks will be counted.

Base-by-base complement (NOT): Before either of the above comparisons are applied, the set of positions covered by a table can be inverted. In other words, if this is selected, then any position covered by the table's features will be considered not covered, and vice versa. This gives the user more flexibility in comparing table positions.


  Retrieving Subregions of Features
 

Subregions of gene predictions

A gene prediction table specifies both exon structure and translated region. The choices of granularity are
  • Whole Gene: the block structure and translated region will be preserved in the BED data.
  • Upstream by N bases: the BED data will contain one simple (chrom/start/end/name) record per feature, from the 5' end of the gene to N bases away. If the strand is '+', the 5' end is the low coordinate ("start"). If the strand is '-', the 5' end is the high coordinate ("end"). Block structure and translated region info will be lost.
  • Exons plus N bases at each end: the BED data will contain one simple record per exon, subtracting N from the start and adding N to the end. (N can be negative to make the region smaller instead of larger.)
  • Introns plus N bases at each end: the BED data will contain one simple record per intron, subtracting N from the start and adding N to the end. (N can be negative to make the region smaller instead of larger.)
  • 5' UTR Exons: the BED data will contain one simple record per 5' UTR exon. Depending on the value of the strand, 5' may start at the low coordinate "start" and proceed up (+), or it may start at the high coordinate "end" and proceed down (-).
  • Coding Exons: the BED data will contain one simple record per coding exon. (If the translated region boundaries fall in the middle of exons, then the exons will be split at those boundaries.)
  • 3' UTR Exons: the BED data will contain one simple record per 3' UTR exon. Depending on the value of the strand, 3' may start at the high coordinate "end" and proceed down (+), or it may start at the low coordinate "start" and proceed up (-);
  • Downstream by N bases: the BED data will contain one simple (chrom/start/end/name) record per feature, from the 3' end of the gene to N bases away. If the strand is '+', the 3' end is the high coordinate ("end"). If the strand is '-', the 3' end is the low coordinate ("start"). Block structure and translated region info will be lost.
Note: if a feature is close to the beginning or end of a chromosome and upstream/downstream bases are added, they may be truncated in order to avoid extending past the edge of the chromosome.

Subregions of alignments

An alignment table specifies block structure, but not translated region.
  • Whole Alignment: the block structure will be preserved in the BED data.
  • Upstream by N bases: the BED data will contain one simple (chrom/start/end/name) record per feature, from the 5' end of the alignment to N bases away. If the strand is '+', the 5' end is the low coordinate ("start"). If the strand is '-', the 5' end is the high coordinate ("end").
  • Blocks plus N bases at each end: the BED data will contain one simple record per block, subtracting N from the start and adding N to the end. (N can be negative to make the region smaller instead of larger.)
  • Regions between blocks plus N bases at each end: the BED data will contain one simple record per intron, subtracting N from the start and adding N to the end. (N can be negative to make the region smaller instead of larger.)
  • Downstream by N bases: the BED data will contain one simple (chrom/start/end/name) record per feature, from the 3' end of the gene to N bases away. If the strand is '+', the 3' end is the high coordinate ("end"). If the strand is '-', the 3' end is the low coordinate ("start"). Block structure and translated region info will be lost.
Note: if a feature is close to the beginning or end of a chromosome and upstream/downstream bases are added, they may be truncated in order to avoid extending past the edge of the chromosome.

Subregions of "thick/thin" records

A thick/thin table specifies a "thick" region in the middle of each feature (surrounded by "thin" regions, as displayed in the Genome Browser). The "thick" region usually refers to translated region (but note that this type of table does not specify block structure).
  • Whole Gene: the "thick" (translated) region information will be preserved in the BED data.
  • Upstream by N bases: the BED data will contain one simple (chrom/start/end/name) record per feature, from the 5' end of the record to N bases away. If the strand is '+', the 5' end is the low coordinate ("start"). If the strand is '-', the 5' end is the high coordinate ("end").
  • 5' UTR: the BED data will contain one simple record per 5' UTR. Depending on the value of the strand, 5' may start at the low coordinate "start" and proceed up to the "thick" region (+), or it may start at the high coordinate "end" and proceed down (-) to the "thick" region.
  • CDS: the BED data will contain one simple record (the "thick" region).
  • 3' UTR: the BED data will contain one simple record per 3' UTR. Depending on the value of the strand, 3' may start at the high coordinate "end" and proceed up to the "thick" region (+), or it may start at the low coordinate "start" and proceed down (-) to the "thick" region.
  • Downstream by N bases:
the BED data will contain one simple (chrom/start/end/name) record per feature, from the 3' end of the record to N bases away. If the strand is '+', the 3' end is the high coordinate ("end"). If the strand is '-', the 3' end is the low coordinate ("start"). Note: if a feature is close to the beginning or end of a chromosome and upstream/downstream bases are added, they may be truncated in order to avoid extending past the edge of the chromosome.

Subregions of simple records

A simple table specifies on the the start and end position (no block structure or thick/thin region). However, upstream and downstream regions are available. If a strand is specified, then the meaning of upstream and downstream depend on the strand.
  • Whole Gene: simple records (with strand and score, if the table specifies strand and score) are output.
  • Upstream by N bases: the BED data will contain one simple (chrom/start/end/name) record per feature, from the 5' end of the record to N bases away. If the strand is '+', the 5' end is the low coordinate ("start"). If the strand is '-', the 5' end is the high coordinate ("end").
  • Downstream by N bases: the BED data will contain one simple (chrom/start/end/name) record per feature, from the 3' end of the record to N bases away. If the strand is '+', the 3' end is the high coordinate ("end"). If the strand is '-', the 3' end is the low coordinate ("start").
Note: if a feature is close to the beginning or end of a chromosome and upstream/downstream bases are added, they may be truncated in order to avoid extending past the edge of the chromosome.

  Example Uses of the Table Browser
  All of these examples assume that the user has already selected a genome assembly (for example, Human Nov. 2002).

Get DNA sequence for RefSeq genes

  • If there is a particular gene of interest, enter the gene's NM_?????? accession number and click on the Look up button.
    Otherwise, enter a position range, e.g. chr22:15500000-15750000
  • Select refGene from the "Positional tables" menu.
    Click on the Advanced query... button.
  • Select FASTA (DNA Sequence) from the output format menu.
    Click on the Get results button.
  • Review the subregion and formatting options.
    Click on the Get sequence button.
FASTA for all RefSeq genes in the position range will be displayed.

Get locations of all RefSeq genes that start with HOX

The refGene table contains the location of RefSeq genes, indexed by accession number (NM_??????). The refFlat table contains the same information, but indexed by gene name.
  • To find all RefSeq genes that start with HOX in the whole genome, change the position to genome.
  • Select refFlat from the "Positional tables" menu.
    Click on the Advanced query... button.
  • Select Hyperlinks to the from the output format menu.
    In the "Filter refFlat records" section, enter HOX* in the box at the end of the line for the geneName field ('geneName [does] match ...').
    Click on the Get results button.
A page of links into the Genome Browser at the location of each HOX* gene will appear.

Get human alignments to mouse at ESPN

Note: This example assumes the existence of the blastzBestMouse track in the current genome assembly/database, and the existence of the ESPN gene in the current assembly's RefSeq Genes (refGene/refFlat) track. Before starting this example, make sure the current database is set to Human, June 2002 or later.
  • Set position to ESPN and click the Look up button.
  • Select chrN_blastzBestMouse from the "Positional tables" menu.
    Click on the Get all fields button.
The contents of the chr1_blastzBestMouse table that fall at least partly in the selected position range will be printed out as tab-separated text. The format used by the Genome Browser for tables containing alignments (such as chrN_blastzBestMouse) is PSL.

Get locations of repeats identified on exons of chr21 RefSeq genes

  • Set position to chr21 and click the Look up button.
  • Select chrN_rmsk from the "Positional tables" menu.
    Click on the Advanced query... button.
  • Select BED/Custom Track from the output format menu.
    Scroll down to the "(Optional) Intersect Results" section.
    Select refGene from the "Positional tables" menu.
    Click the Intersect Results... button.
  • Select All chrN_rmsk records that have any overlap with refGene as the way to combine the tables.
    Scroll down to the bottom of the page and click on the BED/Custom Track button.
  • Experiment with the BED options form by clicking the Get BED button, looking at the results, then using your web browser's Back button to return to the form and try other settings.
If you choose the Load this custom track into my session option when experimenting with BED/Custom Track options, then the results will be kept temporarily (until 8 hours after last usage) on the server. To view your custom track, click your web browser's Back button, then click the "Genome Browser" link on the blue bar near the top of the page. Click the Submit button, and your Table Browser generated custom track will appear at or near the top of the browser tracks image.

Download the contents of a table for local processing

  • Enter the position range (or gene name, accession, etc.) of interest and click on the Look up button.
  • Select a table from either the "Positional tables" menu or the "Non-positional tables" menu.
    Click the Get all fields button.
Please don't set the position to "genome" very often with this method unless you are also filtering the results with constraints on field values (or comparison to another table). The amount of data could be very large, could take a long time, and could unnecessarily and unfairly load the server. Instead, if you need to download the entire contents of tables, see the downloads page.