Description
The UCSC Genes track is a set of gene predictions based on data from
RefSeq, Genbank, CCDS, UniProt, Rfam, and the tRNA Genes track. The
track includes both protein-coding genes and non-coding RNA genes.
Both types of genes can produce noncoding transcripts, but non-coding
RNA genes do not produce protein-coding transcripts. This is a
moderately conservative set of predictions. Transcripts of
protein-coding genes require the support of one RefSeq RNA, or one
GenBank RNA sequence plus at least one additional line of evidence.
Transcripts of non-coding RNA genes require the support of one Rfam or
tRNA prediction. Compared to RefSeq, this gene set has generally
about 10% more protein-coding genes, approximately four times as many
putative non-coding genes, and about twice as many splice
variants.
Display Conventions and Configuration
This track in general follows the display conventions for
gene prediction
tracks. The exons for putative noncoding genes and untranslated regions
are represented by relatively thin blocks, while those for coding open
reading frames are thicker. The following color key is used:
- Black -- feature has a corresponding entry in the Protein Data Bank (PDB)
- Dark blue -- transcript has been reviewed
or validated by either the RefSeq, SwissProt or CCDS staff
- Medium blue -- other RefSeq transcripts
- Light blue -- non-RefSeq transcripts
This track contains an optional codon coloring
feature that allows users to quickly validate and compare gene predictions.
To display codon colors, select the genomic codons option from the
Color track by codons pull-down menu. Click
here for more
information about this feature.
Methods
The UCSC Genes are built using a multi-step pipeline:
- RefSeq and GenBank RNAs are aligned to the genome with BLAT, keeping
only the best
alignments for each RNA and discarding alignments of less than 98% identity.
- Alignments are broken up at non-intronic gaps, with small isolated
fragments thrown out.
- Alignments are merged in from the hg19
tRNA
track
- Alignments are also merged in for perfect alignments of sequences predicted
as human noncoding genes by
Rfam,
in regions that are syntenic with the mm9 mouse
genome. Perfect alignments to regions that are not syntenic with mm9 are
excluded because these regions are enriched for pseudogenes.
- A splicing graph is created for each set of overlapping alignments. This
graph
has an edge for each exon or intron, and a vertex for each splice site, start,
and end.
Each RNA that contributes to an edge is kept as evidence for that edge.
Gene models from the Consensus CDS project (CCDS) are also added to the graph.
- A similar splicing graph is created in the mouse, based on mouse RNA and
ESTs. If
the mouse graph has an edge that is orthologous to an edge in the human graph,
that is added to the evidence for the human edge.
- If an edge in the splicing graph is supported by two or more human ESTs,
it is added as evidence for the edge.
- If there is an Exoniphy prediction for an exon, that is added as evidence.
- The graph is traversed to generate all unique transcripts. The traversal is
guided by the initial RNAs to avoid a combinatorial explosion in alternative
splicing. All refSeq transcripts are output. For other multi-exon transcripts
to be output, an edge supported by at least one additional line of evidence
beyond the RNA is required. Single-exon genes require either two RNAs or two
additional lines of evidence beyond the single RNA.
- Protein predictions are generated. For non-RefSeq transcripts we use the
txCdsPredict program to determine if the transcript is protein-coding and if so,
the locations of the start and stop codons.
The program weighs as positive evidence the length of the protein, the
presence of a Kozak consensus sequence at the start codon, and the length of
the orthologous predicted protein in other species.
As negative evidence it considers nonsense-mediated decay and start codons in
any frame upstream of the predicted start codon. For RefSeq transcripts
the RefSeq protein prediction is used directly instead of this procedure. For CCDS proteins
the CCDS protein is used directly.
- The corresponding UniProt protein is found, if any.
- The transcript is assigned a permanent "uc" accession. If the
transcript was not in the previous release of UCSC Genes, the
accession ends with the suffix ".1" indicating that this is the first
version of this transcript. If the transcript is identical to some
transcript in the previous release of UCSC Genes, the accession is
re-used with the same version number. If the transcript is not
identical to any transcript in the previous release, but if it
overlaps a similar transcript with a compatible structure, the
previous accession is re-used with the version number incremented.
Related Data
The UCSC Genes transcripts are annotated in numerous tables, each of
which is also available as a downloadable file. These include tables
that link UCSC Genes transcripts to external datasets (such as
knownToLocusLink, which maps UCSC Genes transcripts to Entrez
identifiers, previously know as Locus Link identifiers), and tables
that detail some property of UCSC Genes transcript sequences (such as
knownToPfam, which identifies any Pfam domains found in the
UCSC Genes protein-coding transcripts). One can see a full list of
the associated tables in the Table Browser by selecting UCSC Genes at
the track menu; this list is then available at the table
menu. Note that some of these tables refer to UCSC Genes by its
former name of Known Genes, sometimes abbreviated as
known or kg. While the complete set of annotation
tables is too long to describe, some of the more important tables are
described below.
- kgXref identifies the RefSeq, SwissProt, Rfam, or tRNA
sequences (if any) on which each each transcript was based.
- knownToRefSeq identifies the RefSeq transcript that each
UCSC Genes transcript is most closely associated with. That RefSeq
transcript is either the RefSeq on which the UCSC Genes transcript was
based, if there is one, or otherwise it's the RefSeq transcript that
the UCSC Genes transcript overlaps at the most bases.
- knownGeneMrna contains the mRNA sequence that represents
each UCSC Genes transcript. If the transcript is based on a RefSeq
transcript, then this table contains the RefSeq transcript, including
any portions that do not align to the genome.
- knownGeneTxMrna contains mRNA sequences for each UCSC
Genes transcripts. In contrast to the sequencess in knownGeneMrna,
these sequences are derived from obtaining the sequences for each exon
from the reference genome and concatenating these exonic sequences.
- knownGenePep contains the protein sequences derived from
the knownGeneMrna transcript sequences. Any protein-level
annotations, such as the contents of the knownToPfam table, are based
on these sequences.
- knownGeneTxPep contains the protein translation (if any)
of each mRNA sequence in knownGeneTxMrna.
- knownIsoforms maps each transcript to a cluster
ID, a cluster of isoforms of the same gene.
- knownCanonical identifies the canonical isoform of each
cluster ID, or gene. Generally, this is the longest isoform.
Credits
The UCSC Genes track was produced at UCSC using a computational pipeline
developed by Jim Kent, Chuck Sugnet, Melissa Cline and Mark Diekhans.
It is based on data from NCBI
RefSeq,
UniProt
(including TrEMBL and TrEMBL-NEW),
CCDS, and
GenBank as well as data from
Rfam and
the Todd Lowe lab.
Our thanks to the people running these databases
and to the scientists worldwide who have made contributions to them.
Data Use Restrictions
Copyright information from the
UniProt website:
Copyright 2002-2009 UniProt Consortium.
We have chosen to apply the Creative Commons Attribution-NoDerivs
License to all copyrightable parts of our databases. This means that you are
free to copy, distribute, display and make commercial use of these databases,
provided you give us credit. However, if you intend to distribute a modified
version of one of our databases, you must ask us for permission first.
All databases and documents in the UniProt FTP directory may be copied and
redistributed freely, without advance permission, provided that this
copyright statement is reproduced with each copy.
References
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J,
Wheeler DL.
GenBank: update.
Nucleic Acids Res. 2004 Jan 1;32:D23-6.
Chan PP, Lowe TM.
GtRNAdb: A database
of transfer RNA genes detected in genomic sequence.
Nucleic Acids Res. 2009 Jan;37:D93-7.
Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD,
Nawrocki EP, Kolbe DL, Eddy SR, Bateman A.
Rfam: Wikipedia, clans and the "decimal" release
Nucleic Acids Res.2011 Jan;39:D141-5.
Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D.
The UCSC Known Genes.
Bioinformatics. 2006 May 1;22(9):1036-46.
Kent WJ.
BLAT - the BLAST-like alignment tool.
Genome Res. 2002 Apr;12(4):656-64.
Lowe TM, Eddy SR.
tRNAscan-SE: A program for
improved detection of transfer RNA genes in genomic sequence.
Nucleic Acids Res. 1997 Mar 1;25(5):955-64.