Package org.snpeff.snpEffect.factory
Class SnpEffPredictorFactoryRefSeq
java.lang.Object
org.snpeff.snpEffect.factory.SnpEffPredictorFactory
org.snpeff.snpEffect.factory.SnpEffPredictorFactoryRefSeq
This class creates a SnpEffectPredictor from a TXT file dumped using UCSC table browser
RefSeq table schema: http://genome.ucsc.edu/cgi-bin/hgTables
field example SQL type info description
bin 585 smallint(5) range Indexing field to speed chromosome range queries.
name NR_026818 varchar(255) values Name of gene (usually transcript_id from GTF)
chrom chr1 varchar(255) values Reference sequence chromosome or scaffold
strand - char(1) values + or - for strand
txStart 34610 int(10) range Transcription start position
txEnd 36081 int(10) range Transcription end position
cdsStart 36081 int(10) range Coding region start
cdsEnd 36081 int(10) range Coding region end
exonCount 3 int(10) range Number of exons
exonStarts 34610,35276,35720, longblob Exon start positions
exonEnds 35174,35481,36081, longblob Exon end positions
score 0 int(11) range
name2 FAM138A varchar(255) values Alternate name (e.g. gene_id from GTF)
cdsStartStat unk enum('none', 'unk', 'incmpl', 'cmpl') values enum('none','unk','incmpl','cmpl')
cdsEndStat unk enum('none', 'unk', 'incmpl', 'cmpl') values enum('none','unk','incmpl','cmpl')
exonFrames -1,-1,-1, longblob Exon frame {0,1,2}, or -1 if no frame for exon
Refseq Accession format (i.e. NM_ NR_ codes) : http://www.ncbi.nlm.nih.gov/RefSeq/key.html
Accession Molecule Method Note
AC_123456 Genomic Mixed Alternate complete genomic molecule. This prefix is used for records that are provided to reflect an alternate assembly or annotation. Primarily used for viral, prokaryotic records.
AP_123456 Protein Mixed Protein products; alternate protein record. This prefix is used for records that are provided to reflect an alternate assembly or annotation. The AP_ prefix was originally designated for bacterial proteins but this usage was changed.
NC_123456 Genomic Mixed Complete genomic molecules including genomes, chromosomes, organelles, plasmids.
NG_123456 Genomic Mixed Incomplete genomic region; supplied to support the NCBI genome annotation pipeline. Represents either non-transcribed pseudogenes, or larger regions representing a gene cluster that is difficult to annotate via automatic methods.
NM_123456789 mRNA Mixed Transcript products; mature messenger RNA (mRNA) transcripts.
NP_123456789 Protein Mixed Protein products; primarily full-length precursor products but may include some partial proteins and mature peptide products.
NR_123456 RNA Mixed Non-coding transcripts including structural RNAs, transcribed pseudogenes, and others.
NT_123456 Genomic Automated Intermediate genomic assemblies of BAC and/or Whole Genome Shotgun sequence data.
NW_123456789 Genomic Automated Intermediate genomic assemblies of BAC or Whole Genome Shotgun sequence data.
NZ_ABCD12345678 Genomic Automated A collection of whole genome shotgun sequence data for a project. Accessions are not tracked between releases. The first four characters following the underscore (e.g. 'ABCD') identifies a genome project.
XM_123456789 mRNA Automated Transcript products; model mRNA provided by a genome annotation process; sequence corresponds to the genomic contig.
XP_123456789 Protein Automated Protein products; model proteins provided by a genome annotation process; sequence corresponds to the genomic contig.
XR_123456 RNA Automated Transcript products; model non-coding transcripts provided by a genome annotation process; sequence corresponds to the genomic contig.
YP_123456789 Protein Mixed Protein products; no corresponding transcript record provided. Primarily used for bacterial, viral, and mitochondrial records.
ZP_12345678 Protein Automated Protein products; annotated on NZ_ accessions (often via computational methods).
NS_123456 Genomic Automated Genomic records that represent an assembly which does not reflect the structure of a real biological molecule. The assembly may represent an unordered assembly of unplaced scaffolds, or it may represent an assembly of DNA sequences generated from a biological sample that may not represent a single organism.
$ zcat genes.txt.gz | cut -f 2 | cut -b 1,2 | sort | uniq -c
34466 NM
6548 NR
- Author:
- pcingola
-
Field Summary
FieldsFields inherited from class org.snpeff.snpEffect.factory.SnpEffPredictorFactory
MARK, MIN_TOTAL_FRAME_COUNT
-
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.snpeff.snpEffect.factory.SnpEffPredictorFactory
add, add, add, add, add, add, addMarker, addSequences, adjustChromosomes, adjustTranscripts, beforeExonSequences, codingFromCds, collapseZeroLenIntrons, createRandSequences, deleteRedundant, exonsFromCds, exonsFromCds, findGene, findGene, findMarker, findTranscript, findTranscript, getOrCreateChromosome, getProteinByTrId, parsePosition, readExonSequences, replaceTranscript, setCircularCorrectLargeGap, setCreateRandSequences, setDebug, setFastaFile, setFileName, setRandom, setReadSequences, setStoreSequences, setVerbose, showChromoNamesDifferences
-
Field Details
-
CDS_STAT_COMPLETE
- See Also:
-
-
Constructor Details
-
SnpEffPredictorFactoryRefSeq
-
-
Method Details
-
create
- Specified by:
create
in classSnpEffPredictorFactory
-
readRefSeqFile
protected void readRefSeqFile()Read and parse RefSeq file
-