Up until the advent of the SMap tags and associated code, it used to be that the virtual sequences were only built in the context of a Feature Map (FMap) and only Sequence class objects could be mapped. This meant that the Sequence class contained everything, genes, homologies, everything. This document refers to this as the "Old" style of mapping.
With the arrival of SMap a number of changes were made:
In this document we will distinguish between tags that are used by the code for building a virtual sequence and mapping features on to it, as opposed to tags that are used by FMap in the display of the virtual sequence.
We will also look at tags that are being moved out of the Sequence class in to their own classes, this has already happened for homologies with some very useful consequences as we shall see later.
A "tag set" is a set of tags and data that occur in a defined order and can be processed by acedb code regardless of the class they appear in. These tag sets are colour coded in this document to help identify the significant parts of the tag set:
feature_tag [anonymous tag and object reference] [feature specific tags and data]
Where:
feature_tag is the tag that the code searches for and locates on to find out what sort of feature it is processing.
anonymous tag and object reference are sometimes included to allow insertion in to the tag set of object references of arbitrary class (this is also known as the "tag2 system"). Although this tag must be present, the code does not use it, hence both it and the object reference following it at completely user defineable.
feature specific tags and data are the tags and data that specify the feature.
Some examples:
Source_Exons Int UNIQUE Int
Homol DNA_homol ?Sequence XREF DNA_homol ?Method Float Int Int Int Int #Homol_info
The sequence class contains a number of tags used by the code to prepare virtual sequences, they are highlighted in the sequence class below as:
?Sequence DNA UNIQUE ?DNA UNIQUE Int // Int is the length
Structure From Source UNIQUE ?Sequence
Source_Exons Int UNIQUE Int // start at 1
Subsequence ?Sequence XREF Source UNIQUE Int UNIQUE Int
Clone_left_end ?Clone XREF Clone_left_end UNIQUE Int
Clone_right_end ?Clone XREF Clone_right_end UNIQUE Int
Origin Genetic_code UNIQUE ?Genetic_code // specify a different genetic coding.
Method UNIQUE ?Method UNIQUE Float // score
Visible Title UNIQUE ?Text
Other_name ?Text // for repeats
Matching_Genomic ?Sequence XREF Matching_cDNA
Matching_cDNA ?Sequence XREF Matching_Genomic
Corresponding_protein UNIQUE ?Protein XREF Corresponding_DNA
Clone ?Clone XREF Sequence
Locus ?Locus XREF Sequence
Paired_read ?Sequence XREF Paired_read // dl 020110
etc.
Reference ?Paper XREF Sequence
Expression_construct ?Clone // archaic
Expr_pattern ?Expr_pattern XREF Sequence
// tag2 system: names of all objects following next tag are shown in the
// general annotation display column as "tag:objname"
Properties Genomic_canonical
cDNA cDNA_EST
EST_5 // Indicate whether this is a 5' or 3' read [010423 dl]
EST_3
Coding CDS UNIQUE Int UNIQUE Int // start, end in spliced DNA coords,
// default: 1, end-of-CDS
Start_not_found UNIQUE Int // Gives position of start frame for protein
// translation when start of CDS is before first
// exon in this object (should be in range 1 -> 3).
End_not_found
Show_in_reverse_orientation // Draw 3' reads in reverse orientation [010423 dl]
Splices Confirmed_intron Int Int #Splice_confirmation
Predicted_5 ?Method Int Int UNIQUE Float // (x, x+1) or (x, x-1)
Predicted_3 ?Method Int Int UNIQUE Float // (x, x+1) or (x, x-1)
Oligo ?Oligo XREF In_sequence Int UNIQUE Int // for OSP and human mapping mostly
Assembly_tags Text Int Int Text // type, start, stop, comment
Allele ?Allele XREF Sequence UNIQUE Int UNIQUE Int UNIQUE Text
// start, stop, replacement sequence
// if an insertion point Text is transposon name (distinguished
// by containing non ACTG letters), and (n, n+1) = T A, so indicates
// direction (if known).
// if a deletion, put '-' as the replacement sequence
EMBL_feature CAAT_signal Int Int Text #EMBL_info
GC_signal Int Int Text #EMBL_info
allele_seq Int Int Text #EMBL_info
etc.
conflict Int Int Text #EMBL_info
LTR Int Int Text #EMBL_info
terminator Int Int Text #EMBL_info
// EMBL_features are for legitimate EMBL feature table entries only
Homol DNA_homol ?Sequence XREF DNA_homol ?Method Float Int Int Int Int #Homol_info
Pep_homol ?Protein XREF DNA_homol ?Method Float Int Int Int Int #Homol_info
Motif_homol ?Motif XREF DNA_homol ?Method Float Int Int Int Int #Homol_info
// We will generate a column for each distinct ?Method. So for
// distinct Worm_EST and Worm_genomic columns, use ?Method objects
// Worm_EST_Blastn and Worm_genomic_Blastn.
Feature ?Method Int Int UNIQUE Float UNIQUE Text #Feature_info
// Float is score, Text is note
// note is shown on select, and same notes are neighbours
// again, each method has a column double-click shows the method.
// Absorb Assembly_tags?
?Splice_confirmation cDNA
EST
Homology
UTR
False
======================================================================================= Here is something to be discussed, with the addition of new classes we have a problem we didn't have so much with the sequence stuff....with sequence class, the use of allele tag was already defined and so unlikely to be changed..... This is not now so true...the tag allele can be used in other contexts, like following an SMap tag......it all becomes more tricky.... /* Really we should check that the Allele tag is in the correct format here, * there is a big question about how/if we want to go down this road..... * We don't want to do this sort of checking at this level because the * performance implications are BAD....but we do have a problem because with Smap * the same tag may be used in one object as part of the Smap tags and in another * to signify that an object is an allele etc.... */ - we could do nothing which the approach the old code took...... - we could check the models every time we call smapconvert and make sure that the tags we need to process are in the correct format...complex but possible - we could take Keiths suggestion of having "Properties" tag which basically says that anything following is a magic tag so if you change it you may screw things up. The last sounds like a workable half way house..... =======================================================================================
The new tags can be neatly split into tags that give the mapping of the feature on the virtual sequence and those that specify the type and attributes (as required by the acedb code) of a feature.
The Sequence class contains two mapping tag sets:
?Sequence
Structure From Source UNIQUE ?Sequence
Subsequence ?Sequence XREF Source UNIQUE Int UNIQUE Int
These tags are only recognised within a Sequence class object, acedb will continue to process them but they are maintained for backwards compatibility only.
See below for a discussion of the Source_exons tag.
The new SMap tag set provides a much more flexible and sophisticated mechanism for creating virtual sequences. The full tag set is:
SMap S_Parent UNIQUE <anonymous parent tag> UNIQUE <anonymous parent object> XREF <xref tag in parent object>
S_Child <anonymous child tag> <anonymous child object> XREF <xref tag in child object> UNIQUE Int UNIQUE Int #SMap_info
It is a core assumption of SMap that each child object has just one parent, hence the plethora of UNIQUE tags following S_Parent, note that you do need to make both the anonymous tag and the anonymous object reference UNIQUE.
The S_Child tag set specifies the mapping of a child object in its parent but can also specify other information about the child including any gaps in the alignment of child to parent. This information is specified in the SMap_info sub-model described below.
Here is an example of how to specify the tags for some fictious classes, the XREFs are a bit arcane but are very important as an aid to curation in ensuring that all the correct child-parent tags are created and maintained as objects are added to the database. (Note that only the SMap tags have been included.)
// Chromosome has no parent and can only have link objects as children.
//
?Chromosome Remark Text
SMap S_Child Link_child ?Link XREF Chromosome_parent UNIQUE Int UNIQUE Int #SMap_info
// Link can have chromosome or link as parent, and link, sequence or allele as children.
//
?Link Remark Text
SMap S_Parent UNIQUE Chromosome_parent UNIQUE ?Chromosome XREF Link_child
Link_parent UNIQUE ?Link XREF Link_child
S_Child Sequence_child ?Sequence XREF Link_parent UNIQUE Int UNIQUE Int #SMap_info
Link_child ?Link XREF Link_parent UNIQUE Int UNIQUE Int #SMap_info
Allele_child ?Allele XREF Link_parent UNIQUE Int UNIQUE Int #SMap_info
// Sequence can have sequence or link as parent and sequence or allele as children.
//
?Sequence
SMap S_Parent UNIQUE Sequence_parent UNIQUE ?Sequence XREF Sequence_child
Link_parent UNIQUE ?Link XREF Sequence_child
S_Child Sequence_child ?Sequence XREF Sequence_parent UNIQUE Int UNIQUE Int #SMap_info
Allele_child ?Allele XREF Sequence_parent UNIQUE Int UNIQUE Int #SMap_info
// allele can have sequence or link as parent and no children.
//
?Allele Remark Text
DNA UNIQUE ?DNA UNIQUE Int
SMap S_Parent UNIQUE Sequence_parent UNIQUE ?Sequence XREF Allele_child
Link_parent UNIQUE ?Link XREF Allele_child
Genetic_code UNIQUE ?Genetic_code
Method UNIQUE ?Method
CDS UNIQUE Int UNIQUE Int
A couple of points are worth making here. In this example the anonymous tags have "_parent" and "_child" suffices, this is for clarity only, you can call these tags what you like. Also note how the UNIQUE tags following S_Parent ensure there is only ever one parent object, while the tags following S_Child allow an arbitrary number of child objects.
The intention of SMap_info is two fold: to provide additional information about the way that a child maps to a parent (i.e. gaps, scaling and mismatches) but also to provide a mechanism for optimising the way virtual sequences are constructed by allowing the acedb code to find out certain crucial bits of information about the contents of a child object without having to go to the expense of loading the object from disk.
#SMap_info Mapping Align Int UNIQUE Int UNIQUE Int
// if no Align assume whole alignment is ungapped starting at 1 in child
// Otherwise one row per ungapped alignment section of parent_start child_start [length]
// first int is position in parent, second in child
// third int is length - only necessary when gaps in both sequences align
AlignDNAPep Int UNIQUE Int UNIQUE Int
AlignPepDNA Int UNIQUE Int UNIQUE Int
// These two tags are analogous to Align, but scale length
// for the case of a dna alignment to peptide or vice-versa.
Mismatch Int UNIQUE Int
// start end of mismatch region in child coords
// mismatches from this sequence or children are ignored
// if no end then only the specified base
// if no Ints then mismatches OK anywhere in this sequence
Content Homol_only // useful for -nohomol gff option
Feature_only // useful for -nofeature gff option
Method UNIQUE ?Method // methods used in child (optional)
No_DNA // child contains no dna, SMap can be more efficient in building dna sequence.
Display Max_mag UNIQUE Float // don't show if more bases per line
Min_mag UNIQUE Float // don't show if fewer bases per line
Currently only the Mapping tags are supported by acedb as these are essential for supporting gapped alignments and mismatches. The Content and Display tags would provide good optimisations but do not provide extra function.
The Source_exons tag set is given as part of the mapping tag set in the Sequence class:
Structure From Source UNIQUE ?Sequence
Source_Exons Int UNIQUE Int // start at 1
Subsequence ?Sequence XREF Source UNIQUE Int UNIQUE Int
and should properly be added to the SMap tag set since the SMap code must use the Source_exons tag both to position exons and also to retrieve the spliced DNA for the exons. The tag set is almost meaningless without the context of a parent object, although it is possible to specify a Sequence class object that contains DNA and also Source_exons and in this case the Source_exons tag could be interpreted in the context of the DNA.
The tag is also used however by the acedb code to recognise an object that is "gene-like" and should be dumped, displayed etc in that style. So to some extent the tag has a double purpose and is described both here and in the following section on Type tag sets.
In summary then the Source_exons tag set conveys both mapping and type information.
The purpose of these tags is to tell acedb what type of feature it is going to map onto the virtual sequence. The code needs this information for several reasons:
i.e. something has to tell the code what sort of thing it is supposed to produce and what format the data it needs will be in.
Each tag set has tags and data in a predefined and constant format, provided the tag and data format are preserved, this tag set can be embedded in any class that is to be smapped onto the virtual sequence. The following sections describe each tag set that can be used for virtual sequence building.
Source_Exons Int UNIQUE Int
Used to specify a set of exons (and by implication introns) for genes, pseudogenes, transcripts and all things gene like. Each pair of integers specifies the start/end coordinates of an exon. The coordinates are local, i.e. run from 1 to length of span of exons. Although the order the pairs come in doesn't matter (they are sorted by the code), each pair must be given smallest coordinate first. You can specify a length of 1 by putting the same coordinate twice.
Example: Source_Exons 1 25 // first exon starts at 1
58 72
157 178
200 200 // exon of length 1.
345 379 // last exon end coordinate is also the span of the exons.
Confirmed_intron Int Int #Splice_confirmation
?Splice_confirmation cDNA
EST
Homology
UTR
False
Used to specify an intron that has been confirmed alignment data from various sources (EST, cDNA etc.).
Example: Confirmed_intron Int Int #Splice_confirmation
Allele
Specifies an allele, alleles are drawn in various styles to indicate transposon insertions, other insertions or deletions.
Example: errrr... Allele (!)
Homol <anonymous homol tag> <anonymous homol object> XREF <xref tag in homol object> ?Method Float Int Int Int Int #Homol_info
// We will generate a column for each distinct ?Method. So for // distinct Worm_EST and Worm_genomic columns, use ?Method objects // Worm_EST_Blastn and Worm_genomic_Blastn.
Example: Homol DNA_homol ?Sequence XREF DNA_homol ?Method Float Int Int Int Int #Homol_info
Pep_homol ?Protein XREF DNA_homol ?Method Float Int Int Int Int #Homol_info
Motif_homol ?Motif XREF DNA_homol ?Method Float Int Int Int Int #Homol_info
Feature ?Method Int Int UNIQUE Float UNIQUE Text #Feature_info
I dont' know what uses this at the moment.....
#Feature_info EMBL_dump UNIQUE EMBL_dump_YES
EMBL_dump_NO
// overrides for embl dump based on method
EMBL_qualifier Text
// additional to those in the method, includes '/'
Frame UNIQUE Frame_0 /* in frame */
Frame_1 /* 1 base then codon */
Frame_2 /* 2 bases then codon */
Used for any feature that can be positioned on a virtual sequence and may have some sort of score and text for display. The text is a note describing the feature and will be dumped or displayed.
Example:
Example:
Example:
Example:
Example:
Example:
Example:
Example:
Clone_left_end ?Clone XREF Clone_left_end UNIQUE Int
Clone_right_end ?Clone XREF Clone_right_end UNIQUE Int
Specifies what exactly...I can't remember....look this up.....
Example: Clone_left_end B0303 XREF Clone_left_end 425
Feature ?Method Int Int UNIQUE Float UNIQUE Text #Feature_info
// Float is score, Text is note
// note is shown on select, and same notes are neighbours
_Transcribed_gene = str2tag("Transcribed_gene") ;
NOT IN CVS MODELS.WRM
_Clone_left_end = str2tag("Clone_left_end") ;
_Clone_right_end = str2tag("Clone_right_end") ;
Clone_left_end ?Clone XREF Clone_left_end UNIQUE Int
Clone_right_end ?Clone XREF Clone_right_end UNIQUE Int
_Splices = str2tag("Splices") ;
_Confirmed_intron = str2tag("Confirmed_intron") ;
_Predicted_5 = str2tag("Predicted_5") ;
_Predicted_3 = str2tag("Predicted_3") ;
Splices Confirmed_intron Int Int #Splice_confirmation
Predicted_5 ?Method Int Int UNIQUE Float // (x, x+1) or (x, x-1)
Predicted_3 ?Method Int Int UNIQUE Float // (x, x+1) or (x, x-1)
_EMBL_feature = str2tag("EMBL_feature") ;
EMBL_feature CAAT_signal Int Int Text #EMBL_info
GC_signal Int Int Text #EMBL_info
TATA_signal Int Int Text #EMBL_info
allele_seq Int Int Text #EMBL_info
conflict Int Int Text #EMBL_info
mat_peptide Int Int Text #EMBL_info
misc_binding Int Int Text #EMBL_info
misc_feature Int Int Text #EMBL_info
misc_signal Int Int Text #EMBL_info
misc_recomb Int Int Text #EMBL_info
modified_base Int Int Text #EMBL_info
mutation Int Int Text #EMBL_info
old_sequence Int Int Text #EMBL_info
polyA_signal Int Int Text #EMBL_info
polyA_site Int Int Text #EMBL_info
prim_binding Int Int Text #EMBL_info
prim_transcript Int Int Text #EMBL_info
promoter Int Int Text #EMBL_info
repeat_region Int Int Text #EMBL_info
repeat_unit Int Int Text #EMBL_info
satellite Int Int Text #EMBL_info
sig_peptide Int Int Text #EMBL_info
variation Int Int Text #EMBL_info
enhancer Int Int Text #EMBL_info
protein_bind Int Int Text #EMBL_info
stem_loop Int Int Text #EMBL_info
primer_bind Int Int Text #EMBL_info
transit_peptide Int Int Text #EMBL_info
misc_structure Int Int Text #EMBL_info
precursor_RNA Int Int Text #EMBL_info
LTR Int Int Text #EMBL_info
terminator Int Int Text #EMBL_info
// EMBL_features are for legitimate EMBL feature table entries only
_Method = str2tag("Method");
if (bsFindTag(obj, str2tag("Source_Exons")))
convertExons(conv_info, &exons_found) ;
Source_Exons Int UNIQUE Int // start at 1
if (bsFindTag(obj, str2tag("CDS")))
convertCDS(conv_info, &cds_min, exons_found);
Coding CDS UNIQUE Int UNIQUE Int // start, end in spliced DNA coords,
// default: 1, end-of-CDS
CDS_predicted_by ?Method Float // score of method
if (bsFindTag(obj, str2tag("Start_not_found")))
convertStart(conv_info, cds_min) ;
End_not_found
Start_not_found UNIQUE Int // Gives position of start frame for protein
// translation when start of CDS is before first
// exon in this object (should be in range 1 -> 3).
if (bsFindTag(conv_info->obj, str2tag("Assembled_from")))
sinf->flags |= SEQ_VIRTUAL_ERRORS ;
NOT IN CVS MODELS
if (bsFindTag(conv_info->obj, str2tag("Genomic_canonical")) ||
bsFindTag(conv_info->obj, str2tag("Genomic")))
sinf->flags |= SEQ_CANONICAL ;
Properties Genomic_canonical
Clone = str2tag("Clone") ;
Clone ?Clone XREF Sequence
Show_in_reverse_orientation = str2tag("Show_in_reverse_orientation") ;
Paired_read = str2tag("Paired_read") ;
NOT IN CVS MODELS
in method obj
Join_blocks = str2tag("Join_blocks") ;
_Tm = str2tag ("Tm") ;
_Temporary = str2tag ("Temporary") ;
Status Temporary (in ?Oligo)
_EST = str2tag("EST") ;
_cDNA = str2tag("cDNA") ;
_Homology = str2tag("Homology") ;
_UTR = str2tag("UTR") ;
_False = str2tag("False");
?Splice_confirmation cDNA
EST
Homology
UTR
False
seg->key = _CDS ;