SeqAn3 3.3.0
The Modern C++ library for sequence analysis.
|
Structured sequence files contain intra-molecular interactions of RNA or protein. Usually, but not necessarily, they contain the nucleotide or amino acid sequences and descriptions as well. Interactions can be represented either as fixed secondary structure, where every character is assigned at most one interaction partner (structure of minimum free energy), or an annotated sequence, where every character is assigned a set of interaction partners with specific base pair probabilities.
The structured sequence file abstraction supports writing ten different fields:
The member functions take any and either of these fields. If the field ID of an argument cannot be deduced, it is assumed to correspond to the field ID of the respective template parameter.
This class comes with two constructors, one for construction from a file name and one for construction from an existing stream and a known format. The first one automatically picks the format based on the extension of the file name. The second can be used if you have a non-file stream, like std::cout or std::ostringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your output file has a certain format.
In most cases the template parameters are deduced completely automatically:
Writing to std::cout:
Note that this is not the same as writing structure_file_output<>
(with angle brackets). In the latter case they are explicitly set to their default values, in the former case automatic deduction happens which chooses different parameters depending on the constructor arguments. Prefer deduction over explicit defaults.
You can iterate over this file record-wise:
The easiest way to write to a sequence file is to use the push_back() or emplace_back() member functions. These work similarly to how they work on a std::vector. If you pass a tuple to push_back() or give arguments to emplace_back() the seqan3::field ID of the i-th tuple-element/argument is assumed to be the i-th value of selected_field_ids, i.e. by default the first is assumed to be seqan3::field::seq, the second seqan3::field::id and the third one seqan3::field::structure. You may give less fields than are selected, if the actual format you are writing to can cope with less (e.g. for Vienna it is sufficient to write seqan3::field::seq, seqan3::field::id and seqan3::field::structure, even if selected_field_ids also contains seqan3::field::energy).
You may also use the output file's iterator for writing, however, this rarely provides an advantage.
If you want to pass a combined object for SEQ and STRUCTURE fields to push_back() / emplace_back(), or if you want to change the order of the parameters, you can pass a non-empty fields trait object to the structure_file_output constructor to select the fields that are used for interpreting the arguments.
The following snippets demonstrates the usage of such a fields trait object.
A different way of passing custom fields to the file is to pass a seqan3::record – instead of a tuple – to push_back(). The seqan3::record clearly indicates which of its elements has which seqan3::field ID so the file will use that information instead of the template argument. This is especially handy when reading from one file and writing to another, because you don't have to configure the output file to match the input file, it will just work:
You can write multiple records at once, by assigning to the file:
Record-wise writing in batches also works for writing from input files directly to output files, because input files are also input ranges in SeqAn. This can be combined with file-based views to create I/O pipelines:
The record-based interface treats the file as a range of tuples (the records), but in certain situations you might have the data as columns, i.e. a tuple-of-ranges, instead of range-of-tuples.
You can use column-based writing in that case, it uses operator=() and seqan3::views::zip():
Currently, the only implemented format is seqan3::format_vienna. More formats will follow soon.