Structured sequence files contain intra-molecular interactions of RNA or protein. Usually, but not necessarily, they contain the nucleotide or amino acid sequences and descriptions as well. Interactions can be represented either as fixed secondary structure, where every character is assigned at most one interaction partner (structure of minimum free energy), or an annotated sequence, where every character is assigned a set of interaction partners with specific base pair probabilities.
The structured sequence file abstraction supports reading ten different fields:
- seqan3::field::seq (sequence)
- seqan3::field::id (identifier)
- seqan3::field::bpp (annotated sequence)
- seqan3::field::structure (secondary structure)
- seqan3::field::structured_seq (sequence and structure in one range)
- seqan3::field::energy (minimum free energy)
- seqan3::field::react (reactivity)
- seqan3::field::react_err (reactivity error)
- seqan3::field::comment (free text)
- seqan3::field::offset (index of first sequence character)
The first three fields are retrieved by default (and in that order). The seqan3::field::structured_seq may be selected to have sequence and structure directly stored in a more memory-efficient combined container. If you select this field you must not select seqan3::field::seq or seqan3::field::structure.
Construction and specialisation
This class comes with two constructors, one for construction from a file name and one for construction from an existing stream and a known format. The first one automatically picks the format based on the extension of the file name. The second can be used if you have a non-file stream, like std::cin or std::istringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your input file has a certain format.
In most cases the template parameters are deduced completely automatically, e.g. reading from a std::istringstream:
auto input = R"(> S.cerevisiae_tRNA-PHE M10740/1-73
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
(((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)
> example
UUGGAGUACACAACCUGUACACUCUUUC
..(((((..(((...)))..)))))... (-3.71))";
int main()
{
{
fout.
emplace_back(
"GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA"_rna4,
"S.cerevisiae_tRNA-PHE M10740/1-73",
"(((((((..((((........)))).((((.........)))).....(((((.......))))))))))))."_wuss51);
fout.emplace_back("UUGGAGUACACAACCUGUACACUCUUUC"_rna4, "example", "..(((((..(((...)))..)))))..."_wuss51);
}
}
A class for writing structured sequence files, e.g. Stockholm, Connect, Vienna, ViennaRNA bpp matrix ...
Definition io/structure_file/output.hpp:63
void emplace_back(arg_t &&arg, arg_types &&... args)
Write a record to the file by passing individual fields.
Definition io/structure_file/output.hpp:366
Provides seqan3::structure_file_output and corresponding traits classes.
The SeqAn namespace for literals.
Provides seqan3::rna4, container aliases and string literals.
T temp_directory_path(T... args)
Provides the WUSS format for RNA structure.
Note that this is not the same as writing structure_file_input<>
(with angle brackets). In the latter case they are explicitly set to their default values, in the former case automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file, structure_file_input<>
would have also worked, but for opening from stream it would not have.
In some cases, you do need to specify the arguments, e.g. if you want to read amino acids:
auto input = R"(> S.cerevisiae_tRNA-PHE M10740/1-73
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
(((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)
> example
UUGGAGUACACAACCUGUACACUCUUUC
..(((((..(((...)))..)))))... (-3.71))";
int main()
{
}
You can define your own traits type to further customise the types used by and returned by this class, see seqan3::structure_file_default_traits_rna for more details. As mentioned above, specifying at least one template parameter yourself means that you loose automatic deduction so if you want to read amino acids and want to read from a string stream you need to give all types yourself:
auto input = R"(> S.cerevisiae_tRNA-PHE M10740/1-73
ACEWACEW
HGEBHHHH
> example
ACEWACEWACEWACEW
HGEBHHHHHGEBHHHH)";
int main()
{
}
A class template that holds a choice of seqan3::field.
Definition record.hpp:128
Type that contains multiple types.
Definition type_list.hpp:29
Provides seqan3::type_list.
Reading record-wise
You can iterate over this file record-wise:
auto input = R"(> S.cerevisiae_tRNA-PHE M10740/1-73
ACEWACEW
HGEBHHHH
> example
ACEWACEWACEWACEW
HGEBHHHHHGEBHHHH)";
int main()
{
using structure_file_input_t =
for (auto & rec : fin)
{
}
}
Provides seqan3::debug_stream and related types.
auto const to_char
A view that calls seqan3::to_char() on each element in the input range.
Definition to_char.hpp:63
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition debug_stream.hpp:37
Provides seqan3::views::to_char.
In the above example, rec has the type seqan3::structure_file_input::record_type which is a specialisation of seqan3::record and behaves like a std::tuple (that's why we can access it via get). Instead of using the seqan3::field based interface on the record, you could also use std::get<0>
or even std::get<rna4_vector>
to retrieve the sequence, but it is not recommended, because it is more error-prone.
Note: It is important to write auto &
and not just auto
, otherwise you will copy the record on every iteration. Since the buffer gets "refilled" on every iteration, you can also move the data out of the record if you want to store it somewhere without copying:
auto input = R"(> S.cerevisiae_tRNA-PHE M10740/1-73
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
(((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)
> example
UUGGAGUACACAACCUGUACACUCUUUC
..(((((..(((...)))..)))))... (-3.71))";
int main()
{
using record_type = typename decltype(fin)::record_type;
for (auto & rec : fin)
records.push_back(
std::move(rec));
}
SeqAn specific customisations in the standard namespace.
Reading record-wise (decomposed records)
Instead of using get
on the record, you can also use structured bindings to decompose the record into its elements:
auto input = R"(> S.cerevisiae_tRNA-PHE M10740/1-73
ACEWACEW
HGEBHHHH
> example
ACEWACEWACEWACEW
HGEBHHHHHGEBHHHH)";
int main()
{
using structure_file_input_t =
for (auto & [seq, id, structure] : fin)
{
}
}
@ structure
Fixed interactions, usually a string of structure alphabet characters.
@ seq
The "sequence", usually a range of nucleotides or amino acids.
constexpr auto const & get(configuration< configs_t... > const &config) noexcept
This is an overloaded member function, provided for convenience. It differs from the above function o...
Definition configuration.hpp:415
In this case you immediately get the three elements of the tuple: seq
of seqan3::structure_file_input::seq_type, id
of seqan3::structure_file_input::id_type and structure
of seqan3::structure_file_input::structure_type. But beware: with structured bindings you do need to get the order of elements correctly!
Reading record-wise (custom fields)
If you want to skip specific fields from the record you can pass a non-empty fields trait object to the structure_file_input constructor to select the fields that should be read from the input. For example to choose a combined field for SEQ and STRUCTURE (see above). Or to never actually read the STRUCTURE, if you don't need it. The following snippets demonstrate the usage of such a fields trait object.
auto input = R"(> S.cerevisiae_tRNA-PHE M10740/1-73
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
(((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)
> example
UUGGAGUACACAACCUGUACACUCUUUC
..(((((..(((...)))..)))))... (-3.71))";
int main()
{
for (auto & [id, struc_seq] : fin)
{
<< '\n';
}
}
Provides seqan3::views::elements.
When reading a file, all fields not present in the file (but requested implicitly or via the selected_field_ids
parameter) are ignored.
Views on files
Since SeqAn files are ranges, you can also create views over files. A useful example is to filter the records based on certain criteria, e.g. minimum length of the sequence field:
#include <ranges>
auto input = R"(> S.cerevisiae_tRNA-PHE M10740/1-73
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
(((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)
> example
UUGGAGUACACAACCUGUACACUCUUUC
..(((((..(((...)))..)))))... (-3.71))";
int main()
{
auto minimum_length5_filter = std::views::filter(
[](auto const & rec)
{
return std::ranges::size(rec.sequence()) >= 5;
});
for (auto & rec : fin | minimum_length5_filter)
}
constexpr auto to_char
Return the char representation of an alphabet object.
Definition alphabet/concept.hpp:386
The generic concept for a (biological) sequence.
The main SeqAn3 namespace.
Definition aligned_sequence_concept.hpp:29
End of file
You can check whether a file is at end by comparing begin() and end() (if they are the same, the file is at end).
Formats
Currently, the only implemented format is seqan3::format_vienna. More formats will follow soon.