SeqAn3 3.3.0
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
Reading Sequence Files

Sequence files are the most generic and common biological files. Well-known formats include FASTA and FASTQ, but some may also be interested in treating SAM or BAM files as sequence files, discarding the alignment.

The Sequence file abstraction supports reading three different fields:

  1. seqan3::field::seq
  2. seqan3::field::id
  3. seqan3::field::qual

The first three fields are retrieved by default (and in that order). The last field may be selected to have sequence and qualities directly stored in a more memory-efficient combined container. If you select the last field you may not select seqan3::field::seq or seqan3::field::qual.

Construction and specialisation

This class comes with two constructors, one for construction from a file name and one for construction from an existing stream and a known format. The first one automatically picks the format based on the extension of the file name. The second can be used if you have a non-file stream, like std::cin or std::istringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your input file has a certain format.

In most cases the template parameters are deduced completely automatically:

#include <filesystem>
int main()
{
using namespace seqan3::literals;
auto fasta_file = std::filesystem::current_path() / "my.fasta";
{
// Create a ./my.fasta file.
seqan3::sequence_file_output fout{fasta_file};
fout.emplace_back("ACGT"_dna4, "TEST1");
fout.emplace_back("AGGCTGA"_dna4, "Test2");
fout.emplace_back("GGAGTATAATATATATATATATAT"_dna4, "Test3");
}
// FASTA with DNA sequences assumed, regular std::ifstream taken as stream
seqan3::sequence_file_input fin{fasta_file};
}
A class for reading sequence files, e.g. FASTA, FASTQ ...
Definition sequence_file/input.hpp:210
A class for writing sequence files, e.g. FASTA, FASTQ ...
Definition io/sequence_file/output.hpp:69
void emplace_back(arg_t &&arg, arg_types &&... args)
Write a record to the file by passing individual fields.
Definition io/sequence_file/output.hpp:339
T current_path(T... args)
Provides seqan3::dna4, container aliases and string literals.
Provides seqan3::sequence_file_output and corresponding traits classes.
The SeqAn namespace for literals.
Provides seqan3::sequence_file_input and corresponding traits classes.

Reading from a std::istringstream:

#include <sstream>
#include <utility>
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
// ^ no need to specify the template arguments
}
The FASTA format.
Definition format_fasta.hpp:80

Note that this is not the same as writing sequence_file_input<> (with angle brackets). In the latter case they are explicitly set to their default values, in the former case automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file, sequence_file_input<> would have also worked, but for opening from stream it would not have. In some cases, you do need to specify the arguments, e.g. if you want to read amino acids:

#include <sstream>
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
}

You can define your own traits type to further customise the types used by and returned by this class, see seqan3::sequence_file_input_default_traits_dna for more details. As mentioned above, specifying at least one template parameter yourself means that you loose automatic deduction so if you want to read amino acids and want to read from a string stream you need to give all types yourself:

#include <sstream>
// ... input had amino acid sequences
auto input = R"(>TEST1
FQTWE
>Test2
KYRTW
>Test3
EEYQTWEEFARAAEKLYLTDPMKV)";
int main()
{ // Use amino acid traits below
using sequence_file_input_type =
sequence_file_input_type fin{std::istringstream{input}, seqan3::format_fasta{}};
}
A class template that holds a choice of seqan3::field.
Definition record.hpp:128
A traits type that specifies input as amino acids.
Definition sequence_file/input.hpp:170
Type that contains multiple types.
Definition type_list.hpp:29
Provides seqan3::type_list.

Reading record-wise

You can iterate over this file record-wise:

#include <sstream>
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
for (auto & record : fin)
{
seqan3::debug_stream << "ID: " << record.id() << '\n';
seqan3::debug_stream << "SEQ: " << record.sequence() << '\n';
// a quality field also exists, but is not printed, because we know it's empty for FASTA files.
}
}
Provides seqan3::debug_stream and related types.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition debug_stream.hpp:37
The class template that file records are based on; behaves like a std::tuple.
Definition record.hpp:193

In the above example, record has the type seqan3::sequence_file_input::record_type which is seqan3::sequence_record. Note: It is important to write auto & and not just auto, otherwise you will copy the record on every iteration. Since the buffer gets "refilled" on every iteration, you can also move the data out of the record if you want to store it somewhere without copying:

#include <sstream>
#include <utility>
#include <vector>
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
using record_type = typename decltype(fin)::record_type;
for (auto & rec : fin)
records.push_back(std::move(rec));
}
SeqAn specific customisations in the standard namespace.

Reading record-wise (decomposed records)

Instead of using member accessor on the record, you can also use structured bindings to decompose the record into its elements:

auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
for (auto & [sequence, id, quality] : fin)
{
seqan3::debug_stream << "ID: " << id << '\n';
seqan3::debug_stream << "SEQ: " << sequence << '\n';
seqan3::debug_stream << "EMPTY QUAL." << quality << '\n'; // quality is empty for FASTA files
}
}
The generic concept for a (biological) sequence.

In this case you immediately get the two elements of the tuple: sequence of seqan3::sequence_file_input::sequence_type and id of seqan3::sequence_file_input::id_type. But beware: with structured bindings you do need to get the order of elements correctly!

Reading record-wise (custom fields)

If you want to skip specific fields from the record you can pass a non-empty fields trait object to the sequence_file_input constructor to select the fields that should be read from the input. For example to choose a combined field for SEQ and QUAL (see above). Or to never actually read the QUAL, if you don't need it. The following snippets demonstrate the usage of such a fields trait object.

#include <sstream>
#include <vector>
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
using seqan3::get;
for (auto & [id, seq, qual] : fin) // the order is now different, "id" comes first, because it was specified first
{
seqan3::debug_stream << "ID: " << id << '\n';
seqan3::debug_stream << "SEQ: " << seq << '\n';
seqan3::debug_stream << "QUAL: " << qual << '\n';
}
}
@ seq
The "sequence", usually a range of nucleotides or amino acids.
@ qual
The qualities, usually in Phred score notation.
constexpr auto const & get(configuration< configs_t... > const &config) noexcept
This is an overloaded member function, provided for convenience. It differs from the above function o...
Definition configuration.hpp:415

When reading a file, all fields not present in the file (but requested implicitly or via the selected_field_ids parameter) are ignored.

Views on files

Since SeqAn files are ranges, you can also create views over files. A useful example is to filter the records based on certain criteria, e.g. minimum length of the sequence field:

#include <ranges>
#include <sstream>
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
auto minimum_length5_filter = std::views::filter(
[](auto const & rec)
{
return std::ranges::size(rec.sequence()) >= 5;
});
for (auto & rec : fin | minimum_length5_filter) // only record with sequence length >= 5 will "appear"
{
seqan3::debug_stream << "IDs of seq_length >= 5: " << rec.id() << '\n';
// ...
}
}

End of file

You can check whether a file is at end by comparing begin() and end() (if they are the same, the file is at end).

Formats

We currently support reading the following formats: