SeqAn3 3.3.0
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
Writing Sequence Files

Sequence files are the most generic and common biological files. Well-known formats include FASTA and FASTQ, but some may also be interested in treating SAM or BAM files as sequence files, discarding the alignment.

The Sequence file abstraction supports writing three different fields:

  1. seqan3::field::seq
  2. seqan3::field::id
  3. seqan3::field::qual

The member functions take any and either of these fields. If the field ID of an argument cannot be deduced, it is assumed to correspond to the field ID of the respective template parameter.

Construction and specialisation

This class comes with two constructors, one for construction from a file name and one for construction from an existing stream and a known format. The first one automatically picks the format based on the extension of the file name. The second can be used if you have a non-file stream, like std::cout or std::ostringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your output file has a certain format.

In most cases the template parameters are deduced completely automatically:

#include <filesystem>
int main()
{
auto fasta_file = std::filesystem::current_path() / "my.fasta";
// FASTA format detected, std::ofstream opened for file
}
A class for writing sequence files, e.g. FASTA, FASTQ ...
Definition io/sequence_file/output.hpp:69
T current_path(T... args)
Provides seqan3::sequence_file_output and corresponding traits classes.

Writing to std::cout:

int main()
{
// ^ no need to specify the template arguments
}
The FASTA format.
Definition format_fasta.hpp:80

Note that this is not the same as writing sequence_file_output<> (with angle brackets). In the latter case they are explicitly set to their default values, in the former case automatic deduction happens which chooses different parameters depending on the constructor arguments. Prefer deduction over explicit defaults.

Writing record-wise

You can iterate over this file record-wise:

#include <sstream>
#include <string>
#include <tuple>
int main()
{
using namespace seqan3::literals;
for (int i = 0; i < 5; ++i) // ...
{
std::string id{"test_id"};
seqan3::dna5_vector seq{"ACGT"_dna5};
// ...
fout.emplace_back(seq, id); // as individual variables
// or:
fout.push_back(std::tie(seq, id)); // as a tuple
}
}
Provides seqan3::dna5, container aliases and string literals.
@ seq
The "sequence", usually a range of nucleotides or amino acids.
The SeqAn namespace for literals.
T tie(T... args)

The easiest way to write to a sequence file is to use the push_back() or emplace_back() member functions. These work similarly to how they work on a std::vector. If you pass a tuple to push_back() or give arguments to emplace_back() the seqan3::field ID of the i-th tuple-element/argument is assumed to be the i-th value of selected_field_ids, i.e. by default the first is assumed to be seqan3::field::seq, the second seqan3::field::id and the third one seqan3::field::qual. You may give less fields than are selected if the actual format you are writing to can cope with less (e.g. for FASTA it is sufficient to write seqan3::field::seq and seqan3::field::id, even if selected_field_ids also contains seqan3::field::qual at the third position). You may also use the output file's iterator for writing, however, this rarely provides an advantage.

Writing record-wise (custom fields)

If you want to change the order of the parameters, you can pass a non-empty fields trait object to the sequence_file_output constructor to select the fields that are used for interpreting the arguments. The following snippets demonstrates the usage of such a fields trait object.

#include <sstream>
#include <string>
#include <tuple>
#include <vector>
int main()
{
using namespace seqan3::literals;
for (int i = 0; i < 5; i++)
{
std::string id{"test_id"};
// vector of combined data structure:
{'A'_dna5, '1'_phred42},
{'C'_dna5, '3'_phred42}};
auto view_on_seq = seqan3::views::elements<0>(seq_qual);
auto view_on_qual = seqan3::views::elements<1>(seq_qual);
// ...
// Note that the order of the arguments is different from the default `seq, id, qual`,
// because you specified that ID should be first in the fields template argument.
fout.emplace_back(id, view_on_seq, view_on_qual);
// or:
fout.push_back(std::tie(id, view_on_seq, view_on_qual));
}
}
The FASTQ format.
Definition format_fastq.hpp:80
Provides seqan3::views::elements.
Provides seqan3::phred42 quality scores.
Provides quality alphabet composites.
A class template that holds a choice of seqan3::field.
Definition record.hpp:128

A different way of passing custom fields to the file is to pass a seqan3::record – instead of a tuple – to push_back(). The seqan3::record clearly indicates which of its elements has which seqan3::field ID so the file will use that information instead of the template argument. This is especially handy when reading from one file and writing to another, because you don't have to configure the output file to match the input file, it will just work:

#include <sstream>
auto input = R"(@TEST1
ACGT
+
##!#
@Test2
AGGCTGA
+
##!#!!!
@Test3
GGAGTATAATATATATATATATAT
+
##!###!###!###!###!###!#)";
int main()
{
seqan3::format_fastq{}}; // doesn't have to match the configuration
for (auto & r : fin)
{
if (true) // r fulfills some criterium
fout.push_back(r);
}
}
A class for reading sequence files, e.g. FASTA, FASTQ ...
Definition sequence_file/input.hpp:210
Provides seqan3::sequence_file_input and corresponding traits classes.

Writing record-wise in batches

You can write multiple records at once, by assigning to the file:

#include <sstream>
#include <string>
#include <tuple>
#include <vector>
int main()
{
using namespace seqan3::literals;
{"NATA"_dna5, "2nd"},
{"GATA"_dna5, "Third"}}; // a range of "records"
fout = range;
// the same as:
range | fout;
}

File I/O pipelines

Record-wise writing in batches also works for writing from input files directly to output files, because input files are also input ranges in SeqAn:

#include <filesystem>
#include <sstream>
auto input = R"(@TEST1
ACGT
+
##!#
@Test2
AGGCTGA
+
##!#!!!
@Test3
GGAGTATAATATATATATATATAT
+
##!###!###!###!###!###!#)";
int main()
{
// file format conversion in one line:
// with seqan3::sequence_file_output as a variable:
fout = fin;
// or in pipe notation:
}

This can be combined with file-based views to create I/O pipelines:

# include <iterator>
# include <ranges>
# include <sstream>
auto input = R"(@TEST1
ACGT
+
##!#
@Test2
AGGCTGA
+
##!#!!!
@Test3
GGAGTATAATATATATATATATAT
+
##!###!###!###!###!###!#)";
int main()
{
// minimum_average_quality_filter and minimum_sequence_length_filter need to be implemented first
auto minimum_sequence_length_filter = std::views::filter(
[](auto rec)
{
return std::ranges::distance(rec.sequence()) >= 50;
});
auto minimum_average_quality_filter = std::views::filter(
[](auto const & record)
{
double qual_sum{0}; // summation of the qualities
for (auto chr : record.base_qualities())
qual_sum += seqan3::to_phred(chr);
// check if average quality is greater than 20.
return qual_sum / (std::ranges::distance(record.base_qualities())) >= 20;
});
input_file | minimum_average_quality_filter | minimum_sequence_length_filter | std::views::take(3)
}
constexpr auto to_phred
The public getter function for the Phred representation of a quality score.
Definition alphabet/quality/concept.hpp:100
The main SeqAn3 namespace.
Definition aligned_sequence_concept.hpp:29
The class template that file records are based on; behaves like a std::tuple.
Definition record.hpp:193

Column-based writing

The record-based interface treats the file as a range of tuples (the records), but in certain situations you might have the data as columns, i.e. a tuple-of-ranges, instead of range-of-tuples. You can use column-based writing in that case, it uses operator=() and seqan3::views::zip():

#include <sstream>
#include <string>
using namespace seqan3::literals;
struct data_storage_t
{
seqan3::concatenated_sequences<seqan3::dna4_vector> sequences{"ACGT"_dna4, "AAA"_dna4};
};
int main()
{
data_storage_t data_storage{};
// ... in your file writing function:
fout = seqan3::views::zip(data_storage.sequences, data_storage.ids);
}
Container that stores sequences concatenated internally.
Definition concatenated_sequences.hpp:89
Provides seqan3::concatenated_sequences.
Provides seqan3::dna4, container aliases and string literals.
seqan::std::views::zip zip
A view adaptor that takes several views and returns tuple-like values from every i-th element of each...
Definition zip.hpp:27
Provides seqan3::views::zip.

Formats

We currently support writing the following formats: