Learning Objective:
Learning Objective:
You will get an overview of how to read and write SAM/BAM files. This tutorial is a walk-through with links into the API documentation and is also meant as a source for copy-and-paste code.
Introduction
SAM files are used to store pairwise alignments between two (biological) sequences. There are also other output formats, like BLAST, which can store sequence alignments, but in this tutorial we will focus on SAM/BAM files. In addition to the alignment, these formats store information such as the start positions or mapping qualities. SAM files are a little more complex than sequence files but the basic design is the same. If you are new to SeqAn, we strongly recommend completing the tutorial Sequence File Input and Output first.
SAM/BAM file formats
SAM format
SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section (see the official SAM specifications).
Here is an example of a SAM file:
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA *
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 5M * 0 0 TAGGC *
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
The following table summarises the columns of a SAM file:
# | SAM Column ID | Description |
1 | QNAME | Query template NAME |
2 | FLAG | bitwise FLAG |
3 | RNAME | Reference sequence NAME |
4 | POS | 1-based leftmost mapping POSition |
5 | MAPQ | MAPping Quality |
6 | CIGAR | CIGAR string |
7 | RNEXT | Reference name of the mate/next read |
8 | PNEXT | Position of the mate/next read |
9 | TLEN | observed Template LENgth |
10 | SEQ | segment SEQuence |
11 | QUAL | ASCII of Phred-scaled base QUALity+33 |
If you want to read more about the SAM format, take a look at the official specifications.
BAM format
BAM is the binary format version of SAM. It provides the same data as the SAM format with negligible and subtle differences in most use cases.
SAM file fields
To make things clearer, here is the table of SAM columns and the corresponding fields of a SAM file record:
SAM files provide following additional fields:
File extensions
The formerly introduced formats can be identified by the following file name extensions (this is important for automatic format detection from a file name as you will learn in the next section).
File Format | File Extensions |
SAM | .sam |
BAM | .bam |
You can access and modify the valid file extensions via the file_extension
member variable in a format tag:
int main()
{
}
Provides seqan3::debug_stream and related types.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition debug_stream.hpp:37
Meta-header for the IO / SAM File submodule .
Reading SAM files
Before we start, you should copy and paste this example file into a file location of your choice (we use the current path in the examples, so make sure you adjust your path).
- Attention
- Make sure the file you copied is tab delimited!
Construction
The construction works analogously to sequence files by passing a file name, in which case, all template parameters are automatically deduced (by the file name extension). Or you can pass a stream (e.g. std::cin or std::stringstream), but then you need to know your format beforehand:
int main()
{
return 0;
}
T current_path(T... args)
Accessing individual record members
You can access a record member like this:
int main()
{
for (auto && record : fin)
{
}
}
See seqan3::sam_record for all data accessors.
Assignment 1: Accumulating mapping qualities
Let's assume we want to compute the average mapping quality of a SAM file.
For this purpose, write a small program that
Use the following file to test your program:
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
It should output:
Solution
#include <ranges>
int main()
{
double sum{};
std::ranges::for_each(fin,
[&sum, &count](auto & record)
{
sum += record.mapping_quality();
});
}
constexpr ptrdiff_t count
Count the occurrences of a type in a pack.
Definition type_pack/traits.hpp:164
Alignment representation in SAM/BAM files
The SAM format is the common output format of read mappers where you align short read sequences to one or more large reference sequences. In fact, the SAM format stores those alignment information only partially: It does not store the reference sequence but only the query/read sequence and a CIGAR string representing the alignment based on the read.
Take this SAM record as an example:
r003 73 ref 3 17 1M1D4M * 0 0 TAGGC *
The record gives you the following information: A read with name r003
has been mapped to a reference with name ref
at position 3
(in the reference, counting from 1) with a quality of 17
(Phred scaled). The flag has a value of 73
which indicates that the read is paired, the first in pair, but the mate is unmapped (see this website for a nice explanation of SAM flags). Fields set to 0
or *
indicate empty fields and contain no valuable information.
The cigar string is 1M1D4M
which represents the following alignment:
1 2 3 4 5 6 7 8 9 ...
ref N N N N N N N N N ...
read T - A G G C
where the reference sequence is not known (represented by N
). You will learn in the next section how to handle additional reference sequence information.
If you want to read up more about cigar strings, take a look at the SAM specifications or the SAMtools paper.
Reading the CIGAR string
By default, the seqan3::sam_file_input will always read the seqan3::sam_record::cigar_sequence and store it into a std::vector<seqan3::cigar>:
int main()
{
for (auto & record : fin)
seqan3::debug_stream << record.cigar_sequence() <<
'\n';
}
The main SeqAn3 namespace.
Definition aligned_sequence_concept.hpp:29
Transforming the CIGAR information into an alignment
In SeqAn, we represent an alignment as a tuple of two seqan3::aligned_sequence
s.
The conversion from a CIGAR string to an alignment can be done with the function seqan3::alignment_from_cigar
. You need to pass the reference sequence with the position the read was aligned to and the read sequence:
auto sam_file_raw = R"(@HD VN:1.6
@SQ SN:ref LN:34
read1 41 ref 1 61 1S1M1D1M1I ref 10 300 ACGT !##$ AS:i:2 NM:i:7
read2 42 ref 2 62 1H7M1D1M1S2H ref 10 300 AGGCTGNAG !##$&'()* xy:B:S,3,4,5
read3 43 ref 3 63 1S1M1P1M1I1M1I1D1M1S ref 10 300 GGAGTATA !!*+,-./
)";
int main()
{
seqan3::dna5_vector reference = "ACTGATCGAGAGGATCTAGAGGAGATCGTAGGAC"_dna5;
for (auto && rec : fin)
{
alignment_from_cigar(rec.cigar_sequence(), reference, rec.reference_position().value(), rec.sequence());
}
}
Provides the function seqan3::alignment_from_cigar.
auto alignment_from_cigar(std::vector< cigar > const &cigar_vector, reference_type const &reference, uint32_t const zero_based_reference_start_position, sequence_type const &query)
Construct an alignment from a CIGAR string and the corresponding sequences.
Definition alignment_from_cigar.hpp:84
@ alignment
The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment.
The SeqAn namespace for literals.
The code will print the following:
(ACT-,C-GT)
(CTGATCGAG,AGGCTGN-A)
(T-G-A-TC,G-AGTA-T)
Assignment 2: Combining sequence and alignment files
Read the following reference sequence FASTA file (see the sequence file tutorial if you need a reminder):
>chr1
ACAGCAGGCATCTATCGGCGGATCGATCAGGCAGGCAGCTACTGG
>chr2
ACAGCAGGCATCTATCGGCGGATCGATCAGGCAGGCAGCTACTGTAATGGCATCAAAATCGGCATG
Then read the following SAM file while providing the reference sequence information.
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:45
@SQ SN:chr2 LN:66
r001 99 chr1 7 60 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 chr1 9 60 5S6M * 0 0 GCCTAAGCTAA *
r004 0 chr2 16 60 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 chr2 18 10 5M * 0 0 TAGGC *
Only use
With that information do the following:
- Filter the alignment records and only take those with a mapping quality >= 30. (Take a look at the tutorial Reading paired-end reads for a reminder about using views on files)
- For the resulting alignments, print which read was mapped against which reference id and the number of
seqan3::gap
s in each sequence (aligned reference and read sequence).
Hint
Given your reference sequences, you need to know which reference sequence to use for the alignment. For this purpose, you can access record.reference_id()
which gives you the position of the respective reference sequence.
Your program should print the following:
r001 mapped against 0 with 1 gaps in the read sequence and 2 gaps in the reference sequence.
r003 mapped against 0 with 0 gaps in the read sequence and 0 gaps in the reference sequence.
r004 mapped against 1 with 14 gaps in the read sequence and 0 gaps in the reference sequence.
Solution
#include <ranges>
int main()
{
for (
auto &&
record : reference_file)
{
reference_sequences.push_back(std::move(
record.sequence()));
}
auto mapq_filter = std::views::filter(
{
return record.mapping_quality() >= 30;
});
for (
auto &
record : mapping_file | mapq_filter)
{
reference_sequences[
record.reference_id().value()],
record.reference_position().value(),
size_t sum_reference{};
++sum_reference;
size_t sum_read = std::ranges::count(std::get<1>(alignment),
seqan3::gap{});
<< (reference_id ?
std::to_string(reference_id.value()) :
"unknown reference") <<
" with "
<< sum_read << " gaps in the read sequence and " << sum_reference
<< " gaps in the reference sequence.\n";
}
}
The alphabet of a gap character '-'.
Definition gap.hpp:39
Provides seqan3::dna5, container aliases and string literals.
constexpr auto const & get(configuration< configs_t... > const &config) noexcept
This is an overloaded member function, provided for convenience. It differs from the above function o...
Definition configuration.hpp:415
SeqAn specific customisations in the standard namespace.
Provides the seqan3::record template and the seqan3::field enum.
The class template that file records are based on; behaves like a std::tuple.
Definition record.hpp:193
Writing alignment files
Writing records
When writing a SAM file without any further specifications, the default file assumes that all fields are provided. Since those are quite a lot for alignment files, we usually want to write only a subset of the data stored in the SAM format and default the rest.
For this purpose, you can use the seqan3::sam_record to write out a partial record.
int main()
{
record.sequence() =
"ACGTACGT"_dna5;
record.cigar_sequence() = {{4,
'M'_cigar_operation},
{2, 'I'_cigar_operation},
{2, 'M'_cigar_operation},
{2, 'D'_cigar_operation}};
}
A class for writing SAM files, both SAM and its binary representation BAM are supported.
Definition io/sam_file/output.hpp:74
The record type of seqan3::sam_file_input.
Definition sam_file/record.hpp:29
A class template that holds a choice of seqan3::field.
Definition record.hpp:128
Type that contains multiple types.
Definition type_list.hpp:29
Note that this only works because in the SAM format all fields are optional. If we provide fewer fields when writing, default values are written.
Assignment 3: Writing id and sequence information
Create a small program that writes the following unmapped (see seqan3::sam_flag) read ids and sequences:
read1: ACGATCGACTAGCTACGATCAGCTAGCAG
read2: AGAAAGAGCGAGGCTATTTTAGCGAGTTA
Your ids can be of type std::string
and your sequences of type std::vector<seqan3::dna4>
.
Your resulting SAM file should look like this:
read1 4 * 0 0 * * 0 0 ACGATCGACTAGCTACGATCAGCTAGCAG *
read2 4 * 0 0 * * 0 0 AGAAAGAGCGAGGCTATTTTAGCGAGTTA *
Solution
int main()
{
"AGAAAGAGCGAGGCTATTTTAGCGAGTTA"_dna4};
for (size_t i = 0; i < ids.size(); ++i)
{
}
}
Provides seqan3::dna4, container aliases and string literals.
sam_flag
An enum flag that describes the properties of an aligned read (given as a SAM record).
Definition sam_flag.hpp:76
@ unmapped
The read is not mapped to a reference (unaligned).