The BAM format is an efficient method for storing and sharing data from modern, highly parallel sequencers. While primarily used for storing alignment information, BAMs can (and frequently do) store unaligned reads as well.
There are a growing number of general-purpose SAM/BAM manipulation programs, including SAMtools, Picard, and Bamtools. This tool is not intended to duplicate the complex suite of tasks those programs perform. Rather, it is simply intended to extract raw sequences (with qualities). We envision this tool being primarily useful to those wishing to duplicate or extend previous analyses.
This program requires the standard GNU development environment (gcc and make), along with the SAMtools
source code (included) and the zlib compression library.
Download the distribution and extract it with
tar -xzf. Change into the extracted directory and run
1.1.0 (18 August 2010)
Altered default handling of read names
1.0.0 (17 August 2010)
Several assumptions are made about the format of the BAM file:
If these assumptions are correct, the extracted sequence.txt files will contain the same information as the files used to create the BAM file. However, the presentation of the data will differ slightly from the original:
However, none of these differences will impact the actual data, merely the representation. So while the files will not be byte-for-byte identical, they will contain the same biological data. For example, compare the output of the original sequence files (s_1_1_sequence.txt and s_1_2_sequence.txt) to the same data extracted from a BAM file (s_1_1_extracted.txt and s_1_2_extracted.txt). Notice that the Quality Encoding is different, but the remaining values are identical. Also note (on the Files tab) that the size of the extracted files is smaller than the originals.
By default, pair names in the BAM file are modified slightly to allow for BAMs that don't quite meet specification.
This behavior can be disabled with the
bam2fastq [options] <bam file>
-o FILENAME, --output FILENAME
%(replaced with the lane number) and
_2to distinguish PE reads, removed for SE reads). [Default: s_%#_sequence.txt]
-f, --force, --overwrite
--output, overwriting existing files if necessary [Default: exit program rather than overwrite files]