192_725_82_F3 16 ENSMUST00000070533 1 255 5H31M14H * 0 0 GCGGCGGCGGGCGAGCGGGCGCTAGAGTAGG * AS:i:176 NM:i:1 CM:i:5 XX:Z:CcTACTCTAGcGcCCGcTcGCCCGCCGCCGC MD:Z:23g7 #http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output Default bowtie output bowtie outputs one alignment per line. Each line is a collection of 8 fields separated by tabs; from left to right, the fields are: Name of read that aligned. Note that the [SAM specification] disallows whitespace in the read name. If the read name contains any whitespace characters, Bowtie 2 will truncate the name at the first whitespace character. This is similar to the behavior of other tools. Reference strand aligned to, + for forward strand, - for reverse Name of reference sequence where alignment occurs, or numeric ID if no name was provided 0-based offset into the forward reference strand where leftmost character of the alignment occurs Read sequence (reverse-complemented if orientation is -). If the read was in colorspace, then the sequence shown in this column is the sequence of decoded nucleotides, not the original colors. See the Colorspace alignment section for details about decoding. To display colors instead, use the --col-cseq option. ASCII-encoded read qualities (reversed if orientation is -). The encoded quality values are on the Phred scale and the encoding is ASCII-offset by 33 (ASCII char !). If the read was in colorspace, then the qualities shown in this column are the decoded qualities, not the original qualities. See the Colorspace alignment section for details about decoding. To display colors instead, use the --col-cqual option. If -M was specified and the prescribed ceiling was exceeded for this read, this column contains the value of the ceiling, indicating that at least that many valid alignments were found in addition to the one reported. Otherwise, this column contains the number of other instances where the same sequence aligned against the same reference characters as were aligned against in the reported alignment. This is not the number of other places the read aligns with the same number of mismatches. The number in this column is generally not a good proxy for that number (e.g., the number in this column may be '0' while the number of other alignments with the same number of mismatches might be large). Comma-separated list of mismatch descriptors. If there are no mismatches in the alignment, this field is empty. A single descriptor has the format offset:reference-base>read-base. The offset is expressed as a 0-based offset from the high-quality (5') end of the read. SAM bowtie output Following is a brief description of the SAM format as output by bowtie when the -S/--sam option is specified. For more details, see the SAM format specification. When -S/--sam is specified, bowtie prints a SAM header with @HD, @SQ and @PG lines. When one or more --sam-RG arguments are specified, bowtie will also print an @RG line that includes all user-specified --sam-RG tokens separated by tabs. Each subsequnt line corresponds to a read or an alignment. Each line is a collection of at least 12 fields separated by tabs; from left to right, the fields are: Name of read that aligned Sum of all applicable flags. Flags relevant to Bowtie are: 1 The read is one of a pair 2 The alignment is one end of a proper paired-end alignment 4 The read has no reported alignments 8 The read is one of a pair and has no reported alignments 16 The alignment is to the reverse reference strand 32 The other mate in the paired-end alignment is aligned to the reverse reference strand 64 The read is the first (#1) mate in a pair 128 The read is the second (#2) mate in a pair Thus, an unpaired read that aligns to the reverse reference strand will have flag 16. A paired-end read that aligns and is the first mate in the pair will have flag 83 (= 64 + 16 + 2 + 1). Name of reference sequence where alignment occurs, or ordinal ID if no name was provided 1-based offset into the forward reference strand where leftmost character of the alignment occurs Mapping quality CIGAR string representation of alignment Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate. 1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate. Inferred insert size. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if there is no mate. Read sequence (reverse-complemented if aligned to the reverse strand) ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file. Optional fields. Fields are tab-separated. For descriptions of all possible optional fields, see the SAM format specification. bowtie outputs some of these optional fields for each alignment, depending on the type of the alignment: NM:i: Aligned read has an edit distance of . CM:i: Aligned read has an edit distance of in colorspace. This field is present in addition to the NM field in -C/--color mode, but is omitted otherwise. MD:Z: For aligned reads, is a string representation of the mismatched reference bases in the alignment. See SAM format specification for details. For colorspace alignments, describes the decoded nucleotide alignment, not the colorspace alignment. MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* String for mismatching positions. The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string. XA:i: Aligned read belongs to stratum . See Strata for definition. XM:i: For a read with no reported alignments, is 0 if the read had no alignments. If -m was specified and the read's alignments were supressed because the -m ceiling was exceeded, equals the -m ceiling + 1, to indicate that there were at least that many valid alignments (but all were suppressed). In -M mode, if the alignment was randomly selected because the -M ceiling was exceeded, equals the -M ceiling + 1, to indicate that there were at least that many valid alignments (of which one was reported at random). #old samtools has wrong SAM format for minus! #new samtools: /data/results/tools/samtools/samtools-1.3/samtools calmd -b ../raw/6hrep3F-4p3.sam.bam /data/results/reference/mmu/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta/genome.fa > foo.bam 2> foo [bam_fillmd1] different MD for read '5806664-1': '8A10' -> '10T8' samtools view foo.bam |grep 295988-1 295988-1 16 chr10 12175052 0 33M * 0 0 AACATCAACAACAACAACAACAACAACAAGGCG qqq!!qqqqqqqqqqqqqqqqqqqqqqqqqqqq NM:i:1 X1:i:3 MD:Z:4a28 sam 295988-1 16 chr10 12175052 0 33M * 0 0 AACATCAACAACAACAACAACAACAACAAGGCG qqq!!qqqqqqqqqqqqqqqqqqqqqqqqqqqq NM:i:1 X1:i:3 MD:Z:28T4 AACATCAACAACAACAACAACAACAACAAGGCG A 295988-1 - chr10 12175051 AACATCAACAACAACAACAACAACAACAAGGCG qqq!!qqqqqqqqqqqqqqqqqqqqqqqqqqqq 2 28:A>T >mm9_dna range=chr10:12175052-12175084 5'pad=0 3'pad=0 strand=+ repeatMasking=none AACAACAACAACAACAACAACAACAACAAGGCG grep NM:i:1 /data/images/proton/DKlab/mr/parclip/shrimp/IFN-15mMm.bam.sam | head -33 | tail 630_3509_3409_F3 0 ENSMUST00000070533 235 255 15H25M10H * 0 0 GGCTCTGGGCAAGGACTGGCTCCAG * AS:i:156 NM:i:1 CS:Z:T023302010303111123222300310302121032201223..3021.. CM:i:3 XX:Z:GgCTCTgGGCAAgGACTGGCTCCAG MD:Z:7T17 bt 5096016-1 - chr6 34385835 TTGTGTTGTTGTTGTTGTTGTTGTTGTTGTGAATA qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 0 32:T>G,33:G>T samtools view foo.bam | grep 5096016-1 5096016-1 16 chr6 34385836 25 35M * 0 0 TTGTGTTGTTGTTGTTGTTGTTGTTGTTGTGAATA qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq X1:i:1 NM:i:2 MD:Z:1g0t32 5096016-1 16 chr6 34385836 25 35M * 0 0 TTGTGTTGTTGTTGTTGTTGTTGTTGTTGTGAATA qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq NM:i:1 X1:i:1 MD:Z:32G0T1 TTGTGTTGTTGTTGTTGTTGTTGTTGTTGTGAATA TGT >mm9_dna range=chr6:34385836-34385869 5'pad=0 3'pad=0 strand=+ repeatMasking=none TGTTGTTGTTGTTGTTGTTGTTGTTGTTGTGAAT bt 1030429-1 + chr13 62908934 TACACCACCACCACCACCACCAACACCACCACCAA qqqqqqqqqqqqqqqqqqqqqq!!qqqqqqqqqqq 0 22:C>A sam 1030429-1 0 chr13 62908935 25 35M * 0 0 TACACCACCACCACCACCACCAACACCACCACCAA qqqqqqqqqqqqqqqqqqqqqq!!qqqqqqqqqqq NM:i:1 X1:i:1 MD:Z:22A12 bt 1489890-1 16 chr12 53162620 25 35M * 0 0 AACCACAGTTGTCGTTGTTGTTGTTGTTGTTGTTG qqqqqqqqqqq!!qqqqqqqqqqqqqqqqqqqqqq NM:i:1 X1:i:1 MD:Z:22C12 sam 1489890-1 - chr12 53162619 AACCACAGTTGTCGTTGTTGTTGTTGTTGTTGTTG qqqqqqqqqqq!!qqqqqqqqqqqqqqqqqqqqqq 0 22:T>C bt 668_188_2536_F3 - ENSMUST00000160944-chr1-+-3044314-3044814 91 TACAAGGCCTAATGGTGATTCCTACAG IIIIIIIIIIIIIIIIIIIIIIIIIII 0 8:C>T sam 668_188_2536_F3 16 ENSMUST00000160944-chr1-+-3044314-3044814 92 255 18H27M5H * 0 0 TACAAGGCCTAATGGTGATTCCTACAG * AS:i:196 NM:i:1 CS:Z:T10110321132120321001303203020113102330.010..3131.. CM:i:2 XX:Z:CTGTAGgAATCAcCATTAGGCCTTGTA MD:Z:18C8