Consider the following example, based on a problem in a real dataset, running on 64bit Linux:
$ mrfast --version
mrFAST 2.5.0.0 with FastHASH
$ ls
my_read.fastq my_ref.fasta
$ more my_ref.fasta
mitochondrial_fragment
ATTCTTTTTTTAGGTGGTGTGTGGTTAGGTATGGGGTTAGGGGAGTGGCAAAGAGAAGTG
TTTATTAAACATTCTTATGGCCGTAGATAGCATATCGATTATACGAGACCTTCGTAAGAT
CAATCCCCACTAGCATTGCTCATACAGGTTAACTCAATAGGAGGAGCTGGGGTAGAACGT
ATCTAGTTCGGGGGTAACCGCAGTTCAATGAAAGTGACGACGTCGGATGGAACAAACTTA
ATACCACCAGTTGTGCTAACGATTGTTATCTCAATCTATCCCAACAGGCCCCCAGGTAGT
GATGAGTGGTGGAATGGTACAGGGTACCAGTGGGTGAAGAGCGTCACGAACCAGGGAATA
CGGAGTACAGAGTTGAGCGCCCGGGGCTCCGCCCCCGGCTTTTATAGCGCGAGACGTGGT
CAGTCGATTCAGCGTTAGGTTTTAAACTCCTTTGGCAAAGATTGATTCTAGCGATCCAGA
GACCCTGCCTGGCATAAAAGTCTTTATTAGCACCAGTAGGTTCAATAAGGTAGTAGTCCA
ATAGAATGGAAAACTCGAGATCTAATCTCTCGATTTCCTAGTGTCATGGAAATCAGCCAG
GTTCTCTTCATCTGCAACAGTAGAAGAAGAAGAGAGGCTAGCGAGAGAGTCTTATGGCGG
AGACGCTAAGGCTTAAATGTAATGTAGATAACCCCTTACGGAACACTTGAGTGCGACGTA
GACTACATAATCCCTCAGGGATATTAGCTCTGCTCGATTAACAATAGCATACTTTGTTAC
ACGGAGTGTATCTGGGGGGAATAATACTAACTTACTTAGCACTATCGCGATGCTACGCAT
TCGCTCTTTCGCTAAATAAGATACGACGATGAGTGGTTGGTGGAGAGAATAACCGATTCT
AACTTGATAATTCGCATGAAATAATTTTTTATTTGTTTTTTTTTTTGCTCTTAATTTTAG
AGGATGTTTATTTTTATTCTAATAAAAAGGATCCGTTGAA
$ more my_read.fastq
@example_read
ATTCAACGGATCCTTTTTATTAGAATAAAAATAAACATCCTCTAAAATTAAGAGCAAACACAAAACAAATAAAAAA
+
BCCB@BBAA@AB=DCBC@=?:;ADDBB@8>DBBAB@9?>BB>?4:=ABA=>><=;>@<2<>@<=919>94>=9:AA
$ samtools faidx my_ref.fasta
$ mrfast --index my_ref.fasta
Generating Index from my_ref.fasta
- mitochondrial_fragment
DONE in 0.23s!
$ mrfast --search my_ref.fasta --seq my_read.fastq -o my_mapping.sam
Sequence length: 76 bp. Error threshold is set to 4 bp.
You can override this value using the -e parameter.
1 sequence is read in 0.00. (0 discarded) [Mem:0.00 M]
-----------------------------------------------------------------------------------------------------------
| Seq. Name | Loading Time | Mapping Time | Memory Usage(M) | Total Mappings Mapped reads |
-----------------------------------------------------------------------------------------------------------
| mitochondrial_fragment | 0.02 | 0.00 | 128.01 | 1 1 |
-----------------------------------------------------------------------------------------------------------
Total: 0.02 0.00
Total Time: 0.02
Total No. of Reads: 1
Total No. of Mappings: 1
Avg No. of locations verified: 0
$ samtools view -b -S my_mapping.sam > my_mapping.bam
[samopen] SAM header is present: 1 sequences.
Line 4, sequence length 76 vs 12 from CIGAR
Parse error at line 4: CIGAR and sequence length are inconsistent
Aborted (core dumped)
$ more my_mapping.sam
@HD VN:1.4 SO:unknown
@SQ SN:mitochondrial_fragment LN:1000
@PG ID:mrFAST PN:mrFAST VN:2.5.0.0
example_read 16 mitochondrial_fragment 925 255 12M * 0 0 TTTTTTATTTGTTTTGTGTTTGCTCTTAATTTTAGAGGATGTTTATTTTTATTCTAATAAAAAGGATCCGTTGAAT
AA:9=>49>919=<@><2<@>;=<>>=ABA=:4?>BB>?9@BABBD>8@BBDDA;:?=@CBCD=BA@AABB@BCCB
NM:i:4 MD:Z:12
Looking at the SAM file from mrfast, samtools is correct: The CIGAR field is 12M, so the read should be 12bp long. However, the SEQ and QUAL fields are both 76bp long (which is the original read length).
Notice the mapping position is 925, assuming all 76bp mapped that would mean it matched the reference from bases 925 to 1000 inclusive (which is right up to the end of the reference).
Last 76 bases of the reference,
TTTTTTATTTGTTTTTTTTTTTGCTCTTAATTTTAGAGGATGTTTATTTTTATTCTAATAAAAAGGATCCGTTGAA
The 76 bases of the read (reverse complemented as per the FLAG)
TTTTTTATTTGTTTTGTGTTTGCTCTTAATTTTAGAGGATGTTTATTTTTATTCTAATAAAAAGGATCCGTTGAAT
By eye there is a reasonable alignment here including run of 55 perfect matches later on. One possible alignment would be:
TTTTTTATTTGTTTTTTTTTTTGCTCTTAATTTTAGAGGATGTTTATTTTTATTCTAATAAAAAGGATCCGTTGAA-
TTTTTTATTTGT-TTTGTGTTTGCTCTTAATTTTAGAGGATGTTTATTTTTATTCTAATAAAAAGGATCCGTTGAAT
That would have CIGAR string 12M1I63M1D (12 + 63 + 1 = 76), which does start with 12M as in mrfast's output, but the position of the insert is arbitrary.
My guess is there the problem is due to the alignment ending (or starting depending on how you view the strand) right at the end of the reference sequence.
known and being worked on in 2.5.0.1
Oh good - any ETA for 2.5.0.1's release?
Surprisingly your SourceForge SVN is empty - where is the repository since I'd like to apply some of the other fixes coming up?
Thanks!
No, we use a private repository
I guess I'll just try again with 2.5.0.1 once it is released, and in the meantime check for and remove these problem entries.
Problem persists in the latest release, mrFAST v2.6.0.2, although now the SAM output contains multiple invalid entries:
Here none of the five suggested alignments have a valid CIGAR string given the sequence is length 76bp.