Bwa patch to deal with poor insert size distributions

October 26, 2012

BWA's sampe step, by default, tries to improve mis-alignments by performing a Smith-Waterman around the area of inconsistent pairs. It uses the insert size distributions of the pairs to decide when to call a pair inconsistent. Basically, if we have a bad library and the insert size distribution does not have a very well defined pick at the desired insert size, bwa may see many more inconsistent pairs than normal, leading to a painful increase in the running time of the process. A good dataset (A lane of Illumina 100bp), under typical hardware, may compute the alignments in 1 day. If the insert size of the pairs in the library is not consistent, that process can take weeks.

With a "healthy" dataset the sampe step may look like this:

[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (142, 179, 232)
[infer_isize] low and high boundaries: 100 and 412 for estimating avg and std
[infer_isize] inferred external isize from 214690 pairs: 191.612 +/- 65.113
[infer_isize] skewness: 0.911; kurtosis: 0.380; ap_prior: 1.45e-05
[infer_isize] inferred maximum insert size: 643 (6.94 sigma)
[bwa_sai2sam_pe_core] time elapses: 20.77 sec
[bwa_sai2sam_pe_core] changing coordinates of 4232 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 11865 out of 14058 Q17 singletons are mated.
[bwa_paired_sw] 524 out of 4808 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 9.27 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 1.65 sec
[bwa_sai2sam_pe_core] print alignments... 2.94 sec
[bwa_sai2sam_pe_core] 262144 sequences have been processed.

You want to focus on:

[bwa_paired_sw] 524 out of 4808 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 9.27 sec

Bwa computed 262144 alignments in approximately 10 seconds. Now look at the output for a bad library (focusing on interesting lines):

[bwa_paired_sw] 1143 out of 56132 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 6839.43 sec

That's approximately 690 times slower than the previous execution.

Bwa sampe's step has a flag to disable pair rescuing. But you don't want to completely disable this feature since it is useful for normal libraries. That is way I wrote this patch, which adds a new flag to sampe (-d) that will disable SW only when a certain threshold has been reached (user defined). Here is the same dataset than before but using -d 4000:

[bwa_paired_sw] Too many rescue attemps, disabling Smith-Waterman for unmapped mates.
[bwa_paired_sw] 1122 out of 1868 Q17 singletons are mated.
[bwa_paired_sw] 95 out of 2133 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 261.08 sec

Much better. If you want to use this feature fell free to use my bwa fork.