BWA's sampe step, by default, tries to improve mis-alignments by performing a Smith-Waterman around the area of inconsistent pairs. It uses the insert size distributions of the pairs to decide when to call a pair inconsistent. Basically, if we have a bad library and the insert size distribution does not have a very well defined pick at the desired insert size, bwa may see many more inconsistent pairs than normal, leading to a painful increase in the running time of the process. A good dataset (A lane of Illumina 100bp), under typical hardware, may compute the alignments in 1 day. If the insert size of the pairs in the library is not consistent, that process can take weeks.
With a "healthy" dataset the sampe step may look like this:
[bwa_sai2sam_pe_core] convert to sequence coordinate... [infer_isize] (25, 50, 75) percentile: (142, 179, 232) [infer_isize] low and high boundaries: 100 and 412 for estimating avg and std [infer_isize] inferred external isize from 214690 pairs: 191.612 +/- 65.113 [infer_isize] skewness: 0.911; kurtosis: 0.380; ap_prior: 1.45e-05 [infer_isize] inferred maximum insert size: 643 (6.94 sigma) [bwa_sai2sam_pe_core] time elapses: 20.77 sec [bwa_sai2sam_pe_core] changing coordinates of 4232 alignments. [bwa_sai2sam_pe_core] align unmapped mate... [bwa_paired_sw] 11865 out of 14058 Q17 singletons are mated. [bwa_paired_sw] 524 out of 4808 Q17 discordant pairs are fixed. [bwa_sai2sam_pe_core] time elapses: 9.27 sec [bwa_sai2sam_pe_core] refine gapped alignments... 1.65 sec [bwa_sai2sam_pe_core] print alignments... 2.94 sec [bwa_sai2sam_pe_core] 262144 sequences have been processed. ...
You want to focus on:
[bwa_paired_sw] 524 out of 4808 Q17 discordant pairs are fixed. [bwa_sai2sam_pe_core] time elapses: 9.27 sec
Bwa computed 262144 alignments in approximately 10 seconds. Now look at the output for a bad library (focusing on interesting lines):
[bwa_paired_sw] 1143 out of 56132 Q17 discordant pairs are fixed. [bwa_sai2sam_pe_core] time elapses: 6839.43 sec
That's approximately 690 times slower than the previous execution.
Bwa sampe's step has a flag to disable pair rescuing. But you don't want to
completely disable this feature since it is useful for normal libraries. That
is way I wrote this patch,
which adds a new flag to sampe (-d) that will disable SW only when a certain
threshold has been reached (user defined). Here is the same dataset than
before but using
[bwa_paired_sw] Too many rescue attemps, disabling Smith-Waterman for unmapped mates. [bwa_paired_sw] 1122 out of 1868 Q17 singletons are mated. [bwa_paired_sw] 95 out of 2133 Q17 discordant pairs are fixed. [bwa_sai2sam_pe_core] time elapses: 261.08 sec
Much better. If you want to use this feature fell free to use my bwa fork.