Class SingleBarcodeDistanceMetric
-
Constructor Summary
ConstructorsConstructorDescriptionSingleBarcodeDistanceMetric
(byte[] barcodeBases, byte[] readBases, byte[] readQualities, int minimumBaseQuality, int maximalInterestingDistance) -
Method Summary
Modifier and TypeMethodDescriptionint
+++lifted from Commons Lang Text +++int
int
Deprecated.int
Similar to Hamming distance but this version doesn't penalize matching bases with low quality for the read.
-
Constructor Details
-
SingleBarcodeDistanceMetric
public SingleBarcodeDistanceMetric(byte[] barcodeBases, byte[] readBases, byte[] readQualities, int minimumBaseQuality, int maximalInterestingDistance)
-
-
Method Details
-
hammingDistance
public int hammingDistance() -
leniantHammingDistance
Deprecated. -
lenientHammingDistance
public int lenientHammingDistance()Similar to Hamming distance but this version doesn't penalize matching bases with low quality for the read.- Returns:
- the edit distance between the barcode(s) and the read(s)
-
freeDistance
public int freeDistance()+++lifted from Commons Lang Text +++Modified to specific to comparing barcodes to reads. This means ignoring insertions or deletions in the last position as it would amount to double counting otherwise. Based on https://www.pnas.org/content/115/27/E6217
Also, in case of passing the threshold this algorithm has been modified to return maximalInterestingDistance + 1
This implementation only computes the distance if it is less than or equal to the threshold value, returning maximalInterestingDistance + 1 otherwise. The advantage is performance: unbounded distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only computing a diagonal stripe of width 2k + 1 of the cost table. It is also possible to use this to compute the unbounded Levenshtein distance by starting the threshold at 1 and doubling each time until the distance is found; this is O(dm), where d is the distance.
One subtlety comes from needing to ignore entries on the border of our stripe eg. p[] = |#|#|#|* d[] = *|#|#|#| We must ignore the entry to the leftRev of the leftmost member We must ignore the entry above the rightmost member. * As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1. In this case we're going to walk a stripe of length 3. The matrix would look like so:
1 2 3 4 5 1 |#|#| | | | 2 |#|#|#| | | 3 | |#|#|#| | 4 | | |#|#|#| 5 | | | |#|#|
in addition, we terminate the calculation early and return threshold +1 if we detect that there is no way to get to the bottom right corner without going over the thershold. This implementation decreases memory usage by using two single-dimensional arrays and swapping them back and forth instead of allocating an entire n by m matrix. This requires a few minor changes, such as immediately returning when it's detected that the stripe has run off the matrix and initially filling the arrays with large values so that entries we don't compute are ignored. See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
-