A New Fast Approximate Multi-String Search Method - Faster Grepping with ugrep ============================================================================== <p align="right"><i>by Dr. Robert van Engelen, July 16, 2019.<br>Genivia Research Labs<br>Last updated August 28, 2023.</i></p> <p><a class="github-button" href="https://github.com/Genivia/ugrep" data-count-href="/Genivia/ugrep/stargazers" data-show-count="true" data-count-aria-label="# stargazers on GitHub" aria-label="Star Genivia/ugrep on GitHub">Star</a> <a class="github-button" href="https://github.com/Genivia/ugrep/archive/master.zip" data-icon="octicon-cloud-download" aria-label="Download Genivia/ugrep on GitHub">Download</a> <a style="vertical-align:top;" href="https://github.com/Genivia/ugrep/actions/workflows/c-cpp.yml"><img src="https://github.com/Genivia/ugrep/actions/workflows/c-cpp.yml/badge.svg" alt="build status"></img></a> <a style="vertical-align:top;" target="_blank" href="https://opensource.org/licenses/BSD-3-Clause" rel="nofollow"><img alt="license" src="https://img.shields.io/badge/license-BSD%203--Clause-blue.svg"></a></p> Introduction ------------ Fast approximate multi-string matching and search algorithms are critical to boost the performance of search engines and file system search utilities. In this article I will present a new class of algorithms PM-*k* for approximate multi-string matching and searching that I developed in 2019 for a new fast file search utility ugrep. PM-4 is used by ugrep to accelerate regex pattern matching. This article includes additional technical details to a [video introduction](https://www.youtube.com/watch?v=kA7qZgmfwD8) of the principle of the new method I presented at the [Performance Summit IV](https://performancesummitiv.splashthat.com) December 1, 2020. This article also presents a performance benchmark comparison with other grep tools, includes a SIMD implementation with AVX intrinsics, and gives a hardware description of the method. You can download Genivia's ultra fast [ugrep file search utility](get-ugrep.html) that uses the new method to boost search performance. [Reproducible performance benchmarks](https://github.com/Genivia/ugrep-benchmarks) are posted publicly and updated regularly with new releases. If you are interested in the PM-*k* class of multi-string search methods and would like clarification, or receive consultation, or if you found a problem, then please [contact us](contact.html) without obligation. Source code included herein is released under the [BSD-3 license.](files/BSD-3.txt) [![To top](images/go-up.png) To top](#) The PM-*k* multi-string search method ------------------------------------- PM-*k* (Predict Match) uses a fixed window size *k* that slides over the given text to search for potential matches of the string patterns we search for. Consider the following simple example. Our goal is to search for all occurrences of the 7 string patterns `a`, `an`, `the`, `do`, `dog`, `own`, `end` in the given text shown below: `the quick brown fox jumps over the lazy dog`<br> `^^^         ^^^                ^^^  ^   ^^^` We ignore shorter matches that are part of longer matches. So `do` is not a match in `dog` since we want to match `dog`. We also ignore word boundaries in the text. For example, `own` matches part of `brown`. This makes the search actually harder, since we cannot just inspect and match words between spaces. Existing state-of-the-art methods are fast, such as [Bitap](https://en.wikipedia.org/wiki/Bitap_algorithm) ("shift-or matching") to find a single matching string in text and [Hyperscan](https://github.com/intel/hyperscan) that essentially uses Bitap "buckets" and hashing to find matches of multiple string patterns. Bitap slides a window over the searched text to predict matches based on the letters it has shifted into the window. The window length of Bitap is the minimum length among all string patterns we search for. Short Bitap windows generate many false positives. In the worst case the shortest string among all string patterns is one letter long. This severely limits the performance of Bitap. For example, Bitap finds as many as 10 potential match locations in the example text for matching string patterns: `the quick brown fox jumps over the lazy dog`<br> `^ ^         ^    ^        ^ ^  ^ ^  ^   ^  ` These potential matches marked `^` correspond to the letters with which the patterns start, i.e. `a`, `t`, `d`, `o` and `e` are starting letters. The remaining part of the string patterns are ignored and must be matched separately afterwards. Hyperscan essentially uses Bitap buckets, which means that additional optimization can be applied to separate the string patterns into different buckets depending on the properties of the string patterns. The number of buckets is bound by SIMD architectural constraints of the machine to optimize Hyperscan. However, as a Bitap-based method, having a few short strings among the set of string patterns will hamper the performance of Hyperscan. We can do better than Bitap-based methods. To understand how we can do better, consider associating each letter in the text with a letter in the string patterns at offsets, like so: `the quick brown fox jumps over the lazy dog`<br> `^^^         ^^^                ^^^  ^   ^^^`<br> `012         012                012  0   012` Here we have matching letters at offsets in the strings we search: - at offset 0 we match one of the letters `a`, `t`, `d`, `o`, `e` of the pattern `a|an|the|do|dog|own|end`<br> - at offset 1 we match one of the letters `n`, `h`, `o`, `w` of the pattern `an|the|do|dog|own|end`<br> - at offset 2 we match one of the letters `e`, `g`, `n`, `d`, of the pattern `the|dog|own|end` The trick is to predict matches based on all available information, meaning the letters at word (string pattern) offsets 0, 1 and 2 (and higher for longer words.) To do so, we define a function `predictmatch` that return `TRUE` if a word (string pattern) possibly matches the text in a window, otherwise returns `FALSE`. We also define two functions `matchbit` and `acceptbit` that can be implemented as arrays or matrices. The functions take character `c` and an offset `k` to return `matchbit(c, k) = 1` if `word[k] = c` for any word in the set of string patterns, and return `acceptbit(c, k) = 1` if any word ends at `k` with `c`. With these two functions, `predictmatch` is defined as follows in pseudo code to predict string pattern matches up to 4 characters long against a sliding window of length 4: func predictmatch(window[0:3]) var c0 = window[0] var c1 = window[1] var c2 = window[2] var c3 = window[3] if acceptbit(c0, 0) then return TRUE if matchbit(c0, 0) then if acceptbit(c1, 1) then return TRUE if matchbit(c1, 1) then if acceptbit(c2, 2) then return TRUE if match_bit(c2, 2) then if matchbit(c3, 3) then return TRUE return FALSE We will eliminate control flow and replace it with logical operations on bits. For a window of size 4, we need 8 bits (twice the window size). The 8 bits are ordered as follows, where `!` means the logical complement (the not operator): bit7 = ! acceptbit(c0, 0) bit6 = ! matchbbit(c0, 0) bit5 = ! acceptbit(c1, 1) bit4 = ! matchbit(c1, 1) bit3 = ! acceptbit(c2, 2) bit2 = ! matchbit(c2, 2) bit1 = ! matchbit(c3, 3) bit0 = unused, assigned 1 The `predictmatch` computation now becomes: func predictmatch(window[0:3]) var c0 = window[0] var c1 = window[1] var c2 = window[2] var c3 = window[3] var bit7 = ! acceptbit(c0, 0) var bit6 = ! matchbbit(c0, 0) var bit5 = ! acceptbit(c1, 1) var bit4 = ! matchbit(c1, 1) var bit3 = ! acceptbit(c2, 2) var bit2 = ! matchbit(c2, 2) var bit1 = ! matchbit(c3, 3) var bit0 = 1 if bit7 == 0 then return TRUE if bit6 == 0 then if bit5 == 0 then return TRUE if bit4 == 0 then if bit3 == 0 then return TRUE if bit2 == 0 then if bit1 == 0 then return TRUE return FALSE What did we gain with this change? Nothing much it may seem. However, we can now simplify `predictmatch` by removing all expensive branches and replace them with bit-wise logical operations to create a *branchfree* implementation: bits = (mem[c0] & 0xc0) | (mem[c1] & 0x30) | (mem[c2] & 0x0c) | (mem[c3] & 0x03) return ((bits >> 6 | bits >> 4 | bits >> 2 | bits >> 1) != 0x7f) where the `mem[]` array represents `matchbit` and `acceptbit` compactly using 8 bits, or 2 bits per offset for 4 offsets, computed as follows. Note that this simplification should improve execution speed, because the execution branches are not very predictable and bit-wise logical operations on 8 bits is relatively cheap. Parameter `k` is the window size of a PM-*k* method, e.g. `k = 4` in our example to illustrate how PM-4 works. Let |Σ| be the alphabet size, e.g. |Σ| is 256 for 8-bit characters. The objective is to assign `mem[i, j]` 0, 1, 2 or 3, such that even values indicate `matchbit(c, i) = 1` and and values less than 2 indicate `acceptbit(c, i) = 1`: for i = 0 to k-1 for j = 0 to |Σ|-1 mem[i, j] = 3 for each word in the set of words var m = min(|word|, k) for i = 0 to m - 2 mem[i, word[i]] &= 2 mem[m - 1, word[m - 1]] = 0 When this initalization step is completed, we obtain `mem[]` from the set of words (string patterns). To compute `predictmatch` efficiently for any window size `k`, we define: func predictmatch(mem[0:k-1, 0:|Σ|-1], window[0:k-1]) var d = 0 for i = 0 to k - 1 d |= mem[i, window[i]] << 2*(k-i-1) var t = d for i = 0 to k - 3 d |= d >> 2 d = (d >> 1) | t return (d != (1 << 2*k) – 1) The `predictmatch` function returns `TRUE` when a potential match is found in the window. Searching for a match of a word in the given text `data` with `predictmatch` is performed as follows: func search(mem[0:k-1, 0:|Σ|-1], data) for i = 0 to |data| - k if predictmatch(mem, data[i:i+k-1]) then if data[i:i+k-1] matches a word in the set of words then return i return -1 This function returns the index of the match in the text `data` or -1 when not found. Note that the last `k-1` characters in the text data are not searched because the window cannot extend beyond the data. To search all data requires either padding the data with `k-1` with some dummy characters that cannot match, or requires performing a partial `predictmatch` on fewer characters as we near the end of the text data. Note that the next step in `search` is to verify a possible match against the set of words. This last step can be expensive when many false positives are found by `predictmatch`, that is, when a few words actually match but the number of predictions is much higher. [![To top](images/go-up.png) To top](#) PM-*k* with hashing ------------------- Hashing combines multiple characters into a single hash value. This effectively correlates the characters that make up a word (a string pattern). The words `own` and `end` are in our set of words. The simple `predictmatch` function will find a potential match for words that are not in the set of words, such as `ond` and `ewn` for example, which are false positives. Hashing effectively improves the accuracy of the predicted match by lowering the false positive rate. The words `own` and `end` no longer produces false matches for `ond` and `ewn` for example. False positives are not entirely excluded however, unless the hash is perfect, which requires 32 bit hash values to match windows of 4 characters. The resulting `mem[]` array would be too large to be practical. Therefore, finding a sweet spot to reduce false positives as much as possible with a small range of hash values is important. The updated pseudo-code `predictmatch` function with hashing is defined as follows: func predictmatch(mem[0:k-1, 0:2H-1], window[0:k-1]) const salt = ... // optional, some constant between 0 and 2H-1 var h = salt var d = 0 for i = 0 to k - 1 h = hash(h, window[i]) d |= mem[i, h] << 2*(k-i-1) var t = d for i = 0 to k - 3 d |= d >> 2 d = (d >> 1) | t return (d != (1 << 2*k) – 1) The only difference is that `h` is introduced with a hash value computed with `h = hash(h, window[i])` over the text `data` in the window and `mem` is indexed by `h` instead of a character. Correspondingly, `mem[]` is constructed and indexed by hash values instead of single characters. Hash values range from `0` to `2H-1`. Salting the hash is optional, it may or may not improve the accuracy depending on the hashing function `hash` chosen. We do not need to hash the first character in the window, so we can change the `i` loop above to run from 1 and by initializing `h = window[i]`. This means that we can use `mem[]` to quickly find a starting character in the text data first and then call `predictmatch`: func search(mem[0:k-1, 0:|Σ|-1], data) for i = 0 to |data| - k if mem[0, data[i]] != 3 then if predictmatch(mem, data[i:i+k-1]) then if data[i:i+k-1] matches a word in the set of words then return i return -1 This approach reduces computational overhead. [![To top](images/go-up.png) To top](#) PM-*k* with hashing and Bitap ----------------------------- Combining PM-*k* with Bitap can reduce the false positive rate with almost no overhead if the number of strings searched is limited. The idea is to filter possible matches with Bitap prior to `predictmatch`. This approach works well when we search a small set of string patterns and/or if the string patterns are all longer than *k* characters. If a single string among the string patterns is shorter than a few characters, then the benefit of Bitap quickly evaporates and offers no advantage if one string pattern consists of just one character. Also, if the number of strings to search is high, then Bitap produces many false positives and is not effective as a pre-filter for PM-*k*. In the implementation section I introduce a threshold value to determine if Bitap is useful based on the minimum strings pattern length and Bitap "entropy". When I mention "entropy", think about the number and complexity of the string patterns to search. The more strings we have and the more complex they are, the higher the entropy. Bitap is effective if the entropy is sufficiently low. The next section compares the performance of PM-4 methods with hashing and Bitap. The performance comparisons were performed with ugrep and with a [PM-4 implementation in C.](#PM-4-implementation-in-C) [![To top](images/go-up.png) To top](#) Performance results ------------------- Determining the performance of multi-string matching and search methods can have many pitfalls. It depends on the test set consisting of a representative corpus to search and representative search patterns. Therefore, I make no absolute claim that the method is always faster than any other method. We also do not want to cherry-pick a specific test set to beat other methods, just as other method implementers can do to try to beat PM-4 in very specific cases. Instead, we should randomize our experiment as much as possible to obtain a fair assessment of performance. ### Experiments to determine the impact of PM-4 To conduct fair performance experiments, I chose a 100,000,000 byte [English Wikipedia dump](files/enwik8.zip). This text file contains 1,128,023 words, mostly English words, some names, abbreviations and numbers. The word length frequency of the 348,191 [distinct words](files/words.zip) in this file is depicted in the graph shown below: <div class="chart"><a href="images/wordlengths.png" data-lightbox="image-1"><img alt="word length frequency" src="images/wordlengths.png"/></a></div> The hashing function used in this experiment is a PM-4 shift-and-xor hash named `<<3/<<3/<<3`, see my [PM-4 implementation in C,](#PM-4-implementation-in-C) which consists of three operations to hash inputs `a` and `b` and a prior hash value `h` #define hash1(a,b) (uint16_t)((((uint16_t)(a) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash2(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash3(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) The following performance test was conducted with a new and common MacBook Pro using clang 12.0.0 -O2 on a 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3 MacOS machine. The best times of 30 runs is shown under minimal machine load. The performance is compared of ugrep with the new search method presented in this article, Hyperscan's simple grep tool, ripgrep 13.0.0 and GNU grep 3.10. The data labelled "length *N* and up" represent the elapsed search time of a random set of 1,000 string patterns of length *N* and longer, picked at random from the 348,191 words, where the probability distribution of lengths correspond to the word length frequencies in the graph shown above. The set of data labelled "length *N* and up" contains at least one string of length *N*. <div class="chart"><a href="images/performance.png" data-lightbox="image-1"><img alt="1,000 string search performance elapsed time (lower is better)" src="images/performance.png"/></a></div> From this bar chart we observe that the new PM-4 multi-search string method implemented in ugrep performs very well accross the board. The performance gap is smaller for longer strings to search for. This is not entirely surprising, because Hyperscan uses Bitap-based methods, which work best for a set of strings when strings are not too short. The shortest string pattern determines the Bitap predictability of matches, which is lower for short string patterns. It turns out that PM-4 significantly improves the search speed compared to Bitap and Hyperscan for representative string patterns. The reason is that PM-4+hash significantly improves the match prediction rate of `predictmatch` when compared to PM-4 and Bitap for arbitrary strings picked at random from the 348,191 distinct words, where the probability distribution of lengths correspond to the word length frequencies. The log-log graph shown below shows these results for 2 words, 4 words, 8 words, and so on up to 1024 words searched: <div class="chart"><a href="images/rates-all.png" data-lightbox="image-1"><img alt="prediction rates" src="images/rates-all.png"/></a></div> The results show the average taken over 100 runs of random string searches per point. Higher values are better. The top rate in this graph is 1, meaning perfect prediction. A 0.1 is a 10% prediction rate and 0.01 is a 1% prediction rate. Observe that combining Bitap with hashed PM-4 (PM-4+hash+Bitap) gives the highest prediction rates. A more clear picture of the efficacy of PM-4 emerges when we restrict the length of the searched words to 4 characters. In the following comparison, words are randomly picked from the 348,191 distinct words when 1 to 4 characters long. The log-log graph is shown below for 2 words, 4 words, 8 words, and so on up to 1024 words: <div class="chart"><a href="images/rates-to4.png" data-lightbox="image-1"><img alt="prediction rates for string lengths up to 4" src="images/rates-to4.png"/></a></div> To evaluate the efficacy of the hashing method on prediction rates, I will compare the performance of three hash functions. I tested with many more, but picked these three as better performing. The `<<3/<<3/<<3` hashing function I chose is rather simple by shifting and xor-ing four bytes to produce 10-bit hash values to index `mem[]`. We expect this hashing function to give better runtime performance with low computational overhead, but it may produce more false positives when compared to more advanced hashing functions that are computationaly more expensive. A more traditional hashing function is the `*33/*33/*33` hash that multiplies the previous hash value by 33 and adds the next character to the hash: #define hash1(a,b) (uint16_t)((((uint16_t)(a) << 5) + (uint16_t)(a) + (uint16_t)(b)) & (HASH_MAX - 1)) #define hash2(h,b) (uint16_t)((((uint16_t)(h) << 5) + (uint16_t)(h) + (uint16_t)(b)) & (HASH_MAX - 1)) #define hash3(h,b) (uint16_t)((((uint16_t)(h) << 5) + (uint16_t)(h) + (uint16_t)(b)) & (HASH_MAX - 1)) I also compared the results to an alternative "cheap" hashing function that puts more emphasis on the first two characters is `6<</<<3/<<0`: #define hash1(a,b) (uint16_t)((((uint16_t)(a) << 6) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash2(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash3(h,b) (uint16_t)((((uint16_t)(h) << 0) ^ (uint16_t)(b)) & (HASH_MAX - 1)) The match prediction rates for these three hashing functions used with PM-4+hash+Bitap are shown in the log-log graph below for 2 patterns, 4 patterns, 8 patterns, and so on up to 1024 patterns randomly picked from the 348,191 distinct words: <div class="chart"><a href="images/rates-hashing-all.png" data-lightbox="image-1"><img alt="prediction rates" src="images/rates-hashing-all.png"/></a></div> The results are not conclusive, although the results indicate that `*33/*33/*33` performs better. When the length of the words search is restricted to 4, we get a better picture of the efficacy of the three hashing methods on prediction rates: <div class="chart"><a href="images/rates-hashing-to4.png" data-lightbox="image-1"><img alt="prediction rates for string lengths up to 4" src="images/rates-hashing-to4.png"/></a></div> These results show the average taken over 100 runs of randomly picked words per point, up to 4 characters long to search. The difference is more pronounced with the `*33/*33/*33` hashing function performing better in this case. However, the `*33/*33/*33` hash function is typically more expensive to execute compared to the `<<3/<<3/<<3` and `<<6/<<3/<<0` hashes. Therefore, ugrep uses the `<<3/<<3/<<3` hash that offers good runtime performance and reasonably good match prediction in general for words of arbitrary length. ### Benchmarking with a MacBook M1 Pro 16GB LPDDR5 This benchmark test compares the performance of ugrep, ripgrep (a pre 13.0.0 release when I wrote this article) and GNU grep to search for sets of words in the 100MB benchmark file with a MacBook M1 Pro 10 core 16GB LPDDR5. Elapsed time is the average of 100 randomized searches with random word sets of increasing size per data point in the log-log graph. Each point in the log-log graph is the average of 100 random trials for each set size *N* of words searched for *N*=1,2,4,...,1024: <div class="chart"><a href="images/performance-log-log-word-search-m1.png" data-lightbox="image-1"><img alt="" src="images/performance-log-log-word-search-m1.png"/></a></div> Ugrep easily beats ripgrep and GNU grep up to 256 words searched. For larger sets of words, ugrep and ripgrep are more close to ugrep, with a slight advantage for ripgrep. Ripgrep is slightly faster when many patterns match that are returned and output. Output seems somewhat faster in ripgrep when several KB to MB are output (not very common in practical use cases). It also uses letter frequency search heuristics, which work very well to search English text corpora, such as this large benchmark file (but otherwise may perform not as efficient). Statistical variance in the elapesed time is interesting. The following log-log graph shows the same experiment with min-max bars added to highlight the wide range of the elapsed time, not only the average elapsed time: <div class="chart"><a href="images/performance-ave-min-max-word-search-m1.png" data-lightbox="image-1"><img alt="" src="images/performance-ave-min-max-word-search-m1.png"/></a></div> The min-max range of ripgrep is huge compared to ugrep. This means that ripgrep sometimes takes a lot of time to return the results and sometimes gets lucky to return results faster. To be most fair in this experiment, I repeated the test 4 times for each set size *N*=1,2,4,...,1024 and removed the one or two most extreme standard deviation cases. High deviation cases may indicate that the machine is not spending 100% CPU time on the search or something else is taking time away. The min-max bars in the graph also suggest that it is possible to cherry-pick specific search patterns where either ripgrep or ugrep is best. So when ripgrep claims to be faster than ugrep it is likely because of the specific choice of search pattern in the performance comparison, rather than a factual technical assessment. Likewise, I am careful not to make general claims that ugrep is always faster. It is generally very fast and mostly faster for common search patterns. ### Benchmarking on a MacBook Pro 2.9 GHz Intel Core i7 16 GB 2133 MHz LPDDR3 This benchmark test compares the performance of ugrep, ripgrep (a pre 13.0.0 release when I wrote this article) and GNU grep to search for sets of words in the 100MB benchmark file with a MacBook M1 Pro 10 core 16GB LPDDR5. Elapsed time is the average of 100 randomized searches with random word sets of increasing size per data point in the log-log graph. Each point in the log-log graph is the average of 100 random trials for each set size *N* of words searched for *N*=1,2,4,...,1024: <div class="chart"><a href="images/performance-log-log-word-search-x64.png" data-lightbox="image-1"><img alt="" src="images/performance-log-log-word-search-x64.png"/></a></div> Ugrep and ripgrep performance is more similar compared to the M1 Pro benchmark test. Ripgrep is slightly faster when many patterns match that are returned and output. Output seems somewhat faster in ripgrep when several KB to MB are output (not very common in practical use cases). Ripgrep has a bug when searching 32 words and produces the wrong results. This is due to ripgrep's use of letter frequency search heuristics, which work very well for English corpora such as this large benchmark file, but can be buggy when not implemented correctly. These buggy data points were removed from this graph to be fair. Note the strange behavior of ripgrep for 64 words searched, suggesting that another problem might be lurking there. Statistical variance in the elapesed time is interesting. The following log-log graph shows the same experiment with min-max bars added to highlight the wide range of the elapsed time, not only the average elapsed time: <div class="chart"><a href="images/performance-ave-min-max-word-search-x64.png" data-lightbox="image-1"><img alt="" src="images/performance-ave-min-max-word-search-x64.png"/></a></div> The min-max range of ripgrep is much wider compared to ugrep. This means that ripgrep sometimes takes a lot of time to return the results and sometimes gets lucky to return results faster. This graph includes ripgrep's bug that produces incorrect results for *N*=32 as observed in the average and min-max elapsed time. To be most fair in this experiment, I repeated the test 4 times for each set size *N*=1,2,4,...,1024 and removed the one or two most extreme standard deviation cases. High deviation cases may indicate that the machine is not spending 100% CPU time on the search or something else is taking time away. [![To top](images/go-up.png) To top](#) PM-*k* correctness proof ------------------------ In this section I will prove the correctness of `predictmatch` with a short and simple constructive proof. The objective is to show that the logic-based version of `predictmatch` is semantically identical to the control-flow-based version of `predictmatch`. Let's consider the case PM-4 with a window of 4 characters long for `predictmatch`: bits = (mem[c0] & 0xc0) | (mem[c1] & 0x30) | (mem[c2] & 0x0c) | (mem[c3] & 0x03) return (( bits >> 6 | bits >> 4 | bits >> 2 | bits >> 1) != 0x7f) Now align the `bits` patterns as follows, where the indices 1 to 7 are the `bits` values indexed from 0 (low order) to 7 (high order) and `-` represents a zero bit: bits >> 6 = [-,-,-,-,-,-,7,6] bits >> 4 = [-,-,-,-,7,6,5,4] bits >> 2 = [-,-,7,6,5,4,3,2] bits >> 1 = [-,7,6,5,4,3,2,1] Note that `predictmatch` returns `TRUE` when bits >> 6 = [-,-,-,-,-,-,7,6] bits >> 4 = [-,-,-,-,7,6,5,4] bits >> 2 = [-,-,7,6,5,4,3,2] bits >> 1 = [-,7,6,5,4,3,2,1] ^ ^ ^ ^ \ \ \ \__ if bit6, bit4, bit2, bit1 are all zero then \ \ \ return TRUE \ \ \__ if bit6, bit4, bit3 are all zero then \ \ return TRUE \ \__ if bit6, bit5 are all zero then \ return TRUE \__ if bit7 is zero then return TRUE That is if bit7 is zero then return TRUE if bit6, bit5 are all zero then return TRUE if bit6, bit4, bit3 are all zero then return TRUE if bit6, bit4, bit2, bit1 are all zero then return TRUE return FALSE By initialization of `mem[]` as defined earlier, we have the 8 `bits` assigned bits = [bit7,bit6,bit5,bit4,bit3,bit2,bit1,bit0] where bit7 = ! acceptbit(c0, 0) bit6 = ! matchbbit(c0, 0) bit5 = ! acceptbit(c1, 1) bit4 = ! matchbit(c1, 1) bit3 = ! acceptbit(c2, 2) bit2 = ! matchbit(c2, 2) bit1 = ! matchbit(c3, 3) bit0 = unused, assigned 1 Combining the above we have if acceptbit(c0, 0) then return TRUE if matchbbit(c0, 0) and acceptbit(c1, 1) then return TRUE if matchbbit(c0, 0) and matchbit(c1, 1) and acceptbit(c2, 2) then return TRUE if matchbbit(c0, 0) and matchbit(c1, 1) and matchbit(c2, 2) and matchbit(c3, 3) then return TRUE return FALSE which is semantically identical to if acceptbit(c0, 0) then return TRUE if matchbit(c0, 0) then if acceptbit(c1, 1) then return TRUE if matchbit(c1, 1) then if acceptbit(c2, 2) then return TRUE if match_bit(c2, 2) then if matchbit(c3, 3) then return TRUE return FALSE This completes the simple constructive proof for the case `k = 4`. It is not difficult to see that the proof can be generalized to any positive integer `k`. [![To top](images/go-up.png) To top](#) PM-4 implementation in C ------------------------ We start by slightly optimizing the pseudo code return ((bits >> 6 | bits >> 4 | bits >> 2 | bits >> 1) != 0x7f) to the factored form return (((((((bits >> 2) | bits) >> 2) | bits) >> 1) | bits) != 0xff) where it is assumed that `bit0` in `bits` is assigned 1, see my earlier comment in this article. An implementation of `predictmatch` in C with a very simple, computationally efficient, `<<3/<<3/<<3` hashing function that produces reasonably high match prediction rates as discussed: #define HASH_MAX 4096 #define hash1(a,b) (uint16_t)((((uint16_t)(a) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash2(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash3(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) static inline int predictmatch(uint8_t mem[], const char *window) { uint8_t b0 = window[0]; uint8_t b1 = window[1]; uint8_t b2 = window[2]; uint8_t b3 = window[3]; uint16_t h1 = hash1(b0, b1); uint16_t h2 = hash2(h1, b2); uint16_t h3 = hash3(h2, b3); uint8_t a0 = mem[b0]; uint8_t a1 = mem[h1]; uint8_t a2 = mem[h2]; uint8_t a3 = mem[h3]; uint8_t b = (a0 & 0xc0) | (a1 & 0x30) | (a2 & 0x0c) | (a3 & 0x03); uint8_t m = ((((((b >> 2) | b) >> 2) | b) >> 1) | b); return m != 0xff; } The initialization of `mem[]` with a set of `n` string patterns is done as follows: void init(int n, const char **patterns, uint8_t mem[]) { int i; memset(mem, 0xff, sizeof(uint8_t)*HASH_MAX); for (i = 0; i < n; ++i) { uint8_t b0 = (uint8_t)patterns[i][0]; uint8_t b1 = b0 ? (uint8_t)patterns[i][1] : 0; uint8_t b2 = b1 ? (uint8_t)patterns[i][2] : 0; uint8_t b3 = b2 ? (uint8_t)patterns[i][3] : 0; uint16_t h1 = hash1(b0, b1); uint16_t h2 = hash2(h1, b2); uint16_t h3 = hash3(h2, b3); mem[b0] &= 0x3f | ((b1 != 0) << 7); mem[h1] &= 0xcf | ((b2 != 0) << 5); mem[h2] &= 0xf3 | ((b3 != 0) << 3); mem[h3] &= 0xfc; } mem[0] &= 0x3f; } The `mem[0] &= 0x3f` statement flags `\0` as a predicted match to detect a terminating `\0` in the text data to search. A program to search a file for matching string patterns reads the specified file into memory and uses the remaining arguments as string patterns to search: int main(int argc, const char **argv) { uint8_t mem[HASH_MAX]; size_t hits = 0, total = 0; const char *filename = argv[1]; char *buffer, *ptr, *end; struct stat st; FILE *fp = fopen(filename, "r"); if (fp == NULL || fstat(fileno(fp), &st) != 0) { perror("cannot stat file"); exit(1); } buffer = (char*)malloc(st.st_size + 4); // +4 to add \0 and padding if (fread(buffer, st.st_size, 1, fp) != 1) { perror("cannot read file"); exit(1); } fclose(fp); buffer[st.st_size] = '\0'; // make \0-terminated init(argc - 2, &argv[2], mem); ptr = buffer; end = buffer + st.st_size; while (1) { while ((mem[(uint8_t)*ptr] & 0xc0) == 0xc0) ptr++; if (ptr >= end) break; if (predictmatch(mem, ptr)) { size_t len = match(argc - 2, &argv[2], ptr); if (len > 0) { printf("match at %zu\n", ptr - buffer); ++hits; ptr += len - 1; } ++total; } ++ptr; } printf("hits=%zu rate=%.2f%%\n", hits, (float)hits/total*100); } A simple and inefficient `match` function can be defined as size_t match(int n, const char **patterns, const char *ptr) { size_t len; int i; for (i = 0; i < n; ++i) if (*ptr == *patterns[i] && memcmp(patterns[i], ptr, (len = strlen(patterns[i]))) == 0) return len; return 0; } This `match` version is not efficient for large sets of string patterns. Best is to use a tree, DFA or a regex to match multiple strings. [![To top](images/go-up.png) To top](#) PM-4 with Bitap implementation in C ----------------------------------- Adding Bitap to PM-4 in C is almost trivial. The initialization is updated as follows to include `bitap` array initialization given the string patterns to search: int init(int n, const char **patterns, uint8_t mem[], uint16_t bitap[], int *entropy) { int i, j, min = 16; memset(mem, 0xff, sizeof(uint8_t)*HASH_MAX); memset(bitap, 0xff, sizeof(uint16_t)*256); *entropy = 0; for (i = 0; i < n; ++i) { uint8_t b0 = (uint8_t)patterns[i][0]; uint8_t b1 = b0 ? (uint8_t)patterns[i][1] : 0; uint8_t b2 = b1 ? (uint8_t)patterns[i][2] : 0; uint8_t b3 = b2 ? (uint8_t)patterns[i][3] : 0; uint16_t h1 = hash1(b0, b1); uint16_t h2 = hash2(h1, b2); uint16_t h3 = hash3(h2, b3); mem[b0] &= 0x3f | ((b1 != 0) << 7); mem[h1] &= 0xcf | ((b2 != 0) << 5); mem[h2] &= 0xf3 | ((b3 != 0) << 3); mem[h3] &= 0xfc; for (j = 0; j < min; ++j) { uint8_t b = patterns[i][j]; if (b == 0) break; bitap[b] &= ~(1 << j); } min = j; } mem[0] &= 0x3f; for (i = 0; i < 256; ++i) { bitap[i] |= ~((1 << min) - 1); for (j = 0; j < min; ++j) *entropy += ((bitap[i] & (1 << j)) == 0); } *entropy /= min; return min; } The function returns the minimum length of the string patterns for Bitap. The `min` variable is limited to 16, the number of bits in the Bitap bit vector. The `entropy` parameter measures the complexity of the string patterns capped to `min` length. A small entropy makes Bitap more effective, since the expected false positive rate will be low. If the entropy is too high, Bitap should not be used. An entropy threshold of 16 or less appears optimal when using short patterns for ugrep based on performance benchmarks. This threshold roughly corresponds to the point of convergence of the PM-4+hash prediction rate of the PM-4+hash+Bitap prediction rate in the log-log prediction rate graph shown earlier. For sets of 16 string patterns and greater there is negligible difference in the prediction rate. The search program is updated as follows to include a Bitap filter in the loop if the entropy is sufficiently low: int main(int argc, const char **argv) { uint8_t mem[HASH_MAX]; uint16_t bitap[256]; uint16_t bits, mask; int min, entropy; size_t hits = 0, total = 0, bitaps = 0; const char *filename = argv[1]; char *buffer, *ptr, *end; struct stat st; FILE *fp = fopen(filename, "r"); if (fp == NULL || fstat(fileno(fp), &st) != 0) { perror("cannot stat file"); exit(1); } buffer = (char*)malloc(st.st_size + 4); // add \0 and padding if (fread(buffer, st.st_size, 1, fp) != 1) { perror("cannot read file"); exit(1); } fclose(fp); buffer[st.st_size] = '\0'; min = init(argc - 2, &argv[2], mem, bitap, &entropy); bits = ~0; mask = 1 << (min - 1); ptr = buffer; end = buffer + st.st_size - min + 1; if ((min > 4 && entropy < 200) || entropy < 16) { while (1) { while (ptr < end && ((bits = (bits << 1) | bitap[(uint8_t)*ptr]) & mask) != 0) ptr++; if (ptr >= end) break; ++bitaps; if (predictmatch(mem, ptr - min + 1)) { size_t len = match(argc - 2, &argv[2], ptr - min + 1); if (len > 0) { printf("match at %zu\n", ptr - min + 1 - buffer); ++hits; ptr += len - min; bits = ~0; } ++total; } ++ptr; } } else { while (1) { while ((mem[(uint8_t)*ptr] & 0xc0) == 0xc0) ptr++; if (ptr >= end) break; if (predictmatch(mem, ptr)) { size_t len = match(argc - 2, &argv[2], ptr); if (len > 0) { printf("match at %zu\n", ptr - buffer); ++hits; ptr += len - 1; } ++total; } ++ptr; } } printf("hits=%zu rate=%.2f%% (bitap rate=%.2f%%)\n", hits, (float)hits/total*100, (float)hits/bitaps*100); } This combination with Bitap offers the benefit of `predictmatch` to predict matches fairly accurately for short string patterns and Bitap to improve prediction for long string patterns. [![To top](images/go-up.png) To top](#) PM-4 SIMD implementation in C with AVX2 --------------------------------------- SIMD vectorization of PM-4 for x86-64 systems can be achieved with AVX2 (Advanced Vector Extensions). We require AVX2 gather instructions to fetch hash values stored in `mem`. AVX2 gather instructions are not available in SSE/SSE2/AVX. The idea is to execute four PM-4 predictmatch in parallel that predict matches in a window of four patterns simultaneously. When no match is predicted for any of the four patterns, we advance the window by four bytes instead of just one byte. However, the AVX2 implementation does not typically run much faster than the scalar version, but at about the same speed. The performance of PM-4 is memory-bound, not CPU-bound. The scalar version of `predictmatch()` described in a previous section already performs very well due to a good mix of instruction opcodes. Therefore, the performance depends more on memory access latencies and not as much on CPU optimizations. Despite being memory-bound, PM-4 has excellent spatial and temporal locality of the memory access patterns that makes the algorithm competative. Assuming `hash1()`, `hash2()` and `hash3()` are identical in performing a left shift by 3 bits and a xor, the PM-4 implementation with AVX2 is: static inline int predictmatch(uint8_t mem[], const char *window) { uint16_t mask; __m128i vb; __m128i vh = _mm_cvtepu8_epi32(_mm_loadu_si32(&window[0])); __m128i va = _mm_and_si128(_mm_i32gather_epi32((int32_t*)mem, vh, 1), _mm_set1_epi32(0xc0)); if (_mm_movemask_epi8(_mm_cmpeq_epi32(va, _mm_set1_epi32(0xc0))) == 0x0000ffff) return -1; vb = _mm_cvtepu8_epi32(_mm_loadu_si32(&window[1])); vh = _mm_xor_si128(_mm_slli_epi32(vh, 3), vb); va = _mm_or_si128(va, _mm_and_si128(_mm_i32gather_epi32((int32_t*)mem, vh, 1), _mm_set1_epi32(0x30))); vb = _mm_cvtepu8_epi32(_mm_loadu_si32(&window[2])); vh = _mm_and_si128(_mm_xor_si128(_mm_slli_epi32(vh, 3), vb), _mm_set1_epi32(0x0fff)); va = _mm_or_si128(va, _mm_and_si128(_mm_i32gather_epi32((int32_t*)mem, vh, 1), _mm_set1_epi32(0x0c))); vb = _mm_cvtepu8_epi32(_mm_loadu_si32(&window[3])); vh = _mm_and_si128(_mm_xor_si128(_mm_slli_epi32(vh, 3), vb), _mm_set1_epi32(0x0fff)); va = _mm_or_si128(va, _mm_and_si128(_mm_i32gather_epi32((int32_t*)mem, vh, 1), _mm_set1_epi32(0x03))); va = _mm_or_si128(_mm_srli_epi32(_mm_or_si128(_mm_srli_epi32(_mm_or_si128(_mm_srli_epi32(va, 2), va), 2), va), 1), va); va = _mm_cmpeq_epi32(va, _mm_set1_epi32(0xff)); mask = ~(uint16_t)_mm_movemask_epi8(va); return mask == 0 ? -1 : ctz(mask) >> 2; } This AVX2 implementation of `predictmatch()` returns -1 when no match was found in the given window, which means that the pointer can advance by four bytes to attempt the next matches. Otherwise, `predictmatch()` returns the offset from the pointer (i.e. from the start of the `window`) at which a pattern match is predicted. Therefore, we update `main()` as follows (Bitap is not used): while (ptr < end) { int k = 0; while ((k = predictmatch(mem, ptr)) == -1) ptr += 4; pre += k; if (ptr >= end) break; size_t len = match(argc - 2, &argv[2], ptr); if (len > 0) { printf("match at %zu\n", ptr - buffer); ++hits; ptr += len - 1; } ++total; ++ptr; } However, we have to be careful with this update to make additional updates to `main()` to allow the AVX2 gathers to access `mem` as 32 bit integers rather than single bytes. This means that `mem` should be padded with 3 bytes in `main()`: uint8_t mem[HASH_MAX + 3]; These three bytes do not need to be initialized, since the AVX2 gather operations are masked to extract only the lower order bits located at lower addresses (little endian). Furthermore, because `predictmatch()` performs a match on four patterns simultaneously, we must make sure that the window can extend beyond the input buffer by 3 bytes. We set these bytes to `\0` to indicate the end of input in `main()`: buffer = (char*)malloc(st.st_size + 4); // +4 to make room for \0 padding ... memset(&buffer[st.st_size], 0, 4); // add terminating \0 and \0 padding Again, I should emphasize that the AVX2 version of `predictmatch()` may not run faster than the scalar version. The performance on a MacBook Pro 2.9 GHz Intel Core i7 16 GB 2133 MHz LPDDR3 shows that the AVX2 version runs about the same speed or slightly faster when the predictor returns none or a few matches in large corpora. Perhaps future SIMD extensions will offer more competative advantages. [![To top](images/go-up.png) To top](#) PM-*k* hardware implementation ------------------------------ An implementation in hardware with logic gates greatly accelerates the performance of PM-*k*. The diagram below roughly outlines a possible implementation in hardware of PM-4 with hashing: <div class="chart"><a href="images/pm-4-hardware.png" data-lightbox="image-1"><img alt="PM-4 hardware" src="images/pm-4-hardware.png"/></a></div> Assuming the window is placed over the string `ABXK` in the input, the matcher predicts a possible match by hashing the input characters (1) from the left to the right as clocked by (4). The memorized hashed patterns are stored in four memories `mem` (5), each with a fixed number of addressable entries `A` addressed by the hash outputs `H`. The `mem` outputs for `acceptbit` as `D1` and `matchbit` as `D0`, which are gated through a set of OR gates (6). The outputs are combined by the NAND gate (7) to output a match prediction (3). Prior to matching, all string patterns are "learned" by the memories `mem` by hashing the string presented on the input, for example the string pattern `AB`: <div class="chart"><a href="images/pm-4-hardware-learn.png" data-lightbox="image-1"><img alt="PM-4 hardware learning patterns" src="images/pm-4-hardware-learn.png"/></a></div> The control (2) lines are activated as shown, where the third line (`d` for `A` and `0` for `B`) represents `acceptbit` for `D1` and `D2` for `matchbit` is low (6) to write to `mem` (5). [![To top](images/go-up.png) To top](#) Bitap implemented with AVX2 and AVX512 -------------------------------------- A SIMD Bitap implementation can be combined with PM-*k* as before to speed up match prediction. The Bitap "shift-or" algorithm can be fully vectorized in AVX2 and AVX512 by placing the 256-byte Bitap array `bit[]` in four 64-bit vectors. Note that only the first 64 bits are used of each of the four 128-bit vector, because we use a shuffle to index the vectors, based on [an approach posted by Fabian Giesen](https://twitter.com/rygorous/status/1527928984373035008), but extended to cover the full 256 byte Bitap array. The Bitap state 32-bit vector `state` and `mask` are initialized before the matching loop starts. Besides the four bit vectors, we also need two auxiliary `statv` and `tempv` vectors. The input from memory is read at a rate of eight characters at a time. // four 64-bit integer vectors to hold 256-byte bit[] array __m128i bit0 = _mm_loadu_si64(bit); __m128i bit1 = _mm_loadu_si64(bit + 64); __m128i bit2 = _mm_loadu_si64(bit + 128); __m128i bit3 = _mm_loadu_si64(bit + 192); uint32_t state = ~0; uint32_t mask = (1 << (min - 1)); // min is the shift window size (1 to 8) __m256i maskv = _mm256_set_epi32(mask, mask << 1, mask << 2, mask << 3, mask << 4, mask << 5, mask << 6, mask << 7); __m256i statv; __m256i tempv = _mm256_set1_epi32(0); const char *s = input; // input data const char *e = input + size; // end of input data while (s < e - 7) { // fetch next eight characters c0...c7 as a single 64-bit little endian integer [c7...c0] __m128i charv = _mm_loadu_si64(s); // bit[] Bitap table lookup for eight characters __m128i loadv = _mm_shuffle_epi8(bit0, charv); charv = _mm_sub_epi8(charv, _mm_set1_epi8(16)); loadv = _mm_xor_si128(loadv, _mm_shuffle_epi8(bit1, charv)); charv = _mm_sub_epi8(charv, _mm_set1_epi8(16)); loadv = _mm_xor_si128(loadv, _mm_shuffle_epi8(bit2, charv)); charv = _mm_sub_epi8(charv, _mm_set1_epi8(16)); loadv = _mm_xor_si128(loadv, _mm_shuffle_epi8(bit3, charv)); // convert [bit[c7]...bit[c0]] to eight 32-bit integers tempv = _mm256_cvtepu8_epi32(loadv); // shift eight 32-bit integers 0, 1, ..., 7 bits to the left [bit[c7],bit[c6]<<1,...,bit[c0]<<7] tempv = _mm256_sllv_epi32(tempv, _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7)); // shift state left by 8 bits and broadcast to vector of 32-bit integers statv = _mm256_set1_epi32(state << 8); // apply "shift-or" in parallel // [ state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 | bit[c4]<<3 | bit[c5]<<2 | bit[c6]<<1 | bit[c7], // state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 | bit[c4]<<3 | bit[c5]<<2 | bit[c6]<<1, // state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 | bit[c4]<<3 | bit[c5]<<2, // state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 | bit[c4]<<3, // state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4, // state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5, // state<<8 | bit[c0]<<7 | bit[c1]<<6, // state<<8 | bit[c0]<<7, // ] statv = _mm256_or_si256(statv, tempv); tempv = _mm256_slli_si256(tempv, 4); statv = _mm256_or_si256(statv, tempv); tempv = _mm256_slli_si256(tempv, 4); statv = _mm256_or_si256(statv, tempv); tempv = _mm256_slli_si256(tempv, 4); statv = _mm256_or_si256(statv, tempv); tempv = _mm256_slli_si256(tempv, 4); statv = _mm256_or_si256(statv, tempv); tempv = _mm256_slli_si256(tempv, 4); statv = _mm256_or_si256(statv, tempv); tempv = _mm256_slli_si256(tempv, 4); statv = _mm256_or_si256(statv, tempv); tempv = _mm256_slli_si256(tempv, 4); statv = _mm256_or_si256(statv, tempv); // fetch next state from high order [ state<<8 | bit[c0]<<7 | bit[c1]<<6 | ... | bit[c7], ... ] state = _mm256_cvtsi256_si32(_mm256_shuffle_epi32(statv, 7)); // mask the states // [ mask & ( state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 | bit[c4]<<3 | bit[c5]<<2 | bit[c6]<<1 | bit[c7] ), // mask<<1 & ( state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 | bit[c4]<<3 | bit[c5]<<2 | bit[c6]<<1 ), // mask<<2 & ( state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 | bit[c4]<<3 | bit[c5]<<2 ), // mask<<3 & ( state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 | bit[c4]<<3 ), // mask<<4 & ( state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 | bit[c3]<<4 ), // mask<<5 & ( state<<8 | bit[c0]<<7 | bit[c1]<<6 | bit[c2]<<5 ), // mask<<6 & ( state<<8 | bit[c0]<<7 | bit[c1]<<6 ), // mask<<7 & ( state<<8 | bit[c0]<<7 ), // ] tempv = _mm256_and_si256(statv, maskv); // compare each to zero tempv = _mm256_cmpeq_epi32(tempv, _mm256_set1_epi32(0)); // if one of the eight masked states is zero then break if (_mm256_movemask_epi8(tempv) != 0) break; s += 8; } // find out which masked state was zero to set the state and position of the potential match int move = _mm256_movemask_epi8(tempv); if ((move & 0x0000000f)) { state = _mm256_cvtsi256_si32(statv) >> 7; } else if ((move & 0x000000f0)) { state = _mm256_cvtsi256_si32(_mm256_shuffle_epi32(statv, 1)) >> 6; s += 1; } else if ((move & 0x00000f00)) { state = _mm256_cvtsi256_si32(_mm256_shuffle_epi32(statv, 2)) >> 5; s += 2; } else if ((move & 0x0000f000)) { state = _mm256_cvtsi256_si32(_mm256_shuffle_epi32(statv, 3)) >> 4; s += 3; } else if ((move & 0x000f0000)) { state = _mm256_cvtsi256_si32(_mm256_shuffle_epi32(statv, 4)) >> 3; s += 4; } else if ((move & 0x00f00000)) { state = _mm256_cvtsi256_si32(_mm256_shuffle_epi32(statv, 5)) >> 2; s += 5; } else if ((move & 0x0f000000)) { state = _mm256_cvtsi256_si32(_mm256_shuffle_epi32(statv, 6)) >> 1; s += 6; } else if ((move & 0xf0000000)) { s += 7; } After much testing, I found that the AVX2 version does not run any faster than serial Bitap, unfortunately. The Bitap method is IO-bound, not as much CPU-bound, which limits the throughput of this approach. Still, I had expected some performance improvement. It is not clear how or if AVX2 can or will lead to a performance improvement over serial Bitap. Perhaps someone smarter than me figures out a simpler and/or better way to store the 256 Bitap array in vectors and perform shift-or in parallel. The AVX512 version is very simular, but fetches 16 characters at a time from the input stored in memory: // four 64-bit integer vectors to hold 256-byte bit[] array __m128i bit0 = _mm_loadu_si64(bit); __m128i bit1 = _mm_loadu_si64(bit + 64); __m128i bit2 = _mm_loadu_si64(bit + 128); __m128i bit3 = _mm_loadu_si64(bit + 192); uint32_t state = ~0; uint32_t mask = (1 << (min - 1)); // min is the shift window size (1 to 8) __m512i maskv = _mm512_set_epi32(mask, mask << 1, mask << 2, mask << 3, mask << 4, mask << 5, mask << 6, mask << 7, mask << 8, mask << 9, mask << 10, mask << 11, mask << 12, mask << 13, mask << 14, mask << 15); __m512i statv; __m512i tempv = _mm512_set1_epi32(0); uint16_t move = 0; const char *s = input; // input data const char *e = input + size; // end of input data while (s < e - 15) { // fetch next 16 characters c0...c15 as a single 128-bit little endian integer [c15...c0] __m128i charv = _mm_loadu_si128(s); // bit[] Bitap table lookup for sixteen characters __m128i loadv = _mm_shuffle_epi8(bit0, charv); charv = _mm_sub_epi8(charv, _mm_set1_epi8(16)); loadv = _mm_xor_si128(loadv, _mm_shuffle_epi8(bit1, charv)); charv = _mm_sub_epi8(charv, _mm_set1_epi8(16)); loadv = _mm_xor_si128(loadv, _mm_shuffle_epi8(bit2, charv)); charv = _mm_sub_epi8(charv, _mm_set1_epi8(16)); loadv = _mm_xor_si128(loadv, _mm_shuffle_epi8(bit3, charv)); // convert [bit[c15]...bit[c0]] to 16 32-bit integers tempv = _mm512_cvtepu8_epi32(loadv); // shift 16 32-bit integers 0, 1, ..., 15 bits to the left [bit[c15],bit[c14]<<1,...,bit[c0]<<15] tempv = _mm512_sllv_epi32(tempv, _mm512_set_epi32(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)); // shift state left by 16 bits and broadcast to vector of 32-bit integers statv = _mm512_set1_epi32(state << 16); // apply "shift-or" in parallel (see AVX2 version for details) statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); // second half tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); tempv = _mm512_slli_si512(tempv, 4); statv = _mm512_or_si512(statv, tempv); // fetch next state from high order [ state<<16 | bit[c0]<<15 | bit[c1]<<14 | ... | bit[c15], ... ] state = _mm512_cvtsi512_si32(_mm512_shuffle_epi32(statv, 15)); // mask the states (see AVX2 version for details) tempv = _mm512_and_si512(statv, maskv); // compare each to zero move = _mm512_cmpeq_epi32_mask(tempv, _mm512_set1_epi32(0)); // if one of the eight masked states is zero then break if (move != 0) break; s += 16; } if (move != 0) { // find out which masked state was zero to set the state and position of the potential match int k = 0; while ((move & 0x1) == 0) { ++k; move >>= 1; } state = _mm512_cvtsi512_si32(_mm512_shuffle_epi32(statv, k)) >> (15 - k); s += k; } The AVX512 version runs faster than the serial implementation, but it depends on the CPU. I encourage you to try it out and share the results with the community. To use the Bitap AVX implementations, the `bit[]` (or `bitap[]`) array must be constructed or pre-processed by xor-ing the values accross before the `bit[]` array can be used. [![To top](images/go-up.png) To top](#) Summary ------- This article introduced the PM-*k* multi-string approximate search method. Another way to look at PM-*k* is to consider it a class of methods that can be combined with existing multi-string search methods. One such example is PM-4 combined with hashing and Bitap. Multi-string predictive matching with PM-4 hashing and Bitap boosts the performance of multi-string and regex pattern search in ugrep. The implementation in ugrep demonstrates that the search performance beats other state-of-the-art search tools and methods. The generalization to regex patterns, such as implemented by ugrep, becomes obvious when considering the fact that we can generate all strings op to *k* characters long from the regex pattern provided. This is feasible when *k* is not too large. Therefore, ugrep uses PM-4 with hashing and Bitap to predict matches to optimize the performance of the DFA-based POSIX regex matcher. A few more details about ugrep. Ugrep uses PM-4 in combination with string matching. When the initial part of the regex pattern is a fixed string then it makes sense to search the string part of the regex and match the rest of the regex using PM-4. For example, when we search with the regex "ab(c|d|ef|ghi)" we search for "ab" first in the input using fast SIMD algorithms. When "ab" matches we apply PM-4 to predict a possible match for the rest of the pattern "(c|d|ef|ghi)". If the regex matches patterns longer than four characters (four bytes), then ugrep uses a hashing method to predict matches instead of PM-4. For example, when we search with the regex "ab(cdef|ghijhk)" we search for "ab" first then apply hashing to predict a possible match for "(cdef|ghijhk)". Besides the low-hanging fruit of classic string search, bitap, and hashing, ugrep uses PM-4 in clever ways and also uses some additional techniques not discussed here further to speed up search. [![To top](images/go-up.png) To top](#) Downloads --------- Free open source [ugrep file search utility.](get-ugrep.html) [English Wikipedia dump](files/enwik8.zip) and [all unique words extracted](files/words.zip) and sorted with `ugrep -ow '[a-zA-Z]+' enwik8 | sort -u`. [C source code of PM-4 with hashing and Bitap](files/pm4-bitap.c) BSD-3 licensed [C++ source code to randomly pick words from a file](files/pick.cpp) BSD-3 licensed [![To top](images/go-up.png) To top](#) <p align="right"><i>Copyright (c) 2023, Robert van Engelen, Genivia Inc. All rights reserved.</i></p>