# Fuzzy Search
A fuzzy search is a process that locates or filter items by given an approximate similar query string. It's done using approximate string matching algorithms.
ctrlpvim/ctrlp.vim is a vim plugin that supports full path fuzzy finding.
ctrl+p in normal mode, you can fuzzy find files by your inputs. For example,
mo would possibly match below files:
Such a plugin helps finding specific files faster.
junegunn/fzf is a general-purpose command-line Fuzzy finder. It can fuzzily process any list like files, command history, etc.
Such utility help typing commands faster.
// Assume you know what it is.
# Approximate String Matching
The formal definition of approximate string matching can be as below:
Find in the text or dictionary of size n all the words that match the given word (or start with the given word), taking into account k possible differences (errors).
The closeness of approximate is measured by the distance, the number of minimal string operations necessary to convert a query to the string to match. For example, the editors know that you have a typa and decide to pop up typo as a suggestion, because typa has only one character to be substituted.
Some of the most well-known algorithms for the distance calculation includes:
- Hamming distance.
- Levenshtein distance, or edit distance.
- Damerau-Levenshtein distance.
Another approach is using string similarity join algorithm.
# Hamming Distance
def hamming_distance(s1, s2): """Return the Hamming distance between equal-length sequences""" if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(el1 != el2 for el1, el2 in zip(s1, s2))
- Easy to understand.
- Only for calculating a set of words of equal length.
In general, the hamming distance is impractical, but it's helpful to learn other distance algorithms.
# Levenshtein Distance
Levenshtein distance, also called edit distance, calculates the number of operations including deletion, insertion, and substitution between the given query and the given term.
For example, a minimal edit script that transforms
enqueuezebra is 2:
- Insert b: enqueuezebro.
- Substitute o to a: enqueuezebra.
# Damerau-Levenshtein distance
Damerau-Levenshtein distance is a variation of Levenshtein distance by adding an extra rule - transposition of two adjacent letters also counts as one of the operations, alongside with insertion, deletion, and substitution.
difflib.get_close_matches returns a list of the best "good enough" matches. It's the quickest scripting function ready to use.
>>> from difflib import get_close_matches >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) ['apple', 'ape']
The Bitap algorithm is an approximate string matching algorithm, which tells if
a given text contains a substring that approximately equal to a given pattern.
It's fast because it's based on bitwise operations. It's most often used in the fuzzy search. Unix utility
agrep is atop bitap algorithm. The Bitap algorithm
can be based on both Hamming distance and Levenshtein distance.
An example implementation is here: https://gist.github.com/soasme/22c6f083bc971ff381724fd3308a4be2
The disadvantage of Bitap is that it requires a fixed-length bit bucket for calculation. The algorithm supports a large bit bucket, however, has poor performance when it's long. Luckily, in most cases, the search term has only a few characters.
# Bigram Comparing
Bigram comparing works well on variable length strings. The idea is to de-composite the string into a set of bigrams - words that are written with two letters in an alphabetic writing system.
enqueuezero can be transformed to
We then calculate the number of same bigrams in both query and string to match.
Below is an example of Python implementation. 1
def get_bigrams(s): s = s.lower() return [s[i:i+2] for i in range(len(s))] def get_similarity_score(query, match): query_bigrams = get_bigrams(query) match_bigrams = get_bigrams(match) hit = 0 for x in query_bigrams: for y in match_bigrams: if x == y: hit += 1 break return (hit * 2.0) / (len(query_bigrams) + len(match_bigrams))
Fuzzy search can be applied whenever there is a search box. Under the hood, the fuzzy
search requires approximate string matching. Among all algorithms, the Bitap algorithm
is perhaps the best-known for approximate string matching. However, it doesn't
fit the case in which the searching dataset is huge since it requires a full scanning.
get_close_matches in standard lib
difflib is the handiest tool
ready to use.
Installing fuzzy search plugin or utilities save a few seconds every time and thus several hours and days in your work and life.