# # Fuzzy Search

## # Overview

A fuzzy search is a process that locates or filter items by given an approximate similar query string. It's done using approximate string matching algorithms.

## # Solutions

### # ctrlp.vim

ctrlpvim/ctrlp.vim is a vim plugin that supports full path fuzzy finding.

By typing `ctrl+p` in normal mode, you can fuzzy find files by your inputs. For example, typing `mo` would possibly match below files:

• modified.txt
• colors/molokai.vim
• docs/easymotion.txt

Such a plugin helps finding specific files faster.

### # fzf

junegunn/fzf is a general-purpose command-line Fuzzy finder. It can fuzzily process any list like files, command history, etc.

Such utility help typing commands faster.

### # spellcheck

// Assume you know what it is.

## # Patterns

### # Approximate String Matching

The formal definition of approximate string matching can be as below:

Find in the text or dictionary of size n all the words that match the given word (or start with the given word), taking into account k possible differences (errors).

The closeness of approximate is measured by the distance, the number of minimal string operations necessary to convert a query to the string to match. For example, the editors know that you have a typa and decide to pop up typo as a suggestion, because typa has only one character to be substituted.

Some of the most well-known algorithms for the distance calculation includes:

• Hamming distance.
• Levenshtein distance, or edit distance.
• Damerau-Levenshtein distance.

Another approach is using string similarity join algorithm.

### # Hamming Distance

Example code:

``````def hamming_distance(s1, s2):
"""Return the Hamming distance between equal-length sequences"""
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(el1 != el2 for el1, el2 in zip(s1, s2))
``````
• Pros
• Easy to understand.
• Cons
• Only for calculating a set of words of equal length.

In general, the hamming distance is impractical, but it's helpful to learn other distance algorithms.

### # Levenshtein Distance

Levenshtein distance, also called edit distance, calculates the number of operations including deletion, insertion, and substitution between the given query and the given term.

For example, a minimal edit script that transforms `enqueuezero` to `enqueuezebra` is 2:

• Insert b: enqueuezebro.
• Substitute o to a: enqueuezebra.

### # Damerau-Levenshtein distance

Damerau-Levenshtein distance is a variation of Levenshtein distance by adding an extra rule - transposition of two adjacent letters also counts as one of the operations, alongside with insertion, deletion, and substitution.

### # difflib

Python function `difflib.get_close_matches` returns a list of the best "good enough" matches. It's the quickest scripting function ready to use.

``````>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
``````

### # Bitap

The Bitap algorithm is an approximate string matching algorithm, which tells if a given text contains a substring that approximately equal to a given pattern. It's fast because it's based on bitwise operations. It's most often used in the fuzzy search. Unix utility `agrep` is atop bitap algorithm. The Bitap algorithm can be based on both Hamming distance and Levenshtein distance.

An example implementation is here: https://gist.github.com/soasme/22c6f083bc971ff381724fd3308a4be2

The disadvantage of Bitap is that it requires a fixed-length bit bucket for calculation. The algorithm supports a large bit bucket, however, has poor performance when it's long. Luckily, in most cases, the search term has only a few characters.

### # Bigram Comparing

Bigram comparing works well on variable length strings. The idea is to de-composite the string into a set of bigrams - words that are written with two letters in an alphabetic writing system.

For example, `enqueuezero` can be transformed to `en`, `nq`, `qu`, `ue`, ..., `ro`. We then calculate the number of same bigrams in both query and string to match. Below is an example of Python implementation. 1

``````def get_bigrams(s):
s = s.lower()
return [s[i:i+2] for i in range(len(s))]

def get_similarity_score(query, match):
query_bigrams = get_bigrams(query)
match_bigrams = get_bigrams(match)
hit = 0
for x in query_bigrams:
for y in match_bigrams:
if x == y:
hit += 1
break
return (hit * 2.0) / (len(query_bigrams) + len(match_bigrams))
``````

## # Conclusions

Fuzzy search can be applied whenever there is a search box. Under the hood, the fuzzy search requires approximate string matching. Among all algorithms, the Bitap algorithm is perhaps the best-known for approximate string matching. However, it doesn't fit the case in which the searching dataset is huge since it requires a full scanning. Python function `get_close_matches` in standard lib `difflib` is the handiest tool ready to use.

Installing fuzzy search plugin or utilities save a few seconds every time and thus several hours and days in your work and life.