Hyperloglog

HyperLogLog is an algorithm that can solve Count Distinct problem. A Count Distinct problem is something like getting number 5 for a data set like [1, 3, 2, 1, 5, 2, 4], for it has [1, 2, 3, 4, 5] 5 elements.

HyperLogLog can provide estimated count on a very large data stream.

Theory

References:

Enhancements

HyperLogLog++

HyperLogLogPlusPlus is an enhancement of HyperLogLog.

Tools

Redis

Commands PFADD, PFCOUNT, and PFMERGE are the main interfaces for manipulating HyperLogLog data structure in Redis.

redis> PFADD hll a b c d e f g
(integer) 1
redis> PFCOUNT hll
(integer) 7

redis> PFADD hll1 foo bar zap a
(integer) 1
redis> PFADD hll2 a b c foo
(integer) 1
redis> PFMERGE hll3 hll1 hll2
"OK"
redis> PFCOUNT hll3
(integer) 6

References:

Datasketch

Source code of the script test.py:

from datasketch import HyperLogLog

data = [1, 2, 3, 5, 2, 4, 1, 6, 7]

hll = HyperLogLog()
for element in data:
    hll.update(str(element).encode('utf8'))

print(f'estimate: {hll.count()}')
print(f'real: {len(set(data))}')

Run the script:

$ mkdir testdatasketch
$ python3 -mvenv venv
$ source venv/bin/activate
(venv) $ pip install datasketch
(venv) $ python test.py
estimate: 7.097484291802821
real: 7

References:

Logswan

Logswan can find unique IP addresses in very large log files but only consumes 4MB memory.

$ logswan access.log

References:

Strategies

Short-term Bucket & Long-term Merge

Example redis usage: PFADD users:<date>:<hour>:<MM> <UID>, MM=[00, 10, 20, 30, 40, 50]

References: