# # HyperLogLog

## # Overview

HyperLogLog is an algorithm that can solve Count Distinct problem. A Count Distinct problem is something like getting number 5 for a data set like [1, 3, 2, 1, 5, 2, 4], for it has [1, 2, 3, 4, 5] 5 elements.

HyperLogLog can provide estimated count on a very large data stream.

• Fast (O(1)).
• Memory efficient.
• Can be distributed and paralleled.
• Only provides estimated count.

## # Theory

• Hash each item.
• Place each item into 32 buckets based on first 5-bits of the hash.
• Update value for each item in 32 buckets based on last 27-bits of the hash.
• Apply to some fancy mathematical formulas to get the number based on the 32 buckets data.

References:

## # Enhancements

### # HyperLogLog++

HyperLogLogPlusPlus is an enhancement of HyperLogLog.

• Uses 64-bit integers rather than 32-bit
• Sparse data structure rather than one huge array
• Better bias correction algorithm at lower cardinalities.

## # Tools

### # Redis

Commands `PFADD`, `PFCOUNT`, and `PFMERGE` are the main interfaces for manipulating HyperLogLog data structure in Redis.

``````redis> PFADD hll a b c d e f g
(integer) 1
redis> PFCOUNT hll
(integer) 7

redis> PFADD hll1 foo bar zap a
(integer) 1
redis> PFADD hll2 a b c foo
(integer) 1
redis> PFMERGE hll3 hll1 hll2
"OK"
redis> PFCOUNT hll3
(integer) 6
``````

References:

Source code of the script `test.py`:

``````from datasketch import HyperLogLog

data = [1, 2, 3, 5, 2, 4, 1, 6, 7]

hll = HyperLogLog()
for element in data:
hll.update(str(element).encode('utf8'))

print(f'estimate: {hll.count()}')
print(f'real: {len(set(data))}')
``````

Run the script:

``````\$ mkdir testdatasketch
\$ python3 -mvenv venv
\$ source venv/bin/activate
(venv) \$ python test.py
estimate: 7.097484291802821
real: 7
``````

References:

### # Logswan

Logswan can find unique IP addresses in very large log files but only consumes 4MB memory.

``````\$ logswan access.log
``````

References:

## # Strategies

### # Short-term Bucket & Long-term Merge

• Keep HyperLogLog data for every 10 minutes for the past 24 hours.
• Merge HyperLogLog data into hourly HLL after 24 hours.
• Merge HyperLogLog data into daily HLL after 60 days.
• ...

Example redis usage: `PFADD users:<date>:<hour>:<MM> <UID>`, MM=[00, 10, 20, 30, 40, 50]

References: