# Count Distinct
# Problem
Count-distinct problem is a problem of finding the number of distinct elements in a data set or data stream, within which you might possibly see some repeated elements. For example, [1, 3, 2, 1, 5, 2, 4]
has 5 distinct elements [1, 2, 3, 4, 5]
.
# Solutions
# Unix commands
Sort input from stdin, and then count lines with unique values.
$ echo '1
3
2
1
5
2
4' > data.txt
$ sort data.txt | uniq | wc -l
5
- Advantage
- Easy to use
- Disadvantage
- Poor performance when data set grows.
- Huge memory usage when data set grows.
- Limited data input types
- Can't handle data set greater than 10^9 (Memory can store so many data).
# Python script
Hold values into a Python set
data structure, and then count the size of the set.
>>> dataset = [1, 3, 2, 1, 5, 2, 4]
>>> distinct = set()
>>> for element in dataset:
... distinct.add(element)
...
>>> print(len(distinct))
5
- Advantage
- Easy to use
- Good performance
- Broad data input types
- Disadvantage
- Huge memory usage when data set grows.
- Can't handle data set greater than 10^9 (Memory can store so many data).
References
# SQL Database
Counting distinct values from a table is a built-in feature for most SQL databases.
> SELECT COUNT(DISTINCT value) FROM table;
- Advantage
- Can handle huge data set (when index is properly set).
- Disadvantage
- Need to create connection to a database.
- Limited use case
Reference
# Redis HyperLogLog Commands
Applying dataset with HyperLogLog algorithm when inserting data. HyperLogLog can give estimated counting results.
redis> PFADD dataset 1 3 2 1 5 2 4
(integer) 1
redis> PFCOUNT dataset
(integer) 5
- Advantage
- Fast (O(1))
- Memory efficient (a few KB in memory even counting 10^9 data set).
- Can be paralleled.
- Disadvantage
- Only provide estimated counting, not accurate value.
Reference