Count Distinct

Problem

Count-distinct problem is a problem of finding the number of distinct elements in a data set or data stream, within which you might possibly see some repeated elements. For example, [1, 3, 2, 1, 5, 2, 4] has 5 distinct elements [1, 2, 3, 4, 5].

Solutions

Unix commands

Sort input from stdin, and then count lines with unique values.

$ echo '1
3
2
1
5
2
4' > data.txt
$ sort data.txt | uniq | wc -l
       5

Python script

Hold values into a Python set data structure, and then count the size of the set.

>>> dataset = [1, 3, 2, 1, 5, 2, 4]
>>> distinct = set()
>>> for element in dataset:
...     distinct.add(element)
...
>>> print(len(distinct))
5

References

SQL Database

Counting distinct values from a table is a built-in feature for most SQL databases.

> SELECT COUNT(DISTINCT value) FROM table;

Reference

Redis HyperLogLog Commands

Applying dataset with HyperLogLog algorithm when inserting data. HyperLogLog can give estimated counting results.

redis> PFADD dataset  1 3 2 1 5 2 4
(integer) 1

redis> PFCOUNT dataset
(integer) 5

Reference