Vous êtes sur la page 1sur 11

PCSA

Agenda
Use Case
PCSA
Sample Input
DC SP (50) APP (100) MDN (100 M)
A youtube app1 123456789
A google app2 123456789
A youtube app1 938745695
A google app1 987694567
A youtube app3 123456789
A google app4 123456789
A youtube app1 938745695
A google app2 987694567

1. You want to find the Unique Subscriber Count.


1. MDN (Mobile Number as String) represents a subscriber uniquely.
Solution
DimensionSet: DC, APP, SP
MeasureSet: List of MDNs

DC SP (50) APP (100) MDN (100 M)


A youtube app1 List (123456789, 938745695, 938745695)
A google app2 List (123456789, 987694567)
A google app1 List (987694567)
A youtube app3 List (123456789)
A google app4 List (123456789)
Issue with the solution
Consumes too much space/memory to store List
Of MDNs for all the dimensions.

Objective is:
To achieve similar results with lesser
space/memory utilization.
4KB or below for each dimension (for Insta)
Linear Probabilistic Counting
Algo Insert:
bit[] buffer
For i in stream:
h = hash(i)
p = h % (buffer.size)
buffer[p] = 1
Algo count:
m = buffer.size
w = number of 1-bits in buffer
return m * ln ( (m w) / m)
LPC
Pros:
Very Low average error rate: 2%
Cons:
Handles low cardinality about 20000
But we use it for cardinality upto 12000 only
PCSA (Probabilistic Counting and Stochastic
Averaging)
Algo:
For i in stream:
h = hash(i)
q, r = h / number of buffer
(899)
k = first 1-bit in q
choose the r buffer
set the k bit to 1
Uniform Random Hash:
In a uniform random hash, the probability of 0/1 on each
position is equal.

That means if we find the last 1 bit in the buffer at


3rd position, the probability of that happening is:
for 1 to be at 3 rd position, 1 st and
2 nd bits must have 0:
probability
for 0 on 1 st bit: 1/2
for 0 on 1 st and 2 nd bit: 1/4
for 1 on 3 rd bit and 0 on 1 st and 2 nd bit: 1/8

That means we might have seen 8 unique items.

Count:
y = position of last 1 bit in buff
function(2^y)
Count:
l = position of first 0-bit in buff

avg first 0 bit=


l + l1 ..+ln / n

y= avg first 0 bit

function(2^y * n)
PCSA
Pros:
Low average error rate: 5%
Can handle large cardinality.

Vous aimerez peut-être aussi