Why Certain Numbers Are More Common in Data Sets

Why is it that, in many data sets, there are about six times more numbers starting with the
digit 1 than with the digit 9? - Quora
10/30/12 1:38 PM
Search Questions, Topics, People, and Boards

Benford's Law Probability Statistics (mathematical science)
Edit
Write
Home
Peter
Why is it that, in many data sets, there are about six times more numbers starting with the digit 1 than with the digit 9? Edit
This phenomenon is called Benford's Law: http://en.wikipedia.org/w iki/Ben... Edit
Comment Share (3 ) Options 5 Answers Ben Golub, Ph.D. probability courses at Stanford...
325 votes by Yair Livne, Brian Roemmele, Patrick Yan, (more)
Follow Question
Related Questions
Promote Question
Probability: A coin that comes up heads with probability p is flipped n consecutive times. What is the probability that starting with t... (continue) Machine Learning : What papers have shown that for machine learning, data set size is more important than the model being trained? Probability: What is the average number of times something will happen per minute, if the chances of it happening per second are 1 in 3... (continue) It's well known that if you ask a group of people how many jelly beans there are in a jar, that the average of all the ans...
(continue)
If you list all the countries in the world and their populations, 27% of the numbers will start with the digit 1. Only 3% of them will start with the digit 9. Something very similar holds if you look at the heights of the 60 tallest structures in the world whether you measure in meters or in feet [1].
Probability: A fair six-sided die is rolled. At the same time, six fair coins are tossed. Find the probability that the number of heads... (continue) See more related questions Share Question Twitter Facebook
Question Stats This phenomenon called Benford's Law not only connects up to some of the deepest math that is known, but also helps auditors detect fraud in things like taxes and elections. Benford's Law often strikes people as unintuitive because it seems that every digit should have an equal opportunity to start country populations or heights of skyscrapers, like this: Latest activity 8 Oct. This question has 1 monitor with 29190 topic followers. 7605 people have viewed this question. 133 people are following this question.
(The delightful figures are from http://www.thecleverest.com/benf... ) This egalitarian intuition about leading digits turns out to be misleading. The situation where every digit is equally likely to start numbers is actually the anomalous one.
http://www.quora.com/Why-is-it-that-in-many-data-sets-there-are-abou-times-more-numbers-starting-with-the-digit-1-than-with-the-digit-9 Page 1 of 6
Why is it that, in many data sets, there are about six times more numbers starting with the digit 1 than with the digit 9? - Quora
10/30/12 1:38 PM
Frank Benford When he wrote about this phenomenon in 1938, the physicist Frank Benford gave us not just an interesting mathematical factoid. His observation (made before him by Simon Newcomb in 1881) can be very useful if you're trying to spot or succeed in financial fraud [6]. Auditors in many states use the fact that most people don't know or understand Benford's Law to spot tax fraud. The leading digits in large spreadsheets of legitimate financial numbers (light green in the figure below) tend to be very close to Benford's Law (blue), while ones filled in by guessing randomly look way off (orange), and fraudulent numbers (red) tend to look even more bizarre. When tax sleuths notice these tell-tale patterns of numbers with unnatural sources, they call people in for a human audit [3, 4].
The people who naively make up numbers probably think that the leading digits in a natural data set are uniformly distributed. But they're wrong, and here's why. First, observe that if you multiply a number by 2, then very often the first digit of the result will be 1. Certainly if the original number started with 5, 6, 7, 8 or 9.
http://www.quora.com/Why-is-it-that-in-many-data-sets-there-are-abou-times-more-numbers-starting-with-the-digit-1-than-with-the-digit-9
Page 2 of 6
10/30/12 1:38 PM
So if you begin with the intuitively appealing uniform distribution of leading digits (every leading digit being equally likely) and then multiply all the numbers by 2, the distribution of leading digits will no longer be uniform there will now be a lot of leading 1's. Weird, eh? (To describe this phenomenon, I say that multiplication by 2 privileges 1 as a leading digit.) This already tells you that the uniform distribution of leading digits is not really very stable. It doesn't like to persist. It is easy to upset by the innocuous operation of multiplying everything by 2, which is difficult to avoid in the wild! Second, it turns out that many naturally occurring tables of numbers can be thought of as arising from taking some original list and multiplying each entry by a random number of twos. In view of this, it is natural that we see lower digits overrepresented, and higher digits underrepresented, in many naturally occurring data sets. == To explore the explanation in more depth, let's focus on the example of country populations. These tend to grow over time. Think of growing as starting from a random size and being multiplied by 2 a (random) number of times, different for each country (depending on growth rate). Since multiplication by 2 privileges the digit 1 as a leading digit, it's not surprising that a lot of the final numbers start with ones. More than start with nines. (By the way, it's not just multiplication by 2 that privileges 1 as a leading digit. Multiplication by most numbers privileges lower initial digits, in a sense that is made precise below. So does division by most numbers.) With building heights, there are two potential explanations. One is the same growth story that we have for cities. Our building ability improves by some random amount every, say, 20 years, and that...
(more)
3+ Comments Share (8 ) Embed Thank 26 Jul, 2011 Andrew Lucas, Physics and math nerd
38 votes by Stephan Hoyer, Thomas Brstad, Edwin Khoo, (more)
Benford's Law: Suppose a data set contains many entries, spread over many orders of magnitude. Then the probability of finding that the first digit in the decimal expansion of a given number in that data set is a 1,2,...,9 is given by 0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.068, 0.051, 0.046. (This is true regardless of the scaling of the data set, or where the data comes from, etc.) Explanation: The intuitive (if you know a bit of math) explanation for Benford's law goes as follows. The probability that the first digit of a set of numbers is 1,2,...,9 is the same regardless of the scale; i.e., the units you use to measure those numbers. Thus, consider the following argument: Let be a given number in the data set. If we rescale the data set times, each time with (independent and identically distributed, random) scaling constant , , then
Note by taking the log we have reduced the problem to a sum of random variables . Regardless of the distribution of the , a central limit theorem
Page 3 of 6
10/30/12 1:38 PM
dictates that the distribution of
is Gaussian as
. The precise
nature of this distribution, however, won't be all that important. Denote by the probability density function on , or . Denote by ) has first index .
the probability that a number in the data set ( Working with the log variables,
Approximating that the integrand is constant at each step of the integral in the above formula, reduces the integral to
Here
are constants dependent on the probability distribution on the log (the distribution is widely spread
variables. Approximating that
around many orders of magnitude), gives that
Asking that these probabilities be normalized leads to the probability of finding a 1,2,...,9 being 0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.068, 0.051, 0.046. This empirically agrees well with Benford's law. 1+ Comments Share Embed Thank 26 Mar, 2011 Sridhar Ramesh
6 votes by Edwin Khoo, Luke Bornheimer, Jay Wacker, (more)
"Benford's law" is the claim that some dataset's logarithm is uniformly distributed; that is, it is the claim that the amount of datapoints between the numbers A and B is the same as the amount of datapoints between kA and kB, for any scaling factor k (for example, if a list of positive numbers satisfies Benford's law, then there should be as many datapoints in the interval from 2 to 5 as there are in the interval from 6 to 15 [the latter being the former scaled up by a factor of 3]). One might expect this scale-invariance property to appear whenever looking at a compilation of similar data reported in many different units (the different units acting as different scaling factors). And, even for datasets reported entirely in the same units: one sometimes says Benford's law should only be expected for datasets which equally span many orders of magnitude; but if this is meant to be taken as "There is the same amount of data at each order of magnitude", then this is tautological; that's just another way of phrasing the scaling-invariance property which simply is Benford's law. So if one restricts attention to those datasets which seem likely to equally span many orders of magnitude in this sense, then one is restricting attention to those datasets which seem likely to satisfy Benford's law (and ought not, therefore, be that surprised if/when they do).
---Addendum: It occurs to me that there is an important subtlety not previously accounted for by my above original "It's just scale-invariance!" answer: I stated above that Benford's Law is the claim that some data's logarithm is uniformly distributed. But actually, this isn't exactly right; Benford's Law is actually the claim that some data's logarithm modulo log(10) is uniformly distributed.
Page 4 of 6
10/30/12 1:38 PM
What's so important about this relaxation? Well, even data which is normally distributed (and thus not at all uniformly distributed) becomes very nearly uniformly distributed when you take its residue modulo a value which is not very high in comparison to its standard deviation. (In other words, the fractional component of normally distributed data is approximately uniformly distributed so long as the variance isn't too low*). And, of course, by the usual reasoning of the Central Limit Theorem, we might expect a normal distribution to arise whenever adding together many independent random variables. Putting this all together, then, we should expect Benford's law to very nearly hold whenever looking at a product of many independent random variables of sizable variance, simply because we expect such data to be nearly log-normally distributed with a sizable standard deviation (by the Central Limit Theorem), which means its logarithm modulo log(10) will be nearly uniformly distributed, which is precisely the claim of Benford's law. This, I think, is the phenomenon accounting for the ubiquity of Benford-esque distributions in many real-world datasets. [*: An excellent account of this fact via Fourier theory (and its applicability to explaining Benford's Law) can be found in Chapter 34 of The Scientist and Engineer's Guide to Digital Signal Processing, available here: http://www.dmae.upm.es/Webperson... . In short, a random variable's residue modulo some period is described by a periodic probability density function, whose Fourier series is given by the values of the random variable's characteristic function at the multiples of the period. Thus, it will be uniformly distributed just in case the original variable's characteristic function is zero at the nonzero multiples of the period. In particular, if a random variable is normally distributed, then its characteristic function is Gaussian centered at 0, with dispersion parameter inversely proportional to the standard deviation (remember, we're looking at the characteristic function, not the density function). Thus, so long as the standard deviation is reasonably large relative to the sampling period, the characteristic function decays so quickly that its values at the nonzero multiples of the sampling period will be nearly zero, and so the random variable's residue modulo the sampling period will be nearly uniformly distributed.] 1+ Comments Share Embed Thank 26 Mar, 2011 Garrick Toubassi
7 votes by Ben Cunningham, Luke Bornheimer, Edward Castao, (more)
Intuitively, any value that grows exponentially (population, dow jones, etc) will exhibit this property independent of growth rate or initial value. For example: when the Dow is at 10,000, it has to double to get to 20,000 and knock that 1 off (umm yeah, still waiting for that event). But when it was at 9,000 it only had to grow by 11% to knock off that 9. Oh Dow 9,000, I hardly knew ye. 3 Comments Share Embed Thank 26 Jul, 2011 David Tung, an entrepreneur Interesting. The fact seems come from data with small enough continuously deviation (from prior data). It can be studied by 1, 2, .., 9, 10 assuming the change (deviation) is less than or on the order of 100% (or more precise limitation is needed here). The total relative dwell time is
So the probability of leading n is simply Comment Share Embed Thank 24 Oct, 2011 1 Answer Collapsed
Page 5 of 6
10/30/12 1:38 PM
Add Answer or Ask to Answer Peter Frometa Add Bio Make Anonymous
B I U @
Answer Later
Add Answer
About Jobs Privacy Terms Help Shuffle
Mobile Site
Page 6 of 6

Why Certain Numbers Are More Common in Data Sets

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Why Certain Numbers Are More Common in Data Sets

Transféré par

Droits d'auteur :

Formats disponibles

Why is it that, in many data sets, there are about six times more numbers starting with the

digit 1 than with the digit 9? - Quora

Search Questions, Topics, People, and Boards

dictates that the distribution of

variables. Approximating that

around many orders of magnitude), gives that

About Jobs Privacy Terms Help Shuffle

Vous aimerez peut-être aussi