Over a million developers have joined DZone.

Benford Law and Lognormal Distributions

DZone's Guide to

Benford Law and Lognormal Distributions

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Benford’s law is nowadays extremely popular (see e.g. http://en.wikipedia.org/…). It is usually claimed that, for a given set data set, changing units does not affect the distribution of the first digit. Thus, it should be related to scale invariant distributions. Heuristically, scale (or unit) invariance means that the density of the measure http://latex.codecogs.com/gif.latex?%20X (or probability function) http://latex.codecogs.com/gif.latex?f(x) should be proportional to http://latex.codecogs.com/gif.latex?f(kx). Thus, because densities integrate to 1, the proportionality coefficient has to be http://latex.codecogs.com/gif.latex?k^{-1}, and therefore, http://latex.codecogs.com/gif.latex?f should satisfy the following functional equation, http://latex.codecogs.com/gif.latex?%20kf(kx)=f(x), for all http://latex.codecogs.com/gif.latex?%20x in http://latex.codecogs.com/gif.latex?%20(1,\infty) and http://latex.codecogs.com/gif.latex?%20k in http://latex.codecogs.com/gif.latex?%20(0,\infty). The solution of this functional equation is http://latex.codecogs.com/gif.latex?%20f(x)=x^{-1}, I guess this can be proved easily solving ordinary differential equation


Now if http://latex.codecogs.com/gif.latex?%20D denotes the first digit of http://latex.codecogs.com/gif.latex?%20X, in base 10, then

http://latex.codecogs.com/gif.latex?%20\mathbb{P}(D=d)=\frac{\displaystyle{\int_d^{d+1}%20f(x)dx}}{{\displaystyle{\int_1^{10}%20f(x)dx}}}=\cdots=\frac{\displaystyle{\log\left(1+\frac{1}{d}\right)}}{\log(10)}Which is the so-called Benford’s law. So, this distribution looks like that

> (benford=log(1+1/(1:9))/log(10))
[1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 
[6] 0.06694679 0.05799195 0.05115252 0.04575749
> names(benford)=1:9
> sum(benford)
[1] 1
> barplot(benford,col="white",ylim=c(-.045,.3))
> abline(h=0)

To compute the empirical distribution from a sample, use the following function

> firstdigit=function(x){
+ if(x>=1){x=as.numeric(substr(as.character(x),1,1)); zero=FALSE}
+ if(x<1){zero=TRUE}
+ while(zero==TRUE){
+ x=x*10; zero=FALSE
+ if(trunc(x)==0){zero=TRUE}
+ }
+ return(trunc(x))
+ }

and then

> Xd=sapply(X,firstdigit)
> table(Xd)/1000

In Benford’s Law: An Empirical Investigation and a Novel Explanation, we can read

It is not a mathematical article, so do not expect any formal proof in this paper. At least, we can run monte carlo simulation, and see what’s going on if we generate samples from a lognormal distribution with variancehttp://latex.codecogs.com/gif.latex?%20\sigma^2. For instance, with a unit variance,

> set.seed(1)
> s=1
> X=rlnorm(n=1000,0,s)
> Xd=sapply(X,firstdigit)
> table(Xd)/1000
    1     2     3     4     5     6     7     8     9 
0.288 0.172 0.121 0.086 0.075 0.072 0.073 0.053 0.060 
> T=rbind(benford,-table(Xd)/1000)
> barplot(T,col=c("red","white"),ylim=c(-.045,.3))
> abline(h=0)

Clearly, it not far away from Benford’s law. Perhaps a more formal test can be considered, for instance Pearson’s http://latex.codecogs.com/gif.latex?%20\chi^2 (goodness of fit) test.

> chisq.test(T,p=benford)

	Chi-squared test for given probabilities

data:  T 
X-squared = 10.9976, df = 8, p-value = 0.2018

So yes, Benford’s law is admissible ! Now, if we consider the case where http://latex.codecogs.com/gif.latex?%20\sigma is smaller (say 0.9), it is a rather different story,

compared with the case where http://latex.codecogs.com/gif.latex?%20\sigma is larger (say 1.1)

It is possible to generate several samples (always the same size, here 1,000 observations), just change the variance parameter http://latex.codecogs.com/gif.latex?%20\sigma and compute the http://latex.codecogs.com/gif.latex?%20p-value of the test. There might be one tricky part: when generating samples from lognormal distributions with small variance, it might be possible that some digits do not appear at all. On that case, there is a problem with the test. So we just use here

> T=table(Xd)
> T=T[as.character(1:9)]
> T[is.na(T)]=0
> PVAL[i]=chisq.test(T,p=benford)$p.value

Boxplots of the http://latex.codecogs.com/gif.latex?%20p-value of the test are the following,

When http://latex.codecogs.com/gif.latex?%20\sigma is too small, it is clearly not Benford’s distribution: for half (or more) of our samples, the http://latex.codecogs.com/gif.latex?%20p-value is lower than 5%. On the other hand, when http://latex.codecogs.com/gif.latex?%20\sigma is large (enough), Benford’s distribution is the distribution of the first digit of lognormal samples, since 95% of our samples have http://latex.codecogs.com/gif.latex?%20p-values higher than 5% (and the distribution of the http://latex.codecogs.com/gif.latex?%20p-value is almost uniform on the unit interval). Here is the proportion of samples where the http://latex.codecogs.com/gif.latex?%20p-value was lower than 5% (on 5,000 generations each time)

Note that it is also possible to compute the http://latex.codecogs.com/gif.latex?%20p-value of Komogorov-Smirnov test, testing if the http://latex.codecogs.com/gif.latex?%20p-value has a uniform distribution,

> ks.test(PVAL[,s], "punif")$p.value

Indeed, if http://latex.codecogs.com/gif.latex?%20\sigma is larger than 1.15 (around that value), it looks like Benford’s law is a suitable distribution for the first digit.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}