Benford's Law and Your Computer
Benford's law is surprisingly pervasive. It shows up where you might not
expect. For example, if you use Benford's law to analyze the file sizes on your
computer you should find a good fit.
Finding Benford's Law on your computer
Note: To make this work you will have to have a Windows based
computer with MS Excel installed. It is also assumed that you are comfortable
with your computer and know how to use MS Excel.
Getting the File Size Information
In this step you will collect the file size information and load it into MS
Excel.
- Create a scratch directory on
your computer.
- Open a DOS window and change
to the scratch directory.
- Run the following command:
dir /S /A-D /-C c:\ | sort > files.txt
- Use the Notepad or WordPad
text editor to delete everything in files.txt down to the first
dated entries. Dated entries look like this:
01/01/1999 12:09a 1576 ModemLog_LT Win Modem.txt
01/01/1999 12:09a 141021 SETUP.LOG
01/01/1999 12:11a 226 MMDET.LOG
01/01/1999 12:12a 147 ModemDet.txt
01/01/1999 12:12a 2208 SVCPACK.PNF
- Save your changes and exit.
- Start MS Excel running. Use
it to load the file you just created: files.txt. When you load the
file MS Excel specify that the data is in a "fixed width"
format.

- Mark all columns except the
number column as "Do Not Import" so that they won't be loaded
into your spreadsheet.

- Once you've finished load the
data your spreadsheet should contain only a single column of numbers.

Analyzing The Data
Benford's law is analysis of the frequency of occurrence of the first,
second, or third digits. in this example we will be looking at the first digit
only, so our first task is to separate out that digit for each file size
listed.
- To get the first digit of
each number we want to use the =LEFT() macro. So for each number in column
A we want the macro =LEFT(An) in column B.

- Get a count of how many times
each digit occurs. For this we will use the =COUNTIF(start:stop,n) macro.
With this macro "start" and "stop" are column ranges
and "n" is the number you are looking for. So the macro
=COUNTIF(B2:B1000,1) will count all of the occurrences of the number 1 in
column B between rows 2 and 1000 inclusive.
Notice that I added a column labeled "Count Of" to make it
easier to see which digit is being counted by =COUNTIF() in the next
column. 
Did you notice that in the example we count ten
digits (1,2,3,4,5,6,7,8,9,0) rather than just 1..9? That's because some files
have a zero length. For the purpose of this example we are going to ignore the
zero length files. 
- Now we need to know the total
count of non-zero file length entries so we can divide the count of digits
and obtain a ratio. To do this we will use the =sum(start:stop) command.
We will sum the digit counts rather than the raw data so that we do not
inadvertently count zero length files.

- Generate the ratios by
dividing each digit count by the total count. At this point a simple
visual inspection of the numbers should show a pretty good correlation
between the values Benford's Law predicts and the actual percentages.

- Now that we have the real
world data from the computer, we need to have the ratios predicted by
Benford's law. Place them in the column just to the right of your
calculated percentages.
Benford's Law Predicted Values
1 0.30103
2 0.17609
3 0.12494
4 0.09691
5 0.07918
6 0.06695
7 0.05799
8 0.05115
9 0.04576

- Build a chart from the data.
Here's how I did it.


- The final plot is a pretty
good, but not perfect fit. Is the difference significant?

- NOTE: If your chart
shows an odd spike you might want to check your computer to see if there
are a large number of similar files. This typically happens with font
definition files but there are other possible reasons. Odd sets like that
can skew the results.
This example courtesy of Jerry
Anderson - an IT Auditor at Nike Inc.
For questions, comments, or ideas please contact the page owner Dr. Mark Nigrini.