Welcome to the Benford's Law Excel Macro! This program is designed to make it easy and quick to apply Benford's Law to any dataset.
Benford's Law is a statistical principle that predicts the frequency of digits in sets of numerical data. It states that in many real-world datasets, the leading digit is likely to be a small number. For example, in a dataset of prices, the number "1" is likely to be the leading digit more often than any other number.
This macro allows you to quickly and easily analyze your own data to see if it follows Benford's Law. Simply input your data into an Excel spreadsheet and run the macro to get a report on the distribution of leading digits in your dataset.
Whether you're a researcher, a data analyst, or just curious about Benford's Law, this macro is a useful tool for understanding and analyzing numerical data.
Benford's Law is a statistical principle that states that in many real-world datasets, the leading digit (the leftmost digit) is likely to be a small number.
For example, in a set of numbers that represent the population sizes of cities, it is more likely that a city will have a population starting with the digit 1 (such as 10,000 or 100,000) than a population starting with the digit 9 (such as 90,000 or 900,000).
This phenomenon can be explained by the fact that many real-world datasets are distributed along a logarithmic scale.
On a logarithmic scale, numbers are evenly spaced out in terms of their orders of magnitude. For example, the distance between 1 and 10 is the same as the distance between 10 and 100. This means that leading digits are distributed in the exact same way in every order of magnitude.
Another property of logarithmic scale is that regions that start with a smaller leading digit (such as 1, 2 or 3) are much wider than regions that start with a larger leading digit (such as 7, 8 or 9). This makes it more likely for numbers of a dataset that is spread across multiple orders of magnitude to fall into regions with smaller leading digits.
Benford's Law can be expressed mathematically as follows:
where P(d)
is the probability that a number will have the digit d
as its leading digit
and
d
is a digit from 1 to 9.
The following table lists the frequencies of each digit according to Benford's Law:
Digit | Frequency |
---|---|
1 | 30.1% |
2 | 17.6% |
3 | 12.5% |
4 | 9.7% |
5 | 7.9% |
6 | 6.7% |
7 | 5.8% |
8 | 5.1% |
9 | 4.6% |
Benford's Law has been found to hold true for a wide variety of datasets, including financial data, population data, and physical constants. It can be used as a tool for detecting fraudulent or anomalous data, as well as for understanding the underlying distribution of values in a dataset. It is important to note that Benford's Law is not a perfect prediction and may not hold true for all datasets, but it can still be a useful tool for understanding and analyzing numerical data.
In order to make this macro available for every workbook you will need to add it to your Personal Macro Workbook. By default, the personal macro workbook doesn’t exist. You need to first create it:
.bas
file from Github.bas
file and select it.Once the installation is complete, the macro will be available for use in any Excel workbook. To use the macro, you can either select it from the Macros dropdown menu on the View tab, or you can add it to the ribbon for easy access.
To add the macro to the ribbon, follow these steps:
BenfordsLawMacro
and click Add>>.
Now the macro will be available on the "Review" tab in the new group you created. Simply click the button to run the macro and apply Benford's Law to your dataset.
That's it! You're now ready to start using the Excel macro to easily and quickly apply Benford's Law to your datasets.
Benford's Law is a statistical principle that can be used to analyze a dataset to determine whether it is likely to be a random, natural dataset or whether it has been artificially manipulated. In order to analyze a dataset with Benford's Law, the following requirements should be met:
The dataset should be large enough to be statistically significant. Benford's Law is based on the idea that large datasets follow a predictable pattern, so it is important to have a large enough sample size to accurately apply the law.
The dataset should be numeric and continuous. Benford's Law is based on the distribution of the first digits of numbers, so the dataset must consist of numeric values that can be analyzed in this way. The values should also be continuous, rather than discrete or categorical, in order to accurately apply the law.
The data should be collected from a natural or random source. Benford's Law is most effective when applied to datasets that are generated naturally or randomly, rather than artificially manipulated.
The data should be unbiased. The data should not be influenced by any external factors or biases that could affect the distribution of the first digits.
The data should be independent. Each data point should be independent of the others, meaning that it should not be influenced by or related to any other data points in the dataset.
Overall, a dataset that meets these requirements is more likely to be accurately analyzed using Benford's Law, and the results of the analysis will be more reliable and meaningful.
As soon as we run the BenfordsLawMacro
we will be asked to select which cells we want to
analyze.
The recommended action will be to select every full column in which we are interested.
You can also just focus on a specific set of rows but remember that is better to analyze as many values as possible for the statistical tests to be significant.
When we confirm our selection, a new Excel worksheet will appear with the name
Benford's Report
.
In this worksheet we will find a table and a plot which I will describe separately.
The table is the most important part of this generated worksheet. It not only contains the expected and observed frequencies for each most significant digit on every column we chose to analyze, but also some suggestions on how to interpret the results.
In the "suggestion" module of the table, users are presented with three valuable insights to consider when analyzing their data.
The first recommendation is the number of entries analyzed, which should be as high as possible in order to increase the significance of the statistical tests.
Next, a chi square test is presented, complete with the chi square value, the corresponding p-value, and an interpretation of the p-value that reveals whether the observed frequency aligns with the expected frequency or not.
Lastly, users are provided with the minimum and maximum values observed, offering a comprehensive overview of the range of values present in the data. If the range of values is not very large, Benford's Law may not be applicable to the dataset.
After performing the tests, we have gained some insights into whether each column follows Benford's law or not. However, it is important to note that Benford's law may not always be applicable to every dataset, even if the tests indicate that it follows the law. Therefore, it is recommended to carefully analyze and verify the results for yourself to determine the appropriateness of using Benford's law for your specific dataset.
The plot on this slide is a grouped column chart that compares the expected frequencies predicted by Benford's Law, represented in orange, with the observed frequencies from our dataset, represented in a colorblind friendly color scale of blue.
The specific shade of blue used for the observed frequencies is chosen dynamically based on the number of observations. Orange and blue are complementary colors, which helps to make the chart easier to read and interpret.
By comparing the expected frequencies to the observed frequencies from our dataset, we can determine how closely the actual data follows Benford's Law and identify any deviations from the expected pattern.
This chart allows us to easily visualize and analyze these differences, helping us to understand any patterns or trends that may be present in our data.
This example will explaing the walkthrough of executing this macro on a dataset with cricket runs from different players. The columns are the following:
In this dataset the only columns we are able to analyze are the last three, the numeric ones.
Now that we have our results, by looking at the table we can extract the following information for each of the columns we chose:
On the plot we can confirm that the first "Inns" column, eventho is quite similar, its not following the distribution as close as we would expect.
We can also notice that as we mentioned, both "Runs" and "MSD_runs" present the exact same distribution, but because of the information from the table we should only focus on the "Runs" column.
This Macro has been coded by Luis Rodríguez Rubio for a 10 ECTS project at Aarhus University and has been supervised by Hans Jörg Schulz.
This project was aligned with the Data Visualization course of the Computer Science Master's Degree.