Jump to content

Benford's Law


Munkie

Recommended Posts

Any  math nerds amongst nerds wanna nerd out on this?

I was watching Connected on Netflix. Check it out if you get a chance. I like Latif Nasser from Radiolab but his curious and excited energy does offer a respite of the more thespian delivery of David Attenborough et al in a lot of documentaries.

They had an episode on Benford's Law. In an idiot's words: basically if you choose a subject and choose random numbers of that subject, you'll find this weirdly consistent pattern in the numbers. About 30% of numbers in a data set are likely to begin with the digit "1". About 17% begin with 2, and so on, until the digit 9 at about 5%. 

It seems weirldy spooky at first right? I mean you can pretty reliably discover tax fraud with this law. If you look at all the numbers in a company or person's accounting and the numbers don't largely conform to Benford's Law, then you have a high degree of probability that they are cooking their books. Crazy, right?

But is it that crazy?

The whole episode I was thinking "this all makes a lot of sense." Think of it as just a general trends question--"more likely than not"/"less likely than not".

Let's choose CDs/DVD owned (I apologize to anyone born this century). If you own CDs, can we agree that it's more likely you own a number of CDs ranging from 10 to 99 digits, as opposed to owning 9 or fewer? I think so.

If you own in the double digits of CDs (where likely a large population of the US sits), is it more likely you'd have 10-19. 20-29, 30-39, etc? I dunno, but a fairly random distribution seems reasonable. But you probably have more people that own 10-19 CDs than own 90-99 CDs. Right? 10-19 would be a range that an aspiring collector might decide it's a waste of space, whereas a 90-99 owner is more likely to continue into 100+ than the 10-19 owner is to move into 20+.

But what if we look at triple digit CD/DVD owners? I feel like owning discs in the 100-199 range is much, much higher than the 200-299 range. Benford's Law would suggest it's close to double which is verified by my scientific gut check.

What about wealth? If you're a millionaire by definition, is it more likely that you have between 1 and 2 million or between 9 and 10 million? Of course the former, because more people can reach that benchmark than the 9-10 range.

 

So Benford's Law itself as applied to the "random" occurrence of numbers (it's not actually random, we just expect it to be random because the pattern is larger than the human mind can arrange without help) does not confuse me. It's simply a function of counting which always starts small and gets bigger, fizzling out at some point due to whatever factors. That "fizzling" or attrition process makes Benford's Law a mathematical inevitability. Given a range of digits from 1-9, a number is more likely to begin with 1 than 2, more likely 2 than 3, more likely 3 than 4, and so on.

The ONLY thing that actually interests me about this otherwise obvious pattern is the fairly consistent distribution across data sets. Around 30% 1s, around 17% 2s, etc until 5% 9s. 

That's a consistency that moves past the obvious conclusion that there are more 1s than 9s.

So what's the deal? 

  • Like 3
Link to comment
Share on other sites

2 hours ago, Munkie said:

So what's the deal? 

If I recall correctly, Benford’s Law applies most frequently if the dataset contains values that span several orders of magnitude. Something about the probability distribution over a log scale... It’s been a long while since I studied this stuff.

I also think it was less of a Law-Law and more of a really good rule of thumb Law. L ike Godwin’s Law or Occam’s Razor, and not like Newton’s First Law.

  • Like 2
Link to comment
Share on other sites

If you pick a number, say 1000, to get to 2000 you have to double in magnitude. So you spend a lot of time in with the 1's as a leading digit.  But if you choose 9000 and want to get to 10000 that is only a 11% increase to get out of the 9s as a leading digit to get back to the 1s.

Here are people smarter then me that talk about it. I have been watching numberphile on youtube for a while because I find it fascinating.

 

  • Like 1
  • Thanks 2
Link to comment
Share on other sites

I think we're all going to say essentially the same thing but we're going to couch it in different terms depending on how we view number systems.  I  tend to think in logarithmic terms.  It's just a factor of my background.  Others may think in terms of "doubling" which is a form of logarithms but is specifically base 2 instead of base 10 or base e.

What I find more interesting is a comparison of the systems where it applies and a systems where it does not.  Computer systems tend *not* to follow Benford's law.  We set up systems which are not "naturally occurring".

From the wikipedia entry:

Quote

Distributions known to disobey Benford's law[edit]

The square roots and reciprocals of successive natural numbers do not obey this law.[42] Telephone directories violate Benford's law because the (local) numbers have a mostly fixed length and do not start with the long-distance prefix (in the North American Numbering Plan, the digit 1).[43] Benford's law is violated by the populations of all places with population at least 2500 from five US states according to the 1960 and 1970 censuses, where only 19% began with digit 1 but 20% began with digit 2, for the simple reason that the truncation at 2500 introduces statistical bias.[42] The terminal digits in pathology reports violate Benford's law due to rounding.[44]

 

Those are some of the more boring examples, but the wiki article does highlight some interesting cases like Pricing and accounts with fixed minimum or maximum values (e.g. $100 minimum balance, $250,000 FDIC insurance, etc.) or non-random distributions.  Of particular interest is the list of things that we think of as non-random which still follow Benford.

  • Like 1
Link to comment
Share on other sites

5 hours ago, PumpkinHead said:

If you pick a number, say 1000, to get to 2000 you have to double in magnitude. So you spend a lot of time in with the 1's as a leading digit.  But if you choose 9000 and want to get to 10000 that is only a 11% increase to get out of the 9s as a leading digit to get back to the 1s.

Right, that's basically what I "felt" when I watched the episode but couldn't articulate, despite a wall of rambling.

The video explains it nicely.

I'd summarize it to say: the chances of a choosing a random number that happens to begin with 1 in any given data set can never be lower than 11% (1 in 9). The odds can only go up from there, but average 30%

On the other hand, the chances of the number beginning with a 9 can never be higher than 11% (but averages around 5%). This is because you can't get to 9 without having 1,2,3,4,5,6,7&8 already present and watering down your odds.

Interesting stuff!

  • Like 2
  • Thanks 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...