DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Databases
  4. The Probability of Long Runs

The Probability of Long Runs

John Cook user avatar by
John Cook
·
Nov. 15, 12 · Interview
Like (0)
Save
Tweet
Share
5.62K Views

Join the DZone community and get the full member experience.

Join For Free

suppose you’ve written a program that randomly assigns test subjects to one of two treatments, a or b, with equal probability. the researcher using your program calls you to tell you that your software is broken because it has assigned treatment a to seven subjects in a row.

you might argue that the probability of seven a’s in a row is 1/2^7 or about 0.008. not impossible, but pretty small. maybe the software is broken.

but this line of reasoning grossly underestimates the probability of a run of 7 identical assignments. if someone asked the probability that the next 7 assignments would all be a’s, then 1/2^7 would be the right answer. but that’s not the same as asking whether an experiment is likely to see a run of length 7 because run could start any time, not just on the next assignment. also, the phone didn’t ring out of the blue: it rang precisely because there had just been a run.

suppose you have a coin that has probability of heads p and you flip this coin n times. a rule of thumb says that the expected length of the longest run of heads is about

-\frac{\log n(1-p)}{\log p}

provided that n (1- p ) is much larger than 1.

so in a trial of n = 200 subjects with p = 0.5, you’d expect the longest run of heads to be about seven in a row. when p is larger than 0.5, the longest expected run will be longer. for example, if p = 0.6, you’d expect a run of about 9.

the standard deviation of the longest run length is roughly 1/log(1/ p ), independent of n . for coin flips with equal probability of heads or tails, this says an approximate 95% confidence interval would be about 3 either side of the point estimate. so for 200 tosses of a fair coin, you’d expect the longest run of heads to be about 7 ± 3, or between 4 and 10.

the following python code gives an estimate of the probability that the longest run is between a and b inclusive, based on an extreme value distribution.

def prob(a, b, n, p):
    r = -log(n*(1-p))/log(p)
    cdf = lambda x: exp(- p**x )
    return cdf(b + 1 - r) - cdf(a - r)

what if you were interested in the longest run of head or tails? with a fair coin, this just adds 1 to the estimates above. to see this, consider a success to be when consecutive coins turn up the same way. this new sequence has the same expected run lengths, but a run of length m in this sequence corresponds to a run of length m + 1 in the original sequence.

for more details, see “the surprising predictability of long runs” by mark f. schilling, mathematics magazine 85 (2012), number 2, pages 141–149.

COinS Row (database) Software Python (language) Testing Deviation (statistics) Distribution (differential geometry) Magazine Mark (designation)

Published at DZone with permission of John Cook, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Quick Pattern-Matching Queries in PostgreSQL and YugabyteDB
  • 5 Factors When Selecting a Database
  • Agile Transformation With ChatGPT or McBoston?
  • Mind Map Reuse in Software Groups

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: