Feeds

Go Deh: Predicting results from small samples.

Planet Python - Fri, 2024-05-10 16:09

 

I've run simulations, tens of thousands of them at a time, over and over as we developed chips. In one project I noticed that I could predict the final result after only a small number of results were in which allowed me to halt the rest of the simulations, or make advance preparations for the final result.

I looked it up at the time and, indeed, there is an equation where if you want the pass rate of a "large" population to within a given accuracy, it will give you the minimum sample size to use.

To some, that's all gobledygook so I'll try to explain with some code.

Explanation

Lets say you have a large randomised population of pass/fails or ones and zeroes:

from random import sample
population_size = 100_000    # must be large, > 65K?sample_size = 123           # Actualp_hat = 0.5                  # Population pass rate, 0.5 == 50%
_ones = int(population_size*p_hat)_zeroes = population_size - _onespopulation = [1] * _ones + [0] * _zeroes

And that we take a sample from it and compute the pass rate of that single, smaller sample

def random_sample() -> list[int]:    return sample(population, k=sample_size)
pass_rate = (sum(random_sample())  # how many ones             / sample_size)      # convert to a pass rate
print(pass_rate)  # e.g. 0.59027552674230146

Every time we run that pass_rate expression we get a different value. It is random. We need to run that pass_rate calculation many times to get an idea of what the likely pass_rate would be when using the given sample size:

runs_per_sample = 10_000     # each sample size is run this many times ~2500
pass_rates = [sum(random_sample()) / sample_size              for _ in range(runs_per_sample)]

I have learnt that the question to ask is not how many times the sample pass rate was the population pass rate, but to define an acceptable margin of error, (say 5%) and say we want to be confident that pass rates be within that margin a certain percentage of the time

epsilon = 0.05               # target margin of error in sample pass rate. 0.05 == +/-5%
p_hat_min, p_hat_max = p_hat * (1 - epsilon), p_hat * (1 + epsilon)in_range_count = sum(p_hat_min <= pass_rate < p_hat_max                     for pass_rate in pass_rates)sample_confidence_level = in_range_count / runs_per_sampleprint(f"{sample_confidence_level = }")  # = 0.4054

So for a sample size of 123 we could expect the pass rate of the sample to be within 5% of the actual pass rate of the population, 0.5 or 50%, onlr o.4 or 40% of the time!

We need more!

What is actually done is we state what we think the population pass rate is, p_hat, (choose closer to 50% if you are unsure); the margin of error around p_hat we want, epsilon, usually +/-5% or +/-3%; and the confidence_level in hitting within the pass_rates margin of error.

There are calculators that will then give you n, the size of the sample needed to satisfy those condition.

Doing it myself

 I calculated for one specific sample size, above. Obviousely, if I calculated pass_rates over a range of sample_sizes, and increqasing runs_per_sample, I could search out the sample size needed.

That is done in my next program. I have to switch to using the numpy library for its speed and sample_size becomes a range.

When the pass rate confidence levels are calculated I end up with a list of confidence levels for increasing sample sizes that are usually not increasing due to the randomness, e.g.

range_hits = [...0.94, 0.95, 0.93, 0.954, ... 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, ...]  # confidence levels

The range of sample_size corresponding to the first occurrence>= the requested confidence level, and the last occorrence of a confidence level < the requested confidence level in then slightly widened and the runs_per_sample increased on another iteration to get a better result.

Here's a sample of the output I get when searching

Sample output$ time python3 sample_search.py 2098 <= n <= 2610  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(50, 5000, 512), 250)
2013 <= n <= 2525  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(2013, 2694, 256), 500)
1501 <= n <= 2781  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1501, 3037, 256), 500)
1757 <= n <= 2013  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(221, 4061, 256), 500)
1714 <= n <= 1970  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1714, 2055, 128), 1000)
1458 <= n <= 2098  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1458, 2226, 128), 1000)
1586 <= n <= 1714  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(818, 2738, 128), 1000)
1564 <= n <= 1692  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1564, 1735, 64), 2000)
1500 <= n <= 1564  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1436, 1820, 64), 2000)
1553 <= n <= 1585  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1489, 1575, 32), 4000)
1547 <= n <= 1579  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1547, 1590, 16), 8000)
1547 <= n <= 1579  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1515, 1611, 16), 8000)
1541 <= n <= 1581  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1541, 1584, 8), 16000)
1501 <= n <= 1533  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1501, 1621, 8), 16000)
1501 <= n <= 1533  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1389, 1538, 8), 16000)
1503 <= n <= 1575  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1495, 1677, 8), 16000)
1503 <= n <= 1535  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)  Using population_size, sample_size, runs_per_sample =(100000, range(1491, 1587, 4), 32000)
 Use a sample n = 1535 to predict population pass rate of 50.0% +/-5% with a confidence level of 95%.
real    3m49.023suser    3m49.022ssys     0m1.361s


My Code# -*- coding: utf-8 -*-"""Created on Wed May  8 14:04:17 2024
@author: paddy"""# %%from random import sampleimport numpy as np

def sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) -> range:    """    Arguments with example values:        population_size = 100_000           # must be large, > 65K?        sample_size = range(1400, 1750, 16) # (+min, +max, +step)        p_hat = 0.5                         # Population pass rate, 0.5 == 50%        epsilon = 0.05                      # target margin of error in sample pass rate. 0.05 == +/-5%        confidence_level = 0.95             # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.        runs_per_sample = 10_000            # each sample size is run this many times ~2500        Return:        min,max range for the sample size, n, satisfying inputs.        """        def create_1_0_array(population_size=100_000, p_hat=0.5) -> np.ndarray:        "Create numpy array of ones and zeroes with p_hat% as ones"        ones = int(population_size*p_hat + 0.5)        array10 = np.zeros(population_size, dtype=np.uint8)        array10[:ones] = 1                return array10

    def rates_of_samples(population: np.ndarray, sample_size_range: range, runs_per_sample: int                        ) -> list[list[float]]:        "Pass rates for range of sample sizes repeated runs_per_sample times."        # many_samples_many_rates = [( np.random.shuffle(population),         # shuffle every *run*        #                             [population[:s_count].sum() / s_count  # The pass rate for samples        #                             for s_count in sample_size_range]           #                         )[1]                                     # drop the shuffle        #                         for _ in range(runs_per_sample)]         # Every run
        many_samples_many_rates = [[ np.random.shuffle(population),         # shuffle every *sample*                                     [population[:s_count].sum() / s_count  # The pass rate for samples                                     for s_count in sample_size_range]                                      ][1]                                     # drop the shuffle                                   for _ in range(runs_per_sample)]         # Every run
        return list(zip(*many_samples_many_rates))      # Transpose to by_sample_size_then_runs

    population = create_1_0_array(population_size, p_hat)    by_sample_size_then_runs = rates_of_samples(population, sample_size, runs_per_sample)
    # Pass rates within target
    target_pass_range = tmin, tmax = p_hat * (1 - epsilon), p_hat * (1 + epsilon)  # Looking for rates within the range
    range_hits = [sum(tmin <= sample_pass_rate < tmax for sample_pass_rate in single_sample_size)                for single_sample_size in by_sample_size_then_runs]    runs_for_confidence_level = confidence_level * runs_per_sample
    for n_min, conf in zip(sample_size, range_hits):        if conf >= runs_for_confidence_level:            break    else:        n_min = sample_size.start
    for n_max, conf in list(zip(sample_size, range_hits))[::-1]:        if conf <= runs_for_confidence_level:            n_max += sample_size.step  # back a step            break    else:        n_max = sample_size.stop        if (n_min + sample_size.step) >= n_max and sample_size.step > 1:        # Widen        n_max = n_max + sample_size.step + 1
    return range(n_min, n_max, sample_size.step)

def min_max_mid_step(from_range: range) -> tuple[int, int, float, int]:    "Extract from **increasing** from_range the min, max, middle, step"    mn, st = from_range.start, from_range.step        # Handle range where start == stop    mx = from_range.stop    for mx in from_range:        pass        md = (mn + mx) / 2        return mn, mx, md, st    def next_sample_size(new_samples, last_samples,                     runs_per_sample,                     widener=1.33  # Widen range by                    ):    n_min, n_max, n_mid, n_step = min_max_mid_step(new_samples)    l_min, l_max, l_mid, l_step = min_max_mid_step(last_samples)        # Next range of samples computes in names with prefix s_        increase_runs = True    if n_max == l_max:        # Major expand of high end        s_max = l_max + (l_max - l_min)        increase_runs = False    else:        s_max = (n_mid +( n_max - n_mid)* widener)
    if n_min == l_min:        # Major expand of low end        s_min = max(1, l_min + (l_min - l_max))        increase_runs = False    else:        s_min = (n_mid +( n_min - n_mid)* widener)        s_min, s_max = (max(1, int(s_min)), int(s_max + 0.5))    s_step = n_step    if s_min == s_max:        if s_min > 2:            s_min -= 1        s_max += 1            if increase_runs or n_max == n_min:        runs_per_sample *= 2        if n_max == n_min:            s_step = 1        else:            s_step = max(1, (s_step + 1) // 2)  # Go finer        next_sample_range = range(max(1, int(s_min)), int(s_max + 0.5), s_step)        return next_sample_range, runs_per_sample
# %%
if __name__ == '__main__':
    population_size = 100_000           # must be large, > 65K?    sample_size = range(50, 5_000, 512) # Increasing!    p_hat = 0.50                        # Population pass rate, 0.5 == 50%    epsilon = 0.05                      # target margin of error in sample pass rate. 0.05 == +/-5%    confidence_level = 0.95             # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.    runs_per_sample = 250               # each sample size is run this many time at start, ~250    max_runs_per_sample = 35_000
    while runs_per_sample < max_runs_per_sample:        new_range = sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample)        n_min, n_max, n_mid, n_step = min_max_mid_step(new_range)        print(f"{n_min} <= n <= {n_max}")        print(f"  For {p_hat, epsilon, confidence_level =}\n"            f"  Using {population_size, sample_size, runs_per_sample =}\n")                    sample_size, runs_per_sample = next_sample_size(new_range, sample_size, runs_per_sample)
print(f" Use a sample n = {n_max} to predict population pass rate of {p_hat*100.:.1f}% +/-{epsilon*100.:.0f}% "      f"with a confidence level of {confidence_level*100.:.0f}%.")

END.


 

Categories: FLOSS Project Planets

Bounteous.com: Moderate All the Content: Establishing Workflows in Drupal 10

Planet Drupal - Fri, 2024-05-10 11:04
Learn about workflow configuration and customizations to empower your website’s content approver and publisher roles.
Categories: FLOSS Project Planets

Bounteous.com: Composability and Drupal: Going Headless at Scale

Planet Drupal - Fri, 2024-05-10 11:04
Discover how composable architectures offer unparalleled speed, agility, and flexibility to empower organizations in navigating the ever-changing landscape of technological advancements and evolving consumer needs, and how Drupal can be a key part of a composable solution!
Categories: FLOSS Project Planets

Bounteous.com: Upgrading to Drupal 10 (And Beyond) With Composer

Planet Drupal - Fri, 2024-05-10 11:04
Every iteration of Drupal brings a multitude of security improvements, accessibility improvements, and a host of new features created by the Drupal community.
Categories: FLOSS Project Planets

Bounteous.com: Introduction to ChatOps with Acquia BLT and Slack

Planet Drupal - Fri, 2024-05-10 11:04
Learn how to set up ChatOps with Acquia BLT to improve your team’s communication and efficiency by automatically sharing Drupal DevOps messaging to a single channel.
Categories: FLOSS Project Planets

Bounteous.com: What’s New in Acquia Site Studio 6.9?

Planet Drupal - Fri, 2024-05-10 11:04
Acquia has been busy releasing new features for their low-code, drag-and-drop solution called Site Studio!
Categories: FLOSS Project Planets

Bounteous.com: Use the Acquia CMS Headless Beta to Improve Headless Applications

Planet Drupal - Fri, 2024-05-10 11:04
As a developer, building a partially or fully headless Drupal site can feel like a daunting task. There are always more questions than answers when getting started.
Categories: FLOSS Project Planets

Bounteous.com: Building Enterprise Drupal Sites with Acquia Build and Launch Tool (BLT)

Planet Drupal - Fri, 2024-05-10 11:04
Learn more about Acquia’s Built and Launch Tool (BLT) and how this Drupal-specific extensible toolset can help you build, test, and deploy your code.
Categories: FLOSS Project Planets

Bounteous.com: Drupal 10: Uncovering New Features and Benefits

Planet Drupal - Fri, 2024-05-10 11:04
Drupal 10 continues to pave the way for great user experiences into the future. Check out the new features and benefits that Drupal 10 has to offer!
Categories: FLOSS Project Planets

Bounteous.com: Your Team's Technical Guide to Drupal Code Reviews

Planet Drupal - Fri, 2024-05-10 11:04
A checklist and technical Guide to complete Drupal code reviews to improve your codebase and processes.
Categories: FLOSS Project Planets

Bounteous.com: The Acquia Triple Certification: Distinguishing Yourself as a Drupal Developer

Planet Drupal - Fri, 2024-05-10 11:04
Distinguish yourself as a Drupal Developer by getting Acquia Triple Certified. Learn more about the exams and the impact they can have on your career.
Categories: FLOSS Project Planets

Bounteous.com: PHP 7 to 8: Entering the Modern Era of Programming Languages

Planet Drupal - Fri, 2024-05-10 11:04
Whether it’s using new syntax, experiencing the speed boosts of the JIT compiler, or trying out new features of the language, PHP 8 has many improvements that developers and Drupal sites can benefit from.
Categories: FLOSS Project Planets

Bounteous.com: Our Guide to Upgrading Your Site with Drupal 9

Planet Drupal - Fri, 2024-05-10 11:04
Learn how to successfully migrate your Drupal 7 site to the new Drupal 9 platform, and keep these critical considerations in mind as you plan your future steps.
Categories: FLOSS Project Planets

Bounteous.com: Acquia Cloud IDE: First Impressions From a Senior Developer

Planet Drupal - Fri, 2024-05-10 11:04
Acquia Cloud IDE may be the next big thing in local Drupal development. Learn the basics of Cloud IDE and hear our first impressions of the product.
Categories: FLOSS Project Planets

Bounteous.com: Press Release: Bounteous Recognized as Acquia Global Partner and One of the First Acquia Practice Certified Partners

Planet Drupal - Fri, 2024-05-10 11:04
Bounteous announces its elevated status as an Acquia Global Partner and recognition as one of Acquia’s first Practice Certified Partners.
Categories: FLOSS Project Planets

Bounteous.com: Supercharging Drupal Platforms with The Power of Acquia

Planet Drupal - Fri, 2024-05-10 11:04
Drupal is a powerful content management platform, and Acquia offers a suite of world-class digital marketing solutions. Together, they create the leading open digital experience platform.
Categories: FLOSS Project Planets

Bounteous.com: Headless Commerce With Drupal

Planet Drupal - Fri, 2024-05-10 11:04
In this article, we explore why commerce websites are slow, how headless can help, and why Drupal is a great fit in the complicated world of Commerce.
Categories: FLOSS Project Planets

Bounteous.com: Customizing Your Drupal Commerce Forms

Planet Drupal - Fri, 2024-05-10 11:04
Your digital shopping experience and checkout flow can be as distinctive as your brand. Customize your Drupal Commerce forms through these entry points to deeply and efficiently tailor all aspects of your shopping and checkout experience.
Categories: FLOSS Project Planets

Bounteous.com: How to Approach Your Drupal Website Build

Planet Drupal - Fri, 2024-05-10 11:04
You can use a variety of approaches to build your website with Drupal. The approach you take depends on several factors. Learn how those factors influence your approach.
Categories: FLOSS Project Planets

Bounteous.com: Speaking at Drupal Events: A Non-Code Way to Contribute to Drupal

Planet Drupal - Fri, 2024-05-10 11:04
Explore different ways to get involved in the Drupal community including speaking at Drupal events through the eyes of first-time speaker Irene Dobbs.
Categories: FLOSS Project Planets

Pages