Feeds

Go Deh: Predicting results from small samples.

Planet Python - Fri, 2024-05-10 16:09

I've run simulations, tens of thousands of them at a time, over and over as we developed chips. In one project I noticed that I could predict the final result after only a small number of results were in which allowed me to halt the rest of the simulations, or make advance preparations for the final result.

I looked it up at the time and, indeed, there is an equation where if you want the pass rate of a "large" population to within a given accuracy, it will give you the minimum sample size to use.

To some, that's all gobledygook so I'll try to explain with some code.

Explanation

Lets say you have a large randomised population of pass/fails or ones and zeroes:

from random import sample
population_size = 100_000 # must be large, > 65K?sample_size = 123 # Actualp_hat = 0.5 # Population pass rate, 0.5 == 50%
_ones = int(population_size*p_hat)_zeroes = population_size - _onespopulation = [1] * _ones + [0] * _zeroes

And that we take a sample from it and compute the pass rate of that single, smaller sample

def random_sample() -> list[int]: return sample(population, k=sample_size)
pass_rate = (sum(random_sample()) # how many ones / sample_size) # convert to a pass rate
print(pass_rate) # e.g. 0.59027552674230146

Every time we run that pass_rate expression we get a different value. It is random. We need to run that pass_rate calculation many times to get an idea of what the likely pass_rate would be when using the given sample size:

runs_per_sample = 10_000 # each sample size is run this many times ~2500
pass_rates = [sum(random_sample()) / sample_size for _ in range(runs_per_sample)]

I have learnt that the question to ask is not how many times the sample pass rate was the population pass rate, but to define an acceptable margin of error, (say 5%) and say we want to be confident that pass rates be within that margin a certain percentage of the time

epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5%
p_hat_min, p_hat_max = p_hat * (1 - epsilon), p_hat * (1 + epsilon)in_range_count = sum(p_hat_min <= pass_rate < p_hat_max for pass_rate in pass_rates)sample_confidence_level = in_range_count / runs_per_sampleprint(f"{sample_confidence_level = }") # = 0.4054

So for a sample size of 123 we could expect the pass rate of the sample to be within 5% of the actual pass rate of the population, 0.5 or 50%, onlr o.4 or 40% of the time!

We need more!

What is actually done is we state what we think the population pass rate is, p_hat, (choose closer to 50% if you are unsure); the margin of error around p_hat we want, epsilon, usually +/-5% or +/-3%; and the confidence_level in hitting within the pass_rates margin of error.

There are calculators that will then give you n, the size of the sample needed to satisfy those condition.

Doing it myself

I calculated for one specific sample size, above. Obviousely, if I calculated pass_rates over a range of sample_sizes, and increqasing runs_per_sample, I could search out the sample size needed.

That is done in my next program. I have to switch to using the numpy library for its speed and sample_size becomes a range.

When the pass rate confidence levels are calculated I end up with a list of confidence levels for increasing sample sizes that are usually not increasing due to the randomness, e.g.

range_hits = [...0.94, 0.95, 0.93, 0.954, ... 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, ...] # confidence levels

The range of sample_size corresponding to the first occurrence>= the requested confidence level, and the last occorrence of a confidence level < the requested confidence level in then slightly widened and the runs_per_sample increased on another iteration to get a better result.

Here's a sample of the output I get when searching

Sample output$ time python3 sample_search.py 2098 <= n <= 2610 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(50, 5000, 512), 250)
2013 <= n <= 2525 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(2013, 2694, 256), 500)
1501 <= n <= 2781 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1501, 3037, 256), 500)
1757 <= n <= 2013 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(221, 4061, 256), 500)
1714 <= n <= 1970 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1714, 2055, 128), 1000)
1458 <= n <= 2098 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1458, 2226, 128), 1000)
1586 <= n <= 1714 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(818, 2738, 128), 1000)
1564 <= n <= 1692 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1564, 1735, 64), 2000)
1500 <= n <= 1564 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1436, 1820, 64), 2000)
1553 <= n <= 1585 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1489, 1575, 32), 4000)
1547 <= n <= 1579 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1547, 1590, 16), 8000)
1547 <= n <= 1579 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1515, 1611, 16), 8000)
1541 <= n <= 1581 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1541, 1584, 8), 16000)
1501 <= n <= 1533 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1501, 1621, 8), 16000)
1501 <= n <= 1533 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1389, 1538, 8), 16000)
1503 <= n <= 1575 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1495, 1677, 8), 16000)
1503 <= n <= 1535 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1491, 1587, 4), 32000)
Use a sample n = 1535 to predict population pass rate of 50.0% +/-5% with a confidence level of 95%.
real 3m49.023suser 3m49.022ssys 0m1.361s

My Code# -*- coding: utf-8 -*-"""Created on Wed May 8 14:04:17 2024
@author: paddy"""# %%from random import sampleimport numpy as np

def sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) -> range: """ Arguments with example values: population_size = 100_000 # must be large, > 65K? sample_size = range(1400, 1750, 16) # (+min, +max, +step) p_hat = 0.5 # Population pass rate, 0.5 == 50% epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5% confidence_level = 0.95 # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time. runs_per_sample = 10_000 # each sample size is run this many times ~2500 Return: min,max range for the sample size, n, satisfying inputs. """ def create_1_0_array(population_size=100_000, p_hat=0.5) -> np.ndarray: "Create numpy array of ones and zeroes with p_hat% as ones" ones = int(population_size*p_hat + 0.5) array10 = np.zeros(population_size, dtype=np.uint8) array10[:ones] = 1 return array10

def rates_of_samples(population: np.ndarray, sample_size_range: range, runs_per_sample: int ) -> list[list[float]]: "Pass rates for range of sample sizes repeated runs_per_sample times." # many_samples_many_rates = [( np.random.shuffle(population), # shuffle every *run* # [population[:s_count].sum() / s_count # The pass rate for samples # for s_count in sample_size_range] # )[1] # drop the shuffle # for _ in range(runs_per_sample)] # Every run
many_samples_many_rates = [[ np.random.shuffle(population), # shuffle every *sample* [population[:s_count].sum() / s_count # The pass rate for samples for s_count in sample_size_range] ][1] # drop the shuffle for _ in range(runs_per_sample)] # Every run
return list(zip(*many_samples_many_rates)) # Transpose to by_sample_size_then_runs

population = create_1_0_array(population_size, p_hat) by_sample_size_then_runs = rates_of_samples(population, sample_size, runs_per_sample)
# Pass rates within target
target_pass_range = tmin, tmax = p_hat * (1 - epsilon), p_hat * (1 + epsilon) # Looking for rates within the range
range_hits = [sum(tmin <= sample_pass_rate < tmax for sample_pass_rate in single_sample_size) for single_sample_size in by_sample_size_then_runs] runs_for_confidence_level = confidence_level * runs_per_sample
for n_min, conf in zip(sample_size, range_hits): if conf >= runs_for_confidence_level: break else: n_min = sample_size.start
for n_max, conf in list(zip(sample_size, range_hits))[::-1]: if conf <= runs_for_confidence_level: n_max += sample_size.step # back a step break else: n_max = sample_size.stop if (n_min + sample_size.step) >= n_max and sample_size.step > 1: # Widen n_max = n_max + sample_size.step + 1
return range(n_min, n_max, sample_size.step)

def min_max_mid_step(from_range: range) -> tuple[int, int, float, int]: "Extract from **increasing** from_range the min, max, middle, step" mn, st = from_range.start, from_range.step # Handle range where start == stop mx = from_range.stop for mx in from_range: pass md = (mn + mx) / 2 return mn, mx, md, st def next_sample_size(new_samples, last_samples, runs_per_sample, widener=1.33 # Widen range by ): n_min, n_max, n_mid, n_step = min_max_mid_step(new_samples) l_min, l_max, l_mid, l_step = min_max_mid_step(last_samples) # Next range of samples computes in names with prefix s_ increase_runs = True if n_max == l_max: # Major expand of high end s_max = l_max + (l_max - l_min) increase_runs = False else: s_max = (n_mid +( n_max - n_mid)* widener)
if n_min == l_min: # Major expand of low end s_min = max(1, l_min + (l_min - l_max)) increase_runs = False else: s_min = (n_mid +( n_min - n_mid)* widener) s_min, s_max = (max(1, int(s_min)), int(s_max + 0.5)) s_step = n_step if s_min == s_max: if s_min > 2: s_min -= 1 s_max += 1 if increase_runs or n_max == n_min: runs_per_sample *= 2 if n_max == n_min: s_step = 1 else: s_step = max(1, (s_step + 1) // 2) # Go finer next_sample_range = range(max(1, int(s_min)), int(s_max + 0.5), s_step) return next_sample_range, runs_per_sample
# %%
if __name__ == '__main__':
population_size = 100_000 # must be large, > 65K? sample_size = range(50, 5_000, 512) # Increasing! p_hat = 0.50 # Population pass rate, 0.5 == 50% epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5% confidence_level = 0.95 # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time. runs_per_sample = 250 # each sample size is run this many time at start, ~250 max_runs_per_sample = 35_000
while runs_per_sample < max_runs_per_sample: new_range = sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) n_min, n_max, n_mid, n_step = min_max_mid_step(new_range) print(f"{n_min} <= n <= {n_max}") print(f" For {p_hat, epsilon, confidence_level =}\n" f" Using {population_size, sample_size, runs_per_sample =}\n") sample_size, runs_per_sample = next_sample_size(new_range, sample_size, runs_per_sample)
print(f" Use a sample n = {n_max} to predict population pass rate of {p_hat*100.:.1f}% +/-{epsilon*100.:.0f}% " f"with a confidence level of {confidence_level*100.:.0f}%.")

END.

Search form

Tag cloud

You are here

Pages

Recent Publications

FLOSS Project Planets

FLOSS Research