Feeds
Go Deh: Predicting results from small samples.
I've run simulations, tens of thousands of them at a time, over and over as we developed chips. In one project I noticed that I could predict the final result after only a small number of results were in which allowed me to halt the rest of the simulations, or make advance preparations for the final result.
I looked it up at the time and, indeed, there is an equation where if you want the pass rate of a "large" population to within a given accuracy, it will give you the minimum sample size to use.
To some, that's all gobledygook so I'll try to explain with some code.
ExplanationLets say you have a large randomised population of pass/fails or ones and zeroes:
from random import samplepopulation_size = 100_000 # must be large, > 65K?sample_size = 123 # Actualp_hat = 0.5 # Population pass rate, 0.5 == 50%
_ones = int(population_size*p_hat)_zeroes = population_size - _onespopulation = [1] * _ones + [0] * _zeroes
And that we take a sample from it and compute the pass rate of that single, smaller sample
pass_rate = (sum(random_sample()) # how many ones / sample_size) # convert to a pass rate
print(pass_rate) # e.g. 0.59027552674230146
Every time we run that pass_rate expression we get a different value. It is random. We need to run that pass_rate calculation many times to get an idea of what the likely pass_rate would be when using the given sample size:
runs_per_sample = 10_000 # each sample size is run this many times ~2500pass_rates = [sum(random_sample()) / sample_size for _ in range(runs_per_sample)]
I have learnt that the question to ask is not how many times the sample pass rate was the population pass rate, but to define an acceptable margin of error, (say 5%) and say we want to be confident that pass rates be within that margin a certain percentage of the time
epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5%p_hat_min, p_hat_max = p_hat * (1 - epsilon), p_hat * (1 + epsilon)in_range_count = sum(p_hat_min <= pass_rate < p_hat_max for pass_rate in pass_rates)sample_confidence_level = in_range_count / runs_per_sampleprint(f"{sample_confidence_level = }") # = 0.4054
So for a sample size of 123 we could expect the pass rate of the sample to be within 5% of the actual pass rate of the population, 0.5 or 50%, onlr o.4 or 40% of the time!
We need more!What is actually done is we state what we think the population pass rate is, p_hat, (choose closer to 50% if you are unsure); the margin of error around p_hat we want, epsilon, usually +/-5% or +/-3%; and the confidence_level in hitting within the pass_rates margin of error.
There are calculators that will then give you n, the size of the sample needed to satisfy those condition.
Doing it myselfI calculated for one specific sample size, above. Obviousely, if I calculated pass_rates over a range of sample_sizes, and increqasing runs_per_sample, I could search out the sample size needed.
That is done in my next program. I have to switch to using the numpy library for its speed and sample_size becomes a range.
When the pass rate confidence levels are calculated I end up with a list of confidence levels for increasing sample sizes that are usually not increasing due to the randomness, e.g.
range_hits = [...0.94, 0.95, 0.93, 0.954, ... 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, ...] # confidence levels
The range of sample_size corresponding to the first occurrence>= the requested confidence level, and the last occorrence of a confidence level < the requested confidence level in then slightly widened and the runs_per_sample increased on another iteration to get a better result.
Here's a sample of the output I get when searching
2013 <= n <= 2525 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(2013, 2694, 256), 500)
1501 <= n <= 2781 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1501, 3037, 256), 500)
1757 <= n <= 2013 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(221, 4061, 256), 500)
1714 <= n <= 1970 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1714, 2055, 128), 1000)
1458 <= n <= 2098 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1458, 2226, 128), 1000)
1586 <= n <= 1714 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(818, 2738, 128), 1000)
1564 <= n <= 1692 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1564, 1735, 64), 2000)
1500 <= n <= 1564 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1436, 1820, 64), 2000)
1553 <= n <= 1585 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1489, 1575, 32), 4000)
1547 <= n <= 1579 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1547, 1590, 16), 8000)
1547 <= n <= 1579 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1515, 1611, 16), 8000)
1541 <= n <= 1581 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1541, 1584, 8), 16000)
1501 <= n <= 1533 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1501, 1621, 8), 16000)
1501 <= n <= 1533 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1389, 1538, 8), 16000)
1503 <= n <= 1575 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1495, 1677, 8), 16000)
1503 <= n <= 1535 For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95) Using population_size, sample_size, runs_per_sample =(100000, range(1491, 1587, 4), 32000)
Use a sample n = 1535 to predict population pass rate of 50.0% +/-5% with a confidence level of 95%.
real 3m49.023suser 3m49.022ssys 0m1.361s
@author: paddy"""# %%from random import sampleimport numpy as np
def sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) -> range: """ Arguments with example values: population_size = 100_000 # must be large, > 65K? sample_size = range(1400, 1750, 16) # (+min, +max, +step) p_hat = 0.5 # Population pass rate, 0.5 == 50% epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5% confidence_level = 0.95 # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time. runs_per_sample = 10_000 # each sample size is run this many times ~2500 Return: min,max range for the sample size, n, satisfying inputs. """ def create_1_0_array(population_size=100_000, p_hat=0.5) -> np.ndarray: "Create numpy array of ones and zeroes with p_hat% as ones" ones = int(population_size*p_hat + 0.5) array10 = np.zeros(population_size, dtype=np.uint8) array10[:ones] = 1 return array10
def rates_of_samples(population: np.ndarray, sample_size_range: range, runs_per_sample: int ) -> list[list[float]]: "Pass rates for range of sample sizes repeated runs_per_sample times." # many_samples_many_rates = [( np.random.shuffle(population), # shuffle every *run* # [population[:s_count].sum() / s_count # The pass rate for samples # for s_count in sample_size_range] # )[1] # drop the shuffle # for _ in range(runs_per_sample)] # Every run
many_samples_many_rates = [[ np.random.shuffle(population), # shuffle every *sample* [population[:s_count].sum() / s_count # The pass rate for samples for s_count in sample_size_range] ][1] # drop the shuffle for _ in range(runs_per_sample)] # Every run
return list(zip(*many_samples_many_rates)) # Transpose to by_sample_size_then_runs
population = create_1_0_array(population_size, p_hat) by_sample_size_then_runs = rates_of_samples(population, sample_size, runs_per_sample)
# Pass rates within target
target_pass_range = tmin, tmax = p_hat * (1 - epsilon), p_hat * (1 + epsilon) # Looking for rates within the range
range_hits = [sum(tmin <= sample_pass_rate < tmax for sample_pass_rate in single_sample_size) for single_sample_size in by_sample_size_then_runs] runs_for_confidence_level = confidence_level * runs_per_sample
for n_min, conf in zip(sample_size, range_hits): if conf >= runs_for_confidence_level: break else: n_min = sample_size.start
for n_max, conf in list(zip(sample_size, range_hits))[::-1]: if conf <= runs_for_confidence_level: n_max += sample_size.step # back a step break else: n_max = sample_size.stop if (n_min + sample_size.step) >= n_max and sample_size.step > 1: # Widen n_max = n_max + sample_size.step + 1
return range(n_min, n_max, sample_size.step)
def min_max_mid_step(from_range: range) -> tuple[int, int, float, int]: "Extract from **increasing** from_range the min, max, middle, step" mn, st = from_range.start, from_range.step # Handle range where start == stop mx = from_range.stop for mx in from_range: pass md = (mn + mx) / 2 return mn, mx, md, st def next_sample_size(new_samples, last_samples, runs_per_sample, widener=1.33 # Widen range by ): n_min, n_max, n_mid, n_step = min_max_mid_step(new_samples) l_min, l_max, l_mid, l_step = min_max_mid_step(last_samples) # Next range of samples computes in names with prefix s_ increase_runs = True if n_max == l_max: # Major expand of high end s_max = l_max + (l_max - l_min) increase_runs = False else: s_max = (n_mid +( n_max - n_mid)* widener)
if n_min == l_min: # Major expand of low end s_min = max(1, l_min + (l_min - l_max)) increase_runs = False else: s_min = (n_mid +( n_min - n_mid)* widener) s_min, s_max = (max(1, int(s_min)), int(s_max + 0.5)) s_step = n_step if s_min == s_max: if s_min > 2: s_min -= 1 s_max += 1 if increase_runs or n_max == n_min: runs_per_sample *= 2 if n_max == n_min: s_step = 1 else: s_step = max(1, (s_step + 1) // 2) # Go finer next_sample_range = range(max(1, int(s_min)), int(s_max + 0.5), s_step) return next_sample_range, runs_per_sample
# %%
if __name__ == '__main__':
population_size = 100_000 # must be large, > 65K? sample_size = range(50, 5_000, 512) # Increasing! p_hat = 0.50 # Population pass rate, 0.5 == 50% epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5% confidence_level = 0.95 # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time. runs_per_sample = 250 # each sample size is run this many time at start, ~250 max_runs_per_sample = 35_000
while runs_per_sample < max_runs_per_sample: new_range = sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) n_min, n_max, n_mid, n_step = min_max_mid_step(new_range) print(f"{n_min} <= n <= {n_max}") print(f" For {p_hat, epsilon, confidence_level =}\n" f" Using {population_size, sample_size, runs_per_sample =}\n") sample_size, runs_per_sample = next_sample_size(new_range, sample_size, runs_per_sample)
print(f" Use a sample n = {n_max} to predict population pass rate of {p_hat*100.:.1f}% +/-{epsilon*100.:.0f}% " f"with a confidence level of {confidence_level*100.:.0f}%.")
END.