The Law of Large Numbers and the Central Limit
Theorem: A Polling Simulation
Nov. 24, 2008
This is for those who
say: "Math was my worst subject in high school". If you've ever
placed a bet at the casino, at the track or played the lottery, you already
know the basics. It's about probability. It's about common sense. It's not all
that complicated.
It's for Excel
spreadsheet users who enjoy creating math models.
The Excel model can
be downloaded:
Monte
Carlo Polling Simulation Excel Model
This Word Doc is for
those who don’t have Excel but want to view the simulation tables and graphics:
Monte
Carlo Polling Simulation Doc
It's for reporters,
blogs and politicians who seek the truth: Robert Koehler, Brad Friedman, John
Conyers, Barbara Boxer, Mark Miller, Fitrakis, Wasserman, Kathy Dopp, Steve
Freeman, Ron Baiman, Jonathan Simon, Alistair Thompson, Paul Krugman, Keith
Olbermann, Mike Malloy, Randi Rhodes, Thom Hartman Stephanie Miller, Joseph
Cannon, Sam Seder, Janeane Garofalo, etc.
It's for those who
have taken algebra, probability or statistics and want to see how the math is
applied to election polling.
It's for graduates
with degrees in mathematics, political science, an MBA, etc. who may or may not
be familiar with simulation concepts. Simulation is a powerful tool for
analyzing uncertainty in simple and complex models. Like in coin flipping and
election polling.
It's for browsers who
frequent Discussion Forums.
It's for those
Corporate Media reporters who are still waiting for editor approval to discuss the
documented evidence of election fraud, statistical and anecdotal in all
elections since 2000.
In Selection 2000,
Gore won the popular vote by 540,000. But Bush won the election by a single
vote. SCOTUS voted along party lines: Bush 5, Gore 4. That stopped the Florida
recount in its tracks. Gore easily won the state. Why does the
"liberal" media continue to spread the misinformation that he lost?
It's for the exit
poll naysayers who promote faith-based hypothetical arguments in their
unrelenting attempts to debunk the accuracy of the pre-election and exit polls.
________________________________________________________________________
FALSE RECALL,
RELUCTANT RESPONDERS, HOW THEY VOTED IN 2000: IMPLAUSIBLE, CONTRADICTORY AND
MATHEMATICALLY IMPOSSIBLE
Naysayers have a
problem with the 2004 pre-election and exit polls. Regardless of how many were
taken or how large the samples, the results are never good enough for them.
They prefer to cite two implausible hypotheticals: Bush non-responders (rBr)
and Gore voter memory lapse ("false recall").
How do pollsters
handle non-responders? They just increase the sample-size! Furthermore,
statistical studies show that there is no discernible correlation between
non-response rates and survey results.
How do pollsters
handle "false recall"? They know that in a large sample,
forgetfulness on the part of Gore and Bush voters will cancel out! There is no
evidence that Gore voters forget any more than Bush voters.
On the contrary, if
someone you knew robbed you in broad daylight, would you forget who it was four
years later? In 2000, Gore and the voters were robbed in broad daylight.
Naysayers claim that
bias favored Kerry in the pre-election and exit polls. Yet they offer no
evidence to back it up. They claim that Gore voters forgot and told the exit
pollsters they voted for Bush in 2000. It's their famous "false
recall" hypothetical. They were forced to use it when they could not come
up with a plausible explanation for the impossible weightings of Bush and Gore
voter turnout in the Final National Exit poll.
According to the
final 2004 NEP, which Bush won by 51-48%, 43% of the 13660 respondents voted
for Bush in 2000 while only 37% voted for Gore. This contradicts the reluctant
Bush responder (rBr) hypothesis. Furthermore, 43% of the 122.3 million who
voted in 2004 is 52.57mm, yet Bush only got 50.45 mm votes in 2000. The 43/37%
split is a mathematical impossibility.
In addition,
approximately 1.75 mm Bush 2000 voters died prior to the 2004 election.
Therefore, no more than 48.7 mm of Bush 2000 voters could have turned out to
vote in 2004. The Bush 2000 voter share was 48.7/122.3 (or 39.8%), assuming
that all of the Bush 2000 voters still living came to the polls. These
mathematical facts are beyond dispute. Kerry won the final 1:25pm exit poll by
50.93-48.66%, assuming equal 39.8% weights.
For the same reason,
Kerry must have done even better than his 51.4-47.6% winning margin at the
12:22am timeline (13047 respondents). Here the Bush/Gore mix was 41/39%. But we
have just shown that 39.8% was the absolute maximum Bush share. If we apply equal weightings to the 12:22am
results, then Kerry won by 52.25-46.77%, a 6.7 million vote margin
(63.8-57.1mm).
First-time voters and
those who sat out the 2000 election, as well as Nader and Gore 2000 voters,
were overwhelming Kerry voters. The recorded Bush 2004 vote was 62 million.
Where did he get the 13 million new voters from 2000? How do the naysayers
explain it? Only by ignoring the mathematical facts and raising new implausible
theories.
It’s time to put on
the defoggers. We’ve had enough disinformation, obfuscation and
misrepresentation. Let the sunshine in. Let's review the basics.
________________________________________________________________________
A COIN-FLIP EXPERIMENT
Consider this
experiment. Flip a fair coin 10 times. Calculate the percentage of heads. Write
it down. Increase to 20 flips. Calculate the new total percentage. Write it
down.
Keep flipping. Write
down the percentage after every ten flips. Stop at 100. That's our final coin
flip sample-size.
When you're all done,
check the percentages. Is the sequence converging to 50%? That’s the true
population mean (average). That's the Law of Large Numbers.
The coin-flip is
easily simulated in Excel. Likewise, in the polling simulations which follow,
we will analyze the result of polling experiments over a range of trials
(sample size).
_____________________________________________________
THE MATHEMATICAL
FOUNDATION
This model
demonstrates the Law of Large Numbers (LLN). LLN is the foundation and bedrock
of statistical analysis. LLN is illustrated through simulations of polling
samples. In a statistical context, LLN states that the mean (average) of a
random sample taken from a large population is likely to be very close to the
(true) mean of the population.
Start of math jargon
alert...
In probability
theory, several laws of large numbers say that the mean (average) of a sequence
of random variables with a common distribution converges to their common mean
as the size of the sequence approaches infinity.
The Central Limit
Theorem (CLT) is another famous result .The sample means (averages) of an
independent series of random samples (i.e. polls) taken from the same
population will tend to be normally distributed (the bell curve) as the number
of samples increase. This holds for ALL practical statistical distributions.
End of math jargon
alert....
It's really not all
that complicated. Naysayers never consider LLN or CLT. They maintain that polls
are not random-samples. They would have us believe that professional pollsters
are incapable of creating accurate surveys (i.e. effectively random samples)
through systematic, clustered or stratified sampling, especially when Bush is
running.
LLN and CLT say nothing
about bias.
________________________________________________________________
POLLING SAMPLE-SIZE
Just like in the
above coin-flipping example, the Law of Large Numbers takes effect as poll
sample-size increases. That's why the National Exit Poll was designed to survey
at least 13000 respondents.
Note the increasing
sequence of polling sample size as we go from the pre-election state (600) and
national (1000) polls to the state and national exit polls: Ohio (1963),
Florida (2846) and the National (13047).
Here is the National
Exit Poll Timeline:
Updated; respondents;
vote share
3:59pm: 8349; Kerry
led 51-48
7:33pm: 11027; Kerry
led 51-48
12:22am:13047; Kerry
led 51-48
1:25pm: 13660 ; Bush
led 51-48
The final was matched
to the vote.
So much for letting
LLN and CLT do their magic.
________________________________________________________________
USING RANDOM NUMBERS
TO SIMULATE A SEQUENCE OF POLLS
Random number
simulation is the best way to illustrate LLN:
1) Assume a true
2-party vote percentage for Kerry (i.e. 52.6%).
2) Simulate a series
of 8 polls of varying sample size.
3) Calculate the
sample mean vote share and win probability for each poll.
4) Confirm LLN by
noting that as the poll sample size increases,
the sample mean
(average) converges to the population mean ("true" vote).
It's just like
flipping a coin.
Assume there is a p
=52.6% probability that a random poll respondent voted for Kerry (HEADS).
This represents
Kerry's TRUE vote (his population mean)
Bush is TAILS with a
47.4% (1-p) probability.
A random number (RN)
between zero and one is generated for each respondent.
If RN is LESS than
Kerry's TRUE share, the vote goes to Kerry.
If RN is GREATER than
Kerry's TRUE share, the vote goes to Bush.
For example, assume
Kerry's TRUE 52.6% vote share (.526).
If RN is less than .526, Kerry's poll count is
increased by one.
If RN is greater than
.526, Bush's poll count is increased by one.
The sum of Kerry's
votes is divided by the poll sample (i.e. 13047). This is Kerry's simulated
2-party vote share. It approaches his TRUE 52.6% vote share as poll samples
increase.
The LLN works in
polling the same way as in the coin flip experiment.
________________________________________________________________
THE STATE ELECTORAL
VOTE SIMULATION
In addition to
simulating Kerry's popular 2-party vote, the model also includes a State
Electoral Vote (EV) Simulator. The method is similar to the previous National
polling samples, with this exception:
Each simulation
consists of 100 election trials.
When the F9 key is
pressed, one hundred Monte Carlo simulation election trials are executed for
each of the 50 states and DC. In each trial, a random number (RN) is generated
for each state.
The RN is compared to
the probability of Kerry winning the state. If RN is less than the probability,
the state EV is added to his total. If RN is greater, Bush wins the state.
If Kerry's total EV
exceeds 269, he wins the election trial.
For example:
1) Assume that Kerry
and Bush were tied in the FL exit poll.
Therefore, the
probability that Kerry would win FL is 50%.
If RN is less than
0.50, Kerry wins FL 27 electoral votes.
2) Assume that Kerry
won the CA exit poll by 55-45%.
The probability of
winning the state was 99.9%.
If RN is less than
.999, Kerry wins CA 55 electoral votes.
Kerry's total number
of winning election trials (out of the 100) is his expected (mean) electoral
vote win probability. In addition to Kerry's expected mean EV (average), his
median (middle), maximum and minimum electoral vote is calculated for the 100
trials.
Kerry's state win
probability is calculated using the Excel Normal Distribution Function. Inputs to the NDF:
1) Kerry's 2-party
share of the state exit poll
2) the standard
deviation Stdev = MoE/1.96
MoE is the poll Margin of Error.
__________________________________________________________________
THE MARGIN OF ERROR
The MoE (at the 95%
confidence level) is the interval surrounding the sample mean which has a
95% probability of containing the TRUE
population mean.
For example, assume a
2% MoE for a state exit poll won by Kerry: 52-48%. The probability is 95% that
Kerry's TRUE vote is in the interval from 50% to 54%. The (one tail)
probability is 97.5% that Kerry's vote will exceed the interval lower limit of
50%.
This is the standard
formula used to calculate the MoE:
MoE = 1.96 * sqrt
(p*(1-p)/n) * DE
n is the sample size,
p and 1-p are the
2-party vote shares.
DE is the exit poll
"design effect" ratio of the total number of repondents
required using cluster random sampling to the number required using
simple random sampling. A cluster randomized trial which has a large
design effect will require many more samples. As the number of
respondents increases so does the design effect. We can only
estimate the impact of the DE on the MoE.
But DE is only a factor in exit polls. There is no equivalent adjustment
made to the MoE in pre-election or approval polls.
The MoE decreases as
the sample-size (n) increases while the sample poll mean approaches the
population mean. It's the Law of Large Numbers. For a given n sample, the MoE is at it's maximum value when p
=0.50. As p increases, the MoE
declines. In the p-o.50 case, the formula can be simplified to: MoE = 1.96 * .5
/ sqrt (n) =.98 / sqrt (n)
Let's calculate the
MoE for the 12:22am National Exit Poll:
n = 13047 sampled
respondents
p = Kerry's true
2-party vote share = .526
1-p = Bush's vote
share = .474
MoE
= 1.96 * sqrt (.526*.474/13047)= .0086 = 0.86%
Adjusting for an
assumed 30% exit poll cluster design effect,
MoE = 1.30*0.86% =
1.12%
Pollsters use proven
methodologies, such as cluster sampling, stratified sampling, etc. to attain a
near-perfect random sample. Why would a polling firm include the MoE for a poll
that was not an effective random sample?
________________________________________________________________
CALCULATING
PROBABILITIES
Kerry win
probabilities are the main focus of the simulation. They closely match the
theoretical probabilities obtained from the Excel Normal Distribution function.
The probabilities are
calculated using two methods:
1) running the
simulation and counting the votes
2) calculating the
Excel Normal Distribution function
Prob = NORMDIST (P,
V, Stdev, true)
P = .526 is the mean
Kerry poll vote share
V = 0.50 is the majority
vote threshold.
Stdev = MoE/1.96. The
standard deviation is a measure of dispersion around the mean.
Given that Kerry's
led by 3% in the 2-party vote (12:22am National Exit Poll), his popular vote
win probability was close to 100%. And that assumes a 30% cluster effect!
For a 2% lead
(51-49), the win probability is 97.5% (still very high).
For a 1% lead
(50.5-49.5), it's 81% (4 out of 5).
For a 50/50 tie, it’s
50%.
The following
probabilities are calculated in the model:
1) The confidence
level for Kerry's minimum vote share (MVS).
There is a 97.5%
probability that Kerry's true vote exceed MVS.
The MVS increases as
the polling sample size grows.
2) The probability of
Bush obtaining his recorded two-party vote (51.24%).
The probability is
virtually zero that Bush's recorded vote would be almost 4% higher than his
47.4% two-party share.
3) The probability of
the state exit poll discrepancy from the recorded vote is a function of the
magnitude of the deviation, the MoE and cluster effect. The normal distribution
is used to calculate the probability.
4) The probability
that the MoE is exceeded in any given state is 1 in 40. The probability that
the MoE is exceeded in at least N states is calculated using the binomial
distribution function. The cluster effect makes a big difference in the
probability calculation. As the cluster effect is increased, so does the MoE
and is therefore less likely to be exceeded.
Assuming a 30%
cluster effect, the vote discrepancy exceeded the exit poll MoE for Bush in 10
states. The probability of this occurrence is 1 in 2.5 MILLION.
Assuming a 20%
cluster effect, the MoE was exceeded in 13 states, a 1 in 4.5 BILLION
probability.
For a cluster effect
of 12% or less, the MoE was exceeded in 16 states, a 1 in 19 TRILLION
probability!
_______________________________________________________________
SIMULATION
GRAPHICS
http://www.geocities.com/electionmodel/MonteCarloPollingSimulation_26397_image001.gif
http://www.geocities.com/electionmodel/MonteCarloPollingSimulation_22578_image001.gif
http://www.geocities.com/electionmodel/MonteCarloPollingSimulation_21396_image001.gif
________________________________________________________________
DOWNLOADING THE EXCEL
MODEL AND RUNNING THE SIMULATION
http://us.share.geocities.com/electionmodel/MonteCarloPollingSimulation.zip
Two inputs drive the
state and national vote simulations:
1) Kerry's 2-party
true vote share (52.6%)
2) exit poll cluster
effect (set to 30%).
Press F9 to run the
simulation.
The graphs illustrate
polling simulation output based on the inputs:
1- Kerry's 2-party
vote (true population mean): 52.60%
2- The Exit Poll
Cluster effect (zero for pre-election): 30%
Play
"what-if" to see the effect of changing assumptions:
Lower Kerry's 2-party
vote share.
Press F9 to run the
simulation.
Note how the 1%
reduction in Kerry's "true vote" results in a decline of his polling
popular and electoral vote shares , corresponding win probabilities and minimum
vote at the 97.5% confidence level.
________________________________________________________________
Introduction to Statistics and Probability
List
of statistical topics
List of probability topics
Opinion polls
Margin
of error
Random
sampling
Standard
deviation
Standard
score
Normal
distribution
Central
limit theorem
Correlation
Illustration of the central limit theorem
Independent identically distributed random variables
Statistical hypothesis testing
Law
of large numbers
Least
squares
Probability
theory
Odds
Random data
Statistical
power
Testing hypotheses
Monte Carlo Simulation and numerical analysis