Introduction

For many, polling is considered the foundation of electoral predictions. However, its predictive power may be significantly less reliable than expected. In this week’s post, I explore this by creating a few predictive models for national and state races using only polling data, and I analyze the results.

Despite polling’s prominence, the circumstances surrounding this election make data particularly limited. Since President Biden exited the race in mid-July, there are only eight weeks of polling data available that directly compares the current two candidates. This data spans from 15 weeks before the election to 7 weeks prior, which presents some challenges.

When it comes to incorporating polling into a regression model, there are a few different approaches to consider. The simplest, though likely the least accurate, is to average all available data for a candidate over a specific period. To make things more complex, one could treat each week as a separate variable in a regression and then apply a regularization method to exclude certain weeks by minimizing the mean-squared error. While this method might be preferable for predicting national vote share, it presents challenges at the state level. State-level races require weekly polling in each state, and that kind of data simply doesn’t exist. As a result, I had to rely on the simpler method of averaging polling support over a period for my models. Another potential approach is to manually weight each week’s polling data, with more weight given to polls conducted closer to the election. However, determining the correct weights would require extensive analysis to avoid arbitrary decisions.

National Model

As mentioned earlier, this model uses OLS to measure the relationship between the average polling support for each party’s candidate (from 1968 to 2020) and the eventual two-party vote share in that election. The polling data spans from 15 weeks before the election to 7 weeks prior.

I created separate models for each party due to how polling data is structured. Each poll essentially asks whether a voter supports a particular candidate, but a lack of support for one candidate does not necessarily indicate support for the opposing party. It simply means the voter isn’t committed to the candidate in question. Therefore, it’s not accurate to only focus on polling responses about one party’s candidate when predicting the election outcome.

Table: Table 1: National Polling Model Coefficients, P-Values, and R-Squared for Republicans and Democrats

Party	Polling Coefficient	P-Value	R-Squared
Republican	0.6699376	0.0001198	0.7218553
Democrat	0.5008233	0.0059140	0.4814119

As shown in the table above, both models are statistically significant based on their p-values at the 0.05 level. However, the R-squared value and the coefficient for the Republican polling vote share model are much higher than those for the Democratic model. Typically, this indicates that higher national polling in this period for Republicans is more strongly associated with a higher eventual two-party vote share in the election. This finding aligns with recent elections, where polls have tended to overestimate how well Democratic candidates would perform compared to their actual results.

Table: Table 2: 2024 National Predicted 2-Party Vote Share

Party	Predicted 2PV (2024)	Lower 95% Interval	Upper 95% Interval
Republican	50.88708	44.78523	56.98894
Democrat	50.23635	41.87147	58.60124

Interestingly, both models predict a vote share greater than 50% for their respective parties, which is, of course, impossible since the total must sum to 100. This is likely due to the election and polling being extremely close. As I mentioned earlier, polls only measure whether a candidate is supported, without accounting for voters who are uncommitted or plan to vote for a third party. Typically, polling support of around 47-48% correlates to an eventual vote share of around 51-52%, as 5-10% of voters often remain undecided or support third-party candidates. In this case, it seems that, according to the 2024 polls, more voters are committed to one of the two main party candidates than usual. This could explain why the models predict that both candidates receiving more than 50% of the vote, an anomaly that reflects the tight race and polarized electorate.

Lastly, while both candidates receive more than 50% of the vote, the predicted vote share for the Republican candidate is 0.6 percentage points higher than for the Democratic candidate. This might seem surprising, given that most polls favor Harris in a contest against Trump. However, when you account for the coefficient showing Republican candidates tend to outperform their poll standings, the results make more sense. Historically, Democrats have underperformed their poll averages from weeks 15 to 7, while Republicans have exceeded theirs. As a result, despite Harris leading in the polls, Trump is still predicted to gain more of the vote share.

State Models

While predicting the National Two-Party Vote Share is interesting, that particular metric doesn’t determine the winner of the election. Instead, the outcome will likely hinge on the winners of the seven key swing states I identified in my first post:

North Carolina
Georgia
Arizona
Nevada
Pennsylvania
Michigan
Wisconsin

To assess whether polls could help predict the outcome of these states in the 2024 election, I replicated the methodology above, but individually for each state. However, the amount of historical polling data available for each state is much more limited. For some states, like Nevada, no data was available for an entire election, or results were determined by a single poll. As a result, many of the models were not statistically significant, as shown in the tables below.

Table: Table 3: State Polling Model Coefficients, P-Values, and R-Squared for Democratic Candidates

State	Polling Coefficient	P-Value	R-Squared
Arizona	0.0675212	0.7864333	0.0111972
Georgia	-0.0422767	0.8792238	0.0041723
Michigan	0.2908922	0.1325988	0.2114132
North Carolina	0.8065565	0.0005226	0.7158211
Pennsylvania	-0.1412926	0.3147346	0.1670427
Nevada	0.6011793	0.2791756	0.1908194
Wisconsin	0.2237993	0.0650050	0.3635602

Table: Table 4: State Polling Model Coefficients, P-Values, and R-Squared for Republican Candidates

State	Polling Coefficient	P-Value	R-Squared
Arizona	0.3977551	0.0241515	0.5397860
Georgia	0.4160517	0.0247965	0.5959767
Michigan	0.1785548	0.4191225	0.0662982
North Carolina	0.5414656	0.0009602	0.6803417
Pennsylvania	0.2938137	0.1202547	0.3532072
Nevada	0.2291605	0.7357058	0.0204175
Wisconsin	0.0970018	0.6602218	0.0253794

For the Democratic models, only two states passed the statistical significance test at the 0.1 level: Wisconsin and North Carolina. Both had numerous polls for every election in the dataset, which explains why a predictive model could be developed. For the Republican models, only three state models are passed the same test and are worth considering: North Carolina, Arizona, and Georgia.

Table: Table 5: 2024 Democratic Predicted 2-Party Vote Share

State	Predicted 2PV (Democratic)	Lower 95% Interval	Upper 95% Interval
North Carolina	48.87390	40.67340	57.07439
Wisconsin	52.28988	46.16889	58.41087

Table: Table 6: 2024 Republican Predicted 2-Party Vote Share

State	Predicted 2PV (Republican)	Lower 95% Interval	Upper 95% Interval
Arizona	53.59233	47.18375	60.00090
Georgia	52.51831	47.06603	57.97059
North Carolina	54.15122	45.64429	62.65815

Given these results, I chose to create predictive models only for those that were statistically significant. Both models predict a Republican victory in North Carolina. Additionally, the Republican model forecasts wins in Arizona and Georgia, while the Democratic model predicts a Democratic win in Wisconsin.

Concluding Thoughts

Overall, while state polling provides valuable granularity, the limited amount of data makes predicting outcomes much harder compared to national statistics.

As for polling in general, I believe the same conclusion applies as we discussed last week regarding economic factors: while polling can be a decent standalone predictor, especially as data gets closer to election day, it is most accurate when combined with other factors.

Post #3: Polling

Avi Agarwal

2024/09/23

Introduction

National Model

State Models

Concluding Thoughts