In July 2021, The Economist highlighted a finding by data scientist Youyang Gu that inequality is the most correlated factor to covid deaths. Interestingly he also finds that restriction stringency isn't singificantly correlated with covid mortality. In this example, using our Causal Inference analysis, we show that higher stringency index is linked lower covid deaths after controlling by various factors such as population density, nursing home resident rate, median age, mean temperature, etc.
The original dataset prepared by Gu includes different measures such as inequality, population density, restriction strictness, and obesity rate for each state. Specifically for inequality, Gu uses the Gini inequality index where zero means perfect equality and one means perfect inequality. For restrictions, Gu uses the stringency index tracked by a group from Blavatnik School of Government at the University of Oxford. The full list of included metrics are listed below.
After selecting Correlation Analysis and the uploaded data set, we pick deaths_per_100k as correlation target and all other columns as compared factors. After clicking Run, the Correlation Analysis returns most correlated factors and their Spearman-r values. Only correlations with statistical significance are shown. As we can see in the below chart, Gini inequality index is indeed the most correlated factor amongst all compared factors. Next correlated factors are number of residents in nursing homes per 100K people, population density, unemployment rate in March 2021, and average flu death rates.
As observed in Gu's study, stringency index is not strongly correlated to covid deaths (at least its correlation is not statistically significant). However, we should not make the conclusion that there is certainly no association between stringency index and covid deaths as no correlation doesn't mean no association. For example, a correlation can not uncover some complicated associations as shown in the bottom row of this example on Wikipedia.
To better examine potential association of a target and examined factors, it is better to build a predictive model to predict the target using examined factors as its predictors. Once a quality predictive model is built, we then can look at the importance of each predictor in the model's decisions. In Actable AI, that can easily be done by selecting Regression Analysis to apply to this dataset. After selecting the predicted target column and predictor columns, we also select Optimize for performance to further improve the quality of the model and Cross Validation (with 10 folds). As the data set is small, Cross Validation gives us more training data and uncertainty of trained model's performance.
In the above figure, the trained model's performance and feature importance are displayed after running the analysis. The black scores are average scores and the red scores are their standard errors across Cross Validation splits. Feature importance is defined as how much the model's performance is lost when the values of that feature are randomly shuffled.
The R2 and the root mean square error (RMSE) indicate that the model has high predictability (R2=0 means zero predictability). Interestingly unlike in correlation analysis, stringency index is 5th most important factor in predicting covid deaths. It is not as strong as Gini index or number of residents in nursing homes but is comparable to obesity and percentage of non-white population.
However, similar to correlation, neither does association mean causation. Ultimately in order to take meaningful actions, we are interested in the causal effect of these factors on the target variable. Specifically, if we know the causal effect of stringency index on covid deaths then we can take actions and to adjust different non-pharmaceutical interventions accordingly. Using Causal Inference, our users can control different confounders to expose only the effect of a certain treatment and outcome.
The below results show that when controlled by other variables (population density, inequality, obesity rate, median age, nursing resident rate, mean temperature, urban population rate) as confounders, stringency index has negative association with covid deaths (-2.4 with 95% confidence interval [-0.1/-4.7]). That means we are 95% confident that an increase of 1.0 in the stringency index reduces from 0.1 to 4.7 deaths per 100k people, with a best estimate of 2.4 deaths per 100k people.
However we know that the effect might be heterogeneous as under different conditions, the effect of measures are also different. We further breakdown the effect by nursing resident rate by selecting the feature as the effect modifier and re-run the analysis. What we found is somewhat interesting. In the states where there are fewer nursing residents per 100k people, the mean effect is smaller and a lot more uncertain while in the states where there are more nursing residents per 100k people, the effect is significantly stronger and more statistically significant (smaller confidence interval and below 0).
To control for reverse causal effect, we did the same analysis for only states with fewer death rate than average and observe a similar effect. As these states were more likely to impose restrictions due to precaution than reaction, we can conclude that reverse causal effect unlikely affects our analysis.
In this analysis, we demonstrate that within 10 minutes one can use Actable AI to quickly find out the most correlated and predictive factors to covid deaths in different US states. The user also found the causal relationship (when causal assumptions are met) between stringency index and covid deaths when various factors are controlled. It confirms that Gini inequality index and number of residents in nursing homes are strongly associated factors to COVID-19 deaths. Even though stringency index does not show correlation to COVID-19 deaths, it is shown to have some level of a predictive association and has a significant effect when other factors are controlled. The effect is particularly stronger in the states where there is a higher nursing resident rate.
The dataset is from Economist's 2021 article highlighting the findings about why have some places suffered more COVID-19 deaths than others. The dataset consists of 16 variables listed below:
deaths_per_100k: Number of covid deaths per 100K people
dem_margin_2020: Tendency to vote Democratic Party
perc_25plus_with_bachelors: Percentage of people above 25-year old with a Bachelor's degree
income_per_capita: Income per capita
perc_pop_at_least_1_dose_may_2021: Percentage of population with at least one dose of vaccine in Mat 2021
stringency_index: Restriction stringency index by Blavatnik School of Government, Oxford
population_per_sq_mi: Population per square mile
perc_urban: Percentage of population living in ubran areas
median_age: Median age
mean_temperature: Mean temperature
perc_blue_collar_jobs: Percentage of people with blue collar jobs
obesity_rate: Obesity rate
perc_pop_nonwhite: Percentage of non-white population
flu_death_rate: Average flu death rate
gini_coefficient: Gini inequality index
nursing_resid_per_100k: Number of nursing residents per 100k people