import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image

Ideas on what to do¶

Features to predict on: #deaths, #cases, (measurement that curve is flattening), %death/case

1.) Feature selection (choose best feature that predicts high deaths/cases)

Plot out variance on each feature?
Run logistic regression on each feature (does not account for correlations)
Determine best subset of features (seems difficult if not just brute-forcing)
Something to do with PCA (how to extract features that are within the eigenvectors?

2.) Using timeseries, determine which counties have lowered their case/death count

Same as #1 but determine good features for lowering case/death count (flattening the curve)

Interesting features in datasets¶

Counties:¶

Rural-UrbanContinuumCode2013 (urban/rural)
Demographic Info
Health Resource Availability (#ICU beds, etc.)
Health Outcomes and Risk Factors (other health factors can worsen
Social Distancing and Mobility (private data)

Deaths & Cases¶

match to county infor via FIPS
Gives number of (new?) deaths/cases by each week

States¶

Doesn't seem as useful - just number of cases in different state (including China, etc.)
No data on the states (so going by county seems better)

Load Data¶

states = pd.read_csv('covid19/4.18states.csv')
counties = pd.read_csv('covid19/abridged_couties.csv')
cases = pd.read_csv('covid19/time_series_covid19_confirmed_US.csv')
deaths = pd.read_csv('covid19/time_series_covid19_deaths_US.csv')

1) Data Cleaning and Preprocessing¶

Deaths and Cases¶

print('Num rows:', len(deaths))
print('Num cols:', len(deaths.columns))
s = deaths.isnull().sum(axis=0)
s.loc[s > 0]

Num rows: 3255
Num cols: 100

FIPS      4
Admin2    7
dtype: int64

print('Num rows:', len(cases))
print('Num cols:', len(cases.columns))
s = cases.isnull().sum(axis=0)
s.loc[s > 0]

Num rows: 3255
Num cols: 99

FIPS      4
Admin2    7
dtype: int64

for col in deaths.columns:
    if col not in cases.columns:
        print(col)

Population

# Verify that the NaNs in FIPS correspond between deaths and cases
print(deaths[deaths['FIPS'].isnull()]['Admin2'])
print(cases[cases['FIPS'].isnull()]['Admin2'])

3147                          Dukes and Nantucket
3148                                  Kansas City
3253    Michigan Department of Corrections (MDOC)
3254       Federal Correctional Institution (FCI)
Name: Admin2, dtype: object
3147                          Dukes and Nantucket
3148                                  Kansas City
3253    Michigan Department of Corrections (MDOC)
3254       Federal Correctional Institution (FCI)
Name: Admin2, dtype: object

From the cells above, we see that the deaths and cases dataframes are very similar. They have all the same columns except that cases doesn't have the Population column, and the two dataframes share the same rows that have NaNs in the FIPS and Admin2 columns.

Because there are only 4 out of 3255 rows missing value in FIPS we decide to drop those, and ignore Admin2 as it is only the name of the county and we can fill them in later with their respective FIPS code if needed. We will also ignore filling in Population until we decide it is needed later.

# Verify 4 rows were deleted
deaths = deaths[deaths['FIPS'].notnull()]
cases = cases[cases['FIPS'].notnull()]
print('Deaths new num rows:', len(deaths))
print('Cases new num rows:', len(cases))

Deaths new num rows: 3251
Cases new num rows: 3251

Counties¶

There are a total of 87 columns in the counties dataset, but as we can see that some of the columns have a very significant amount of NaNs. Later when we try to model what are key indicators of cases and deaths of coronavirus, we will decide what cutoff to use to drop columns. An example would be if there are more than 324 NaNs (over 10% of the rows are missing data) we could drop the column.

# Counties Dataset
print('Num rows:', len(counties))
print('Num cols:', len(counties.columns))
print("Number of nans in each column:")
counties.isnull().sum(axis=0).sort_values(ascending=False)[:10]
# np.array(counties.isnull().sum(axis=0).tolist())

Num rows: 3244
Num cols: 87
Number of nans in each column:

3-YrMortalityAge1-4Years2015-17      3179
mortality2015-17Estimated            3149
3-YrMortalityAge5-14Years2015-17     3149
3-YrMortalityAge<1Year2015-17        2772
3-YrMortalityAge15-24Years2015-17    2610
3-YrMortalityAge25-34Years2015-17    2270
3-YrMortalityAge35-44Years2015-17    1916
3-YrDiabetes2015-17                  1744
HPSAUnderservedPop                   1141
HPSAServedPop                        1141
dtype: int64

MINIMUM_DATA_POINTS = 2000 
number_nans_in_col = np.array(counties.isnull().sum(axis=0).tolist())
high_nan_idx= np.where(number_nans_in_col > MINIMUM_DATA_POINTS)[0]
high_nan_cols = [counties.columns[i] for i in high_nan_idx]
counties = counties.drop(high_nan_cols, axis=1)
print('Num rows:', len(counties))
print('Num cols:', len(counties.columns))

Num rows: 3244
Num cols: 81

Similarly, some of the rows in the dataset have very few data entries for each feature. In an extreme example, the last two rows for New York City and Kansas City have 0 features or data points. Rows that are missing values for more than 40 features are dropped from the table

print("Number of nans in each county:")
counties.set_index("CountyName").isnull().sum(axis=1).sort_values(ascending=False)[:10]

Number of nans in each county:

CountyName
Kansas City                        78
Dade                               78
Yellowstone Nat Park               78
South Boston City                  78
New York City                      78
Clifton Forge City                 76
Aguijan                            75
Prince of Wales-Outer Ketchikan    75
Rota                               75
Skagway-Hoonah-Angoon              75
dtype: int64

MINIMUM_FEATURES = 40 
number_nans_in_row = np.array(counties.isnull().sum(axis=1).tolist())
high_nan_idx= np.where(number_nans_in_row > MINIMUM_FEATURES)[0]
counties = counties.drop(high_nan_idx, axis=0)
print('Num rows:', len(counties))
print('Num cols:', len(counties.columns))

Num rows: 3225
Num cols: 81

The data seems to be ordered relatively geographically (states are grouped together). We will make the assumption that rows consecutive rows will have similar values, so we will fill in the NaNs using ffill followed by bfill to fill in all values. This approach solves the issue with determining what a mean or median is for categorical data points.

# Verify no for NaNs
counties = counties.ffill().bfill()
counties.isnull().sum(axis=0).sort_values(ascending=False)[:5]

HPSAUnderservedPop                0
PopMale10-142010                  0
#EligibleforMedicare2018          0
MedicareEnrollment,AgedTot2017    0
3-YrDiabetes2015-17               0
dtype: int64

Removing features with 0 variance¶

sns.boxplot(counties['federal guidelines'],color='green',orient='h')

<matplotlib.axes._subplots.AxesSubplot at 0x2bae558f748>

Some of the data columns have 0 variance: all of the data points are the same. As a result, these features have 0 predictive power and instead make the model more prone to overfitting. In this step, we remove such features from the dataset.

print("Removed Features: ")
for col in counties: 
    data = counties[col]
    if data.dtype == 'float64': 
        if np.max(data) - np.min(data) < 1: 
            print("\t" + col)
            counties = counties.drop(col, axis=1)
print('Num rows:', len(counties))
print('Num cols:', len(counties.columns))

Removed Features: 
	FracMale2017
	federal guidelines
	foreign travel ban
Num rows: 3225
Num cols: 78

Normalization¶

Some features have values that are much larger than other features. For example, values for population will be much higher than values for population density. To prevent these larger values from dominating predictions, we normalize and demean each quantitative feature, leaving categorical features the same.

for col in counties: 
    data = counties[col]
    if data.dtype == 'float64': 
        counties[col] = (data - np.mean(data)) / np.std(data)

2) Exploratory Data Analysis (EDA)¶

plt.rcParams.update({'figure.max_open_warning': 0})

quan_counties = counties.copy() 
print("Removed categorical features: ")
for col in quan_counties: 
    data = quan_counties[col]
    if data.dtype != 'float64': 
        print("\t" + col)
        quan_counties = quan_counties.drop(col, axis=1)

df = quan_counties
l = df.columns.values
number_of_columns=12
number_of_rows = len(l)-1/number_of_columns
plt.figure(figsize=(18,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(df[l[i]],color='green',orient='v')
plt.tight_layout()

Removed categorical features: 
	countyFIPS
	CountyName
	StateName
	State
	CensusRegionName
	CensusDivisionName

plt.scatter(counties['PopMale15-192010'], counties['PopFmle15-192010'])
plt.xlabel('PopMale15-192010')
plt.ylabel('PopFmle15-192010')
plt.title("PopMale15-192010 vs PopFmle15-192010")
plt.figure(figsize=(24,16))
a = sns.heatmap(quan_counties.corr(),cmap='Blues',annot=False) 
# a.set_axis_labels("Features", "Features")
# quan_counties.corr()
a

<matplotlib.axes._subplots.AxesSubplot at 0x2bae6b7a588>

Boxplots¶

df.boxplot(["3-YrDiabetes2015-17"])

<matplotlib.axes._subplots.AxesSubplot at 0x2bae8e5dc48>

Distributions¶

l = df.columns.values
number_of_columns=12
number_of_rows = len(l)-1/number_of_columns
plt.figure(figsize=(2*number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.distplot(df[l[i]],kde=True)

counties[">50 gatherings"]
df.boxplot([">500 gatherings"])

<matplotlib.axes._subplots.AxesSubplot at 0x2baeb440b48>

3) Prediction Methods¶

counties["countyFIPS"] = counties["countyFIPS"].astype(int)

deaths['FIPS'] = deaths['FIPS'].astype(int)
counties_deaths = counties.set_index(['countyFIPS']).join(deaths.set_index("FIPS"), how='inner')

cases["totalCases"] = cases["4/9/20"] + cases["4/10/20"] + \
                     cases["4/11/20"] + cases["4/12/20"] + \
                     cases["4/13/20"] + cases["4/14/20"] + \
                     cases["4/15/20"] + cases["4/16/20"] + \
                     cases["4/17/20"] + cases["4/18/20"] 
cases['FIPS'] = cases['FIPS'].astype(int)
counties_deaths = counties_deaths.join(cases[['FIPS', 'totalCases']].set_index("FIPS"), how='inner')

cases[['FIPS', 'totalCases']]

counties_deaths["totalDeaths"] = counties_deaths["4/9/20"] + counties_deaths["4/10/20"] + \
                                 counties_deaths["4/11/20"] + counties_deaths["4/12/20"] + \
                                 counties_deaths["4/13/20"] + counties_deaths["4/14/20"] + \
                                 counties_deaths["4/15/20"] + counties_deaths["4/16/20"] + \
                                 counties_deaths["4/17/20"] + counties_deaths["4/18/20"]

counties_deaths_Y = counties_deaths

counties_deaths

Manually selected feature space.¶

quan_counties = counties_deaths.copy() 
for col in quan_counties: 
    data = quan_counties[col]
    if data.dtype != 'float64': 
        quan_counties = quan_counties.drop(col, axis=1)

mod_counties_deaths = quan_counties.copy()

keep_col = [
#     'STATEFP', 
#     'COUNTYFP', 
#     'lat', 
#     'lon', 
#     'POP_LATITUDE',
#     'POP_LONGITUDE', 
#     'PopTotalMale2017', 
#     'PopTotalFemale2017',
#     'PopulationEstimate65+2017', 
#     'CensusPopulation2010', 
#     '#EligibleforMedicare2018', 
#     '3-YrDiabetes2015-17', 
#     '#FTEHospitalTotal2017',
#     "TotalM.D.'s,TotNon-FedandFed2017",
#     '#Hospitals', '#ICU_beds',
#     'PopFmle<52010',
#     'PopFmle5-92010', 
#     'PopFmle10-142010', 
#     'PopFmle20-242010', 
#     'PopFmle25-292010', 
#     'PopFmle30-342010',
#     'PopFmle35-442010', 
#     'PopFmle45-542010', 
#     'PopFmle55-592010',
#     'PopFmle65-742010', 
#     'PopFmle60-642010', 
#     'PopFmle>842010',
#     '3-YrMortalityAge25-34Years2015-17',
#     '3-YrMortalityAge35-44Years2015-17',
#     '3-YrMortalityAge45-54Years2015-17',
#     '3-YrMortalityAge55-64Years2015-17',
#     '3-YrMortalityAge65-74Years2015-17',
#     '3-YrMortalityAge75-84Years2015-17',
#     '3-YrMortalityAge85+Years2015-17', 
#     'HPSAShortage', 'HPSAServedPop', 'HPSAUnderservedPop', 'Lat',
#     'Long_'
#     'PopFmle75-842010',
#     'PopFmle15-192010',
    'Rural-UrbanContinuumCode2013',
    'PopulationEstimate2018',   
    'PopulationDensityperSqMile2010',
    'MedianAge2010',
    'MedicareEnrollment,AgedTot2017',
    'DiabetesPercentage',
    'HeartDiseaseMortality', 'StrokeMortality', 'Smokers_Percentage',
    'RespMortalityRate2014', 
    '#HospParticipatinginNetwork2017', 
    'dem_to_rep_ratio', 
    'PopMale<52010', 
    'PopMale5-92010',
    'PopMale10-142010',
    'PopMale15-192010', 
    'PopMale20-242010', 
    'PopMale25-292010',
    'PopMale30-342010', 
    'PopMale35-442010', 
    'PopMale45-542010',
    'PopMale55-592010', 
    'PopMale60-642010', 
    'PopMale65-742010',
    'PopMale75-842010', 
    'PopMale>842010', 
    'stay at home',
    '>50 gatherings', '>500 gatherings', 'public schools',
    'restaurant dine-in', 'entertainment/gym', 'SVIPercentile',
]
for i, col in enumerate(mod_counties_deaths.columns): 
    if col not in keep_col:  
        mod_counties_deaths = mod_counties_deaths.drop(col, axis=1)

mod_counties_deaths

Recursive feature elimination.¶

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plt

model = LinearRegression()

# float_idx = np.where(np.array([counties_deaths[col].dtype for col in counties_deaths.columns]) == 'float64')[0]

# X = counties_deaths.iloc[:, float_idx]
X = mod_counties_deaths
y = counties_deaths_Y['totalDeaths']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)


svc = SVC(kernel="linear", C=1)

n_features = list(range(5, 6))
score = []
ranking = []

for n in n_features: 
    rfe = RFE(estimator=svc, n_features_to_select=n, step=1)
    rfe.fit(X_train, y_train)
    score.append(rfe.score(X_test, y_test))
    ranking.append(rfe.ranking_)

# model.fit(X_train, y_train)
# model.score(X_test, y_test)

4) Cross-validation¶

sum(rfe.predict(X_train) == 0) - len(X_train)

-36

mod_counties_deaths.columns

Index(['Rural-UrbanContinuumCode2013', 'PopulationEstimate2018',
       'PopulationDensityperSqMile2010', 'MedianAge2010',
       'MedicareEnrollment,AgedTot2017', 'DiabetesPercentage',
       'HeartDiseaseMortality', 'StrokeMortality', 'Smokers_Percentage',
       'RespMortalityRate2014', '#HospParticipatinginNetwork2017',
       'dem_to_rep_ratio', 'PopMale<52010', 'PopMale5-92010',
       'PopMale10-142010', 'PopMale15-192010', 'PopMale20-242010',
       'PopMale25-292010', 'PopMale30-342010', 'PopMale35-442010',
       'PopMale45-542010', 'PopMale55-592010', 'PopMale60-642010',
       'PopMale65-742010', 'PopMale75-842010', 'PopMale>842010',
       'stay at home', '>50 gatherings', '>500 gatherings', 'public schools',
       'restaurant dine-in', 'entertainment/gym', 'SVIPercentile'],
      dtype='object')

a = list(zip(ranking[0], mod_counties_deaths.columns))
a.sort(key=lambda x: x[0])
a

[(1, 'MedicareEnrollment,AgedTot2017'),
 (1, 'StrokeMortality'),
 (1, 'Smokers_Percentage'),
 (1, '#HospParticipatinginNetwork2017'),
 (1, 'SVIPercentile'),
 (2, '>50 gatherings'),
 (3, 'stay at home'),
 (4, 'dem_to_rep_ratio'),
 (5, 'Rural-UrbanContinuumCode2013'),
 (6, 'MedianAge2010'),
 (7, '>500 gatherings'),
 (8, 'RespMortalityRate2014'),
 (9, 'PopMale75-842010'),
 (10, 'DiabetesPercentage'),
 (11, 'HeartDiseaseMortality'),
 (12, 'restaurant dine-in'),
 (13, 'public schools'),
 (14, 'PopMale>842010'),
 (15, 'entertainment/gym'),
 (16, 'PopMale5-92010'),
 (17, 'PopMale55-592010'),
 (18, 'PopMale65-742010'),
 (19, 'PopMale20-242010'),
 (20, 'PopMale10-142010'),
 (21, 'PopMale60-642010'),
 (22, 'PopulationDensityperSqMile2010'),
 (23, 'PopMale45-542010'),
 (24, 'PopMale<52010'),
 (25, 'PopMale35-442010'),
 (26, 'PopulationEstimate2018'),
 (27, 'PopMale15-192010'),
 (28, 'PopMale30-342010'),
 (29, 'PopMale25-292010')]

mod_counties_deaths.columns[np.where(np.array(ranking[0] == 20))]

Index(['PopMale10-142010'], dtype='object')

	FIPS	totalCases
0	60	0
1	66	1332
2	69	119
3	72	9122
4	78	505
...	...	...
3248	90053	5862
3249	90054	0
3250	90055	2
3251	90056	0
3252	99999	1030

	STATEFP	COUNTYFP	CountyName	StateName	State	lat	lon	POP_LATITUDE	POP_LONGITUDE	CensusRegionName	...	4/11/20	4/12/20	4/13/20	4/14/20	4/15/20	4/16/20	4/17/20	4/18/20	totalCases	totalDeaths
1001	-1.859112	-0.955075	Autauga	AL	Alabama	-1.187502	0.456038	-0.890312	0.383511	South	...	1	1	1	1	1	1	2	2	213	12
1003	-1.859112	-0.936366	Baldwin	AL	Alabama	-1.556236	0.362066	-1.209442	0.289002	South	...	1	1	1	2	2	2	2	2	815	15
1005	-1.859112	-0.917656	Barbour	AL	Alabama	-1.323811	0.564593	-0.997648	0.471754	South	...	0	0	0	0	0	0	0	0	112	0
1007	-1.859112	-0.898947	Bibb	AL	Alabama	-1.093581	0.414331	-0.803552	0.336302	South	...	0	0	0	0	0	0	0	0	173	0
1009	-1.859112	-0.880238	Blount	AL	Alabama	-0.890688	0.463250	-0.652394	0.376258	South	...	0	0	0	0	0	0	0	0	153	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
56037	1.510462	-0.618305	Sweetwater	WY	Wyoming	0.678175	-1.477386	0.594553	-1.315647	West	...	0	0	0	0	0	0	0	0	84	0
56039	1.510462	-0.599596	Teton	WY	Wyoming	1.099150	-1.624522	0.907541	-1.426627	West	...	0	0	0	0	0	0	0	0	557	0
56041	1.510462	-0.580886	Uinta	WY	Wyoming	0.603030	-1.622962	0.544118	-1.425372	West	...	0	0	0	0	0	0	0	0	43	0
56043	1.510462	-0.562177	Washakie	WY	Wyoming	1.139160	-1.373058	0.992246	-1.212542	West	...	0	0	0	0	0	0	0	0	51	0
56045	1.510462	-0.543467	Weston	WY	Wyoming	1.125734	-1.101535	0.975577	-0.944329	West	...	0	0	0	0	0	0	0	0	0	0

	Rural-UrbanContinuumCode2013	PopulationEstimate2018	PopulationDensityperSqMile2010	MedianAge2010	MedicareEnrollment,AgedTot2017	DiabetesPercentage	HeartDiseaseMortality	StrokeMortality	Smokers_Percentage	RespMortalityRate2014	...	PopMale65-742010	PopMale75-842010	PopMale>842010	stay at home	>50 gatherings	>500 gatherings	public schools	restaurant dine-in	entertainment/gym	SVIPercentile
1001	-1.077454	-0.142765	-0.111776	-0.644517	-0.168923	-0.152650	0.439721	1.881349	0.180903	1.038131	...	-0.156487	-0.180098	-0.232144	1.331732	-0.410221	-1.204969	-0.693872	-0.158131	1.214887	-0.211549
1003	-0.710373	0.350410	-0.098468	0.170389	0.612586	-0.519781	-0.022648	0.227993	0.013956	-0.584252	...	0.620327	0.543473	0.353014	1.331732	-0.410221	-1.204969	-0.693872	-0.158131	1.214887	-0.978034
1005	0.390870	-0.236044	-0.147108	-0.247002	-0.246738	1.368321	0.784870	1.054671	1.284945	0.332209	...	-0.243427	-0.250246	-0.249406	1.331732	-0.410221	-1.204969	-0.693872	-0.158131	1.214887	1.748374
1007	-1.444536	-0.243577	-0.143738	-0.485511	-0.277422	0.738954	0.895578	2.009426	0.471857	1.197362	...	-0.273868	-0.276424	-0.281627	1.331732	-0.410221	-1.204969	-0.693872	-0.158131	1.214887	0.365064
1009	-1.444536	-0.135967	-0.113461	-0.247002	-0.134069	1.158532	0.880383	1.497118	0.498475	1.348336	...	-0.090763	-0.126720	-0.188991	1.331732	-0.410221	-1.204969	-0.693872	-0.158131	1.214887	-0.250712
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
56037	0.023789	-0.180872	-0.162683	-1.479299	-0.243966	-0.467334	0.198768	-0.866482	0.178611	0.815208	...	-0.237201	-0.251883	-0.252858	-0.098586	-0.410221	0.054391	0.476493	-0.158131	-0.349259	-0.439886
56039	0.757952	-0.241509	-0.162043	-0.664393	-0.283130	-2.198094	-2.030591	-1.285642	-0.815159	-1.656994	...	-0.287013	-0.297080	-0.280476	-0.098586	-0.410221	0.054391	0.476493	-0.158131	-0.349259	-1.339946
56041	0.757952	-0.249956	-0.159196	-1.260666	-0.296014	0.004692	0.007742	-0.924699	-0.063910	1.021618	...	-0.301080	-0.296058	-0.283928	-0.098586	-0.410221	0.054391	0.476493	-0.158131	-0.349259	-0.152803
56043	0.757952	-0.287650	-0.162915	0.309520	-0.317634	-0.309992	-0.589214	-0.994559	-0.163447	-0.284662	...	-0.318376	-0.304034	-0.283353	-0.098586	-0.410221	0.054391	0.476493	-0.158131	-0.349259	-0.429046
56045	0.757952	-0.290438	-0.163380	0.408898	-0.324321	-0.860688	-0.758532	-1.215782	-0.210294	1.031644	...	-0.334173	-0.317736	-0.289682	-0.098586	-0.410221	0.054391	0.476493	-0.158131	-0.349259	-0.518912