In [1]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image

Ideas on what to do

Features to predict on: #deaths, #cases, (measurement that curve is flattening), %death/case

1.) Feature selection (choose best feature that predicts high deaths/cases)

  1. Plot out variance on each feature?
  2. Run logistic regression on each feature (does not account for correlations)
  3. Determine best subset of features (seems difficult if not just brute-forcing)
  4. Something to do with PCA (how to extract features that are within the eigenvectors?

2.) Using timeseries, determine which counties have lowered their case/death count

  1. Same as #1 but determine good features for lowering case/death count (flattening the curve)

Interesting features in datasets

Counties:

  1. Rural-UrbanContinuumCode2013 (urban/rural)
  2. Demographic Info
  3. Health Resource Availability (#ICU beds, etc.)
  4. Health Outcomes and Risk Factors (other health factors can worsen
  5. Social Distancing and Mobility (private data)

Deaths & Cases

  1. match to county infor via FIPS
  2. Gives number of (new?) deaths/cases by each week

States

  1. Doesn't seem as useful - just number of cases in different state (including China, etc.)
  2. No data on the states (so going by county seems better)

Load Data

In [2]:
states = pd.read_csv('covid19/4.18states.csv')
counties = pd.read_csv('covid19/abridged_couties.csv')
cases = pd.read_csv('covid19/time_series_covid19_confirmed_US.csv')
deaths = pd.read_csv('covid19/time_series_covid19_deaths_US.csv')

1) Data Cleaning and Preprocessing

Deaths and Cases

In [3]:
print('Num rows:', len(deaths))
print('Num cols:', len(deaths.columns))
s = deaths.isnull().sum(axis=0)
s.loc[s > 0]
Num rows: 3255
Num cols: 100
Out[3]:
FIPS      4
Admin2    7
dtype: int64
In [4]:
print('Num rows:', len(cases))
print('Num cols:', len(cases.columns))
s = cases.isnull().sum(axis=0)
s.loc[s > 0]
Num rows: 3255
Num cols: 99
Out[4]:
FIPS      4
Admin2    7
dtype: int64
In [5]:
for col in deaths.columns:
    if col not in cases.columns:
        print(col)
Population
In [6]:
# Verify that the NaNs in FIPS correspond between deaths and cases
print(deaths[deaths['FIPS'].isnull()]['Admin2'])
print(cases[cases['FIPS'].isnull()]['Admin2'])
3147                          Dukes and Nantucket
3148                                  Kansas City
3253    Michigan Department of Corrections (MDOC)
3254       Federal Correctional Institution (FCI)
Name: Admin2, dtype: object
3147                          Dukes and Nantucket
3148                                  Kansas City
3253    Michigan Department of Corrections (MDOC)
3254       Federal Correctional Institution (FCI)
Name: Admin2, dtype: object

From the cells above, we see that the deaths and cases dataframes are very similar. They have all the same columns except that cases doesn't have the Population column, and the two dataframes share the same rows that have NaNs in the FIPS and Admin2 columns.

Because there are only 4 out of 3255 rows missing value in FIPS we decide to drop those, and ignore Admin2 as it is only the name of the county and we can fill them in later with their respective FIPS code if needed. We will also ignore filling in Population until we decide it is needed later.

In [7]:
# Verify 4 rows were deleted
deaths = deaths[deaths['FIPS'].notnull()]
cases = cases[cases['FIPS'].notnull()]
print('Deaths new num rows:', len(deaths))
print('Cases new num rows:', len(cases))
Deaths new num rows: 3251
Cases new num rows: 3251

Counties

There are a total of 87 columns in the counties dataset, but as we can see that some of the columns have a very significant amount of NaNs. Later when we try to model what are key indicators of cases and deaths of coronavirus, we will decide what cutoff to use to drop columns. An example would be if there are more than 324 NaNs (over 10% of the rows are missing data) we could drop the column.

In [8]:
# Counties Dataset
print('Num rows:', len(counties))
print('Num cols:', len(counties.columns))
print("Number of nans in each column:")
counties.isnull().sum(axis=0).sort_values(ascending=False)[:10]
# np.array(counties.isnull().sum(axis=0).tolist())
Num rows: 3244
Num cols: 87
Number of nans in each column:
Out[8]:
3-YrMortalityAge1-4Years2015-17      3179
mortality2015-17Estimated            3149
3-YrMortalityAge5-14Years2015-17     3149
3-YrMortalityAge<1Year2015-17        2772
3-YrMortalityAge15-24Years2015-17    2610
3-YrMortalityAge25-34Years2015-17    2270
3-YrMortalityAge35-44Years2015-17    1916
3-YrDiabetes2015-17                  1744
HPSAUnderservedPop                   1141
HPSAServedPop                        1141
dtype: int64
In [9]:
MINIMUM_DATA_POINTS = 2000 
number_nans_in_col = np.array(counties.isnull().sum(axis=0).tolist())
high_nan_idx= np.where(number_nans_in_col > MINIMUM_DATA_POINTS)[0]
high_nan_cols = [counties.columns[i] for i in high_nan_idx]
counties = counties.drop(high_nan_cols, axis=1)
print('Num rows:', len(counties))
print('Num cols:', len(counties.columns))
Num rows: 3244
Num cols: 81

Similarly, some of the rows in the dataset have very few data entries for each feature. In an extreme example, the last two rows for New York City and Kansas City have 0 features or data points. Rows that are missing values for more than 40 features are dropped from the table

In [10]:
print("Number of nans in each county:")
counties.set_index("CountyName").isnull().sum(axis=1).sort_values(ascending=False)[:10]
Number of nans in each county:
Out[10]:
CountyName
Kansas City                        78
Dade                               78
Yellowstone Nat Park               78
South Boston City                  78
New York City                      78
Clifton Forge City                 76
Aguijan                            75
Prince of Wales-Outer Ketchikan    75
Rota                               75
Skagway-Hoonah-Angoon              75
dtype: int64
In [11]:
MINIMUM_FEATURES = 40 
number_nans_in_row = np.array(counties.isnull().sum(axis=1).tolist())
high_nan_idx= np.where(number_nans_in_row > MINIMUM_FEATURES)[0]
counties = counties.drop(high_nan_idx, axis=0)
print('Num rows:', len(counties))
print('Num cols:', len(counties.columns))
Num rows: 3225
Num cols: 81

The data seems to be ordered relatively geographically (states are grouped together). We will make the assumption that rows consecutive rows will have similar values, so we will fill in the NaNs using ffill followed by bfill to fill in all values. This approach solves the issue with determining what a mean or median is for categorical data points.

In [12]:
# Verify no for NaNs
counties = counties.ffill().bfill()
counties.isnull().sum(axis=0).sort_values(ascending=False)[:5]
Out[12]:
HPSAUnderservedPop                0
PopMale10-142010                  0
#EligibleforMedicare2018          0
MedicareEnrollment,AgedTot2017    0
3-YrDiabetes2015-17               0
dtype: int64

Removing features with 0 variance

In [13]:
sns.boxplot(counties['federal guidelines'],color='green',orient='h')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bae558f748>

Some of the data columns have 0 variance: all of the data points are the same. As a result, these features have 0 predictive power and instead make the model more prone to overfitting. In this step, we remove such features from the dataset.

In [14]:
print("Removed Features: ")
for col in counties: 
    data = counties[col]
    if data.dtype == 'float64': 
        if np.max(data) - np.min(data) < 1: 
            print("\t" + col)
            counties = counties.drop(col, axis=1)
print('Num rows:', len(counties))
print('Num cols:', len(counties.columns))
Removed Features: 
	FracMale2017
	federal guidelines
	foreign travel ban
Num rows: 3225
Num cols: 78

Normalization

Some features have values that are much larger than other features. For example, values for population will be much higher than values for population density. To prevent these larger values from dominating predictions, we normalize and demean each quantitative feature, leaving categorical features the same.

In [15]:
for col in counties: 
    data = counties[col]
    if data.dtype == 'float64': 
        counties[col] = (data - np.mean(data)) / np.std(data)

2) Exploratory Data Analysis (EDA)

In [16]:
plt.rcParams.update({'figure.max_open_warning': 0})

quan_counties = counties.copy() 
print("Removed categorical features: ")
for col in quan_counties: 
    data = quan_counties[col]
    if data.dtype != 'float64': 
        print("\t" + col)
        quan_counties = quan_counties.drop(col, axis=1)

df = quan_counties
l = df.columns.values
number_of_columns=12
number_of_rows = len(l)-1/number_of_columns
plt.figure(figsize=(18,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(df[l[i]],color='green',orient='v')
plt.tight_layout()
Removed categorical features: 
	countyFIPS
	CountyName
	StateName
	State
	CensusRegionName
	CensusDivisionName
In [17]:
plt.scatter(counties['PopMale15-192010'], counties['PopFmle15-192010'])
plt.xlabel('PopMale15-192010')
plt.ylabel('PopFmle15-192010')
plt.title("PopMale15-192010 vs PopFmle15-192010")
plt.figure(figsize=(24,16))
a = sns.heatmap(quan_counties.corr(),cmap='Blues',annot=False) 
# a.set_axis_labels("Features", "Features")
# quan_counties.corr()
a
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bae6b7a588>

Boxplots

In [18]:
df.boxplot(["3-YrDiabetes2015-17"])
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bae8e5dc48>

Distributions

In [19]:
l = df.columns.values
number_of_columns=12
number_of_rows = len(l)-1/number_of_columns
plt.figure(figsize=(2*number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.distplot(df[l[i]],kde=True) 
In [20]:
counties[">50 gatherings"]
df.boxplot([">500 gatherings"])
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x2baeb440b48>

3) Prediction Methods

In [21]:
counties["countyFIPS"] = counties["countyFIPS"].astype(int)

deaths['FIPS'] = deaths['FIPS'].astype(int)
counties_deaths = counties.set_index(['countyFIPS']).join(deaths.set_index("FIPS"), how='inner')
In [22]:
cases["totalCases"] = cases["4/9/20"] + cases["4/10/20"] + \
                     cases["4/11/20"] + cases["4/12/20"] + \
                     cases["4/13/20"] + cases["4/14/20"] + \
                     cases["4/15/20"] + cases["4/16/20"] + \
                     cases["4/17/20"] + cases["4/18/20"] 
cases['FIPS'] = cases['FIPS'].astype(int)
counties_deaths = counties_deaths.join(cases[['FIPS', 'totalCases']].set_index("FIPS"), how='inner') 
In [23]:
cases[['FIPS', 'totalCases']]
Out[23]:
FIPS totalCases
0 60 0
1 66 1332
2 69 119
3 72 9122
4 78 505
... ... ...
3248 90053 5862
3249 90054 0
3250 90055 2
3251 90056 0
3252 99999 1030

3251 rows × 2 columns

In [24]:
counties_deaths["totalDeaths"] = counties_deaths["4/9/20"] + counties_deaths["4/10/20"] + \
                                 counties_deaths["4/11/20"] + counties_deaths["4/12/20"] + \
                                 counties_deaths["4/13/20"] + counties_deaths["4/14/20"] + \
                                 counties_deaths["4/15/20"] + counties_deaths["4/16/20"] + \
                                 counties_deaths["4/17/20"] + counties_deaths["4/18/20"] 
In [25]:
counties_deaths_Y = counties_deaths
In [26]:
counties_deaths
Out[26]:
STATEFP COUNTYFP CountyName StateName State lat lon POP_LATITUDE POP_LONGITUDE CensusRegionName ... 4/11/20 4/12/20 4/13/20 4/14/20 4/15/20 4/16/20 4/17/20 4/18/20 totalCases totalDeaths
1001 -1.859112 -0.955075 Autauga AL Alabama -1.187502 0.456038 -0.890312 0.383511 South ... 1 1 1 1 1 1 2 2 213 12
1003 -1.859112 -0.936366 Baldwin AL Alabama -1.556236 0.362066 -1.209442 0.289002 South ... 1 1 1 2 2 2 2 2 815 15
1005 -1.859112 -0.917656 Barbour AL Alabama -1.323811 0.564593 -0.997648 0.471754 South ... 0 0 0 0 0 0 0 0 112 0
1007 -1.859112 -0.898947 Bibb AL Alabama -1.093581 0.414331 -0.803552 0.336302 South ... 0 0 0 0 0 0 0 0 173 0
1009 -1.859112 -0.880238 Blount AL Alabama -0.890688 0.463250 -0.652394 0.376258 South ... 0 0 0 0 0 0 0 0 153 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
56037 1.510462 -0.618305 Sweetwater WY Wyoming 0.678175 -1.477386 0.594553 -1.315647 West ... 0 0 0 0 0 0 0 0 84 0
56039 1.510462 -0.599596 Teton WY Wyoming 1.099150 -1.624522 0.907541 -1.426627 West ... 0 0 0 0 0 0 0 0 557 0
56041 1.510462 -0.580886 Uinta WY Wyoming 0.603030 -1.622962 0.544118 -1.425372 West ... 0 0 0 0 0 0 0 0 43 0
56043 1.510462 -0.562177 Washakie WY Wyoming 1.139160 -1.373058 0.992246 -1.212542 West ... 0 0 0 0 0 0 0 0 51 0
56045 1.510462 -0.543467 Weston WY Wyoming 1.125734 -1.101535 0.975577 -0.944329 West ... 0 0 0 0 0 0 0 0 0 0

3140 rows × 178 columns

Manually selected feature space.

In [27]:
quan_counties = counties_deaths.copy() 
for col in quan_counties: 
    data = quan_counties[col]
    if data.dtype != 'float64': 
        quan_counties = quan_counties.drop(col, axis=1)

mod_counties_deaths = quan_counties.copy()

keep_col = [
#     'STATEFP', 
#     'COUNTYFP', 
#     'lat', 
#     'lon', 
#     'POP_LATITUDE',
#     'POP_LONGITUDE', 
#     'PopTotalMale2017', 
#     'PopTotalFemale2017',
#     'PopulationEstimate65+2017', 
#     'CensusPopulation2010', 
#     '#EligibleforMedicare2018', 
#     '3-YrDiabetes2015-17', 
#     '#FTEHospitalTotal2017',
#     "TotalM.D.'s,TotNon-FedandFed2017",
#     '#Hospitals', '#ICU_beds',
#     'PopFmle<52010',
#     'PopFmle5-92010', 
#     'PopFmle10-142010', 
#     'PopFmle20-242010', 
#     'PopFmle25-292010', 
#     'PopFmle30-342010',
#     'PopFmle35-442010', 
#     'PopFmle45-542010', 
#     'PopFmle55-592010',
#     'PopFmle65-742010', 
#     'PopFmle60-642010', 
#     'PopFmle>842010',
#     '3-YrMortalityAge25-34Years2015-17',
#     '3-YrMortalityAge35-44Years2015-17',
#     '3-YrMortalityAge45-54Years2015-17',
#     '3-YrMortalityAge55-64Years2015-17',
#     '3-YrMortalityAge65-74Years2015-17',
#     '3-YrMortalityAge75-84Years2015-17',
#     '3-YrMortalityAge85+Years2015-17', 
#     'HPSAShortage', 'HPSAServedPop', 'HPSAUnderservedPop', 'Lat',
#     'Long_'
#     'PopFmle75-842010',
#     'PopFmle15-192010',
    'Rural-UrbanContinuumCode2013',
    'PopulationEstimate2018',   
    'PopulationDensityperSqMile2010',
    'MedianAge2010',
    'MedicareEnrollment,AgedTot2017',
    'DiabetesPercentage',
    'HeartDiseaseMortality', 'StrokeMortality', 'Smokers_Percentage',
    'RespMortalityRate2014', 
    '#HospParticipatinginNetwork2017', 
    'dem_to_rep_ratio', 
    'PopMale<52010', 
    'PopMale5-92010',
    'PopMale10-142010',
    'PopMale15-192010', 
    'PopMale20-242010', 
    'PopMale25-292010',
    'PopMale30-342010', 
    'PopMale35-442010', 
    'PopMale45-542010',
    'PopMale55-592010', 
    'PopMale60-642010', 
    'PopMale65-742010',
    'PopMale75-842010', 
    'PopMale>842010', 
    'stay at home',
    '>50 gatherings', '>500 gatherings', 'public schools',
    'restaurant dine-in', 'entertainment/gym', 'SVIPercentile',
]
for i, col in enumerate(mod_counties_deaths.columns): 
    if col not in keep_col:  
        mod_counties_deaths = mod_counties_deaths.drop(col, axis=1)
In [28]:
mod_counties_deaths
Out[28]:
Rural-UrbanContinuumCode2013 PopulationEstimate2018 PopulationDensityperSqMile2010 MedianAge2010 MedicareEnrollment,AgedTot2017 DiabetesPercentage HeartDiseaseMortality StrokeMortality Smokers_Percentage RespMortalityRate2014 ... PopMale65-742010 PopMale75-842010 PopMale>842010 stay at home >50 gatherings >500 gatherings public schools restaurant dine-in entertainment/gym SVIPercentile
1001 -1.077454 -0.142765 -0.111776 -0.644517 -0.168923 -0.152650 0.439721 1.881349 0.180903 1.038131 ... -0.156487 -0.180098 -0.232144 1.331732 -0.410221 -1.204969 -0.693872 -0.158131 1.214887 -0.211549
1003 -0.710373 0.350410 -0.098468 0.170389 0.612586 -0.519781 -0.022648 0.227993 0.013956 -0.584252 ... 0.620327 0.543473 0.353014 1.331732 -0.410221 -1.204969 -0.693872 -0.158131 1.214887 -0.978034
1005 0.390870 -0.236044 -0.147108 -0.247002 -0.246738 1.368321 0.784870 1.054671 1.284945 0.332209 ... -0.243427 -0.250246 -0.249406 1.331732 -0.410221 -1.204969 -0.693872 -0.158131 1.214887 1.748374
1007 -1.444536 -0.243577 -0.143738 -0.485511 -0.277422 0.738954 0.895578 2.009426 0.471857 1.197362 ... -0.273868 -0.276424 -0.281627 1.331732 -0.410221 -1.204969 -0.693872 -0.158131 1.214887 0.365064
1009 -1.444536 -0.135967 -0.113461 -0.247002 -0.134069 1.158532 0.880383 1.497118 0.498475 1.348336 ... -0.090763 -0.126720 -0.188991 1.331732 -0.410221 -1.204969 -0.693872 -0.158131 1.214887 -0.250712
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
56037 0.023789 -0.180872 -0.162683 -1.479299 -0.243966 -0.467334 0.198768 -0.866482 0.178611 0.815208 ... -0.237201 -0.251883 -0.252858 -0.098586 -0.410221 0.054391 0.476493 -0.158131 -0.349259 -0.439886
56039 0.757952 -0.241509 -0.162043 -0.664393 -0.283130 -2.198094 -2.030591 -1.285642 -0.815159 -1.656994 ... -0.287013 -0.297080 -0.280476 -0.098586 -0.410221 0.054391 0.476493 -0.158131 -0.349259 -1.339946
56041 0.757952 -0.249956 -0.159196 -1.260666 -0.296014 0.004692 0.007742 -0.924699 -0.063910 1.021618 ... -0.301080 -0.296058 -0.283928 -0.098586 -0.410221 0.054391 0.476493 -0.158131 -0.349259 -0.152803
56043 0.757952 -0.287650 -0.162915 0.309520 -0.317634 -0.309992 -0.589214 -0.994559 -0.163447 -0.284662 ... -0.318376 -0.304034 -0.283353 -0.098586 -0.410221 0.054391 0.476493 -0.158131 -0.349259 -0.429046
56045 0.757952 -0.290438 -0.163380 0.408898 -0.324321 -0.860688 -0.758532 -1.215782 -0.210294 1.031644 ... -0.334173 -0.317736 -0.289682 -0.098586 -0.410221 0.054391 0.476493 -0.158131 -0.349259 -0.518912

3140 rows × 33 columns

Recursive feature elimination.

In [29]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plt

model = LinearRegression()

# float_idx = np.where(np.array([counties_deaths[col].dtype for col in counties_deaths.columns]) == 'float64')[0]

# X = counties_deaths.iloc[:, float_idx]
X = mod_counties_deaths
y = counties_deaths_Y['totalDeaths']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)


svc = SVC(kernel="linear", C=1)

n_features = list(range(5, 6))
score = []
ranking = []

for n in n_features: 
    rfe = RFE(estimator=svc, n_features_to_select=n, step=1)
    rfe.fit(X_train, y_train)
    score.append(rfe.score(X_test, y_test))
    ranking.append(rfe.ranking_)

# model.fit(X_train, y_train)
# model.score(X_test, y_test)

4) Cross-validation

In [30]:
sum(rfe.predict(X_train) == 0) - len(X_train)
Out[30]:
-36
In [31]:
mod_counties_deaths.columns
Out[31]:
Index(['Rural-UrbanContinuumCode2013', 'PopulationEstimate2018',
       'PopulationDensityperSqMile2010', 'MedianAge2010',
       'MedicareEnrollment,AgedTot2017', 'DiabetesPercentage',
       'HeartDiseaseMortality', 'StrokeMortality', 'Smokers_Percentage',
       'RespMortalityRate2014', '#HospParticipatinginNetwork2017',
       'dem_to_rep_ratio', 'PopMale<52010', 'PopMale5-92010',
       'PopMale10-142010', 'PopMale15-192010', 'PopMale20-242010',
       'PopMale25-292010', 'PopMale30-342010', 'PopMale35-442010',
       'PopMale45-542010', 'PopMale55-592010', 'PopMale60-642010',
       'PopMale65-742010', 'PopMale75-842010', 'PopMale>842010',
       'stay at home', '>50 gatherings', '>500 gatherings', 'public schools',
       'restaurant dine-in', 'entertainment/gym', 'SVIPercentile'],
      dtype='object')
In [32]:
a = list(zip(ranking[0], mod_counties_deaths.columns))
a.sort(key=lambda x: x[0])
a
Out[32]:
[(1, 'MedicareEnrollment,AgedTot2017'),
 (1, 'StrokeMortality'),
 (1, 'Smokers_Percentage'),
 (1, '#HospParticipatinginNetwork2017'),
 (1, 'SVIPercentile'),
 (2, '>50 gatherings'),
 (3, 'stay at home'),
 (4, 'dem_to_rep_ratio'),
 (5, 'Rural-UrbanContinuumCode2013'),
 (6, 'MedianAge2010'),
 (7, '>500 gatherings'),
 (8, 'RespMortalityRate2014'),
 (9, 'PopMale75-842010'),
 (10, 'DiabetesPercentage'),
 (11, 'HeartDiseaseMortality'),
 (12, 'restaurant dine-in'),
 (13, 'public schools'),
 (14, 'PopMale>842010'),
 (15, 'entertainment/gym'),
 (16, 'PopMale5-92010'),
 (17, 'PopMale55-592010'),
 (18, 'PopMale65-742010'),
 (19, 'PopMale20-242010'),
 (20, 'PopMale10-142010'),
 (21, 'PopMale60-642010'),
 (22, 'PopulationDensityperSqMile2010'),
 (23, 'PopMale45-542010'),
 (24, 'PopMale<52010'),
 (25, 'PopMale35-442010'),
 (26, 'PopulationEstimate2018'),
 (27, 'PopMale15-192010'),
 (28, 'PopMale30-342010'),
 (29, 'PopMale25-292010')]
In [33]:
mod_counties_deaths.columns[np.where(np.array(ranking[0] == 20))]
Out[33]:
Index(['PopMale10-142010'], dtype='object')