Logistic Regression

All code used in this module is available here.

Loading Packages

using DataFrames, CSV, ScikitLearn, PyPlot

Loading the dataset

data = CSV.File("covid_cleaned.csv") |> DataFrame
20352×16 DataFrame
   Row │ intubed  pneumonia  age    pregnancy  diabetes  copd   asthma  inmsupr  hypertension  other_disease  cardiovascular  obesity  renal_chronic  tobacco  contact_other_covid  covid_res 
       │ Int64    Int64      Int64  Int64      Int64     Int64  Int64   Int64    Int64         Int64          Int64           Int64    Int64          Int64    Int64                Int64     
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     10          0     25          0         0      0       0        0             0              0               0        0              0        0                    1          1
     20          0     52          0         0      0       0        0             0              0               0        1              0        1                    1          1
     30          1     51          0         0      0       0        0             0              0               0        0              0        0                    1          1
     41          1     67          0         1      0       0        0             1              0               0        1              0        0                    1          1
     50          1     59          0         1      0       0        0             0              0               0        0              0        0                    1          1
     60          0     52          0         1      0       0        0             1              0               1        0              0        0                    0          1
     70          1     54          0         0      0       0        0             0              0               0        0              0        0                    0          1
     80          1     78          0         0      0       0        0             1              0               0        1              0        0                    1          1
   ⋮   │    ⋮         ⋮        ⋮        ⋮         ⋮        ⋮      ⋮        ⋮          ⋮              ⋮              ⋮            ⋮           ⋮           ⋮              ⋮               ⋮
 203461          1     65          0         0      0       0        0             0              0               0        0              0        0                    0          0
 203470          1     49          0         0      0       0        0             0              0               0        0              0        0                    0          0
 203480          1     80          0         1      0       0        0             0              0               0        0              0        0                    0          0
 203490          0     13          0         0      0       0        0             0              0               0        0              0        0                    0          0
 203501          0     23          0         0      0       0        0             0              1               0        0              0        1                    0          0
 203510          1      1          0         0      0       0        0             0              0               0        0              0        0                    0          0
 203520          1     55          0         0      0       0        0             0              0               0        1              0        0                    0          0
                                                                                                                                                                            20337 rows omitted

Specifying the target and predictors

The ScikitLearn package only accepts the data in array form, hence we need to convert our data into Arrays

X = convert(Array, data[!,Not(:covid_res)])
y = convert(Array, data[!,:covid_res]) # :covid_res is our target variable

Splitting the data into training set and test set

@sk_import model_selection: train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.33, random_state=42) # You can define the train/test size ratio using the test_size argument

Model Definition

@sk_import linear_model: LogisticRegression
simplelogistic =LogisticRegression(penalty=:none)

Model Fitting

fit!(simplelogistic,X_train,y_train)

Model Evaluation

Classification report with the training data:

y_pred = predict(simplelogistic,X_train)

@sk_import metrics: classification_report
print(classification_report(y_train,y_predict))
>	      precision    recall  f1-score   support

           0       0.63      0.45      0.52      5684
           1       0.67      0.81      0.74      7951

    accuracy                           0.66     13635
   macro avg       0.65      0.63      0.63     13635
weighted avg       0.65      0.66      0.65     13635

Classification report with the test data:

y_pred = predict(simplelogistic,X_test)
print(classification_report(y_test,y_pred))
>              precision    recall  f1-score   support

           0       0.64      0.45      0.53      2763
           1       0.68      0.82      0.74      3954

    accuracy                           0.67      6717
   macro avg       0.66      0.64      0.64      6717
weighted avg       0.66      0.67      0.66      6717

Cross Validation

@sk_import model_selection: cross_val_score
cross_val_score(LogisticRegression(penalty=:none ), X_train, y_train)
5-element Array{Float64,1}:
 0.6615328199486615
 0.6563989732306564
 0.6604327099376605
 0.6585991932526586
 0.654932159882655

Confusion Matrix

@sk_import metrics: plot_confusion_matrix
plot_confusion_matrix(simplelogistic,X_train,y_train)
PyPlot.gcf()