Naive Bayes Classifier

All code used in this module is available here.

Loading Packages

using DataFrames, CSV, ScikitLearn, PyPlot

Loading the dataset

data = CSV.File("covid_cleaned.csv") |> DataFrame
20352×16 DataFrame
   Row │ intubed  pneumonia  age    pregnancy  diabetes  copd   asthma  inmsupr  hypertension  other_disease  cardiovascular  obesity  renal_chronic  tobacco  contact_other_covid  covid_res 
       │ Int64    Int64      Int64  Int64      Int64     Int64  Int64   Int64    Int64         Int64          Int64           Int64    Int64          Int64    Int64                Int64     
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     10          0     25          0         0      0       0        0             0              0               0        0              0        0                    1          1
     20          0     52          0         0      0       0        0             0              0               0        1              0        1                    1          1
     30          1     51          0         0      0       0        0             0              0               0        0              0        0                    1          1
     41          1     67          0         1      0       0        0             1              0               0        1              0        0                    1          1
     50          1     59          0         1      0       0        0             0              0               0        0              0        0                    1          1
     60          0     52          0         1      0       0        0             1              0               1        0              0        0                    0          1
     70          1     54          0         0      0       0        0             0              0               0        0              0        0                    0          1
     80          1     78          0         0      0       0        0             1              0               0        1              0        0                    1          1
   ⋮   │    ⋮         ⋮        ⋮        ⋮         ⋮        ⋮      ⋮        ⋮          ⋮              ⋮              ⋮            ⋮           ⋮           ⋮              ⋮               ⋮
 203461          1     65          0         0      0       0        0             0              0               0        0              0        0                    0          0
 203470          1     49          0         0      0       0        0             0              0               0        0              0        0                    0          0
 203480          1     80          0         1      0       0        0             0              0               0        0              0        0                    0          0
 203490          0     13          0         0      0       0        0             0              0               0        0              0        0                    0          0
 203501          0     23          0         0      0       0        0             0              1               0        0              0        1                    0          0
 203510          1      1          0         0      0       0        0             0              0               0        0              0        0                    0          0
 203520          1     55          0         0      0       0        0             0              0               0        1              0        0                    0          0
                                                                                                                                                                            20337 rows omitted

The ScikitLearn package only accepts the data in array form, hence we need to convert our data into Arrays

X = convert(Array, data[!,Not(:covid_res)])
y = convert(Array, data[!,:covid_res]) # :covid_res is our target variable

Splitting the data into training set and test set

@sk_import model_selection: train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # You can define the train/test size ratio using the test_size argument

Model Definition

@sk_import naive_bayes: GaussianNB
gnb = GaussianNB()

Model Fitting

fit!(gnb,X_train,y_train)

Model Evaluation

Classification report with the training data:

y_pred = predict(gnb,X_train)

@sk_import metrics: classification_report
print(classification_report(y_train,y_pred))
>              precision    recall  f1-score   support

           0       0.60      0.40      0.48      4245
           1       0.65      0.81      0.72      5931

    accuracy                           0.64     10176
   macro avg       0.62      0.60      0.60     10176
weighted avg       0.63      0.64      0.62     10176

Classification report with the test data:

Confusion Matrix

@sk_import metrics: plot_confusion_matrix
plot_confusion_matrix(gnb,X_train,y_train)
PyPlot.gcf()