Advanced Data Handling

All code used in this module is available here.
Please read Fundamentals of Data Handling before reading this module.


Most of the times, the data you get wouldn't be in the format that you require them to be in. In those cases you rely on the querying functionalities (like those supported by SQL) that your language supports for data manipulation. A package that provides a really neat querying syntax is DataFramesMeta. In this module, you will learn how to use various functionalities from the DataFramesMeta package to manipulate your raw data.

Loading the packages and the cars dataset from Rdatasets

using RDatasets
using StatsBase
using DataFrames
using DataFramesMeta
cars = dataset("datasets", "mtcars")

Before going further, we need to know the column names. We will use the describe() function to get the column names as well as some other summary statistics of the data.

describe(cars)
12×7 DataFrame
 Row │ variable  mean     min          median  max         nmissing  eltype   
     │ Symbol    Union…   Any          Union…  Any         Int64     DataType 
─────┼────────────────────────────────────────────────────────────────────────
   1 │ Model              AMC Javelin          Volvo 142E         0  String
   2 │ MPG       20.0906  10.4         19.2    33.9               0  Float64
   3 │ Cyl       6.1875   4            6.0     8                  0  Int64
   4 │ Disp      230.722  71.1         196.3   472.0              0  Float64
   5 │ HP        146.688  52           123.0   335                0  Int64
   6 │ DRat      3.59656  2.76         3.695   4.93               0  Float64
   7 │ WT        3.21725  1.513        3.325   5.424              0  Float64
   8 │ QSec      17.8487  14.5         17.71   22.9               0  Float64
   9 │ VS        0.4375   0            0.0     1                  0  Int64
  10 │ AM        0.40625  0            0.0     1                  0  Int64
  11 │ Gear      3.6875   3            4.0     5                  0  Int64
  12 │ Carb      2.8125   1            2.0     8                  0  Int64

Let's say you only want data about the cars that have 4 cylinders. How would we get that?

@linq cars |>
        where(:Cyl .==4)
11×12 DataFrame
 Row │ Model           MPG      Cyl    Disp     HP     DRat     WT       QSec     VS     AM     Gear   Carb  
     │ String          Float64  Int64  Float64  Int64  Float64  Float64  Float64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Datsun 710         22.8      4    108.0     93     3.85    2.32     18.61      1      1      4      1
   2 │ Merc 240D          24.4      4    146.7     62     3.69    3.19     20.0       1      0      4      2
   3 │ Merc 230           22.8      4    140.8     95     3.92    3.15     22.9       1      0      4      2
   4 │ Fiat 128           32.4      4     78.7     66     4.08    2.2      19.47      1      1      4      1
   5 │ Honda Civic        30.4      4     75.7     52     4.93    1.615    18.52      1      1      4      2
   6 │ Toyota Corolla     33.9      4     71.1     65     4.22    1.835    19.9       1      1      4      1
   7 │ Toyota Corona      21.5      4    120.1     97     3.7     2.465    20.01      1      0      3      1
   8 │ Fiat X1-9          27.3      4     79.0     66     4.08    1.935    18.9       1      1      4      1
   9 │ Porsche 914-2      26.0      4    120.3     91     4.43    2.14     16.7       0      1      5      2
  10 │ Lotus Europa       30.4      4     95.1    113     3.77    1.513    16.9       1      1      5      2
  11 │ Volvo 142E         21.4      4    121.0    109     4.11    2.78     18.6       1      1      4      2

With the same logic, we can also get the details of a specific car model:

@linq cars |>
        where(:Model .== "Honda Civic")
1×12 DataFrame
 Row │ Model        MPG      Cyl    Disp     HP     DRat     WT       QSec     VS     AM     Gear   Carb  
     │ String       Float64  Int64  Float64  Int64  Float64  Float64  Float64  Int64  Int64  Int64  Int64 
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Honda Civic     30.4      4     75.7     52     4.93    1.615    18.52      1      1      4      2

If you want to have more than one condition, you can list all them together with a comma separating each condition.

@linq cars |>
        where(:Cyl .==4, :Gear .==5)
2×12 DataFrame
 Row │ Model          MPG      Cyl    Disp     HP     DRat     WT       QSec     VS     AM     Gear   Carb  
     │ String         Float64  Int64  Float64  Int64  Float64  Float64  Float64  Int64  Int64  Int64  Int64 
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Porsche 914-2     26.0      4    120.3     91     4.43    2.14      16.7      0      1      5      2
   2 │ Lotus Europa      30.4      4     95.1    113     3.77    1.513     16.9      1      1      5      2

But suppose you want to have an OR logic between your conditions.

@linq cars |>
        where( .|(:Cyl .==4, :Cyl .==6))
18×12 DataFrame
 Row │ Model           MPG      Cyl    Disp     HP     DRat     WT       QSec     VS     AM     Gear   Carb  
     │ String          Float64  Int64  Float64  Int64  Float64  Float64  Float64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4          21.0      6    160.0    110     3.9     2.62     16.46      0      1      4      4
   2 │ Mazda RX4 Wag      21.0      6    160.0    110     3.9     2.875    17.02      0      1      4      4
   3 │ Datsun 710         22.8      4    108.0     93     3.85    2.32     18.61      1      1      4      1
   4 │ Hornet 4 Drive     21.4      6    258.0    110     3.08    3.215    19.44      1      0      3      1
   5 │ Valiant            18.1      6    225.0    105     2.76    3.46     20.22      1      0      3      1
  ⋮  │       ⋮            ⋮       ⋮       ⋮       ⋮       ⋮        ⋮        ⋮       ⋮      ⋮      ⋮      ⋮
  14 │ Fiat X1-9          27.3      4     79.0     66     4.08    1.935    18.9       1      1      4      1
  15 │ Porsche 914-2      26.0      4    120.3     91     4.43    2.14     16.7       0      1      5      2
  16 │ Lotus Europa       30.4      4     95.1    113     3.77    1.513    16.9       1      1      5      2
  17 │ Ferrari Dino       19.7      6    145.0    175     3.62    2.77     15.5       0      1      5      6
  18 │ Volvo 142E         21.4      4    121.0    109     4.11    2.78     18.6       1      1      4      2
                                                                                               8 rows omitted

But out of all the columns, if you are only interested in a subset of the columns in the data, you can use the select() function to choose the ones you are interested in.

@linq cars |>
        where(:Cyl .!=4) |>
        select(:Model, :MPG)
21×2 DataFrame
 Row │ Model                MPG     
     │ String               Float64 
─────┼──────────────────────────────
   1 │ Mazda RX4               21.0
   2 │ Mazda RX4 Wag           21.0
   3 │ Hornet 4 Drive          21.4
   4 │ Hornet Sportabout       18.7
   5 │ Valiant                 18.1
  ⋮  │          ⋮              ⋮
  17 │ Camaro Z28              13.3
  18 │ Pontiac Firebird        19.2
  19 │ Ford Pantera L          15.8
  20 │ Ferrari Dino            19.7
  21 │ Maserati Bora           15.0
                     11 rows omitted

You can also order the data using orderby() command. By default it will be in the ascending order. To list them in a descending way you can add -ve sign to the variable.

@linq cars |>
        where(:Cyl .==4) |>
        select(:Model, :MPG) |>
        orderby(-:MPG)
11×2 DataFrame
 Row │ Model           MPG     
     │ String          Float64 
─────┼─────────────────────────
   1 │ Toyota Corolla     33.9
   2 │ Fiat 128           32.4
   3 │ Honda Civic        30.4
   4 │ Lotus Europa       30.4
   5 │ Fiat X1-9          27.3
   6 │ Porsche 914-2      26.0
   7 │ Merc 240D          24.4
   8 │ Datsun 710         22.8
   9 │ Merc 230           22.8
  10 │ Toyota Corona      21.5
  11 │ Volvo 142E         21.4

Sometimes in your data pre-processing, you have to create new meaningful variables from your dataset. This is called feature engineering and can be achieved with the help of transform() command in DataFramesMeta.

@linq cars |>
        transform(KPL = :MPG .* 0.425144) |>
        by(:Cyl, meanMPG = mean(:MPG))
3×2 DataFrame
 Row │ Cyl    meanMPG 
     │ Int64  Float64 
─────┼────────────────
   16  19.7429
   24  26.6636
   38  15.1

unstack and stack are commands we use to reshape a data from wide format to long format and vice-versa.

unstack(cars,:Model,:Cyl,:MPG)
32×4 DataFrame
 Row │ Model              6          4          8String             Float64?   Float64?   Float64?  
─────┼────────────────────────────────────────────────────
   1 │ Mazda RX4               21.0  missing    missing   
   2 │ Mazda RX4 Wag           21.0  missing    missing   
   3 │ Datsun 710         missing         22.8  missing   
   4 │ Hornet 4 Drive          21.4  missing    missing   
   5 │ Hornet Sportabout  missing    missing         18.7
  ⋮  │         ⋮              ⋮          ⋮          ⋮
  28 │ Lotus Europa       missing         30.4  missing   
  29 │ Ford Pantera L     missing    missing         15.8
  30 │ Ferrari Dino            19.7  missing    missing   
  31 │ Maserati Bora      missing    missing         15.0
  32 │ Volvo 142E         missing         21.4  missing   
                                           22 rows omitted