Describing Data

All code used in this module is available here.
using RDatasets 
data = dataset("datasets", "iris")

Once you have loaded your data into your Julia work environment, an easy way to get a quick summary of your dataset is to use the describe() function

describe(data)
5×8 DataFrame
│ Row │ variable    │ mean    │ min    │ median │ max       │ nunique │ nmissing │ eltype                   │
│     │ Symbol      │ Union…  │ Any    │ Union… │ Any       │ Union…  │ NothingDataType                 │
├─────┼─────────────┼─────────┼────────┼────────┼───────────┼─────────┼──────────┼──────────────────────────┤
│ 1   │ SepalLength │ 5.843334.35.87.9       │         │          │ Float64                  │
│ 2   │ SepalWidth  │ 3.057332.03.04.4       │         │          │ Float64                  │
│ 3   │ PetalLength │ 3.7581.04.356.9       │         │          │ Float64                  │
│ 4   │ PetalWidth  │ 1.199330.11.32.5       │         │          │ Float64                  │
│ 5   │ Species     │         │ setosa │        │ virginica │ 3       │          │ CategoricalString{UInt8} │

So how do we interpret the above table ?

unique(data.Species)
3-element Array{String,1}:
  "setosa"
  "versicolor"
  "virginica"

But suppose you are only interested in the statistics of a particular column or a bunch of columns.

In that case you can use the functions provided by the Statistics standard library that comes with Julia. Suppose I am more interested in understanding PetalWidth

Standard Deviation:

using Statistics
std(data.PetalWidth)
0.7622376689603466

Variance:

var(data.PetalWidth)
0.581006263982103

Mean:

mean(data.PetalWidth)
1.1993333333333336

Median:

median(data.PetalWidth)
1.3

Correlation:

cor(data.PetalWidth, data.PetalLength)
0.9628654314027961