Describing Data

All code used in this module is available here.

using RDatasets 
data = dataset("datasets", "iris")

Once you have loaded your data into your Julia work environment, an easy way to get a quick summary of your dataset is to use the describe() function

describe(data)

5×8 DataFrame
│ Row │ variable    │ mean    │ min    │ median │ max       │ nunique │ nmissing │ eltype                   │
│     │ Symbol      │ Union…  │ Any    │ Union… │ Any       │ Union…  │ Nothing  │ DataType                 │
├─────┼─────────────┼─────────┼────────┼────────┼───────────┼─────────┼──────────┼──────────────────────────┤
│ 1   │ SepalLength │ 5.84333 │ 4.3    │ 5.8    │ 7.9       │         │          │ Float64                  │
│ 2   │ SepalWidth  │ 3.05733 │ 2.0    │ 3.0    │ 4.4       │         │          │ Float64                  │
│ 3   │ PetalLength │ 3.758   │ 1.0    │ 4.35   │ 6.9       │         │          │ Float64                  │
│ 4   │ PetalWidth  │ 1.19933 │ 0.1    │ 1.3    │ 2.5       │         │          │ Float64                  │
│ 5   │ Species     │         │ setosa │        │ virginica │ 3       │          │ CategoricalString{UInt8} │

So how do we interpret the above table ?

The first column in the above table shows the column number of all columns in the data a.k.a the position of each column in the data
The third, fourth, fifth, and sixth columns report the summary statistics of each column in the dataset. These summary statistics are important for understanding the dispersion and shape of the distribution of the variables you are interested.
nunique returns the no of unique elements in a column. This is an important information when your data contains categorical variables. In this case, we can see that Species column is categorical in nature and has 3 unique elements/levels. To get the levels information, you can either use levels() or unique(). While levels() is just reserved for categorical type, unique()can be used for any element type.

unique(data.Species)

3-element Array{String,1}:
  "setosa"
  "versicolor"
  "virginica"

One of the many serious problems everyone face who deal with real-world data is the problem of missing data. It is very important to know if your data have missing elements. nmissing returns the no of missing values in each column of the data. For iris data, we don't have any missing values and hence the field is empty.

But suppose you are only interested in the statistics of a particular column or a bunch of columns.

In that case you can use the functions provided by the Statistics standard library that comes with Julia. Suppose I am more interested in understanding PetalWidth

Standard Deviation:

using Statistics
std(data.PetalWidth)

0.7622376689603466

Variance:

var(data.PetalWidth)

0.581006263982103

Mean:

mean(data.PetalWidth)

1.1993333333333336

Median:

median(data.PetalWidth)

1.3

Correlation:

cor(data.PetalWidth, data.PetalLength)

0.9628654314027961

Computational Cognitive Modeling Lab

Describing Data