Getting the data

From URLs
From Your Computer
From Packages
From other languages

Though we will be mostly using toy datasets from RDatasets, you are encouraged to explore and fiddle with publicly available open datasets for your own learning and for your class project. Some of the data repositories you should checkout are:

Kaggle: For all sorts of datasets.
UCI Machine Learning Repository: For all sorts of datasets.
UCLA Library : For Psychological datasets.
CMU DataShop: For Educational & Psychological datasets.
CRCN: For Neuroscience datasets.
OpenNeuro: For Neuroscience datasets
RDatasets: A collection of common datasets used in all Statistical/ Machine Learning textbooks

Once you have found a dataset, the first step in your Machine Learning programing workflow is to load the data into your work environment. Either the dataset came with a package, or you found a dataset from a repository like the UCI Machine Learning Repository, or you already have them in your computer. Here we describe how you can load the datasets in each of the mentioned scenarios.

From URLs

Suppose the data you wanted to use is available in a .csv format in a public domain like this https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv

The following code will help you to download the URL to your current working directory:

url = "https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv"
dataname = "iris.csv" # Here you can give any name you wish instead of iris 
download(url, dataname)

"iris.csv"

🔍 Useful Tip

If in case you forgot your current working directory (where the above code will be downloading the dataset into), you just can run the following code and it will print your current working directory.

pwd()

"C:\\Users\\BBS"

👆 In this case, Julia is telling us that we are at file location C:\Users\BBS

From Your Computer

Suppose you downloaded the dataset using the above step or you already had the dataset in your current working directory. Then you can load the dataset into your Julia environment using the following code:

using CSV # This is a pacakge we use for loading CSV Files.
using DataFrames

🔍 If you haven't installed the package, you can check the Software Setup page to learn how to install packages in Julia.

data = CSV.read("iris.csv", DataFrame) # This will load the dataset and convert it into a DataFrame

150×5 DataFrame
 Row │ sepal_length  sepal_width  petal_length  petal_width  species   
     │ Float64       Float64      Float64       Float64      String    
─────┼─────────────────────────────────────────────────────────────────
   1 │          5.1          3.5           1.4          0.2  setosa
   2 │          4.9          3.0           1.4          0.2  setosa
   3 │          4.7          3.2           1.3          0.2  setosa
   4 │          4.6          3.1           1.5          0.2  setosa
   5 │          5.0          3.6           1.4          0.2  setosa
  ⋮  │      ⋮             ⋮            ⋮             ⋮           ⋮
 146 │          6.7          3.0           5.2          2.3  virginica
 147 │          6.3          2.5           5.0          1.9  virginica
 148 │          6.5          3.0           5.2          2.0  virginica
 149 │          6.2          3.4           5.4          2.3  virginica
 150 │          5.9          3.0           5.1          1.8  virginica
                                                       140 rows omitted

From Packages

Sometimes you want toy datasets to develop the "proof-of-concept" code or check your algorithms. In that case,Rdatasets is a good starting point. RDatasets provides access to many of the standard datasets that are generally used to get started in data science and machine learning.

The RDatasets comes with the Iris data, and the following code will illustrate how to load them into your Julia environment.

using RDatasets 
data = dataset("datasets", "iris")

150×5 DataFrame
 Row │ sepal_length  sepal_width  petal_length  petal_width  species   
     │ Float64       Float64      Float64       Float64      String    
─────┼─────────────────────────────────────────────────────────────────
   1 │          5.1          3.5           1.4          0.2  setosa
   2 │          4.9          3.0           1.4          0.2  setosa
   3 │          4.7          3.2           1.3          0.2  setosa
   4 │          4.6          3.1           1.5          0.2  setosa
   5 │          5.0          3.6           1.4          0.2  setosa
  ⋮  │      ⋮             ⋮            ⋮             ⋮           ⋮
 146 │          6.7          3.0           5.2          2.3  virginica
 147 │          6.3          2.5           5.0          1.9  virginica
 148 │          6.5          3.0           5.2          2.0  virginica
 149 │          6.2          3.4           5.4          2.3  virginica
 150 │          5.9          3.0           5.1          1.8  virginica
                                                       140 rows omitted

From other languages

Sometimes some parts of your data science/machine learning pipeline will be handled by another team whom might be using a different programming language. And for their ease of use, they might also be saving data in their language's native data format. In those cases you can use NPZ(Python), RData (R), or MAT (Matlab) packages to load those data.

Python

using NPZ
data = npzread("iris.npz") # To load NPZ data
npzwrite("iris_new.npz", data)  # To write NPZ data

R

using RData
data = RData.load("iris.rda") # To load NPZ data

MATLAB

using MAT
data = matread("iris.mat") # To load MAT data
matwrite("iris_new.mat",data) # To write MAT data