Get and Clean Tidy Data

.title[
# Get and Clean Tidy Data
]
.author[
### Christopher Gandrud
]
.institute[
### HU DYNAMICS
]
.date[
### 2022-06-10 (updated: 2023-05-06)
]

---

# 📝 Lesson Preview

- Retrieving and cleaning (tabular) public data

- APIs

- Cleaning grouped data

- Introduction to SQL and quierying relational databases

---

# Accessing data

Significant amounts of data are open for researchers and the public generally.

However, the level of **accessibility varies**:

- use restrictions

- format

- documentation

- version control

- data privacy restrictions

---

# So . . .

We are only going to begin **scratching the surface** of the data access
**obstacles** you are likely to encounter.

---

# Retrieving and cleaning (tabular) public data

---

# Remember to programmatically tie your research to your data

Do as much **data gathering** and **cleaning** as possible programmatically:

- Fully document for reproducible research.

- Can find (inevitable) mistakes.

- Easy to update when the data is updated.

- Can apply methods to other data sets.

---

# "Easy" automatic data gathering

1. Plain-text data (e.g. CSV) stored at secure (https) URL, not embedded in
a larger HTML marked-up website.

2. Data stored in a database with a well structured API (Application Programming
    Interface), that has a corresponding R package.

---

# Non-Secure URL Plain-text data

Use `read.table()` or `read.csv()` (just a wrapper for `read.csv()` with `sep = ','`).

Include the URL rather than the file path.

```r
read.table('http://SOMEDATA.csv')
```

---

# Loading compressed plain-text data

You can download and load data files stored in compressed formats (e.g. `.zip`).

1. Download the compressed file into a **temporary file**.

2. Uncompress the file and pass it to `read.table()`, `import()`, etc.

---

# Loading compressed plain-text data

Load data from @Pemstein2010 in a file called *uds_summary.csv*.

Note: this example no longer works because the website was taken down.

```r
# For simplicity, store the URL in an object called 'URL'.

URL <- "https://bit.ly/3z4Lay2"

# Create a temporary file called 'temp' to put the zip file into.
temp <- tempfile()

# Download the compressed file into the temporary file.
download.file(URL, temp)

# Decompress the file and convert it into a data frame
uds <- read.csv(gzfile(temp, "uds_summary.csv"))

# Delete the temporary file.
unlink(temp)
```

---

# Secure (https) URL Plain-text data

Use `import()` from the *rio* package.

Data on GitHub is stored at secure URLs. Select the **RAW** URL:

```r
URL <- paste0('https://raw.githubusercontent.com/christophergandrud/',
 'LegislativeViolence/master/Data/LegViolenceDescriptives.csv')
leg_violence <- readr::read_csv(URL)
head(leg_violence)
```

```
## # A tibble: 6 × 10
## country year violence violence_y_cum udems_mean dem_age polity2 durable
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1980 0 0 -2.03 0 NA 0
## 2 Afghanistan 1981 0 0 -1.84 0 NA 0
## 3 Afghanistan 1982 0 0 -1.83 0 NA 0
## 4 Afghanistan 1983 0 0 -1.61 0 NA 0
## 5 Afghanistan 1984 0 0 -1.62 0 NA 0
## 6 Afghanistan 1985 0 0 -1.83 0 NA 0
## # ℹ 2 more variables: disproportionality <dbl>, maj <dbl>
```

---

# Versioning and reproducible research

Data maintainers (unfortunately) often change data sets with little or no documentation.

`source_data` allows you to notice these changes by assigning each file a unique
[SHA1 Hash](http://en.wikipedia.org/wiki/SHA-1).

Each download can be checked against the Hash:

```r
leg_violence <- repmis::source_data(URL,
 sha1 = '01cff579b689cea9ef9c98e433ce3122745cc5cb')
```

```
## Downloading data from: https://raw.githubusercontent.com/christophergandrud/LegislativeViolence/master/Data/LegViolenceDescriptives.csv
```

```
## Specified SHA-1 hash matches downloaded data file.
```

🏊 Dive deeper: [source code](https://github.com/christophergandrud/repmis/blob/3989ad8f5d6270ea1c939d8e53ad6c3c14afe4f1/R/utils.R#L39-L51) for hash check.

---

# Excel Files

The `source_XlsxData()` function in repmis does the same hash check as `source_data()`, but for Excel files. It builds on `read.xlsx()` for loading locally stored Excel files.

Can also use `read_excel()` from [readxl](https://readxl.tidyverse.org/).

Note: Excel data often needs **a lot of cleaning** before it is useful
for statistical/graphical analyses.

---

# Caching

`source_data()` allows you to **cache** data with `cache = TRUE`
([source code](https://github.com/christophergandrud/repmis/blob/3989ad8f5d6270ea1c939d8e53ad6c3c14afe4f1/R/source_data.R#L84-L113)).

Caching stores data so that future requests for the data can be done faster.

Creates a locally stored version of the data set which is useful for 
reproducibility.

The cache and the source can become out of sync, hence the need to be able to compare their hashes.

There are other approaches to achieve similar goals, like [makefiles](https://www.gnu.org/software/make/manual/html_node/Introduction.html).

---

# Data APIs

API = Application Programming Interface, a documented way for programs to talk
to each other.

Data API = a documented way to access data from one program stored with another.

---

# R and Data APIs

R can interact with most data APIs using the [httr](https://github.com/hadley/httr)
package.

Even easier: users have written API-specific packages to interact with 
particular data APIs.

---

# World Bank Development Indications with WDI

Access the [World Bank's Development Indicators](http://data.worldbank.org/indicator)
with the WDI package.

Alternative Energy Use example:

```r
# Load WDI package
library(WDI)

# Per country alternative energy use as % of total energy use
alt_energy <- WDI(indicator = 'EG.USE.COMM.CL.ZS')
```

Note: The indicator ID is at the end of the indicator's URL on the World Bank site.

---

# Financial Data with quantmod

The [quantmod](http://www.quantmod.com/) package allows you to access financial
data from a variety of sources (e.g. [Yahoo Finance](http://finance.yahoo.com/),
[Google Finance](https://www.google.com/finance),
[US Federal Reserve's FRED database](http://research.stlouisfed.org/fred2/)).

```r
library(quantmod)

# Download Yen/USD exchange rate
YenDollar <- getSymbols(Symbols = 'DEXJPUS', src = 'FRED')

head(YenDollar)
```

```
## [1] "DEXJPUS"
```

See also [tidyquant](https://business-science.github.io/tidyquant/).

---

# Other API-R packages

There are many more R packages that interact with web data APIs.

For a good beginner list see:
<http://cran.r-project.org/web/views/WebTechnologies.html>

---

# Working with non-table data

| Format                                      | R packages                                                              |
| ------------------------------------------- | ----------------------------------------------------------------------- |
| Excel, Stata, SPSS, SAS                     | rio, readxl                                                                     |
| [JSON](http://en.wikipedia.org/wiki/JSON)   | [jsonlite](https://cran.r-project.org/web/packages/jsonlite/index.html) |
| Databases, e.g. a Google BigQuery data base | [dbplyr](https://dbplyr.tidyverse.org/)                                 |

---

# Data Cleaning

---

# 🙇 Remember: Tidy data semantics + structure

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

---

# 🙇 Remember: Tidy data

| Person       | treatment | result |
| ------------ | --------- | ------ |
| John Smith   | a         |        |
| Jane Doe     | a         | 16     |
| Mary Johnson | a         | 3      |
| John Smith   | b         | 2      |
| Jane Doe     | b         | 11     |
| Mary Johnson | b         | 1      |

---

# 🙇 Remember: Messy to Tidy data

First identify what your observations and variables are.

Then use R tools to convert your data into this format.

**tidyr** and is particularly useful.

---

# 🙇 Remember: Messy to tidy data

Using `pivot_longer()`

```r
tidied <- tidyr::pivot_longer(data = messy, 
 cols = c("a", "b"), # Columns to pivot
 names_to = "treatment",
 values_to = "result")
tidied
```

```
## # A tibble: 6 × 3
## person treatment result
## <chr> <chr> <dbl>
## 1 John Smith a NA
## 2 John Smith b 2
## 3 Jane Doe a 16
## 4 Jane Doe b 11
## 5 Mary Johnson a 3
## 6 Mary Johnson b 1
```

---

# 🙇 Remember: Tidy to messy data

Sometimes it is useful to reverse this operation with `pivot_wider`.

```r
messy_again <- tidyr::pivot_wider(data = tidied, 
 names_from = "treatment",
 values_from = result)

messy_again
```

```
## # A tibble: 3 × 3
## person a b
## <chr> <dbl> <dbl>
## 1 John Smith NA 2
## 2 Jane Doe 16 11
## 3 Mary Johnson 3 1
```

---

# 🙇 Remember: Other issues cleaning data

Always **look at** and **poke your data**.

For example, see if:

- Missing values are designated with `NA`

- Variable classes are what you expect them to be.

- Distributions are what you expect them to be.

[dlookr](https://cran.r-project.org/web/packages/dlookr/vignettes/diagonosis.html)   a data quality checker package.

---

# Join (merge) tidy data frames

Once you have tidy data frames, you can merge them for analysis.

In general: **each observation** must have a **unique key identifier** to merge 
them on.

These identifiers **must match exactly across the data frames**.

---

# 🔑 Key-Value pairs

Key-value pairs are a fundamental data representation.

Data frames don't have explicit key-value pairs: e.g.

$$
\langle \mathrm{Christopher},\: \mathrm{Lecturer}\rangle
$$

`Christopher` is the key and the value is `Lecturer`.

🤓 Fun fact: this is also called a `2-tuple`. Other languages like Python and Julia handle these with data structures called "dictionaries".

---

# Data frames!

Data frames don't have explicit key-value pairings

- Nice for flexibility

- Bad for data manipulation speed and consistency

---

# Merging data

```r
tail(alt_energy, n = 3)
```

```
##       iso2c  country EG.USE.COMM.CL.ZS year
## 16490    ZW Zimbabwe                NA 1962
## 16491    ZW Zimbabwe                NA 1961
## 16492    ZW Zimbabwe                NA 1960
```

```r
tail(leg_violence, n = 3)
```

```
##         country year violence violence_y_cum udems_mean dem_age polity2 durable
## 5481     Taiwan 2010        1              3         NA      NA      NA      NA
## 5482    Ukraine 2010        1              2         NA      NA      NA      NA
## 5483 Yugoslavia 1991       NA              0         NA      NA      -5       0
##      disproportionality maj
## 5481              16.89  NA
## 5482               3.59  NA
## 5483                 NA  NA
```

---

# Create unique ID: country codes

Unique identifier will be
[iso 2 letter country code](http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2)
and **year**.

Use the [countrycode](https://github.com/vincentarelbundock/countrycode)
package to turn *leg_violence* data's
country names to `iso2c`.

```r
library(countrycode)

# Assign iso2c codes base on correlates of war codes
leg_violence$iso2c <- countrycode(leg_violence$country, 
 origin = 'country.name',
 destination = 'iso2c', 
 warn = TRUE)
```

**NOTE**: Always check the data to make sure the correct codes have been applied!

---

# Creating IDs: Time

Time units may be important components of observation IDs.

Use the [lubridate](http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html)
package to standardise dates.

---

# Creating IDs: Time

```r
library(lubridate)

# Create time data
times <- c('Sep. 17 1980', 'March 23 2000', 'Nov. 3 2003')
mdy(times)
```

```
## [1] "1980-09-17" "2000-03-23" "2003-11-03"
```

Note: Times should always go from **longest to shortest** unit (e.g. YEAR-MONTH-DAY). Makes dates **sortable**.

---

# Join data

```r
# Keep only desired variables
leg_violence <- leg_violence[, c('iso2c', 'year', 'udems_mean')]
names(leg_violence)
```

```
## [1] "iso2c"      "year"       "udems_mean"
```

```r
combined <- full_join(x = alt_energy, 
 y = leg_violence,
 by = c('iso2c', 'year'))
head(combined, n = 3)
```

```
##   iso2c                     country EG.USE.COMM.CL.ZS year udems_mean
## 1    ZH Africa Eastern and Southern                NA 2021         NA
## 2    ZH Africa Eastern and Southern                NA 2020         NA
## 3    ZH Africa Eastern and Southern                NA 2019         NA
```

---

# (Many) of the joins

| Name         | Description                                               |
| ------------ | --------------------------------------------------------- |
| `full_join`  | Returns all rows and columns from both data sets          |
| `inner_join` | Returns all rows where there is a match in both data sets |
| `left_join`  | Returns all rows from x                                   |
| `right_join` | Returns all rows from y                                   |

Joins are a crucial part of Structure Query Language (**SQL**) work that is common in commercial data science for working with data bases

Always **check your data** after a merge to see if you did what you wanted to do!

---

# Clean up

You many want to do some post merge cleaning. For example assign new variable
names:

```r
names(combined) <- c('iso2c', 'year', 'country',
 'alt_energy_use', 'uds_median')
```

```r
combined <- dplyr::rename(combined, new_year = year)
```

---

# Reorder variables

```r
combined <- dplyr::relocate(combined, country)

head(combined)
```

```
##                       country iso2c EG.USE.COMM.CL.ZS new_year udems_mean
## 1 Africa Eastern and Southern    ZH                NA     2021         NA
## 2 Africa Eastern and Southern    ZH                NA     2020         NA
## 3 Africa Eastern and Southern    ZH                NA     2019         NA
## 4 Africa Eastern and Southern    ZH                NA     2018         NA
## 5 Africa Eastern and Southern    ZH                NA     2017         NA
## 6 Africa Eastern and Southern    ZH                NA     2016         NA
```

---

# More data transformations with dplyr

Set up for examples

```r
# Create fake grouped data
library(randomNames)
library(dplyr)
library(tidyr)

people <- randomNames(n = 1000)
people <- sort(rep(people, 4))
year <- rep(2010:2013, 1000)
trend_income <- c(30000, 31000, 32000, 33000)

income <- replicate(trend_income + rnorm(4, sd = 20000),
 n = 1000) %>%
 data.frame() %>%
 gather(obs, value, X1:X1000)

income$value[income$value < 0] <- 0
data <- data.frame(people, year, income = income$value)
```

---

# dplyr

```r
head(data)
```

```
##             people year     income
## 1   Abeyta, Amanda 2010   929.1221
## 2   Abeyta, Amanda 2011 21536.3039
## 3   Abeyta, Amanda 2012 25090.8023
## 4   Abeyta, Amanda 2013 69358.1747
## 5 Abeyta, Courtney 2010     0.0000
## 6 Abeyta, Courtney 2011 59779.2156
```

---

# Filter dplyr

Select rows

```r
higher_income <- filter(data, income > 60000)

head(higher_income)
```

```
##                people year   income
## 1      Abeyta, Amanda 2013 69358.17
## 2        Aken, Blanca 2012 74046.89
## 3 al-Abdulla, Maisoon 2011 67877.35
## 4    al-Aboud, Kabeer 2013 65387.64
## 5    al-Ahmad, Raaida 2012 63470.02
## 6      al-Amen, Muzna 2013 61506.77
```

---

# Select dplyr

Select columns

```r
people_income <- select(data, people, income)

# OR
people_income <- select(data, -year)

head(people_income)
```

```
##             people     income
## 1   Abeyta, Amanda   929.1221
## 2   Abeyta, Amanda 21536.3039
## 3   Abeyta, Amanda 25090.8023
## 4   Abeyta, Amanda 69358.1747
## 5 Abeyta, Courtney     0.0000
## 6 Abeyta, Courtney 59779.2156
```

---

# dplyr with grouped data

Tell dplyr what the groups are in the data with `group_by`.

```r
group_data <- group_by(data, people)

head(group_data)[1:5, ]
```

```
## # A tibble: 5 × 3
## # Groups: people [2]
## people year income
## <chr> <int> <dbl>
## 1 Abeyta, Amanda 2010 929.
## 2 Abeyta, Amanda 2011 21536.
## 3 Abeyta, Amanda 2012 25091.
## 4 Abeyta, Amanda 2013 69358.
## 5 Abeyta, Courtney 2010 0
```

Note: the following functions work on **non-grouped data** as well.

---

# dplyr with grouped data

Now that we have declared the data as grouped, we can do operations on each group.

For example, we can extract the highest and lowest income years for each person:

```r
min_max_income <- summarize(group_data,
 min_income = min(income),
 max_income = max(income))

head(min_max_income)[1:3, ]
```

```
## # A tibble: 3 × 3
## people min_income max_income
## <chr> <dbl> <dbl>
## 1 Abeyta, Amanda 929. 69358.
## 2 Abeyta, Courtney 0 59779.
## 3 Acevedo Salinas, Maria 21407. 53774.
```

---

# dplyr with grouped data

We can sort the data using `arrange`.

```r
# Sort highest income for each person in ascending order
ascending <- arrange(min_max_income, max_income)

head(ascending)[1:3, ]
```

```
## # A tibble: 3 × 3
## people min_income max_income
## <chr> <dbl> <dbl>
## 1 el-Sala, Wisaam 249. 11010.
## 2 el-Murad, Mawdood 7776. 12010.
## 3 Roel, Kimberly 0 13230.
```

---

# dplyr with grouped data

Add `desc` to sort in descending order

```r
descending <- arrange(min_max_income, desc(max_income))

head(descending)[1:3, ]
```

```
## # A tibble: 3 × 3
## people min_income max_income
## <chr> <dbl> <dbl>
## 1 O'Brien, Coleburn 38709. 102495.
## 2 Slusarczyk, Kathryn 17792. 99101.
## 3 Ball, Taulric 11272. 97805.
```

---

# dplyr with grouped data

`summarize` creates a new data frame with the summarised data.

We can use `mutate` to add new columns to the original data frame.

```r
data <- mutate(group_data,
 min_income = min(income),
 max_income = max(income))

head(data)[1:3, ]
```

```
## # A tibble: 3 × 5
## # Groups: people [1]
## people year income min_income max_income
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Abeyta, Amanda 2010 929. 929. 69358.
## 2 Abeyta, Amanda 2011 21536. 929. 69358.
## 3 Abeyta, Amanda 2012 25091. 929. 69358.
```

---

# Introduction to SQL and accessing databases from R

SQL (Structured Query Language) is a programming language used to manage and manipulate **relational databases**.

Relational databases aren't super common in academics, but **industry runs on them**.

---

# What is a "relational database"?

> A relational database is a type of database that stores data in tables with rows and columns. Each table represents a **specific entity** or concept, such as customers, orders, or products. The **relationships** between the tables are defined by common fields, or **keys**, which allow data to be linked and combined from multiple tables.

🗯
If you understand **tidy data principles** and merging data, working with SQL databases should be conceptually easy. **dbplyr** even lets you use dplyr syntax to query databases/generate SQL from R code (`translate_sql`).

The devil is in the **connections** (to the tables) and **syntax**.

---

# High level example

- Illustrates basic steps:

- Connect to database (project)
  
  - Run query and return output to R

- From [bigrquery](https://bigrquery.r-dbi.org/#dplyr) docs

- Assumes you have already set up BigQuery.

```r
library(dplyr)

# Connect to natality table for a prespecified connection
natality <- tbl(con, "natality")

# Query table
natality %>%
  select(year, month, day, weight_pounds) %>% 
  head(10) %>%
  collect()
```

---

# SQL Syntax walk through

Assume we have two databases with the data as above for `alt_engergy` and `leg_violence`. 
The following SQL query joins the "alt_energy" and "leg_violence" databases for the specific countries and years:

```sql
SELECT *
FROM alt_energy
JOIN leg_violence
ON alt_energy.iso2c = leg_violence.iso2c AND alt_energy.year = leg_violence.year
WHERE (alt_energy.iso2c = 'DE' OR alt_energy.iso2c = 'KR') AND alt_energy.year >= 2001
LIMIT 100;
```

- `*` selects all columns from both data sets. Specify specific columns with the (unquoted) names, separated by `,` commas.

- `FROM` and `JOIN` specifies the base table and the one to join with it.

- `WHERE` clause filters the results to include only specific rows.
  
- `LIMIT 100` limits the data returned to the first 100 rows (for convenience)

---

# Note!

Note, to run a SQL example from R we would need to connect to set up database connections, which we don't have time for in this course and are service specific.

Instead we aim to give you the core principles needed so that you can get started faster with a relational database you do encounter "in the wild".

---

# But you can play with SQL from BigQuery's web interface

[Google BigQuery](https://cloud.google.com/bigquery) is one popular relational database hosting platform.

You can interact with it using **dbplyr** or [bigrquery](https://bigrquery.r-dbi.org/)

To follow along, you will need to sign up for a [Google Cloud](https://cloud.google.com/bigquery) account. You need a credit card, but "free" with ~$300 credit for 90 days.

- Once you are signed up, Click "Analyze and manage data (BigQuery)" to access BigQuery"

---

# BigQuery web interface

---

# 🥅 Practice

Download two data sets (that conceivable could be merged) using the API-based packages discussed today. Either in the same or a linked R script  **clean** and **merge** the data.

And/Or set up GoogleBig Query and practice querying from the included example data sets.