File Organisation and Tidy Data Structures

.title[
# File Organisation and Tidy Data Structures
]
.author[
### Christopher Gandrud
]
.institute[
### HU DYNAMICS
]
.date[
### 2022-06-10 (updated: 2023-05-06)
]

---

# 📝 Lesson Preview

- Why is file management important?

- What is tidy data?

- Manipulating tidy data (1)

---

# 🙇 Remember: Practical tips for reproducible research

- ✏️ Document Everything!

- 📄 Everything is a (text) file.

- 🤓 All files should be human readable.

- ⋈ Explicitly tie your files together.

- 🗂 Have a plan to organise, store, and make your files available. [And iterate on this plan -- refactor regularly]

---

# Importance of understanding files/file structures

This topic may seem kind of . . . dry.

Why not just click and drag files with the GUI (Graphical User Interface)?

---

# Importance of understanding files/file structures

1. **Crucial for programmatic access and manipulation**: automated data collection and management _is_ automated file manipulation.

2. **Reproducibility**: other researchers only have your files. If they are **well
    organised** and the **links** between the files are **explicitly stated**
    then they can better understand what you did.

+ Clearest way of explicitly stating links is **dynamically** using file
    paths in your source code.

3. **You**: well organised files will be easier for you to find/understand/use
    in the future.

---

# 📄 (text) Files

---

# 📄 Why text files?

(Almost) all files are ultimately text files. Note very large data is usually stored in _binary_ file formats

- E.g. a website is typically just a series of connected `.html`, `.js`,
and `.css` files.

- **These are text files!** Despite different file extensions.

+ To see this explore a webpage with
    [Chrome Developer Tools](https://developers.google.com/web/tools/chrome-devtools/)

---

# 📄 Why text files?

Text files are **versitile**.

- Store your data (`.csv`, `json`), store your analysis code (`.R`), store your
presentation markup (`.Rmd`).

- They are **simple** and are **not dependent on particular software**.

- Any text editor can open them.

- Helps **future-proof** research.

- Easy to **version control**.

---

# 📄 CSV Example

CSV (Comma Separated Values)

- All columns are separated by commas--`,`

- All rows are separated by new lines--`\n`

---

# CSV Example

In CSV this:

```
iso2c, country, score
US, United States, 1.086
US, United States, 1.094
US, United States, 1.050
```

Makes:

---

# 🗂 Organise your files!

---

# 🌳 File paths & trees

Files are organised **hierarchically** into (upside down) trees.

```
Root
 └── Parent
      ├── Child-1
      └── Child-2
```

---

# 🌳 Root

**Root** directories are the **first level of a disk**.

```
Root
 └── Parent
      ├── Child-1
      └── Child-2
```

They are the root out of which the file tree grows.

**Naming Conventions**:

**Linux/Mac**: `/`

- e.g. `/Parent` means that the `Parent` directory is a child
of the root directory.

**Windows**: the disk is partitioned, e.g. the `C` partition is denoted `C:\`.

- `C:\Parent` indicates that the `Parent` directory is a child
of the `C` partition.

---

# 🚸 Sub (child) directories

Sub (child) directories are denoted with a forward slash `/` in Linux/Mac and a back slash `\` in Windows, e.g.:

```sh
# Linux/Mac
/Parent/Child

# Windows
C:\Parent\Child
```

---

# R tip:

- In R for Windows you either use two backslashes `\\` (`\` is the R [escape
character](http://en.wikipedia.org/wiki/Escape_character))

- Or `/` in **relative paths** in R for Windows, it will know
what you mean.

---

# 🏗 Working directories

A **working directory** is the directory where the program looks for files/other
directories.

⚠️ **Always remember the working directory.** Otherwise you may open/save files that you do not want to open/save.

---

# 🏗 Working directories

**In R:**

```r
# Find working directory
getwd()
```

```
## [1] "/Users/cgandrud/git_repos/intro-data-retrieval-management/static/slides"
```

```r
# List all files in the working directory
list.files()
```

```
##  [1] "code"                       "data"                      
##  [3] "get-json-html_cache"        "get-json-html.html"        
##  [5] "get-json-html.Rmd"          "get-tidy-data_cache"       
##  [7] "get-tidy-data.html"         "get-tidy-data.Rmd"         
##  [9] "img"                        "libs"                      
## [11] "main.bib"                   "pdf-and-text-as-data_cache"
## [13] "pdf-and-text-as-data_files" "pdf-and-text-as-data.html" 
## [15] "pdf-and-text-as-data.Rmd"   "tidy-data-structures_cache"
## [17] "tidy-data-structures.html"  "tidy-data-structures.Rmd"  
## [19] "why-automated-data_cache"   "why-automated-data_files"  
## [21] "why-automated-data.html"    "why-automated-data.Rmd"
```

```R
# Set root as working directory
setwd('/')
```

---

# 🤓 in the Terminal Shell

In the **Terminal Shell**:

```sh
# Find working directory
pwd
```

```sh
# Set root as working directory
cd /
```

---

# 🤓 Naming tips

- Data management and analyses files are often kept in a directory called `src` (source).

- Images are often in an `img` directory.

- If you need to indicate distinct versions, use **version numbers**. Never use "Final" → `FinalFinal-ReallyFinal-Final5`

- Always, everywhere use the **same date format**: YYYY-MM-DD (e.g. 2020-05-22)

+ [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601)

+ Easy to sort, parse, and use for joins

---

# Relative vs. Absolute file paths

Use **relative file paths** when possible.

- **Absolute file path**: the entire path on a particular system,
    + E.g. `/Parent/Project1/data.csv`

- **Relative file path**: the path relative to the working directory.

+ E.g. if `/Parent` is the working directory then the relative path for
    `data.csv` is `Project1/data.csv`.

Why?

- Crucial for **automation**. Your scripts will run easily on **other computers**.

- **Enhances reproducibility**.
Easier for your collaborators. Easier for you when you use another computer.

---

# Back up with `../`

You can refer to files in the working directory and further down by stating them, in reference to the working directory. E.g.

```
list.files("child")
```

Will show the files in the *child* directory, which is in the working directory.

You can also go **back up the tree** from the working directory with `../`. For example,

```
list.files("../") 
```

will list the files in the directory above the working directory.

---

# File & directory name conventions

⚠️ **Don't use spaces** in your file names.

They can create problems for programs that treat spaces as an indication that the
path has ended.

Alternatives to spaces:

- `kebab-case` (e.g. `data-analysis.R`)

- `CamelCase` (ex. `DataAnalysis.R`)

- `file_underscore` (ex. `data_analysis.R`)

---

# Load files into R--Dynamically Link

There are a number of R commands to load files, depending on the file type.

- Load Data: `read.table`, `read.csv` `read.dta` `xlsx::read.xlsx`,
`rio::import`, `jsonlite::read_json`, `readr::read_csv`

```r
read.csv('data/test-data.csv')
```

- Save Data: e.g. `write.csv()`, `write.dta()`, `jsonlite::write_json()`

- Databases: [dbplyr](https://dbplyr.tidyverse.org/)

- Load and run R source code: `source()`

---

# R Input/Output of well structured data from URLs

🧨🎇😵URLs are also **file paths for files on the internet**.

You can use them the same way as local file paths.

The [rio](https://github.com/leeper/rio) package can import many different
file types (including from URLS) with the same function: `import()`.

```r
library(rio)
disprop <- import('http://bit.ly/Ss6zDO', format = 'csv')
head(disprop)
```

```
##               country iso2c year disproportionality
## 1             Andorra    AD 2001               9.97
## 2             Andorra    AD 2005               6.10
## 3             Andorra    AD 2009               8.41
## 4             Andorra    AD 2011              17.20
## 5 Antigua and Barbuda    AG 1971              16.90
## 6 Antigua and Barbuda    AG 1976              18.54
```

---

# Advanced: Git and GitHub for source code management

Not covered in this course, but you should always manage your source code with a tool like Git/GitHub.

See [here](https://docs.github.com/en/get-started/using-git/about-git) for more details.

Let me know if you are interested and I can give you an introduction if we have time.

---

# 🧹 Tidy Structured Data

---

# Rows and columns have no meaning!

Many (not all) statistical data sets are organised into **rows** and **columns**.

Rows and columns have **no inherent meaning**.

Wickham (2014) laid out **principles of data
tidying** to standardize statistical semantics.

- Links the **physical structure** of a data set to its **meaning**
(semantics).

---

# Two ways to represent the same data

| Person       |  treatmentA | treatmentB |
| ------------ | ----------- | ---------- |
| John Smith   |             | 2          |
| Jane Doe     | 16          | 11         |
| Mary Johnson | 3           | 1          |

<br>

| Treatment  | John Smith | Jane Doe | Mary Johnson |
| ---------- | ---------- | -------- | ------------ |
| treatmentA |            | 16       | 3            |
| treatmentB | 2          | 11       | 1            |

---

# Data sets are **collections of values**.

All values are assigned to a **variable** and an **observation**.

- **Variable**: all values measuring the same attribute across units

- **Observation**: all values measured within the same unit across attributes.

---

# Tidy data semantics + structure

<br>

1. Each **variable** forms a column.

2. Each **observation** forms a row.

3. Each **type** of observational unit forms a table.

---

# Tidy data

| Person       |  treatment  | result     |
| ------------ | ----------- | ---------- |
| John Smith   | a           |            |
| Jane Doe     | a           | 16         |
| Mary Johnson | a           | 3          |
| John Smith   | b           | 2          |
| Jane Doe     | b           | 11         |
| Mary Johnson | b           | 1          |

---

# 🧹 Messy to Tidy data

First identify what your observations and variables are.

Then use R tools to convert your data into this format.

**tidyr** is particularly useful.

---

# Messy to tidy data

```r
# Create messy (wide) data
messy <- data.frame(
  person = c("John Smith", "Jane Doe", "Mary Johnson"),
  a = c(NA, 16, 3),
  b = c(2, 11, 1)
)

messy
```

```
##         person  a  b
## 1   John Smith NA  2
## 2     Jane Doe 16 11
## 3 Mary Johnson  3  1
```

---

# Messy to tidy data

Using `pivot_longer()`

```r
tidied <- tidyr::pivot_longer(data = messy, 
                     cols = c("a", "b"),     # Columns to pivot
                     names_to = "treatment",
                     values_to = "result")
tidied
```

```
## # A tibble: 6 × 3
##   person       treatment result
##   <chr>        <chr>      <dbl>
## 1 John Smith   a             NA
## 2 John Smith   b              2
## 3 Jane Doe     a             16
## 4 Jane Doe     b             11
## 5 Mary Johnson a              3
## 6 Mary Johnson b              1
```

---

# Tidy to messy data

Sometimes it is useful to reverse this operation with `pivot_wider`.

```r
messy_again <- tidyr::pivot_wider(data = tidied, 
                                 names_from = "treatment",
                                 values_from = result)

messy_again
```

```
## # A tibble: 3 × 3
##   person           a     b
##   <chr>        <dbl> <dbl>
## 1 John Smith      NA     2
## 2 Jane Doe        16    11
## 3 Mary Johnson     3     1
```

---

# 🧹 Data cleaning tips

Always **look at** and **poke your data**.

For example, see if:

- Missing values are designated with `NA`

- Variable types are what you expect them to be.

- Distributions are what you expect them to be.

---

# 🥅 Practice

Using R or the terminal, create a new research project file structure with a file tree for storing your data management and analysis scripts, the data, and a paper writing up the results.

Optional Advanced: Store this project on GitHub

---

# 🥅 With a partner

Find a CSV formatted data set online and programmatically load it into R.

If it is in long format, pivot wider.

If it is in wide format, pivot longer.