class: center, middle, inverse, title-slide .title[ # File Organisation and Tidy Data Structures ] .author[ ### Christopher Gandrud ] .institute[ ### HU DYNAMICS ] .date[ ### 2022-06-10 (updated: 2023-05-06) ] --- # ๐ Lesson Preview - Why is file management important? - What is tidy data? - Manipulating tidy data (1) --- # ๐ Remember: Practical tips for reproducible research - โ๏ธ Document Everything! - ๐ Everything is a (text) file. - ๐ค All files should be human readable. - โ Explicitly tie your files together. - ๐ Have a plan to organise, store, and make your files available. [And iterate on this plan -- refactor regularly] --- # Importance of understanding files/file structures This topic may seem kind of . . . dry. Why not just click and drag files with the GUI (Graphical User Interface)? --- # Importance of understanding files/file structures 1. **Crucial for programmatic access and manipulation**: automated data collection and management _is_ automated file manipulation. 2. **Reproducibility**: other researchers only have your files. If they are **well organised** and the **links** between the files are **explicitly stated** then they can better understand what you did. + Clearest way of explicitly stating links is **dynamically** using file paths in your source code. 3. **You**: well organised files will be easier for you to find/understand/use in the future. --- class: inverse, center, middle # ๐ (text) Files --- # ๐ Why text files? (Almost) all files are ultimately text files. Note very large data is usually stored in _binary_ file formats - E.g. a website is typically just a series of connected `.html`, `.js`, and `.css` files. - **These are text files!** Despite different file extensions. + To see this explore a webpage with [Chrome Developer Tools](https://developers.google.com/web/tools/chrome-devtools/) --- # ๐ Why text files? Text files are **versitile**. - Store your data (`.csv`, `json`), store your analysis code (`.R`), store your presentation markup (`.Rmd`). - They are **simple** and are **not dependent on particular software**. - Any text editor can open them. - Helps **future-proof** research. - Easy to **version control**. --- # ๐ CSV Example CSV (Comma Separated Values) - All columns are separated by commas--`,` - All rows are separated by new lines--`\n` --- # CSV Example In CSV this: ``` iso2c, country, score US, United States, 1.086 US, United States, 1.094 US, United States, 1.050 ``` Makes: <img src="img/CSVBasic.png" width="299" /> --- class: inverse, center, middle # ๐ Organise your files! --- # ๐ณ File paths & trees Files are organised **hierarchically** into (upside down) trees. ``` Root โโโ Parent โโโ Child-1 โโโ Child-2 ``` --- # ๐ณ Root **Root** directories are the **first level of a disk**. ``` Root โโโ Parent โโโ Child-1 โโโ Child-2 ``` They are the root out of which the file tree grows. **Naming Conventions**: **Linux/Mac**: `/` - e.g. `/Parent` means that the `Parent` directory is a child of the root directory. **Windows**: the disk is partitioned, e.g. the `C` partition is denoted `C:\`. - `C:\Parent` indicates that the `Parent` directory is a child of the `C` partition. --- # ๐ธ Sub (child) directories Sub (child) directories are denoted with a forward slash `/` in Linux/Mac and a back slash `\` in Windows, e.g.: ```sh # Linux/Mac /Parent/Child # Windows C:\Parent\Child ``` --- # R tip: - In R for Windows you either use two backslashes `\\` (`\` is the R [escape character](http://en.wikipedia.org/wiki/Escape_character)) - Or `/` in **relative paths** in R for Windows, it will know what you mean. --- # ๐ Working directories A **working directory** is the directory where the program looks for files/other directories. โ ๏ธ **Always remember the working directory.** Otherwise you may open/save files that you do not want to open/save. --- # ๐ Working directories **In R:** ```r # Find working directory getwd() ``` ``` ## [1] "/Users/cgandrud/git_repos/intro-data-retrieval-management/static/slides" ``` ```r # List all files in the working directory list.files() ``` ``` ## [1] "code" "data" ## [3] "get-json-html_cache" "get-json-html.html" ## [5] "get-json-html.Rmd" "get-tidy-data_cache" ## [7] "get-tidy-data.html" "get-tidy-data.Rmd" ## [9] "img" "libs" ## [11] "main.bib" "pdf-and-text-as-data_cache" ## [13] "pdf-and-text-as-data_files" "pdf-and-text-as-data.html" ## [15] "pdf-and-text-as-data.Rmd" "tidy-data-structures_cache" ## [17] "tidy-data-structures.html" "tidy-data-structures.Rmd" ## [19] "why-automated-data_cache" "why-automated-data_files" ## [21] "why-automated-data.html" "why-automated-data.Rmd" ``` ```R # Set root as working directory setwd('/') ``` --- # ๐ค in the Terminal Shell In the **Terminal Shell**: ```sh # Find working directory pwd ``` ```sh # Set root as working directory cd / ``` --- # ๐ค Naming tips - Data management and analyses files are often kept in a directory called `src` (source). - Images are often in an `img` directory. - If you need to indicate distinct versions, use **version numbers**. Never use "Final" โ `FinalFinal-ReallyFinal-Final5` - Always, everywhere use the **same date format**: YYYY-MM-DD (e.g. 2020-05-22) + [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) + Easy to sort, parse, and use for joins --- # Relative vs. Absolute file paths Use **relative file paths** when possible. - **Absolute file path**: the entire path on a particular system, + E.g. `/Parent/Project1/data.csv` - **Relative file path**: the path relative to the working directory. + E.g. if `/Parent` is the working directory then the relative path for `data.csv` is `Project1/data.csv`. Why? - Crucial for **automation**. Your scripts will run easily on **other computers**. - **Enhances reproducibility**. Easier for your collaborators. Easier for you when you use another computer. --- # Back up with `../` You can refer to files in the working directory and further down by stating them, in reference to the working directory. E.g. ``` list.files("child") ``` Will show the files in the *child* directory, which is in the working directory. You can also go **back up the tree** from the working directory with `../`. For example, ``` list.files("../") ``` will list the files in the directory above the working directory. --- # File & directory name conventions โ ๏ธ **Don't use spaces** in your file names. They can create problems for programs that treat spaces as an indication that the path has ended. Alternatives to spaces: - `kebab-case` (e.g. `data-analysis.R`) - `CamelCase` (ex. `DataAnalysis.R`) - `file_underscore` (ex. `data_analysis.R`) --- # Load files into R--Dynamically Link There are a number of R commands to load files, depending on the file type. - Load Data: `read.table`, `read.csv` `read.dta` `xlsx::read.xlsx`, `rio::import`, `jsonlite::read_json`, `readr::read_csv` ```r read.csv('data/test-data.csv') ``` - Save Data: e.g. `write.csv()`, `write.dta()`, `jsonlite::write_json()` - Databases: [dbplyr](https://dbplyr.tidyverse.org/) - Load and run R source code: `source()` --- # R Input/Output of well structured data from URLs ๐งจ๐๐ตURLs are also **file paths for files on the internet**. You can use them the same way as local file paths. The [rio](https://github.com/leeper/rio) package can import many different file types (including from URLS) with the same function: `import()`. ```r library(rio) disprop <- import('http://bit.ly/Ss6zDO', format = 'csv') head(disprop) ``` ``` ## country iso2c year disproportionality ## 1 Andorra AD 2001 9.97 ## 2 Andorra AD 2005 6.10 ## 3 Andorra AD 2009 8.41 ## 4 Andorra AD 2011 17.20 ## 5 Antigua and Barbuda AG 1971 16.90 ## 6 Antigua and Barbuda AG 1976 18.54 ``` --- # Advanced: Git and GitHub for source code management Not covered in this course, but you should always manage your source code with a tool like Git/GitHub. See [here](https://docs.github.com/en/get-started/using-git/about-git) for more details. Let me know if you are interested and I can give you an introduction if we have time. --- class: inverse, center, middle # ๐งน Tidy Structured Data --- # Rows and columns have no meaning! Many (not all) statistical data sets are organised into **rows** and **columns**. Rows and columns have **no inherent meaning**. Wickham (2014) laid out **principles of data tidying** to standardize statistical semantics. - Links the **physical structure** of a data set to its **meaning** (semantics). --- # Two ways to represent the same data | Person | treatmentA | treatmentB | | ------------ | ----------- | ---------- | | John Smith | | 2 | | Jane Doe | 16 | 11 | | Mary Johnson | 3 | 1 | <br> | Treatment | John Smith | Jane Doe | Mary Johnson | | ---------- | ---------- | -------- | ------------ | | treatmentA | | 16 | 3 | | treatmentB | 2 | 11 | 1 | --- # Data sets are **collections of values**. All values are assigned to a **variable** and an **observation**. - **Variable**: all values measuring the same attribute across units - **Observation**: all values measured within the same unit across attributes. --- # Tidy data semantics + structure <br> 1. Each **variable** forms a column. 2. Each **observation** forms a row. 3. Each **type** of observational unit forms a table. .small[[Wickham (2014)](https://www.jstatsoft.org/article/view/v059i10)] --- # Tidy data | Person | treatment | result | | ------------ | ----------- | ---------- | | John Smith | a | | | Jane Doe | a | 16 | | Mary Johnson | a | 3 | | John Smith | b | 2 | | Jane Doe | b | 11 | | Mary Johnson | b | 1 | --- # ๐งน Messy to Tidy data First identify what your observations and variables are. Then use R tools to convert your data into this format. **tidyr** is particularly useful. --- # Messy to tidy data ```r # Create messy (wide) data messy <- data.frame( person = c("John Smith", "Jane Doe", "Mary Johnson"), a = c(NA, 16, 3), b = c(2, 11, 1) ) messy ``` ``` ## person a b ## 1 John Smith NA 2 ## 2 Jane Doe 16 11 ## 3 Mary Johnson 3 1 ``` --- # Messy to tidy data Using `pivot_longer()` ```r tidied <- tidyr::pivot_longer(data = messy, cols = c("a", "b"), # Columns to pivot names_to = "treatment", values_to = "result") tidied ``` ``` ## # A tibble: 6 ร 3 ## person treatment result ## <chr> <chr> <dbl> ## 1 John Smith a NA ## 2 John Smith b 2 ## 3 Jane Doe a 16 ## 4 Jane Doe b 11 ## 5 Mary Johnson a 3 ## 6 Mary Johnson b 1 ``` --- # Tidy to messy data Sometimes it is useful to reverse this operation with `pivot_wider`. ```r messy_again <- tidyr::pivot_wider(data = tidied, names_from = "treatment", values_from = result) messy_again ``` ``` ## # A tibble: 3 ร 3 ## person a b ## <chr> <dbl> <dbl> ## 1 John Smith NA 2 ## 2 Jane Doe 16 11 ## 3 Mary Johnson 3 1 ``` --- # ๐งน Data cleaning tips Always **look at** and **poke your data**. For example, see if: - Missing values are designated with `NA` - Variable types are what you expect them to be. - Distributions are what you expect them to be. --- class: inverse, center, middle background-color: #FF8C03 # ๐ฅ Practice Using R or the terminal, create a new research project file structure with a file tree for storing your data management and analysis scripts, the data, and a paper writing up the results. Optional Advanced: Store this project on GitHub --- class: inverse, center, middle background-color: #FF8C03 # ๐ฅ With a partner Find a CSV formatted data set online and programmatically load it into R. If it is in long format, pivot wider. If it is in wide format, pivot longer.