Syllabus

May 6 2023

[Core Methods: Data Science] Programming Methods for Data Retrieval & Management

Instructor: Christopher Gandrud, PhD

Email: christopher.gandrud@gmail.com

Location: TBD

📜 Course Description

Part I: [Optional] Part I: Introduction to R Foundations

The optional first half day of the course is for students who are new to the R programming language or want to refresh on the basics. We’ll learn about the fundamental object types and data structures. We’ll then learn how to write functions and program control flow for more complex operations both in sequence and parallel.

These skills will be directly useful for the follow up workshop on programming methods for automated data retrieval and management.

Part II: Data Retrieval & Management

The rapid growth of the World Wide Web over the past two decades tremendously changed the way in which we share, collect and publish data. The web is full of data that are of great interest to scientists and businesses alike. Firms, public institutions and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behaviour. But how to efficiently collect data from the internet, retrieve information from social networks, search engines and dynamic web pages, tap web services and finally, process and manage the large volume of collected data with statistical software? What was once a fundamental problem for the social sciences - the scarcity and inaccessibility of observations - is quickly turning into an abundance of data. The internet offers non-reactive measurements of behaviour and preferences of political and other actors (for example, citizens, representatives, courts, and media).

The aim of the course is to provide the technical bases for web data collection methods and subsequent data management. Furthermore, we will study state-of-the art applications from the social sciences that exploit the potential of web-based data to tackle both classical and new questions of social science. This course will provide an introduction to the basics of web data collection practice with R and (optional) some tasks in Python. The sessions are hands-on; participants will practice every step of the process with R using various examples. The doctoral candidates will learn how to scrape content from static and dynamic web pages, connect to APIs from popular web services, to read out and process user data and to set up automatically working scraper programs. For the practical part, the course participants are expected to independently design and collect data for their own empirical applications.

🗝 Enrollment

Prerequisite(s): Some experience with statistical programming languages such as Stata, R, or Python.

Format

The course will take place over two days–Friday and Saturday–12-13 May 2023.

We’ll break the course into 1.5 hour blocks. Each block ends with a 15 minute break/time to ask 1:1 questions.

All of the blocks include interactive time where you can use the material presented in the block in a semi-structured exercise with a partner.

Each day, we’ll take 1 hour for lunch. At the end of the first day we’ll catch up with a retro using the 4 L’s format (Like, Learned, Lacked, Longed For). Hopefully this will help us identify areas of improvement for the next days.

Session Plan

May 12 (9.00-17.00)

[Optional] Introduction to R Foundations 1
[Optional] Introduction to R Foundations 2

[Lunch]

Introduction to reproducibility, data management, and automation in research
File Organisation and Tidy Data Structures
GET Tidy Data (including some SQL)

May 13 (9.00-17.00)

GET and Parse HTML Data
Parse PDFs and Text as Data

[Lunch]

Pair Project

Collaboration tools

Retro Board: Jamboard link

Code: All of the materials are generated using code in GitHub. If you find any typo or other issue in the materials, please open an Issue or even make a fix with a Pull Request.

📖 Recommended Readings

🙀 Live coding examples

Most of the rough and tumble live coding examples are available here.