Resources

{FakeDataR}

{FakeDataR} is an R package that provides a local solution for creating synthetic copies of real datasets, preserving their structure, schema, types, and privacy. It prevents the risk of exposing sensitive data and is designed to support Large Language Model (LLM) workflows and reproducible sharing. The package includes heuristics for identifying sensitive fields, with the ability to fake or drop them, and supports exporting synthetic data along with a JSON schema and README prompt for LLM bundles. It's a suitable tool for creating quick, privacy-preserving synthetic data without the need for cloud processing.

Go to Resource

A year with Visible Long-Covid Tracking

Dr. Mowinckel shares insights on a year-long journey of tracking Long Covid symptoms using the Visible app. The app monitors heart rate, HRV, daily symptoms, and functional capacity through the FUNCAP27 questionnaire. The post details the process of collecting and analyzing personal health data to understand recovery patterns, pacing strategies, and warning signs. The blog also offers a look at tools within Visible that help visualize progress, such as heart rate graphs and a functional capacity semi-circle, providing a valuable resource for individuals managing Long Covid.

Go to Resource

Applied Data Skills

The 'Applied Data Skills' book by Emily Nordmann and Lisa DeBruine is designed to teach the fundamentals of data processing and presentation using R. It guides learners through data import, cleaning, summarization, visualization, and report generation, aiming to provide skills for professional reporting and presenting. The book is part of a 10-week course with each chapter introducing new concepts and practical exercises. It emphasizes learning through practice, error resolution, and the efficient use of help resources rather than memorization. The goal is to enable learners to create automated, updateable reports and visualizations with R.

Go to Resource

Cleaning Biodiversity Data in R

This content is a specialized resource for ecology and biodiversity data professionals, detailing processes for cleaning geo-referenced biodiversity data in R. Tailored specifically for ecological data, the guide goes beyond general cleaning techniques to address unique challenges in biodiversity datasets. It's freely available under a CC BY-NC-ND license, emphasizing the book's accessibility and adherence to sharing protocols. The authors acknowledge the lands and environmental know-how of Indigenous Australian peoples, showing sensitivity to cultural heritage in data practices.

Go to Resource

Data Cleaning Flipbook

A flipbook with examples of data cleaning using R and the tidyverse package

Go to Resource

Data cleaning for data sharing | Crystal Lewis

Data cleaning for data sharing by Crystal Lewis in tutorials February 14, 2023.

Go to Resource

Don't use Quarto documents to clean or analyze data

This content advocates against using Quarto, R Markdown, or Jupyter for data cleaning and analysis, emphasizing that these platforms should be used for communication rather than exploratory tasks. Diego Catalan Molina advises that data inputs should be clean before being loaded into documents which should serve as a vehicle to tell a story. He suggests creating engaging outlines focused on findings' importance and using these documents exclusively to share results, not every plot or table during the exploratory phase of data analysis.

Go to Resource

Easily clean up messy databases with fuzzy matching in R

This article introduces data journalists to fuzzy matching techniques using R to clean up databases with inconsistently entered text data. It outlines the challenge of recognizing similar information recorded in various ways and the computer's inability to naturally interpret them as identical. The tutorial explains 'fuzzy' matching, which identifies similarities in letter patterns to group text together more accurately. Essential R libraries like tidyverse and stringdist are loaded to demonstrate the process. Practical examples from the 2025 IRE conference schedule data show how to extract and clean session names with potential entry mistakes, using fuzzy matching to consolidate the categories accurately.

Go to Resource

Extract Data from Professional Volleyball Leagues in North America with {rvolleydata}

The R package {rvolleydata} is designed for those interested in analyzing professional volleyball data, providing a simple interface to collect structured data from North American leagues such as League One Volleyball Pro (LOVB), Athletes Unlimited Pro Volleyball (AUPVB), and Major League Volleyball (MLV). The package can be installed from CRAN for stable use or from GitHub for the development version. Comprehensive usage guidelines are available in the package vignette, which helps users employ {rvolleydata} effectively to obtain clean and tidy volleyball league data for their analyses.

Go to Resource

How (and Why) I came to Use R for Data Analysis and Evaluation

Alberto Espinoza recounts his journey with R for data analysis and evaluation, marking his 10-year experience since first encountering R during his graduate assistantship. Initially clueless about R, he was tasked with assisting and leading statistics labs using R. Despite early challenges and a steep learning curve, he recognized R's power over software like SPSS or Excel. His continued use of R spanned graduate projects, market research, data preparation for Tableau, and Survey Monkey analysis. Espinoza outlines R's advantages: reproducibility, efficiency, clarity, and an extensive package ecosystem, underlining R's significance in his professional growth.

Go to Resource

How to Turn Messy PDFs into Clean Data Frames with R and Elmer

Albert Rapp demonstrates how to use the {ellmer} package to leverage AI models for extracting data from messy PDF files. If you’ve ever struggled with getting clean data out of PDFs, you know how challenging this task can be. This tutorial shows how AI can streamline this traditionally painful process, making it much easier to transform unstructured PDF content into usable data frames in R.

Go to Resource

Modern Data Visualization with R

Modern Data Visualization with R is a comprehensive guide by Robert Kabacoff on data visualization techniques using the R programming language. This book, available in both online and print versions, emphasizes the use of ggplot2 for creating a variety of charts and plots. Covering topics from importing and cleaning data to customizing and saving graphs, the book includes worked examples and best practices to help readers create publication-ready graphics. The content also introduces interactive graphing tools and offers advice on graph aesthetics such as color choice and signal-to-noise ratio.

Go to Resource