3 Data Cleaning and Transformation

3.1 Introduction

In this lesson, we’ll tackle the often-messy reality of real-world data: dirty data. We’ll learn how to identify common data quality issues like missing values, outliers, and inconsistencies, and then apply techniques in R to clean and transform the data into a usable format. This lesson will also focus on cloud deployments with version control.

3.2 Learning Objectives

Identify common data quality issues.
Detect missing values and apply appropriate handling methods (removal, imputation).
Detect outliers and apply appropriate handling methods (trimming, capping, transformation).
Correct inconsistent data entries (e.g., typos, inconsistent formatting).
Commit and upload the project again to version control.

3.3 What is Dirty Data?

Dirty data refers to data that is inaccurate, incomplete, inconsistent, or otherwise unreliable. Common sources of dirty data include:

Human error during data entry
Data integration issues from multiple sources
Software bugs
Inconsistent data standards

3.4 The `exam_scores` Dataset: Let’s Get Specific

We’ll use our exam_scores dataset (or a modified version with intentional errors) to illustrate these cleaning techniques. We’ll assume the dataset contains columns like:

student_id: Unique identifier for each student.
study_hours: Number of hours spent studying.
score: Exam score (out of 100).
grade: Letter grade (A, B, C, D, F).

3.5 Identifying Data Quality Issues

Missing Values (NA):
- Let’s check for missing values in the score column:

    library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.3

Warning: package 'ggplot2' was built under R version 4.2.3

Warning: package 'tibble' was built under R version 4.2.3

Warning: package 'tidyr' was built under R version 4.2.3

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'purrr' was built under R version 4.2.3

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'stringr' was built under R version 4.2.3

Warning: package 'forcats' was built under R version 4.2.3

Warning: package 'lubridate' was built under R version 4.2.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

    exam_scores <- read.csv("https://raw.githubusercontent.com/sijuswamyresearch/R-for-Data-Analytics/refs/heads/main/data/exam_scores.csv")

    sum(is.na(exam_scores))

[1] 1

cleaning the NAs

#Remove NA values in Exam Scores
exam_scores <- na.omit(exam_scores)
exam_scores

   student_id study_hours score grade
2           2         5.0    88     B
3           3         1.0    52     F
4           4         3.0    76     c
5           5         4.0    82     B
6           6         2.0   100     C
7           7         6.0    95     A
8           8         1.5    58     F
9           9         3.5    79     C
10         10         4.5    85     B
11         11         2.5    73     C
12         12         5.5    91     A
13         13         0.5    45     f
14         14         3.0    77     C
15         15         4.0    83     B
16         16        24.0    68     C
17         17         6.0    96     A
18         18         1.0    55     F
19         19         3.5    10     B
20         20         4.5    86     B
21         21         2.5    74     C
22         22         5.5    92     A
23         23         0.5    48     F
24         24         3.0    78     C
25         25         4.0    84     B
26         26         2.0    69     C
27         27         6.0    97     A
28         28         1.0    56     F
29         29         3.5    81     B
30         30        48.5    87     b

Outliers:

Let’s identify potential outliers in the study_hours column using a boxplot:

ggplot(exam_scores, aes(y = study_hours)) +
  geom_boxplot() +
  labs(title = "Boxplot of Study Hours")

We can then calculate the IQR and identify values outside the typical range:

Q1 <- quantile(exam_scores$study_hours, 0.25)
Q3 <- quantile(exam_scores$study_hours, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- exam_scores %>%
  filter(study_hours < lower_bound | study_hours > upper_bound)

print(outliers)

  student_id study_hours score grade
1         16        24.0    68     C
2         30        48.5    87     b

Inconsistent Data:
- Let’s check for inconsistent grade entries (e.g., lowercase “a” instead of uppercase “A”):
```
unique(exam_scores$grade) #See if the dataset has what you expect
```
```
[1] "B" "F" "c" "C" "A" "f" "b"
```

3.6 Handling Missing Values

Removal: Let’s remove rows where the exam score is NA.

exam_scores_no_na <- exam_scores %>%
  filter(!is.na(score))

Imputation: We can use a simple ifelse statement to impute with the mean.
- Mean/Median Imputation: Replace with the mean or median of the column.
```
exam_scores <- exam_scores %>%
  mutate(score = ifelse(is.na(score), mean(score, na.rm = TRUE), score))
```

3.7 Handling Outliers

We can handle those extreme scores to help reduce the variability of the dataset.

Capping (Winsorizing):

# Cap values above the 95th percentile for study hours.
upper_threshold <- quantile(exam_scores$study_hours, 0.95)
exam_scores <- exam_scores %>%
  mutate(study_hours = ifelse(study_hours > upper_threshold, upper_threshold, study_hours))
print(exam_scores)

   student_id study_hours score grade
2           2         5.0    88     B
3           3         1.0    52     F
4           4         3.0    76     c
5           5         4.0    82     B
6           6         2.0   100     C
7           7         6.0    95     A
8           8         1.5    58     F
9           9         3.5    79     C
10         10         4.5    85     B
11         11         2.5    73     C
12         12         5.5    91     A
13         13         0.5    45     f
14         14         3.0    77     C
15         15         4.0    83     B
16         16        16.8    68     C
17         17         6.0    96     A
18         18         1.0    55     F
19         19         3.5    10     B
20         20         4.5    86     B
21         21         2.5    74     C
22         22         5.5    92     A
23         23         0.5    48     F
24         24         3.0    78     C
25         25         4.0    84     B
26         26         2.0    69     C
27         27         6.0    97     A
28         28         1.0    56     F
29         29         3.5    81     B
30         30        16.8    87     b

3.8 Correcting Inconsistent Data

Text Cleaning: Let’s apply some cleaning techniques to ensure the dataset is formatted as best as possible.

    library(stringr)

    exam_scores <- exam_scores %>%
      mutate(grade = str_to_upper(grade))

    print(unique(exam_scores$grade)) #Check if things are good now!

[1] "B" "F" "C" "A"