<- c(75, 80, 92, 68, 85)
scores mean(scores)
[1] 80
In this section, we’ll delve into the world of descriptive statistics. Descriptive statistics provide a summary of your data, allowing you to understand its central tendency, variability, and shape. We’ll learn how to calculate common descriptive statistics in R and interpret their meaning.
dplyr::summarize()
to efficiently calculate descriptive statistics.Measures of central tendency describe the “center” of a dataset.
<- c(75, 80, 92, 68, 85)
scores mean(scores)
[1] 80
median(scores)
[1] 80
<- function(v) {
getmode <- unique(v)
uniqv which.max(tabulate(match(v, uniqv)))]
uniqv[
}
<- c(1, 2, 2, 3, 4, 4, 4, 5)
values getmode(values)
[1] 4
When to Use Which Measure:
Measures of dispersion describe the spread or variability of a dataset.
Standard Deviation: A measure of how spread out the data is around the mean.
sd(scores)
[1] 9.192388
Variance: The square of the standard deviation.
var(scores)
[1] 84.5
Interquartile Range (IQR): The difference between the 75th percentile (Q3) and the 25th percentile (Q1).
IQR(scores)
[1] 10
Range: The difference between the maximum and minimum values.
range(scores)
[1] 68 92
diff(range(scores)) #Calculate the range from the output
[1] 24
dplyr::summarize()
The dplyr
package provides the summarize()
function, which makes it easy to calculate multiple descriptive statistics at once:
library(dplyr)
Warning: package 'dplyr' was built under R version 4.2.3
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
#Replace this file with your local file.
<- read.csv("https://raw.githubusercontent.com/sijuswamyresearch/R-for-Data-Analytics/refs/heads/main/data/exam_scores.csv")
exam_scores
%>%
exam_scores summarize(
mean_score = mean(score),
median_score = median(score),
sd_score = sd(score),
iqr_score = IQR(score),
min_score = min(score),
max_score = max(score)
)
mean_score median_score sd_score iqr_score min_score max_score
1 74.33333 78.5 19.38598 21 10 100
You can calculate descriptive statistics for different groups within your data using dplyr::group_by()
in combination with summarize()
:
%>%
exam_scores group_by(grade) %>%
summarize(
mean_score = mean(score),
median_score = median(score),
sd_score = sd(score)
)
# A tibble: 7 × 4
grade mean_score median_score sd_score
<chr> <dbl> <dbl> <dbl>
1 A 94.2 95 2.59
2 B 74.9 83.5 26.3
3 C 75.9 74 10.2
4 F 53.8 55 3.90
5 b 87 87 NA
6 c 76 76 NA
7 f 45 45 NA
Descriptive statistics provide insights into the distribution of your data:
exam_scores.csv
dataset (or your own dataset).