6  Data Exploration Versus Presentation

6.1 Introduction

In this section, we’ll explore the critical distinction between data exploration and data presentation. Both are essential components of the data analysis process, but they have different goals, methods, and considerations. Knowing the difference between them will help you tailor your analysis and communication to the appropriate context and audience.

6.2 Learning Objectives

  • Understand the purpose and characteristics of data exploration.
  • Understand the purpose and characteristics of data presentation.
  • Identify the key differences between data exploration and presentation.
  • Choose appropriate visualization methods for exploration vs. presentation.
  • Create compelling visuals for data presentation.
  • Understand data visualization bias.

6.3 What is Data Exploration?

Data exploration is an iterative process of discovering patterns, relationships, and anomalies in a dataset. It’s a detective-like activity, guided by curiosity and a desire to understand the underlying data.

6.3.1 Characteristics of Data Exploration:

  • Goal: Discover patterns, generate hypotheses, and understand data characteristics.
  • Audience: Primarily for the analyst/data scientist themselves (or a small team).
  • Methods:
    • Data cleaning and transformation.
    • Calculation of summary statistics.
    • Visualization (histograms, scatter plots, box plots, etc.).
    • Hypothesis generation.
  • Emphasis: Flexibility, experimentation, and uncovering insights.
  • Level of Polish: Often informal, with less emphasis on aesthetics or perfection.

6.4 What is Data Presentation?

Data presentation focuses on communicating findings and insights to a specific audience. It’s about telling a compelling story with the data in a clear, concise, and visually appealing manner.

6.4.1 Characteristics of Data Presentation:

  • Goal: Communicate specific findings, insights, and conclusions to an audience.
  • Audience: Stakeholders, decision-makers, clients, or the general public.
  • Methods:
    • Carefully selected visualizations (charts, graphs, tables).
    • Clear and concise narrative.
    • Context and background information.
    • Key takeaways and recommendations.
  • Emphasis: Clarity, accuracy, visual appeal, and persuasiveness.
  • Level of Polish: High, with attention to detail in design, labeling, and formatting.

6.5 Key Differences Between Data Exploration and Presentation

Feature Data Exploration Data Presentation
Goal Discovery, understanding Communication, persuasion
Audience Analyst, small team Stakeholders, decision-makers, wider audience
Focus Flexibility, experimentation Clarity, accuracy, visual appeal
Visualizations Quick, diverse, exploratory Polished, focused, impactful
Narrative Minimal, internal notes Clear, compelling, tailored to audience
Level of Detail High (all aspects of the data) Selective (highlights key insights)
“Aha!” moments New insights Confirmation of insights
Polished Level Quick, functional Professional, polished

6.6 Choosing Appropriate Visualization Methods

The choice of visualization method depends on whether you are in the exploration or presentation phase.

6.6.1 Exploration Visualizations:

  • Histograms: Quickly understand the distribution of a single variable.
  • Scatter Plots: Explore relationships between two numerical variables.
  • Box Plots: Compare the distribution of a variable across different groups.
  • Correlation Matrices: Identify correlations between multiple numerical variables.
  • Density Plots: See the shape of a distribution.
  • Parallel Coordinate Plots: To visualize several different dimensions.
  • 3-D visualizations: Useful for getting a bird’s-eye view of your data.

6.6.2 Presentation Visualizations:

  • Bar Charts: Compare the magnitudes of different categories (ensure proper baseline and clear labels).
  • Line Charts: Show trends over time or across ordered categories (ensure clear labels and scales).
  • Pie Charts: (Use sparingly) Show proportions or percentages of different categories (limit the number of slices and ensure clear labels).
  • Maps: Display geographical data and patterns (use appropriate projections and color schemes).
  • Scatter Plots (Annotated): Highlight specific points or trends in a relationship between two variables.

6.6.3 Common rules of thumb for presenting datasets for people

  • Always Label clearly
  • Make sure axes are consistent
  • Do not cut off axes
  • Do not hide information

6.7 Creating Compelling Visuals for Data Presentation

For data presentation, it’s crucial to create visuals that are clear, accurate, and engaging. Here are some tips:

  • Choose the Right Chart Type: Select the chart type that best conveys your message and suits your data.
  • Simplify the Visual: Remove unnecessary clutter, such as gridlines, extra labels, or overly complex designs.
  • Use Color Effectively: Use color to highlight important elements or create visual contrast. Be mindful of colorblindness.
  • Tell a Story: Use the visualization to tell a clear and compelling story. Add a title, caption, and annotations to guide the viewer.

6.7.0.1 Example: Improved Bar Chart for Presentation

Let’s say you want to present the average exam score for each grade. Here’s how you might create a more polished version for presentation:

Initial Exploration Chart:

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.2.3
#Basic graph, for exploration
 exam_scores <- read.csv("https://raw.githubusercontent.com/sijuswamyresearch/R-for-Data-Analytics/refs/heads/main/data/exam_scores.csv")
ggplot(exam_scores, aes(x = grade, y = score)) +
  geom_bar(stat = "summary", fun = "mean")

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Assuming the grade is the main column with the average score.
ggplot(exam_scores, aes(x = grade, y = score)) +
  geom_bar(stat = "summary", fun = "mean", fill = "steelblue") +
  labs(title = "Average Exam Score by Grade",
       x = "Grade",
       y = "Average Score") +
  theme_minimal() +
  ylim(0, 100) # Force score values between 0 and 100.

6.8 Dangers of Data Visualization

Data visualization can be very powerful for communicating the information with your stakeholders. But it can also give the audience an incorrect understanding of your data

6.8.1 common pitfalls

  • Incorrect data scale

  • Hidden and misrepresented data

  • Skewed charts

  1. Data Scale

When communicating results make sure that the Y scale correctly represents the values for an accurate picture. Here is an example of an incorrect visualization of the data.

library(tidyverse)
set.seed(123)
data <- data.frame(Category = c("A", "B"),
                   Value = c(60, 55))

ggplot(data, aes(x = Category, y = Value)) +
  geom_bar(stat = "identity") +
  ylim(0, 60) +
  labs(title = "Misleading: Y-axis starts at 50",
       x = "Category", y = "Value")

  1. Hidden or misinterpreted data

One common visualziation mistake is to have incorrect scales and to remove data. If you had one set of data, and only communicated a small part of the data set the audience can get a very wrong idea about the quality and the overall trend.

This also applies to things such as selecting the data to use, or cleaning data that may be relevant.

  1. Skewing Charts

Make sure you’re choosing the best methods to display the data. Charts like a pie chart can often be more confusing than helpful.

6.9 Conclusion

Data exploration and data presentation are two distinct but complementary phases of the data analysis process. Data exploration is about discovering patterns and generating hypotheses. Data presentation is about communicating findings and insights to a specific audience. By understanding the key differences between these phases, you can tailor your analysis and communication to the appropriate context and audience. Choose the right visualization and format for the given situation.

Practice

  • Take a dataset you’ve used previously.

  • Create exploratory visualizations to understand the data.

  • Select a key insight you want to communicate.

  • Design a polished visualization to present that insight to a specific audience (e.g., a non-technical stakeholder).