Project: Data Journey Narrative


Characterize the journey that your data took to get to you. (BatesDataJourneysCapturing2016?; Madrigal and Meyer 2021) Some relevant questions to answer might include

  • Who generated these data? Why? How?
  • What measurement instruments were used to generate these data? What infrastructure has been used to store and transmit the data?
  • What was the original intended use of these data? What are some other ways the data have been used?
  • How has the mutability/immutability of the data been important to its journey & use? How has its mutability/immutability been created and maintained?
  • How will you need to transform the data further for your project? How do the physical and material properties of the data facilitate this, and/or create friction?
  • What values have been crystallized in the data? How have power relations and values shaped what is/isn’t included in the data?
  • Based on these factors, do the data appear to be fit for purpose? That is, based on what you know so far, do the data appear to be appropriate for answering your research question?

Include references for the sources you use to answer these questions. You can work on this step of the project at the same time as the EDA step, but I will ask you to turn in the data journey narrative first. A good target length for your narrative is 500-1,000 words long, not counting any references.

Exploratory Data Analysis


This book is about exploratory data analysis, about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. Its concern is with appearance, not with confirmation. (John Wilder Tukey 1977)

Exploratory Data Analysis (EDA) is “an attitude, AND a flexibility, AND some graph paper (or transparencies, or both)” (John W. Tukey 1980)

Exploratory and Confirmatory Research

Especially in the wake of the replication crisis, one common distinction is between exploratory and confirmatory research (Wagenmakers et al. 2012)

Wagenmakers et al. (2012) figure 1. The caption reads, in part: “A continuum of experimental exploration and the corresponding continuum of statistical wonkiness. On the far left of the continuum, researchers find their hypothesis in the data by post-hoc theorizing, and the corresponding statistics are ‘wonky,’ dramatically overestimating the evidence for the hypothesis.”

Exploratory vs. confirmatory research
Confirmatory Exploratory
hypothesis testing hypothesis development
specified in advance adaptable
algorithmic free, creative
mechanical objectivity
(Daston and Galison 2007)
pure subjectivity?
avoids inferential errors makes errors?
rigorous lacking rigor?
real science?? h*cking around with data??
assumes experimental methods relevant to all methods

I agree that it’s important to

  • be thoughtful about how much confidence we’re placing in our conclusions
  • interpret findings from one study in light of other studies


But the confirmatory/exploratory distinction can overhype the confirmatory side

  • Making us too rigid and narrow-minded about what counts as good science

Better Models for EDA I: Developing Phenomena

Bogen and Woodward (1988) by way of Brown (2002)

The data/phenomena/theory distinction
Data Phenomena Theories/ Causal processes
Ex: Spreadsheet of numbers, downloaded from Qualtrics Ex: Correlation between partisanship and sharing Covid misinformation Ex: Conservative susceptibility to anxiety hypothesis

collected abstracted or extracted from data postulated
not explained explained by theories explain phenomena
highly local to time, place, sample, procedure varying scope universal?
“raw,” messy, unwieldy “processed,” clean, stylized “laws of nature”?

EDA as phenomena development

  • cleaning messy data
    • identifying and mitigating (where possible) errors and idiosyncracies
  • identifying local patterns (“local phenomena”)


  • not yet claiming these will be stable and appear elsewhere
  • not yet worrying (much) about explanations

Better Models for EDA II: Epicycle of Analysis

Epicycle of analysis model, Peng and Matsui (2016), 5

Peng and Matsui (2016)

  • Data analysis is organized into 5 activities
  • Each activity involves the same 3-step “epicycle” process
    1. Develop expectations
    2. Collect information
    3. Compare and revise expectations
  • Not “the scientific method”! (Peng and Matsui 2016, 4)
    • “Highly iterative and non-linear”
    • “information is learned at each step, which then informs
      • whether (and how) to refine, and redo, the [previous] step …, or
      • whether (and how) to proceed to the next step.”


Aligning the goals of EDA with steps in the “epicycle of analysis” (Peng and Matsui 2016)
Goals of EDA Epicycle step
Determine if there are problems with the data 2. Collecting information
Determine whether our question can be answered with these data 3. Comparing and revising expectations
Develop sketch of an answer 1. Developing expectations

Discussion

  • For each of these models, how well do they fit the way you’ve been taught to do science?
  • How do they challenge the way you’ve been taught to do science?

References