Project: Data Journey Narrative

Characterize the journey that your data took to get to you. (BatesDataJourneysCapturing2016?; Madrigal and Meyer 2021) Some relevant questions to answer might include

Who generated these data? Why? How?
What measurement instruments were used to generate these data? What infrastructure has been used to store and transmit the data?
What was the original intended use of these data? What are some other ways the data have been used?
How has the mutability/immutability of the data been important to its journey & use? How has its mutability/immutability been created and maintained?
How will you need to transform the data further for your project? How do the physical and material properties of the data facilitate this, and/or create friction?
What values have been crystallized in the data? How have power relations and values shaped what is/isn’t included in the data?
Based on these factors, do the data appear to be fit for purpose? That is, based on what you know so far, do the data appear to be appropriate for answering your research question?

Include references for the sources you use to answer these questions. You can work on this step of the project at the same time as the EDA step, but I will ask you to turn in the data journey narrative first. A good target length for your narrative is 500-1,000 words long, not counting any references.

Exploratory Data Analysis

This book is about exploratory data analysis, about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. Its concern is with appearance, not with confirmation. (John Wilder Tukey 1977)

Exploratory Data Analysis (EDA) is “an attitude, AND a flexibility, AND some graph paper (or transparencies, or both)” (John W. Tukey 1980)

Exploratory and Confirmatory Research

Especially in the wake of the replication crisis, one common distinction is between exploratory and confirmatory research (Wagenmakers et al. 2012)

Wagenmakers et al. (2012) figure 1. The caption reads, in part: “A continuum of experimental exploration and the corresponding continuum of statistical wonkiness. On the far left of the continuum, researchers find their hypothesis in the data by post-hoc theorizing, and the corresponding statistics are ‘wonky,’ dramatically overestimating the evidence for the hypothesis.”

Exploratory vs. confirmatory research
Confirmatory	Exploratory
hypothesis testing	hypothesis development
specified in advance	adaptable
algorithmic	free, creative
mechanical objectivity (Daston and Galison 2007)	pure subjectivity?
avoids inferential errors	makes errors?
rigorous	lacking rigor?
real science??	*hcking around with data??**
assumes experimental methods	relevant to all methods

I agree that it’s important to

be thoughtful about how much confidence we’re placing in our conclusions
interpret findings from one study in light of other studies

But the confirmatory/exploratory distinction can overhype the confirmatory side

Making us too rigid and narrow-minded about what counts as good science

Better Models for EDA I: Developing Phenomena

Bogen and Woodward (1988) by way of Brown (2002)

The data/phenomena/theory distinction
Data	Phenomena	Theories/ Causal processes
Ex: Spreadsheet of numbers, downloaded from Qualtrics	Ex: Correlation between partisanship and sharing Covid misinformation	Ex: Conservative susceptibility to anxiety hypothesis

collected	abstracted or extracted from data	postulated
not explained	explained by theories	explain phenomena
highly local to time, place, sample, procedure	varying scope	universal?
“raw,” messy, unwieldy	“processed,” clean, stylized	“laws of nature”?

EDA as phenomena development

cleaning messy data
- identifying and mitigating (where possible) errors and idiosyncracies
identifying local patterns (“local phenomena”)

not yet claiming these will be stable and appear elsewhere
not yet worrying (much) about explanations

Better Models for EDA II: Epicycle of Analysis

Peng and Matsui (2016)

Data analysis is organized into 5 activities
Each activity involves the same 3-step “epicycle” process
1. Develop expectations
2. Collect information
3. Compare and revise expectations
Not “the scientific method”! (Peng and Matsui 2016, 4)
- “Highly iterative and non-linear”
- “information is learned at each step, which then informs
  - whether (and how) to refine, and redo, the [previous] step …, or
  - whether (and how) to proceed to the next step.”

Aligning the goals of EDA with steps in the “epicycle of analysis” (Peng and Matsui 2016)
Goals of EDA	Epicycle step
Determine if there are problems with the data	2. Collecting information
Determine whether our question can be answered with these data	3. Comparing and revising expectations
Develop sketch of an answer	1. Developing expectations

Discussion

For each of these models, how well do they fit the way you’ve been taught to do science?
How do they challenge the way you’ve been taught to do science?