Project and data management

Reading

Some data management disasters

  • The Economist (2011), β€œVideo: Keith Baggerly, "When Is Reproducibility an Ethical Issue? Genomics, Personalized Medicine, and Human Error"” (n.d.)
  • Herndon, Ash, and Pollin (2014), but you can just read Bailey and Borwein (Jon) (n.d.), Cassidy (n.d.), and/or watch Reinhart & Rogoff - Growth in a Time of Debt - EXERCISE! (2019)
  • Laskowski (n.d.), Viglione (2020), Pennisi (2020)
  • β€œHow Excel May Have Caused Loss of 16,000 Covid Tests in England” (2020)

Dumpster organization

  • Dump all of your files into one place
  • Use search tools to find what you want
  • Just assume that things aren’t getting corrupted
  • The way many Gen Z students think about their files? (Chin 2021)

The first rule of data management

Do not edit your data

Project organization

Noble’s (2009) sample folder structure is designed for experimental biologists
  • Keep your project self-contained
  • Locate files quickly
  • Play nicely with version control
  • Self-document key relationships between project files
  • Work with your data without editing it

A model for computational social science projects

.
β”œβ”€β”€ R
β”œβ”€β”€ data
β”œβ”€β”€ paper
β”‚   β”œβ”€β”€ _quarto.yml
β”‚   β”œβ”€β”€ paper.qmd
β”œβ”€β”€ plots
β”œβ”€β”€ readme.md
β”œβ”€β”€ scripts
└── talk
  • https://github.com/dhicks/project_template

  • Configured as a GitHub β€œtemplate,” making it easy to create new repositories for new projects

  • Designated folders for data, plots/outputs, and utility functions

File naming convention

Files in scripts, data, and plots should generally use a sequential naming convention:

  • Scripts in scripts should have filenames starting with 01_:
    • 01_scrape.R
    • 02_parse.R
    • 03_eda.R, and so on
  • Data and plot files (data and plots) should use a parallel naming convention:
    • 00_ indicates raw data (produced or gathered outside of the pipeline in scripts)
    • 01_ indicates plots and intermediate data files produced by script number 01, and so on

The model in action: A mid-sized text-mining project

Published paper: https://doi.org/10.1162/qss_a_00150
GitHub repo: https://github.com/dhicks/orus
23 directories, 274 files (plus 160k data files)

.
β”œβ”€β”€ ORU\ faculty
β”‚   └── ORU\ Publications.fld
β”œβ”€β”€ QSS\ forms
β”œβ”€β”€ R
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ author_histories
β”‚   β”œβ”€β”€ authors_meta
β”‚   β”œβ”€β”€ docs
β”‚   β”œβ”€β”€ ldatuning_results
β”‚   β”œβ”€β”€ parsed_blocks
β”‚   β”œβ”€β”€ pubs
β”‚   └── temp
β”œβ”€β”€ paper
β”‚   β”œβ”€β”€ img
β”‚   └── scraps
β”œβ”€β”€ plots
β”œβ”€β”€ presentations
└── scripts
    β”œβ”€β”€ 12_analysis_cache
    β”‚   └── html
    β”œβ”€β”€ 12_analysis_files
    β”‚   └── figure-html
    └── scraps

The analysis pipeline

β”œβ”€β”€ scripts
β”‚   β”œβ”€β”€ 01_parse_faculty_list.R
β”‚   β”œβ”€β”€ 02_Scopus_search_results.R
β”‚   β”œβ”€β”€ 03_match.R
β”‚   β”œβ”€β”€ 03_matched.csv
β”‚   β”œβ”€β”€ 04_author_meta.R
β”‚   β”œβ”€β”€ 05_filtering.R
β”‚   β”œβ”€β”€ 06_author_histories.R
β”‚   β”œβ”€β”€ 07_complete_histories.R
β”‚   β”œβ”€β”€ 08_text_annotation.R
β”‚   β”œβ”€β”€ 09_build_vocab.R
β”‚   β”œβ”€β”€ 10_topic_modeling.R
β”‚   β”œβ”€β”€ 11_depts.R
β”‚   β”œβ”€β”€ 11_depts.html
β”‚   β”œβ”€β”€ 12_analysis\ copy.html
β”‚   β”œβ”€β”€ 12_analysis-matched.html
β”‚   β”œβ”€β”€ 12_analysis.R
β”‚   β”œβ”€β”€ 12_analysis.html
β”‚   β”œβ”€β”€ 12_analysis_cache
β”‚   β”‚   └── html
β”‚   β”‚       β”œβ”€β”€ __packages
β”‚   β”‚       ...
β”‚   β”‚       └── topic_viz_41d0cb157a88d4ec41810a16e769f5d5.rdx
β”‚   β”œβ”€β”€ 12_analysis_files
β”‚   β”‚   └── figure-html
β”‚   β”‚       β”œβ”€β”€ author-dept\ distance-1.png
β”‚   β”‚       ...
β”‚   β”‚       └── topic_viz-2.png
β”‚   β”œβ”€β”€ api_key.R
β”‚   └── scraps
β”‚       β”œβ”€β”€ 02_parse_pubs_list.R
β”‚       β”œβ”€β”€ 03_coe_pubs.R
β”‚       β”œβ”€β”€ 03_match_auids.R
β”‚       β”œβ”€β”€ 07.R
β”‚       β”œβ”€β”€ 12_regressions.R
β”‚       β”œβ”€β”€ BML-CMSI\ deep\ dive.R
β”‚       β”œβ”€β”€ Hellinger_low_memory.R
β”‚       β”œβ”€β”€ dept_hell_net.R
β”‚       β”œβ”€β”€ divergence\ against\ lagged\ distributions.R
β”‚       β”œβ”€β”€ exploring\ topics.R
β”‚       β”œβ”€β”€ fractional_authorship.R
β”‚       β”œβ”€β”€ hellinger.R
β”‚       β”œβ”€β”€ model_scratch.R
β”‚       β”œβ”€β”€ multicore.R
β”‚       β”œβ”€β”€ net_viz.R
β”‚       β”œβ”€β”€ prcomp.R
β”‚       β”œβ”€β”€ propensity.R
β”‚       β”œβ”€β”€ rs_diversity.R
β”‚       β”œβ”€β”€ spacyr.R
β”‚       β”œβ”€β”€ topic\ counts\ rather\ than\ entropies.R
β”‚       β”œβ”€β”€ topic_cosine_sim.R
β”‚       β”œβ”€β”€ unit-level.R
β”‚       β”œβ”€β”€ weighted\ regression.R
β”‚       β”œβ”€β”€ word-topic_distance.R
β”‚       β”œβ”€β”€ xx_construct_samples.R
β”‚       └── xx_oru_complete_histories.R

Intermediate data files

data
β”œβ”€β”€ *ORUs\ -\ DSL\ -\ Google\ Drive.webloc
β”œβ”€β”€ 00_UCD_2016.csv
β”œβ”€β”€ 00_UCD_2017.csv
β”œβ”€β”€ 00_UCD_2018.csv
β”œβ”€β”€ 00_faculty_list.html
β”œβ”€β”€ 00_manual_matches.csv
β”œβ”€β”€ 00_publications_list.html
β”œβ”€β”€ 01_departments.csv
β”œβ”€β”€ 01_departments_canonical.csv
β”œβ”€β”€ 01_faculty.Rds
β”œβ”€β”€ 02_pubs.Rds
β”œβ”€β”€ 03_codepartmentals.Rds
β”œβ”€β”€ 03_dropout.Rds
β”œβ”€β”€ 03_matched.Rds
β”œβ”€β”€ 03_unmatched.Rds
β”œβ”€β”€ 04_author_meta.Rds
β”œβ”€β”€ 04_dropouts.Rds
β”œβ”€β”€ 04_genderize
β”œβ”€β”€ 04_namsor.Rds
β”œβ”€β”€ 05_author_meta.Rds
β”œβ”€β”€ 05_dept_dummies.Rds
β”œβ”€β”€ 05_dropouts.Rds
β”œβ”€β”€ 05_layout.Rds
β”œβ”€β”€ 05_matched.Rds
β”œβ”€β”€ 06_author_histories.Rds
β”œβ”€β”€ 07_coauth_count.Rds
β”œβ”€β”€ 07_parsed_histories.Rds
β”œβ”€β”€ 08_phrases.Rds
β”œβ”€β”€ 09_H.Rds
β”œβ”€β”€ 09_atm.csv
β”œβ”€β”€ 09_vocab.tex
β”œβ”€β”€ 10_atm.csv
β”œβ”€β”€ 10_atm_pc.Rds
β”œβ”€β”€ 10_aytm.csv
β”œβ”€β”€ 10_aytm_comp.csv
β”œβ”€β”€ 10_aytm_did.csv
β”œβ”€β”€ 10_model_stats.Rds
β”œβ”€β”€ 10_models.Rds
β”œβ”€β”€ 11_au_dept_xwalk.Rds
β”œβ”€β”€ 11_departments.csv
β”œβ”€β”€ 11_departments_canonical.csv
β”œβ”€β”€ 11_dept_dummies.Rds
β”œβ”€β”€ 11_dept_gamma.Rds
β”œβ”€β”€ 11_dept_term_matrix.Rds
β”œβ”€β”€ 11_oru_gamma.Rds
β”œβ”€β”€ 11_oru_term_matrix.Rds
β”œβ”€β”€ 11_test_train.Rds
β”œβ”€β”€ 12_layout.Rds
β”œβ”€β”€ author_histories [7665 entries exceeds filelimit, not opening dir]
β”œβ”€β”€ authors_meta [6020 entries exceeds filelimit, not opening dir]
β”œβ”€β”€ docs [145144 entries exceeds filelimit, not opening dir]
β”œβ”€β”€ ldatuning_results
β”‚   β”œβ”€β”€ tuningResult_comp.Rds
β”‚   β”œβ”€β”€ tuningResult_comp.docx
β”‚   β”œβ”€β”€ tuningResult_comp.pdf
β”‚   β”œβ”€β”€ tuningResult_did.Rds
β”‚   └── tuningResult_did.pdf
β”œβ”€β”€ ldatuning_results-20190415T164055Z-001.zip
β”œβ”€β”€ parsed_blocks [430 entries exceeds filelimit, not opening dir]
β”œβ”€β”€ pubs [282 entries exceeds filelimit, not opening dir]
└── temp

A reminder on paths

  • Windows and Unix-based systems write paths differently
  • Use file.path() or (even better) the here package to construct paths

Exercise: Organize your EDA

.
β”œβ”€β”€ R
β”œβ”€β”€ data
β”œβ”€β”€ paper
β”œβ”€β”€ plots
β”œβ”€β”€ readme.md
β”œβ”€β”€ scripts
└── talk

Data management plans

  • Much like a research plan, data management plans provide an overview of the steps you’ll take to gather, publish, and maintain your data
    • Since 2011, NSF has required a 2-page data management plan for most types of proposals

Common elements

  • Who is responsible for data management
  • Who else will have access to which data
  • How data will be collected
  • Data formatting standards
  • Whether and how data will be archived and made available for reuse

Data management plan examples and resources

FAIR principles for published data

  • Findable
    • F1. (meta)data are assigned a globally unique and persistent identifier
    • F2. data are described with rich metadata (defined by R1 below)
    • F3. metadata clearly and explicitly include the identifier of the data it describes
    • F4. (meta)data are registered or indexed in a searchable resource
  • Accessible
    • A1. (meta)data are retrievable by their identifier using a standardized communications protocol
      • A1.1 the protocol is open, free, and universally implementable
      • A1.2 the protocol allows for an authentication and authorization procedure, where necessary
    • A2. metadata are accessible, even when the data are no longer available
  • Interoperable
    • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
    • I2. (meta)data use vocabularies that follow FAIR principles
    • I3. (meta)data include qualified references to other (meta)data
  • Reusable
    • R1. meta(data) are richly described with a plurality of accurate and relevant attributes
      • R1.1. (meta)data are released with a clear and accessible data usage license
      • R1.2. (meta)data are associated with detailed provenance
      • R1.3. (meta)data meet domain-relevant community standards

CARE principles

Applications of FAIR Principles have the potential to neglect the rights of Indigenous Peoples and their protocols for cultural, spiritual and ecological information. (Jennings et al. 2023)

  • Collective benefit
    • C1. For inclusive development and innovation
    • C2. For improved governance and citizen engagement
    • C3. For equitable outcomes
  • Authority to control
    • A1. Recognizing rights and interests
    • A2. Data for governance [self-governance and self-determination]
    • A3. Governance of data
  • Responsibility
    • R1. For positive relationships
    • R2. For expanding capability and capacity
    • R3. For Indigenous languages and worldviews
  • Ethics
    • E1. For minimizing harm and maximizing benefit
    • E2. For justice
    • E3. For future use

https://www.gida-global.org/care

Discussion

FAIR

  • Findable
  • Accessible
  • Interoperable
  • Reusable

CARE

  • Collective benefit
  • Authority to control
  • Responsibility
  • Ethics
  • Where/how do the FAIR and CARE principles complement each other?
  • Where/how do they conflict with each other?
  • What implications do the CARE principles have for those of us who don’t work with Indigenous data?

References