Project and data management
Reading
Some data management disasters
- The Economist (2011), βVideo: Keith Baggerly, "When Is Reproducibility an Ethical Issue? Genomics, Personalized Medicine, and Human Error"β (n.d.)
- Herndon, Ash, and Pollin (2014), but you can just read Bailey and Borwein (Jon) (n.d.), Cassidy (n.d.), and/or watch Reinhart & Rogoff - Growth in a Time of Debt - EXERCISE! (2019)
- Laskowski (n.d.), Viglione (2020), Pennisi (2020)
- βHow Excel May Have Caused Loss of 16,000 Covid Tests in Englandβ (2020)
Dumpster organization
- Dump all of your files into one place
- Use search tools to find what you want
- Just assume that things arenβt getting corrupted
- The way many Gen Z students think about their files? (Chin 2021)
The first rule of data management
Do not edit your data
Project organization
- Keep your project self-contained
- Locate files quickly
- Play nicely with version control
- Self-document key relationships between project files
- Work with your data without editing it
File naming convention
Files in scripts
, data
, and plots
should generally use a sequential naming convention:
- Scripts in
scripts
should have filenames starting with01_
:01_scrape.R
02_parse.R
03_eda.R
, and so on
- Data and plot files (
data
andplots
) should use a parallel naming convention:00_
indicates raw data (produced or gathered outside of the pipeline inscripts
)01_
indicates plots and intermediate data files produced by script number01
, and so on
The model in action: A mid-sized text-mining project
Published paper: https://doi.org/10.1162/qss_a_00150
GitHub repo: https://github.com/dhicks/orus
23 directories, 274 files (plus 160k data files)
.
βββ ORU\ faculty
β βββ ORU\ Publications.fld
βββ QSS\ forms
βββ R
βββ data
β βββ author_histories
β βββ authors_meta
β βββ docs
β βββ ldatuning_results
β βββ parsed_blocks
β βββ pubs
β βββ temp
βββ paper
β βββ img
β βββ scraps
βββ plots
βββ presentations
βββ scripts
βββ 12_analysis_cache
β βββ html
βββ 12_analysis_files
β βββ figure-html
βββ scraps
The analysis pipeline
βββ scripts
β βββ 01_parse_faculty_list.R
β βββ 02_Scopus_search_results.R
β βββ 03_match.R
β βββ 03_matched.csv
β βββ 04_author_meta.R
β βββ 05_filtering.R
β βββ 06_author_histories.R
β βββ 07_complete_histories.R
β βββ 08_text_annotation.R
β βββ 09_build_vocab.R
β βββ 10_topic_modeling.R
β βββ 11_depts.R
β βββ 11_depts.html
β βββ 12_analysis\ copy.html
β βββ 12_analysis-matched.html
β βββ 12_analysis.R
β βββ 12_analysis.html
β βββ 12_analysis_cache
β β βββ html
β β βββ __packages
β β ...
β β βββ topic_viz_41d0cb157a88d4ec41810a16e769f5d5.rdx
β βββ 12_analysis_files
β β βββ figure-html
β β βββ author-dept\ distance-1.png
β β ...
β β βββ topic_viz-2.png
β βββ api_key.R
β βββ scraps
β βββ 02_parse_pubs_list.R
β βββ 03_coe_pubs.R
β βββ 03_match_auids.R
β βββ 07.R
β βββ 12_regressions.R
β βββ BML-CMSI\ deep\ dive.R
β βββ Hellinger_low_memory.R
β βββ dept_hell_net.R
β βββ divergence\ against\ lagged\ distributions.R
β βββ exploring\ topics.R
β βββ fractional_authorship.R
β βββ hellinger.R
β βββ model_scratch.R
β βββ multicore.R
β βββ net_viz.R
β βββ prcomp.R
β βββ propensity.R
β βββ rs_diversity.R
β βββ spacyr.R
β βββ topic\ counts\ rather\ than\ entropies.R
β βββ topic_cosine_sim.R
β βββ unit-level.R
β βββ weighted\ regression.R
β βββ word-topic_distance.R
β βββ xx_construct_samples.R
β βββ xx_oru_complete_histories.R
Intermediate data files
data
βββ *ORUs\ -\ DSL\ -\ Google\ Drive.webloc
βββ 00_UCD_2016.csv
βββ 00_UCD_2017.csv
βββ 00_UCD_2018.csv
βββ 00_faculty_list.html
βββ 00_manual_matches.csv
βββ 00_publications_list.html
βββ 01_departments.csv
βββ 01_departments_canonical.csv
βββ 01_faculty.Rds
βββ 02_pubs.Rds
βββ 03_codepartmentals.Rds
βββ 03_dropout.Rds
βββ 03_matched.Rds
βββ 03_unmatched.Rds
βββ 04_author_meta.Rds
βββ 04_dropouts.Rds
βββ 04_genderize
βββ 04_namsor.Rds
βββ 05_author_meta.Rds
βββ 05_dept_dummies.Rds
βββ 05_dropouts.Rds
βββ 05_layout.Rds
βββ 05_matched.Rds
βββ 06_author_histories.Rds
βββ 07_coauth_count.Rds
βββ 07_parsed_histories.Rds
βββ 08_phrases.Rds
βββ 09_H.Rds
βββ 09_atm.csv
βββ 09_vocab.tex
βββ 10_atm.csv
βββ 10_atm_pc.Rds
βββ 10_aytm.csv
βββ 10_aytm_comp.csv
βββ 10_aytm_did.csv
βββ 10_model_stats.Rds
βββ 10_models.Rds
βββ 11_au_dept_xwalk.Rds
βββ 11_departments.csv
βββ 11_departments_canonical.csv
βββ 11_dept_dummies.Rds
βββ 11_dept_gamma.Rds
βββ 11_dept_term_matrix.Rds
βββ 11_oru_gamma.Rds
βββ 11_oru_term_matrix.Rds
βββ 11_test_train.Rds
βββ 12_layout.Rds
βββ author_histories [7665 entries exceeds filelimit, not opening dir]
βββ authors_meta [6020 entries exceeds filelimit, not opening dir]
βββ docs [145144 entries exceeds filelimit, not opening dir]
βββ ldatuning_results
β βββ tuningResult_comp.Rds
β βββ tuningResult_comp.docx
β βββ tuningResult_comp.pdf
β βββ tuningResult_did.Rds
β βββ tuningResult_did.pdf
βββ ldatuning_results-20190415T164055Z-001.zip
βββ parsed_blocks [430 entries exceeds filelimit, not opening dir]
βββ pubs [282 entries exceeds filelimit, not opening dir]
βββ temp
A reminder on paths
- Windows and Unix-based systems write paths differently
- Use
file.path()
or (even better) thehere
package to construct paths
Exercise: Organize your EDA
.
βββ R
βββ data
βββ paper
βββ plots
βββ readme.md
βββ scripts
βββ talk
Data management plans
- Much like a research plan, data management plans provide an overview of the steps youβll take to gather, publish, and maintain your data
- Since 2011, NSF has required a 2-page data management plan for most types of proposals
Common elements
- Who is responsible for data management
- Who else will have access to which data
- How data will be collected
- Data formatting standards
- Whether and how data will be archived and made available for reuse
Data management plan examples and resources
FAIR principles for published data
- Findable
- F1. (meta)data are assigned a globally unique and persistent identifier
- F2. data are described with rich metadata (defined by R1 below)
- F3. metadata clearly and explicitly include the identifier of the data it describes
- F4. (meta)data are registered or indexed in a searchable resource
- Accessible
- A1. (meta)data are retrievable by their identifier using a standardized communications protocol
- A1.1 the protocol is open, free, and universally implementable
- A1.2 the protocol allows for an authentication and authorization procedure, where necessary
- A2. metadata are accessible, even when the data are no longer available
- A1. (meta)data are retrievable by their identifier using a standardized communications protocol
- Interoperable
- I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
- I2. (meta)data use vocabularies that follow FAIR principles
- I3. (meta)data include qualified references to other (meta)data
- Reusable
- R1. meta(data) are richly described with a plurality of accurate and relevant attributes
- R1.1. (meta)data are released with a clear and accessible data usage license
- R1.2. (meta)data are associated with detailed provenance
- R1.3. (meta)data meet domain-relevant community standards
- R1. meta(data) are richly described with a plurality of accurate and relevant attributes
CARE principles
Applications of FAIR Principles have the potential to neglect the rights of Indigenous Peoples and their protocols for cultural, spiritual and ecological information. (Jennings et al. 2023)
- Collective benefit
- C1. For inclusive development and innovation
- C2. For improved governance and citizen engagement
- C3. For equitable outcomes
- Authority to control
- A1. Recognizing rights and interests
- A2. Data for governance [self-governance and self-determination]
- A3. Governance of data
- Responsibility
- R1. For positive relationships
- R2. For expanding capability and capacity
- R3. For Indigenous languages and worldviews
- Ethics
- E1. For minimizing harm and maximizing benefit
- E2. For justice
- E3. For future use
Discussion
FAIR
- Findable
- Accessible
- Interoperable
- Reusable
CARE
- Collective benefit
- Authority to control
- Responsibility
- Ethics
- Where/how do the FAIR and CARE principles complement each other?
- Where/how do they conflict with each other?
- What implications do the CARE principles have for those of us who donβt work with Indigenous data?