top of page
  • luxzia0

Centering the Data in Data Science 2: Data Provenance (first published on Medium on 03.07.2021)

(Image credit: photo by ev on Unsplash)

If you are building a language model to understand TikTok comments, using a language model built off of news articles is not going to be particularly helpful. The provenance of the data is key to its quality. If you are looking at medical outcomes, looking at data from the 1930s would rarely be useful — medical care has improved dramatically in the last century, and health outcomes are typically much better. Only when we understand what Miriam Posner refers to as the data biography can we truly understand our data, its usefulness, when it’s ethical to use, and other facets of the data such as our trust in its cohesiveness and relevance.

Some essential questions we should ask ourselves as end users of a dataset include “who put this data together?”, or “when did they put it together and for what purpose?”. Why should we care who put the data together? Knowing the who allows us to know whom to ask about the inevitable questions we’ll eventually have of the data. For example, can they tell us about their research history, or other histories that are relevant to all of the other aspects of the dataset?

Understanding the time period, or asking when, is often crucial before we can use a dataset with confidence. For a model to enjoy sustained use, its predictions must be useful or applicable to the problem at hand. Restraining that utility, models are inherently limited in the fact that they are built on historical data. A good general rule of thumb is that the more up-to-date the data is, the more relevant its predictions will be to the time it is being deployed. There are effects that impact signal in a dataset and exceptions to be aware of — such as historical biases in language models or economic and supply chain models disrupted in 2020 because of the black swan event that is the Covid-19 pandemic. For those reasons, many models continuously train with fresh data to stay current for their purposes.

Understanding the purpose of a dataset’s creation can yield insights that might not otherwise be recorded in its metadata or recorded information. Datasets are not free of context. A dataset is a product of its time, creators, and place. Those characteristics are imbued into its very nature.

The place, or where, of a dataset may hold special and hidden relevance. Medical data from a country with a robust public health and testing system, such as the National Health System (NHS) in the United Kingdom, will differ in quality from the spottier collection in a large country with privatized health care such as the United States. Let’s consider Covid-19 genomic data as an example. If we know that genomic data is from the NHS and that it was collected for a multitude of purposes, it allows us to infer its methods, its accuracy as it relates to the spread, and the prevalence of various strains within the UK. If we contrast that with collected data from the United States and early data collection from Hubei province in China — which was also spotty at a crucial point in the evolution of the SARS-COV-2 genome — we better understand how reliable that data is for modeling strain prevalence. Since we know the identity of the dataset’s creators, its original timeline, and its place and purpose, we can enjoy insights into what the data truly represents as well as validate its authority.

Equal in importance to knowing other aspects of your dataset, you should know the collection methods by which it was put together. You need to know the methods for statistical, and ethical reasons, as well as its to qualify its relevance.

Statistical arguments can only be defended using a dataset if the gathered statistics are known to be representative and measurable — if a dataset had demographic information for 1,000 people, can we confirm that it is representative? We need to be able to trust that the information was chosen randomly enough to be representative of its intended population without being influenced by such factors as selection bias. If the data were collected via a method that was often known to be invalid, such as surveys where response bias is very high (people often lie in surveys to portray their behavior better than it actually is as discovered by anthropologists working on the Tucson Garbage Project), then the dataset may not be usable for specific applications because the data may be insufficiently accurate.

If the information were collected in questionable circumstances, it could also indicate the data would not be relevant to researchers for their intended purposes. For example, if the dataset were collected from prisoners, or students who were afraid their responses could impact their grades, they may not give accurate information freely. Related to this, we must ask whether the data was given willingly. As seen in the controversy surrounding the facial recognition dataset MegaFace, Flickr users did not give their willing consent to the use of their images to a profit-seeking organization. Data collected from prisoners, images of individuals from surveillance cameras, and personal data sold by banks are often collected without knowledge or consent.

(Image credit: photo by Arno Senoner on Unsplash)

While the EU passed GDPR in 2018, many countries still lack data privacy laws that would give legal standing to these concerns. Regardless, for a company’s reputation and future legal standing, it’s still best practice to source data ethically and with consent.

Finally, for data such as visual or language data that is annotated by a third party, one must also know basic information and metrics about the annotation to properly judge the quality of a dataset. For instance, in language data, you would want to know if the annotators are first or second language speakers, what dialect(s) of a language they speak, their current geographic region, and other demographic data. Additionally, you want to know the platform or data collection method, the type of tasks, the quality benchmarking/testing used in selecting an annotator, the level of expertise of an annotator, and the metrics on annotation. Examples of annotation metrics to consider are Cohen’s kappa and Krippendorff’s alpha scores of inner annotator agreement. They can be evaluated per annotator and per judgment to tell you the quality of the annotations and to help you understand what the threshold of voting by annotators is required for a label to be judged correct. The need for many high-quality datasets for smaller models in recommendation engines, localized search on an application or site, or for a benchmarking dataset could very well depend on well-curated annotated data.


Statisticians and social scientists alike have long explored data provenance and most keep notes or systems of their own for tracking it. In the earlier part of this century, scientists and developers of the Semantic Web noted the need for having tools to trace the history of a dataset. Multiple publications have written on the topic and W3C standards for documentation were laid out in 2010.

In machine learning, these topics hadn’t been addressed until the last few years, and for a data scientist, the best tools I can recommend are probably still Datasheets for Datasets or IBM’s Factsheets (as I discussed in previous posts, and was covered by Gebru, et al). Regardless of methodology, AI companies and projects should start to document the history and sourcing of the data they use for auditing purposes. We are quickly entering an age in which customers will begin to demand answers for why AI applications produce the results they do, and governments will begin to respond with greater regulation.

Acknowledgements: Thanks to Will Roberts for his editing chops on this.

Recommended readings:

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets. arXiv: 1803.09010

Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. In Transactions of the Association for Computation Lingusitics 6, 587–604.

Yogesh Simmhan, Beth Plale, and Dennis Gannon. 2005. A Survey of Data Provenance in e-Science. In ACM SIGMOD Record 34(3): 31–36.

John Richards, David Piorkowski, Michael Hind, Stephanie Houde, and Aleksandra Mojsilovic. A Methodology for Creating AI FactSheets. arXiv: 2006.13796

0 views0 comments

Recent Posts

See All

You can take the woman out of Texas...

but you never really take Texas out of the woman. Ever. I just had the oddest conversation in a Thai restaurant/bar here in Seattle. Three people randomly sitting at a bar, who all went the Texas->Ca


bottom of page