(Image credit: photo by Emily Morter on Unsplash)
Data science is the science of data — in any scientific field, such as linguistics, or language science, the scientists understand their topical data, in this case language. While the concept of understanding data, how it works, and which data to use might seem very abstract, it is the very meaning of what it is to be a data scientist.
There’s an adage about how data scientists spend 80% of their time dealing with data and about 20% dealing with actual machine learning problems. Despite all of that, almost all of the writings for data scientists are on problems such as hyperparameter optimization, edge cases on TensorFlow, or other deep learning issues rather than the heart of data science, which is data.
In this post and subsequent ones, I’ll examine the dimensions we should take into consideration when looking at an already existing dataset and using it as is or altering it, or in creating our own from scratch for a problem. In this first post, I’ll be covering ownership and licensing issues. In subsequent posts, I’ll be discussing data provenance, cohesiveness, usage, and other topics as they arise.
What follows is a combination of my own experience as a data scientist, linguistic archivist, and field linguist, as well as, if not more importantly, conversations with experienced social scientists, statisticians, AI ethicists, and work such as that of Timnit Gebru and Emily Bender.
First dimension: Who owns the data?
As we recently saw with the press coverage of the MegaFace dataset, data are often shared and reshared without any type of description as to who owns the data and how someone is allowed to use it, and ownership can be a big problem for its downstream users. Even when the dataset’s original creators were explicit in the licensing and allowed usage, it can be difficult to pin down what is and isn’t acceptable.
While the original publishers of the photos that created the seeds of the eventual MegaFace dataset had good intentions — to level the playing field with other datasets being prohibitively expensive and beyond the reach of ordinary researchers — their supposed safeguards were insufficient. They provided links that meant that if users deleted their originals photos from Flickr, the photos wouldn’t be used any longer in the dataset. These safeguard mechanisms didn’t solve the problems of access and ownership.
The Creative Commons licenses under which many of the photos were published on the Flickr website do allow for some non-commercial usage of their work, but the licenses were created before large scale data collection for machine learning became common. Additionally, the images were traceable back to the original users, which violated privacy policies.
The University of Washington researchers who assembled MegaFace have allowed it to be downloaded widely despite the fact that under laws such as Illinois’ Biometric Information Privacy Act, a person’s facial data being in the dataset without consent is illegal. The dataset is likely to have been used by multiple companies although it is unclear to what extent it has been used to develop commercial technology. However, many companies and researchers have certainly used it without the deeper consideration of the data’s origin or ownership.
It’s easy to forgive people being excited to use these data sets. For novice data scientists and those more experienced but expanding their skills, the availability of free and ethically sourced datasets is key to improving their skills and growth. For university researchers, access to free or low-cost datasets is also central to continuing to do research on often very limited budgets. Especially for research, education or benchmarking the data repository for machine learning at the UC Irvine is a common source of datasets for training, as are datasets within the scikit-learn and NLTK libraries in Python.
For commercial purposes though while it is fine to look for free datasets where the individuals or organizations who own the datasets allow it to be used for commercial purposes, many do not. Be clear on this. If you are using a dataset to make money, you should be willing to consider this an investment like any other type of asset, such as your AWS assets and usage and the time of employees. Good datasets are well-curated, often labeled, and are an investment of time. When I was younger, I was always told that if I sold any puppies my dog had, I should charge money even if the dog was a total mutt. The reasoning behind this was that people willing to pay for an animal would value that animal and take care of it. The same is true for a dataset.
(Image credit: photo by NICHOLAS BYRNE on Unsplash)
Of course, it is murky to determine the acceptable uses of many licenses as legal language can be obtuse and difficult to determine, especially if at the time they are written, certain scenarios are not taken into account. This is the case with a lot of data licensed under Creative Commons. With the 2019 release of the Montréal Data License, the creators hope to address the specific issues of data usage with regards to the specific needs machine learning researchers and data scientists. This license discusses usages such as benchmarking versus training usages of data, direct release of a model versus API usage in a SaaS company, and other issues that are particular to machine learning. While the license does not yet have widespread adoption, it can be hoped that within the next few years these issues will be clarified to a greater extent. Furthermore, as suggested by Jo and Gebru [1], development of data consortia to curate and manage datasets as exists for archival purposes in many other fields such as the Linguistic Data Consortium for linguistics can take much of the guesswork out of this process and lead to clarity for developers, safeguards for the privacy of the public at large, and lessening of legal issues for companies.
Solutions for Now
While the Montréal Data License which explicitly states how data can be used in machine learning now exists, it isn’t commonplace yet. However, there are several tools that can be used to help ameliorate and think through issues of data ownership. Among these are [2] Datasheets for Datasets (the datasheet template is at the end of the paper) and the Data Nutrition Label Project. For the moment, the responsibility for investigating the licensing and ownership of the data falls on the individual, institution, or company using it for whatever reason that they are. The bottom line for any data use should be to ensure that you have the consent of owners of the data whether tacitly by the licensing or explicitly in written communication. Having an individual or team dedicated to sourcing and verifying a dataset’s history and ownership and a lawyer or policy point person is probably the best way to ensure that the data you use is both legal and ethical, as well as a good fit given the two dimensions discussed here.
Acknowledgments: many thanks to Will Roberts for edits, Victoria Heath at MAIEI for advice sought for another project and Richard Pang at NYU for asking the questions that inspired this blog post.
Papers referenced or used as a basis for this piece:
[1] Eun Seo Jo and Timnit Gebru. 2020. Lessons From Archives: Strategies for Collecting Sociocultural Data in Machine Learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–16
[2] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets. arXiv: 1803.09010
Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. In Transactions of the Association for Computation Linguistics 6, 587–604.
Misha Benjamin, Paul Gagnon, Negar Rostamzadeh, Chris Pal, Yoshua Bengio, and Alex Shee. 2019. Towards the Standardization of Data Licenses: The Montréal Data License. arXiv: 1903.12262
Further resources:
More on Creative Commons uses: https://wiki.creativecommons.org/wiki/Data#Frequently_asked_questions_about_data_and_CC_licenses
Copyright laws and machine learning in the US and EU:
Comments