title: ai4lam Metadata/Discovery WG Monthly Meeting
April 12, 2022#
Attending
Jeremy Nelson (Stanford)
Mike Trizna (Smithsonian)
Erik Radio (Colorado)
David Lowe (Texas A&M)
Regrets *
**Notetaker (alpha by first name): **
Helpful Links#
Project Documents and Data#
Agenda Topics#
Updates, announcements, intros
Our invited guest speaker for this meeting will be Dr. Emily M. Bender from the University of Washington, who is a leading researcher in Natural Language Processing and Computational Linguistics. Her work in Data Statements for language data sets—as a unique type of descriptive metadata—holds the potential to be the basis of a fascinating conversation for library, archives, and museum metadata professionals who handle or advise on cataloging large data sets. She has indicated that it will be useful for attendees to familiarize themselves with these two documents in particular prior to the meeting:
Bender, E. M., & Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041
Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
Abstract. In this paper, we propose data statements as a design solution and professional practice for natural language processing technologists, in both research and development. Through the adoption and widespread use of data statements, the field can begin to address critical scientific and ethical issues that result from the use of data from certain populations in the development of technology for other populations. We present a form that data statements can take and explore the implications of adopting them as part of regular practice. We argue that data statements will help alleviate issues related to exclusion and bias in language technology, lead to better precision in claims about how natural language processing research can generalize and thus better engineering results, protect companies from public embarrassment, and ultimately lead to language technology that meets its users in their own preferred linguistic style and furthermore does not misrepresent them to others.
doi.org
2. Bender, E. M., Friedman, B., & McMillan-Major, A. (2022). Data Statements: A Guide for Writing Data Statements for Natural Language Processing. https:/techpolicylab.uw.edu/wp-content/uploads/2021/11/Data_Statements_Guide_V2.pdf
Podcast [https://www.youtube.com/watch?v=VaxNN3YRhBA](https://www.youtube.com/watch?v=VaxNN3YRhBA).
Dr. Bender runs a master program in linguistics at the University of Washington, course on ethics in curriculum. In 2017 Ethical issues with NLP, course material, what can go wrong and we don’t have a lot of support on how to mitigate these issues. Positional statements as a more systematic approach of the participants, more systematics for data
Data statements a schema of documenting NLP datasets. Informed by linguistic characteristics of the dataset. Finished the first schema of Data Statements, other people working on similar efforts. Datasheets for Data sets, Data nutrition labels, Model Card for Model Reporting, work in parallel, some cross-work occurring among. Value sensitive design, imagine a future with possible downsides. What negative effects of requiring Data Statements for all data sets in NLP, imaging. Workshop online in May 2020, participants from all over. Had people write Data Statements worksheets, comments in these discussion, produced Data Statements 2.0 with schema available at [http://techpolicylab.uw.edu/data-statements/](http://techpolicylab.uw.edu/data-statements/).
Different
Not as structured, not machine readable but human readable.
Much bigger focus on “who is doing the speaking”
Question: Metadata as Dataset#
Could you describe techniques.
Who are the speakers? Limited knowledge of catalogers, if don’t know. It is valuable that we don’t know.
How do you document your dataset? This is a close match to what the researcher
Useful exercise as you create the dataset.
Scope out what you have in your collection,
Question: About Datasets that have existed for a while?#
Multiple Data Statements
Legacy datasets lack documentation, Data Statements 2.0, header information how to cite the creators verses the datasets.
It is possible to get some of the information from the publications, also can ask existing researchers for this information
Valuable for metadata about conflicting Data Statements for evaluation two or three different data statements.
Question: LAM use of these tools?#
Large datasets like digital newspapers, OCR works differently for different time periods. Trained newspapers from 1950 wouldn’t work as well as newspapers from 1850s. OCR for handwriting as well as printed text.
Music and Video collections
Pattern matching over things that have curated, selected and some metadata created, advantages over people have extracted data from scraped data
Data statements for text we have a lot of data of who created/curated, interesting structured datasets.
Tools models and the data that trained, keeping the separated,
Named entities are very time sensitive, a lot value in working with already the way like libraries
Question: Use of a single Data Statement Repository?#
What are data statements life like?
We want data statements travel with the data, matched version
Having a catalog of data statements that have used and available. ACM host such a repository? Unclear of what the appropriate governing body for managing Data Statements.
Maybe what is need is a couple of sites that just collect DOIs
Question: Legal aspects of Data Statements like Open Source licenses, could something be done the same way?#
Licensing of Data Statements is important, license for this data statement and constitute datasets
Beyond open source, can Data Statements be restricted for different uses
Question: Fill us in the broader picture of the tech policy lab?#
Relative newcomer to the policy lab, if we are going to create policy that works with technology, span experts on policy and technology, be deeply knowledgeable about policy and interact with people who know technology.
You never want to regulate technologies but affordances for example DVD, once a different method or format, regulations no longer apply.
Think about the affordances of what technologies that
Success with libraries on DRM, overly restricted DRM, libraries can rip DVD and CD have some success, DRM violates first use sale.
Policies around data sets regulation in 2018 paper, not sure of the policy implementations
Machine learning that impacts, should the impacted users know about what the AI model was trained and other informed decisions about the dataset.
Similarly, conversational AI (not really AI or conversation) is not very transparent about what domain they are searching over.
Question: Open Access for Papers and Management Plans, things that are obviously good but hard for institutions, would Data Statements approach for Data Statements more widely implemented?#
Some people that provide their papers as open access even if the journal is not
Data Statements are similar, early adopters use Data Statements, if the culture changes, but higher uptake would definitely happen if funders and institutions require Data Statements. Neur
In the Data Statements guide, why is this useful for the person writing this and the person using the Data Statements.
In computational linguistics and machine learning, make data statements not count against any limits of the publication.
Question: FAIR principles for Data Statements?#
Came across FAIR principles in prior work, interoperability of tools using endangered languages, proliferation of tools, Open Languages efforts.
Discoverability of data sets but not necessarily accessibility, tension between access and protecting sensitive data.
Question: Along the FAIR principles, big models a lot of data (like all of the Internet) limit usability of Data Statements?#
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜https://dl.acm.org/doi/10.1145/3442188.3445922 How big is too big? Size of the dataset plan for the documentation and only create a dataset size that is supported by the documentation
Palm paper from Google data set includes Data sheet, sampled 1% of data,
Jessy Dodge et al, https://arxiv.org/abs/2104.08758 Colossal Data set, large portion of the data, https://homes.cs.washington.edu/~msap/pdfs/dodge2021documentingC4.pdf. Actual published location https://aclanthology.org/2021.emnlp-main.98/
Hard to do documentation post hoc
Papers listed in course https://faculty.washington.edu/ebender/2021_575/
Question: Data Statements for documenting data sets, is there a project using Data Statements for bibliographic data?#
Training, test, and validation data.
One of the best practices came out of the workshop, participants paired up to create data statements for different projects. Have schema interview between the participants, wh
RDF and the semantic web, move from triples to mining full-text, lost something in the process.
Question: Course link, what is the level of interest? Wonder AI and how it is misapplied, how much CS students sociology research that is very valuable and apply#
Focused on masters students in the Linguistics Department, some have CS background
Interesting problems, taught in standalone, make it required can backfire. Thread through-out the curriculum, how to situate the values and create a cohort to apply more broadly
Situated within hierarchy of knowledge, in CS machine learning more privileged, value and expose