March 9, 2021

title: ai4lam Metadata/Discovery WG Monthly Meeting

March 9, 2021#

9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris

Attending

*Name (institution) *
Tim Thompson (Yale)
Jeremy Nelson (Stanford)
Niek Verhoeff (Amsterdam City Archives)
Mike Trizna (Smithsonian)
Corey Harper (Elsevier Labs / Univ. of Amsterdam)

Regrets

David Lowe (Texas A&M)

**Notetaker (alpha by first name): **

[]{#anchor-1}Helpful Links

Metadata WG Zotero Group Library

[]{#anchor-2}Project Documents and Data

WG charter

<<<<<<< HEAD

=======

9c0c03e (Adds remaining 2021 meetings)

WG Google Drive folder

[]{#anchor-3}Agenda Topics

Updates, announcements
Presentation by Kat Thornton on Software Emulation as a Service

a. Background reading:
> Thornton, K., Cochrane, E., Ledoux, T., Caron, B., & > Wilson, C. (2017). Modeling the Domain of Digital Preservation > in Wikidata. 10. > https://phaidra.univie.ac.at/view/o:931058
Discussion of Model Cards

a. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., > Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). > Model Cards for Model Reporting. Proceedings of the > Conference on Fairness, Accountability, and Transparency, > 220–229. > https://doi.org/10.1145/3287560.3287596

[]{#anchor-4}Notes

Introductions
Presentation by Kat Thornton, modeling and use cases around metadata surrounding machine learning and training data. Kat’s work is adjacent, using Wikidata for software emulation. Purpose for creating metadata and curated through Wikidata for use by others outside the group.
In-browser emulation of software titles in Yale’s collection, no longer have software/environment available to run. Get data into and out of these legacy applications, Kat’s provide metadata software formats and titles.
Wikidata as a data garden. Right around 1 year mark of global pandemic, source of health and balance engaging with the natural world, taking the garden as inspiration, think about Wikidata as a data garden. Watch it grow and then harvest the results. “Data seed”, creating new items in Wikidata and adding statements to existing items.
Many gardens collaboration in Wikidata. Other , multi-lingual support in Wikidata over 300 languages, even if they aren’t speaking or know about other gardeners. Other editors plant or tend data in other domains. Kat focused on Computing, complementary domains from other like scholarly publications, many millions of publications described in Wikidata, create connections between articles with software being used in these articles.
In terms of data harvest, SPARQL queries get specific data baskets and then see what is available and watch how changes over time. Produce from these garden can be harvested at any and amount.
Data changed in the domain in Computing in 2016 11,055 software described in wikidata, now over 85,000 large increase in activity and curation. Before there 400 file formats now over 5,000
RDF graph, mulit-lingual expands the use, property names and items can be stored in multiple languages. Label and descriptions translations uneven distributed in wikidata over human language,
Portal for Wikidata for Digital Preservation, portal pulls from Wikidata API and SPARQL endpoint. Example of searching for Debian. Language drop-down, switch to property names translated into the language. Wikidata multi-language supported in third party more useful for more people. International preservation community has more access to metadata available in Wikidata.
Why create an RDF data layer? RDF datasets can be automatically combined, RDF is appropriate for evolving data models. Advantage of SPARQL, ability to write very precise queries for work supporting other external systems. Ask complex questions about the data, over 900 software titles have user manuals available on the web. SPARQL query for software titles that have manuals, results have links to manuals across the web.
Over 20,000 scientific articles that reference 900 software titles, dozens of file formats connected to their defining standard. More than 1,500 RFC connected to technologies/formats they describe. Many editors continually contribute to these items to increase coverage. Really nice for people doing computing research tied to software formats, can provide access to WIkidata and do not need to be stored. Experts can specify how formats are used in what software and help create connections in wikidata, easy to point users for further information to do research.
Graph visualization of software emulators that are linked and are available in the Easy system. Can click on these nodes and get information from running those queries on Wikidata and make the system more feature rich.
WIkidata for Cultural Heritage Buzz, LoC now using Wikidata authority data, ARL white paper, OCLC’s using WIkidata Creating Library Linked Data in a production system. Other national libraries using Wikidata to manage their library data or their wikibase and have written up their results.
How do we improve the quality of Wikidata? Multiple models, constraint violations, Wikidata designed to accommodate multiple epistemological models, community-based systems to nudge people to collaborate on data models. Early days in building tools around data quality in Wikidata, common experience creative exploratory SPARQL queries to see if the Wikidata can meet their needs. New namespace in Wikidata, schema namespace, “e”, share models in shape expressions formal modelling and shape for RDF http://shex.io, example Schema for vr headset in Wikidata schema namespace. Test for conformance and test namespace. Relative new but very promising approach for modeling data in WIkidata. Very exciting to have this global data store, more useful for producers and consumers can agree.
Question: Wikidata constantly being used in ML community, columns in tables entity-linking in Wikidata. Kinds of software applications represented in Wikidata, representation of ML models, codes, BERT and other, resources along with papers, datasets linking, etc.? WIkidata expand out to larger documentation of AI space?

Some of that in Wikidata, a lot of more work to be done. Not of aware of anyone who is creating links to all resources mentioned in an article. Possibility is there but a more is needed. EaaSE project looking on legacy software past, other people are focusing on current software. WikiProject Informatics can post interesting questions, ask or encourage people to do more work on this, post question. Infrastructure is there for supporting, properties specific for datasets and connecting to publications. Not being systemically curated right now.

What work that would like? Write a bot framework, PyWiki data and other custom bots, take the json formats and programmatically update Wikidata as more work.

Is the wikidata comfortable with the relatively sparse metadata? Authors, dates, not, create scaffold for other data updates over time? Write a set of inter-connection of bots combine to be a richer, suggest posting to WikiProject Informatics, help with curation to be more througal.
Question: How tricky version can be? Software with different properties with different versions of software? Talk about software and version with different items, 30-40 versions of different file formats. For software there is more variability, legacy software ok with different versions. For open-source software link to Github for other versions, different versions stored in tabular format with different links and date. Table-based version approach was more optimal for some software. EaaE, really want to know, ex: these versions of ClarisWork, these files will work, later versions may have different properties. Priority needs of researcher, discussion on the wikidata, some degree of messiness of dealing with WIkidata ecosystem, require some work. Wikidata a much younger project than 20-year Wikipedia.
Question: Model Cards concerned about ethical considerations of models bias with different attributes, MC have a narrative component, piece of documentation for each model, mix of structured and very broad narrative. Wikidata approach for MC? Room for leverage Wikipedia for narrative component? Wikipedia challenge of modeling for one article for Model cards, or create articles per model card. Some structured and unstructured, different type of repository and create per model card item. Link out with a property that links out to the narrative. If MC is published a publication, create metadata in wikidata and link out to the publication item and require user to go out the narrative repo.

Mentioned tabular data in MC, MC have summary tables, properties for recording tabular data? How will stored? Will have to create new properties, Wikidata community would be interested in collaborative. They won’t be opposed to these, editors interested and lively conversations. Create own Wikibase for MC and create as many properties as we want, use subset of Wikidata properties that make sense. Trade-off, make WIkibase move fast and create model. Interim step and show community and about properties. Ethical considerations for MC, too important and need to be included in Wikidata, trade-off propose properties and negotiate structure. Tabular property, not widely used in property namespace. Example was only used in Software versions. In some software items, software version property. Earlier software p348 software is a string, p4669 has english label of tabular data version, data is tabular. Tabular data type not widely used. Propose tabular structure takes more careful discussion and justification, property proposal may be more involved.

How many properties did you create? Some more difficult to be proposed? Any other suggestions? Yes, created properties related to digital preservation work, pretty straightforward only a couple of weeks. External identifier is the easiest to be approved, point to resource on the web. Usually the Wikidata open to external identifiers. Other types of properties take more conversations. In file formats, patterns some further qualifiers, know pattern and know what qualifier, like encoding. Offset, indicate point location in file. Without these qualifiers, less useful for external tools. More complicated and different qualifiers set, used by different tools and workflows. Can be complicated, also proposed properties for non-digital preservation, harder to be approved. Lot of factors, properties proposal review process more thoroughly, some people or more focused on who proposed and what volume and quality of what an individual contributed can be taken in account. 20,000 users make at least one action in Wikidata per month, at any given time, not even a dozen users engaged in properties proposal. Property namespace and engage with these proposals. Who is reviewing these proposal. Special class of account, can be property creator role to create a property.

Next steps? Available for follow-up? Yes!

Cory Model Card related topic, data nutrition labels? Data nutrition project https://datanutrition.org/ very cool.