Background[edit]

This project is focussed on uploading metadata for New Zealand academic theses to Wikidata, in order for them to be more openly citable and accessible. We believe this is the first attempt to upload a national dataset of theses.

The project came about while Giantflightlessbirds was a Wikipedian in Residence at Lincoln University. During that short residency, librarian Zeborah raised the possibility of adding Lincoln University's theses to Wikidata. She had an opportunity to present on to her academic librarian colleagues at the online conference Aotearoa Institutional Repositories Community Days 30 September – 1 October 2021 on adding thesis metadata into Wikidata. In preparation for this presentation she reached out to Giantflightlessbirds who in turn invited Ambrosia10 and DrThneed to join in the discussion. This group met several times to discuss the proposal of uploading all New Zealand academic theses into Wikidata and to prepare for the presentation at the conference.

Discussion documents, slides and other project documentation is being collated in Google Docs folders as some of the participating academic librarians are not Wikimedians. Some of this documentation is linked to in the documentation section of this page.

Scope[edit]

The intention is to collect metadata for theses from New Zealand universities and polytechnics, and upload a core set of statements for each thesis in the first instance. After this core set of statements has been uploaded, there is potential for further work to increase the findability and linkage of the theses, for example the data includes keywords, often in controlled vocabularies such as ANZSRC, which could be mapped to main subject statements. We also have data connecting theses to degree programmes and advisors.

A dataset of approximately 66,500 theses has been compiled, from 13 New Zealand institutions. The theses range from diploma and bachelor's theses through to Doctor of Science, and span the time period 1907 to 2022. Whilst many of the theses are digitised and available through an institutional repository, others are represented only by their metadata. Because of variability in the data both within and between institutions, there is a lot of clean up and standardising of data required. Deborah Fitchett has done significant work aggregating and collating the data in Excel, and DrThneed will clean it up in OpenRefine and upload to Wikidata. There will likely be some problems to resolve with institutions where, for example, a thesis is held in more than one library and has been modelled differently by each.

Funding[edit]

The thesis dataset is a large and complex dataset, with 66.5k items and several languages, including some apparent duplicate items within and between institutions that need to be clarified with the academic librarians involved, and some incomplete data that may need follow up. The inconsistencies in data format between institutions will require a lot of time to standardise and clean up. For instance, we have counted more than 50 ways of indicating in a title that a work is a thesis but we need to remove these additions to ensure the title of each thesis is as the author intended and in order to make a good citation.

We estimate the data cleaning, checking and upload to Wikidata to take approximately 200 hours of work by an experienced data wrangler. At an hourly rate of $NZ25 this amounts to $NZ5000.

We are approaching Wikimedia Aotearoa New Zealand to support obtaining a contractor to complete this work.

Progress[edit]

Ambrosia10 and DrThneed used a small sample dataset to work on mapping the thesis data to Wikidata properties, and Ambrosia10 developed a Wikidata Cradle schema for an academic thesis in consultation with the other members of the group as well as the academic librarians contributing the data. This ontology will likely need to be modified during the project.

Zeborah undertook significant work collating and aggregating the data and was able to pass the dataset onto DrThneed in the beginning of March. DrThneed then spent time exploring the dataset and began a small trial upload of 116 theses into Wikidata both to test the proposed workflow and the schema that had been previously created.

Feedback is in the process of being gathered from the participating institutions and as at April 2022 DrThneed is continuing to work on the dataset preparing it for upload to Wikidata. It is anticipated that the upload of a core set of statements for the full theses dataset will be complete in May/June 2022.

A small team met before Christmas to work on ANZSRC vocabularies in Wikidata, which would be a useful prelude to uploading keywords to the theses items. Progress on the ANZSRC Mix'n'Matches has been slow but we intend to return to this work after upload of the core statements for the main dataset.

DrThneed has created a dashboard that measures edits to Wikidata items with the statement "on focus list of NZThesisProject".

Events[edit]

Documentation[edit]

Tools[edit]

DrThneed has made some Wikidata property dashboards to see progress on the project. They are both linked from the Wikidata project page. One table shows properties for theses, and the other properties for people (thesis authors). A third table shows some properties we don't expect to find, like volume number and published in - this helps check that our thesis items haven't been inappropriately merged with other types of publications.

The Wikidata project page also contains a link to some Histropedia timelines, and some Sparql queries to visualise the data e.g. a map of where authors have been educated or employed, bubble charts of main subjects or author occupations, links between advisors and students.

Tasks[edit]

If you would like to help, some easy tasks are making sure the theses are cited on relevant author Wikipedia pages, or matching authors to author name strings in the Mix'n'match tool.

Citing theses on Wikipedia

This Googlesheet shows theses by people who have Wikipedia pages (updated 23 March 2023). Unfortunately we have discovered that CiteQ is not helpful for citing theses currently, as the citations are not tracked by Altmetric. That means the impact of all the work is harder to see. We are currently replacing CiteQ citations with the "cite thesis" template instead. To make this easier the Google sheet now contains the citation with ref tags ready to paste into the Wikipedia page - without any need for source editing. A 4 minute "how to" video has been uploaded to YouTube showing how to create a new citation or replace an existing one.

For reference purposes, here is the old Googlesheet

Do you like working in other language Wikipedias?

This Googlesheet has a short list of thesis authors who do not have an English Wikipedia page, but do have one in another language (languages show in last column). It would be great to cite the theses on those pages, so that non-English speakers can see the work exists. The first sheet in the file contains some instructions for how to go about this if you don't speak the language concerned, obviously if you are fluent you will find it much faster!

Mix'n'match

The Mix'n'match tool is a way to match the author name strings from the thesis project to authors on Wikidata. If you search Wikidata and do not find the author, try removing middle names, initials etc. If you are sure the person is not in Wikidata, click the 'new' button to create an item for them. You may be able to find other identifiers to add to the new record e.g. Orcid or ResearchGate. Or if they have a university profile page you can add the university as an 'employer' statement, and then use their profile URL as the reference URL for the statement. You do NOT need to link the author and the thesis item. DrThneed will periodically download matches from the Mix'n'match catalogue and match the authors and theses, and also add other information such as advisors.

If you are not familiar with the Mix'n'match tool, this screencapture shows how to match items, using the Alexander Turnbull library catalogue as an example.

Participants[edit]

Outcomes and impact[edit]

October 2022 slides

.