CNI: Coalition for Networked Information

  • About CNI
    • Membership
    • CNI Collaborations
    • Staff
    • Steering Committee
    • CNI Awards
    • History
    • CNI News
  • Program Plan
    • Current Program Plan
    • Program Plan Archive
  • Topics
  • Events & Projects
    • Membership Meetings
    • Workshops & Projects
    • Other Events
    • Event Calendar
  • Resources
    • Publications by CNI Staff
    • Program Plan
    • Pre-Recorded Project Briefing Series
    • Videos & Podcasts
    • Follow CNI
    • Historical Resources
  • Contact Us

Towards Aiding Research by Improving Access to Electronic Theses and Dissertations from Multiple Domains

Home / Project Briefing Pages / CNI Fall 2021 Project Briefings / Towards Aiding Research by Improving Access to Electronic Theses and Dissertations from Multiple Domains

December 1, 2021

Jian Wu
Assistant Professor
Old Dominion University

Edward Fox
Professor
Virginia Polytechnic Institute and State University

Funded by the Institute of Museum and Library Services, Virginia Tech and Old Dominion University are collaborating on a project aimed at bringing computational access to book-length documents, and demonstrating that process with electronic theses and dissertations (ETDs). Since the project launch, the team has made substantial progress on various tasks, including data acquisition, information extraction, and classification. The team has collected the largest corpus of ETDs containing about 500,000 full-text documents and their metadata. The collection was made by actively crawling institutional ETD repositories of university libraries in the United States, honoring the crawling policies of target websites. To facilitate building robust text representations for downstream tasks, we investigated building a language model specific to ETDs. This model, called ETDBERT, was built by fine-tuning Bidirectional Encoder Representations from Transformers (BERT) using a corpus containing 300 million tokens extracted from a subset of ETDs we collected across 195 disciplines. ETDBERT was evaluated based on intrinsic and extrinsic metrics and demonstrated superior performance compared with traditional text representations on a subject domain classification task. Compared with SciBERT, which was trained on a single Tensor Processing Unit (TPU) for seven days, training ETDBERT uses far fewer resources while achieving comparable performance on a subject domain classification task. We attribute this to the multi-disciplinary sampling of our training corpus. Our planned further improved language model will help us even more with tasks, such as novelty measurement, automatic subject categorization, and long text summarization, to better understand the nuances of knowledge in ETDs, and to provide robust and scalable related services.

https://opening-etds.github.io/

Presentation

Share this:

  • Click to share on Facebook (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)

Filed Under: CNI Fall 2021 Project Briefings, Electronic Theses & Dissertations (ETDs), Emerging Technologies, Project Briefing Pages
Tagged With: cni2021fall, Project Briefings & Plenary Sessions, Videos

Last updated:  Monday, July 25th, 2022

 

Contact Us

21 Dupont Circle
Suite 800
Washington, DC, 20036
202.296.5098

Contact us
Copyright © 2023 CNI

  • Copyright Policy
  • Privacy Policy
  • Site map

Keeping up with CNI

CNI-ANNOUNCE is a low-volume electronic forum used for information about the activities and programs of CNI, and events and documents of interest to the CNI community.
Sign up

Follow CNI

  • View cni.org’s profile on Facebook
  • View cni_org’s profile on Twitter
  • LinkedIn
  • YouTube
  • Vimeo

A joint project