Scholar name data collection and creation


Metadata Elements notes metadata that should be collected about scholars or authors of journal articles.  This page explores how that data will be collected and managed.  It is the result of a literature search on scholar name data and name disambiguation, an examination of name schemas and data systems, informal interviews with key informants, and initial research by Linda MacAfee and Robert Martel.

One issue to deal with regarding scholar/author names is name disambiguation. When more than one person shares the same name, which will certainly be the case in the Knowledge for All database, it is important to have a means by which authors and their associated works can be identified and similar names can be disambiguated. It will lead to more precise searching and make citation analysis possible. There are various tools available or in development that facilitate name disambiguation. The ideal tool would limit the names to scholars who publish, provide rich metadata about each scholar, provide unique name identifiers, and provide open data. Currently, it does not appear that such a resource exists. Current options are noted below.

The Open Researcher and Contributor ID (ORCID) project “aims to solve the author/contributor name ambiguity problem in scholarly communications by creating a central registry of unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID and other current author ID schemes." It is a promising project but is still in the beta development stage, so it is uncertain whether this tool will meet Knowledge for All's needs for name disambiguation. We do not know yet what data will be provided for scholars, how that data will be made available, or how the tool will work. ORCID's About page states that in 2012 they plan to start charging fees to organizations who wish the use the service.

Friend of a Friend (FOAF) is a semantic RDF ontology used to describe people, their activities, and their relationships with objects and each other on the web.  It does not involve a centralized database of information about people but rather is a method by which that information is made available in a standardized format.  It may enable Knowledge for All to harvest scholar/author information from multiple sources.  However, it is not limited to the scholarly community.

ArXiv, an open e-prints archive, has an authority records system that assigns author identifiers in an attempt to disambiguate author names and enable retrieval of all publications by a particular author. It has some limitations and is by no means comprehensive but the data is open.

There are various closed-access databases of scholar names, such as Thomson Reuters ResearchID and Scholar Universe. This tool could potentially be used by contributors with access through their institutions but is not reliable due to its closed status.

The Library of Congress Authority Files are the most established and widely-used name authority records. They are free to search but can only be downloaded one at a time. Names are uniquely identified by adding additional biographical information, such as birth and death dates. Authority records are based mainly on book publications rather than journal publications, and so may not be adequate for Knowledge for All.

The Virtual International Authority File is a joint project of multiple libraries, implemented and hosted by OCLC, that aggregates library authority files and makes some attempt to disambiguate names.  It is currently available to organizations that apply and are accepted to be members, which requires contributing name authority data that meets certain criteria.  In the future the service will be freely available.

The International Standard Name Identifier (ISNI) is a standard for assigning unique identifying numbers to names, similar to the ISBN. Authors must register for a number and the database is not publicly searchable.

Knowledge for All could utilize one of these external resources to identify and disambiguate scholar names or create our own internal system of author name disambiguation and unique identifiers within the database. This process is explored by Torvik and Smalheiseracm (2009).  In his article, “Metadata for Name Disambiguation and Collocation” (2010), Jeffrey Beall recommends collecting the following additional data for name disambiguation, as necessary:

  • Preferred or authorized form of the name
  • Other forms of the name, including earlier names, nicknames, pseudonyms, shortened or longer forms of the name, name in other languages or scripts, names associated with the person’s office
  • Birth date
  • Death date
  • Gender
  • Life events
  • Family
  • Works
  • Languages the person normally writes or creates in or the person’s native language(s)
  • Brief biography
  • Unique identifier

Beall, Jeffrey. “Metadata for Name Disambiguation and Collocation.” Future Internet 2.1 (2010): 1-15. Retrieved 6 April 2011 from http://www.mdpi.com/1999-5903/2/1/1/

Torvik, Vetle I. and Neil R. Smalheiser. “Author Name Disambiguation in MEDLINE.” ACM Transactions on Knowledge Discovery from Data 3.3 (July 1, 2009): 11.