Libraries have been in the business of sharing book or monograph records for a long time, and extensive infrastructure exists to facilitate this process, from centralized record creation and distribution facilities to defined data standards and formats. Citation or journal article records, on the other hand, have largely been monopolized by research database vendors, who sell libraries access to searchable sets of records at high prices. Libraries' recent surge toward adopting “discovery layers” that allow users to search journal articles, monograph records, and and other resources simultaneously has created an additional need for access to aggregate citation data, and vendors are capitalizing on this need by selling access to citation data collections at equally high prices.
While the need for libraries to have their own citation research databases and collections of citation data that they can use as they please is clear, having a centralized repository and record sharing system for journal article records is new. For ideas and models we can look to smaller scale citation data projects, reference management software tools and repositories, and broader digital collections such as institutional repositories. However, many questions need to be addressed, including what metadata standards and data elements will we use, how will we define the collection parameters, how will we generate the data, and how will we categorize and classify the records? The Content Planning section of the Planning and Development Research Report examines these questions. Recommendations are made based on information collected through a review of similar projects, literature searches, and informal interviews with key informants in the scholarly community.
This document was prepared by Amanda Stevens in April 2011, with assistance from students in Dalhousie University's School of Information Management's Digital Libraries class of fall 2010.
A formal collections policy for Knowledge for All should be developed and approved by the Board of Directors and/or appropriate Committees and members. The collections policy should delineate what will and won't be included in the Knowledge for All digital collection. Thus far it has been decided that the Knowledge for All collection will consist of journal and journal article metadata for all published scholarly journals, but within that are many grey areas which will be explored here. Recommendations are based on information found in library and information studies scholarly literature, publications by academic libraries, and interviews with researchers.
It is recommended that Knowledge for All adopt a collections policy that allows for gradual collections development and expansion, starting off with a narrow focus and outlining a schedule to expand over time. This would allow the collection to expand as the project becomes more established and acquires more resources.
The Collections Policy should define what is a scholarly journal in contrast to a book, magazine, grey literature, or other types of publications. Here is a working definition:
A scholarly journal is a publication that is published annually, semi-annually, quarterly, or monthly in print and/or electronic format. Its primary purpose is to to report primary results of research or overviews of research results to other researchers. Articles are written by experts in the field who cite their sources. Scholarly journals exercise quality control on content, usually through a peer-review system.
One easy way to distinguish scholarly journals from other types of publications is whether it is peer-reviewed. However, there are journals which meet all other criteria but are not peer-reviewed and so could still be considered scholarly. The Knowledge for All community will need to decide whether to include non-peer-reviewed scholarly journals. A search in Ulrichsweb retrieves 28,693 active and "refereed" (peer-reviewed) journals. There is another option to search for "academic/scholarly" periodicals, which retrieves 47,191 publications. However, this list includes things like "1,012 GMAT Practice Questions" and "Allyn Museum Bulletin" so the numbers cannot be considered accurate for our purposes. We are in the process of gathering our own data about journals, but in the meantime we will use Ulrichsweb.
It is recommended that Knowledge for All develop a checklist for definining a scholarly journal.
By this definition, the following types of publications would not be included in the Knowledge for All collection:
However, pre-prints will be included in the collection in order to considerably increase accessibility to free, full-text versions of articles. And it may be desirable or advantageous to include other publication types noted above in the Knowledge for All database in the future, as the project grows or as requested by the community. It should also be acknowledged that scholarly research is constantly evolving, with new forms of scholarly publication appearing, such as blog posts. The Knowledge for All Collections Policy should be frequently revisited and adapted based on the fluid nature of scholarly publishing and the needs of the community.
Within some disciplines there are special types of research sources which may not strictly fit the definition of scholarly journal but could be considered for the Knowledge for All collection because of their importance to researchers in those disciplines. Further consultation is needed with subject experts to identify these sources and determine the importance of these publications. Knowledge for All will then need to consider what would be required to accommodate inclusion of these sources in the collection and then determine the best approach. Here is a working list of these publications and their disciplines:
Additional researchers should be consulted to identify other special types of publications relevant to specific disciplines.
It is recommended that we begin with an initial list of journals that fit into the standard definition of scholarly journal and add other journals and types of publications if recommended by the community. A community consultation process will be developed that facilitates making these decisions.
A major strength of the Knowledge for All citation database is that it will allow users to to search across disciplines, and so the aim is to include journals from all subject areas, including scientific, technical, and medical; law; social sciences; humanities; and fine arts.
During the initial stages of building the Knowledge for All community, we may not have contributors with the expertise to index in all subject areas, or we may not have access to full-text journals in all subject areas. Thus, it may be necessary to limit by subject area during initial pilot and development phases.
Until the 1990s all journals were published in print format and article-level metadata was available in printed indexes. The first electronic journals began appearing in the late 1980s (Langschied, 1991), but did not become significant until the 1990s, when there was rapid growth. In 1991 there were 110 peer-reviewed electronic journals and by 1997 there were approximately 1,049 (Chan, 1999). A search for “refereed” and “online” active journals in Ulrichsweb, a comprehensive periodical index, yields 21,610 results while a search for just “refereed” active journals yields 28,693 results. Thus, approximately 75% of current peer-reviewed scholarly journals are published electronically. Many older print journals and their metadata are now available electronically. It is recommended that Knowledge for All include both print and electronic journals in its collection. However, we could choose to focus on electronic journals if metadata for these journals is more readily available.
Harvestable metadata could be more readily available for some journals for a number of reasons, and the Collections Policy could specify that these journals are collected first. This could include focusing on free and open access journals. The Collections Policy could also favour free and open access journals under the assumption that users will prefer to be able to link to full-text not dependent on access via subscription.
The first peer reviewed journals were published in 1665. In the 19th century there was an explosion in the number of journals produced caused by the increased specialisation and diversification of academic research and also inexpensive mass publication on cheap wood pulp based paper. Another growth period occurred post-WWII and commercial publishers began to take up journal publishing. In 1962 it was estimated there were around 30,000 scientific and technical journals (Bourne, 1962). Data regarding number of journals published during different time periods will be collected. As Knowledge for All aims to index all published scholarly journal literature, it will not exclude older publications. However, due to data availability issues, the resources required to index past literature, and the prioritization of research published in the last ten years for most disciplines, it is recommended that we focus first on current and recently published research, then work our way back.
It is recommended that Knowledge for All aim to include scholarly journals published in every language in its collection in order to truly be an international project. Indeed, providing localized international access to multilingual resources could be a unique advantage of K4All over other databases. A search in Ulrichsweb finds there are 25,528 refereed periodicals published in English and 3,165 that are not published in English. Some of the English language journals are also published in other languages or contain some text in other languages, but it is impossible to determine how many using Ulrichsweb. The number of non-English scholarly journals may increase if we broaden the definition of scholarly beyond peer-reviewed.
However, in order to include non-English scholarly journals for non-English speakers in the Knowledge for All collection, we will need to have:
Internationalization of Knowledge for All will be discussed further on the Internationalization page and in the Technology Plan.
It is recommended that we first focus on the collection English journals until the project is established enough internationally to include non-English language journals.
A final option for gradual collection development is to focus initially on high impact journals.
Bourne, Charles P. “The World's Technical Journal Literature: An Estimate of Volume, Origin, Language, Field, Indexing, and Abstracting.” American Documentation (April 1962): 159-168.
Biblarz, Dora. Guidelines for a Collection Development Policy Using the Conspectus Model. International Federation of Library Associations and Institutions - Section on Acquisition and Collection Development (2001). Retrieved from http://www.ifla.org/VII/s14.
Chan, Lisa. “Electronic journals and academic libraries.” Library Hi Tech 17.1 (1999): 10-16.
Langschied, Linda. "The changing shape of the electronic journal." Serials Review 17.3 (Fall 1991): 7-13.
As detailed in Collections policy recommendations, the end goal of Knowledge for All is to collect journal article metadata for all current and past scholarly journals in all subject areas and languages. Metadata elements details specific data elements needed for different types of content in the database. This document identifies and analyzes different methods for collecting and generating past and current journal article metadata. These ideas were generated through reading discussions on relevant listservs; informal interviews with librarians, researchers, and developers; extensive Internet searching for metadata repositories; and initial research conducted by Carly Currie, Mary Zazelenchuk, Andrea Crabbe, and Alyssa Graybeal.
Most of the journal article data needed is factual data (such as title, author, year), which, as discussed in Copyright of journal article metadata, can be more easily harvested from existing collections of metadata without violating copyright compared to subject terms and abstracts, which are subject to copyright as literary works. Below I identify potential sources of factual journal article metadata and methods for collecting or generating factual journal article metadata. The accompanying Metadata Sources document lists specific collections of journal article metadata that could be potentially harvested or acquired for the Knowledge for All database. These are by not means exhaustive, but they give a sense of what is available and what to consider in selecting methods and sources. They are categorized by type of metadata source, which corresponds with the categories noted here. I address subject indexing and abstracts in a separate document.
An important partner and resource in collecting journal article data is the Open Knowledge Foundation's Open Bibliographic Data Working Group, which is making different kinds of bibliographic data open and harvestable. It is recommended that Knowledge for All work closely with this group to share strategies and resources.
Summary of general recommendations for metadata collection and creation:
Publishers may be willing to provide their journal article metadata to Knowledge for All as a means of publicizing their content. Willing publishers could enter into a data sharing agreement with Knowledge for All and either upload their data into the system as it becomes available or have their data automatically harvested at regularly scheduled intervals. We could request both current and past data from publishers.
This option may appeal more to open access publishers and non-profit publishers. Some open access publishers, such as Public Library of Science, have already been contacted and have expressed a desire to share their data. Other open access and non-profit publishers should be approached. Commercial publishers should also be approached, especially smaller ones, and possibly when Knowledge for All has reached a more advanced stage of development and popularity.
Another group of publishers that may be interested in providing their journal article metadata is organizations that publish journals in non-Western countries and languages other than English, as this group may feel neglected by commercial publishers and want a new way to publicize their content.
Publishers are not included in the Metadata Sources document, with the exception of Public Library of Science and BioMed Central, because there are so many.
Institutional repositories (IRs), or collections of research articles, theses, and dissertations by university faculty and students at a particular institution, are becoming increasingly common at universities. They are usually managed by university libraries and sometimes accompanied by open access policies that require faculty to deposit copies of all of their published works in the IR. There are also online repositories of eprints, or digital versions of research articles, that are not restricted to a particular institution and usually subject-specific, where authors are encouraged to deposit copies of their articles. Articles deposited in IRs and eprint archives are often pre-prints, or first drafts that have not yet undergone the peer-review editing process. This is due to copyright issues, although some journals also allow authors to deposit post-print versions of articles in repositories. The Self-Archiving FAQ for the Budapest Open Access Initiative (BOAI) states that sixty-eight percent of journals allow self-archiving of post-print articles, while 32% do not. Journal policies for self-archiving can be searched in the Sherpa Romeo site.
Most institutional and eprint repositories follow an open access model and so would likely be willing to provide Knowledge for All with their metadata. In fact, their data is often made available via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). IRs are not included in the Metadata Sources list because there are too many of them, but subject-specific repositories are. The Canadian Association of Research Libraries (CARL) Institutional Repositories Pilot Project Harvester is a search tool for searching across IR content from participating Canadian institutions, and could be a useful tool for locating IRs, as well as potentially harvesting metadata. OpenDOAR is a directory of academic open access repositories with full text resources. Open DOAR can be used to search repository contents as well as repositories but not to harvest metadata as it only searches Google indexes.
One problem with this source and method is articles would be acquired individually so it may be challenging to collect complete journal holdings. This method would need to be used alongside other methods. Also records may lack complete metadata relating to their publication in journals and this additional metadata may need to be collected separately.
The fact that these articles may be pre-prints necessitates having a field that indicates the version of the article and ensuring that the record links to the appropriate full-text version of the article. If an article record is obtained for a post-print version of the same article, the two records should be linked.
Many journals provide a free table of contents (TOC) subscription service in which subscribers receive the table of contents for upcoming issues of journals via e-mail or RSS feed. There are also services such as JournalTOCs or TicTOCs which provide a search interface for journal TOCs and a means of subscribing to multiple journal TOCs at once. This could be a means of acquiring current but not past journal article data. Some TOC services are provided via e-mail while others are available via RSS feed. RSS feeds are in XML, and so the data could easily be adopted into the Knowledge for All database. TOC services are included in the Metadata Sources document.
One issue is the quality of data and schemas of TOCs will vary among publishers. In terms of copyright issues, it may depend which TOC service is used. An administrator of ticTOCS Journal Tables of Contents Service stated, "ticTOCs merely directs you to the publisher's feeds, therefore any questions about re-using the content of publishers' feeds for an OA citation database should be directed at the individual publishers," whereas JournalTOCs has an API that provides free access to the metadata that has been collected by JournalTOCs. The JournalTOCs API is a very promising avenue to collect journal article data as well as journal data and should be investigated further.
Numerous free, non-commercial online citation databases, like Knowledge for All, already exist. These primarily index journals in specific subject areas and have been created and maintained by scholars that specialize in those subjects, such as Latin American Periodicals Tables of Contents, and databases that index open access journals, such as the Directory of Open Access Journals (DOAJ). Many of the subject-specific databases are called “table of contents” services by their creators or even have the words “table of contents” or “TOC” in their names, but they differ from TOC services like TicTOCs in that they provide past as well as current journal article metadata. While the databases of open access journals are fairly sophisticated and follow current data standards, there is a lot of variation in the type of data and format of subject-specific databases, and the usability of the data varies. Citation databases are included in the Metadata Sources list.
Citation databases present an important opportunity to Knowledge for All on multiple fronts. They make good potential partners due to sharing similar values and goals – both as metadata sources and indexers. Citation databases were likely created because certain subjects are not well represented in commercial databases or their creators want to provide free and open access to research. When contacting these organizations about partnerships, Knowledge for All should emphasize that it aims to be comprehensive and provide free and open access to research. Knowledge for All can offer a better interface and search features, more complete data, a larger community of contributors, and more comprehensive content than most of these sources. Some citation databases have already been contacted and expressed willingness to contribute their data to Knowledge for All. Some have mentioned challenges with maintaining their own systems.
A concern with citation databases is data from some sources is not in an easily harvestable or digestible format. For example, the LINGUIST citation database only has citation data in HTML. In some cases we will need to determine whether the time and effort it will take to harvest data in a challenging format is worth it. The organizations that maintain citation databases may have limited resources and expertise to provide their data in other ways.
In order to offer full-text searching to users, Knowledge for All will aim to acquire PDF files of full-text journal articles to extract and index the full text without actually storing the PDF files or making them available to users. This method could also be used to harvest citations or references in articles. In addition, journal article metadata could be extracted from PDF files using Zotero or a similar tool or process, then added to the Knowledge for All database. Contributors could upload PDF files for extraction, but it is recommended that a further process be created to upload and harvest PDF files in bulk if this is selected as a primary method of collecting metadata. Metadata extracted from PDF files is not always accurate and will need to be edited.
Use of this method depends on having access to PDF files for all articles, which then depends on having an established community of contributors with access to full-text PDF files. They, in turn, will need to ensure their institutional licenses do not prohibit metadata harvesting of PDF files for journal articles. Another issue is that PDF versions of articles may not exist or be available for all older journal content. In terms of full-text searchability, Knowledge for All may not be able to provide full-text searching for all articles in the database but could aim to provide it over time.
Reference management software, which is widely used by scholars, allows people to build personal citation databases for the purpose of storing and accessing research literature and easily generating bibliographies. Some reference management tools additionally create centralized databases of user-added citations. Thus, Knowledge for All could appeal to both individual users and organizations that manage these tools to contribute their citation databases to the Knowledge for All project. Infrastructure could be created that would allow ongoing sharing of records added to centralized reference databases or personal reference databases with Knowledge for All. In return, Knowledge for All could offer the following:
Zotero, a popular open-source reference management tool maintained by a non-profit organization, is a promising potential partner that has shown interest in a similar project through its partnership with the Internet Archive. Zotero, Mendeley, and CiteULike are included in the Metadata Sources list, but the latter two are less likely partners due to being closed source and operated by private companies.
One issue with using this source of metadata is varied metadata quality and the need to sort, combine, and edit many duplicate records. Like with institutional repositories and eprint archives, journal holdings will not be complete. The Knowledge for All system would need to be able to ingest data in various reference management formats, including RIS, BibTeX, and Zotero RDF.
It would also be possible to design Knowledge for All so that masses of users could upload data in the style of a reference management software project, but this might be a redundant effort compared to partnering with or utilizing an existing project.
The least desirable method of collecting journal article metadata is for contributors to enter it by hand, since this would take the most time and effort. However, it may be necessary for some articles and journals where we are not able to acquire the data using any of the other methods or sources noted here.
Even with utilizing any of the above automated metadata collection methods, there will inevitably be missing data to be added and errors to correct. One thing that will distinguish Knowledge for All from other citation search tools is its high quality metadata, and so having robust quality control and editing processes is essential. A large number of volunteer editors will be needed to carry out this work. Clear and precise data standards should be created, agreed upon, maintained, and distributed to editors to ensure consistent and high quality metadata.
After creating a collections policy for Knowledge for All that defines what types of journals will be collected, Knowledge for All will need to develop a comprehensive list of all past and current scholarly journals that will be included in its collection. The following data should be collected for each journal:
This information is needed for estimating resources required, planning workflows, and determine how and where journal article metadata can be collected. Knowing how many journals are published, how often, in which subject areas, and in which languages, as well as numbers of past journals that need to be indexed, will help determine how many volunteer indexers and editors are needed, what subject and language expertise they must have, how much time it will take to collect data, and which metadata sources can be used. In addition journal data is an essential component of the metadata needed in the operational Knowledge for All database. Ulrichsweb was used to gather some of this aggregate data to make collections policy recommendations, but more complete data is needed – particularly as we finish planning and begin operations.
Some preliminary work has been done to gather the journal data, but this document will mainly explain how further data will be collected.
Many of the sources listed in Metadata Sources also contain journal data, but in varied forms. Some, such as the Directory of Open Access Journals, provide a file of journal data in CSV format that is easy to harvest while others simply have HTML pages of journal lists. Data in easily harvested formats will be collected first, with other sources being used to fill in gaps if necessary. Additional sources of journal data only have been located through Internet searching, which are not currently listed with Metadata Sources. The Open Knowledge Foundation's Open Bibliographic Data Working Group is another source of journal title data that is constantly growing.
The source of journal data which is most comprehensive in terms of probably providing all titles published is Simon Fraser University's (SFU) CUFTS Knowledgebase, a journal database used primarily for libraries to link to their full-text holdings. SFU has provided its Knowledgebase files to Knowledge for All, but the data is dispersed over 371 files that include duplicate entries and non-journals, and every file does not have the same fields (although the same field names are used in every file).
After considering the above information and consulting with others, it has been determined that the following steps should be taken to collect journal data:
Knowing what elements of data will be included and managed in the Knowledge for All system is essential for designing and developing the technological infrastructure and approaching content collection and creation. Metadata elements have been identified through examining existing citation databases in a variety of subject areas and reading scholarly literature about metadata in digital collections.
The Knowledge for All database will contain data about 3 main types of content: journals, journal articles, and scholars/authors. These content types will be related, as journal articles will be part of journals and scholars will be linked with journal articles through the author/creator field. A journal issue content type could also be used to link articles to journals. Alternatively journal issue information could be included in every article record, but it may be desirable to have a separate journal issue content type in order to minimize data needed for journal article records and organize indexing workflows. Another significant node type in the Knowledge for All system will be contributors (indexers, editors, developers, etc.), but that will be addressed in the Contributors and Workflows section of the planning documentation. There may be considerable overlap between scholars and contributors.
Below I have made initial recommendations for data elements needed for each type of content. These elements are not mapped to any particular metadata schema.
Journal:
Journal Issue:
Journal article:
Some subject-specific databases include other metadata elements which would largely only be relevant within that subject. These include classification, population, location, age group, tests and measures, grant information, and methodology for psychology; type of literature, time period, subject author, subject work , literary theme, literary genre, and media for literature; and study design, place of study, period of study, materials, methods, edited by, and reviewed by for medicine. We intend to include these in the Knowledge for All system to allow for highly refined searching within disciplines. However, the fields will only be searchable if a user is searching within a particular discipline rather than doing a general, interdisciplinary search, and if the data for specialized fields is not available elsewhere volunteer indexers will need to create original data.
An article's list of citations or its bibliography is not a necessary element but ideally it will be included to allow for citation analysis. As discussed in Legal Issues, citations may be protected under copyright.
Administrative metadata
Additional administrative metadata which could be included for the above content types include:
Other administrative metadata elements will likely be added as the system's technical infrastructure and workflows are developed.
As the Knowledge for All database will not actually include digitized objects, metadata related to preservation of digital objects is not relevant.
Scholar names/Personal authors:
Other data possibly needed for name disambiguation (discussed further in Scholar Name Data Collection and Creation):
In selecting metadata schemas or standards to use for the Knowledge for All citation database, it is important to consider the information or metadata elements needed for different types of content in the database (outlined in Metadata Elements) standards used by sources that metadata will be harvested from (noted in Metadata Sources), standards used by institutions that will be harvesting data from Knowledge for All, and standards used by other tools that may be used in the data harvesting, creation, and editing process. There are many different metadata schemas for bibliographic description in use by different institutions and standards are always changing, so interoperability is the key as opposed to conforming strictly to one particular standard. This process will inevitably evolve as workflows are developed, data harvesting methods are determined, and technical infrastructure is designed. Here are some initial recommendations. In addition to considering standards and schemas used by other repositories, I reviewed research on metadata standards used in digital libraries and specifically metadata standards used for electronic journal article data. Initial research by Vanessa Black and Rebecca Prescott was also utilized.
There is no standard metadata schema for journal article records. The most common bibliographic metadata schema for digital materials in general is Dublin Core. Although Dublin Core is widely recognized as insufficient for describing journal article metadata due to its simplicity and limited number of elements, many institutions still use Dublin Core because of its interoperability and wide use. The Dublin Core Metadata Initiative Citation Working Group analyzed this issue and published Guidelines for Encoding Bibliographic Citation Information in Dublin Core Metadata. Many institutions which have chosen to use Dublin Core for journal articles have also documented how they adapted it to meet their needs (see references). As stated by Apps and MacIntyre (2002), “Dublin Core should remain a ‘core’ set of metadata elements, with domain-specific metadata recorded according to more complex standards, whether extensions to Dublin Core or separate standards.” In addition, Dublin Core is the metadata standard required by the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which itself is quickly becoming the standard through which institutions make their metadata available. It is recommended that Knowledge for All use a form of Dublin Core or incorporate Dublin Core elements into its metadata schema.
Metadata Object Description Schema (MODS) is another important metadata schema for descriptive bibliographic data because of its wide use and compatibility with MARC, the data format in which libraries exchange bibliographic records. Like Dublin Core, it is built on an XML foundation. Although not widely used for journal article metadata, MODS is a more complex and specific schema than Dublin Core and can be used in conjunction with Dublin Core for more specificity and granularity.
The Open Knowledge Foundation's Open Bibliographic Data project has developed BibJSON, a simple description of how to represent bibliographic metadata in the JSON format, and is using this schema for its open bibliographic data. This option should be examined further.
Metadata Encoding and Transmission Standard (METS) is a standard for encoding descriptive, administrative, and structural metadata that could be used to bundle multiple metadata sets together.
Scholar name authority files should follow the Metadata Authority Description Standard (MADS) and Friend of a Friend (FOAF) ontology so as to be compatible with MARC and easily used by other systems.
Allinson, Julie, Pete Johnston, and Andy Powell. “A Dublin Core Application Profile for Scholarly Works.” Ariadne 50 (January 2007). Retrieved 8 April 2011 from http://www.ariadne.ac.uk/issue50/allinson-et-al/.
Apps, Ann and Ross MacIntyre. “Dublin Core Metadata for Electronic Journals.” Lecture Notes in Computer Science 1923 (2000): 93-102. Retrieved 31 March 2011 from http://eprints.rclis.org/bitstream/10760/12183/1/appsmacecdl2000_full.html
Apps, Ann and Ross MacIntyre. “zetoc: a Dublin Core Based Current Awareness Service.” Journal of Digital Information 2 (2002). Retrieved 31 March 2011 from http://journals.tdl.org/jodi/article/viewArticle/39
Dappert, Angela and Markus Enders. “Using METS, PREMIS and MODS for Archiving eJournals.” D-Lib Magazine 14.9/10 (September/October 2008). Retrieved 4 April 2011 from http://www.dlib.org/dlib/september08/dappert/09dappert.html
Directory of Open Access Journals' XML schema for journal articles
Metadata Elements notes metadata that should be collected about scholars or authors of journal articles. This page explores how that data will be collected and managed. It is the result of a literature search on scholar name data and name disambiguation, an examination of name schemas and data systems, informal interviews with key informants, and initial research by Linda MacAfee and Robert Martel.
One issue to deal with regarding scholar/author names is name disambiguation. When more than one person shares the same name, which will certainly be the case in the Knowledge for All database, it is important to have a means by which authors and their associated works can be identified and similar names can be disambiguated. It will lead to more precise searching and make citation analysis possible. There are various tools available or in development that facilitate name disambiguation. The ideal tool would limit the names to scholars who publish, provide rich metadata about each scholar, provide unique name identifiers, and provide open data. Currently, it does not appear that such a resource exists. Current options are noted below.
The Open Researcher and Contributor ID (ORCID) project “aims to solve the author/contributor name ambiguity problem in scholarly communications by creating a central registry of unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID and other current author ID schemes." It is a promising project but is still in the beta development stage, so it is uncertain whether this tool will meet Knowledge for All's needs for name disambiguation. We do not know yet what data will be provided for scholars, how that data will be made available, or how the tool will work. ORCID's About page states that in 2012 they plan to start charging fees to organizations who wish the use the service.
Friend of a Friend (FOAF) is a semantic RDF ontology used to describe people, their activities, and their relationships with objects and each other on the web. It does not involve a centralized database of information about people but rather is a method by which that information is made available in a standardized format. It may enable Knowledge for All to harvest scholar/author information from multiple sources. However, it is not limited to the scholarly community.
ArXiv, an open e-prints archive, has an authority records system that assigns author identifiers in an attempt to disambiguate author names and enable retrieval of all publications by a particular author. It has some limitations and is by no means comprehensive but the data is open.
There are various closed-access databases of scholar names, such as Thomson Reuters ResearchID and Scholar Universe. This tool could potentially be used by contributors with access through their institutions but is not reliable due to its closed status.
The Library of Congress Authority Files are the most established and widely-used name authority records. They are free to search but can only be downloaded one at a time. Names are uniquely identified by adding additional biographical information, such as birth and death dates. Authority records are based mainly on book publications rather than journal publications, and so may not be adequate for Knowledge for All.
The Virtual International Authority File is a joint project of multiple libraries, implemented and hosted by OCLC, that aggregates library authority files and makes some attempt to disambiguate names. It is currently available to organizations that apply and are accepted to be members, which requires contributing name authority data that meets certain criteria. In the future the service will be freely available.
The International Standard Name Identifier (ISNI) is a standard for assigning unique identifying numbers to names, similar to the ISBN. Authors must register for a number and the database is not publicly searchable.
Knowledge for All could utilize one of these external resources to identify and disambiguate scholar names or create our own internal system of author name disambiguation and unique identifiers within the database. This process is explored by Torvik and Smalheiseracm (2009). In his article, “Metadata for Name Disambiguation and Collocation” (2010), Jeffrey Beall recommends collecting the following additional data for name disambiguation, as necessary:
Beall, Jeffrey. “Metadata for Name Disambiguation and Collocation.” Future Internet 2.1 (2010): 1-15. Retrieved 6 April 2011 from http://www.mdpi.com/1999-5903/2/1/1/
Torvik, Vetle I. and Neil R. Smalheiser. “Author Name Disambiguation in MEDLINE.” ACM Transactions on Knowledge Discovery from Data 3.3 (July 1, 2009): 11.
The Knowledge for All system should have two levels of subject classification for articles and ideally abstracts included in every article record. Each article in the database should be classified in one or more broad subject categories and have multiple more specific subject terms assigned to it. This system of classification will allow us to provide rich searching and browsing capabilities in the citation database. It will also be useful for connecting volunteer indexers and editors with journals in which they have subject expertise.
The primary purpose of broad subject classification in the K4All database is to allow users to limit their searching to a specific discipline or set of disciplines. While the goal should be to classify an article or journal in a single broad subject category (and categories should be broad enough to allow that), classification in more than one category should be allowed for multidisciplinary articles or journals. Broad subjects should be selected and defined thoroughly to facilitate easy classification. This could be a flat or hierarchical list of subject categories. JournalSeek provides a model flat list while Directory of Open Access Journals provides a model hierarchical list.
Broad subject classification terms have been assigned to Thesauri for Subject Indexing records and Metadata Sources records, but the vocabulary has not been 'controlled' yet. The working, uncontrolled list of subject categories can be viewed here.
Classification of articles into one or more broad subject class can either occur at the article level or the journal level. An advantage of classifying entire journals in one or more broad subject categories means that indexers do not need to make this classification for each individual article. However, with interdisciplinary journals or journals whose articles might typically fall into more than one broad subject class, this could result in mis-classification of journal articles. As a compromise, it is recommended that journals be classified under broad subjects but that indexers are able to overwrite that classification on a per article basis if needed, and interdisciplinary journals will be flagged so that indexers assigned to those journals pay special attention to this field and correct if needed.
Subject indexing or applying subject terms from controlled vocabularies to journal articles will be an important aspect of the Knowledge for All system, as it will provide search precision that is absent from many search tools that search for keywords in full-text only, such as Google Scholar, and that use subject terms that are not from controlled vocabularies, which is the case with many commercial databases.
Unless the source's license allows it, Knowledge for All will not be able to copy subject terms from existing journal article records due to copyright restrictions. This is discussed in more detail in Copyright of Journal Article Metadata. There may be exceptions when the subject terms were chosen from the same thesaurus used by Knowledge for All (such as the ubiquitous MESH) and the metadata record is part of an open data set. But otherwise volunteer indexers will select subject terms from controlled vocabularies for all articles in the Knowledge for All database.
There are currently many thesauri available online for different subject areas and in different languages. They are being located and listed on the Thesauri for Subject Indexing page. Once a near-complete list is composed, the Knowledge for All community will need to decide which thesauri will be used for different subject areas and ensure that indexers consistently use those thesauri.
It may be necessary to adapt and modify existing thesauri for Knowledge for All or to import terms and structures into the Knowledge for All system. If so, thesauri that allow this should be selected or Knowledge for All can request permission from thesauri maintainers. Another option is to develop thesauri from scratch for Knowledge for All, but this takes considerable time and skill and so should be avoided where possible.
Some original thesauri construction and modification will likely be necessary to sufficiently represent and describe all subjects covered by all published scholarly journal literature, particularly in areas that may be neglected or misrepresented by existing thesauri. These gaps will be identified once a complete list of existing thesauri is compiled and these thesauri are analyzed by subject specialists or as indexers begin to use them. Thesauri construction and indexing can be controversial due to the politics of language and naming (de la tierra, 2003). It is recommended that Knowledge for All make every effort to include diverse perspectives in construction and modification of its thesauri.
Copying of abstracts from journal metadata records is also restricted by copyright, unless the source's license allows it. As discussed in detail in Copyright of Journal Article Metadata, the creation of abstracts for most articles in the system would take considerable time and is not a feasible option. Thus, it is recommended that Knowledge for All find a way around the copyright issue with abstracts and make every effort to provide access to existing abstracts of articles in the database.
de la tierra, tatiana. "Latina Lesbian Subject Headings: The Power of Naming." Radical Cataloging: Essays at the Front. Ed: K.R. Roberto. Jefferson: McFarland, 2008 (94-102).