Journal article data collection and creation
As detailed in Collections policy recommendations, the end goal of Knowledge for All is to collect journal article metadata for all current and past scholarly journals in all subject areas and languages. Metadata elements details specific data elements needed for different types of content in the database. This document identifies and analyzes different methods for collecting and generating past and current journal article metadata. These ideas were generated through reading discussions on relevant listservs; informal interviews with librarians, researchers, and developers; extensive Internet searching for metadata repositories; and initial research conducted by Carly Currie, Mary Zazelenchuk, Andrea Crabbe, and Alyssa Graybeal.
Most of the journal article data needed is factual data (such as title, author, year), which, as discussed in Copyright of journal article metadata, can be more easily harvested from existing collections of metadata without violating copyright compared to subject terms and abstracts, which are subject to copyright as literary works. Below I identify potential sources of factual journal article metadata and methods for collecting or generating factual journal article metadata. The accompanying Metadata Sources document lists specific collections of journal article metadata that could be potentially harvested or acquired for the Knowledge for All database. These are by not means exhaustive, but they give a sense of what is available and what to consider in selecting methods and sources. They are categorized by type of metadata source, which corresponds with the categories noted here. I address subject indexing and abstracts in a separate document.
An important partner and resource in collecting journal article data is the Open Knowledge Foundation's Open Bibliographic Data Working Group, which is making different kinds of bibliographic data open and harvestable. It is recommended that Knowledge for All work closely with this group to share strategies and resources.
Summary of general recommendations for metadata collection and creation:
- Even when a source's data is explicitly open access and harvestable, Knowledge for All should contact the organization and obtain permission to harvest the data, making it clear that the data will become ODC-By.
- Clear and precise data standards should be created, agreed upon, maintained, and distributed to editors to ensure consistent and high quality metadata.
- Design the Knowledge for All system so that data can be uploaded in a wide variety of formats and schemas.
- Develop a robust system for identifying and managing duplicate article records.
- Prioritize metadata sources that follow open data standards and open access principles.
Publishers
Publishers may be willing to provide their journal article metadata to Knowledge for All as a means of publicizing their content. Willing publishers could enter into a data sharing agreement with Knowledge for All and either upload their data into the system as it becomes available or have their data automatically harvested at regularly scheduled intervals. We could request both current and past data from publishers.
This option may appeal more to open access publishers and non-profit publishers. Some open access publishers, such as Public Library of Science, have already been contacted and have expressed a desire to share their data. Other open access and non-profit publishers should be approached. Commercial publishers should also be approached, especially smaller ones, and possibly when Knowledge for All has reached a more advanced stage of development and popularity.
Another group of publishers that may be interested in providing their journal article metadata is organizations that publish journals in non-Western countries and languages other than English, as this group may feel neglected by commercial publishers and want a new way to publicize their content.
Publishers are not included in the Metadata Sources document, with the exception of Public Library of Science and BioMed Central, because there are so many.
Institutional repositories and eprint archives
Institutional repositories (IRs), or collections of research articles, theses, and dissertations by university faculty and students at a particular institution, are becoming increasingly common at universities. They are usually managed by university libraries and sometimes accompanied by open access policies that require faculty to deposit copies of all of their published works in the IR. There are also online repositories of eprints, or digital versions of research articles, that are not restricted to a particular institution and usually subject-specific, where authors are encouraged to deposit copies of their articles. Articles deposited in IRs and eprint archives are often pre-prints, or first drafts that have not yet undergone the peer-review editing process. This is due to copyright issues, although some journals also allow authors to deposit post-print versions of articles in repositories. The Self-Archiving FAQ for the Budapest Open Access Initiative (BOAI) states that sixty-eight percent of journals allow self-archiving of post-print articles, while 32% do not. Journal policies for self-archiving can be searched in the Sherpa Romeo site.
Most institutional and eprint repositories follow an open access model and so would likely be willing to provide Knowledge for All with their metadata. In fact, their data is often made available via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). IRs are not included in the Metadata Sources list because there are too many of them, but subject-specific repositories are. The Canadian Association of Research Libraries (CARL) Institutional Repositories Pilot Project Harvester is a search tool for searching across IR content from participating Canadian institutions, and could be a useful tool for locating IRs, as well as potentially harvesting metadata. OpenDOAR is a directory of academic open access repositories with full text resources. Open DOAR can be used to search repository contents as well as repositories but not to harvest metadata as it only searches Google indexes.
One problem with this source and method is articles would be acquired individually so it may be challenging to collect complete journal holdings. This method would need to be used alongside other methods. Also records may lack complete metadata relating to their publication in journals and this additional metadata may need to be collected separately.
The fact that these articles may be pre-prints necessitates having a field that indicates the version of the article and ensuring that the record links to the appropriate full-text version of the article. If an article record is obtained for a post-print version of the same article, the two records should be linked.
Table of contents services
Many journals provide a free table of contents (TOC) subscription service in which subscribers receive the table of contents for upcoming issues of journals via e-mail or RSS feed. There are also services such as JournalTOCs or TicTOCs which provide a search interface for journal TOCs and a means of subscribing to multiple journal TOCs at once. This could be a means of acquiring current but not past journal article data. Some TOC services are provided via e-mail while others are available via RSS feed. RSS feeds are in XML, and so the data could easily be adopted into the Knowledge for All database. TOC services are included in the Metadata Sources document.
One issue is the quality of data and schemas of TOCs will vary among publishers. In terms of copyright issues, it may depend which TOC service is used. An administrator of ticTOCS Journal Tables of Contents Service stated, "ticTOCs merely directs you to the publisher's feeds, therefore any questions about re-using the content of publishers' feeds for an OA citation database should be directed at the individual publishers," whereas JournalTOCs has an API that provides free access to the metadata that has been collected by JournalTOCs. The JournalTOCs API is a very promising avenue to collect journal article data as well as journal data and should be investigated further.
Citation databases
Numerous free, non-commercial online citation databases, like Knowledge for All, already exist. These primarily index journals in specific subject areas and have been created and maintained by scholars that specialize in those subjects, such as Latin American Periodicals Tables of Contents, and databases that index open access journals, such as the Directory of Open Access Journals (DOAJ). Many of the subject-specific databases are called “table of contents” services by their creators or even have the words “table of contents” or “TOC” in their names, but they differ from TOC services like TicTOCs in that they provide past as well as current journal article metadata. While the databases of open access journals are fairly sophisticated and follow current data standards, there is a lot of variation in the type of data and format of subject-specific databases, and the usability of the data varies. Citation databases are included in the Metadata Sources list.
Citation databases present an important opportunity to Knowledge for All on multiple fronts. They make good potential partners due to sharing similar values and goals – both as metadata sources and indexers. Citation databases were likely created because certain subjects are not well represented in commercial databases or their creators want to provide free and open access to research. When contacting these organizations about partnerships, Knowledge for All should emphasize that it aims to be comprehensive and provide free and open access to research. Knowledge for All can offer a better interface and search features, more complete data, a larger community of contributors, and more comprehensive content than most of these sources. Some citation databases have already been contacted and expressed willingness to contribute their data to Knowledge for All. Some have mentioned challenges with maintaining their own systems.
A concern with citation databases is data from some sources is not in an easily harvestable or digestible format. For example, the LINGUIST citation database only has citation data in HTML. In some cases we will need to determine whether the time and effort it will take to harvest data in a challenging format is worth it. The organizations that maintain citation databases may have limited resources and expertise to provide their data in other ways.
PDF file extraction
In order to offer full-text searching to users, Knowledge for All will aim to acquire PDF files of full-text journal articles to extract and index the full text without actually storing the PDF files or making them available to users. This method could also be used to harvest citations or references in articles. In addition, journal article metadata could be extracted from PDF files using Zotero or a similar tool or process, then added to the Knowledge for All database. Contributors could upload PDF files for extraction, but it is recommended that a further process be created to upload and harvest PDF files in bulk if this is selected as a primary method of collecting metadata. Metadata extracted from PDF files is not always accurate and will need to be edited.
Use of this method depends on having access to PDF files for all articles, which then depends on having an established community of contributors with access to full-text PDF files. They, in turn, will need to ensure their institutional licenses do not prohibit metadata harvesting of PDF files for journal articles. Another issue is that PDF versions of articles may not exist or be available for all older journal content. In terms of full-text searchability, Knowledge for All may not be able to provide full-text searching for all articles in the database but could aim to provide it over time.
Reference management databases
Reference management software, which is widely used by scholars, allows people to build personal citation databases for the purpose of storing and accessing research literature and easily generating bibliographies. Some reference management tools additionally create centralized databases of user-added citations. Thus, Knowledge for All could appeal to both individual users and organizations that manage these tools to contribute their citation databases to the Knowledge for All project. Infrastructure could be created that would allow ongoing sharing of records added to centralized reference databases or personal reference databases with Knowledge for All. In return, Knowledge for All could offer the following:
- Enhanced and improved citation records, edited for quality and indexed
- Additional citation data or a comprehensive collection of citation data
- Sophisticated search interface and features
Zotero, a popular open-source reference management tool maintained by a non-profit organization, is a promising potential partner that has shown interest in a similar project through its partnership with the Internet Archive. Zotero, Mendeley, and CiteULike are included in the Metadata Sources list, but the latter two are less likely partners due to being closed source and operated by private companies.
One issue with using this source of metadata is varied metadata quality and the need to sort, combine, and edit many duplicate records. Like with institutional repositories and eprint archives, journal holdings will not be complete. The Knowledge for All system would need to be able to ingest data in various reference management formats, including RIS, BibTeX, and Zotero RDF.
It would also be possible to design Knowledge for All so that masses of users could upload data in the style of a reference management software project, but this might be a redundant effort compared to partnering with or utilizing an existing project.
Data entry and editing
The least desirable method of collecting journal article metadata is for contributors to enter it by hand, since this would take the most time and effort. However, it may be necessary for some articles and journals where we are not able to acquire the data using any of the other methods or sources noted here.
Even with utilizing any of the above automated metadata collection methods, there will inevitably be missing data to be added and errors to correct. One thing that will distinguish Knowledge for All from other citation search tools is its high quality metadata, and so having robust quality control and editing processes is essential. A large number of volunteer editors will be needed to carry out this work. Clear and precise data standards should be created, agreed upon, maintained, and distributed to editors to ensure consistent and high quality metadata.
