Research Guides: Digital Humanities Resources: Datasets

Open Access Image

The Open Image Collections is a collection of digital image sources suitable for teaching, learning and research. Sources include museum digital collections, stock images, photo archives, design resources and image search engines. The project is started by Hedren Sum, a librarian at Nanyang Technological University. The 'Openness' of each image has been labeled at the webpage.

Smithsonian Open Access

The Smithsonian Open Access has nearly 3 million 2D and 3D digital items across the Smithsonian’s 19 museums, nine research centers, libraries, archives, and the National Zoo.

Users can access open access metadata and register for an API key via the Smithsonian’s public API hosted on api.data.gov.

Harvard Art Museums

The collections in Harvard Art Museums include approximately 250,000 objects in all media, featuring European & American art from the Middle Ages to the present day.

The Harvard Art Museums provide a REST-style service designed for developers who wish to explore and integrate the museums’ collections in their projects. The documentation of the Harvard Art Museums API can be found here.

The Museum of Modern Art (MoMA) Collection

MoMA's evolving collection contains almost 200,000 works from around the world spanning the last 150 years.

The GitHub Page of MoMA provides datasets that contain the metadata of each artwork and artist. The datasets are updated monthly.

Rijksmuseum

The Rijksmuseum is a Dutch national museum dedicated to arts and history. Rijksmuseum data services provide access to object metadata, bibliographic data, controlled vocabularies and user generated content.

Users can access and download the object metadata through Rijksmuseum's API.

Resources for Computational Text Analysis

HathiTrust Digital Library

HathiTrust Digital Library is the product of the HathiTrust, a partnership of major research institutions and libraries working together to preserve our cultural record of print materials. As of January 2013, the digital library comprises over 10 million volumes, over 3.2 million of which are public domain, and includes almost half the print holdings at SU Libraries. HathiTrust provides its members with full-text searching across the entire repository, full-text PDF downloads for items in public domain or not otherwise under copyright, and full-text access to brittle out-of-print items in SU Libraries. Researchers can conduct computational analysis of works in the HathiTrust Digital Library through HathiTrust Research Center (HTRC), which contains a suite of tools and services for text-based, data-driven research, such as HTRC Algorithms and Data Capsule.

Gale's Nineteenth Century Collections

A multi-year global digitization and publishing program focusing on primary source collections of the long nineteenth century. Researchers can check the number of document relevant to key terms over a specific period of time through the Term Frequency function.

Europeana Newspapers

The Europeana Newspapers project has converted 10 million historic newspaper pages to full text for Europeana. It has also developed a number of open source software tools, such as Named Entity Recognition Tool for Europeana Newspapers.

LC for Robots

The Library of Congress Lab provides a list of APIs, bulk downloads, and tutorials for researchers to explore the machine-readable access to its digital collections.

Digital Public Library of America

The Digital Public Library of America (DPLA) aims at providing public access to digital holdings within America’s libraries, archives, museums, and other cultural heritage institutions. DPLA offers public API and Bulk Download that grant access to all of DPLA’s records under a permissive license.

New York Times APIs

NYT offers ten APIs to facilitate a wide range of uses, from custom link lists to complex visualizations.

Open American National Corpus

An open access linguistic corpus consisting of 15 million words of American English automatically annotated for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), and named entities.

Reddit API

Reddit provides API to access data from its posts, threads, comments, users and more. Historic Reddit data can be downloaded from this website.

Documenting the Now

An organization & set of tools and materials and media for chronicling historically significant events via social media.