Skip to Main Content
Syracuse University Libraries

Introduction to Text Mining with HathiTrust Research Center: Home

About this Research Guide

This research guide offers a step-by-step instruction on how to conduct a basic text mining project with HathiTrust Digital Library and HathiTrust Research Center. 

Step 1: Create Accounts

Create a HTDL account

 

  1. Click “LOG IN” in the top right corner.

  2. A log in window will appear. 

  3. If you are affiliated to an HT partner institution, select your institution and then follow the directions for institutional log in. 

  4. If you are not affiliated to an HT partner institution, you can log in as a guest. 

  5. Click on “See options to log in as a guest” in the log in window. 

  6. You will be directed to a new page where you can log in with a Google, Facebook, Twitter, AOL, LinkedIn, Windows Live (Hotmail), Yahoo!, or University Michigan Friend Account. 

  7. Click on an option of your choice and follow the directions.

Create a HTRC Analytics account

  1. Click “Sign Up” in the top right corner.

  2. Use an email address from an academic institution and follow security guidelines for the password.

  3. Activate your account from the link you will be sent via email.

Step 2: Build a Collection

After creating the accounts, you need to build a collection based on your research interests in the HathiTrust Digital Library.

Create a new collection

 

  1. Click the My Collections in the top 

 

  1. Click the New Collection button

  2. Enter name and description of the collection. Save changes.

 

Search for the volumes you want by applying the filters

 

For instance, if you want to build a collection of books on American Poetry published after 1900 in the United States, you can apply the following filters. Note that if you want the collection to be processed by the HTRC analytics tools, you need to apply the “Full-View” filter. 

 

Add volumes to your collection 

 

You can select volumes one by one and add them to your collection. If you want to build a big collection, a more efficient way is to select all the volumes on the page, add them to the collection, and remove the unwanted/repeated results from your collection later.

 

  1. Click the checkbox next to the titles you would like to add to your collection

  2. Click on the “Select Collection” bar and choose the title of your newly created collection. 

  3. Click “Add Selected”

 

You can then find your collection in the “My Collections” page.

 


 

Step 3: Build and Validate a Workset

Build a workset

 

Now, you need to build a workset in the HTRC Analytics based on the collection you build in the HTDL. 

 

  1. Log in to HTRC Analytics. Click the “Worksets” in the top bar.

 

 

  1. Click on “Create a Workset” and Choose “Import from HathiTrust”

 

  1. Copy the url to your collection (in HTDL).

 

  1. Paste in the collection url field (in HTRC), and click “fetch”

  1. Fill in other information and click “Create Workset”.

 

Validate a workset

 

Some of the volumes in the workset may not be able to be processed by the HTRC analytics, so you need to validate it before you run analytics programs.

 

  1. In the workset page, select “Validate a workset”

  2. Choose the workset that you want to validate

  3. Validate it and check the result.

In this example, there are 16 invalid volumes that needs to be removed from the workset. So we download the file of “valid volumes”. This file contains a list of volume_id, title, author of the book. 

 

  1. Download the list of valid volumes

 

 

Now, we can build a new workset by uploading this file.

 

  1. In the ‘Create a workset’ page, click the “Upload File”.

  1. Fill in the information. Click on “Choose File” to select the list of valid volumes. Create Workset

 

Now you have the workset ready for analytics.

 

Step 4: Apply the Analytics Tools

HTRC Algorithms are web-based, click-and-run tools to perform computational text analysis on volumes in the HathiTrust Digital Library. Here is a brief instruction to two of the four analytics tools in HTRC Algorithms.

Token Count and Tag Cloud Creator

 

Token count and tag cloud is the most straightforward tool in the HTRC analytics. It can generate a tag cloud showing the most frequently occurring words in the workset, and a file with a list of those words and the number of times they occur.

 

  1. Click the Algorithms on the top of the HTRC Analytics page.

 

  1. Click Execute of the ‘Token Count and Tag Cloud Creator’

 

 

  1. Select the workset (validated), specify the name and language

 

  1. Use a stopword list

    

Stop words are generally the most common words in a language, such as ‘and’, ‘in’, ‘she’. You may want to remove these words from the word cloud that you want to make. You can use the default stopword list or create a new one. 

 

 

  1. Use a replacement rule

 

Replace the words in the workset with its standardized version. You can use the default list or create a new one. 



 

 

  1. Specify how many words you want to display in the word cloud and click ‘Submit’.

 

  1. Wait for the program to run automatically. It may take a few minutes.

  2. Check the status of your project by clicking the ‘Job’ in the algorithm page. 

 

 

  1. Once it’s finished. You can find your work in the ‘Completed Jobs’ list.

 

  1. Click the name of the job to check how it is. You can find the word cloud in the ‘Output’ section

 

Name Entity Recognition 

 

It seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

 

  1. Click the Algorithms on the top of the HTRC Analytics page.

 

  1. Click ‘Execute’ of the ‘Named Entity Recognizer’’

 

 

  1. Specify the name of the project, the workset you want to use, and the language of the project. Click Submit.

 

 

  1. Wait for the program to run automatically. It may take more than ten minutes to finish.  

  2. Check the status of your project by clicking the ‘Job’ in the algorithm page. 

 

 

  1. Once it’s finished. You can find your work in the ‘Completed Jobs’ list.

 

  1. Click the name of the job to check how it is.

 

The output of the NER in HTRC is a list of name entities, their types, and the volume and page that they are in.

 

You can download the output file and process it on your own.

 

 

Here I use R to do some basic analysis. You can also use Python, Excel, or many other software or programming tools. 

 

For instance, you can check the distribution of each type of entity.

DATE

123,374

LOCATION

153,741

MISC

92,983

MONEY

2,615

ORGANIZATION

69,698

PERCENT 

746

PERSON

951,210

TIME

8,332

 

 

After a little bit data manipulation with R, you can also do many kinds of visualization of the word frequency of certain words.

 

For instance, the top 15 city-name in the volumes.

Or a state-name distribution map.