Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Syracuse University Libraries

ProQuest TDM Studio Guide

Step One: Create Dataset

1. Go to the Dashboard Page of the ProQuest TDM Studio and Click "Create New Dataset"

When you click "+ Create New Dataset" you will choose to see a list of available ProQuest databases or a list of available publication titles.

screenshot showing Datasets display

Note 1: TDM supports Chrome and FireFox only. 

Note 2: TDM has a limit of ten datasets for each institution. 

 

2. Search for Content

To explore licensed ProQuest content prior to accessing the TDM Studio platform, please visit this list of ProQuest databases. Please note that not all content available through these databases is available on the TDM Studio platform. If you would like additional information about content specific to TDM Studio, please reach out to one of our team members via the "Contact Us" box on the left.

Search for Databases or Publications based on keyword in title

screenshot showing publication search interface

Note: you can select multiple publications or databases.

3. Select Articles

Search for articles using a combination of keywords and date range. 

In this example, we select dissertations on digital humanities/scholarship published in 2019: 

screenshot showing list of 241 search results and filtering options

Note: the size limit of document is 2,000,000 

3. Name your dataset

Specify the name of your dataset. 

screenshot of Database Details form

You can find your newly built dataset in the dashboard page. Once the submitting process is completed, the "Queued" in STATUS will become "Completed". It usually take less than ten minutes. 

screenshot of active datasets with new dataset queued for processing

Step Two: Transfer Data to Notebook

1. Check the Status of Your Dataset

If the status is 'Completed' and the number of document is correct, you can go ahead and work with your dataset.

screenschot of dh2019 dataset details

tdm-ale-data/100/corpus/dh2019/

2. Open Jupyter Notebook

Click the 'Open Jupyter Notebook' button at the upper-right corner of the dashboard page.

screenshot of jupiter notebook button

Note: in the Jupyter Notebook home directory, you can find two folders that contain useful guides on working with text data in the environment.

two folders

3. Transfer Data to Notebook

Open or create a notebook and execute the following three lines of code to transfer the data. In first two lines, you need to change the 'dh2019' into the name of your dataset.

screenshot of initial entry box on Jupyter notebook

Once the codes are executed, you can find your dataset under the 'data' directory.

screenshot of folder navigation, including data folder

screenshot if dh219 project folder in navigation

screenshot of contents of dh2019 folder

 

Step Four: Read and Parse files

1. Create a list of file directories

You can create a list of file directories using Python's glob module.

screenshot of glob process output

2. Read and parse all the files

The files are in xml format. To retrieve the content you need, you can use Python's 'lxml' and 'BeautifulSoup' libraries to process these files.

For instance, you can retrieve the abstracts of dissertations in dh2019 dataset using the following code:

screenshot of beautiful soup input box

screenshot of beautiful soup output box

3. Text cleaning

As we can see, the abstracts still contain many tags which makes it difficult to conduct computational analysis.

We can clean the xml tags using the following codes:

screenshot of cleanTag process

Now we have the clean text ready for analysis.

4. Save files

If you need to save the clean text for you or your teammates to do the analysis, you can save the file in a csv format.

To do so, you need to first transform the text into a data frame and then save it as a csv file.

screenshot of csv creation script

You can find your file in the home directory:

You can click the file to inspect it. You are not encouraged to do so if the file is too large.

Step Five: Do the Analysis

There are two ways of doing text analysis with documents from ProQuest TDM

1. Coding in Jupyter Notebook on ProQuest TDM

You can create a new R/Python notebook:

screenshot of notebook selection box

Or you can open an existing one, for instance, 'dh2019.ipynb'.

screenshot of folders and notebook links

2. Upload your scripts/notebook from your computer to ProQuest TDM

If you find it more comfortable to write scripts in your editors/notebook or you already have some scripts ready to use, you upload it to the ProQuest TDM Studio. It is a two-step process.

Step one: Upload the file from your computer to the 'temporary file' directory

You need to click the "my file" button at the upper-left corner of the jupyter notebook environment.

screenshot of browser-bar upload tool

In the temporary files folder, click the "Upload Files" button at the upper right corner.

screenshot of upload files interface

You can find your file in the folder if the upload is successful.

screenshot of file upload location

Step two: Upload the file from the 'temporary files' directory to the TDM notebook workspace

To use the scripts/notebook on the jupyter notebook workspace, you need to upload the file from the  "temporary files" folder.

First, you need to click the "upload" button at the upper-right corner of the notebook workspace.

Your scripts are in "My Files > Temporary Files" directory.

screenshor of local file location

screenshot of local temporary files folder

Once you find your file, you can double click it to upload.

screenshot of temporary files contents

Finally, you can click the "upload" blue button to upload it.

screenshot of upload interface

Now you can use your uploaded file in the workspace.

screenshot of uploaded file in workspace

 

Step Six: Export the Result

To export the file, you need to execute one line of code:

aws s3 cp /home/ec2-user/SageMaker/[Your Filename] s3://pq-tdm-studio-results/tdm-ale-data/100/results

For instance, if you want to export "dh2019_barchart2_png" in home directory, you can execute this line:

screenshot of file export jupyter notebook box

Once the code is successfully executed, you and your teammates will receive a one-time only download link to the file like this:

screenshot of email sent to user with export link

Note: the current export limit is 15 MB per week.