1. Go to the Dashboard Page of the ProQuest TDM Studio and Click "Create New Dataset"
When you click "+ Create New Dataset" you will choose to see a list of available ProQuest databases or a list of available publication titles.
Note 1: TDM supports Chrome and FireFox only.
Note 2: TDM has a limit of ten datasets for each institution.
2. Search for Content
To explore licensed ProQuest content prior to accessing the TDM Studio platform, please visit this list of ProQuest databases. Please note that not all content available through these databases is available on the TDM Studio platform. If you would like additional information about content specific to TDM Studio, please reach out to one of our team members via the "Contact Us" box on the left.
Search for Databases or Publications based on keyword in title
Note: you can select multiple publications or databases.
3. Select Articles
Search for articles using a combination of keywords and date range.
In this example, we select dissertations on digital humanities/scholarship published in 2019:
Note: the size limit of document is 2,000,000
3. Name your dataset
Specify the name of your dataset.
You can find your newly built dataset in the dashboard page. Once the submitting process is completed, the "Queued" in STATUS will become "Completed". It usually take less than ten minutes.
1. Check the Status of Your Dataset
If the status is 'Completed' and the number of document is correct, you can go ahead and work with your dataset.
tdm-ale-data/100/corpus/dh2019/
2. Open Jupyter Notebook
Click the 'Open Jupyter Notebook' button at the upper-right corner of the dashboard page.
Note: in the Jupyter Notebook home directory, you can find two folders that contain useful guides on working with text data in the environment.
3. Transfer Data to Notebook
Open or create a notebook and execute the following three lines of code to transfer the data. In first two lines, you need to change the 'dh2019' into the name of your dataset.
Once the codes are executed, you can find your dataset under the 'data' directory.
1. Create a list of file directories
You can create a list of file directories using Python's glob module.
2. Read and parse all the files
The files are in xml format. To retrieve the content you need, you can use Python's 'lxml' and 'BeautifulSoup' libraries to process these files.
For instance, you can retrieve the abstracts of dissertations in dh2019 dataset using the following code:
3. Text cleaning
As we can see, the abstracts still contain many tags which makes it difficult to conduct computational analysis.
We can clean the xml tags using the following codes:
Now we have the clean text ready for analysis.
4. Save files
If you need to save the clean text for you or your teammates to do the analysis, you can save the file in a csv format.
To do so, you need to first transform the text into a data frame and then save it as a csv file.
You can find your file in the home directory:
You can click the file to inspect it. You are not encouraged to do so if the file is too large.
There are two ways of doing text analysis with documents from ProQuest TDM
1. Coding in Jupyter Notebook on ProQuest TDM
You can create a new R/Python notebook:
Or you can open an existing one, for instance, 'dh2019.ipynb'.
2. Upload your scripts/notebook from your computer to ProQuest TDM
If you find it more comfortable to write scripts in your editors/notebook or you already have some scripts ready to use, you upload it to the ProQuest TDM Studio. It is a two-step process.
Step one: Upload the file from your computer to the 'temporary file' directory
You need to click the "my file" button at the upper-left corner of the jupyter notebook environment.
In the temporary files folder, click the "Upload Files" button at the upper right corner.
You can find your file in the folder if the upload is successful.
Step two: Upload the file from the 'temporary files' directory to the TDM notebook workspace
To use the scripts/notebook on the jupyter notebook workspace, you need to upload the file from the "temporary files" folder.
First, you need to click the "upload" button at the upper-right corner of the notebook workspace.
Your scripts are in "My Files > Temporary Files" directory.
Once you find your file, you can double click it to upload.
Finally, you can click the "upload" blue button to upload it.
Now you can use your uploaded file in the workspace.
To export the file, you need to execute one line of code:
aws s3 cp /home/ec2-user/SageMaker/[Your Filename] s3://pq-tdm-studio-results/tdm-ale-data/100/results
For instance, if you want to export "dh2019_barchart2_png" in home directory, you can execute this line:
Once the code is successfully executed, you and your teammates will receive a one-time only download link to the file like this:
Note: the current export limit is 15 MB per week.