Skip to main content

Text and data mining: 2. Creating a dataset

An introduction to text and data mining concepts and an overview of the steps involved in undertaking text and data mining as part of a research project.

Choosing a dataset

Before you begin the process of text or data mining, you need to create a dataset, which is called a corpus. The choices you making in assembling a corpus to mine and analyse will be crucial to the success of your project. Having developed a research question, you need to:

  • Consider what content or information you need that will answer your research question. 
  • Review what resources are available for mining 

This will save time and help you to choose the mining methods best suited to your project.

To use TDM, you must:

Assembling a corpus

Considerations when assembling a corpus:

  • Is the data available to me to use?
  • Where is the data coming from? (Primary or secondary sources? Is there any bias?)
  • What is the geographical coverage of the data set?
  • What is the time period or date range that the data covers?
  • Is the data clean and ready to use? What kinds of cleaning might the data require?
  • Evaluating data availability: what kind of data can you access? For instance, do you have access to metadata, abstracts or full text?

Can a computer read your text?

In order for text and data mining to occur, a computer must be able to read your text. A simple, but not fool-proof, test to see if your text is machine-readable is to use the 'find' command to search for a word that you can see in your document. If the computer can find it, it can read your text.

Scanned images of typed text can be made machine-readable using software that performs optical character recognition (OCR). OCR software looks at an image, identifies the text and then adds the text to the file.

Staff and students at the University have access to Adobe Acrobat, which can be used to undertake OCR. You will need to check any file generated by OCR for quality and accuracy. A computer may be able to read some words in the resulting file but not others due to inconsistencies in image or text quality.

OCR software doesn’t work very well on handwritten text in images, as handwriting is irregular and sometimes illegible. Handwriting will usually need to be transcribed to get it into machine-readable form, that is, you’ll need to read the text and manually type it out. This is a very time-consuming process and may not be feasible if you have a lot of handwritten text. Intelligent character recognition (ICR), a form of OCR that can learn and recognise handwriting, is in development. Transcription by hand is the most practical option at present.

Text and data mining databases

Library licensed databases

Text and data mining is permitted in a number of the databases that the Library provides access to for University staff and students. Check out the full list of databases available for text and data mining, including database licence and access conditions, to see which might be useful for your project.

 

Publicly available data for mining

The following is a list of publicly available data sources that permit data or text mining under a variety of licensing arrangements. For most of these sources, data is access by either directly downloading content from the data source or by using an application programming interface (API). Downloading content can be quite a manual process and may take a long time if you need to access a large number of documents. Using an API provided by the data source allows you to automate the process of accessing and downloading content, however, you will need write some programming code in order to interact with and call on the API.

Data source Description Data access Further information
BioMed Central Scholarly articles in STM (Science, Technology, and Medicine) fields from peer-reviewed open access journals Direct download, web harvest, or API BioMed Central’s Open Data policy
British Library Datasets A wide variety of datasets released by the British Library Direct download Each dataset may have its own licensing conditions and re-use stipulations
BOM data feeds Weather data from the Bureau of Meteorology Direct download BOM weather data services
Data.gov.au Public datasets from Australian government agencies API CKAN API guide
PLOS Open access content and metrics from PLOS journals API or direct download Text and Data Mining at PLOS
PolMine corpora Debates in the German Bundestag and meeting records of the United Nations General Assembly Install data packages using R PolMine corpora information
PubMed Central Open Access Subset Open access subset of full text archive of biomedical and life sciences journal literature at the US National Institutes of Health’s National Library of Medicine Direct download or web harvest Do text mining/retrieving full text
State Library NSW datasets and social media archive Digitised content from the State Library's collections and archived social media content discussing life in NSW Datasets - direct download
Social media - API
Access to open data and About the SLNSW social media archive
Trove Digital content from Australian libraries, hosted by the National Library of Australia API Building with Trove
Twitter Social media data from online social networking service API with key and access token API overview
WikiData Structured data from Wikipedia and other open knowledge bases Direct download or API Wikidata: data access

 

If you need help understanding the licensing conditions and what text and data mining is permitted for a specific data source, please contact researchdatasupport@sydney.edu.au.

Library support

Support services

The Library provides a number of services to help you get started with text and data mining.

Consultations

Want to talk to someone about how to start your own text mining project? Chat to your Academic Liaison Librarian or email us for help!