Before you begin the process of text or data mining, you need to create a dataset, which is called a corpus. The choices you making in assembling a corpus to mine and analyse will be crucial to the success of your project. Having developed a research question, you need to:
This will save time and help you to choose the mining methods best suited to your project.
To use TDM, you must:
Considerations when assembling a corpus:
In order for text and data mining to occur, a computer must be able to read your text. A simple, but not fool-proof, test to see if your text is machine-readable is to use the 'find' command to search for a word that you can see in your document. If the computer can find it, it can read your text.
Scanned images of typed text can be made machine-readable using software that performs optical character recognition (OCR). OCR software looks at an image, identifies the text and then adds the text to the file.
Staff and students at the University have access to Adobe Acrobat, which can be used to undertake OCR. You will need to check any file generated by OCR for quality and accuracy. A computer may be able to read some words in the resulting file but not others due to inconsistencies in image or text quality.
OCR software doesn’t work very well on handwritten text in images, as handwriting is irregular and sometimes illegible. Handwriting will usually need to be transcribed to get it into machine-readable form, that is, you’ll need to read the text and manually type it out. This is a very time-consuming process and may not be feasible if you have a lot of handwritten text. Intelligent character recognition (ICR), a form of OCR that can learn and recognise handwriting, is in development. Transcription by hand is the most practical option at present.
Text and data mining is permitted in a number of the databases that the Library provides access to for University staff and students. Check out the full list of databases available for text and data mining, including database licence and access conditions, to see which might be useful for your project.
The following is a list of publicly available data sources that permit data or text mining under a variety of licensing arrangements. For most of these sources, data is access by either directly downloading content from the data source or by using an application programming interface (API). Downloading content can be quite a manual process and may take a long time if you need to access a large number of documents. Using an API provided by the data source allows you to automate the process of accessing and downloading content, however, you will need write some programming code in order to interact with and call on the API.
|Data source||Description||Data access||Further information|
|BioMed Central||Scholarly articles in STM (Science, Technology, and Medicine) fields from peer-reviewed open access journals||Direct download, web harvest, or API||BioMed Central’s Open Data policy|
|British Library Datasets||A wide variety of datasets released by the British Library||Direct download||Each dataset may have its own licensing conditions and re-use stipulations|
|BOM data feeds||Weather data from the Bureau of Meteorology||Direct download||BOM weather data services|
|Data.gov.au||Public datasets from Australian government agencies||API||CKAN API guide|
|PLOS||Open access content and metrics from PLOS journals||API or direct download||Text and Data Mining at PLOS|
|PolMine corpora||Debates in the German Bundestag and meeting records of the United Nations General Assembly||Install data packages using R||PolMine corpora information|
|PubMed Central Open Access Subset||Open access subset of full text archive of biomedical and life sciences journal literature at the US National Institutes of Health’s National Library of Medicine||Direct download or web harvest||Do text mining/retrieving full text|
|State Library NSW datasets and social media archive||Digitised content from the State Library's collections and archived social media content discussing life in NSW||Datasets - direct download
Social media - API
|Access to open data and About the SLNSW social media archive|
|Trove||Digital content from Australian libraries, hosted by the National Library of Australia||API||Building with Trove|
|Social media data from online social networking service||API with key and access token||API overview|
|WikiData||Structured data from Wikipedia and other open knowledge bases||Direct download or API||Wikidata: data access|
If you need help understanding the licensing conditions and what text and data mining is permitted for a specific data source, please contact firstname.lastname@example.org.