Skip to Main Content

Text and data mining: 5. Text and data mining methods

An introduction to text and data mining concepts and an overview of the steps involved in undertaking text and data mining as part of a research project.

Text and data mining methods

Topic modelling

Topic modelling is a method that looks across all of your text and identifies groups of words that tend to appear in the same documents as each other. To get into the details of how this method works, take a look at tedunderwood's post Topic modeling made just simple enough.

 

Why is it used?

Topic modelling can be used to:

  • Perform an initial exploration of a corpus by providing an overview of the discourses or topics that appear in your texts
  • Discover overarching themes or concepts in a large corpus that might be missed if you were to read each text individually, or if a corpus is too large for close reading to be feasible
  • Assist in undertaking a literature review by identifying gaps or trends in existing research

 

Limitations

  • Topic modelling is unsuitable for short documents or small corpora. A corpus must have enough data to establish defined themes and concepts.
  • Topic modelling finds groups of words that a statistical analysis has identified as appearing together, however, identifying the topic uniting that group of words is up to you. Further investigation may be required to decide what a topic is about and some of the identified groups of words may not have any useful meaning for your research question.
  • You should not solely rely on topic modelling to make prescriptive judgements about the text. Like any form of research, you should do a closer reading of the texts in order to validate any theories about the corpus.  
  • Documents need to be cleaned prior to using topic modelling. An uncleaned dataset will produce topics with conjunctions or generate noise in topics with useful content. More information on cleaning a dataset is in Section 4 Cleaning and preparing data.

 

Sample projects

Sentiment analysis

Sentiment analysis is used to determine if the emotion in a text is positive, negative or neutral. This is done by scoring sentences based on the presence of positive and negative adjectives and phrases. 

  • The algorithm will analyse the text and prescribe a score for each sentence based on the presence of the words from the list (i.e. -1 for negative words and +1 for positive words)
  • To do this, the algorithm relies on a list of words and phrases which have been scored by a person
  • Depending on the algorithm and word list, it may also detect negations (such as don’t, isn’t) or intensifiers (such as very, really etc.) and adjust the score accordingly. For example, the sentence “It was a really good game overall.” might be scored higher than the sentence “It was a good game overall.”
  • The algorithm will then average the final score by the number of sentences to determine the overall sentiment score for the document

 

Why is it used?

Sentiment analysis can be used to:

  • Gauge the overall mood of a text
  • Examine social media posts and website comments to investigate emotional responses to topics
  • Examine texts that document significant events, such as newspapers, to determine public perceptions and, potentially, the evolution of perceptions through time 

 

Limitations

  • Sarcasm and satire can throw off outputs.
  • Doesn't work well with short pieces of text

 

Sample projects

Term frequency and TD-IDF

Term frequency is a method that looks at how often a word or a phrase appears in a document or in your corpus. In its simplest form, term frequency is calculated by counting the number of times the term is used, providing insight into the topics under most frequent discussion in your text.

In order to filter more useful terms, many researchers couple this method with inverse document frequency which offsets frequent terms by the number of times the same word appears in other documents in the corpus. This technique is often referred to as TF-IDF and it's a useful way of identifying terms that are more unique to particular documents in your corpus, differentiating these from terms that are common in all or most of the documents in the corpus.

 

Why is it used?

  • It can be used as to provide insights into how language is used across a sample of documents, for instance how a word or term falls in and out of use over time.
  • Through visualising frequent terms – such as in a word cloud – you can get an idea of the overall content of a document or groups of documents.  
  • Term frequency is also used for indexing documents and information retrieval, where documents with a higher instance of a term are shown before documents with lower instances.

 

Limitations

  • Does not account for synonyms e.g. run, sprint.
  • Does not account for homographs, e.g. tear (noun, liquid produced when crying) vs tear (verb, to rip something).
  • Can produce unhelpful results if your corpus hasn't been appropriately cleaned, e.g. frequencies dominated by stopwords such as "of" and "the". 

 

Sample projects

Collocation analysis

A collocation is a group of 2 or more words that tend to appear close together more often than would be expected by chance. Statistical tests are used to identify co-occurring words and the strength of the association is evaluated to determine if the co-occurrence is greater than random chance. Collocations can be multiword phrases, such as middle management or crystal clear, or they can be words that appear near each other, but not always directly together. For example, door and knock are likely to appear in close proximity, such as in the phrase 'A knock came at the door', however they don't necessarily form a distinct phrase.

 

Why is it used?

  • Understanding the contexts in which words are used and the associated meanings that words can gain due to regularly co-occurring with particular other words. This can include investigating how language can be used to construct and reinforce societal norms and stereotypical assumptions, e.g. the collocation illegal immigrant can reinforce negative ideas around immigration and migrants.
  • Identifying and understanding subtle differences in meaning and usage in near synonyms, e.g. strong and powerful have similar meanings, but we would use strong tea and powerful computer rather than the other way around.
  • Identifying idioms
  • Understanding phrase construction by native speakers of a language

 

Limitations

  • Different statistical methods can identify different words as collocates from the same text, so it's important to understand your method and what information it gives you so that you can use the most appropriate method for the question you want to answer

 

Sample projects

Library support

Support services

The Library provides a number of services to help you get started with text and data mining. The Library also provides support and information for systematic reviews.

Consultations

Want to talk to someone about how to start your own text mining project? Chat to your Academic Liaison Librarian or email us for help!

Aboriginal and Torres Strait Islander peoples are advised that this website may contain images, voices and names of people who have died.

The University of Sydney Library acknowledges that its facilities sit on the ancestral lands of Aboriginal and Torres Strait Islander peoples, who have for thousands of generations exchanged knowledge for the benefit of all. Learn more