Skip to main content

Text and data mining: 5. Text and data mining methods

An introduction to text and data mining concepts and an overview of the steps involved in undertaking text and data mining as part of a research project.

Text and data mining methods

Topic modelling

Topic modelling is a method that looks across all of your text and identifies groups of words that tend to appear in the same documents as each other. To get into the details of how this method works, take a look at tedunderwood's post Topic modeling made just simple enough.

 

Why is it used?

Topic modelling can be used to:

  • Perform an initial exploration of a corpus by providing an overview of the discourses or topics that appear in your texts
  • Discover overarching themes or concepts in a large corpus that might be missed if you were to read each text individually, or if a corpus is too large for close reading to be feasible
  • Assist in undertaking a literature review by identifying gaps or trends in existing research

 

Limitations

  • Topic modelling is unsuitable for short documents or small corpora. A corpus must have enough data to establish defined themes and concepts.
  • Topic modelling finds groups of words that a statistical analysis has identified as appearing together, however, identifying the topic uniting that group of words is up to you. Further investigation may be required to decide what a topic is about and some of the identified groups of words may not have any useful meaning for your research question.
  • You should not solely rely on topic modelling to make prescriptive judgements about the text. Like any form of research, you should do a closer reading of the texts in order to validate any theories about the corpus.  
  • Documents need to be cleaned prior to using topic modelling. An uncleaned dataset will produce topics with conjunctions or generate noise in topics with useful content. More information on cleaning a dataset is in Section 4 Cleaning and preparing data.

 

Sample projects

Sentiment analysis

Sentiment analysis is used to determine if the emotion in a text is positive, negative or neutral. This is done by scoring sentences based on the presence of positive and negative adjectives and phrases. 

  • The algorithm will analyse the text and prescribe a score for each sentence based on the presence of the words from the list (i.e. -1 for negative words and +1 for positive words)
  • To do this, the algorithm relies on a list of words and phrases which have been scored by a person
  • Depending on the algorithm and word list, it may also detect negations (such as don’t, isn’t) or intensifiers (such as very, really etc.) and adjust the score accordingly. For example, the sentence “It was a really good game overall.” might be scored higher than the sentence “It was a good game overall.”
  • The algorithm will then average the final score by the number of sentences to determine the overall sentiment score for the document

 

Why is it used?

Sentiment analysis can be used to:

  • Gauge the overall mood of a text
  • Examine social media posts and website comments to investigate emotional responses to topics
  • Examine texts that document significant events, such as newspapers, to determine public perceptions and, potentially, the evolution of perceptions through time 

 

Limitations

  • Sarcasm and satire can throw off outputs.
  • Doesn't work well with short pieces of text

 

Sample projects

Term frequency and TD-IDF

Term frequency is a method that looks at how often a word or a phrase appears in a document or in your corpus. In its simplest form, term frequency is calculated by counting the number of times the term is used, providing insight into the topics under most frequent discussion in your text.

In order to filter more useful terms, many researchers couple this method with inverse document frequency which offsets frequent terms by the number of times the same word appears in other documents in the corpus. This technique is often referred to as TF-IDF and it's a useful way of identifying terms that are more unique to particular documents in your corpus, differentiating these from terms that are common in all or most of the documents in the corpus.

 

Why is it used?

  • It can be used as to provide insights into how language is used across a sample of documents, for instance how a word or term falls in and out of use over time.
  • Through visualising frequent terms – such as in a word cloud – you can get an idea of the overall content of a document or groups of documents.  
  • Term frequency is also used for indexing documents and information retrieval, where documents with a higher instance of a term are shown before documents with lower instances.

 

Limitations

  • Does not account for synonyms e.g. run, sprint.
  • Does not account for homographs, e.g. tear (noun, liquid produced when crying) vs tear (verb, to rip something).
  • Can produce unhelpful results if your corpus hasn't been appropriately cleaned, e.g. frequencies dominated by stopwords such as "of" and "the". 

 

Sample projects

Library support

Support services

The Library provides a number of services to help you get started with text and data mining.

Consultations

Want to talk to someone about how to start your own text mining project? Chat to your Academic Liaison Librarian or email us for help!