Common methods include:
Topic modelling is a method that looks across all of your text and identifies groups of words that tend to appear in the same documents as each other. To get into the details of how this method works, take a look at tedunderwood's post Topic modeling made just simple enough.
Topic modelling can be used to:
Sentiment analysis is used to determine if the emotion in a text is positive, negative or neutral. This is done by scoring sentences based on the presence of positive and negative adjectives and phrases.
Sentiment analysis can be used to:
Term frequency is a method that looks at how often a word or a phrase appears in a document or in your corpus. In its simplest form, term frequency is calculated by counting the number of times the term is used, providing insight into the topics under most frequent discussion in your text.
In order to filter more useful terms, many researchers couple this method with inverse document frequency which offsets frequent terms by the number of times the same word appears in other documents in the corpus. This technique is often referred to as TF-IDF and it's a useful way of identifying terms that are more unique to particular documents in your corpus, differentiating these from terms that are common in all or most of the documents in the corpus.
A collocation is a group of 2 or more words that tend to appear close together more often than would be expected by chance. Statistical tests are used to identify co-occurring words and the strength of the association is evaluated to determine if the co-occurrence is greater than random chance. Collocations can be multiword phrases, such as middle management or crystal clear, or they can be words that appear near each other, but not always directly together. For example, door and knock are likely to appear in close proximity, such as in the phrase 'A knock came at the door', however they don't necessarily form a distinct phrase.
The Library provides a number of services to help you get started with text and data mining. The Library also provides support and information for systematic reviews.
Want to talk to someone about how to start your own text mining project? Chat to your Academic Liaison Librarian or email us for help!
Aboriginal and Torres Strait Islander peoples are advised that this website may contain images, voices and names of people who have died.
The University of Sydney Library acknowledges that its facilities sit on the ancestral lands of Aboriginal and Torres Strait Islander peoples, who have for thousands of generations exchanged knowledge for the benefit of all. Learn more