Skip to Main Content

Text and data mining: 4. Cleaning and preparing data

An introduction to text and data mining concepts and an overview of the steps involved in undertaking text and data mining as part of a research project.

Cleaning and preparing data

Having generated a corpus, you now need to take some steps to make sure that your texts are in a form that a computer can understand and work with. ‘Pre-processing’ is a catch-all term used for the different activities that you undertake to get your documents ready to be analysed. You may only use a few pre-processing techniques, or you may decide to use a wide array, depending on your documents, the kind of text you have and the kinds of analyses you want to perform.

The first pre-processing step in any TDM project is to identify the cleaning that will need to be done to enable your analysis. Cleaning refers to steps that you take to standardise your text and to remove text and characters that aren’t relevant. After performing these steps, you'll be left with a nice ‘clean’ text dataset that is ready to be analysed.

Some TDM methods require that extra context be added to your corpus before analysis can be undertaken. Pre-processing techniques, such as parts of speech tagging and named entity recognition, can enable these analyses by categorising and assigning meaning to elements in your text.

Cleaning and pre-processing methods may sometimes be included within the interface of the TDM tool that you are using. Examples of tools that have some cleaning and pre-processing methods as part of them include:

Alternatively, you may need to do some programming to appropriately prepare your corpus for your analyses. Tutorials from the Programming Historian are a great way to get started with learning programming to perform cleaning and TDM analyses.

 

Cleaning and other pre-processing techniques

Tokenisation

Many TDM methods are based counting words or short phrases. However, a computer doesn't know what words or phrases are – to it the texts in your corpus are just long strings of characters. You need to tell the computer how to split the text up into meaningful segments that will enable it to count and perform calculations. These segments are called tokens and the process of splitting your text is called tokenisation.

It’s common to split your text up into the individual words as tokens, but other kinds of tokenisation can be useful too. For instance, if you were interested in specific phrases such as 'artificial intelligence' or 'White Australia Policy' or if you were investigating how some words tend to be used together, then you might split your text up into two or three word units. If you wanted to analyse sentence structures and features, then you might start by tokenising your text into individual sentences. Often you may tokenise your text in several different ways to enable different analyses.

For languages that don’t separate words in their writing, such as Chinese, Thai or Vietnamese, tokenisation will require more thought to identify how the text will need to be split to enable the desired analysis.

Example: ‘The cat sat on a mat. Then the cat saw a rat.’
This text could be tokenised by:
 
Word (sometimes called a unigram):
    The
    cat
    sat
    on
    a
    mat.
    Then
    the
    cat
    saw
    a
    rat.
Two word phrase (often called bigrams or 2-grams):
    The cat
    cat sat
    sat on
    on a
    a mat
    mat. Then
    Then the
    the cat
    cat saw
    saw a
    a rat.
Three word phrase (often called trigrams or 3-grams):
    The cat sat
    cat sat on
    sat on a
    on a mat
    a mat. Then
    mat. Then the
    Then the cat
    the cat saw
    cat saw a
    saw a rat.
Sentence:
    The cat sat on a mat.
    Then the cat saw a rat.

 

Potential pitfalls: Splitting up words can change meaning or cause things to be grouped incorrectly in cases where multiple words are used to indicate a single thing. For example, ‘southeast’ vs ‘south east’ vs ‘south-east’, or place names like 'New South Wales’ or ‘Los Angeles’, or multi-word concepts like ‘global warming’ and ‘social distancing’. Using both phrase tokenisation, such as bigrams and trigrams, as well as single words can help to mitigate this issue.

Sentence-level tokenisation can be complicated by the fact that full-stops are used in contexts other than just the end of sentences; for example, ‘Ms.’, ‘etc.’, ‘e.g.’ and, ‘.com.au’. Using a list of abbreviations that may contain full stops can help you identify these cases and improve your tokenisation.

Converting your text to lower case

Computers will often treat capitalised versions of words as being different to their lowercase counterparts, which can cause problems during analysis. Making all text lowercase can solve this problem.

Example:  How many times is ‘cheetah’ used within my documents?
With capitals included:
- cheetah = 7
- Cheetah = 2
However, if we make everything lowercase, we get the answer we are interested in right away:
- cheetah = 9

 

Potential pitfalls: Sometimes capital letters allow us to distinguish between things that are different. For example, if your documents referred to a person named ‘Rose’ and also to the flower called a ‘rose’, then converting the name to all lowercase would result in these two different things being grouped together.

Other pre-processing techniques, such as named entity recognition, can help avoid this pitfall. Consider the content of your documents to decide if converting to lowercase will be useful, or if you need to do other pre-processing first.

Word replacement

Variations in spelling can cause problems in text analysis as the computer will treat the different spellings as being different words, rather than as referring to the same thing. Solve this by choosing a single spelling and replacing any other variants in your text with that version.

For a large corpus you would first tokenise words and then standardise the spelling from there. Alternatively, you can use tools such as VARD to do the work for you.

Example:   Uncorrected text contains: paediatric, pediatric, and pædiatric
Choose a variant to standardise to – replace all variants with: paediatric

 

Potential pitfalls: As long as you have identified all of the spelling variations of interest, this method is unlikely to have unintended consequences. If, however, you are specifically looking at the use of different spellings or how spellings of a word change over time, using such a method would not be helpful.

Punctuation and non-alphanumeric character removal

Punctuation or special characters can add clutter to your data and make analysing the text more difficult. Errors in OCR can also result in unusual non-alphanumeric characters being mistakenly added to your text. Identifying characters in your text that are neither letters or numbers and then removing those characters is a simple way of removing this clutter.

Example:  If uncorrected text contains: ‘coastline’ and ‘coastline;’, they will be identified as different words.
Removing the punctuation will correctly identify them as the same word.

 

Potential pitfalls: If you are interested in how certain punctuation or special characters are used, then blanket deleting all non-alphanumeric characters will remove information that you’re interested in. This will also be the case where you are using corpora with mixed languages or texts where punctuation is important (e.g. names or words in French).

You will need to take a more targeted approach to any non-alphanumeric character removal. Also consider if you need punctuation to help you do other pre-processing, for example, tokenisation by sentences.

Stopwords

There are a lot of commonly used words, such as ‘the’, ‘is’, ‘that’, ‘a’, etc., that would completely dominate an analysis, but don’t offer much insight into the text in your documents. These words, that we want to filter out before we analyse our text, are called ‘stopwords’. There are many existing stopword lists that you can use to remove common words from numerous languages. If there are specific words that are common in your documents, but that aren’t relevant to your analysis, you can customise existing stopwords lists by adding your own words to them.

Potential pitfalls: When using a stopword list, particularly one created by someone else, it’s important to look it over before using it to make sure that it doesn’t contain any words that you are interested in or words that you will need to undertake the TDM method that you've chosen.

Parts of speech tagging

Parts of speech tagging is used to provide context to text. Text is often referred to as ‘unstructured data’ – data that has no defined structure or pattern. For a computer, text is just a long string of characters that doesn’t mean anything. However, we can run analyses that look at the context that words or tokens that are used in to categorise them in certain ways.

In parts of speech tagging, all the words in your text get categorised as belonging to different word classes, such as nouns, verbs, adjectives, prepositions, determiners, etc. Having this extra information attached to words enables further processing and analyses, such as lemmatisation, sentiment analysis, or any analysis where you wish to look closer at a specific class of words.

Parts of speech taggers are software that has ‘learned’ how to classify words by being trained on text that has been manually classified by humans. There are different ways that tagging software can work, but usually a tagger will look at probabilities, such as what a word has been tagged as previously.  For example, ‘dogs’ can be both a noun and a verb, but the noun form is much more common.

A tagger can also look at how tags tend to be ordered. For example, given the phrase ‘this erudite scholar’, if the tagger is unfamiliar with the word ‘erudite’, it can look at the words around it to identify that a word with a determiner in front of it and a noun following it is likely to be an adjective.

Example: They refuse to permit us to obtain the refuse permit
Parts of speech tagged: They (pronoun) refuse (verb) to (to) permit (verb) us (pronoun) to (to) obtain (verb) the (determiner) refuse (noun) permit (noun)

 

Potential pitfalls: If the parts of speech tagging software that you use was trained on text that is very different to the text in your corpus, it might struggle to correctly classify a significant number of the words in your text. For example, if the tagger was trained on modern newspapers, it might have a hard time tagging social media posts or 18th century novels.

Also, it's worth noting that sometimes a phrase is just written in an ambiguous or unclear way. For example, in the phrase ‘the duchess was entertaining last night’, ‘entertaining’ could be a verb (the duchess threw a party last night) or an adjective (the duchess was a delightful and amusing companion last night). 

Named entity recognition

Like parts of speech tagging, this method is used to provide context and structure to text. 

Named entity recognition (NER) is a process where software analyses text to locate things within the text that a human would recognise as a distinct entity. These entities are then classified into categories, such as person, location, organisation, nationality, time, date, etc.

Some named entity recognisers, such as SpaCy, have a set of predefined categories that they have been trained to identify. Others, such as Stanford NER, allow you to define your own categories. Defining your own categories means you will need to train the recogniser to identify the entities that you are interested in. To do this, you will need to manually classify many documents, a time consuming and laborious process.

Having entities tagged within your text allows you to ask questions about the texts in more natural ways. For example, if you wanted to know all the people mentioned in your text, your computer wouldn’t know how to tell you that information before you’ve performed NER, as it doesn’t know what people are.

After the entities in your text have been classified, it’s very easy for the computer to list you all the entities with a Person tag. There are many different questions you can start to explore using named entity recognition, such as ‘which people are mentioned in the same documents as each other?’, ‘what places are important in my text?’, or ‘do the entities mentioned in the text change with different time periods?’.

Example: Apple is an American tech company whose headquarters are located in Cupertino, California. It was founded by Steve Jobs and Steve Wozniak in April 1976.
NER tagged text: [Apple (organisation)] is an [American (nationality)] tech company whose headquarters are located in [Cupertino, California (geopolitical entity)]. It was founded by [Steve Jobs (person)] and [Steve Wozniak (person)] in [April 1976 (date)].

 

Potential pitfalls: As with parts of speech tagging, it’s good to ensure that the NER you use has been trained on text that is similar to the kind of text that you are working with. For instance, an NER tagger trained with American terms and locations may mislabel Fairfax as a geographic location if we run it over newspapers published by Australasian Fairfax.

A tagger is never going to get everything right, so you will likely end up with some missed or misclassified entities.

Stemming and lemmatisation

There may be some cases where it would be helpful for your analysis for different words with the same root to be recognised as being the same. For instance, ‘swim’, ‘swims’, ‘swimming’, ‘swam’ and ‘swum’, would normally all be treated as different words by your computer, but you might want them to all be recognised as forms of ‘swim’.

Stemming and lemmatisation are two different methods that attempt to reduce words down to a core root so that they can be grouped in this way. Particularly when getting started with text mining, you are unlikely to need to use these more complex standardisation methods, but it’s good to be aware of their existence.

 

Stemming

In stemming, a set of general rules is used to identify the end bits of words that can be chopped off to leave the core root of the word. The resulting ‘stem’ may or may not be a real word. Several different stemming algorithms exist, such as the Snowball or Lancaster stemmers. These will produce different results, so look at the rules they execute and trial them on your data to decide which will suit your needs. Implementing a stemming algorithm will require you to undertake some programming.

Example:
replaces => replac
replaced => replac
replacement => replac

So, all three words, which would have been counted separately, are now grouped together through a single word stem, ‘replac’, although this stem itself isn’t a valid English word.

 

Potential pitfalls: Languages are irregular and complex, so any stemming algorithm won’t do exactly what you want it to 100% of the time. Overstemming occurs when the algorithm cuts off too much of the word and it either loses its meaning, or it ends up with the same word stem as other, unrelated words. Understemming is when not enough is cut off and two words that humans would recognise as being related are not given the same stem. No algorithm will be perfect, so you will have test and decide if any of them do a good enough job to achieve your purpose.

 

Lemmatisation

Lemmatisation involves the computer analysing words to give you the dictionary root form of that word. This analysis requires the lemmatiser to understand the context in which the word is used, so before you can lemmatise your text, you will need to pre-process it with parts of speech tagging. There are several different lemmatisers that you could choose to use, however, all of them will require you to undertake some programming.

Example:
‘am’, ‘are’, ‘is’, ‘was’, ‘were’ would all be lemmatised to ‘be’
'rose’ (noun) would be lemmatised to ‘rose’
'rose’ (verb) would be lemmatised to ‘rise’

 

Potential pitfalls: As in stemming, precision and nuances can be lost during lemmatisation. For example, ‘operating’ can have quite different meanings from the verb form through to compound noun forms, such as ‘operating theatre’ or ‘operating system’. If the lemmatiser that you use reduces all of these to ‘operate’, then you can end up grouping things together that are actually separate concepts or things. Lemmatisation is also a slower process than stemming, as more analysis is involved.

Library support

Support services

The Library provides a number of services to help you get started with text and data mining. The Library also provides support and information for systematic reviews.

Consultations

Want to talk to someone about how to start your own text mining project? Chat to your Academic Liaison Librarian or email us for help!

Aboriginal and Torres Strait Islander peoples are advised that this website may contain images, voices and names of people who have died.

The University of Sydney Library acknowledges that its facilities sit on the ancestral lands of Aboriginal and Torres Strait Islander peoples, who have for thousands of generations exchanged knowledge for the benefit of all. Learn more