Skip to main content

Digital Humanities: Text mining & corpus linguistics

An introduction to digital humanities

What is text mining?

Text mining involves the use of computational tools and techniques to automatically discover new and unexpected information from an aggregated body of machine-readable text or data. Text mining requires the preparation of data that stems from a research question, and involves the collation of data or a text corpus, data familiarisation and cleaning, data formatting and the selection of an analysis method. Text mining is a generic term for computationally analysing a large body of text and can involve a variety of different analysis techniques.

Corpus linguistics is the study of language contained in bodies of texts. Corpus linguists use specialised computer software to analyse naturally occurring language in computerised text collections known as corpora. Computational stylistics and stylometry are concerned with the study of features of style that can be measured, identifying patterns of language usage in text.

Practices around text mining

Pre-processing practices:

  • Optical character recognition (OCR)
    • OCR is the conversion of typed or printed text into machine readable, electronic text from scanned documents or images. OCR is usually an important step when preparing a document corpus for analysis in text and data mining. 
  • Tokenisation
    • Tokenisation is the processes of splitting up sentences into individual words or a sequence of words, often expressed as an n-gram.
  • Stemming/lemmatisation
    • Stemming involves breaking down individual words to the root string of a word. For example the word "running" might be stemmed down to the root word "run".
    • Lemmatisation groups inflected forms of a word in order for it to be analysed as a single term. For example "runs" or "ran" are associated with the word "run"
  • Entity extraction
    • Also known as named entity recognition, this is a text mining technique where an algorithm identifies key terms (entities) in a text and classifies them into pre-defined categories (e.g. location, people, organisatons). This technique transforms unstructured data into structured data that can be understood by machines for analysis. 

Text mining practices:

  • Topic modeling 
    • Topic modeling is a method used to discover abstract topics that occur in a corpus. It is frequently used for distance reading of texts to get a "feel" of the literature, as well as to discover new and interesting topics that may not be apparent through reading the text. 
  • Sentiment analysis 
    • Sentiment analysis is a technique where an algorithm will identify affect words and allocate a score according to whether or not the word has a negative or positive sentiment. This is then calculated across the document to determine a sentiment score. This can help identify if the overall sentiment in a document is positive or negative. 
  • Term frequency
    • Term frequency examines how many times a word appears in a text and how important the term is in relation to other documents in the corpus. 

Tools used in text mining


  • TAPoR – Text Analysis Portal for Research
    • A catalogue of tools that can be used to undertake text analysis.
  • Voyant Tools
    • A web-based reading and analysis environment for digital texts.
    • Used to analyse Shakespeare plays: Wilhelm, T., Burghardt, M. & Wolff, C. (2013). "To See or Not to See" - An Interactive Tool for the Visualization and Analysis of Shakespeare Plays. In R. Franken-Wendelstorf, E. Lindinger & J. Sieck (Eds.) Kultur und Informatik: Visual Worlds & Interactive Spaces, pp. 175-185. Glückstadt: Verlag Werner Hülsbusch.
  • Data Science Toolkit
    • A collection of open source tools for data analysis, including several text analysis tools.
    • Used to measure political responsiveness with Tweets: Barberá, P., Bonneau, R., Egan, P., Jost, J. T., Nagler, J., & Tucker, J. (2014, August). Leaders or followers? Measuring political responsiveness in the US Congress using social media data. In 110th American Political Science Association Annual Meeting.
  • Stylo R package
    • A suite of stylometric tools provided as a package for use with the analysis software R.
    • Used to analyse the provenance of the unfinished novel The Dark Tower, generally attributed to C. S. Lewis: Oakes, M. P. (2018, September). Computer stylometry of C. S. Lewis’s The Dark Tower and related texts. Digital Scholarship in the Humanities, 33(3), pp. 637–650, doi:10.1093/llc/fqx043
  • TXM
    • Textometry software for textual analysis.
    • Used to annotate Medieval French texts: Stein, A., & Prévost, S. (2013). Syntactic annotation of medieval texts. In P. Bennett, M. Durrell, S. Scheible & R. Whitt (Eds.), New methods in historical corpora, 3, pp. 275-282.
  • AntConc
    • A freeware corpus analysis toolkit for concordancing and text analysis.
    • Used to analyse the Twitter backchannel from a digital humanities conference: Ross, C., Terras, M., Warwick, C., & Welsh, A. (2011). Enabled backchannel: Conference twitter use by digital humanists. Journal of Documentation67(2), 214-237. doi:10.1108/00220411111109449

Support available from the Library


Want to talk to someone about how to start your own text mining project? Send us an email!

External Support


Tinker is a Digital Humanities toolbox and directory for tools and research methods. It also includes links to example projects to give you an idea on how digital humanities can help you with your research.