Skip to Main Content

Text and data mining: 3. Licensing, copyright and ethics

An introduction to text and data mining concepts and an overview of the steps involved in undertaking text and data mining as part of a research project.

Licensing, copyright and ethics

From the outset, ensure that your text and data mining activities and the subsequent publication of your research comply with any licensing terms and conditions and copyright and ethical requirements.

Licensing

Data providers will each have their own specific standards and procedures that you must follow in order to legally use the data they provide. For example, many data providers license their data to be mined for research purposes only and either prohibit or require special negotiation for data mining with potential commercial applications.

If you have any questions about licensing conditions or negotiating permission for potential commercial applications of data mining with data providers, please contact library.digitalcollections@sydney.edu.au.

Copyright

The large datasets used in text and data mining are often sourced from pre-existing research outputs, original creative works, or proprietary data owned by commercial enterprises. This means that performing data and text mining may require you to access, copy and process material that is protected by copyright.

If you have any questions or need guidance on complying with copyright during data mining activities, please contact the Library’s Copyright Services team.

Ethics

Text and data mining sometimes involves the collation and linkage of separate datasets; you should take care to seek appropriate ethics approvals and conduct privacy impact assessments before commencing.

Even if all the original datasets contain de-identified data, data linkage and data mining can sometimes have the unforeseen consequence of enabling re-identification of de-identified data.

Online mining etiquette

Even if the licence permits it, some approaches to text and data mining are considered poor etiquette due to the inconvenience they can cause to data providers.

For example, bulk scraping of a data provider's website to extract information can place a significant burden on the data provider's servers. Similarly, when using an API to automate accessing and downloading content, you should ensure that you use rate limiting to control the number of requests you send to the data provider's servers over a given time period. Not rate-limiting your automated requests can cause slow response times or even down time for other users.

Best practice is to check the requirements of the data provider and comply with their preferences regarding data mining activities.

Library support

Support services

The Library provides a number of services to help you get started with text and data mining. The Library also provides support and information for systematic reviews.

Consultations

Want to talk to someone about how to start your own text mining project? Chat to your Academic Liaison Librarian or email us for help!

Aboriginal and Torres Strait Islander peoples are advised that this website may contain images, voices and names of people who have died.

The University of Sydney Library acknowledges that its facilities sit on the ancestral lands of Aboriginal and Torres Strait Islander peoples, who have for thousands of generations exchanged knowledge for the benefit of all. Learn more