Automating the construction of higher order data representations from heterogeneous biodiversity datasets

Nicolson, Nicky

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/19620

Title:	Automating the construction of higher order data representations from heterogeneous biodiversity datasets
Authors:	Nicolson, Nicky
Advisors:	Tucker, A
Keywords:	Machine learning;Biodiversity informatics;Specimen digitisation;Clustering;Record linkage
Issue Date:	2019
Publisher:	Brunel University London
Abstract:	Datasets created from large-scale specimen digitisation drive biodiversity research, but these are often heterogeneous: incomplete and fragmented. As aggregated data volumes increase, there have been calls to develop a “biodiversity knowledge graph” to better interconnect the data and support meta-analysis, particularly relating to the process of species description. This work maps data concepts and inter-relationships, and aims to develop automated approaches to detect the entities required to support these kinds of meta-analyses. An example is given using trends analysis on name publication events and their authors, which shows that despite implementation and widespread adoption of major changes to the process by which authors can publish new scientific names for plants, the data show no difference in the rates of publication. A novel data-mining process based on unsupervised learning is described, which detects specimen collectors and events preparatory to species description, allowing a larger set of data to be used in trends analysis. Record linkage techniques are applied to these two datasets to integrate data on authors and collectors to create a generalised agent entity, assessing specialisation and classifying working practices into separate categories. Recognising the role of agents (collectors, authors) in the processes (collection, publication) contributing to the recognition of new species, it is shown that features derived from data-mined aggregations can be used to build a classification model to predict which agent-initiated units of work are particularly valuable for species discovery. Finally, shared collector entities are used to integrate distributed specimen products of a single collection event across institutional boundaries, maximising impact of expert annotations. An inferred network of relationships between institutions based on specimen sharing relationships allows community analysis and the definition of optimal co-working relationships for efficient specimen digitisation and curation.
Description:	This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London
URI:	http://bura.brunel.ac.uk/handle/2438/19620
Appears in Collections:	Computer Science Dept of Computer Science Theses

Files in This Item:

File	Description	Size	Format
FulltextThesis.pdf		42.13 MB	Adobe PDF	View/Open

Show full item record