State tagging for improved Earth and environmental data quality assurance

Tso, Chak-Hau Michael; Henrys, Peter; Rennie, Susannah; Watkins, John ORCID: 2020 State tagging for improved Earth and environmental data quality assurance. Frontiers in Environmental Science, 8, 46. 14, pp.

Before downloading, please read NORA policies.
N527884JA.pdf - Published Version
Available under License Creative Commons Attribution 4.0.

Download (1MB) | Preview


Environmental data allows us to monitor the constantly changing environment that we live in. It allows us to study trends and helps us to develop better models to describe processes in our environment and they, in turn, can provide information to improve management practices. To ensure that the data are reliable for analysis and interpretation, they must undergo quality assurance procedures. Such procedures generally include standard operating procedures during sampling and laboratory measurement (if applicable), as well as data validation upon entry to databases. The latter usually involves compliance (i.e., format) and conformity (i.e., value) checks that are most likely to be in the form of single parameter range tests. Such tests take no consideration of the system state at which each measurement is made, and provide the user with little contextual information on the probable cause for a measurement to be flagged out of range. We propose the use of data science techniques to tag each measurement with an identified system state. The term “state” here is defined loosely and they are identified using k-means clustering, an unsupervised machine learning method. The meaning of the states is open to specialist interpretation. Once the states are identified, state-dependent prediction intervals can be calculated for each observational variable. This approach provides the user with more contextual information to resolve out-of-range flags and derive prediction intervals for observational variables that considers the changes in system states. The users can then apply further analysis and filtering as they see fit. We illustrate our approach with two well-established long-term monitoring datasets in the UK: moth and butterfly data from the UK Environmental Change Network (ECN), and the UK CEH Cumbrian Lakes monitoring scheme. Our work contributes to the ongoing development of a better data science framework that allows researchers and other stakeholders to find and use the data they need more readily.

Item Type: Publication - Article
Digital Object Identifier (DOI):
UKCEH and CEH Sections/Science Areas: Pollution (Science Area 2017-)
Soils and Land Use (Science Area 2017-)
ISSN: 2296-665X
Additional Information. Not used in RCUK Gateway to Research.: Open Access paper - full text available via Official URL link.
Additional Keywords: data science, quality assurance, data analytics, environmental monitoring, environmental informatics, clustering (unsupervised) algorithms
NORA Subject Terms: Data and Information
Date made live: 05 Jun 2020 14:42 +0 (UTC)

Actions (login required)

View Item View Item

Document Downloads

Downloads for past 30 days

Downloads per month over past year

More statistics for this item...