Spherical k-means clustering is good for interpreting multivariate species occurrence data
Hill, Mark O.; Harrower, Colin A. ORCID: https://orcid.org/0000-0001-5070-5293; Preston, Christopher D.. 2013 Spherical k-means clustering is good for interpreting multivariate species occurrence data. Methods in Ecology and Evolution, 4 (6). 542-551. 10.1111/2041-210X.12038
Before downloading, please read NORA policies.Preview |
Text
N500812PP.pdf - Accepted Version Download (818kB) | Preview |
Abstract/Summary
1. Clustering multivariate species data can be an effective way of showing groups of species or samples with similar characteristics. Most current techniques classify the samples first and then the species. A disadvantage of classifying the samples first is that relatively subtle differences between occurrence profiles of species can be obscured. 2. The k-means method of clustering minimizes the sum of squared distances between cluster centres and cluster members. If the entities to be clustered are projected on the unit sphere, then a natural measure of dispersion is the sum of squared chord distances separating the entities from their cluster centres; k-means clustering with this measure of dispersion is called spherical k-means (SKM). We also consider a variant in which the sum of squared perpendicular distances to a central ray is minimized. 3. Unweighted SKM is liable to produce clusters of very rare species. This feature can be avoided if each point on the unit sphere is weighted by the length of the ray that produced it. The standard SKM algorithm converges to very numerous local optima. To avoid this problem, we have developed a computationally intensive algorithm that uses multiple randomizations to select high-quality seed species. 4. The species clustering can be used to define simplified attributes for the samples. If the samples are then classified using the same technique, the resulting matrix of clustered species and clustered samples provides a biclustering of the data. The strength of the relationship between clusters can be measured by their mutual information, which is effectively the entropy of the biclustering. 5. The technique was tested on five ecological and biogeographical datasets ranging in size from 30 species in 20 samples to 1405 species in 3857 samples. Several variants of SKM were compared, together with results from the established program Twinspan. When judged by entropy, SKM always performed adequately and produced the best clustering in all datasets but the smallest.
Item Type: | Publication - Article |
---|---|
Digital Object Identifier (DOI): | 10.1111/2041-210X.12038 |
Programmes: | CEH Topics & Objectives 2009 - 2012 > Biodiversity > BD Topic 1 - Observations, Patterns, and Predictions for Biodiversity > BD - 1.2 - Data collection systems to record and assess changes ... |
UKCEH and CEH Sections/Science Areas: | UKCEH Fellows Pywell |
ISSN: | 2041-210X |
Additional Keywords: | biclustering, biogeography, co-clustering, entropy, mutual information, R-mode, Twinspan |
NORA Subject Terms: | Mathematics Botany Data and Information |
Date made live: | 04 Apr 2013 10:30 +0 (UTC) |
URI: | https://nora.nerc.ac.uk/id/eprint/500812 |
Actions (login required)
View Item |
Document Downloads
Downloads for past 30 days
Downloads per month over past year