The analysis of grain-size distributions has a long tradition in Quaternary Science and disciplines studying Earth surface and subsurface deposits. The decomposition of multi-modal grain-size distributions into inherent subpopulations, commonly termed end-member modelling analysis (EMMA), is increasingly recognised as a tool to infer the underlying sediment sources, transport and (post-)depositional processes. Most of the existing deterministic EMMA approaches are only able to deliver one out of many possible solutions, thereby shortcutting uncertainty in model parameters. Here, we provide user-friendly computational protocols that support deterministic as well as robust (i.e. explicitly accounting for incomplete knowledge about input parameters in a probabilistic approach) EMMA, in the free and open software framework of R.

In addition, and going beyond previous validation tests, we compare the
performance of available grain-size EMMA algorithms using four real-world
sediment types, covering a wide range of grain-size distribution shapes
(alluvial fan, dune, loess and floodplain deposits). These were randomly
mixed in the lab to produce a synthetic data set. Across all algorithms, the
original data set was modelled with mean

Many studies in Quaternary Science aim to reconstruct past Earth surface dynamics using sedimentary proxies. Earth surface dynamics include a variety of processes that mix process-related components (Buccianti et al., 2006). Sediment from different sources can be transported and deposited by a multitude of sedimentological processes that have been linked to climate, vegetation, geological and geomorphological dynamics (Bartholdy et al., 2007; Folk and Ward, 1957; Macumber et al., 2018; Stuut et al., 2002; Tjallingii et al., 2008; Vandenberghe, 2013; Vandenberghe et al., 2004, 2018). During transport, grain-size subpopulations are affected by different transport energies, and, thus, distinct grain-size distributions are created upon deposition. Accordingly, it is possible to infer source areas, transport pathways and transport processes as well as the related sedimentary environment from measured grain-size distributions. This basic concept has been exploited for more than 60 years (Flemming, 2007; Folk and Ward, 1957; Hartmann, 2007; Visher, 1969). However, the approach is limited when sediments are transported by more than one process and become mixed during and after deposition (Bagnold and Barndorff-Nielsen, 1980; Vandenberghe et al., 2018).

The advent of fast, high-resolution grain-size measurements through
laser diffraction allows the assessment of grain-size distributions of large
sample sets in a short time and reveals the sediment mixing effects in
multiple modes or distinct shoulders in the grain-size distribution curves.
Although widely used, classic measures of bulk distributions such as sand,
silt and clay contents or mean grain size,

To overcome these limitations and to improve process interpretation and attribution of associated drivers from sedimentary archives (Dietze et al., 2014), two ways have been proposed to decompose multi-modal grain-size distributions and to quantify the dominant grain-size subpopulation: parametric and non-parametric approaches. Among the former, commonly used curve fitting approaches describe a sediment sample as a combination of a finite number of parametric distribution functions such as (skewed) log-normal, log-hyperbolic or Weibull distributions (Bagnold and Barndorff-Nielsen, 1980; Gan and Scholz, 2017; Sun et al., 2002). However, curve fitting solutions are non-unique, and subpopulations might not be detected if a fixed number of functions are fitted to individual samples (Paterson and Heslop, 2015; Weltje and Prins, 2003), whereas other parametric approaches such as non- and semi-parametric mixture models (Hunter et al., 2011; Lindsay and Lesperance, 1995) are still very poorly explored in the field of grain-size distribution analyses.

Non-parametric end-member modelling or mixing analysis (EMMA) aims to
describe a whole data set as a combination of discrete subpopulations, based
on eigenspace analysis and compositional data constraints (Aitchison,
1986). A multidimensional grain-size data set

Five approaches of non-parametric EMMA have been proposed: Weltje (1997) has developed a FORTRAN algorithm based on simplex expansion, which has been translated to a set of scripts for R (R Core Team, 2017) called RECA (R-based Endmember Composition Algorithm), including a fuzzy c-means clustering approach (Seidel and Hlawitschka, 2015). Available as MATLAB scripts, the algorithm by Dietze et al. (2012) has included eigenvector rotation, whereas Yu et al. (2015) have introduced a Bayesian EMMA (BEMMA) and Paterson and Heslop (2015) have used approaches from hyperspectral image processing (AnalySize). Based on the MATLAB algorithm by Dietze et al. (2012), Dietze and Dietze (2016) compiled a prototype R package (EMMAgeo v. 0.9.4).

Most EMMA approaches are deterministic (i.e. one single model solution
without any uncertainty estimates) and require the user to set a fixed
number of end-members

Previous studies of EMMA performance (Weltje and Prins, 2007; Seidel and Hlawitschka, 2015; Paterson and Heslop, 2015) either used measured data without information on the true loadings and scores or were based on ideally designed synthetic data. However, natural process end-members can overlap substantially and may have a varying or multi-modal grain-size distribution shape due to unstable transport conditions (e.g. gradual fining of aeolian dust with transport distance) and deposition (e.g. reworking by soil formation; Dietze et al., 2016; Vandenberghe et al., 2018).

Recently, van Hateren et al. (2018) compared the concepts and performances of AnalySize, RECA, BEMMA, EMMAgeo and a diffuse reflectance spectroscopy (DRS) unmixing approach (Heslop et al., 2007). They used numerically mixed real-world grain-size samples and compared the modelled end-member loadings with the real-world distributions and modelled scores with randomised mixing ratios, as suggested by Schulte et al. (2014). Van Hateren and others confirmed former studies and highlighted that geological background knowledge is crucial for end-member interpretation, but they also pointed to strong differences in model performance. However, the descriptions of van Hateren et al. (2018) are mainly based on verbal comparisons of graphic data representations, and the validation data are not available for future comparative studies.

Here, we introduce new operational modes and protocols for the comprehensive open-source R package EMMAgeo as a tool for quantifying process-related grain-size subpopulations in mixed sediments. We aim to clarify information provided by the reference documentation of the first version of the package (v. 0.9.4; Dietze and Dietze, 2016) and by Dietze et al. (2014), regarding parameter estimation and optimisation, and we add a new approach of uncertainty estimation of the end-member scores. We evaluate the performance and validity of EMMAgeo using a real-world grain-size data set with fully known end-member compositions and unbiased quantitative measures. For comparison, the same data set is modelled with other available grain-size end-member algorithms. An evaluation and validation of both process end-member distribution shapes and mixing ratios are provided. Finally, general constraints for the interpretation of end-members are discussed. The detailed Supplement shall help users to apply the EMMAgeo protocols and to reproduce the results, making use of the raw data published along with this study.

EMMAgeo in its current version 0.9.6 (Dietze and Dietze, 2019) contains 22 functions (Table S1 in the Supplement), the example data set for this study and full documentation of these items. EMMAgeo provides a systematic chain of data pre-processing, parameter estimation and optimisation, the actual modelling and the inference of model uncertainties.

EMMAgeo is based on the EMMA MATLAB code by Dietze et al. (2012), which was slightly modified, i.e. vectorisation of looped
calculations to increase computation speed, a new coding of the scaling
procedure (Miesch, 1976) and additional measures of model
performance. Following Dietze et al. (2012), the core function

A deterministic and a robust operational EMMA mode can be run by a function
and two protocols, respectively. First, EMMA can be performed with a
user-based decision on all parameters, which is comparable to existing
algorithms. This

The second and third protocol of

Flow chart of the two robust EMMA protocols.

The

The range of the number of end-members

End-member loadings from different model parameter settings tend to cluster
at similar main mode positions, which Dietze et al. (2012, 2014) used to
manually identify robust end-members. To identify these mode clusters within
EMMAgeo (step 7),

With the mean robust loadings, i.e. the unweighted mean of all similarly
likely loadings of step 9, it is possible to optimise the model with respect
to different quality criteria by changing

The

Sediment outcrops of four depositional environments were sampled near the
city of Dresden, Germany (Fig. 2). These represent natural sedimentological
end-members (EM

Three parallel samples (0.3–2.0

The example data set

To run the FORTRAN-based approach by Weltje (1997), provided by Jan-Berendt Stuut
(personal communication, 2017), the grain-size classes of

Running the collection of the five RECA R scripts (Seidel and
Hlawitschka, 2015) required manual installation of the additional package
compositions (Van den Boogaart et al., 2014), e1071
(Meyer et al., 2017) and nnls (Mullen and van Stokkum,
2012), loading all scripts and manual screen input of the model parameters.
RECA needs to be run completely to the end until consequences of parameter
changes can be inspected. The decision on

AnalySize by Paterson and Heslop (2015) provides a MATLAB GUI, in
which

Bayesian EMMA (BEMMA) in MATLAB (Yu et al., 2015) does not allow
a predefined

The performance of all approaches was evaluated in two steps. First, we
compared the original data set

Second, knowing which natural end-members have been mixed to create the
example data set

Figure 3 shows the default graphical output after the EMMA algorithm has
modelled the data set with four end-members. Panels a and b depict

Default graphical output of the R function

Comparison of model performance (total, sample-wise and grain-size
class-wise coefficients of variation (

The scores of EM1 to EM4 accounted for 20 %, 20 %, 31 % and 29 % of the
variance of

In the extended protocol, an

Parameter optimisation steps in the extended protocol of robust
EMMA.

Figure 5a shows all 223 end-member loadings from 96 EMMA runs that agree
with the parameter space of

The resulting robust EM3 and EM4 loadings show high class-wise standard
deviations (SDs) around the mode positions (Fig. 6a). EM1 has a continuously
narrow uncertainty envelope (i.e.

With the compact protocol, the same parameter space (

Defining the limits by the automatic kernel density estimate approach
suggested only three out of four natural end-members as robust ones,
combining all loadings around class 100 (Fig. 5b, black line). Setting the
kernel bandwidth arbitrarily to 0.5 would allow separation of the two
overlapping modes around EM

The resulting end-members are shown in Fig. 6b. They are similar to the
plotted output of the deterministic version (Fig. 3) but extended by
uncertainty polygons, the different representation of scores and slightly
different mode positions, grain-size class-wise

The full benchmark reveals that all approaches successfully model the data
sets. The output of RECA shows difficulties in reaching the minimum
convexity error of

The average

The main absolute deviations of

Model performance to unmix and reproduce the example data set

The above criteria quantify how well the approaches modelled the data set
(Eq. 1), whereas their ability to reproduce the true “mixed ingredients”
is addressed here. The

A graphical comparison of the grain-size class-wise deviations of input
end-member distributions and modelled loadings (Fig. 8) shows that all
EMMAgeo-based models underestimate the main mode grain-size classes (i.e.
curves are below the

Natural versus modelled end-member grain-size distributions for
all evaluated models. Deviation of main mode (in number of classes).

Concerning the reproduction of the initial mixing ratios (Fig. 9, Table 2b),
variability among the models is higher, and all approaches show some
unsystematic over- and underestimation, especially for EM in samples in which real mixing ratios were zero (vertical point clusters along the 0 %

Natural versus modelled end-member mixing ratios for all evaluated
models.

The modal grain-size classes of the four EM

The functionality of EMMA has improved significantly since the introduction of the MATLAB algorithm of Dietze et al. (2012). Not only an increase in computation speed, which was already 1 to 3 orders of magnitude faster than for other algorithms (Paterson and Heslop, 2015), but also many new and detailed ways to explore end-members (with deterministic EMMA) and to estimate and describe associated uncertainties of all end-member components (with robust EMMA) were implemented. The plot output of both EMMA modes is a comprehensive visualisation of all relevant information. It allows direct process interpretation in terms of plausibility of loadings and scores, model performance and identification of outliers.

Both EMMA modes, deterministic and robust, result in consistently similar
outputs. Deviations of individual modes of robust loadings from known
EM

Unmixing quality is very high regardless of the model used, suggesting that
all approaches in this benchmark are able to reproduce the input grain-size
data set with unmixed end-member subpopulations. There is no model with an
outstanding performance. Model deviations of

The validation against known input end-member composition showed that all
EMMA approaches are equally applicable. When comparing end-member loadings
with the EM

If the correct grain-size distribution shape of underlying process end-members is targeted, RECA of Seidel and Hlawitschka (2015) and EMMA by Weltje (1997) are most suitable from our benchmark study (Table 2a). RECA had problems with reaching the convexity error threshold, which could result from our data set with largely overlapping natural process end-members.

When quantifying the contribution of end-members to a given sample, robust
EMMA, EMMA according to Weltje (1997) and AnalySize performed best (Table 2b).
Robustly estimated scores using EMMAgeo reproduced original mixing
proportions very well and in a range comparable to the other available
end-member algorithms. However, as all approaches and earlier EMMA
evaluations showed, very low and high scores (

If uncertainty estimates for both loadings and scores are considered important, then only robust EMMA is suitable. The inclusion of uncertainties for loadings and scores is a key precondition for propagating model results to further data analysis, for example to interpret grain-size end-members as proxies for sediment sources (loadings) in environmental archives as they evolve with time (scores). As van Hateren et al. (2018) emphasise, changes in the model results will inevitably result in diverging interpretations of the assumed sedimentary processes. Also, the interpretations of the scores in their spatial (samples across a landscape) or temporal (samples downcore) context will be affected. Thus, it is extremely important to provide some estimate of the inherent uncertainty in both the proxy definition and in the sample domain. So far only robust EMMA can deliver such information. Yet, necessary parameter estimates and diverging start conditions evidently exist in the other models too.

If the distribution shape of an inherent natural grain-size end-member is known, EMMAgeo allows quantification of its contribution to the data set by including it as unscaled loadings in both deterministic and robust EMMA or by assigning the known main mode class limits when selecting robust end-members (step 4; Fig. 1b). Finally, if free and open-source software is a criterion – which is increasingly the case for journals and funding agencies (David et al., 2016; Munafò et al., 2017) – RECA and EMMAgeo remain the only options.

In previous benchmark studies, EMMAgeo performed less well, which
Paterson and Heslop (2015) attributed to the implementation of the
non-negativity and sum-to-one constraints. van Hateren et
al. (2018) pointed to the secondary modes as cause of the deviations of
scores from the mixing ratios. We cannot confirm the poor performance of
EMMAgeo in our study, as it is not fully clear how van
Hateren et al. (2018) determined the EMMAgeo loading curves, which they
evaluate graphically. They note that in EMMAgeo the

Yet, the occurrence of artificial secondary modes below the main modes of the end-members is more pronounced in EMMAgeo compared to other unmixing algorithms. The inherent compositional data constraints lead to an intimate linkage of the distribution shape of one end-member with the distribution shapes of other loadings. However, when excluding hardly interpretable secondary modes from global measures of model quality, the performance of EMMA is well within the range of other available algorithms. As repeatedly noted in articles applying EMMAgeo (Dietze et al., 2012, 2014) but also highlighted for other approaches in the benchmark study of Paterson and Heslop (2015), secondary modes are model artefacts and should not be interpreted genetically.

However, to test the impact of artificial secondary modes on model
performance, we modelled the EM

Going beyond classical measures of grain-size properties, EMMA is well suited to quantify sedimentary processes from mixed sediment sequences in space and time. However, interpretation of grain-size end-members requires expert knowledge about the investigated sedimentary system. Hence, when applying EMMA to any set of grain-size data, the interpretability of the resulting end-members needs to be checked. For this, both end-member components should be considered: the shape and position of the main modes of the loadings and the spatio-temporal or stratigraphic context of the scores. For example, the effectiveness of a process in sorting sediment could be interpreted in the classical sense from the shape of the end-member loadings (excluding artificial modes), with broader peaks being more poorly sorted than narrow peaks (Friedman, 1961).

As any other statistical method, EMMA is a tool, and interpretation of grain-size end-members relies on contextual knowledge. There may be processes that contribute to the overall sediment composition and that are not size-selective or sort sediment of various grain-size classes in a typical way. For example, event-triggered turbidity currents in lakes caused problems in attributing a single sedimentary process to end-members in the study by Dietze et al. (2014) because the typical fining-upwards trend was also reflected by several end-members that contributed to samples of “normal” deposition.

Closely related is the constraint of stationarity in processes, which implies that through space and time each transport process must create an identical grain-size distribution. For example, fining of aeolian material from one distinct source area with downwind transport distance (Pye, 1995) might rather be explored by a gradual approach, e.g. by running EMMA in a moving window over a data set to detect shifts in stationarity.

Post-depositional processes that change grain-sizes, e.g. due to permafrost conditions or soil formation, could strongly disturb the original grain-size characteristics. In the worst case, a lacustrine sediment archive composed of different aeolian and fluvial sediment end-members (Dietze et al., 2013) can be affected by ongoing cryogenic and active-layer dynamics in a way that all modelled end-members were overlapping and peaking in similar grain-size classes – “erasing” primary signals related to sediment deposition. If post-depositional activity overprints the original depositional processes, EMMA can detect them as single end-members and would allow quantification of the intensity of the overprint, e.g. soil formation (Dietze et al., 2016) or weathering (Sun et al., 2002; Xiao et al., 2012).

Sediments affected by the processes mentioned above can affect end-member modelling in manifold ways. For example, EMMA could result in rather low explained variances, and the modes of affected end-member loadings would become broader and/or may even be better represented by additional but nevertheless spurious end-members. In the worst case, modes of end-member loadings overlap strongly or cannot be unmixed at all.

EMMAgeo allows the characterisation of multi-modal grain-size distributions by end-member subpopulations. New protocols allow a quick analysis, including modelling of associated uncertainties for both end-member loadings and scores. Using four known natural end-members, which represent typical sediment types found in terrestrial systems, the performance of EMMAgeo in unmixing the correct end-member distribution shapes and mixing ratios is within the same order as the performance of other available end-member modelling algorithms, which all perform very well. Hence, all of these algorithms are powerful tools for characterisation of different sediment source, transport, depositional and even post-depositional processes. In comparison to other algorithms, EMMAgeo is the only available open-source grain-size unmixing approach that includes uncertainty estimates. An inherent strength of the fully free R package is a large flexibility for users to modify the parameter settings and workflows with the new protocols, reproduce results and continue data evaluation.

Once genetically interpretable grain-size end-members are derived, their loadings can be described by classical descriptive measures (Folk and Ward, 1957; Blott and Pye, 2001). This allows a statistically robust determination and comparison of mean, sorting and shape measures across sites and data sets by describing and quantifying processes that sort sediment better or poorer than other processes.

Many future applications in the fields of Quaternary Science, sedimentology, geology, geomorphology and hydrology could gain new insights from applying EMMAgeo to compositional data sets that represent mixtures. In contrast to classical linear decomposition methods such as principle component analysis, EMMA has the potential to quantify (and not just qualify) different sources or processes of modern and past sedimentary environments that contribute to a sample set, including associated model uncertainties.

The Supplement contains the example data set, end-member
measurement data, mixing ratios and output of the other approaches included
in the comparison. The R package EMMAgeo in its latest release version 0.9.6
(Dietze and Dietze, 2019;

The supplement related to this article is available online at:

ED and MD improved the original EMMA algorithm, workflows and auxiliary functionalities. ED compiled the operational modes of EMMA and MD established the EMMAgeo package. Both authors wrote the paper.

The authors declare that they have no conflict of interest.

This article is part of the special issue “Connecting disciplines – Quaternary archives and geomorphological processes in a changing environment”. It is a result of the First Central European Conference on Geomorphology and Quaternary Sciences, Gießen, Germany, 23–27 September 2018.

Thomas Hösel and Claudia Ziener prepared the example data set. Philip Schulte performed grain-size analysis using the Laser particle sizer at RWTH Aachen. Jan-Berend Stuut provided the data from the FORTRAN code of Weltje (1997), and Mitch D'Arcy provided language editing. Kai Hartmann and Andreas Borchers supported the initial development of EMMA and Kirsten Elger the DOI and landing page coordination. Many users of former versions of the MATLAB and R scripts greatly helped to improve EMMAgeo.

The article processing charges for this open-access publication were covered by a Research Centre of the Helmholtz Association.