2
1 1
X2
1 1
This Chapter serves as the reference for rOpenSci’s standards for statistical software. Software accepted for peer-review must fit one or more of our categories, and thus all packages must comply with the General Standards listed in the first of the following sections, as well as at least one of the category-specific sets of standards listed in the subsequent sections.
Our standards are open and intended to change and evolve in response to public feedback. Please contribute via the GitHub discussions pages for this book. We particularly encourage anybody preparing software for submission to discuss any aspects of our standards, including applicability, validity, phrasing, expectations, reasons for standards, and even the addition or removal of specific standards.
These general standards, and all category-specific standards that follow, are intended to serve as recommendations for best practices. Note in particular that many standards are written using the word “should” in explicit acknowledgement that adhering to such standards may not always be possible. All standards phrased in these terms are intended to be interpreted as applicable under such conditions as “Where possible”, or “Where applicable”. Developers are requested to note any standards which they deem not applicable to their software via the srr
package, as described in Chapter 3.
The R language defines the following data types:
class = "numeric"
/ typeof = "double"
)The base R system also includes what are considered here to be direct extensions of fundamental types to include:
The continuous type has a typeof
of “double” because that represents the storage mode in the C representation of such objects, while the class
as defined within R is referred to as “numeric”. While typeof
is not the same as class
, with reference to continuous variables, “numeric” may be considered identical to “double” throughout.
The term “character” is interpreted here to refer to a vector each element of which is an individual “character” object. The term “string” does not relate to any official R nomenclature, but is used here to refer for convenience to a character vector of length one; in other words, a “string” is the sole element of a single-length “character” vector.
We consider that statistical software submitted under our system will either (i) implement or extend prior methods, in which case the primary reference will be to the most relevant published version(s) of prior methods; or (ii) be an implementation of some new method. In the second case, it will be expected that the software will eventually form the basis of an academic publication. Until that time, the most suitable reference for equivalent algorithms or implementations should be provided.
The second and third options additionally require references to comparable algorithms or implementations to be documented somewhere within the software, including references to all known implementations in other computer languages. (A common location for such is a statement of “Prior Art” or similar at the end of the main README
document.)
We encourage these to placed within a repository’s CONTRIBUTING.md
file, as in this example. A simple Life Cycle Statement may be formed by selecting one of the following four statements.
This package is
- In a stable state of development, with minimal subsequent development
envisioned.
- In a stable state of development, with active subsequent development
primarily in response to user feedback.
- In a stable state of development, with some degree of active subsequent
development as envisioned by the primary authors.
- In an initially stable state of development, with a great deal of active
subsequent development envisioned.
Developers should not presume anywhere in the documentation of software that specific statistical terminology may be “generally understood”, and therefore not need explicit clarification. Even terms which many may consider sufficiently generic as to not require such clarification, such as “null hypotheses” or “confidence intervals”, will generally need explicit clarification. For example, both the estimation and interpretation of confidence intervals are dependent on distributional properties and associated assumptions. Any particular implementation of procedures to estimate or report on confidence intervals will accordingly reflect assumptions on distributional properties (among other aspects), both the nature and implications of which must be explicitly clarified.
The following standards describe several forms of what might be considered “Supplementary Material”. While there are many places within an R package where such material may be included, common locations include vignettes, or in additional directories (such as data-raw
) listed in .Rbuildignore
to prevent inclusion within installed packages.
Where software supports a publication, all claims made in the publication with regard to software performance (for example, claims of algorithmic scaling or efficiency; or claims of accuracy), the following standard applies:
Where claims regarding aspects of software performance are made with respect to other extant R packages, the following standard applies:
This section considers general standards for Input Structures. These standards may often effectively be addressed through implementing class structures, although this is not a general requirement. Developers are nevertheless encouraged to examine the guide to S3 vectors in the vctrs
package as an example of the kind of assurances and validation checks that are possible with regard to input data. Systems like those demonstrated in that vignette provide a very effective way to ensure that software remains robust to diverse and unexpected classes and types of input data. Packages such checkmate
enable direct and simple ways to check and assert input structures.
It is important to note for univariate data that single values in R are vectors with a length of one, and that 1
is of exactly the same data type as 1:n
. Given this, inputs expected to be univariate should:
match.arg()
or equivalent where applicable to only permit expected values.
tolower()
or equivalent to ensure input of character parameters is not case dependent; or explicitly document that parameters are strictly case-sensitive.
integer
via as.integer()
as.numeric()
as.character()
(and not paste
or paste0
)
as.factor()
as...()
functions
factor
type, secondary documentation should explicitly state whether these should be ordered
or not, and those inputs should provide appropriate error or other routines to ensure inputs follow these expectations.
A few packages implement R versions of “static type” forms common in other languages, whereby the type of a variable must be explicitly specified prior to assignment. Use of such approaches is encouraged, including but not restricted to approaches documented in packages such as vctrs
, or the experimental package typed
. One additional standard for vector input is:
The units
package provides a good example, in creating objects that may be treated as vectors, yet which have a class structure that does not inherit from the vector
class. Using these objects as input often causes software to fail. The storage.mode
of the underlying objects may nevertheless be examined, and the objects transformed or processed accordingly to ensure such inputs do not lead to errors.
This sub-section concerns input in “tabular data” forms, meaning the base R forms array
, matrix
, and data.frame
, and other forms and classes derived from these. Tabular data generally have two dimensions, although may have more (such as for array
objects). There is a primary distinction within R itself between array
or matrix
representations, and data.frame
and associated representations. The former are restricted to storing data of a single uniform type (for example, all integer
or all character
values), whereas data.frame
as associated representations (generally) store each column as a list item, allowing different columns to hold values of different types. Further noting that a matrix
may, as of R version 4.0, be considered as a strictly two-dimensional array, tabular inputs for the purposes of these standards are considered to imply data represented in one or more of the following forms:
matrix
form when referring to specifically two-dimensional data of one uniform typearray
form as a more general expression, or when referring to data that are not necessarily or strictly two-dimensionaldata.frame
tibble
data.table
tsibble
for time series, or sf
for spatial data.Both matrix
and array
forms are actually stored as vectors with a single storage.mode
, and so all of the preceding standards G2.0–G2.5 apply. The other rectangular forms are not stored as vectors, and do not necessarily have a single storage.mode
for all columns. These forms are referred to throughout these standards as “data.frame
-type tabular forms”, which may be assumed to refer to data represented in either the base::data.frame
format, and/or any of the classes listed in the final of the above points.
General Standards applicable to software which is intended to accept any one or more of these data.frame
-type tabular inputs are then that:
Software need not necessarily test abilities to accept different types of inputs, because that may require adding packages to the Suggests
field of a package for that purpose alone. Nevertheless, software which somehow uses (through Depends
or Suggests
) any packages for representing tabular data should confirm in tests the ability to accept these types of input.
sf
-format data) or added (such as insertion of variable or column names where none were provided).
Note, for example, that an array
may have column names which start with numeric values, but that a data.frame
may not.
2
1 1
X2
1 1
If array
or matrix
class objects are accepted as input, then G2.8 implies that routines should be implemented to check for such conversion of column names.
The next standard concerns the following inconsistencies between three common tabular classes in regard the column extraction operator, [
.
data.frame
returns a vector
by default, and a data.frame
if drop = FALSE
.tibble
returns a single-column tibble
by default, and a vector
if drop = TRUE
.data.table
always returns a data.table
, and the drop
argument has no effect.Given such inconsistencies,
Adherence to the above standard G2.8 will ensure that any implicitly or explicitly assumed default behaviour will yield consistent results regardless of input classes.
Columns of tabular inputs
The follow standards apply to data.frame
-like tabular objects (including all derived and otherwise compatible classes), and so do not apply to matrix
or array
objects.
data.frame
-like tabular objects which have columns which do not themselves have standard class attributes (typically, vector
) are appropriately processed, and do not error without reason. This behaviour should be tested. Again, columns created by the units
package provide a good test case.
data.frame
-like tabular objects which have list columns should ensure that those columns are appropriately pre-processed either through being removed, converted to equivalent vector columns where appropriate, or some other appropriate treatment such as an informative error. This behaviour should be tested.
NA
) data, with options minimally including:
na.rm = FALSE
-type parameters (such as mean()
, sd()
or cor()
).
NaN
, Inf
and -Inf
), including potentially ignoring or removing such values.
This standard applies to all computer languages included in any package. In R, values can be affirmed to be integers through is.integer()
, or asserting that the storage.mode()
of an object is “integer”. One way to compare numeric values with tolerance is with the all.equal()
function, which accepts an additional tolerance
parameter with a default for numeric
comparison of sqrt(.Machine$double.eps)
, which is typically around e(-8–10). In other languages, including C and C++, comparisons of floating point numbers are commonly implemented by conditions such as if (abs(a - b) < tol)
, where tol
specifies the tolerance for equality.
Importantly, R functions such as duplicated()
and unique()
rely on equality comparisons, and this standard extends to require that software should not apply any functions which themselves rely on equality comparisons to floating point numbers.
stats::cov
function.
Estimates of covariance can be very sensitive to outliers, and a variety of methods have been developed for “robust” estimates of covariance, implemented in such packages as rms
, robust
, and sandwich
. Adhering to this standard merely requires an ability for a user to specify a particular covariance function, such as through an additional parameter. The stats::cov
function can be used as a default, and additional packages such as the three listed here need not necessarily be listed as Imports
to a package.
All packages should follow rOpenSci standards on testing and continuous integration, including aiming for high test coverage. Extant R packages which may be useful for testing include testthat
, tinytest
, roxytest
, and xpectr
.
NA
) or undefined (NaN
, Inf
) values, the absence of any such values in return objects should be explicitly tested.
For testing statistical algorithms, tests should include tests of the following types:
Note that authors should ensure that they use at least v3 of the testthat
package, which introduced a testthat_tolerance()
, defaulting to the value defined by base::all_equal()
of sqrt(.Machine$double.eps)
on all expect_equal()
expectations.
NA
fields or columns or all identical fields or columns
.Machine$double.eps
) to data does not meaningfully change results
Thorough testing of statistical software may require tests on large data sets, tests with many permutations, or other conditions leading to long-running tests. In such cases it may be neither possible nor advisable to execute tests continuously, or with every code change. Software should nevertheless test any and all conditions regardless of how long tests may take, and in doing so should adhere to the following standards:
<MYPKG>_EXTENDED_TESTS="true"
environment variable.
The extended tests can be then run automatically by GitHub Actions for example by adding the following to the env
section of the workflow:
MYPKG_EXTENDED_TESTS: ${{contains(github.event.head_commit.message, 'run-extended')}}
Extended tests will then be run in response to any commit message which contains the phrase run-extended
.
CONTRIBUTING.md
or tests/README.md
file.
Bayesian and Monte Carlo software centres on quantitative estimation of components of Baye’s theorem, particularly on estimation or application of prior and/or posterior probability distributions. The procedures implemented to estimate the properties of such distributions are commonly based on random sampling procedures, hence referred to as “Monte Carlo” routines in reference to the random yet quantifiable nature of casino games. The scope of this category also includes algorithms which focus on sampling routines only, such as Markov-Chain Monte Carlo (MCMC) procedures, independent of application in Bayesian analyses.
The term “model” is understood with reference here to Bayesian software to refer to an encoded description of how parameters specifying aspects of one or more prior distributions are transformed into (properties of) one or more posterior distributions.
Some examples of Bayesian and Monte Carlo software include:
bayestestR
package which “provides tools to describe … posterior distributions”ArviZ
package python package for exploratory analyses of Bayesian models, particularly posterior distributions.GammaGompertzCR
package, which features explicit diagnostics of MCMC convergence statistics.BayesianNetwork
package, which is in many ways a wrapper package primarily serving a shiny
app, and is also accordingly a package in the EDA category.fmcmc
package, which is a “classic” MCMC package which directly provides its own implementation, and generates its own convergence statistics.rsimsum
package which “summarise[s] results from Monte Carlo simulation studies”. Many of the statistics generated by this package are useful for assessing and comparing Bayesian and Monte Carlo software in general. (See also the MCMCvis
package, with more of a focus on visualisation.)walkr
package for “MCMC Sampling from Non-Negative Convex Polytopes”. This package is also indicative of the difficulties of deriving generally applicable assessments of software in this category, because MCMC sampling relies on fundamentally different inputs and outputs than many other MCMC routines.Click on the following link to view a demonstration Application of Bayesian and Monte Carlo Standards.
Bayesian and Monte Carlo Software (hereafter referred to for simplicity as “Bayesian Software”) is presumed to perform one or more of the following steps:
This chapter details standards for each of these steps, each prefixed with “BS”.
Prior to actual standards for documentation of inputs, we note one terminological standard for Bayesian software which uses the term “hyperparameter”:
This standard reflects the dual facts that this term is frequently used in Bayesian software, yet has no unambiguous definition or interpretation. The term “hyperparameter” is also used in other statistical contexts in ways that are often distinctly different from its common use in Bayesian analyses. Examples of the kinds of clarifications required to adhere to this standard include,
Hyperparameters refer here to parameters determining the form of prior distributions that conditionally depend on other parameters.
Such a clarification would then require further explicit distinction between “parameters” and “hyperparameters”. The remainder of these standards does not refer to “hyperparameters”, rather attempts to make explicit distinctions between different kinds of parameters, such as distributional or algorithmic control parameters. Beyond this standard, Bayesian Software should provide the following documentation of how to specify inputs:
README
, either as textual description or example code
This section contains standards primarily intended to ensure that input data, including model specifications, are validated prior to passing through to the main computational algorithms.
Bayesian Software is commonly designed to accept generic one- or two-dimensional forms such as vector, matrix, or data.frame
objects, for which the following standard applies.
The second set of standards in this section concern specification of prior distributions, model structures, or other equivalent ways of specifying hypothesised relationships among input data structures. R already has a diverse range of Bayesian Software with distinct approaches to this task, commonly either through specifying a model as a character vector representing an R function, or an external file either as R code, or encoded according to some alternative system (such as for rstan
).
Bayesian Software should:
The following example demonstrates how standards like the above (BS2.4-2.5) might be addressed. Consider the following function which defines a log-likelihood estimator for a linear regression, controlled via a vector of three distributional parameters, p
:
ll <- function (x, y, p) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)
Pre-processing stages should be used to determine:
x
and y
, are commensurate (BS2.1); non-commensurate inputs should error by default.p
(BS2.3)The latter task is not necessarily straightforward, because the definition of the function, ll()
, will itself generally be part of the input to an actual Bayesian Software function. This functional input thus needs to be examined to determine expected lengths of hyperparameter vectors. The following code illustrates one way to achieve this, relying on utilities for parsing function calls in R, primarily through the getParseData
function from the utils
package. The parse data for a function can be extracted with the following line:
x <- getParseData (parse (text = deparse (ll)))
The object x
is a data.frame
of every R token (such as an expression, symbol, or operator) parsed from the function ll
. The following section illustrates how this data can be used to determine the expected lengths of vector inputs to the function, ll()
.
Input arguments used to define parameter vectors in any R software are accessed through R’s standard vector access syntax of vec[i]
, for some element i
of a vector vec
. The parse data for such begins with the SYMBOL
of vec
, the [
, a NUM_CONST
for the value of i
, and a closing ]
. The following code can be used to extract elements of the parse data which match this pattern, and ultimately to extract the various values of i
used to access members of vec
.
vector_length <- function (x, i) {
xn <- x [which (x$token %in% c ("SYMBOL", "NUM_CONST", "'['", "']'")), ]
# split resultant data.frame at first "SYMBOL" entry
xn <- split (xn, cumsum (xn$token == "SYMBOL"))
# reduce to only those matching the above pattern
xn <- xn [which (vapply (xn, function (j)
j$text [1] == i & nrow (j) > 3,
logical (1)))]
ret <- NA_integer_ # default return value
if (length (xn) > 0) {
# get all values of NUM_CONST as integers
n <- vapply (xn, function (j)
as.integer (j$text [j$token == "NUM_CONST"] [1]),
integer (1), USE.NAMES = FALSE)
# and return max of these
ret <- max (n)
}
return (ret)
}
That function can then be used to determine the length of any inputs which are used as hyperparameter vectors:
ll <- function (p, x, y) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)
p <- parse (text = deparse (ll))
x <- utils::getParseData (p)
# extract the names of the parameters:
params <- unique (x$text [x$token == "SYMBOL"])
lens <- vapply (params, function (i) vector_length (x, i), integer (1))
lens
#> y p x
#> NA 3 NA
And the vector p
is used as a hyperparameter vector containing three parameters. Any initial value vectors can then be examined to ensure that they have this same length.
Not all Bayesian Software is designed to accept model inputs expressed as R code. The rstan
package, for example, implements its own model specification language, and only allows distributional parameters to be named, and not addressed by index. While this largely avoids problems of mismatched lengths of parameter vectors, the software (at v2.21.1) does not ensure the existence of named parameters prior to starting the computational chains. This ultimately results in each chain generating an error when a model specification refers to a non-existent or undefined distributional parameter. Such controls should be part of a single pre-processing stage, and so should only generate a single error.
Computational parameters are considered here distinct from distributional parameters, and commonly passed to Bayesian functions to directly control computational processes. They typically include parameters controlling lengths of runs, lengths of burn-in periods, numbers of parallel computations, other parameters controlling how samples are to be generated, or convergence criteria. All Computational Parameters should be checked for general “sanity” prior to calling primary computational algorithms. The standards for such sanity checks include that Bayesian Software should:
While admittedly not always possible to define, plausible ranges may be as simple as ensuring values are greater than zero. Where possible, checks should nevertheless ensure appropriate responses to extremely large values, for example by issuing diagnostic messages about likely long computational times. The following two sub-sections consider particular cases of computational parameters.
Bayesian software generally relies on sequential random sampling procedures, with each sequence uniquely determined by (among other aspects) the value at which it is started. Given that, Bayesian software should:
Bayesian Software which implements or enables multiple computational chains should:
To avoid potential confusion between separate parameters to control random seeds and starting values, we recommended a single “starting values” rather than “seeds” argument, with appropriate translation of these parameters into seeds where necessary.
All Bayesian Software should implement computational parameters to control output verbosity. Bayesian computations are often time-consuming, and often performed as batch computations. The following standards should be adhered to in regard to output verbosity:
In additional to the General Standards for missing values (G2.13–2.16), and in particular G2.13, Bayesian Software should:
NA
, Inf
) values, and that such values, or entire rows including any such values, will be automatically removed from input data.
Where appropriate, Bayesian Software should:
An appropriate test for BS3.2 would confirm that system.time()
or equivalent timing expressions for perfectly collinear data should be less than equivalent routines called with non-collinear data. Alternatively, a test could ensure that perfectly collinear data passed to a function with a stopping criteria generated no results, while specifying a fixed number of iterations may generate results.
As mentioned, analytic algorithms for Bayesian Software are commonly algorithms to simulate posterior distributions, and to draw samples from those simulations. Numerous extant R packages implement and offer sampling algorithms, and not all Bayesian Software will internally implement sampling algorithms. The following standards apply to packages which do implement internal sampling algorithms:
Regardless of whether or not Bayesian Software implements internal sampling algorithms, it should:
An example of posterior validation is the Simulation Based Calibration approach implemented in the rstan
function sbc
). (Note also that the BayesValidate
package has not been updated for almost 15 years, so should not be directly used, although ideas from that package may be adapted for validation purposes.) Beyond this, where possible or applicable, Bayesian Software should:
This is often achieved by having default behaviour to stop after specified numbers of iterations regardless of convergence.
Unlike software in many other categories, Bayesian Software should generally return several kinds of distinct data, both the raw data derived from statistical algorithms, and associated metadata. Such distinct and generally disparate forms of data will be generally best combined into a single object through implementing a defined class structure, although other options are possible, including (re-)using extant class structures (see the CRAN Task view on Bayesian Inference for reference to other packages and class systems). Regardless of the precise form of return object, and whether or not defined class structures are used or implemented, the following standards apply:
The latter standard may also include returning a unique hash computed from the input data, to enable results to be uniquely associated with that input data. With regard to the input function, or alternative means of specifying prior distributions:
Where convergence checkers are implemented or provided:
With regard to additional methods implemented for, or dispatched on, return objects:
print
method for return objects
plot
method for return objects
Beyond these points:
summary
methods for return objects
Bayesian software should implement the following parameter recovery tests:
An example of adhering to this standard would be documentation or tests which demonstrate or confirm that computation times increase approximately logarithmically with increasing sizes of input data.
Exploration is a part of all data analyses, and Exploratory Data Analysis (EDA) is not something that is entered into and exited from at some point prior to “real” analysis. Exploratory Analyses are also not strictly limited to Data, but may extend to exploration of Models of those data. The category could thus equally be termed, “Exploratory Data and Model Analysis”, yet we opt to utilise the standard acronym of EDA in this document.
Summary statistics are generally intended to aid data exploration, and software providing summary statistics is also considered here as a form of EDA software. For simplicity, both kinds of software are referred to throughout these standards as “EDA software”, a phrase intended at all times to also encompass summary statistics software.
The category of EDA is somewhat different to many other categories considered here. Primary differences include:
Examples of EDA software include:
gtsummary
, which provides, “Presentation-ready data summary and analytic result tables.”smartEDA
package (with accompanying JOSS paper) “for automated exploratory data analysis”. The package, “automatically selects the variables and performs the related descriptive statistics. Moreover, it also analyzes the information value, the weight of evidence, custom tables, summary statistics, and performs graphical techniques for both numeric and categorical variables.” This package is potentially as much a workflow package as it is a statistical reporting package, and illustrates the ambiguity between these two categories.modeLLtest
package (with accompanying JOSS paper) is “An R Package for Unbiased Model Comparison using Cross Validation.” Its main functionality allows different statistical models to be compared, likely implying that this represents a kind of meta package.insight
package (with accompanying JOSS paper) provides “a unified interface to access information from model objects in R,” with a strong focus on unified and consistent reporting of statistical results.arviz
software for python (with accompanying JOSS paper) provides “a unified library for exploratory analysis of Bayesian models in Python.”iRF
package (with accompanying JOSS paper) enables “extracting interactions from random forests”, yet also focusses primarily on enabling interpretation of random forests through reporting on interaction terms.Click on the following link to view a demonstration Application of Exploratory Data Analysis Standards.
Reflecting these considerations, the following standards are somewhat differently structured than equivalent standards developed to date for other categories, particularly through being more qualitative and abstract. In particular, while documentation is an important component of standards for all categories, clear and instructive documentation is of paramount importance for EDA Software, and so warrants its own sub-section within this document.
The following refer to Primary Documentation, implying in main package README
or vignette(s), and Secondary Documentation, implying function-level documentation.
The Primary Documentation (README
and/or vignette(s)) of EDA software should:
Important distinctions between kinds of questions include whether they are inferential, predictive, associative, causal, or representative of other modes of statistical enquiry. The Secondary Documentation (within individual functions) of EDA software should:
A further primary difference of EDA software from that of our other categories is that input data for statistical software may be generally presumed of one or more specific types, whereas EDA software often accepts data of more general and varied types. EDA software should aim to accept and appropriately transform as many diverse kinds of input data as possible, through addressing the following standards, considered in terms of the two cases of input data in uni- and multi-variate form. All of the general standards for kinds of input (G2.0 - G2.12) apply to input data for EDA Software.
The following standards refer to an index column, which is understood to imply an explicitly named or identified column which can be used to provide a unique index index into any and all rows of that table. Index columns ensure the universal applicability of standard table join operations, such as those implemented via the dplyr
package.
attribute
on a table, x
, of attr(x, "index") <- <index_col_name>
.
For EDA software which either implements custom classes or explicitly sets attributes specifying index columns, these attributes should be used as the basis of all table join operations, and in particular:
EDA software designed to accept multi-tabular input should:
DM
package).
Classes are understood here to be the classes define single input objects, while Sub-Classes refer to the class definitions of components of input objects (for example, of columns of an input data.frame
). EDA software which is intended to receive input in general vector formats (see Uni-variate Input section of General Standards) should ensure that it complies with G2., so that vector input is appropriately processed regardless of input class. An additional standard for EDA software is that,
The following code illustrates some ways by which “metadata” defining classes and additional attributes associated with a standard vector object may by modified.
x <- 1:10
class (x) <- "notvector"
attr (x, "extra_attribute") <- "another attribute"
attr (x, "vector attribute") <- runif (5)
attributes (x)
#> $class
#> [1] "notvector"
#>
#> $extra_attribute
#> [1] "another attribute"
#>
#> $`vector attribute`
#> [1] 0.03521663 0.49418081 0.60129563 0.75804346 0.16073301
All statistical software should appropriately deal with such input data, as exemplified by the storage.mode()
, length()
, and sum()
functions of the base
package, which return the appropriate values regardless of redefinition of class or additional attributes.
storage.mode (x)
#> [1] "integer"
length (x)
#> [1] 10
sum (x)
#> [1] 55
storage.mode (sum (x))
#> [1] "integer"
Tabular inputs in data.frame
class may contain columns which are themselves defined by custom classes, and which possess additional attributes. The ability of software to accept such inputs is covered by the Tabular Input section of the General Standards.
EDA software will generally not directly implement what might be considered as statistical algorithms in their own right. Where algorithms are implemented, the following standards apply.
Both of these standards also relate to the following standards for output values, visualisation, and summary output.
Examples of such compliance include ensuring that sum
, min
, or max
values applied to integer
-type vectors return integer
values.
print
and plot
methods give sensible results. Default summary
methods may also be implemented.
Visualization commonly represents one of the primary functions of EDA Software, and thus visualization output is given greater consideration in this category than in other categories in which visualization may nevertheless play an important role. In particular, one component of this sub-category is Summary Output, taken to refer to all forms of screen-based output beyond conventional graphical output, including tabular and other text-based forms. Standards for visualization itself are considered in the two primary sub-categories of static and dynamic visualization, where the latter includes interactive visualization.
Prior to these individual sub-categories, we consider a few standards applicable to visualization in general, whether static or dynamic.
graphics
package) should consider accessibility
numeric
types, rather should also use some version of round(., digits)
, formatC
, sprintf
, or similar functions for numeric formatting according the parameter described in EA4.1.storage.mode
, class
, or equivalent defining attribute of each column.
An example of compliance with the latter standard is the print.tibble
method of the tibble
package.
pretty()
function).
Dynamic visualization routines are commonly implemented as interfaces to javascript
routines. Unless routines have been explicitly developed as an internal part of an R package, standards shall not be considered to apply to the code itself, rather only to decisions present as user-controlled parameters exposed within the R environment. That said, one standard may nevertheless be applied, which aims to maximise inter-operability between packages.
data.frame
-type tabular objects
numeric
values either using testthat::expect_equal()
or equivalent with a defined value for the tolerance
parameter, or using round(..., digits = x)
with some defined value of x
prior to testing equality.
vdiffr
package or equivalent.
Tests for graphical output are frequently only run as part of an extended test suite.
R has an extensive and diverse ecosystem of Machine Learning (ML) software which is very well described in the corresponding CRAN Task View. Unlike most other categories of statistical software considered here, the primary distinguishing feature of ML software is not (necessarily or directly) algorithmic, rather pertains to a workflow typical of machine learning tasks. In particular, we consider ML software to approach data analysis via the two primary steps of:
A single ML task generally yields two distinct outputs:
Click on the following link to view a demonstration Application of Machine Learning Software Standards.
A Machine Learning Workflow
Given those initial considerations, we now attempt the difficult task of envisioning a typical standard workflow for inherently diverse ML software. The following workflow ought to be considered an “extensive” workflow, with shorter versions, and correspondingly more restricted sets of standards, possible dependent upon envisioned areas of application. For example, the workflow presumes input data to be too large to be stored as a single entity in local memory. Adaptation to situations in which all training data can be loaded into memory may mean that some of the following workflow stages, and therefore corresponding standards, may not apply.
Just as typical workflows are potentially very diverse, so are outputs of ML software, which depend on areas of application and intended purpose of software. The following refers to the “desired output” of ML software, a phrase which is intentionally left non-specific, but which it intended to connote any and all forms of “response variable” and other “pre-specified outputs” such as categorical labels or validation data, along with outputs which may not necessarily be able to be pre-specified in simple uni- or multi-variate form, such as measures of distance between sets of training and validation data.
Such “desired outputs” are presumed to be quantified in terms of a “loss” or “cost” function (hereafter, simply “loss function”) quantifying some measure of distance between a model estimate (resulting from applying the model to one or more components of a training data set) and a pre-defined “valid” output (during training), or a test data set (following training).
Given the foregoing considerations, we consider a typical ML workflow to progress through (at least some of) the following steps:
training
and test
data, along with optional additional categories and labels such as validation
data used, for example, to determine accuracy of models applied to training data yet prior to testing.Importantly, ML workflows may be partly iterative. This may in turn potentially confound distinctions between training and test data, and accordingly confound expectations commonly placed upon statistical analyses of statistical independence of response variables. ML routines such as cross-validation repeatedly (re-)partition data between training and test sets. Resultant models can then not be considered to have been developed through application to any single set of truly “independent” data. In the context of the standards that follow, these considerations admit a potential lack of clarity in any notional categorical distinction between training and test data, and between model specification and training.
The preceding workflow mentioned a couple of concepts the interpretations of which in the context of these standards may be seen by clicking on the corresponding items below. Following that, we proceed to standards for ML software, enumerated and developed with reference to the preceding workflow steps. In order that the following standards initially adhere to the enumeration of workflow steps given above, more general standards pertaining to aspects such as documentation and testing are given following the initial five “workflow” standards.
The following definition comes from a vignette for the rray
package named Broadcasting.
This concept runs counter to aspects of standards in other categories, which often suggest that functions should error when passed input objects which do not have commensurate dimensions. Broadcasting is a pre-processing step which enables objects with incommensurate dimensions to be dimensionally reconciled.
The following demonstration is taken directly from the rray
package (which is not currently on CRAN).
Broadcasting is commonly employed in ML software because it enables ML operations to be implemented on objects with incommensurate dimensions. One example is image analysis, in which training data may all be dimensionally commensurate, yet test images may have different dimensions. Broadcasting allows data to be submitted to ML routines regardless of potentially incommensurate dimensions.
This parameter is particularly important for training ML algorithms like neural networks, the results of which can be very sensitive to variations in learning rates. A useful overview of the importance of learning rates, and a useful approach to automatically determining appropriate values, is given in this blog post.
Partly because of widespread and current relevance, the category of Machine Learning software is one for which there have been other notable attempts to develop standards. A particularly useful reference is the MLPerf organization which, among other activities, hosts several github repositories providing reference datasets and benchmark conditions for comparing performance aspects of ML software. While such reference or benchmark standards are not explicitly referred to in the current version of the following standards, we expect them to be gradually adapted and incorporated as we start to apply and refine our standards in application to software submitted to our review system.
Many of the following standards refer to the labelling of input data as “testing” or “training” data, along with potentially additional labels such as “validation” data. In regard to such labelling, the following two standards apply,
The following three standards (ML1.2–ML1.4) represent three possible design intentions for ML software. Only one of these three will generally be applicable to any one piece of software, although it is nevertheless possible that more than one of these standards may apply. The first of these three standards applies to ML software which is intended to process, or capable of processing, input data as a single (generally tabular) object.
TRUE
/FALSE
or 0
/1
values, or which uses some other system such as missing (NA
) values to denote test data); and/orThe second of these three standards applies to ML software which is intended to process, or capable of processing, input data represented as multiple objects which exist in local memory.
list
item), or should enable an additional means of categorically distinguishing training from test data (such as via an additional parameter which provides explicit labels). Where applicable, distinction of validation and any other data should also accord with this standard.
The third of these three standards for data input applies to ML software for which data are expected to be input as references to multiple external objects, generally expected to be read from either local or remote connections.
The following standard applies to all ML software regardless of the applicability or otherwise of the preceding three standards.
Missing data are handled differently by different ML routines, and it is also difficult to suggest generally applicable standards for pre-processing missing values in ML software. The General Standards for missing values (G2.13–G2.16) do not apply to Machine Learning software, in the place of which the following standards attempt to cover a practical range of typical approaches and applications.
As reflected in the workflow envisioned at the outset, ML software operates somewhat differently to statistical software in many other categories. In particular, ML software often requires explicit specification of a workflow, including specification of input data (as per the standards of the preceding sub-section), and of both transformations and statistical models to be applied to those data. This section of standards refers exclusively to the transformation of input data as a pre-processing step prior to any specification of, or submission to, actual models.
print
method which summarizes the input data set (as per ML1.5 above) and associated transformations (see the following standard).
Standards for most other categories of statistical software suggest that pre-processing routines should ensure that input data sets are commensurate, for example, through having equal numbers of cases or rows. In contrast, ML software is commonly intended to accept input data which can not be guaranteed to be dimensionally commensurate, such as software intended to process rectangular image files which may be of different sizes.
Beyond broadcasting and dimensional transformations, the following standards apply to the pre-processing stages of ML software.
For all transformations applied to input data, whether of dimension (ML2.1) or scale (ML2.2),
A “model” in the context of ML software is understood to be a means of specifying a mapping between input and output data, generally applied to training and validation data. Model specification is the step of specifying how such a mapping is to be constructed. The specification of what the values of such a model actually are occurs through training the model, and is described in the following sub-section. These standards also refer to control parameters which specify how models are trained. These parameters commonly include values specifying numbers of iterations, training rates, and parameters controlling algorithmic processes such as re-sampling or cross-validation.
nofit = FALSE
).
print
method which summarises the model specification, including values of all relevant parameters.
A function fulfilling ML3.0–3.2 might, for example, permit the following arguments:
data
: Input data specification constructed according to ML1
model
: An optional previously-trained modelcontrol
: A list of parameters controlling how the model algorithm is to be applied during the subsequent training phase (ML4).A function with the arguments defined above would fulfil the preceding three standards, because the data
stage would represent the output of ML1, while the model
stage would allow for different pre-trained models to be submitted using the same data and associated specifications (ML3.1). The provision of a separate .data
argument would fulfil ML3.2 by allowing one or both model
or control
parameters to be re-defined while submitting the same data
object.
Control parameters are considered here to specify how a model is to be applied to a set of training data. These are generally distinct from parameters specifying the actual model (such as model architecture). While we recommend that control parameters be submitted as items of a single named list, this is neither a firm expectation nor an explicit part of the current standards.
ML software often involves manipulation of large numbers of rectangular arrays for which graphics processing units (GPUs) are often more efficient than central processing units (CPUs). ML software thus commonly offers options to train models using either CPUs or GPUs. While these standards do not currently suggest any particular design choice in this regard, we do note the following:
libcudacxx
.
This library can be “switched on” through activating a single C++ header file to switch from CPU to GPU.
Model training is the stage of the ML workflow envisioned here in which the actual computation is performed by applying a model specified according to ML3 to data specified according to ML1 and ML2.
The following standards apply to ML software which implements batch processing, commonly to train models on data sets too large to be loaded in their entirety into memory.
According to that standard, it would for example be inappropriate to have a parameter, nepochs
, described as “Number of epochs used in model training”. Rather, the definition and particular implementation of “epoch” must be explicitly defined.
As described at the outset, ML software does not always rely on pre-specified and categorical distinctions between training and test data. For example, models may be fit to what is effectively one single data set in which specified cases or rows are used as training data, and the remainder as test data. Re-sampling generally refers to the practice of re-defining categorical distinctions between training and test data. One training run accordingly connotes training a model on one particular set of training data and then applying that model to the specified set of test data. Re-sampling starts that process anew, through constructing an alternative categorical partition between test and training data.
Even where test and training data are distinguished by more than a simple data-internal category (such as a labelling column), for example, by being stored in distinctly-named sub-directories, re-sampling may be implemented by effectively shuffling data between training and test sub-directories.
Model output is considered here as a stage distinct from model performance. Model output refers to the end result of model training (ML4), while model performance involves the assessment of a trained model against a test data set. The present section first describes standards for model output, which are standards guiding the form of a model trained according to the preceding standards (ML4). Model Performance is then considered as a separate stage.
print
method which summarises important aspects of the model object, including but not limited to summaries of input data and algorithmic control parameters.
saveRDS
are not appropriate for storing local copies of trained models, an explicit function should be provided for that purpose, and should be demonstrated with example code.
The R6
system for representing classes in R is an example of a system with explicit functionality, all components of which are accessible by a simple ls()
call. Adherence to ML5.2a would nevertheless require explicit description of the ability of ls()
to supply a list of all functions associated with an object. The mlr
package, for example, uses R6
classes, yet neither explicitly describes the use of ls()
to list all associated functions, nor explicitly lists those functions.
Model performance refers to the quantitative assessment of a trained model when applied to a set of test data.
The remaining sub-sections specify general standards beyond the preceding workflow-specific ones.
The following standard applies to packages which are intended or other able to only encompass a restricted subset of the six primary workflow steps enumerated at the outset. Envisioned here are packages explicitly intended to aid one particular aspect of the general workflow envisioned here, such as implementations of ML optimization functions, or specific loss measures.
The following standard applies to models in both untrained and trained forms, considered to be the respective outputs of the preceding standards ML3 and ML4.
expand.grid()
is recommended.
The following example illustrates:
Var1 Var2 Var3
1 archA optA costA
2 archB optA costA
3 archA optB costA
4 archB optB costA
5 archA optC costA
6 archB optC costA
7 archA optA costB
8 archB optA costB
9 archA optB costB
10 archB optB costB
11 archA optC costB
12 archB optC costB
13 archA optA costC
14 archB optA costC
15 archA optB costC
16 archB optB costC
17 archA optC costC
18 archB optC costC
All possible combinations of these categorical parameters could then be tested by iterating over the rows of that output.
This sub-section details standards for Regression and Supervised Learning Software – referred to from here on for simplicity as “Regression Software”. Regression Software implements algorithms which aim to construct or analyse one or more mappings between two defined data sets (for example, a set of “independent” data, \(X\), and a set of “dependent” data, \(Y\)). In contrast, the analogous category of Unsupervised Learning Software aims to construct or analyse one or more mappings between a defined set of input or independent data, and a second set of “output” data which are not necessarily known or given prior to the analysis.
Common purposes of Regression Software are to fit models to estimate relationships or to make predictions between specified inputs and outputs. Regression Software includes tools with inferential or predictive foci, Bayesian, frequentist, or probability-free Machine Learning (ML) approaches, parametric or or non-parametric approaches, discrete outputs (such as in classification tasks) or continuous outputs, and models and algorithms specific to applications or data such as time series or spatial data. In many cases other standards specific to these subcategories may apply.
Examples of the diversity of Regression and Unsupervised Learning software include the following.
xrnet
to perform “hierarchical regularized regression to incorporate external data”, where “external data” in this case refers to structured meta-data as applied to genomic features.survPen
is, “an R package for hazard and excess hazard modelling with multidimensional penalized splines”areal
is, “an R package for areal weighted interpolation”.ChiRP
is a package for “Chinese Restaurant Process mixtures for regression and clustering”, which implements a class of non-parametric Bayesian Monte Carlo models.klrfome
is a package for, “kernel logistic regression on focal mean embeddings,” with a specific and exclusive application to the prediction of likely archaeological sites.gravity
is a package for “estimation methods for gravity models in R,” where “gravity models” refers to models of spatial interactions between point locations based on the properties of those locations.compboost
is an example of an R package for gradient boosting, which is inherently a regression-based technique, and so standards for regression software ought to consider such applications.ungroup
is, “an R package for efficient estimation of smooth distributions from coarsely binned data.” As such, this package is an example of regression-based software for which the input data are (effectively) categorical. The package is primarily intended to implement a particular method for “unbinning” the data, and so represents a particular class of interpolation methods.registr
is a package for “registration for exponential family functional data,” where registration in this context is effectively an interpolation method applied within a functional data analysis context.ggeffects
for “tidy data frames of marginal effects from regression models.” This package aims to make statistics quantifying marginal effects readily understandable, and so implements a standard (tidyverse-based) methodology for representing and visualising statistics relating to marginal effects.Click on the following link to view a demonstration Application of Regression and Supervised Learning Standards.
The following standards are divided among several sub-categories, with each standard prefixed with “RE”.
See Max Kuhn’s RStudio blog post for examples of how to implement and describe such conversions.
Examples documentation addressing this standard include clarifying that software accepts only numeric inputs in vector
or matrix
form, or that all inputs must be in data.frame
form with both column and row names.
attributes()
.
This standard reflects the common process in regression software of transforming a rectangular input structure into a modified version which includes additional columns of model fits or predictions. Software which constructs such modified versions anew often copies numeric values from input columns, and may implicitly drop additional information such as attributes. This standard requires all such information to be retained.
factor
, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).
NA
or NaN
values from Inf
values (for example, through use of na.omit()
and related functions from the stats
package).
Note that fulfilling this standard ensures compliance with all General Standard for missing values (G2.13–G2.16).
These pre-processing routines should also be tested as described below.
The following standards apply to the model fitting algorithms of Regression Software which implement or rely on iterative algorithms which are expected to converge to generate model statistics. Regression Software which implements or relies on iterative convergence algorithms should:
lm
, glm
, or model objects from other packages), or creating a new class of model objects.
Regression Software should provide functions to access or extract as much of the following kinds of model data as possible or practicable. Access should ideally rely on class-specific methods which extend, or implement otherwise equivalent versions of, the methods from the stats
package which are named in parentheses in each of the following standards.
Model objects should include, or otherwise enable effectively immediate access to the following descriptors. It is acknowledged that not all regression models can sensibly provide access to these descriptors, yet should include access provisions to all those that are applicable.
coeff()
/ coefficients()
)
confint()
)
formula()
)
nobs()
)
vcov()
)
Note that compliance with RE4.6 should also heed General Standard G3.1 in offering user control over covariance algorithms. Regression Software should further provide simple and direct methods to return or otherwise access the following form of data and metadata, where the latter includes information on any transformations which may have been applied to the data prior to submission to modelling routines.
Regression software may additionally opt to provide simple and direct methods to return or otherwise access the following:
Not all regression software is intended to, or can, provide distinct abilities to extrapolate or forecast. Moreover, identifying cases in which a regression model is used to extrapolate or forecast may often be a non-trivial exercise. It may nevertheless be possible, for example when input data used to construct a model are unidimensional, and data on which a prediction is to be based extend beyond the range used to construct the model. Where reasonably unambiguous identification of extrapolation or forecasting using a model is possible, the following standards apply:
Distinct from extrapolation or forecasting abilities, the following standard applies to regression software which relies on, or otherwise provides abilities to process, categorical grouping variables:
predict()
methods.
print
method which provides an on-screen summary of model (input) parameters and (output) coefficients.
summary
methods for model objects, and in particular should implement distinct summary
methods for any cases in which calculation of summary statistics is computationally non-trivial (for example, for bootstrapped estimates of confidence intervals).
Beyond the General Standards for documentation, Regression Software should explicitly describe the following aspects, and ideally provide extended documentation including summary graphical reports of:
plot
methods, either through explicit implementation, extension of methods for existing model objects, or through ensuring default methods work appropriately.
plot
method is NOT a generic plot
method dispatched on the class of return objects (that is, through an S3-type plot.<myclass>
function or equivalent), that method dispatch (or equivalent) should nevertheless exist in order to explicitly direct users to the appropriate function.
plot
method should produce a plot of the fitted
values of the model, with optional visualisation of confidence intervals or equivalent.
The following standard applies only to software fulfilling RE4.14-4.15, and the conditions described prior to those standards.
predict()
method), the default plot
method should provide clear visual distinction between modelled (interpolated) and forecast (extrapolated) values.
Tests for Regression Software should include the following conditions and cases:
Tests for Regression Software should
Standards for spatial software begin with a consideration and standardisation of domains of applicability. Following that we proceed to standards according to which spatial software is presumed to perform one or more of the following steps:
Each standard for spatial software is prefixed with “SP”.
Many developers of spatial software in R, including many of those those featured on the CRAN Task view on “Analysis of Spatial Data”, have been primarily focussed on geographic data; that is, data quantifying positions, structures, and relationships on the Earth and other planets. Spatial analyses are nevertheless both broader and more general than geography alone. In particular, spatial software may be geometric – that is, concerned with positions, structures, and relationships in space in any general or specific sense, not necessarily confined to geographic systems alone.
It is important to distinguish these two domains because many algorithms and procedures devised in one of these two domains are not necessarily (directly) applicable in the other, most commonly because geometric algorithms presume space to be rectilinear or Cartesian, while geographic algorithms (generally) presume it be have a specific curvilinear form (commonly spherical or elliptical). Algorithms designed for Cartesian space may not be directly applicable in curvilinear space, and vice-versa.
Moreover, spatial software and algorithms might be intended to apply in spaces of arbitrary dimensionality. The phrase “Cartesian” refers to any space of arbitrary dimensionality in which all dimensions are orthogonal and described by straight lines; dimensions in a curvilinear space or arbitrary dimensionality are described by curved lines. A planar geometry is a two-dimensional Cartesian space; a spherical geometry is a two- (or maybe three-)dimensional curvilinear space.
One of the earliest and still most widely used R spatial packages, spatstat
(first released 2002), describes itself as, “[f]ocused mainly on two-dimensional point patterns, including multitype/marked points, in any spatial region.” Routines from this package are thus generally applicable to two-dimensional Cartesian data only, even through the final phrase might be interpreted to indicate a comprehensive generality. spatstat
routines may not necessarily give accurate results when applied in curvilinear space.
These considerations motivate the first standard for spatial software:
We encourage the use of clear and unambiguous phrases such as “planar”, “spherical”, “Cartesian”, “rectilinear” or “curvilinear”, along with clear indications of dimensionality such as “two-” or “three-dimensional.” Concepts of dimensionality should be interpreted to refer explicitly to the dimensionality of independent spatial coordinates. Elevation is a third spatial dimension, and time may also be considered an additional dimension. Beyond those two, other attributes measured at spatial locations do not represent additional dimensions.
These considerations of domains of applicability permeate much of the ensuring standards, which distinguish “geometric software” from “geographic software”, where these phrases are to be interpreted as shorthand references to software intended for use in the respective domains.
Input validation is an important software task, and an important part of our standards. While there are many ways to approach validation, the class systems of R offer a particularly convenient and effective means. For Spatial Software in particular, a range of class systems have been developed, for which we refer to the CRAN Task view on “Analysis of Spatial Data”. Software which uses and relies on defined classes can often validate input through affirming appropriate class(es). Software which does not use or rely on class systems will generally need specific routines to validate input data structures.
As for our standards for Time-Series Software, these standards for Spatial Software also suggest that software should use explicit class systems designed and intended for spatial data. New packages may implement new class systems for spatial data, and these may even be as simple as appending a class attribute to a matrix of coordinates. The primary motivation of the following standard is nevertheless to encourage and enhance inter-operability with the rich system of classes for spatial data in R.
Spatial Workflows, Packages, and Classes
Spatial software encompasses an enormous diversity, yet workflows implemented by spatial software often share much in common. In particular, coordinate reference systems used to precisely relate pairs of coordinates to precise locations in a curvilinear space, and in particular to the Earth’s ellipsoid, need to be able to be compared and transformed regardless of the specificities of individual software. This ubiquitous need has fostered the development of the PROJ
library for representing and transforming spatial coordinates. Several other libraries have been built on top or or alongside that, notably including the GDAL
(“Geospatial Data Abstraction Library”) and GEOS
(“Geometry Engine, Open Source”) libraries. These libraries are used by, and integrated within, most geographical spatial software commonly used today, and will likely continue to be used.
While not a standard in itself, it is expected that spatial software should not, absent very convincing and explicit justification, attempt to reconstruct aspects of these generic libraries. Given that, the following standards aim to ensure that spatial software remains as compatible as possible with workflows established by preceding packages which have aimed to expose and integrate as much of the functionality of these generic libraries as possible. The use of specific class systems for spatial data, and the workflows encapsulated in associated packages, ensures maximal ongoing compatibility with these libraries and with spatial workflows in general.
Notable class systems and associated packages in R include sp
, sf
, and raster
, and more recent extensions such as stars
, terra
, and s2
. With regard to these packages, the following single standard applies, because the maintainer of sp has made it clear that new software should build upon sf, not sp.
sp
package, rather should use sf
.
More generally,
This standard is further refined in a number of subsequent standards concerning documentation and testing.
GDAL
, and therefore by the sf
package) should include example and test code which load those data in spatial formats, rather than R-specific binary formats such as .Rds
.
See the sf
vignette on “Reading, Writing and Converting Simple Features” for useful examples.
Coordinate Reference Systems
As described above, one of the primary reasons for the development of classes in Spatial Software is to represent the coordinate reference systems in which data are represented, and to ensure compatibility with the PROJ
system and other generic spatial libraries. The PROJ
standards and associated software library have been recently (2020) updated (to version number 7) with “breaking changes” that are not backwards-compatible with previous versions, and in particular with the long-standing version 4. The details and implications of these changes within the context of spatial software in R can be examined in this blog entry on r-spatial.org
, and in this vignette for the rgdal
package. The “breaking” nature of these updates partly reflects analogous “breaking changes” associated with updates in the “Well-Known Text” (WKT) system for representing coordinate reference systems.
The following standard applies to software which directly or indirectly relies on geographic data which uses or relies upon coordinate reference systems.
PROJ
, and with WKT2
representations. The primary implication, described in detail in the articles linked to above, is that:
General Input Structures
New spatial software may nevertheless eschew these prior packages and classes in favour of implementing new classes. Whether or not prior classes are used or expected, geographic software should accord as much as possible with the principles of these prior systems by according with the following standards:
The following standards will be conditionally applicable to some but not all spatial software. Procedures for standards deemed not applicable to a particular piece of software are described in the srr
package.
An example of software which would not adhere to SP3.2 would be where input data were a simple matrix of spatial coordinates, and sampling were implemented using the sample()
function to randomly select elements of those input data (like sample(nrow(xy), n)
). In the context of an example based on the sample()
function, adhering to the standard would require including an additional prob
vector where each point was weighted by the local density of surrounding points. Doing so would lead to higher probabilities of samples being taken from central clusters of higher densities than from outlying extreme points. Note that the standard merely suggests that software should enable such density-based samples to be taken, not that it must, or even necessarily should by default.
Algorithms for spatial software are often related to other categories of statistical software, and it is anticipated that spatial software will commonly also be subject to standards from these other categories. Nevertheless, because spatial analyses frequently face unique challenges, some of these category-specific standards also have extension standards when applied to spatial software. The following standards will be applicable for any spatial software which also fits any of the other listed categories of statistical software.
Regression Software
Unsupervised Learning Software
The following standard applies to any spatial unsupervised learning software which uses clustering algorithms.
Machine Learning Software
One common application in which machine learning algorithms are applied to spatial software is in analyses of raster images. The first of the following standards applies because the individual cells or pixels of these raster images represent fixed spatial coordinates. (This standard also renders ML2.1 inapplicable).
A definition of broadcasting is given at the end of the introduction to corresponding Machine Learning Standards, just above Input Data Specification.
A simple example might be to provide examples or extended documentation which compares the effects of sampling both test and training data from the same spatial region versus sampling them from distinct regions. Although there are no comparable General Standard for Machine Learning Software, procedures for sampling spatial data may have particularly pronounced effects on results, and this standard attempts to foster a “best practice” of documenting how such effects may arise with a given piece of software.
A more concrete example may be to demonstrate a particular technique for generating distinct test and training data such as spatial partitioning (Muenchow n.d.; Brenning 2012; Schratz et al. 2019; Valavi et al. 2019). There may nevertheless be cases in which such sampling from a common spatial region is appropriate, for example for software intended to analyse or model temporally-structured spatial data for which a more appropriate distinction might be temporal rather than spatial. Adherence to this standard merely requires that the potential for any such confounding effects be explicitly documented (and possibly tested as well).
For (functions within) Spatial Software which return spatial data:
Spatial Software which returns objects in a custom class structure explicitly designed to represent or include spatial data should:
plot
methods for any implemented class system.
An example of SP5.1 might be ensuring that longitude is placed on the x-axis, latitude on the y, although standard orientations may depend on coordinate reference systems and other aspects of data and software design. The preceding three standards will generally not apply to software which returns objects in a custom class structure yet which is not inherently spatial.
Spatial Software which returns objects with geographical coordinates should:
html
-based) visualisations of results.
The following standards apply to all Spatial Software which is intended or able to be applied to data represented in curvilinear systems, notably including all geographical data. The only Spatial Software to which the following standards do not (necessarily) apply would be software explicitly intended to be applied exclusively to Cartesian spatial data, and which ensured appropriate rejection of curvilinear data according to SP2.0b.
Round-Trip Tests
This standard is applicable to any software which implements any routines for coordinate transformations, even if those routines are implemented via PROJ
. Conversely, software which has no routines for coordinate transformations need not adhere to SP6.0, even if that software relies on PROJ
for other purposes.
Extreme Geographical Coordinates
While such tests should generally confirm that software generates reliable results to such extreme coordinates, software which is unable to generate reliable results to such inputs should nevertheless include tests to indicate both approximate bounds of reliability, and the expected characteristics of unreliable results.
The remaining standards for testing Spatial Software extend directly from the preceding Algorithmic Standards (SP3), with the same sub-section headings used here.
Unsupervised Learning Software
Machine Learning Software
The category of Time Series software is arguably easier to define than the preceding categories, and represents any software the primary input of which is intended to be temporally structured data. Importantly, while “temporally structured” may often imply temporally ordered, this need not necessarily be the case. The primary definition of temporally structured data is that they possess some kind of index which can be used to extract temporal relationships.
Time series software is presumed to perform one or more of the following steps:
This document details standards for each of these steps, each prefixed with “TS”.
Input validation is an important software task, and an important part of our standards. While there are many ways to approach validation, the class systems of R offer a particularly convenient and effective means. For Time Series Software in particular, a range of class systems have been developed, for which we refer to the section “Time Series Classes” in the CRAN Task view on Time Series Analysis”, and the class-conversion package tsbox
. Software which uses and relies on defined classes can often validate input through affirming appropriate class(es). Software which does not use or rely on class systems will generally need specific routines to validate input data structures. In particular, because of the long history of time series software in R, and the variety of class systems for representing time series data, new time series packages should accept as many different classes of input as possible by according with the following standards:
The core algorithms of time-series software are often ultimately applied to simple vector objects, and some time series software accepts simple vector inputs, assuming these to represent temporally sequential data. Permitting such generic inputs nevertheless prevents any such assumptions from being asserted or tested. Missing values pose particular problems in this regard. A simple na.omit()
call or similar will shorten the length of the vector by removing any NA
values, and will change the explicit temporal relationship between elements. The use of explicit classes for time series generally ensures an ability to explicitly assert properties such as strict temporal regularity, and to control for any deviation from expected properties.
Such documentation should include a demonstration of how to input data in at least one commonly used class for time-series such as ts
.
tsbox
package provides one convenient approach for this).
For Time Series Software which relies on or implements custom classes or types for representing time-series data, the following standards should be adhered to:
While most common packages and classes for time series data assume absolute temporal scales such as those represented in POSIX
classes for dates or times, time series may also be quantified on relative scales where the temporal index variable quantifies intervals rather than absolute times or dates. Many analytic routines which accept time series inputs in absolute form are also appropriately applied to analogous data in relative form, and thus many packages should accept time series inputs both in absolute and relative forms. Software which can or should accept times series inputs in relative form should:
units
package for attributing SI units to R vectors.
One critical pre-processing step for Time Series Software is the appropriate handling of missing data. It is convenient to distinguish between implicit and explicit missing data. For regular time series, explicit missing data may be represented by NA
values, while for irregular time series, implicit missing data may be represented by missing rows. The difference is demonstrated in the following table.
Time | value |
08:43 | 0.71 |
08:44 | NA |
08:45 | 0.28 |
08:47 | 0.34 |
08:48 | 0.07 |
The value for 08:46 is implicitly missing, while the value for 08:44 is explicitly missing. These two forms of missingness may connote different things, and may require different forms of pre-processing. With this in mind, and beyond the General Standards for missing data (G2.13–G2.16), the following standards apply:
This latter standard is a modified version of General Standard G2.14, with additional requirements via TS2.1b.
Time Series Software should explicitly document assumptions or requirements made with respect to the stationarity or otherwise of all input data. In particular, any (sub-)functions which assume or rely on stationarity should:
The two options in the last point (TS2.4b) respectively translate to enabling transformations to ensure stationarity by providing appropriate routines, generally triggered by some function parameter, or advising on appropriate transformations, for example by directing users to additional functions able to implement appropriate transformations.
Where auto-covariance matrices are constructed or otherwise used within or as input to functions, they should:
index
attribute of the time series data as an attribute of the auto-covariance matrix.
General Standard G3.1 also applies to all Time Series Software which constructs or uses auto-covariance matrices.
Analytic algorithms are considered here to reflect the core analytic components of Time Series Software. These may be many and varied, and we explicitly consider only a small subset here.
Statistical software which implements forecasting routines should:
For (functions within) Time Series Software which return time series data:
tsbox
package to re-convert from standard internal format (see 1.4, above); or
For (functions within) Time Series Software which return data other than direct series:
Time Series Software which internally implements routines for transforming data to achieve stationarity and which returns forecast values should:
Where Time Series Software implements or otherwise enables forecasting abilities, it should return one of the following three kinds of information. These are presented in decreasing order of preference, such that software should strive to return the first kind of object, failing that the second, and only the third as a last resort.
distributional
package as used in the fable
package for time-series forecasting).
Beyond these particular standards for return objects, Time Series Software which implements or otherwise enables forecasting should:
Time Series Software should:
plot
methods for any implemented class system.
For the results of forecast operations, Time Series Software should
This sub-section details standards for Dimensionality Reduction, Clustering, and Unsupervised Learning Software – referred to from here on for simplicity as “Unsupervised Learning Software”. Software in this category is distinguished from Regression Software though the latter aiming to construct or analyse one or more mappings between two defined data sets (for example, a set of “independent” data, \(X\), and a set of “dependent” data, “Y”), whereas Unsupervised Learning Software aims to construct or analyse one or more mappings between a defined set of input or independent data, and a second set of “output” data which are not necessarily known or given prior to the analysis. A key distinction in Unsupervised Learning Software and Algorithms is between that for which output data represent (generally numerical) transformations of the input data set, and that for which output data are discrete labels applied to the input data. Examples of the former type include dimensionality reduction and ordination software and algorithms, and examples of the latter include clustering and discrete partitioning software and algorithms.
Some examples of Dimensionality Reduction, Clustering, and Unsupervised Learning software include:
ivis
implements a dimensionality reduction technique using a “Siamese Neural Network architecture.tsfeaturex
is a package to automate “time series feature extraction,” which also provides an example of a package for which both input and output data are generally incomparable with most other packages in this category.iRF
is another example of a generally incomparable package within this category, here one for which the features extracted are the most distinct predictive features extracted from repeated iterations of random forest algorithms.compboost
is a package for component-wise gradient boosting which may be sufficient general to potentially allow general application to problems addressed by several packages in this category.iml
package may offer usable functionality for devising general assessments of software within this category, through offering a “toolbox for making machine learning models interpretable” in a “model agnostic” way.Click on the following link to view a demonstration Application of Dimensionality Reduction, Clustering, and Unsupervised Learning Standards.
vector
or matrix
form, or that all inputs must be in data.frame
form with both column and row names.
The following code demonstrates an example of a routine from the base stats
package which fails to meet this standard.
#> Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536"): missing value where TRUE/FALSE needed
The latter fails, yet issues an uninformative error message that clearly indicates a failure to provide sufficient checks on the class of input data.
Such messages need not necessarily be provided by default, but should at least be optionally available.
The data.frame
function inserts default row and column names where these are not explicitly specified.
#> X1 X2
#> 1 1 6
#> 2 2 7
#> 3 3 8
#> 4 4 9
#> 5 5 10
Generic row names are almost always simple integer sequences, which the following condition confirms.
#> [1] TRUE
Generic column names may come in a variety of formats. The following code uses a grep
expression to match any number of characters plus an optional leading zero followed by a generic sequence of column numbers, appropriate for matching column names produced by generic construction of data.frame
objects.
#> [1] TRUE
Messages should be issued in both of these cases.
The following code illustrates that the hclust
function does not implement any such checks or assertions, rather it silently returns an object with default labels.
#> [1] "1" "2" "3" "4" "5" "6"
attributes()
, to corresponding aspects of return objects.
An example of a function according with UL1.3 is stats::cutree()
#> Alabama Alaska Arizona Arkansas California Colorado
#> 1 2 3 4 5 4
The row names of USArrests
are transferred to the output object. In contrast, some routines from the cluster
package do not comply with this standard:
#> [1] 1 2 3 4 3 4
The case labels are not appropriately carried through to the object returned by agnes()
to enable them to be transferred within cutree()
. (The labels are transferred to the object returned by agnes
, just not in a way that enables cutree
to inherit them.)
scale()
or equivalent transformations without explaining why scale is applied, and explicitly illustrating and contrasting the consequences of not applying such transformations.
Example of compliance with this standard are the documentation entries for the center
and scale.
parameters of the stats::prcomp()
function.
factor
, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).
NA
or NaN
values from Inf
values.
This standard applies beyond General Standards G2.13–G2.16, through the additional requirement of implementing explicit parameters.
Note that the stats::cutree()
function does not accord with this standard:
#>
#> 1 2 3 4 5 6 7 8 9 10
#> 3 3 3 6 5 10 2 5 5 8
The cutree()
function applies arbitrary integer labels to the groups, yet the order of labels is not related to the order of group sizes.
The stats::prcomp
function accords with this standard:
#> Importance of first k=5 (out of 21) components:
#> PC1 PC2 PC3 PC4 PC5
#> Standard deviation 2529.6298 2157.3434 1459.4839 551.68183 369.10901
#> Proportion of Variance 0.4591 0.3339 0.1528 0.02184 0.00977
#> Cumulative Proportion 0.4591 0.7930 0.9458 0.96764 0.97741
The proportion of variance explained by each component decreasing with increasing numeric labelling of the components.
array
-like data with no row names) should provide an additional parameter to enable cases to be labelled.
While many algorithms such as Hierarchical clustering can not (readily) be used to predict memberships of new data, other algorithms can nevertheless be applied to perform this task. The following demonstrates how the output of stats::hclust
can be used to predict membership of new data using the class:knn()
function. (This is intended to illustrate only one of many possible approaches.)
#> [1] 2 2 1 1 2
#> Levels: 1 2 3
The stats::prcomp()
function implements its own predict()
method which conforms to this standard:
#> PC1 PC2 PC3 PC4
#> North Carolina 165.17494 -30.693263 -11.682811 1.304563
#> Maryland 129.44401 -4.132644 -2.161693 1.258237
#> Ohio -49.51994 12.748248 2.104966 -2.777463
#> Colorado 35.78896 14.023774 12.869816 1.233391
#> Georgia 41.28054 -7.203986 3.987152 -7.818416
Many unsupervised learning algorithms serve to label, categorise, or partition data. Software which performs any of these tasks will commonly output some kind of labelling or grouping schemes. The above example of principal components illustrates that the return object records the standard deviations associated with each component:
#> Standard deviations (1, .., p=4):
#> [1] 83.732400 14.212402 6.489426 2.482790
#>
#> Rotation (n x k) = (4 x 4):
#> PC1 PC2 PC3 PC4
#> Murder 0.04170432 -0.04482166 0.07989066 -0.99492173
#> Assault 0.99522128 -0.05876003 -0.06756974 0.03893830
#> UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914
#> Rape 0.07515550 0.20071807 0.97408059 0.07232502
#> Importance of components:
#> PC1 PC2 PC3 PC4
#> Standard deviation 83.7324 14.21240 6.4894 2.48279
#> Proportion of Variance 0.9655 0.02782 0.0058 0.00085
#> Cumulative Proportion 0.9655 0.99335 0.9991 1.00000
Such output accords with the following standard:
The above example of principal components is one where there are no inter-group relationships, and so that standard is fulfilled by providing information on intra-group variances alone. Discrete clustering algorithms, in contrast, yield results for which inter-group relationships are meaningful, and such relationships can generally be meaningfully provided. The hclust()
routine, like many clustering routines, simply returns a scheme for devising an arbitrary number of clusters, and so can not meaningfully provide variances or relationships between such. The cutree()
function, however, does yield defined numbers of clusters, yet devoid of any quantitative information on variances or equivalent.
#> Named int [1:50] 1 1 1 2 1 2 3 1 4 2 ...
#> - attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
Compare that with the output of a largely equivalent routine, the clara()
function from the cluster
package.
#> size max_diss av_diss isolation
#> [1,] 4 24.708298 14.284874 1.4837745
#> [2,] 6 28.857755 16.759943 1.7329563
#> [3,] 6 44.640565 23.718040 0.9677229
#> [4,] 6 28.005892 17.382196 0.8442061
#> [5,] 6 15.901258 9.363471 1.1037219
#> [6,] 7 29.407822 14.817031 0.9080598
#> [7,] 4 11.764353 6.781659 0.8165753
#> [8,] 3 8.766984 5.768183 0.3547323
#> [9,] 3 18.848077 10.101505 0.7176276
#> [10,] 5 16.477257 8.468541 0.6273603
That object contains information on dissimilarities between each observation and cluster medoids, which in the context of UL3.4 is “information on intra-group variances or equivalent”. Moreover, inter-group information is also available as the “silhouette” of the clustering scheme.
print
method which provides an on-screen summary of model (input) parameters and methods used to generate results. The print
method may also summarise statistical aspects of the output data or results.
print
method should always ensure only a restricted number of rows of any result matrices or equivalent are printed to the screen.
The prcomp
objects returned from the function of the same name include potential large matrices of component coordinates which are by default printed in their entirety to the screen. This is because the default print behaviour for most tabular objects in R (matrix
, data.frame
, and objects from the Matrix
package, for example) is to print objects in their entirety (limited only by such options as getOption("max.print")
, which determines maximal numbers of printed objects, such as lines of data.frame
objects). Such default behaviour ought be avoided, particularly in Unsupervised Learning Software which commonly returns objects containing large numbers of numeric entries.
summary
methods for model objects which should summarise the primary statistics used in generating the model (such as numbers of observations, parameters of methods applied). The summary
method may also provide summary statistics from the resultant model.
plot
methods, either through explicit implementation, extension of methods for existing model objects, through ensuring default methods work appropriately, or through explicit reference to helper packages such as factoextra
and associated functions.
plot
method is NOT a generic plot
method dispatched on the class of return objects (that is, through an S3-type plot.<myclass>
function or equivalent), that method dispatch (or equivalent) should nevertheless exist in order to explicitly direct users to the appropriate function.
Unsupervised Learning Software should test the following properties and behaviours:
The following tests should be implement for Unsupervised Learning Software for which inputs are presumed or required to be scaled in any particular ways (such as having mean values of zero).
With regard to labelling of output data, tests for Unsupervised Learning Software should:
With regard to prediction, tests for Unsupervised Learning Software should:
For Unsupervised Learning Software which implements batch processing routines:
This sub-section details standards for Software which represents, transforms, or otherwise processes probability distributions. Unlike most other categories of standards, packages which fit in this category will also generally be expected to fit into at least one other category of statistical software. Reflecting that expectation, standards for probability distributions will be expected to only pertain to some (potentially small) portion of code in any package.
Packages which utilise distributional functions to extract uni- or multi-variate estimates as a final algorithmic step, for example to provide numeric probability estimates, are not considered probability distributions software, and are not required to comply with these standards.
These standards apply to any package which performs operations on probability distributions. Operations include, but are not limited to, transformation, representation, convolution, integration, inversion, fitting, or re-scaling. The definition of probability distributions software ultimately depends on the notion of an “operation,” and it is ultimately up to package authors, in conversation with reviewers, to decide whether or not these Probability Distribution Standards might apply. If in doubt, the same principle applies here as to all other categories of standards: If at least half of the following standards apply, or could conceivably be applied, to a package, then it should be considered a probability distributions package.
This standard applies, for example, to all cases where results of some algorithm are assumed to comply with some “known” statistical distribution, and are accordingly transformed or summarised. Software should then provide references demonstrating that such distributional properties may indeed be assumed to apply. This standard will not apply to any routines for general processing of probability distributions.
These standards encourage the use of packages for general representation of probability distributions, especially as this allows distributional assumptions to be readily tested, refined, and updated, rather than remaining hard-coded and effectively fixed. The CRAN Task View on Probability Distributions has a sub-section under the “Miscellaneous” heading on Unified interface to handle distributions. Packages mentioned in that sub-section include:
stats
package distributed wtih base R
;distr
family of packages, which offer an extremely powerful and flexibility range of S4-class objects for representing and manipulating probability distributions;distributions3
and distributional
packages for representing and manipulating probability distributions as S3 objects; anddistr6
package for distributions as R6
objects.The follow standard should be adhered to where possible:
Any one package will generally only be able to fulfil either this or the preceding standard (PD1.0): it will either use a particular distribution, and thus need to adhere to PD1.0, or it will treat distributions more generally, and thus need to adhere to PD2.0.
An exemplary discussion of conditions under which numeric manipulations may be considered is provided in the Analytical and Numerical Methods vignette of the distr6
package.
This standard enables assumptions on distributions to be readily tested and updated, and applies even to packages which use only one single and specific distribution in accordance with PD1.0. The names of distributions are generally best passed as single character values, processed via calls like do.call(get(dist_name), list(args))
(although many other approaches are also possible). This standard is also important for the testing standards which follow.
The following standard applies to operations on probability distributions which require calls to optimisation algorithms such as optimize()
, optim()
, or any equivalent numerical optimisation routines from stats
or other packages.
See below for additional testing standards which also apply to probability distribution packages which use optimisation algorithms.
See below for additional testing standards which also apply to probability distribution packages which use integration algorithms.
Fitting distributions is an important component of many statistical analyses, yet R currently has only two packages for general distributional fitting: fitdistrplus
and fitteR
. The field of distributional fitting is currently in very active development, and there are no notably “stable” approaches nor widely-used algorithms. This is reflected in the almost complete lack of mention of distributional fitting in the CRAN Task View on Probability Distributions. The very last point in the current version of that Task View describes “Parameter Estimation”, and links to both of these packages.
Given this dynamically evolving nature of code and algorithms for distributional fitting, this book currently provides no standards for this aspect. We nevertheless encourage any authors using or implementing distributional fitting procedures to help develop standards, for which we recommend use of the GitHub discussions channel for these standards.
The following standards refer and apply to functions which process probability distributions, meaning functions defined in accordance with PD2.1, above. Such functions are referred to in the following standards as probability distribution functions.
Numeric equality should always be tested within a defined tolerance (see General Standard G3.0).
A test fulfilling this standard will thus serve the dual purpose of testing the numeric results of a probability distribution function, and enabling anybody reading the test file to understand how those numeric results are derived.
A package may justifiably rely on one single kind of probability distribution. Adherence to this standard would then require that the function notionally accept one other distribution as well, with a test then reflecting an expectation that results generated with this alternative distribution will differ somehow.
The following standards only apply to packages which use either optimisation or integration algorithms (or both), and so comply with PD3.2 and PD3_3 for optimisation, or with PD3.4 and PD3_5 for integration.
The following applies to any procedures other than simple one-dimensional optimisation or integration via routines such as stats::optimize()
or stats::integrate()
.
Use of the stats::optim()
function, for example, would already meet this standard through complying with the previous PD4.3, because optim()
includes a method
parameter naming one of several available optimisation methods. Many optimisation and integration routines nevertheless implement a single method, in which case adherence to this standard would require testing results against equivalent results generated via at least one alternative method.