Skip to content

Overview

UncertainData.jl

Motivation

UncertainData.jl was born to systematically deal with uncertain data, and to sample from uncertain datasets more rigorously. It makes workflows involving uncertain data of different types and from different sources significantly easier.

Package philosophy

Way too often in data analysis the uncertainties in observational data are ignored or not dealt with in a systematic manner. The core concept of the package is that uncertain data should live in the probability domain, not as single value representations of the data (e.g. the mean).

In this package, uncertain data values are thus stored as probability distributions. Only when performing a computation or plotting, the uncertain values are realized by resampling the probability distributions furnishing them.

Organising uncertain data

Individual uncertain observations may be collected in UncertainDatasets, which can be sampled according to user-provided sampling constraints. Likewise, indices (e.g. time, depth or any other index) of observations can also be represented as probability distributions and may also sampled using constraints.

The UncertainIndexValueDataset type allows you to work with datasets where both the indices and the data values are uncertain. This may be useful when you, for example, want to draw realizations of your dataset while simultaneously enforcing sequential resampling, for example strictly increasing age models.

Mathematical operations

Several elementary mathematical operations and trigonometric functions are supported for uncertain values. Computations are done using a resampling approach.

Statistics on uncertain datasets

Statistics on uncertain observations and uncertain datasets are obtained using a resampling approach.

Basic workflow

  1. Define uncertain values by probability distributions.
  2. Define uncertain datasets by gathering uncertain values.
  3. Use sampling constraints to constraint the support of the distributions furnishing the uncertain values (i.e. apply subjective criteria to decide what is acceptable data and what is not).
  4. Resample the uncertain values or uncertain datasets.
  5. Extend existing algorithm to accept uncertain values/datasets.
  6. Quantify the uncertainty in your dataset or on whatever measure your algorithm computes.

A related package is Measurements.jl, which propagates errors exactly and handles correlated uncertainties. However, Measurements.jl accepts only normally distributed values. This package serves a slightly different purpose: it was born to provide an easy way of handling uncertainties of many different types, using a resampling approach to obtain statistics when needed, and providing a rich set of sampling constraints that makes it easy for the user to reason about and plot their uncertain data under different assumptions.

Depending on your needs, Measurements.jl may be a better (and faster) choice if your data satisfies the requirements for the package (normally distributed) and if your uncertainties are correlated.

Contributing

If you have questions, or a good idea for new functionality that could be useful to have in the package, please submit an issue, or even better - a pull request.