List of resampling schemes and their purpose

For collections of uncertain data, sampling constraints can be represented using the ConstrainedIndexValueResampling type. This allows for passing complicated sampling constraints as a single input argument to functions that accept uncertain value collections. Sequential constraints also make it possible to impose constraints on the indices of datasets while sampling.

Constrained

Constrained resampling

UncertainData.Resampling.ConstrainedIndexValueResamplingType
ConstrainedIndexValueResampling(constraints::NTuple{N_DATASETS, NTuple{N_VARIABLES, Union{SamplingConstraint, Vector{<:SamplingConstraint}}}}, n::Int)

Indicates that resampling should be performed with constraints on a set of uncertain index-value datasets. See examples for usage.

Fields

  • constraints. The constraints for the datasets. The constraints are represented as a tuple of length N_DATASETS, where the i-th tuple element is itself a N_VARIABLES-length tuple containing the constraints for the N_VARIABLES different variables. See "Indexing" below for details. Constraints for each individual variable must be supplied as either a single sampling constraint, or as a vector of sampling constraints with length matching the length of the variable (Union{SamplingConstraint, Vector{<:SamplingConstraint}}}). For example, if the j-th variable for the i-th dataset contains 352 observations, then constraints[i, j] must be either a single sampling constraint (e.g. TruncateStd(1.1)) or a vector of 352 different sampling constraints (e.g. [TruncateStd(1.0 + rand()) for i = 1:352]).
  • n::Int. The number of draws.

Indexing

Assume c is an instance of ConstrainedIndexValueResampling. Then

  • c[i] returns the NTuple of constraints for the i-th dataset, and
  • c[i, j] returns the constraint(s) for the j-th variable of the i-th dataset.

Example

Defining ConstrainedIndexValueResamplings.

Assume we want to constraints three separate uncertain index-value datasets, with different sampling constraints for the indices and the values for each of the datasets.

# (index constraints, value constraints) for the 1st, 2nd and 3rd datasets
c1 = (TruncateStd(1), TruncateStd(1.1))
c2 = (TruncateStd(0.5), TruncateQuantiles(0.1, 0.8))
c3 = (TruncateQuantiles(0.05, 0.95), TruncateQuantiles(0.33, 0.67))
c = ConstrainedIndexValueResampling(c1, c2, c3)

Now,

  • c[2] returns the NTuple of constraints for the 2nd dataset, and
  • c[1, 2] returns the constraint(s) for the 2nd variable of the 1st dataset.

Controlling the number of draws

The number of draws defaults to 1 if not specified. To indicate that more than one draw should be performed, just input the number of draws before supplying the constraints to the constructor.

c1 = (TruncateStd(1), TruncateStd(1.1))
c2 = (TruncateStd(0.5), TruncateQuantiles(0.1, 0.8))

# A single draw
c_single = ConstrainedIndexValueResampling(c1, c2)

# Multiple (300) draws
c_multiple = ConstrainedIndexValueResampling(300, c1, c2) 

Detailed example

Let's say we have two uncertain index-value datasets x and y. We want to constrain the furnishing distributions/population for both the time indices and values, both for x and y. For x, truncate the indices at 0.8 times the standard deviation around their mean, and for y, trucate the indices at 1.5 times the standard deviation around their mean. Next, truncate xs values at roughly (roughly) at their 20th percentile range, and truncate ys values at roughly their 80th percentile range.

All this information can be combined in a ConstrainedIndexValueResampling instance. This instance can be passed on to any function that accepts uncertain index-value datasets, to indicate that resampling should be performed on truncated versions of the distributions/populations furnishing the datasets.

# some noise, so we don't truncate all furnishing distributions/population at 
# exactly the same quantiles.
r = Uniform(0, 0.01)

constraints_x_inds = TruncateStd(0.8)
constraints_y_inds = TruncateStd(1.5)
constraints_x_vals = [TruncateQuantiles(0.4 + rand(r), 0.6 + rand(r)) for i = 1:length(x)];
constraints_y_vals = [TruncateQuantiles(0.1 + rand(r), 0.9 + rand(r)) for i = 1:length(x)];

cs_x = (constraints_x_inds, constraints_x_vals)
cs_y = (constraints_y_inds, constraints_y_vals)

resampling = ConstrainedIndexValueResampling(cs_x, cs_y)
source

Sequential

Sequential resampling

UncertainData.Resampling.SequentialResamplingType
SequentialResampling{SequentialSamplingConstraint}

Indicates that resampling should be done by resampling sequentially.

Fields

  • sequential_constraint::SequentialSamplingConstraint. The sequential sampling constraint, for example StrictlyIncreasing().

Examples

SequentialResampling(StrictlyIncreasing())
source

Sequential and interpolated resampling

UncertainData.Resampling.SequentialInterpolatedResamplingType
SequentialInterpolatedResampling{SequentialSamplingConstraint, InterpolationGrid}

Indicates that resampling should be done by first resampling sequentially, then interpolating the sample to an interpolation grid.

Fields

  • sequential_constraint::SequentialSamplingConstraint. The sequential sampling constraint, for example StrictlyIncreasing().
  • grid::InterpolationGrid. The grid onto which the resampled draw (generated according to the sequential constraint) is interpolated, for example RegularGrid(0, 100, 2.5).

Examples

For example, SequentialInterpolatedResampling(StrictlyIncreasing(), RegularGrid(0:2:100)) indicates a sequential draw that is then interpolated to the grid 0:2:100.

source

Binned resampling

BinnedResampling

UncertainData.Resampling.BinnedResamplingType
BinnedResampling(left_bin_edges, n::Int; bin_repr = UncertainScalarKDE)
BinnedResampling(UncertainScalarKDE, left_bin_edges, n::Int)
BinnedResampling(UncertainScalarPopulation, left_bin_edges, n::Int)
BinnedResampling(RawValues, left_bin_edges, n::Int)

Indicates that binned resampling should be performed.

Fields

  • left_bin_edges. The left edgepoints of the bins. Either a range or some custom type which implements minimum and step methods.
  • n. The number of draws. Each point in the dataset is sampled n times. If there are m points in the dataset, then the total number of draws is n*m.
  • bin_repr. A type of uncertain value indicating how each bin should be summarised (UncertainScalarKDE for kernel density estimated distributions in each bin, UncertainScalarPopulation to represent values in each bin as an equiprobable population) or not summarise but return raw values falling in each bin (RawValues).

Examples

using UncertainData

# Resample on a grid from 0 to 200 in steps of 20
grid = 0:20:200

# The number of samples per point in the dataset
n_draws = 10000

# Create the resampling scheme. Use kernel density estimates to distribution 
# in each bin.
resampling = BinnedResampling(grid, n_draws, bin_repr = UncertainScalarKDE)

# Represent each bin as an equiprobably population 
resampling = BinnedResampling(grid, n_draws, bin_repr = UncertainScalarPopulation)

# Keep raw values for each bin (essentially the same as UncertainScalarPopulation,
# but avoids storing an additional vector of weights for the population members).
resampling = BinnedResampling(grid, n_draws, bin_repr = RawValues)
source

BinnedWeightedResampling

UncertainData.Resampling.BinnedWeightedResamplingType
BinnedWeightedResampling(left_bin_edges, weights, n::Int; bin_repr = UncertainScalarKDE)
BinnedWeightedResampling(UncertainScalarKDE, left_bin_edges, weights, n::Int)
BinnedWeightedResampling(UncertainScalarPopulation, left_bin_edges, weights, n::Int)
BinnedWeightedResampling(RawValues, left_bin_edges, weights, n::Int)

Indicates that binned resampling should be performed, but weighting each point in the dataset differently.

Fields

  • left_bin_edges. The left edgepoints of the bins. Either a range or some custom type which implements minimum and step methods.
  • weights. The relative probability weights assigned to each point.
  • n. The total number of draws. These are distributed among the points of the dataset according to weights.
  • bin_repr. A type of uncertain value indicating how each bin should be summarised (UncertainScalarKDE for kernel density estimated distributions in each bin, UncertainScalarPopulation to represent values in each bin as an equiprobable population) or not summarise but return raw values falling in each bin (RawValues).

Examples

using UncertainData, StatsBase

# Resample on a grid from 0 to 200 in steps of 20
grid = 0:20:200

# Assume our dataset has 50 points. We'll assign random weights to them.
wts = Weights(rand(50))

# The total number of draws (on average 1000000/50 = 20000 draws per point
# if weights are equal)
n_draws = 10000000

# Create the resampling scheme. Use kernel density estimates to distribution 
# in each bin.
resampling = BinnedWeightedResampling(grid, wts, n_draws, bin_repr = UncertainScalarKDE)

# Represent each bin as an equiprobably population 
resampling = BinnedWeightedResampling(grid, wts, n_draws, bin_repr = UncertainScalarPopulation)

# Keep raw values for each bin (essentially the same as UncertainScalarPopulation,
# but avoids storing an additional vector of weights for the population members).
resampling = BinnedWeightedResampling(grid, wts n_draws, bin_repr = RawValues)
source

BinnedMeanResampling

UncertainData.Resampling.BinnedMeanResamplingType
BinnedMeanResampling

Binned resampling where each bin is summarised using the mean of all draws falling in that bin.

Fields

  • left_bin_edges. The left edgepoints of the bins. Either a range or some custom type which implements minimum and step methods.
  • n. The number of draws. Each point in the dataset is sampled n times. If there are m points in the dataset, then the total number of draws is n*m.

Examples

using UncertainData

# Resample on a grid from 0 to 200 in steps of 20
grid = 0:20:200

# The number of samples per point in the dataset
n_draws = 10000

# Create the resampling scheme
resampling = BinnedMeanResampling(grid, n_draws)
source

BinnedMeanWeightedResampling

UncertainData.Resampling.BinnedMeanWeightedResamplingType
BinnedMeanWeightedResampling

Binned resampling where each bin is summarised using the mean of all draws falling in that bin. Points in the dataset are sampled with probabilities according to weights.

Fields

  • left_bin_edges. The left edgepoints of the bins. Either a range or some custom type which implements minimum and step methods.
  • weights. The relative probability weights assigned to each point.
  • n. The total number of draws. These are distributed among the points of the dataset according to weights.

Examples

using UncertainData, StatsBase

# Resample on a grid from 0 to 200 in steps of 20
grid = 0:20:200

# Assume our dataset has 50 points. We'll assign random weights to them.
wts = Weights(rand(50))

# The total number of draws (on average 1000000/50 = 20000 draws per point
# if weights are equal)
n_draws = 10000000

# Create the resampling scheme
resampling = BinnedMeanWeightedResampling(grid, wts, n_draws)
source

Interpolated-and-binned resampling

InterpolateAndBin

UncertainData.Resampling.InterpolateAndBinType
InterpolateAndBin{L}(f::Function, left_bin_edges, intp::L, intp_grid,
    extrapolation_bc::Union{<:Real, Interpolations.BoundaryCondition})

Indicates that a dataset consisting of both indices and values should first be interpolated to the intp_grid grid using the provided intp scheme (e.g. Linear()). After interpolating, assign the interpolated values to the bins defined by left_bin_edges and summarise the values falling in each bin using the summary function f (e.g. mean).

Example

using UncertainData, Interpolations, StatsBase

# Assume we have the following unevenly spaced data with some `NaN` values
T = 100
time = sample(1.0:T*5, T, ordered = true, replace = false)
y1 = rand(T)
time[rand(1:T, 10)] .= NaN 
y1[rand(1:T, 10)] .= NaN 

# We want to first intepolate the dataset linearly to a regular time grid 
# with steps of `0.1` time units.
intp = Linear()
intp_grid = 0:0.1:1000
extrapolation_bc = Flat(OnGrid())

# Then, bin the dataset in time bins `50` time units wide, collect all 
# values in each bin and summarise them using `f`.
f = mean
left_bin_edges = 0:50:1000

r = InterpolateAndBin(f, left_bin_edges, intp, intp_grid, extrapolation_bc)
source