List of resampling schemes and their purpose
For collections of uncertain data, sampling constraints can be represented using the ConstrainedIndexValueResampling
type. This allows for passing complicated sampling constraints as a single input argument to functions that accept uncertain value collections. Sequential constraints also make it possible to impose constraints on the indices of datasets while sampling.
Constrained
Constrained resampling
UncertainData.Resampling.ConstrainedIndexValueResampling
— TypeConstrainedIndexValueResampling(constraints::NTuple{N_DATASETS, NTuple{N_VARIABLES, Union{SamplingConstraint, Vector{<:SamplingConstraint}}}}, n::Int)
Indicates that resampling should be performed with constraints on a set of uncertain index-value datasets. See examples for usage.
Fields
constraints
. The constraints for the datasets. The constraints are represented as a tuple of lengthN_DATASETS
, where thei
-th tuple element is itself aN_VARIABLES
-length tuple containing the constraints for theN_VARIABLES
different variables. See "Indexing" below for details. Constraints for each individual variable must be supplied as either a single sampling constraint, or as a vector of sampling constraints with length matching the length of the variable (Union{SamplingConstraint, Vector{<:SamplingConstraint}}}
). For example, if thej
-th variable for thei
-th dataset contains 352 observations, thenconstraints[i, j]
must be either a single sampling constraint (e.g.TruncateStd(1.1)
) or a vector of 352 different sampling constraints (e.g.[TruncateStd(1.0 + rand()) for i = 1:352]
).n::Int
. The number of draws.
Indexing
Assume c
is an instance of ConstrainedIndexValueResampling
. Then
c[i]
returns theNTuple
of constraints for thei
-th dataset, andc[i, j]
returns the constraint(s) for thej
-th variable of thei
-th dataset.
Example
Defining ConstrainedIndexValueResampling
s.
Assume we want to constraints three separate uncertain index-value datasets, with different sampling constraints for the indices and the values for each of the datasets.
# (index constraints, value constraints) for the 1st, 2nd and 3rd datasets
c1 = (TruncateStd(1), TruncateStd(1.1))
c2 = (TruncateStd(0.5), TruncateQuantiles(0.1, 0.8))
c3 = (TruncateQuantiles(0.05, 0.95), TruncateQuantiles(0.33, 0.67))
c = ConstrainedIndexValueResampling(c1, c2, c3)
Now,
c[2]
returns theNTuple
of constraints for the 2nd dataset, andc[1, 2]
returns the constraint(s) for the 2nd variable of the 1st dataset.
Controlling the number of draws
The number of draws defaults to 1 if not specified. To indicate that more than one draw should be performed, just input the number of draws before supplying the constraints to the constructor.
c1 = (TruncateStd(1), TruncateStd(1.1))
c2 = (TruncateStd(0.5), TruncateQuantiles(0.1, 0.8))
# A single draw
c_single = ConstrainedIndexValueResampling(c1, c2)
# Multiple (300) draws
c_multiple = ConstrainedIndexValueResampling(300, c1, c2)
Detailed example
Let's say we have two uncertain index-value datasets x
and y
. We want to constrain the furnishing distributions/population for both the time indices and values, both for x
and y
. For x
, truncate the indices at 0.8
times the standard deviation around their mean, and for y
, trucate the indices at 1.5
times the standard deviation around their mean. Next, truncate x
s values at roughly (roughly) at their 20th percentile range, and truncate y
s values at roughly their 80th percentile range.
All this information can be combined in a ConstrainedIndexValueResampling
instance. This instance can be passed on to any function that accepts uncertain index-value datasets, to indicate that resampling should be performed on truncated versions of the distributions/populations furnishing the datasets.
# some noise, so we don't truncate all furnishing distributions/population at
# exactly the same quantiles.
r = Uniform(0, 0.01)
constraints_x_inds = TruncateStd(0.8)
constraints_y_inds = TruncateStd(1.5)
constraints_x_vals = [TruncateQuantiles(0.4 + rand(r), 0.6 + rand(r)) for i = 1:length(x)];
constraints_y_vals = [TruncateQuantiles(0.1 + rand(r), 0.9 + rand(r)) for i = 1:length(x)];
cs_x = (constraints_x_inds, constraints_x_vals)
cs_y = (constraints_y_inds, constraints_y_vals)
resampling = ConstrainedIndexValueResampling(cs_x, cs_y)
Sequential
Sequential resampling
UncertainData.Resampling.SequentialResampling
— TypeSequentialResampling{SequentialSamplingConstraint}
Indicates that resampling should be done by resampling sequentially.
Fields
sequential_constraint::SequentialSamplingConstraint
. The sequential sampling constraint, for exampleStrictlyIncreasing()
.
Examples
SequentialResampling(StrictlyIncreasing())
Sequential and interpolated resampling
UncertainData.Resampling.SequentialInterpolatedResampling
— TypeSequentialInterpolatedResampling{SequentialSamplingConstraint, InterpolationGrid}
Indicates that resampling should be done by first resampling sequentially, then interpolating the sample to an interpolation grid.
Fields
sequential_constraint::SequentialSamplingConstraint
. The sequential sampling constraint, for exampleStrictlyIncreasing()
.grid::InterpolationGrid
. The grid onto which the resampled draw (generated according to the sequential constraint) is interpolated, for exampleRegularGrid(0, 100, 2.5)
.
Examples
For example, SequentialInterpolatedResampling(StrictlyIncreasing(), RegularGrid(0:2:100))
indicates a sequential draw that is then interpolated to the grid 0:2:100.
Binned resampling
BinnedResampling
UncertainData.Resampling.BinnedResampling
— TypeBinnedResampling(left_bin_edges, n::Int; bin_repr = UncertainScalarKDE)
BinnedResampling(UncertainScalarKDE, left_bin_edges, n::Int)
BinnedResampling(UncertainScalarPopulation, left_bin_edges, n::Int)
BinnedResampling(RawValues, left_bin_edges, n::Int)
Indicates that binned resampling should be performed.
Fields
left_bin_edges
. The left edgepoints of the bins. Either a range or some custom type which implementsminimum
andstep
methods.n
. The number of draws. Each point in the dataset is sampledn
times. If there arem
points in the dataset, then the total number of draws isn*m
.bin_repr
. A type of uncertain value indicating how each bin should be summarised (UncertainScalarKDE
for kernel density estimated distributions in each bin,UncertainScalarPopulation
to represent values in each bin as an equiprobable population) or not summarise but return raw values falling in each bin (RawValues
).
Examples
using UncertainData
# Resample on a grid from 0 to 200 in steps of 20
grid = 0:20:200
# The number of samples per point in the dataset
n_draws = 10000
# Create the resampling scheme. Use kernel density estimates to distribution
# in each bin.
resampling = BinnedResampling(grid, n_draws, bin_repr = UncertainScalarKDE)
# Represent each bin as an equiprobably population
resampling = BinnedResampling(grid, n_draws, bin_repr = UncertainScalarPopulation)
# Keep raw values for each bin (essentially the same as UncertainScalarPopulation,
# but avoids storing an additional vector of weights for the population members).
resampling = BinnedResampling(grid, n_draws, bin_repr = RawValues)
BinnedWeightedResampling
UncertainData.Resampling.BinnedWeightedResampling
— TypeBinnedWeightedResampling(left_bin_edges, weights, n::Int; bin_repr = UncertainScalarKDE)
BinnedWeightedResampling(UncertainScalarKDE, left_bin_edges, weights, n::Int)
BinnedWeightedResampling(UncertainScalarPopulation, left_bin_edges, weights, n::Int)
BinnedWeightedResampling(RawValues, left_bin_edges, weights, n::Int)
Indicates that binned resampling should be performed, but weighting each point in the dataset differently.
Fields
left_bin_edges
. The left edgepoints of the bins. Either a range or some custom type which implementsminimum
andstep
methods.weights
. The relative probability weights assigned to each point.n
. The total number of draws. These are distributed among the points of the dataset according toweights
.bin_repr
. A type of uncertain value indicating how each bin should be summarised (UncertainScalarKDE
for kernel density estimated distributions in each bin,UncertainScalarPopulation
to represent values in each bin as an equiprobable population) or not summarise but return raw values falling in each bin (RawValues
).
Examples
using UncertainData, StatsBase
# Resample on a grid from 0 to 200 in steps of 20
grid = 0:20:200
# Assume our dataset has 50 points. We'll assign random weights to them.
wts = Weights(rand(50))
# The total number of draws (on average 1000000/50 = 20000 draws per point
# if weights are equal)
n_draws = 10000000
# Create the resampling scheme. Use kernel density estimates to distribution
# in each bin.
resampling = BinnedWeightedResampling(grid, wts, n_draws, bin_repr = UncertainScalarKDE)
# Represent each bin as an equiprobably population
resampling = BinnedWeightedResampling(grid, wts, n_draws, bin_repr = UncertainScalarPopulation)
# Keep raw values for each bin (essentially the same as UncertainScalarPopulation,
# but avoids storing an additional vector of weights for the population members).
resampling = BinnedWeightedResampling(grid, wts n_draws, bin_repr = RawValues)
BinnedMeanResampling
UncertainData.Resampling.BinnedMeanResampling
— TypeBinnedMeanResampling
Binned resampling where each bin is summarised using the mean of all draws falling in that bin.
Fields
left_bin_edges
. The left edgepoints of the bins. Either a range or some custom type which implementsminimum
andstep
methods.n
. The number of draws. Each point in the dataset is sampledn
times. If there arem
points in the dataset, then the total number of draws isn*m
.
Examples
using UncertainData
# Resample on a grid from 0 to 200 in steps of 20
grid = 0:20:200
# The number of samples per point in the dataset
n_draws = 10000
# Create the resampling scheme
resampling = BinnedMeanResampling(grid, n_draws)
BinnedMeanWeightedResampling
UncertainData.Resampling.BinnedMeanWeightedResampling
— TypeBinnedMeanWeightedResampling
Binned resampling where each bin is summarised using the mean of all draws falling in that bin. Points in the dataset are sampled with probabilities according to weights
.
Fields
left_bin_edges
. The left edgepoints of the bins. Either a range or some custom type which implementsminimum
andstep
methods.weights
. The relative probability weights assigned to each point.n
. The total number of draws. These are distributed among the points of the dataset according toweights
.
Examples
using UncertainData, StatsBase
# Resample on a grid from 0 to 200 in steps of 20
grid = 0:20:200
# Assume our dataset has 50 points. We'll assign random weights to them.
wts = Weights(rand(50))
# The total number of draws (on average 1000000/50 = 20000 draws per point
# if weights are equal)
n_draws = 10000000
# Create the resampling scheme
resampling = BinnedMeanWeightedResampling(grid, wts, n_draws)
Interpolated-and-binned resampling
InterpolateAndBin
UncertainData.Resampling.InterpolateAndBin
— TypeInterpolateAndBin{L}(f::Function, left_bin_edges, intp::L, intp_grid,
extrapolation_bc::Union{<:Real, Interpolations.BoundaryCondition})
Indicates that a dataset consisting of both indices and values should first be interpolated to the intp_grid
grid using the provided intp
scheme (e.g. Linear()
). After interpolating, assign the interpolated values to the bins defined by left_bin_edges
and summarise the values falling in each bin using the summary function f
(e.g. mean
).
Example
using UncertainData, Interpolations, StatsBase
# Assume we have the following unevenly spaced data with some `NaN` values
T = 100
time = sample(1.0:T*5, T, ordered = true, replace = false)
y1 = rand(T)
time[rand(1:T, 10)] .= NaN
y1[rand(1:T, 10)] .= NaN
# We want to first intepolate the dataset linearly to a regular time grid
# with steps of `0.1` time units.
intp = Linear()
intp_grid = 0:0.1:1000
extrapolation_bc = Flat(OnGrid())
# Then, bin the dataset in time bins `50` time units wide, collect all
# values in each bin and summarise them using `f`.
f = mean
left_bin_edges = 0:50:1000
r = InterpolateAndBin(f, left_bin_edges, intp, intp_grid, extrapolation_bc)