Title: | Iterative Proportional Fitting Algorithms for Nested Structures |
---|---|
Description: | The Iterative Proportional Fitting (IPF) algorithm operates on count data. This package offers implementations for several algorithms that extend this to nested structures: 'parent' and 'child' items for both of which constraints can be provided. The fitting algorithms include Iterative Proportional Updating <https://trid.trb.org/view/881554>, Hierarchical IPF <doi:10.3929/ethz-a-006620748>, Entropy Optimization <https://trid.trb.org/view/881144>, and Generalized Raking <doi:10.2307/2290793>. Additionally, a number of replication methods is also provided such as 'Truncate, replicate, sample' <doi:10.1016/j.compenvurbsys.2013.03.004>. |
Authors: | Kirill Müller [aut, cph] (Creator of the package), Kay W. Axhausen [ths] (Advisor of Kirill Müller), Amarin Siripanich [aut, cre] (Contributed `ml_replicate()`), Taha H. Rashidi [ths] (Advisor of Amarin Siripanich) |
Maintainer: | Amarin Siripanich <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.6.1.9008 |
Built: | 2024-11-22 03:04:59 UTC |
Source: | https://github.com/mlfit/mlfit |
The Iterative Proportional Fitting (IPF) algorithm operates on count data. This package offers implementations for several algorithms that extend this to nested structures: 'parent' and 'child' items for both of which constraints can be provided. The fitting algorithms include Iterative Proportional Updating https://trid.trb.org/view/881554, Hierarchical IPF doi:10.3929/ethz-a-006620748, Entropy Optimization https://trid.trb.org/view/881144, and Generalized Raking doi:10.2307/2290793. Additionally, a number of replication methods is also provided such as 'Truncate, replicate, sample' doi:10.1016/j.compenvurbsys.2013.03.004.
To use this package, you need to:
Specify your fitting problem with ml_problem()
Optionally, convert the fitting problem to a structure that can be
processed by the algorithms with flatten_ml_fit_problem()
; this is
helpful if you want to run the same fitting problem with multiple
algorithms and compare results.
Compute weights with one of the algorithms provided in this package with
ml_fit()
or one of the specialized functions
Analyze weights or residuals, e.g. with compute_margins()
Maintainer: Amarin Siripanich [email protected] (Contributed 'ml_replicate()')
Authors:
Kirill Müller (Creator of the package) [copyright holder]
Other contributors:
Kay W. Axhausen (Advisor of Kirill Müller) [thesis advisor]
Taha H. Rashidi (Advisor of Amarin Siripanich) [thesis advisor]
Useful links:
Report bugs at https://github.com/mlfit/mlfit/issues
These functions allows checking a fit in terms of the original input data.
compute_margins(ml_problem, weights, verbose = FALSE) margin_to_df(controls, count = NULL, verbose = FALSE)
compute_margins(ml_problem, weights, verbose = FALSE) margin_to_df(controls, count = NULL, verbose = FALSE)
ml_problem |
A fitting problem created by
|
weights |
A vector with one entry per row of the original reference sample |
verbose |
If |
controls |
Margins as returned by |
count |
Name of control total column, autodetected by default. |
compute_margins()
computes margins in the format used for the input
controls (i.e., as expected by the controls
parameter of the
ml_problem()
function), based on a reference sample and a weight vector.
margins_to_df()
converts margins to a data frame for easier comparison.
compute_margins()
returns a named list with two components,
individual
and group
. Each contains a list of margins as data.frame
s.
margins_to_df()
returns a data frame with the following columns:
..control.type..
Type of the control total: either individual
or group
.
..control.name..
Name of the control total, if exists.
..id..
Name of the control category.
..count..
Count of the control category.
path <- toy_example("Tiny") problem <- readRDS(path) fit <- ml_fit(ml_problem = problem, algorithm = "entropy_o") margins <- compute_margins(problem, fit$weights) margins margin_to_df(problem$controls) margin_to_df(margins)
path <- toy_example("Tiny") problem <- readRDS(path) fit <- ml_fit(ml_problem = problem, algorithm = "entropy_o") margins <- compute_margins(problem, fit$weights) margins margin_to_df(problem$controls) margin_to_df(margins)
Calibrate sample weights according to known marginal population totals. Based on initial sample weights, the so-called g-weights are computed by generalized raking procedures. The final sample weights need to be computed by multiplying the resulting g-weights with the initial sample weights.
dss( X, d, totals, q = NULL, method = c("raking", "linear", "logit"), bounds = NULL, maxit = 500, ginv = gginv(), tol = 1e-06, attributes = FALSE )
dss( X, d, totals, q = NULL, method = c("raking", "linear", "logit"), bounds = NULL, maxit = 500, ginv = gginv(), tol = 1e-06, attributes = FALSE )
X |
a matrix of calibration variables. |
d |
a numeric vector giving the initial sample (or design) weights. |
totals |
a numeric vector of population totals corresponding to the
calibration variables in |
q |
a numeric vector of positive values accounting for heteroscedasticity. Small values reduce the variation of the g-weights. |
method |
a character string specifying the calibration method to be
used. Possible values are |
bounds |
a numeric vector of length two giving bounds for the g-weights
to be used in the logit method. The first value gives the lower bound (which
must be smaller than or equal to 1) and the second value gives the upper
bound (which must be larger than or equal to 1). If |
maxit |
a numeric value giving the maximum number of iterations. |
ginv |
a function that computes the Moore-Penrose generalized
inverse (default: an optimized version of |
tol |
relative tolerance; convergence is achieved if the difference of all residuals (relative to the corresponding total) is smaller than this tolerance. |
attributes |
should additional attributes (currently
|
A numeric vector containing the g-weights.
This is a faster implementation of parts of
sampling::calib()
from package sampling
. Note that the
default calibration method is raking and that the truncated linear method is
not yet implemented.
Andreas Alfons, with improvements by Kirill Müller
Deville, J.-C. and Särndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376–382.
Deville, J.-C., Särndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American Statistical Association, 88(423), 1013–1020.
obs <- 1000 vars <- 100 Xs <- matrix(runif(obs * vars), nrow = obs) d <- runif(obs) / obs totals <- rep(1, vars) g <- dss(Xs, d, totals, method = "linear", ginv = solve) g2 <- dss(Xs, d, totals, method = "raking")
obs <- 1000 vars <- 100 Xs <- matrix(runif(obs * vars), nrow = obs) d <- runif(obs) / obs totals <- rep(1, vars) g <- dss(Xs, d, totals, method = "linear", ginv = solve) g2 <- dss(Xs, d, totals, method = "raking")
This function transforms a multi-level fitting problem to a representation more suitable for applying the algorithms: A matrix with one row per controlled attribute and one column per household, a weight vector with one weight per household, and a control vector.
flatten_ml_fit_problem( ml_problem, model_matrix_type = c("combined", "separate"), verbose = FALSE ) as_flat_ml_fit_problem(x, model_matrix_type = c("combined", "separate"), ...)
flatten_ml_fit_problem( ml_problem, model_matrix_type = c("combined", "separate"), verbose = FALSE ) as_flat_ml_fit_problem(x, model_matrix_type = c("combined", "separate"), ...)
ml_problem |
A fitting problem created by
|
model_matrix_type |
Which model matrix building strategy to use? See details. |
verbose |
If |
x |
An object |
... |
Further parameters passed to the algorithm |
The standard way to build a model matrix (model_matrix = "combined"
)
is to include intercepts and avoid repeating redundant attributes.
A simpler model matrix specification, available via model_matrix = "separate"
,
is suggested by Ye et al. (2009) and required for the ml_fit_ipu()
implementation:
Here, simply one column per target value is used, which
results in a larger model matrix if more than one control is given.
An object of classes flat_ml_fit_problem
,
essentially a named list.
path <- toy_example("Tiny") flat_problem <- flatten_ml_fit_problem(ml_problem = readRDS(path)) flat_problem fit <- ml_fit_dss(flat_problem) fit$flat_weights fit$weights
path <- toy_example("Tiny") flat_problem <- flatten_ml_fit_problem(ml_problem = readRDS(path)) flat_problem fit <- ml_fit_dss(flat_problem) fit$flat_weights fit$weights
The gginv
function creates a function that
calculates the Moore-Penrose generalized inverse of a matrix X using a
fixed tolerance value and a custom
implementation for computing the singular value decomposition.
gginv(tol = sqrt(.Machine$double.eps), svd = base::svd)
gginv(tol = sqrt(.Machine$double.eps), svd = base::svd)
tol |
A relative tolerance to detect zero singular values. |
svd |
A function that computes the singular value decomposition of a matrix |
The svd
argument is expected to adhere to the interface of
base::svd()
. It will be called as svd(x)
(with the
nu
and nv
arguments unset) and is expected to return a named
list with components d
, u
and v
.
A function that accepts one argument X
that computes a MP
generalized inverse matrix for it.
Adapted implementation from the MASS
package.
These functions reweight a reference sample to match constraints given by aggregate controls.
ml_fit()
accepts an algorithm as argument and calls the
corresponding function. This is useful if the result of multiple algorithms
are compared to each other.
ml_fit( ml_problem, algorithm = c("entropy_o", "dss", "ipu", "hipf"), verbose = FALSE, ..., tol = 1e-06 ) is_ml_fit(x) ## S3 method for class 'ml_fit' format(x, ...) ## S3 method for class 'ml_fit' print(x, ...) ml_fit_dss( ml_problem, method = c("raking", "linear", "logit"), ginv = gginv(), tol = 1e-06, verbose = FALSE ) ml_fit_entropy_o( ml_problem, verbose = FALSE, tol = 1e-06, dfsane_args = list() ) ml_fit_hipf( ml_problem, diff_tol = 16 * .Machine$double.eps, tol = 1e-06, maxiter = 2000, verbose = FALSE ) ml_fit_ipu( ml_problem, diff_tol = 16 * .Machine$double.eps, tol = 1e-06, maxiter = 2000, verbose = FALSE )
ml_fit( ml_problem, algorithm = c("entropy_o", "dss", "ipu", "hipf"), verbose = FALSE, ..., tol = 1e-06 ) is_ml_fit(x) ## S3 method for class 'ml_fit' format(x, ...) ## S3 method for class 'ml_fit' print(x, ...) ml_fit_dss( ml_problem, method = c("raking", "linear", "logit"), ginv = gginv(), tol = 1e-06, verbose = FALSE ) ml_fit_entropy_o( ml_problem, verbose = FALSE, tol = 1e-06, dfsane_args = list() ) ml_fit_hipf( ml_problem, diff_tol = 16 * .Machine$double.eps, tol = 1e-06, maxiter = 2000, verbose = FALSE ) ml_fit_ipu( ml_problem, diff_tol = 16 * .Machine$double.eps, tol = 1e-06, maxiter = 2000, verbose = FALSE )
ml_problem |
A fitting problem created by
|
algorithm |
Algorithm to use |
verbose |
If |
... |
Further parameters passed to the algorithm |
tol |
Tolerance, the algorithm has succeeded when all target values are reached within this tolerance. |
x |
An object |
method |
Calibration method, one of |
ginv |
Function that computes the Moore-Penrose pseudoinverse. |
dfsane_args |
Additional arguments (as a named list) passed to the
|
diff_tol |
Tolerance, the algorithm stops when relative difference of control values between iterations drops below this value |
maxiter |
Maximum number of iterations. |
All functions return an object of class ml_fit
, which is
a named list under the hood. The class matches the function called,
e.g., the return value of the ml_fit_ipu
function also is of class
ml_fit_ipu
.
All returned objects contain at least the following components, which can be
accessed with $
or [[
:
weights
: Resulting weights, compatible to the original reference sample
tol
: The input tolerance
iterations
: The actual number of iterations required to obtain the result
flat
: The flattened fitting problem, see flatten_ml_fit_problem()
flat_weights
: Weights in terms of the flattened fitting problem
residuals
: Absolute residuals
rel_residuals
: Relative residuals
success
: Are the residuals within the tolerance?
is_ml_fit()
returns a logical.
Deville, J.-C. and Särndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association, 87 (418), 376–382.
Deville, J.-C., Särndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American Statistical Association, 88 (423), 1013–1020.
Bar-Gera, H., Konduri, K. C., Sana, B., Ye, X., & Pendyala, R. M. (2009, January). Estimating survey weights with multiple constraints using entropy optimization methods. In 88th Annual Meeting of the Transportation Research Board, Washington, DC.
Müller, K. and Axhausen, K. W. (2011), Hierarchical IPF: Generating a synthetic population for Switzerland, paper presented at the 51st Congress of the European Regional Science Association, University of Barcelona, Barcelona.
Ye, X., K. Konduri, R. M. Pendyala, B. Sana and P. A. Waddell (2009) A methodology to match distributions of both household and person attributes in the generation of synthetic populations, paper presented at the 88th Annual Meeting of the Transportation Research Board, Washington, D.C., January 2009.
path <- toy_example("Tiny") fit <- ml_fit(ml_problem = readRDS(path), algorithm = "entropy_o") fit fit$weights fit$tol fit$iterations fit$flat fit$flat_weights fit$residuals fit$rel_residuals fit$success ml_fit_dss(ml_problem = readRDS(path)) ml_fit_dss(ml_problem = readRDS(path), ginv = solve) ml_fit_entropy_o(ml_problem = readRDS(path)) ml_fit_hipf(ml_problem = readRDS(path)) ml_fit_ipu(ml_problem = readRDS(path))
path <- toy_example("Tiny") fit <- ml_fit(ml_problem = readRDS(path), algorithm = "entropy_o") fit fit$weights fit$tol fit$iterations fit$flat fit$flat_weights fit$residuals fit$rel_residuals fit$success ml_fit_dss(ml_problem = readRDS(path)) ml_fit_dss(ml_problem = readRDS(path), ginv = solve) ml_fit_entropy_o(ml_problem = readRDS(path)) ml_fit_hipf(ml_problem = readRDS(path)) ml_fit_ipu(ml_problem = readRDS(path))
The ml_problem()
function is the first step for fitting a reference
sample to known control totals with mlfit.
All algorithms (see ml_fit()
) expect an object created by this function (or
optionally processed with flatten_ml_fit_problem()
).
The special_field_names()
function is useful for the field_names
argument
to ml_problem
.
ml_problem( ref_sample, controls = list(individual = individual_controls, group = group_controls), field_names, individual_controls = NULL, group_controls = NULL, prior_weights = NULL, geo_hierarchy = NULL ) is_ml_problem(x) ## S3 method for class 'ml_problem' format(x, ...) ## S3 method for class 'ml_problem' print(x, ...) special_field_names( groupId, individualId, individualsPerGroup = NULL, count = NULL, zone = NULL, region = NULL, prior_weight = NULL )
ml_problem( ref_sample, controls = list(individual = individual_controls, group = group_controls), field_names, individual_controls = NULL, group_controls = NULL, prior_weights = NULL, geo_hierarchy = NULL ) is_ml_problem(x) ## S3 method for class 'ml_problem' format(x, ...) ## S3 method for class 'ml_problem' print(x, ...) special_field_names( groupId, individualId, individualsPerGroup = NULL, count = NULL, zone = NULL, region = NULL, prior_weight = NULL )
ref_sample |
The reference sample |
controls |
Control totals, by default initialized from the
|
field_names |
Names of special fields, construct using
|
individual_controls , group_controls
|
Control totals at individual and group level, given as a list of data frames where each data frame defines a control |
prior_weights |
(Deprecated) Use |
geo_hierarchy |
A table shows mapping between a larger zoning level to
many zones of a smaller zoning level. The column name of the larger level
should be specified in |
x |
An object |
... |
Ignored. |
groupId , individualId
|
Name of the column that defines the ID of the group or the individual |
individualsPerGroup |
Obsolete. |
count |
Name of control total column in control tables (use first numeric column in each control by default). |
region , zone
|
Name of the column that defines the region of the reference sample or the zone of the controls. Note that region is a larger area that contains more than one zone. |
prior_weight |
Name of the column that defines the prior weight of the reference sample. Prior (or design) weights at group level; by default a vector of ones will be used, which corresponds to random sampling of groups. |
An object of class ml_problem
or a list of them if geo_hierarchy
was given, essentially a named list with the following components:
refSample
The reference sample, a data.frame
.
controls
A named list with two components, individual
and group
. Each contains a list of controls as data.frame
s.
fieldNames
A named list with the names of special fields.
is_ml_problem()
returns a logical.
# Create example from Ye et al., 2009 # Provide reference sample ye <- tibble::tribble( ~HHNR, ~PNR, ~APER, ~HH_VAR, ~P_VAR, 1, 1, 3, 1, 1, 1, 2, 3, 1, 2, 1, 3, 3, 1, 3, 2, 4, 2, 1, 1, 2, 5, 2, 1, 3, 3, 6, 3, 1, 1, 3, 7, 3, 1, 1, 3, 8, 3, 1, 2, 4, 9, 3, 2, 1, 4, 10, 3, 2, 3, 4, 11, 3, 2, 3, 5, 12, 3, 2, 2, 5, 13, 3, 2, 2, 5, 14, 3, 2, 3, 6, 15, 2, 2, 1, 6, 16, 2, 2, 2, 7, 17, 5, 2, 1, 7, 18, 5, 2, 1, 7, 19, 5, 2, 2, 7, 20, 5, 2, 3, 7, 21, 5, 2, 3, 8, 22, 2, 2, 1, 8, 23, 2, 2, 2 ) ye # Specify control at household level ye_hh <- tibble::tribble( ~HH_VAR, ~N, 1, 35, 2, 65 ) ye_hh # Specify control at person level ye_ind <- tibble::tribble( ~P_VAR, ~N, 1, 91, 2, 65, 3, 104 ) ye_ind ye_problem <- ml_problem( ref_sample = ye, field_names = special_field_names( groupId = "HHNR", individualId = "PNR", count = "N" ), group_controls = list(ye_hh), individual_controls = list(ye_ind) ) ye_problem fit <- ml_fit_dss(ye_problem) fit$weights
# Create example from Ye et al., 2009 # Provide reference sample ye <- tibble::tribble( ~HHNR, ~PNR, ~APER, ~HH_VAR, ~P_VAR, 1, 1, 3, 1, 1, 1, 2, 3, 1, 2, 1, 3, 3, 1, 3, 2, 4, 2, 1, 1, 2, 5, 2, 1, 3, 3, 6, 3, 1, 1, 3, 7, 3, 1, 1, 3, 8, 3, 1, 2, 4, 9, 3, 2, 1, 4, 10, 3, 2, 3, 4, 11, 3, 2, 3, 5, 12, 3, 2, 2, 5, 13, 3, 2, 2, 5, 14, 3, 2, 3, 6, 15, 2, 2, 1, 6, 16, 2, 2, 2, 7, 17, 5, 2, 1, 7, 18, 5, 2, 1, 7, 19, 5, 2, 2, 7, 20, 5, 2, 3, 7, 21, 5, 2, 3, 8, 22, 2, 2, 1, 8, 23, 2, 2, 2 ) ye # Specify control at household level ye_hh <- tibble::tribble( ~HH_VAR, ~N, 1, 35, 2, 65 ) ye_hh # Specify control at person level ye_ind <- tibble::tribble( ~P_VAR, ~N, 1, 91, 2, 65, 3, 104 ) ye_ind ye_problem <- ml_problem( ref_sample = ye, field_names = special_field_names( groupId = "HHNR", individualId = "PNR", count = "N" ), group_controls = list(ye_hh), individual_controls = list(ye_ind) ) ye_problem fit <- ml_fit_dss(ye_problem) fit$weights
This function replicates each entry in a reference sample based on its fitted
weights. This is useful if the result of multiple replication algorithms
are compared to each other, or to generate a full synthetic population
based on the result of a ml_fit
object. Note that, all individual
and group ids of the synthetic population are not the same as those in
the original reference sample, and the total number of groups replicated
is always very close to or equal the sum of the fitted group weights.
ml_replicate( ml_fit, algorithm = c("pp", "trs", "round"), verbose = FALSE, .keep_original_ids = FALSE )
ml_replicate( ml_fit, algorithm = c("pp", "trs", "round"), verbose = FALSE, .keep_original_ids = FALSE )
ml_fit |
A |
algorithm |
Replication algorithm to use. "trs" is the 'Truncate, replicate, sample' integerisation algorithm proposed by Lovelace et al. (2013), "pp" is weighted sampling with replacement, and "round" is just simple rounding. |
verbose |
If |
.keep_original_ids |
If |
The function returns a replicated sample in data.frame
in the format used in the reference sample of the input ml_fit
object.
Lovelace, R., & Ballas, D. (2013). ‘Truncate, replicate, sample’: A method for creating integer weights for spatial microsimulation. Computers, Environment and Urban Systems, 41, 1-11.
path <- toy_example("Tiny") fit <- ml_fit(ml_problem = readRDS(path), algorithm = "entropy_o") syn_pop <- ml_replicate(fit, algorithm = "trs") syn_pop
path <- toy_example("Tiny") fit <- ml_fit(ml_problem = readRDS(path), algorithm = "entropy_o") syn_pop <- ml_replicate(fit, algorithm = "trs") syn_pop
Returns the paths to all available toy examples, or to a specific toy
example. Load via readRDS()
.
toy_example(name = NULL)
toy_example(name = NULL)
name |
Name of the example, default: return all |
A named vector of file system paths.
toy_example() # Load example with results from Ye et al. (2009) readRDS(toy_example("Tiny"))
toy_example() # Load example with results from Ye et al. (2009) readRDS(toy_example("Tiny"))