Package {Iscores}


Type: Package
Title: Proper Scoring Rules for Missing Value Imputation
Version: 1.2.0
Description: Provides tools for evaluating and ranking missing value imputation methods using proper scoring rules. Implements the Energy-I-Score and the DR-I-Score for the assessment of deterministic, stochastic and multiple imputation methods for numerical and mixed datasets, following Näf et al. (2022) <doi:10.48550/arXiv.2106.03742> and Näf et al. (2025) <doi:10.48550/arXiv.2507.11297>.
Encoding: UTF-8
RoxygenNote: 7.3.3
Imports: energy, kernlab, pbapply, pbmcapply, ranger, scoringRules, stats
Suggests: knitr, mice, rmarkdown, spelling, testthat (≥ 3.0.0)
Config/testthat/edition: 3
VignetteBuilder: knitr
URL: https://krystynagrzesiak.github.io/Iscores/
License: GPL-3
Language: en-US
NeedsCompilation: no
Packaged: 2026-06-08 17:05:50 UTC; Krysia
Author: Krystyna Grzesiak ORCID iD [aut, cre], Loris Michel [aut, ctb], Meta-Lina Spohn [aut, ctb], Jeffrey Näf ORCID iD [aut, ctb]
Maintainer: Krystyna Grzesiak <krygrz11@gmail.com>
Repository: CRAN
Date/Publication: 2026-06-08 18:20:25 UTC

Compute the imputation KL-based scoring rules

Description

Compute the imputation KL-based scoring rules

Usage

DR_IScore(
  X,
  imputation_func = NULL,
  X_imp = NULL,
  m = 5,
  n_proj = 100,
  n_trees_per_proj = 5,
  min_node_size = 10,
  n_cores = 1,
  projection_function = NULL,
  ...
)

Arguments

X

data containing missing values denoted with NA's.

imputation_func

an imputing function. If NULL, please provide imputed datasets X_imp and m.

X_imp

a list of imputed datasets. If NULL it will be obtained using imputation_func.

m

the number of multiple imputations to consider, default to 5.

n_proj

an integer specifying the number of projections to consider for the score.

n_trees_per_proj

an integer, the number of trees per projection.

min_node_size

the minimum number of nodes in a tree.

n_cores

an integer, the number of cores to use.

projection_function

a function providing the user-specific projections.

...

used for compatibility

Value

numeric value of the score obtained for provided imputation method.

References

This method is described in detail in:

Näf, Jeffrey, Meta-Lina Spohn, Loris Michel, and Nicolai Meinshausen. 2022. “Imputation Scores.” https://arxiv.org/abs/2106.03742.

Examples

set.seed(111)
X <- random_mcar_data(100, 3, 0.2)
imputation_func <- exp_imputation
DR_IScore(X, imputation_func, m = 2, n_proj = 10, n_trees_per_proj = 2 )



Balancing of Classes

Description

Balancing of Classes

Usage

class.balancing(X_proj_complete, Y.proj, drawA, X_imp, ids.with.missing, vars)

Arguments

X_proj_complete

matrix with complete projected observations.

Y.proj

matrix with projected imputed observations.

drawA

vector of indices corresponding to current missingness pattern.

X_imp

matrix of full imputed observations.

ids.with.missing

vector of indices of observations with missing values.

vars

vectors of variables in projection.

Value

a list of new X_proj_complete and Y.proj.


Combine two projection forests

Description

Combine two projection forests

Usage

combine2Forests(mod1, mod2)

Arguments

mod1

A fitted forest object.

mod2

A fitted forest object.

Value

A forest object containing trees from both input forests.


Combine a list of forests

Description

Combine a list of forests

Usage

combineForests(list.rf)

Arguments

list.rf

A list of fitted forest objects.

Value

A single forest object obtained by combining all forests in list.rf.


Calculates IScores for multiple imputation functions

Description

Calculates IScores for multiple imputation functions

Usage

compare_Iscores(X, methods_list, score = c("energy_IScore", "DR_IScore"), ...)

Arguments

X

data containing missing values denoted with NA's.

methods_list

a named list of imputing functions.

score

a vector of names of scores to calculate. It can be "energy_IScore" and "DR_IScore".

...

other arguments to be passed to energy_IScore or DR_IScore

Value

a vector of IScores for provided methods

Examples

set.seed(111)
X <- random_mcar_data(100, 3, 0.2)
methods_list <- list(exp = exp_imputation,
                       norm = norm_imputation)
compare_Iscores(X, methods_list = methods_list, m = 2,
                n_proj = 10, n_trees_per_proj = 2 )


Compute the density ratio score

Description

Compute the density ratio score

Usage

compute_drScore(object, Z = Z, n_trees_per_proj, n_proj)

Arguments

object

a crf object.

Z

a matrix of candidate points.

n_trees_per_proj

an integer, the number of trees per projection.

n_proj

an integer specifying the number of projections.

Value

a numeric value, the DR I-Score.


Computation of the density ratio score

Description

Computes the density ratio score using a random forest model based on random projections.

Usage

densityRatioScore(
  X,
  X_imp,
  pattern = NULL,
  n_proj = 10,
  n_trees_per_proj = 1,
  projection_function = NULL,
  min_node_size = 1,
  normal_proj = TRUE
)

Arguments

X

A numeric matrix of observed data that may contain missing values denoted by NA.

X_imp

A numeric matrix of imputed values with the same dimensions as X.

pattern

A vector or pattern indicating the missingness structure.

n_proj

An integer specifying the number of random projections.

n_trees_per_proj

An integer specifying the number of trees grown per projection.

projection_function

A function that generates user-defined projections.

min_node_size

An integer specifying the minimum number of observations in a terminal node (leaf) of each tree.

normal_proj

Logical. If TRUE, sampling is performed from both missing (NA) and observed values. If FALSE, sampling is performed only from missing (NA) values.

Details

The method builds multiple random forests on projected versions of the data to estimate the density ratio between observed and imputed distributions.

Value

An object representing a fitted random forest model based on random projections.


Convert a factor vector to one-hot encoding

Description

Converts a factor vector into a one-hot encoded matrix with one column per factor level.

Usage

do_one_hot(vec)

Arguments

vec

A factor vector to be encoded.

Details

Missing values in 'vec' are preserved as rows containing 'NA' values.

Value

A numeric matrix with one row per element of 'vec' and one column per factor level. Column names are prefixed with '"level_"'.


Energy distance

Description

Calculating energy distance/statistic.

Usage

edistance(X, X_imp, rescale = FALSE)

Arguments

X

a complete original dataset (without missing values).

X_imp

an imputed dataset

rescale

a logical, indicating whether the returned value should be rescaled. Default to FALSE. See "details" section for more information.

Details

This function uses the eqdist.e function. According to this implementation, by default, the function returns the energy statistic which is given by

E(X, Y) = \frac{nm}{n + m} \hat{\varepsilon}{(X, Y)},

where \hat{\varepsilon}{(X, Y)} is the raw energy distance. To obtain raw energy distance use rescale = TRUE.

Value

A numeric value giving the energy distance between the original dataset and the imputed dataset.

Examples

X <- matrix(rnorm(100), nrow = 25)
X_imp <- matrix(rnorm(100), nrow = 25)
edistance(X, X_imp)


Calculates Imputation Score for imputation function

Description

Calculates Imputation Score for imputation function

Usage

energy_IScore(
  X,
  imputation_func,
  X_imp = NULL,
  multiple = TRUE,
  N = 50,
  max_length = NULL,
  skip_if_needed = TRUE,
  scale = FALSE,
  n_cores = 1,
  silent = TRUE
)

Arguments

X

data containing missing values denoted with NA's.

imputation_func

a function that imputes data.

X_imp

imputed dataset of the same size as X. It's NULL by default meaning that it will be obtained by imputation of X using the imputation_func.

multiple

a logical indicating whether provided imputation method is a multiple imputation approach (i.e. it generates different values to impute for each call). Default to TRUE. Note that if multiple equals to FALSE, N is automatically set to 1.

N

a numeric value. Number of samples from imputation distribution H. Default to 50.

max_length

Maximum number of variables X_j to consider, can speed up the code. Default to NULL meaning that all the columns will be taken under consideration.

skip_if_needed

logical, indicating whether some observations should be skipped to obtain complete columns for scoring. If FALSE, NA will be returned for column with no observed variable for training.

scale

a logical value. If TRUE, each variable is scaled in the score.

n_cores

a number of cores for parallelization.

silent

logical indicating whether warnings and messages should be printed.

Details

This function relies on functions energy_Iscore_num and energy_Iscore_cat. Depending on the presence of factor-type data, these functions compute a score either for purely numerical data or for mixed data types.

If you want to compute the score for numerical data, make sure that the dataset does not contain any factor-type variables.

If you want to compute the score for categorical data, ensure that all categorical variables are preserved as factors.

If your imputation method does not support categorical variables represented as factors, implement a wrapper function that handles the appropriate data type conversions before and after imputation.

Value

a numerical value denoting weighted Imputation Score obtained for provided imputation function and a table with scores and weights calculated for particular columns.

References

Näf, J., Grzesiak, K., and Scornet, E. (2025). How to rank imputation methods? arXiv preprint. doi:10.48550/arXiv.2507.11297.

Examples

set.seed(111)
X <- random_mcar_data(100, 4)
imputation_func <- exp_imputation
energy_IScore(X, imputation_func)

X <-  random_mcar_mixed_data(100, 4, 2)
imputation_func <- median_mode_imputation
energy_IScore(X, imputation_func)


energy-I-Score for imputation of mixed data (categorical and numerical)

Description

energy-I-Score for imputation of mixed data (categorical and numerical)

Usage

energy_Iscore_cat(
  X,
  imputation_func,
  X_imp = imputation_func(X),
  multiple = TRUE,
  N = 50,
  max_length = NULL,
  skip_if_needed = TRUE,
  scale = FALSE,
  n_cores = 1,
  silent = TRUE
)

Arguments

X

data containing missing values denoted with NA's.

imputation_func

a function that imputes data.

X_imp

imputed dataset of the same size as X. It's NULL by default meaning that it will be obtained by imputation of X using the imputation_func.

multiple

a logical indicating whether provided imputation method is a multiple imputation approach (i.e. it generates different values to impute for each call). Default to TRUE. Note that if multiple equals to FALSE, N is automatically set to 1.

N

a numeric value. Number of samples from imputation distribution H. Default to 50.

max_length

Maximum number of variables X_j to consider, can speed up the code. Default to NULL meaning that all the columns will be taken under consideration.

skip_if_needed

logical, indicating whether some observations should be skipped to obtain complete columns for scoring. If FALSE, NA will be returned for column with no observed variable for training.

scale

a logical value. If TRUE, each variable is scaled in the score.

n_cores

a number of cores for parallelization.

silent

logical indicating whether warnings and messages should be printed.

Details

The categorical variables should be stored as factors. If you need additional conversion of the data (for example one-hot encoding) for imputation, please, implement everything within imputation_func parameter. You can use miceDRF:::onehot_to_factor and miceDRF:::factor_to_onehot functions.

Value

a numerical value denoting weighted Imputation Score obtained for provided imputation function and a table with scores and weights calculated for particular columns.

References

This method is described in detail in:

Näf, J., Grzesiak, K., and Scornet, E. (2025). How to rank imputation methods? arXiv preprint. doi:10.48550/arXiv.2507.11297.


Calculates score for one imputation function

Description

Calculates score for one imputation function

Usage

energy_Iscore_num(
  X,
  imputation_func,
  X_imp = imputation_func(X),
  multiple = TRUE,
  N = 50,
  max_length = NULL,
  skip_if_needed = TRUE,
  scale = FALSE,
  n_cores = 1,
  silent = TRUE
)

Arguments

X

data containing missing values denoted with NA's.

imputation_func

a function that imputes data.

X_imp

imputed dataset of the same size as X. It's NULL by default meaning that it will be obtained by imputation of X using the imputation_func.

multiple

a logical indicating whether provided imputation method is a multiple imputation approach (i.e. it generates different values to impute for each call). Default to TRUE. Note that if multiple equals to FALSE, N is automatically set to 1.

N

a numeric value. Number of samples from imputation distribution H. Default to 50.

max_length

Maximum number of variables X_j to consider, can speed up the code. Default to NULL meaning that all the columns will be taken under consideration.

skip_if_needed

logical, indicating whether some observations should be skipped to obtain complete columns for scoring. If FALSE, NA will be returned for column with no observed variable for training.

scale

a logical value. If TRUE, each variable is scaled in the score.

n_cores

a number of cores for parallelization.

silent

logical indicating whether warnings and messages should be printed.

Value

a numerical value denoting weighted Imputation Score obtained for provided imputation function and a table with scores and weights calculated for particular columns.

References

This method is described in detail in:

Näf, J., Grzesiak, K., and Scornet, E. (2025). How to rank imputation methods? arXiv preprint. doi:10.48550/arXiv.2507.11297.


Standard exponential imputation

Description

Imputes all missing values by independent draws from an exponential distribution with rate 1.

Usage

exp_imputation(X_miss)

Arguments

X_miss

A data set containing missing values.

Value

A completed data set with all missing values replaced by draws from an Exp(1) distribution.

Examples

X <- random_mcar_data(100, 3)
X_imp <- exp_imputation(X)


Internal function for changing factors to numerical

Description

A supplementary function for data management

Usage

factor_to_numeric(factor_col)

Arguments

factor_col

a factor column

Details

This function converts factor variables to numeric variables.


One hot encoding

Description

A supplementary function for one-hot encoding

Usage

factor_to_onehot(dat)

Arguments

dat

a data containing some factor but numeric columns.

Details

This function converts factor variables into one-hot encoding


Extract and group missing-data patterns

Description

Identifies unique missingness patterns in a data matrix and groups observations according to these patterns. If more than one pattern occurs only once, such singleton patterns are merged into a single group.

Usage

get_pattern_data(X)

Arguments

X

A matrix or data frame that may contain missing values.

Details

Missingness patterns are represented by a logical matrix obtained from is.na(X). Only rows containing at least one missing value are used to define the unique patterns.

If more than one pattern is represented by a single observation, these singleton patterns are merged using merge_singleton_patterns().

Value

A list with three elements:

patterns

A matrix of unique missingness patterns.

groups

A list of integer vectors giving row indices for each pattern.

average_diff

A logical indicating whether singleton patterns were merged.


Median/mode imputation

Description

Imputes numerical variables using their median and categorical variables using their most frequent observed category.

Usage

median_mode_imputation(X_miss)

Arguments

X_miss

A data set containing missing values.

Value

A completed data set with all missing values imputed.

Examples

X <- random_mcar_mixed_data(100, 3, n_fac = 1)
X_imp <- median_mode_imputation(X)


Merge singleton missingness patterns

Description

Merges missingness patterns that occur only once (singleton patterns) into a single pattern. If the merged pattern already exists among the current patterns, the corresponding groups of observations are combined. Otherwise, a new pattern is created and appended.

Usage

merge_singleton_patterns(patterns, groups, ind_singletons)

Arguments

patterns

A numeric matrix where each row represents a unique missingness pattern.

groups

A list of integer vectors. Each element contains the indices of observations corresponding to a given pattern in patterns.

ind_singletons

An integer vector indicating indices of patterns in patterns that occur only once.

Value

A list with two elements:

patterns

Updated matrix of unique missingness patterns.

groups

Updated list of observation indices grouped by pattern.


Standard normal imputation

Description

Imputes all missing values by independent draws from a standard normal distribution.

Usage

norm_imputation(X_miss)

Arguments

X_miss

A data set containing missing values.

Value

A completed data set with all missing values replaced by draws from a N(0,1) distribution.

Examples

X <- random_mcar_data(100, 3)
X_imp <- norm_imputation(X)


Generate random data with MCAR missing values

Description

Generates a numerical dataset consisting of independent standard normal variables and introduces missing values according to a Missing Completely at Random (MCAR) mechanism.

Usage

random_mcar_data(n, p, ratio = 0.2)

Arguments

n

Number of observations.

p

Number of numerical variables.

ratio

Proportion of entries to replace with missing values.

Value

A data frame with n rows and p numerical variables containing missing values.

Examples

X <- random_mcar_data(100, 3, ratio = 0.2)
head(X)


Generate random mixed data with MCAR missing values

Description

Generates a mixed dataset containing independent standard normal variables and categorical variables, then introduces missing values according to a Missing Completely at Random (MCAR) mechanism.

Usage

random_mcar_mixed_data(n, p, n_fac = 1, ratio = 0.2)

Arguments

n

Number of observations.

p

Number of numerical variables.

n_fac

Number of categorical variables.

ratio

Proportion of entries to replace with missing values.

Value

A data frame containing p numerical variables and n_fac factor variables with missing values.

Examples

X <- random_mcar_mixed_data(100, 3, n_fac = 2, ratio = 0.2)
str(X)


Sampling of Projections

Description

Sampling of Projections

Usage

sample_vars_proj(ids_x_na, X, projection_function = NULL, normal_proj = TRUE)

Arguments

ids_x_na

a vector of indices corresponding to NA in the given missingness pattern.

X

a matrix of the observed data containing missing values.

projection_function

a function providing the user-specific projections.

normal_proj

a boolean, if TRUE, sample from the NA of the pattern and additionally from the non-NA. If FALSE, sample only from the NA of the pattern.

Value

a vector of variables corresponding to the projection.


Truncation of probability

Description

Truncation of probability

Usage

truncProb(p)

Arguments

p

a numeric value between 0 and 1 to be truncated

Value

a numeric value, the truncated probability.