mcstatsim
packageThe mcstatsim
package offers an efficient, functional
programming-based approach for statistical simulations, centralizing the
process in a single higher-order function for better
manageability. Besides, it includes ready-to-use functions for
well-known simulation targets.
The core runsim()
function processes simulation
parameters via expanded grid, mapping them to the simulation
function.
Replication management is performed through the
replicater()
function, which estimates the remaining time
(ETA) with an option to hide progress via
show_progress = FALSE
. Outputs are deliberately structured
as dataframe to simplify analysis and visualization, addressing the
limitations of list outputs in data manipulation.
You can install the latest development version of
mcstatsim
from gitHub:
# install.packages("devtools")
#devtools::install_github("ielbadisy/mcstatsim")
Here is a basic example to get you started with
mcstatsim
:
library(mcstatsim)
# Define a simple simulation function
<- function(a, b) {
sim_function Sys.sleep(0.1) # Simulate a time-consuming process
return(data.frame(result = a + b))
}
# Generate a grid of parameters
<- expand.grid(a = 1:3, b = 4:6)
params
# Run simulations
<- runsim(n = 3, grid_params = params, sim_func = sim_function) results
This example demonstrates how to define a simple simulation function,
create a grid of parameters for the simulation, and run the simulations
in parallel using mcstatsim
.
To illustrate the utility of this package in a concrete example, we
will simulate the evaluation of several imputation methods to assess
their effectiveness in preserving the accuracy of coefficient estimates
in a Cox regression model. This will demonstrate the creation of a
simulation function, setting up parameters, and using
mcstatsim
to run these simulations in parallel.
Simulation aim: We will to assess the performance of some imputation methods regarding their capacity to preserve the values of coefficients estimates. For this aim, we well set up the following simulation design:
Generate fully observed data ->
data_complete
estimate the beta coefficients values from
data_complete
Introduce missigness under MCAR to complete dataset generated in
(1) -> data_missing
impute the dataset generated at (3) ->
data_imputed
Use the following simulation targets to compute the distortion
between beta from data_complete
and beta
data_imputed
:
\[\text{Bias} = \frac{1}{N} \sum_{i=1}^{N} (\hat{\beta}_i - \beta_{\text{true}})\]
\[\text{Coverage} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\beta_{\text{true}} \in \text{CI}_i)\]
\[\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (\hat{\beta}_i - \beta_{\text{true}})^2\]
NB: All these metrics are already implemented in the
mcstatsim
package (see ?calc_bias()
,
?calc_coverage()
, and ?calc_rmse()
.
Since we want to preserve the spirit of the functional prgramming style, all our simulation design step will be decomposed as helper functions. In others words, the simulation steps will be translated to functions as follow:
::p_load(mcstatsim, survival, dplyr, ggplot2)
pacman
## (1) generate fully observed data -> `data_complete`
<- function(n = 300, maxTime = 7, logHR = 0.5) {
gencox <- 0.1
lambda <- 1.6
rho <- 0.09
rateC
# covariates
<- rnorm(n)
x1 <- x1^2 + x1 + runif(n)
x2 <- rbinom(n, 1, 0.5)
x3
# estimated survival times
<- runif(n)
U <- (-log(U) / (lambda * exp(logHR * (x1 + x2 + x3))))^(1 / rho)
Tlat <- rexp(n, rate = rateC)
Ctimes
# follow-up times and event indicators
<- pmin(Tlat, Ctimes)
time <- as.numeric(Tlat <= Ctimes)
status <- ifelse(time > maxTime, maxTime, time)
time <- ifelse(time >= maxTime, 1, status)
status
<- data.frame(time, status, x1, x2, x3)
data $x3 <- as.factor(data$x3)
data
return(data)
}
## (2) estimate the beta coefficients values from `data_complete`
<- function(data) {
estimate_coxest <- survival::Surv(time, status) ~ x1 + x2 + x3
myFormula <- summary(survival::coxph(myFormula, data = data))$coef
coefs 1]
coefs[,
}
## (3) introduce missigness under MCAR to complete dataset generated in (1) -> `data_missing`
<- function(x, covariates = names(x), p = 0.3) {
introduce_MCAR
stopifnot(is.data.frame(x), p >= 0 && p <= 1, all(covariates %in% names(x)))
<- lapply(x[covariates], function(z) {
x[covariates] sample(length(z), floor(p * length(z)))] <- NA
z[
z
})
return(x)
}
## (4) impute the dataset generated at (3) -> `data_imputed`
<- function(data, method) {
imputer stopifnot(is.data.frame(data))
if (is.factor(method)) {method <- as.character(method)}
<- c("knn", "cart", "missforest", "missranger", "misscforest", "complete")
supported_methods stopifnot(method %in% supported_methods)
<- switch(method,
data_imputed knn = VIM::kNN(data)[names(data)],
cart = simputation::impute_cart(data, .~.),
missforest = missForest::missForest(data, xtrue = data, verbose = FALSE)$ximp,
missranger = missRanger::missRanger(data, pmm.k = 5, num.trees = 100, verbose = 0),
misscforest = suppressWarnings(missCforest::missCforest(data, ntree = 10L)),
complete = data[stats::complete.cases(data), ])
return(data_imputed)
}
## (5) compute the simulation targets
<- function(data, truelogHR) {
evaluate_coxest <- survival::Surv(time, status) ~ x1 + x2 + x3
myFormula <- summary(survival::coxph(myFormula, data = data))$coef
coefs <- as.data.frame(coefs)
estimates
<- data.frame(
out estimates = estimates$coef,
ci_lower = estimates$coef - 1.96 * estimates$`se(coef)`,
ci_upper = estimates$coef + 1.96 * estimates$`se(coef)`
)
$bias <- mcstatsim::calc_bias(out$estimates, truelogHR)$bias
out$coverage <- mcstatsim::calc_coverage(out$ci_lower, out$ci_upper, truelogHR)$coverage
out$rmse <- mcstatsim::calc_rmse(out$estimates, truelogHR)$rmse
out
return(out)
}
<- function(n, logHR, pmiss, covariates = c("x2"), method = method){
simcox <- gencox(n = n)
data_complete <- estimate_coxest(data_complete)
truelogHR <- introduce_MCAR(data_complete, covariates = covariates, p = pmiss)
data_missing <- imputer(data_missing, method = method)
data_imputed <- evaluate_coxest(data_missing, truelogHR = truelogHR)
res_est <- cbind(n, pmiss, method, covariates, res_est, row.names = NULL)
res return(res)
}
Now, to link our simulation function to the simulation parameters, we will generate a grid of paremeters with the same names as the argument of simulation function:
<- expand.grid(n = c(200, 500),
params logHR = 0.5,
pmiss = c(0.2, 0.5),
method = c("knn", "cart", "missforest", "missranger",
"misscforest", "complete"))
In one line of code, we can lunch our simulation:
::p_load(survival, missRanger, missCforest, missForest, VIM, simputation, ggplot2, tidyr)
pacman<- mcstatsim::runsim(n = 3, grid_params = params, sim_func = simcox) sim_res
Finaly, we retreive all the results in one single dataset (i.e
sim_res
) for ploting and table production:
$bias <- abs(sim_res$bias)
sim_res
<- gather(sim_res, metric, value, c(bias, rmse))
sim_res2 ggplot(sim_res2, aes(x=value, y=method, fill=method)) +
stat_summary(fun=mean, geom="point", shape=20, size=3, color="red") +
theme_bw() +
theme(axis.text.x = element_text(hjust = 1),
legend.position = "none") +
labs(x = "", y = "") +
facet_grid(pmiss~metric, scales="free")
#ggsave("quick_res.png")
library(dplyr)
<- sim_res %>%
summary_table group_by(n, pmiss, method) %>% # Group by number of observations, missingness, and method
summarise(
Mean_Estimate = mean(estimates),
Mean_CI_Lower = mean(ci_lower),
Mean_CI_Upper = mean(ci_upper),
Mean_Bias = mean(bias),
Coverage = mean(coverage),
Mean_RMSE = mean(rmse),
.groups = 'drop'
)
::kable(summary_table, format = "markdown", caption = "Summary of Simulation Results") knitr
Let’s skip interpreting the results since the simulation design isn’t
complete yet—we need to add more simulation targets and assess different
hyperparameter values. However, this provides a good demonstration of
how the mcstatsim
package can efficiently organize Monte
Carlo simulations without the complexities of for-loops and managing
numerous parameters.
No Dependencies: mcstatsim
is
designed to be lightweight and standalone, requiring no additional
packages for its core functionalities, simplifying installation and
usage.
Functional programming approach: Streamlines the process of setting up and running simulations.
Parallel execution: Leverages multiple cores to speed up the execution of simulations.
Structured output: Returns simulation results in a dataframe, facilitating quick analysis and visualization.
Contributions are welcome! If you’d like to help improve
mcstatsim
, please feel free to submit a pull request.