fda.vi implements the novel Bayesian basis function selection method of da Cruz, de Souza, and Sousa (2024) for functional data analysis. The method smooths one or multiple functional curves simultaneously, while accounting for within-curve correlation, by expressing each curve as a linear combination of basis functions (B-splines or Fourier) and using a variational expectation-maximization (VEM) algorithm to select essential basis functions while selectively removing unnecessary ones via sparsity-inducing priors.
Key method features:
Let \(y_{ij}\) denote observation \(j\) of curve \(i\), for \(i = 1, \ldots, m\) curves each measured at evaluation points \(t_{ij}\), \(j = 1, \ldots, n_i\). Each curve is modelled as
\[ y_{ij} = g_i(t_{ij}) + \varepsilon_i(t_{ij}), \qquad g_i(t_{ij}) = \sum_{k=1}^{K} Z_{ki}\,\beta_{ki}\,B_k(t_{ij}) \]
where \(B_k(\cdot)\) are \(K\) unknown basis functions, \(\beta_{ki}\) are the basis coefficients, and \(Z_{ki} \in \{0,1\}\) are inclusion indicators drawn from independent Bernoulli sparsity-inducing priors. The errors \(\varepsilon_i(t)\) follow a zero-mean Gaussian process with Ornstein-Uhlenbeck covariance function:
\[ \psi(t, t') = \sigma^2 \exp\!\left(-w\,|t - t'|\right) \]
with decay parameter \(w > 0\) and variance \(\sigma^2 > 0\). The parameter \(\tau^2\) controls the regularization of the basis coefficients.
The VEM algorithm iterates between:
until convergence or the maximum number of iterations is achieved.
library(fda.vi)
data(toy_curves)
# Fit at a single K
fit <- vem_fit(
y = toy_curves$y,
Xt = toy_curves$Xt,
K = 8,
center = FALSE,
scale = FALSE
)
#>
#> Running VEM Smooth |
#> Basis Type: cubic_bspline | Selection: mean ---
summary(fit)
#> ------------------------------------------------
#> VEM Smooth Fit Summary
#> ------------------------------------------------
#> Basis Type cubic_bspline
#> Curves (m): 3
#> Basis Functions (K): 8
#> Active Bases (per curve): 6, 6, 6
#>
#> Model Parameters (Representative):
#> Point estimate for decay parameter (w): 6.217
#>
#> Posterior q(sigma^2) ~ IG(delta1, delta2):
#> delta1 (shape): 97
#> delta2 (scale): 0.859869
#>
#> Posterior q(tau^2) ~ IG(lambda1, lambda2):
#> lambda1 (shape): 12.001
#> lambda2 (scale): 1517.36
#> ------------------------------------------------The toy_curves dataset contains three simulated curves
generated using \(K = 8\) cubic
B-spline basis functions, with known basis coefficients (basis functions
2 and 5 are not relevant, with corresponding coefficients set to zero),
and Ornstein-Uhlenbeck correlated errors with \(\sigma = 0.1\) and \(w = 6\).
toy_curves Datasetdata(toy_curves)
str(toy_curves)
#> List of 7
#> $ y :List of 3
#> ..$ curve_1: num [1:50] 1.559 1.177 0.737 0.488 0.232 ...
#> ..$ curve_2: num [1:50] 1.5876 1.0917 0.6612 0.3333 0.0734 ...
#> ..$ curve_3: num [1:50] 1.553 1.102 0.804 0.481 0.214 ...
#> $ Xt : num [1:50] 0 0.0204 0.0408 0.0612 0.0816 ...
#> $ true_coef: num [1:8] 1.5 0 -1 0.8 0 -0.5 1.2 -0.9
#> $ K : num 8
#> $ m : num 3
#> $ sigma : num 0.1
#> $ w : num 6The dataset is a named list with three elements:
y: a list of 3 numeric vectors, each of length 50,
containing the observed noisy curve valuesXt: a numeric vector of 50 equally spaced evaluation
points on \([0, 1]\)true_coef: the true basis coefficients used to generate
the data, c(1.5, 0, -1, 0.8, 0, -0.5, 1.2, -0.9) — basis
functions 2 and 5 are not relevant, with corresponding coefficients set
to zeroplot(toy_curves$Xt, toy_curves$y[[1]],
type = "p", pch = 16, cex = 0.6, col = "steelblue",
xlab = "t", ylab = "y(t)", main = "Toy Curves Dataset")
for (i in 2:3) {
points(toy_curves$Xt, toy_curves$y[[i]],
pch = 16, cex = 0.6,
col = c("firebrick", "forestgreen")[i - 1])
}
legend("topright", legend = paste("Curve", 1:3),
col = c("steelblue", "firebrick", "forestgreen"),
pch = 16, bty = "n")When a single integer is passed to K,
vem_fit fits the model directly at that basis size without
GCV tuning:
When a vector of candidate values is passed to K,
vem_fit fits the model at each candidate and selects the
\(K\) minimizing the mean generalized
cross-validation (GCV) score across all curves:
fit_gcv <- vem_fit(
y = toy_curves$y,
Xt = toy_curves$Xt,
K = c(6, 8, 10, 15)
)
#>
#> Running VEM Smooth |
#> Basis Type: cubic_bspline | Selection: mean ---
fit_gcv$best_K
#> [1] 10
fit_gcv$tuning$gcv_matrix
#> K_6 K_8 K_10 K_15
#> [1,] 0.04349140 0.002860308 0.002250097 0.005072672
#> [2,] 0.05563686 0.004690865 0.004270509 0.007115469
#> [3,] 0.05803666 0.003215976 0.002124001 0.004428491Setting selection_metric = "per_curve" selects the best
\(K\) independently for each curve,
returning a composite fit with the results obtained from the optimal fit
per curve:
For periodic functional data, a Fourier basis can be used by setting
basis_type = "fourier":
fit_f <- vem_fit(
y = toy_curves$y,
Xt = toy_curves$Xt,
K = 10,
basis_type = "fourier"
)
#>
#> Running VEM Smooth |
#> Basis Type: fourier | Selection: mean ---
summary(fit_f)
#> ------------------------------------------------
#> VEM Smooth Fit Summary
#> ------------------------------------------------
#> Basis Type fourier
#> Curves (m): 3
#> Basis Functions (K): 10
#> Active Bases (per curve): 3, 4, 4
#>
#> Model Parameters (Representative):
#> Point estimate for decay parameter (w): 0.9096
#>
#> Posterior q(sigma^2) ~ IG(delta1, delta2):
#> delta1 (shape): 100
#> delta2 (scale): 23.9913
#>
#> Posterior q(tau^2) ~ IG(lambda1, lambda2):
#> lambda1 (shape): 15.001
#> lambda2 (scale): 9.5855
#> ------------------------------------------------summary(fit)
#> ------------------------------------------------
#> VEM Smooth Fit Summary
#> ------------------------------------------------
#> Basis Type cubic_bspline
#> Curves (m): 3
#> Basis Functions (K): 8
#> Active Bases (per curve): 6, 6, 6
#>
#> Model Parameters (Representative):
#> Point estimate for decay parameter (w): 6.217
#>
#> Posterior q(sigma^2) ~ IG(delta1, delta2):
#> delta1 (shape): 97
#> delta2 (scale): 0.859869
#>
#> Posterior q(tau^2) ~ IG(lambda1, lambda2):
#> lambda1 (shape): 12.001
#> lambda2 (scale): 1517.36
#> ------------------------------------------------The summary reports:
coef(fit)
#> Curve_1 Curve_2 Curve_3
#> B1 1.5105434 1.6158541 1.5027916
#> B2 0.0000000 0.0000000 0.0000000
#> B3 -1.0173887 -1.1296271 -0.9889452
#> B4 0.4855913 0.7182764 0.8098905
#> B5 0.0000000 0.0000000 0.0000000
#> B6 -0.4088634 -0.7414497 -0.5324928
#> B7 1.0851575 1.1127756 1.2152763
#> B8 -1.0093923 -1.1186148 -0.9470451Returns a \(K \times m\) matrix of
estimated basis coefficients. Inactive basis functions (PIP \(\leq 0.5\)) have their coefficients set to
zero by the sparsity-inducing prior. For toy_curves the
true zeros at positions 2 and 5 should be recovered:
The posterior inclusion probabilities (PIPs) for all basis functions
across all curves are stored in fit$model$prob and can be
inspected directly:
K <- fit$best_K
m <- length(toy_curves$y)
pip_mat <- matrix(fit$model$prob, nrow = K, ncol = m)
rownames(pip_mat) <- paste0("B", 1:K)
colnames(pip_mat) <- paste0("Curve_", 1:m)
round(pip_mat, 3)
#> Curve_1 Curve_2 Curve_3
#> B1 1.000 1 1
#> B2 0.000 0 0
#> B3 1.000 1 1
#> B4 1.000 1 1
#> B5 0.000 0 0
#> B6 0.998 1 1
#> B7 1.000 1 1
#> B8 1.000 1 1Predictions based on the vem_fit object. Returns a list
of length \(m\) (one numeric vector per
curve), where each vector has length equal to the number of evaluation
points requested.
da Cruz, A. C., de Souza, C. P. E., & Sousa, P. H. T. O. (2024). Fast Bayesian basis selection for functional data representation with correlated errors. arXiv:2405.20758. https://arxiv.org/abs/2405.20758
citation("fda.vi")
#> Warning in citation("fda.vi"): could not determine year for 'fda.vi' from
#> package DESCRIPTION file
#> To cite package 'fda.vi' in publications use:
#>
#> Kinsey S, da Cruz A, Toledo Oliveira Sousa P (????). _fda.vi:
#> Functional Data Analysis using Variational Inference_. R package
#> version 1.0.0, <https://github.com/desouzalab/fda.vi>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {fda.vi: Functional Data Analysis using Variational Inference},
#> author = {Stephen Kinsey and Ana Carolina {da Cruz} and Pedro Henrique {Toledo Oliveira Sousa}},
#> note = {R package version 1.0.0},
#> url = {https://github.com/desouzalab/fda.vi},
#> }