| Type: | Package |
| Title: | Directed Dependence Coefficient |
| Version: | 1.1.0 |
| Maintainer: | Yuping Wang <yuping.wang@plus.ac.at> |
| Description: | Directed Dependence Coefficient (didec) is a measure of functional dependence. Multivariate Feature Ordering by Conditional Independence (MFOCI) is a variable selection algorithm based on didec. Hierarchical Variable Clustering (VarClustPartition) is a variable clustering method based on didec. For more information, see the paper by Ansari and Fuchs (2025, <doi:10.48550/arXiv.2212.01621>), and the paper by Fuchs and Wang (2024, <doi:10.1016/j.ijar.2024.109185>). |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| Imports: | copBasic (≥ 2.2.3), cowplot (≥ 1.1.2), dendextend (≥ 1.17.1), FOCI (≥ 0.1.3), ggplot2 (≥ 3.4.4), graphics (≥ 4.3.0), grDevices (≥ 0.5-1), gtools (≥ 3.9.5), pcaPP (≥ 2.0-5), phylogram (≥ 2.1.0), RANN (≥ 2.6.1), rlang (≥ 1.1.4), stats (≥ 4.3.0) |
| Depends: | R (≥ 3.5) |
| NeedsCompilation: | no |
| Packaged: | 2026-01-30 09:50:04 UTC; yupin |
| Author: | Yuping Wang [aut, cre], Sebastian Fuchs [aut], Jonathan Ansari [aut] |
| Repository: | CRAN |
| Date/Publication: | 2026-02-02 08:30:02 UTC |
Average diameter & Maximum split of every partition of a given dendrogram
Description
Average diameter & Maximum split of every partition of a given dendrogram
Usage
Adiam.Msplit(X, dend = dend, dist.func = "PD", estim.method = c("copula"))
Arguments
X |
a data frame for a set of variables X |
dend |
a dendrogramm |
dist.func |
PD / MPD / kendall / footrule |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. |
Value
a data frame
Estimate for T(Y,X) based on function codec
Description
Estimate for T(Y,X) based on function codec
Usage
Codec.Tq(X, Y)
Arguments
X |
a data frame for input vector X |
Y |
a data frame for output vector Y |
Value
a value
Estimate for T_bar(Y,X) based on function Codec & a sample of all / all increasing / all decreasing permutations
Description
Estimate for T_bar(Y,X) based on function Codec & a sample of all / all increasing / all decreasing permutations
Usage
Codec.Tq.Perm(X, Y, method = c("sample"))
Arguments
X |
a data frame for input vector X |
Y |
a data frame for output vector Y |
method |
permuatation methods: sample / increasing / decreasing / full |
Value
a value
Estimate for \xi(Y,X) using codec function
Description
Estimate for \xi(Y,X) using codec function
Usage
CodecCorr(X, y)
Arguments
X |
a data frame for input vector X |
y |
a data frame for output vector Y |
Value
a value
Estimate for T(Y,X) based on dimension reduction principle
Description
Estimate for T(Y,X) based on dimension reduction principle
Usage
Copula.Tq(X, Y)
Arguments
X |
a data frame for input vector X |
Y |
a data frame for output vector Y |
Value
a value
Estimate for T_bar(Y,X) based on dimension reduction principle
Description
Estimate for T_bar(Y,X) based on dimension reduction principle
Usage
Copula.Tq.Perm(X, Y, method = c("sample"))
Arguments
X |
a data frame for input vector X |
Y |
a data frame for output vector Y |
method |
permuatation methods: sample / increasing / decreasing / full |
Value
a value
Estimate for \xi(Y,X) based on dimension reduction principle
Description
Estimate for \xi(Y,X) based on dimension reduction principle
Usage
CopulaCorr(X, y)
Arguments
X |
a data frame for input vector X |
y |
a data frame for output vector Y |
Value
a value
Markov product estimate from single (q=1) endogenuous and (p>=1) exogenous variables based on dimension reduction
Description
Markov product estimate from single (q=1) endogenuous and (p>=1) exogenous variables based on dimension reduction
Usage
MPhi(X, y)
Arguments
X |
a data frame for input vector X |
y |
a data frame for output vector Y |
Value
a value
Silhouette value for the i. variable given variable partition
Description
Silhouette value for the i. variable given variable partition
Usage
Silhouette(i, df, partition, dist.func = "PD", estim.method = c("copula"))
Arguments
i |
the index of the variable |
df |
a data frame for all variables |
partition |
a partition |
dist.func |
PD / MPD / kendall / footrule |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. |
Value
a value for Silhouette
Silhouette coefficients given a dendrogram
Description
Silhouette coefficients given a dendrogram
Usage
Silhouette.coefficient(X, dend, dist.func = "PD", estim.method = c("copula"))
Arguments
X |
a data frame for a set of variables X |
dend |
a dendrogramm |
dist.func |
PD / MPD / kendall / footrule |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. |
Value
a data frame
Hierarchical variable clustering and partition.
Description
VarClustPartition is a hierarchical variable clustering algorithm based on the directed dependence coefficient (didec) or a concordance measure (Kendall tau \tau or Spearman's footrule) according to a pre-selected number of clusters or an optimality criterion (Adiam&Msplit or Silhouette coefficient).
Usage
VarClustPartition(
X,
trans = FALSE,
trans.method = c("standardization"),
dist.method = c("PD"),
estim.method = c("copula"),
linkage = FALSE,
link.method = c("complete"),
part.method = c("optimal"),
part.criterion = c("Adiam&Msplit"),
num.cluster = NULL,
plot = FALSE
)
Arguments
X |
A numeric matrix or data.frame/data.table. Contains the variables to be clustered. |
trans |
A logical. If |
trans.method |
An optional character string specifying a method for data standardization. This must be one of the strings |
dist.method |
An optional character string computing a distance function for clustering. This must be one of the strings |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient if |
linkage |
A logical. If |
link.method |
An optional character string selecting a linkage method. This must be one of the strings |
part.method |
An optional character string selecting a partitioning method. This must be one of the strings |
part.criterion |
An optional character string selecting a criterion for the optimal partition if |
num.cluster |
An integer value for the pre-selected number of clusters if |
plot |
A logical. If |
Details
VarClustPartition performs a hierarchical variable clustering based on the directed dependence coefficient (didec) and provides a partition of the set of variables.
If dist.method =="PD" (perfect dependence) or dist.method =="MPD" (mutual perfect dependence) the clustering is performed using didec either as a directed ("PD") or as a symmetric ("MPD") dependence coefficient.
If dist.method =="kendall" or dist.method =="footrule", clustering is performed using either multivariate Kendall's tau ("kendall") or multivariate Spearman's footrule ("footrule"). "kendall" uses the function cor.fk which is provided in the R package pcaPP to calculate bivariate Kendall's tau.
Instead of using one of the above-mentioned four multivariate measures for the clustering, the option linkage == TRUE enables the use of bivariate linkage methods,
including complete linkage (link.method == "complete"), average linkage (link.method == "average") and single linkage (link.method == "single").
Note that the multivariate distance methods are computationally demanding because higher-dimensional dependencies are included in the calculation, in contrast to linkage methods which only incorporate pairwise dependencies.
A pre-selected number of clusters num.cluster can be realized with the option part.method == "selected".
Otherwise (part.method == "optimal"), the number of clusters is determined by maximizing the intra-cluster similarity (similarity within the same cluster) and minimizing the inter-cluster similarity (similarity among the clusters). Two optimality criteria (Fuchs & Wang 2024) are available:
"Adiam&Msplit": Adiam measures the intra-cluster similarity and Msplit measures the inter-cluster similarity.
"Silhouette": A mixed coefficient incorporating the intra-cluster similarity and the inter-cluster similarity. The optimal number of clusters corresponds to the maximum Silhouette coefficient.
Value
A list containing:
- dendrogram
A dendrogram without colored branches;
- num.cluster
An integer value determining the number of clusters after partitioning;
- clusters
A list containing the clusters after partitioning.
Author(s)
Yuping Wang, Sebastian Fuchs
References
S. Fuchs, Y. Wang, Hierarchical variable clustering based on the predictive strength between random vectors, Int. J. Approx. Reason. 170, Article ID 109185, 2024.
P. Hansen, B. Jaumard, Cluster analysis and mathematical programming, Math. Program. 79 (1) 191–215, 1997.
L. Kaufman, Finding Groups in Data, John Wiley & Sons, 1990.
Examples
library(didec)
n <- 50
X1 <- rnorm(n,0,1)
X2 <- X1
X3 <- rnorm(n,0,1)
X4 <- X3 + X2
X <- data.frame(X1=X1,X2=X2,X3=X3,X4=X4)
vcp <- VarClustPartition(X,
dist.method = c("PD"),
part.method = c("optimal"),
part.criterion = c("Silhouette"),
plot = TRUE)
vcp$clusters
data("bioclimatic")
X <- bioclimatic[c(2:4,9)]
vcp1 <- VarClustPartition(X,
linkage = TRUE,
link.method = c("complete"),
dist.method = "PD",
part.method = "optimal",
part.criterion = "Silhouette",
plot = TRUE)
vcp1$clusters
vcp2 <- VarClustPartition(X,
linkage = TRUE,
link.method = c("complete"),
dist.method = "footrule",
part.method = "optimal",
part.criterion = "Adiam&Msplit",
plot = TRUE)
vcp2$clusters
Bioclimatic variables
Description
A data set of bioclimatic variables for n=1,862 locations homogeneously distributed over the global landmass from CHELSA ("Climatologies at high resolution for the earth’s land surface areas").
Usage
bioclimatic
Format
An object of class data.frame with 1862 rows and 19 columns.
References
D.N. Karger, O. Conrad, J. Böhner, T. Kawohl, H. Kreft, R.W. Soria-Auza, N.E. Zimmermann, H.P. Linder, M. Kessler, Climatologies at high resolution for the Earth's land surface areas, Sci. Data 4(1), 2017.
Examples
data(bioclimatic)
head(bioclimatic)
Cluster a set of variables using distance function based on predictive measure
Description
Cluster a set of variables using distance function based on predictive measure
Usage
clust.Tq(X, estim.method = c("copula"), mutual = FALSE)
Arguments
X |
a data frame for a set of variables X |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. |
mutual |
type B function or not |
Value
a list for hierarchical clustering result
Clustering a set of variables using distance function based on multivariate concordance measures
Description
Clustering a set of variables using distance function based on multivariate concordance measures
Usage
clust.concor.M(X, method = c("footrule"))
Arguments
X |
a data frame for vector X |
method |
kendall / footrule |
Value
a list for hierarchical clustering result
Estimation for multivariate concordance measures
Description
Estimation for multivariate concordance measures
Usage
concor.M(X, method = c("footrule"))
Arguments
X |
a data frame for vector X |
method |
kendall / footrule |
Value
a value of the estimator for the multivariate concordance measures
Read a dendrogram from a list for hierarchical clustering result
Description
Read a dendrogram from a list for hierarchical clustering result
Usage
dendrogram(clust, step = TRUE)
Arguments
clust |
a list for hierarchical clustering result |
step |
whether using clustering step as y axis or not |
Value
an object of class "dendrogram"
Diameter of a class of variables based on different distance function
Description
Diameter of a class of variables based on different distance function
Usage
diam(X, dist.func = "PD", estim.method = c("copula"))
Arguments
X |
a data frame for a set of variables X |
dist.func |
PD / MPD / kendall / footrule |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. |
Value
a value
Computes the directed dependence coefficient.
Description
The directed dependence coefficient (didec) estimates the degree of functional dependence of a random vector Y on a random vector X, based on an i.i.d. sample of (X,Y).
Usage
didec(
X,
Y,
trans = FALSE,
trans.method = c("standardization"),
estim.method = c("copula"),
perm = FALSE,
perm.method = c("decreasing")
)
Arguments
X |
A numeric matrix or data.frame/data.table. Contains the predictor vector X. |
Y |
A numeric matrix or data.frame/data.table. Contains the response vector Y. |
trans |
A logical. If |
trans.method |
An optional character string specifying the data standardization method. This must be one of the strings |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. This must be one of the strings |
perm |
A logical. If |
perm.method |
An optional character string specifying a method for permuting the response variables. This must be one of the strings |
Details
The directed dependence coefficient (didec) is an extension of Azadkia & Chatterjee's measure of functional dependence (Azadkia & Chatterjee, 2021) to a vector of response variables introduced in (Ansari & Fuchs, 2025).
estim.method specifies two methods for estimating the directed dependence coefficient. "codec" uses the function codec which estimates Azadkia & Chatterjee’s measure of functional dependence and is provided in the R package FOCI. "copula" estimates the directed dependence coefficient based on a dimension reduction principle; see (Fuchs 2024). The value returned by didec may be positive or negative. In the asymptotic limit, however, it is guaranteed to lie between 0 and 1.
By definition, didec is invariant under permutations of the variables within the predictor vector X. Invariance under permutations within the q-dimensional response vector Y is achieved by computing the arithmetic mean over all possible permutations.
In addition to the option "full" of running all q! permutations of (1, ..., q), less computationally intensive options are also available: a random selection of q permutations "sample", cyclic permutations such as (1,2,...,q), (2,...,q,1) either "increasing" or "decreasing".
Note that when the number of variables q is large, choosing "full" may result in long computation times.
Value
The degree of functional dependence of the random vector Y on the random vector X.
Author(s)
Yuping Wang, Sebastian Fuchs, Jonathan Ansari
References
J. Ansari, S. Fuchs, A direct extension of Azadkia & Chatterjee's rank correlation to multi-response vectors, Available at https://arxiv.org/abs/2212.01621, 2025.
M. Azadkia, S. Chatterjee, A simple measure of conditional dependence, Ann. Stat. 49 (6), 2021.
S. Fuchs, Quantifying directed dependence via dimension reduction, J. Multivariate Anal. 201, Article ID 105266, 2024.
distance function based on T
Description
distance function based on T
Usage
dist.Tq(X, Y, estim.method = c("copula"), mutual = FALSE)
Arguments
X |
a data frame for vector X |
Y |
a data frame for vector Y |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. |
mutual |
use mutual perfect dependence or not |
Value
a value for distance between two vectors
distance function based on multivariate concordance measures
Description
distance function based on multivariate concordance measures
Usage
dist.concor.M(X, Y, method = c("footrule"))
Arguments
X |
a data frame for vector X |
Y |
a data frame for vector Y |
method |
kendall / footrule |
Value
a value for distance between two vectors
Distance Matrix Computation using distance function based on T^q
Description
Distance Matrix Computation using distance function based on T^q
Usage
dist.mat.T(X, estim.method = c("copula"), mutual = FALSE)
Arguments
X |
a data frame for vector X |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. |
mutual |
use type B function (mutual perfect dependence) or not |
Value
an object of class "dist"
Distance Matrix Computation using distance function based on multivariate concordance measures
Description
Distance Matrix Computation using distance function based on multivariate concordance measures
Usage
dist.mat.concor(X, method = c("footrule"))
Arguments
X |
a data frame for vector X |
method |
kendall / footrule |
Value
an object of class "dist"
Multivariate feature ordering by conditional independence.
Description
A variable selection algorithm based on the directed dependence coefficient (didec).
Usage
mfoci(
X,
Y,
trans = FALSE,
trans.method = c("standardization"),
estim.method = c("copula"),
perm = FALSE,
perm.method = c("decreasing"),
pre.selected = NULL,
select.method = c("forward"),
autostop = TRUE,
max.num = NULL
)
Arguments
X |
A numeric matrix or data.frame/data.table. Contains the predictor vector X. |
Y |
A numeric matrix or data.frame/data.table. Contains the response vector Y. |
trans |
A logical. If |
trans.method |
An optional character string specifying a method for data standardization. This must be one of the strings |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient |
perm |
A logical. If |
perm.method |
An optional character string specifying a method for permuting the response variables. This must be one of the strings |
pre.selected |
An integer vector for indexing pre-selected components from predictor X. |
select.method |
An optional character string specifying a feature selection method. This must be one of the strings |
autostop |
A logical. If |
max.num |
An integer for limiting the maximal number of selected variables if |
Details
mfoci involves a forward feature selection algorithm for multiple-outcome data that employs the directed dependence coefficient (didec) at each step.
If autostop == TRUE the algorithm stops at the first non-increasing value of didec, thereby selecting a subset of variables.
Otherwise, all predictor variables are ranked according to their predictive strength measured by didec.
In addition to the forward feature selection algorithm, this function also provides a best subset selection, which can be accomplished by select.method == "subset".
This method selects features by calculating the directed dependence coefficient of all possible feature combinations.
Note that the features selected by this method are not ordered.
Value
A list containing:
- features
A vector listing all features in X;
- pre.selected.features
A vector listing the pre.selected features in X if
pre.selected != NULL;- selected.features
A data.frame listing the selected and ranked variables and the corresponding values of the directed dependence coefficient if
select.method == "forward"; A vector listing the selected features ifselect.method == "subset";- valueT
The values of the directed dependence coefficient if
select.method == "subset".
Author(s)
Sebastian Fuchs, Jonathan Ansari, Yuping Wang
References
J. Ansari, S. Fuchs, A direct extension of Azadkia & Chatterjee's rank correlation to multi-response vectors, Available at https://arxiv.org/abs/2212.01621, 2025.
Examples
library(didec)
df <- as.data.frame(bioclimatic)
X <- df[, c(9:12)]
Y <- df[, c(1,8)]
mfoci(X, Y, pre.selected = c(1, 3))
Plot the trade-off of Adiam and Msplit
Description
Plot the trade-off of Adiam and Msplit
Usage
plot_Adiam.Msplit(tradeoff, main = NULL, sub = NULL)
Arguments
tradeoff |
a data frame |
main |
main title |
sub |
sub title |
Value
ggplot
Plot the Silhouette coefficient
Description
Plot the Silhouette coefficient
Usage
plot_Silhouette.coefficient(Silhouette_Index, main = NULL, sub = NULL)
Arguments
Silhouette_Index |
a data frame of Silhouette coefficient |
main |
main title |
sub |
sub title |
Value
ggplot
plotting a dendrogram with colored branches
Description
plotting a dendrogram with colored branches
Usage
plot_dendrogram(
dend,
num.cluster = num.cluster,
linkage = FALSE,
ylab = ylab,
cex.lab = 0.6,
cex.axis = 0.6
)
Arguments
dend |
a dendrogram |
num.cluster |
the number of colored branches |
linkage |
logical; if 'True', the linkage method is used |
ylab |
a string |
cex.lab |
a value |
cex.axis |
a value |
Value
plot
Powerset without empty set
Description
Powerset without empty set
Usage
powerset(s)
Arguments
s |
Value
a list
Split of two classes of variables based on different distance function
Description
Split of two classes of variables based on different distance function
Usage
split(X, Y, dist.func = "PD", estim.method = c("copula"))
Arguments
X |
a data frame for a set of variables X |
Y |
a data frame for a set of variables Y |
dist.func |
PD / MPD / kendall / footrule |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. |
Value
a value