| Type: | Package |
| Title: | Multivariate Joint Grid Discretization |
| Version: | 0.3.2 |
| Date: | 2025-12-12 |
| Depends: | R (≥ 3.5.0) |
| Author: | Jiandong Wang [aut],
Sajal Kumar |
| Maintainer: | Joe Song <joemsong@nmsu.edu> |
| Description: | Discretize multivariate continuous data using a grid to capture the joint distribution that preserves clusters in original data. It can handle both labeled or unlabeled data. Both published methods (Wang et al 2020) <doi:10.1145/3388440.3412415> and new methods are included. Joint grid discretization can prepare data for model-free inference of association, function, or causality. |
| Imports: | Rcpp, Ckmeans.1d.dp, cluster, fossil, dqrng, mclust, Rdpack, plotrix |
| Suggests: | FunChisq, knitr, testthat (≥ 2.1.0), rmarkdown |
| RdMacros: | Rdpack |
| License: | LGPL (≥ 3) |
| Encoding: | UTF-8 |
| LinkingTo: | BH, Rcpp |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | yes |
| VignetteBuilder: | knitr |
| Packaged: | 2025-12-12 13:19:38 UTC; joesong |
| Repository: | CRAN |
| Date/Publication: | 2025-12-12 13:40:07 UTC |
Cluster Multivariate Data
Description
The function obtains clusters from data using the given number of clusters, which may be a range.
Usage
cluster(data, k, method, noise)
Arguments
data |
input continuous multivariate data |
k |
the number(s) of clusters |
method |
the method for clustering |
noise |
adding jitter noise to the data or not |
Discretize Multivariate Continuous Data by Cluster-Preserving Grid
Description
Discretize multivariate continuous data using a grid that captures the joint distribution via preserving clusters in original data
Usage
discretize.jointly(
data,
k = c(2:10),
min_level = 1,
max_level = 100,
cluster_method = c("Ball+BIC", "kmeans+silhouette", "PAM"),
grid_method = c("DP approx likelihood 1-way", "DP approx likelihood 2-way",
"DP exact likelihood", "DP Compressed majority", "DP", "Sort+split",
"MultiChannel.WUC"),
eval_method = c("ARI", "purity", "upsllion", "CAIR"),
cluster_label = NULL,
cutoff = 0,
entropy = FALSE,
noise = FALSE,
dim_reduction = FALSE,
scale = FALSE,
variance = 0.5,
nthread = 1
)
Arguments
data |
a numeric matrix for multivariate data or a numeric vector for univariate data. In case of a matrix, columns are continuous variables; rows are observations. |
k |
either an integer, a integer vector,
or |
min_level |
an integer or an integer vector, to specify the minimum number of levels
along each dimension. If a vector of size |
max_level |
an integer or an integer vector, to specify the maximum
number of levels along each dimension. It works in the
same way as |
cluster_method |
a character string to specify a clustering
method to be used. Ignored if
|
grid_method |
a character string to specify a grid
discretization method. Default:
|
eval_method |
a character string to specify a method to evaluate quality of discretized data. |
cluster_label |
a vector of labels for each data point or
observation. It can be class labels on the input |
cutoff |
a numeric value. A grid line is added only when the
quality of the line is not smaller than |
entropy |
a logical to chose either entropy
( |
noise |
a logical to apply jitter noise to original
data if |
dim_reduction |
a logical to turn on/off
dimension reduction. Default: |
scale |
a logical to specify linear
scaling of the variable in each dimension
if |
variance |
a numeric value to specify noise variance to be added to the data |
nthread |
an integer to specify number of CPU threads to use. Automatically adjusted if invalid or exceeding available cores. |
Details
The function implements both published algorithms described in (Wang et al. 2020) and new algorithms for multivariate discretization.
The included grid discretization methods can be summarized into three categories:
By Density
-
"Sort+split"(Wang et al. 2020) sorts clusters by mean in each dimension. It then splits consecutive pairs only if the sum of error rate of each cluster is less than or equal to 50%. It is possible that no grid line will be added in a certain dimension. The maximum number of lines is the number of clusters minus one.
-
By SSE (Sum of Squared Errors)
-
"MultiChannel.WUC"splits each dimension by weighted with-in cluster sum of squared distances byCkmeans.1d.dp::MultiChannel.WUC(). Applied in each projection on each dimension. The channel of each point is defined by its multivariate cluster label. -
"DP"orders labels by data in each dimension and then cuts data into a maximum ofmax_levelbins. It evaluates the quality of each cut to find a best number of bins. -
"DP Compressed majority"orders labels by data in each dimension. It then compresses labels neighbored by the same label to avoid discretization within consecutive points of the same cluster label, so as to greatly reduce runtime of dynamic programming. Then it cuts data into a maximum ofmax_levelbins, and it evaluates the quality of each cut by the majority of data to find a best number of bins.
-
By cluster likelihood
-
"DP exact likelihood"orders labels by data in each dimension. It then compresses labels neighbored by the same label to avoid discretization within consecutive points of the same cluster label, so as to greatly reduce runtime of dynamic programming. Then cut the data into a maximum ofmax_levelbins. -
"DP approx likelihood 1-way"is a sped-up version of the"DP exact likelihood"method, but it is not always optimal. -
"DP approx likelihood 2-way"is a bidirectional variant of the"DP approx likelihood"method. It performs approximate dynamic programming in both the forward and backward directions and selects the better of the two results. This approach provides additional robustness compared to the one-directional version, but optimality is not always achieved.
-
Value
A list that contains four items:
D |
a matrix of discretized values from original |
grid |
a list of numeric vectors of decision boundaries for each variable/dimension. |
clabels |
a vector of cluster labels for each observation in |
csimilarity |
a similarity score between clusters from joint discretization
|
Note
The default grid_method is changed
from "Sort+Split" (Wang et al. 2020) (up to released package version 0.1.0.2)
to "DP approx likelihood 1-way" (since version 0.3.2),
representing a major improvement.
Author(s)
Jiandong Wang, Sajal Kumar, and Mingzhou Song
References
Wang J, Kumar S, Song M (2020). “Joint Grid Discretization for Biological Pattern Discovery.” In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. ISBN 9781450379649, doi:10.1145/3388440.3412415.
See Also
See Ckmeans.1d.dp for discretizing univariate continuous data.
Examples
# using a specified k
x = rnorm(100)
y = sin(x)
z = cos(x)
data = cbind(x, y, z)
discretized_data = discretize.jointly(data, k=5)$D
# using a range of k
x = rnorm(100)
y = log1p(abs(x))
z = tan(x)
data = cbind(x, y, z)
discretized_data = discretize.jointly(data, k=c(3:10))$D
# using k = Inf
x = c()
y = c()
mns = seq(0,1200,100)
for(i in 1:12){
x = c(x,runif(n=20, min=mns[i], max=mns[i]+20))
y = c(y,runif(n=20, min=mns[i], max=mns[i]+20))
}
data = cbind(x, y)
discretized_data = discretize.jointly(data, k=Inf)$D
# using an alternate clustering method to k-means
library(cluster)
x = rnorm(100)
y = log1p(abs(x))
z = sin(x)
data = cbind(x, y, z)
# pre-cluster the data using partition around medoids (PAM)
cluster_label = pam(x=data, diss = FALSE, metric = "euclidean", k = 5)$clustering
discretized_data = discretize.jointly(data, cluster_label = cluster_label)$D
Generate Simulated Data
Description
Generate Simulated Data
Usage
gen_simdata(cord, sim_table, noise = 0.3, plot = FALSE)
Arguments
cord |
data matrix that records the index for each cluster on each dimension |
sim_table |
a matrix |
noise |
a numeric value to specify noise level |
plot |
a logical to turn on or off plotting |
Plotting Grid on Continuous Data
Description
Plots discretized data based on grid that preserves clusters in original data.
Usage
## S3 method for class 'GridOnClusters'
plot(
x,
xlab = NULL,
ylab = NULL,
main = NULL,
main.table = NULL,
col,
line_col = "black",
cex = 1.125,
sub = NULL,
pch = 19,
plot.table = TRUE,
...
)
Arguments
x |
the result generated by discretize.jointly |
xlab |
the horizontal axis label |
ylab |
the vertical axis label |
main |
the title of the clustering scatter plots |
main.table |
the title of the discretized data plots |
col |
the color of data points |
line_col |
the color of grid lines |
cex |
A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. |
sub |
the subtitle |
pch |
the symbol for points on the scatter plots |
plot.table |
a logical to show the contingency
table. Default: |
... |
additional graphical parameters |
Deprecated: Please use plot() instead
Description
Plots examples of jointly discretizing continuous data based on grids that preserve clusters in the original data.
Usage
plotGOCpatterns(data, res)
Arguments
data |
the input continuous data matrix |
res |
the result generated by discretize.jointly |