Type: | Package |
Title: | Robust and Flexible Model-Based Clustering for Data Sets with Missing Values at Random |
Version: | 3.0.5 |
Description: | Implementations of various robust and flexible model-based clustering methods for data sets with missing values at random. Two main models are: Multivariate Contaminated Normal Mixture (MCNM, Tong and Tortora, 2022, <doi:10.1007/s11634-021-00476-1>) and Multivariate Generalized Hyperbolic Mixture (MGHM, Wei et al., 2019, <doi:10.1016/j.csda.2018.08.016>). Mixtures via some special or limiting cases of the multivariate generalized hyperbolic distribution are also included: Normal-Inverse Gaussian, Symmetric Normal-Inverse Gaussian, Skew-Cauchy, Cauchy, Skew-t, Student's t, Normal, Symmetric Generalized Hyperbolic, Hyperbolic Univariate Marginals, Hyperbolic, and Symmetric Hyperbolic. Funding: This work was partially supported by the National Science foundation NSF Grant NO. 2209974. |
Imports: | mvtnorm (≥ 1.1-2), mnormt (≥ 2.0.2), cluster (≥ 2.1.2), MASS (≥ 7.3), numDeriv (≥ 8.1.1), Bessel (≥ 0.6.0), mclust (≥ 5.0.0), mice (≥ 3.10.0) |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Encoding: | UTF-8 |
LazyData: | true |
Repository: | CRAN |
RoxygenNote: | 7.3.3 |
Depends: | R (≥ 3.5.0) |
NeedsCompilation: | no |
Packaged: | 2025-10-23 15:33:12 UTC; hungt |
Author: | Hung Tong [aut, cre], Cristina Tortora [aut, ths, dgs] |
Maintainer: | Hung Tong <hungtongmx@gmail.com> |
Date/Publication: | 2025-10-23 15:50:10 UTC |
Multivariate Contaminated Normal Mixture (MCNM)
Description
Carries out model-based clustering using a multivariate contaminated normal mixture (MCNM). The function will determine itself if the data set is complete or incomplete and fit the appropriate model accordingly. In the incomplete case, the data set must be at least bivariate, and missing values are assumed to be missing at random (MAR).
Usage
MCNM(
X,
G,
criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
max_iter = 20,
epsilon = 0.01,
init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
clusters = NULL,
eta_min = 1.001,
progress = TRUE
)
Arguments
X |
An |
G |
An integer vector specifying the numbers of clusters, which must be at least 1. |
criterion |
A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria. |
max_iter |
(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default. |
epsilon |
(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default. |
init_method |
(optional) A string specifying the method to initialize
the EM algorithm. "kmedoids" clustering is used by default. Alternative
methods include "kmeans", "hierarchical", "mclust", and "manual". When "manual" is chosen,
a vector |
clusters |
(optional) A numeric vector of length |
eta_min |
(optional) A numeric value close to 1 to the right specifying the minimum value of eta; 1.001 by default. |
progress |
(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default. |
Details
Available information criteria include
AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion
Value
An object of class MixtureMissing
with:
model |
The model used to fit the data set. |
pi |
Mixing proportions. |
mu |
Component location vectors. |
Sigma |
Component dispersion matrices. |
alpha |
Component proportions of good observations. |
eta |
Component degrees of contamination. |
z_tilde |
An |
v_tilde |
An |
clusters |
A numeric vector of length |
outliers |
A logical vector of length |
data |
The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations. |
complete |
An |
npar |
The breakdown of the number of parameters to estimate. |
max_iter |
Maximum number of iterations allowed in the EM algorithm. |
iter_stop |
The actual number of iterations needed when fitting the data set. |
final_loglik |
The final value of log-likelihood. |
loglik |
All the values of log-likelihood. |
AIC |
Akaike information criterion. |
BIC |
Bayesian information criterion. |
KIC |
Kullback information criterion. |
KICc |
Corrected Kullback information criterion. |
AIC3 |
Modified AIC. |
CAIC |
Bozdogan's consistent AIC. |
AICc |
Small-sample version of AIC. |
ent |
Entropy. |
ICL |
Integrated Completed Likelihood criterion. |
AWE |
Approximate weight of evidence. |
CLC |
Classification likelihood criterion. |
init_method |
The initialization method used in model fitting. |
References
Punzo, A. and McNicholas, P.D., 2016. Parsimonious mixtures of multivariate
contaminated normal distributions. Biometrical Journal, 58(6), pp.1506-1537.
Tong, H. and, Tortora, C., 2022. Model-based clustering and outlier detection
with missing data. Advances in Data Analysis and Classification.
Examples
data('auto')
#++++ With no missing values ++++#
X <- auto[, c('engine_size', 'city_mpg', 'highway_mpg')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)
plot(mod)
#++++ With missing values ++++#
X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)
plot(mod)
Multivariate Generalized Hyperbolic Mixture (MGHM)
Description
Carries out model-based clustering using a multivariate generalized hyperbolic mixture (MGHM). The function will determine itself if the data set is complete or incomplete and fit the appropriate model accordingly. In the incomplete case, the data set must be at least bivariate, and missing values are assumed to be missing at random (MAR).
Usage
MGHM(
X,
G,
model = c("GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H", "SH"),
criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
max_iter = 20,
epsilon = 0.01,
init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
clusters = NULL,
outlier_cutoff = 0.95,
deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r
= 6, v = 2, show.details = FALSE),
progress = TRUE
)
Arguments
X |
An |
G |
An integer vector specifying the numbers of clusters, which must be at least 1. |
model |
A string indicating the mixture model to be fitted; "GH" for generalized hyperbolic by default. See the details section for a list of available distributions. |
criterion |
A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria. |
max_iter |
(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default. |
epsilon |
(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default. |
init_method |
(optional) A string specifying the method to initialize
the EM algorithm. "kmedoids" clustering is used by default. Alternative
methods include "kmeans", "hierarchical", "mclust", and "manual". When "manual" is chosen,
a vector |
clusters |
(optional) A vector of length |
outlier_cutoff |
(optional) A number between 0 and 1 indicating the percentile cutoff used for outlier detection. This is only relevant for t mixture. |
deriv_ctrl |
(optional) A list containing arguments to control the numerical
procedures for calculating the first and second derivatives. Some values are
suggested by default. Refer to functions |
progress |
(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default. |
Details
Beside the generalized hyperbolic distribution, the function can fit mixture via its special and limiting cases. Available distributions include
GH - Generalized Hyperbolic
NIG - Normal-Inverse Gaussian
SNIG - Symmetric Normal-Inverse Gaussian
SC - Skew-Cauchy
C - Cauchy
St - Skew-t
t - Student's t
N - Normal or Gaussian
SGH - Symmetric Generalized Hyperbolic
HUM- Hyperbolic Univariate Marginals
H - Hyperbolic
SH - Symmetric Hyperbolic
Available information criteria include
AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion
Value
An object of class MixtureMissing
with:
model |
The model used to fit the data set. |
pi |
Mixing proportions. |
mu |
Component location vectors. |
Sigma |
Component dispersion matrices. |
beta |
Component skewness vectors. Only available if |
lambda |
Component index parameters. Only available if |
omega |
Component concentration parameters. Only available if |
df |
Component degrees of freedom. Only available if |
z_tilde |
An |
clusters |
A numeric vector of length |
outliers |
A logical vector of length |
data |
The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations. |
complete |
An |
npar |
The breakdown of the number of parameters to estimate. |
max_iter |
Maximum number of iterations allowed in the EM algorithm. |
iter_stop |
The actual number of iterations needed when fitting the data set. |
final_loglik |
The final value of log-likelihood. |
loglik |
All the values of log-likelihood. |
AIC |
Akaike information criterion. |
BIC |
Bayesian information criterion. |
KIC |
Kullback information criterion. |
KICc |
Corrected Kullback information criterion. |
AIC3 |
Modified AIC. |
CAIC |
Bozdogan's consistent AIC. |
AICc |
Small-sample version of AIC. |
ent |
Entropy. |
ICL |
Integrated Completed Likelihood criterion. |
AWE |
Approximate weight of evidence. |
CLC |
Classification likelihood criterion. |
init_method |
The initialization method used in model fitting. |
References
Browne, R. P. and McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions.
Canadian Journal of Statistics, 43(2):176–198.
Wei, Y., Tang, Y., and McNicholas, P. D. (2019). Mixtures of generalized hyperbolic
distributions and mixtures of skew-t distributions for model-based clustering
with incomplete data. Computational Statistics & Data Analysis, 130:18–41.
Examples
data('bankruptcy')
#++++ With no missing values ++++#
X <- bankruptcy[, 2:3]
mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)
plot(mod)
#++++ With missing values ++++#
set.seed(1234)
X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1)
mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)
plot(mod)
US Cost of Living Indices in 2019 Data Set
Description
The data set contains the 2019 cost of living indices of 50 states in five different categories: grocery, housing, transportation, utilities, and miscellaneous (Washington DC is not included). The indices are calculated by first determining the average cost of living in the United States to be used as a baseline set at 100. States are then measured against this baseline. For example, a state with a cost of living index of 200 is twice as expensive as the national average.
Usage
UScost
Format
A data frame with 50 rows and 7 variables. There are no missing values
- Abbr
State abbreviation.
- State
State name.
- Grocery
Grocery index.
- Housing
Housing index.
- Utilities
Utilities index
- Transportation
Transporation index.
- Misc
Miscellaneous index
Source
https://worldpopulationreview.com
Automobile Data Set
Description
This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
Usage
auto
Format
A data frame with 205 rows and 26 variables. The first 15 variables are continuous, while the last 11 variables are categorical. There are 45 rows with missing values.
- normalized_losses
continuous from 65 to 256.
- wheel_base
continuous from 86.6 120.9.
- length
continuous from 141.1 to 208.1.
- width
continuous from 60.3 to 72.3.
- height
continuous from 47.8 to 59.8.
- curb_weight
continuous from 1488 to 4066.
- engine_size
continuous from 61 to 326.
- bore
continuous from 2.54 to 3.94.
- stroke
continuous from 2.07 to 4.17.
- compression_ratio
continuous from 7 to 23.
- horsepower
continuous from 48 to 288.
- peak_rpm
continuous from 4150 to 6600.
- city_mpg
continuous from 13 to 49.
- highway_mpg
continuous from 16 to 54.
- price
continuous from 5118 to 45400.
- symboling
-3, -2, -1, 0, 1, 2, 3.
- make
alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
- fuel_type
diesel, gas.
- aspiration
std, turbo.
- num_doors
four, two.
- body_style
hardtop, wagon, sedan, hatchback, convertible.
- drive_wheels
4wd, fwd, rwd.
- engine_location
front, rear.
- engine_type
dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
- num_cylinders
eight, five, four, six, three, twelve, two.
- fuel_system
1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
Source
Kibler, D., Aha, D.W., & Albert,M. (1989). Instance-based prediction of real-valued attributes. Computational Intelligence, Vol 5, 51–57. https://archive.ics.uci.edu/ml/datasets/automobile
Bankruptcy Data Set
Description
The data set contains the ratio of retained earnings (RE) to total assets, and the ratio of earnings before interests and taxes (EBIT) to total assets of 66 American firms recorded in the form of ratios. Half of the selected firms had filed for bankruptcy.
Usage
bankruptcy
Format
A data frame with 66 rows and 3 variables:
- Y
Status of the firm: 0 for bankruptcy and 1 for financially sound.
- RE
Ratio of retained earnings.
- EBIT
Ratio of earnings before interests and taxes.
Source
Altman E.I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4): 589-609 https://www.jstor.org/stable/2978933
Binary Classification Evaluation
Description
Evaluate the performance of a classification model by comparing its predicted labels to the true labels. Various metrics are returned to give an insight on how well the model classifies the observations. This function is added to aid outlier detection evaluation of MCNM and MtM in case that true outliers are known in advance.
Usage
evaluation_metrics(true_labels, pred_labels)
Arguments
true_labels |
An 0-1 or logical vector denoting the true labels. The meaning of 0 and 1 (or TRUE and FALSE) is up to the user. |
pred_labels |
An 0-1 or logical vector denoting the true labels. The meaning of 0 and 1 (or TRUE and FALSE) is up to the user. |
Value
A list with the following slots:
matr |
The confusion matrix built upon true labels and predicted labels. |
TN |
True negative. |
FP |
False positive (type I error). |
FN |
False negative (type II error). |
TP |
True positive. |
TPR |
True positive rate (sensitivy). |
FPR |
False positive rate. |
TNR |
True negative rate (specificity). |
FNR |
False negative rate. |
precision |
Precision or positive predictive value (PPV). |
accuracy |
Accuracy. |
error_rate |
Error rate. |
FDR |
False discovery rate. |
Examples
#++++ Inputs are 0-1 vectors ++++#
evaluation_metrics(
true_labels = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1),
pred_labels = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1)
)
#++++ Inputs are logical vectors ++++#
evaluation_metrics(
true_labels = c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE),
pred_labels = c(FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE)
)
Extractor function for MixtureMissing
Description
Extract values from MixtureMissing
objects or from outputs of
select_mixture.
Usage
extract(
object,
what = c("model", "parameters", "cluster", "posterior", "outlier", "missing",
"imputed", "complete", "information"),
criterion = c("AIC", "BIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
m_code = NULL
)
Arguments
object |
A |
what |
The specific value to be extracted. See the return section for possible values. |
criterion |
If |
m_code |
Only used in the case when |
Details
Available information criteria include
AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion
Value
One of the following depending on what
If
what = "model"
- A data frame showing the component distribution and its abbreviation, number of clusters, and whether the data set is complete or incomplete.If
what = "parameters"
- A list containing the relevant parameters.If
what = "cluster"
- A numeric vector of lengthn
indicating cluster memberships determined by the model.If
what = "posterior"
- Ann
byG
matrix where each row indicates the expected probabilities that the corresponding observation belongs to each cluster.If
what = "outlier"
- A logical vector of lengthn
indicating observations that are outliers. Only available ifmodel
is CN or t; NULL otherwise with a warning.If
what = "missing"
- A data frame showing how many observations (cases) have missing values and the number of missing values per variables.If
what = "imputed"
- The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations.If
what = "complete"
- Ann
byd
logical matrix indicating which cells have no missing values.If
what = "information"
- A data frame showing the number of clusters, final observed log-likelihood value, number of parameters, and desired information criteria.
Examples
#++++ With no missing values ++++#
X <- iris[, 1:4]
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
extract(mod, what = "model")
extract(mod, what = "parameters")
extract(mod, what = "cluster")
#++++ With missing values ++++#
set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
extract(mod, what = "outlier")
extract(mod, what = "missing")
extract(mod, what = "imputed")
Missing-Data Pattern Generation
Description
Generate all possible missing patterns in a multivariate data set. The function
can be used to complement the function ampute()
from package mice
in which a matrix of patterns is needed to allow for general missing-data
patterns with missing-data mechanism missing at random (MAR). Using this
function, each observation can have more than one missing value.
Usage
generate_patterns(d)
Arguments
d |
The number of variables or columns of the data set. |
Details
An observation cannot have all values missing values. A complete observation
is not qualified for missing-data pattern. Note that a large value of d
may
result in memory allocation error.
Value
A matrix where 0 indicates that a variable should have missing values
and 1 indicates that a variable should remain complete. This matrix has d
columns and 2^d - 2
rows.
Examples
generate_patterns(4)
#++++ To use with the function ampute() from package mice ++++#
library(mice)
patterns_matr <- generate_patterns(4)
data_missing <- ampute(iris[1:4], prop = 0.5, patterns = patterns_matr)$amp
Missing Values Generation
Description
A convenient function that randomly introduces missing values to an at-least-bivariate data set. The user can specify either the proportion of observations that contain some missing values or the exact number of observations that contain some missing values. Note that the function does not guarantee that underlying missing-data mechanism to be missing at random (MAR).
Usage
hide_values(X, prop_cases = 0.1, n_cases = NULL)
Arguments
X |
An |
prop_cases |
(optional) Proportion of observations that contain some missing values.
|
n_cases |
(optional) Number of observations that contain some missing values.
|
Details
If subject to missingness, an observation can have at least 1 and at
most ncol(X) - 1
missing values. Depending on the data
set, it is not guaranteed that the resulting matrix will have the number of
rows with missing values matches the specified proportion.
Value
The orginal n
by d
matrix or data frame with missing values.
Examples
set.seed(1234)
hide_values(iris[1:4])
hide_values(iris[1:4], prop_cases = 0.5)
hide_values(iris[1:4], n_cases = 80)
Cluster Initialization using a Heuristic Method
Description
Initialize cluster memberships and component parameters to start the EM algorithm using a heuristic clustering method or user-defined labels.
Usage
initialize_clusters(
X,
G,
init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
clusters = NULL
)
Arguments
X |
An |
G |
The number of clusters, which must be at least 1. If |
init_method |
(optional) A string specifying the method to initialize
the EM algorithm. "kmedoids" clustering is used by default. Alternative
methods include "kmeans", "hierarchical", "manual". When
"manual" is chosen, a vector |
clusters |
A numeric vector of length |
Details
Available heuristic methods include k-medoids clustering, k-means clustering,
and hierarchical clustering. Alternately, the user can also enter pre-specified
cluster memberships, making other initialization methods possible. If the given
data set contains missing values, only observations with complete records will
be used to initialize clusters. However, in this case, except when G = 1
, the resulting cluster
memberships will be set to NULL
since they represent those complete records
rather than the original data set as a whole.
Value
A list with the following slots:
pi |
Component mixing proportions. |
mu |
A |
Sigma |
A |
clusters |
An numeric vector with values from 1 to |
References
Everitt, B., Landau, S., Leese, M., and Stahl, D. (2011). Cluster Analysis. John Wiley & Sons.
Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an
introduction to cluster analysis, volume 344. John Wiley & Sons.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering
algorithm. Applied Statistics, 28, 100-108. doi: 10.2307/2346830.
Examples
#++++ Initialization using a heuristic method ++++#
set.seed(1234)
init <- initialize_clusters(iris[1:4], G = 3)
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'kmeans')
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'hierarchical')
#++++ Initialization using user-defined labels ++++#
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'manual',
clusters = as.numeric(iris$Species))
#++++ Initial parameters and pairwise scatterplot showing the mapping ++++#
init$pi
init$mu
init$Sigma
init$clusters
pairs(iris[1:4], col = init$clusters, pch = 16)
Mean Imputation
Description
Replace missing values of data set by the mean of other observed values.
Usage
mean_impute(X)
Arguments
X |
An |
Value
A complete data matrix with missing values imputed accordingly.
References
Schafer, J. L. and Graham, J. W. (2002). Missing data: our view of the state of the art.
Psychological Methods, 7(2):147–177.
Little, R. J. A. and Rubin, D. B. (2020). Statistical analysis with missing data.
Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, 3rd edition
Examples
X <- matrix(nrow = 6, ncol = 3, byrow = TRUE, c(
NA, 2, 2,
3, NA, 5,
4, 3, 2,
NA, NA, 3,
7, 2, NA,
NA, 4, 2
))
mean_impute(X)
MixtureMissing Plotting
Description
Provide four model-based clustering plots for a MixtureMissing
object. The options
include (1) pairwise scatter plots showing cluster memberships and highlighting outliers denoted by triangles;
(2) pairwise scatter plots highlighting in red observations whose values are missing but are replaced by
expectations obtained in the EM algorithm; (3) parallel plot of up to the first 10 variables of a multivariate
data set; and (4) plots of estimated density in the form of contours. A single or multiple options
can be specified. In the latter case, interactive mode will be triggered for the user to choose.
Usage
## S3 method for class 'MixtureMissing'
plot(
x,
what = c("classification", "missing", "parallel", "density"),
nlevels = 15,
drawlabels = TRUE,
addpoints = TRUE,
cex.point = 1,
cex.axis = 1,
cex.labels = 2,
lwd = 1,
col_line = "gray",
...
)
Arguments
x |
A |
what |
A string or a character vector specifying the desired plots. See the details section for a list of available plots. |
nlevels |
Number of contour levels desired; 15 by default. |
drawlabels |
Contour levels are labelled if |
addpoints |
Colored points showing cluster memberships are added if |
cex.point |
A numerical value giving the amount by which data points should be magnified relative to the default. |
cex.axis |
The magnification to be used for axis annotation. |
cex.labels |
A numerical value to control the character size of variable labels. |
lwd |
The contour line width, a positive number, defaulting to 1. |
col_line |
The color of contour; "gray" by default. |
... |
Arguments to be passed to methods, such as graphical parameters. |
Details
The plots that can be retrieved include
If
what = "classification"
- Pairwise scatter plots showing cluster memberships and highlighting outliers denoted by triangles.If
what = "missing"
- Pairwise scatter plots highlighting in red observations whose values are missing but are replaced by expectations obtained in the EM algorithm.If
what = "parallel"
- Parallel plot of up to the first 10 variables of a multivariate data set.If
what = "density"
- Plots of estimated density in the form of contours.
Value
No return value, called to visualize the fitted model's results
Examples
set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MCNM(X, G = 2, max_iter = 10)
plot(mod, what = 'classification')
Print for MixtureMissing
Description
Print MixtureMissing
object.
Usage
## S3 method for class 'MixtureMissing'
print(x, ...)
Arguments
x |
A |
... |
Further arguments passed to or from other methods. |
Details
The description includes information on the complete or incomplete data, number of clusters, and component distribution.
Value
No return value, called to print the fitted model's description.
Examples
#++++ With no missing values ++++#
X <- iris[, 1:4]
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
print(mod)
#++++ With missing values ++++#
set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
print(mod)
Mixture Model Selection
Description
Fit mixtures via various distributions and decide the best model based on a given information criterion. The distributions include multivariate contaminated normal, multivariate generalized hyperbolic, special and limiting cases of multivariate generalized hyperbolic.
Usage
select_mixture(
X,
G,
model = c("CN", "GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H",
"SH"),
criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
max_iter = 20,
epsilon = 0.01,
init_method = c("kmedoids", "kmeans", "hierarchical", "manual"),
clusters = NULL,
eta_min = 1.001,
outlier_cutoff = 0.95,
deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r
= 6, v = 2, show.details = FALSE),
progress = TRUE
)
Arguments
X |
An |
G |
The number of clusters, which must be at least 1. If |
model |
A vector of character strings indicating the mixture model(s) to be fitted. See the details section for a list of available distributions. However, all distributions will be considered by default. |
criterion |
A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria. |
max_iter |
(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default. |
epsilon |
(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default. |
init_method |
(optional) A string specifying the method to initialize
the EM algorithm. "kmedoids" clustering is used by default. Alternative
methods include "kmeans", "hierarchical", and "manual". When "manual" is chosen,
a vector |
clusters |
(optional) A vector of length |
eta_min |
(optional) A numeric value close to 1 to the right specifying the minimum value of eta; 1.001 by default. This is only relevant for CN mixture |
outlier_cutoff |
(optional) A number between 0 and 1 indicating the percentile cutoff used for outlier detection. This is only relevant for t mixture. |
deriv_ctrl |
(optional) A list containing arguments to control the numerical
procedures for calculating the first and second derivatives. Some values are
suggested by default. Refer to functions |
progress |
(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default. |
Details
The function can fit mixtures via the contaminated normal distribution, generalized hyperbolic distribution, and special and limiting cases of the generalized hyperbolic distribution. Available distributions include
CN - Contaminated Normal
GH - Generalized Hyperbolic
NIG - Normal-Inverse Gaussian
SNIG - Symmetric Normal-Inverse Gaussian
SC - Skew-Cauchy
C - Cauchy
St - Skew-t
t - Student's t
N - Normal or Gaussian
SGH - Symmetric Generalized Hyperbolic
HUM- Hyperbolic Univariate Marginals
H - Hyperbolic
SH - Symmetric Hyperbolic
Available information criteria include
AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion
Value
A list with
best_mod |
An object of class |
all_mod |
A list of objects of class |
criterion |
A numeric vector containing the chosen information criterion values of all models of consideration. The vector is in the order of best-to-worst models. |
Each object of class MixtureMissing
have slots depending on the fitted model. See
the returned value of MCNM and MGHM.
References
Browne, R. P. and McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions.
Canadian Journal of Statistics, 43(2):176–198.
Wei, Y., Tang, Y., and McNicholas, P. D. (2019). Mixtures of generalized hyperbolic
distributions and mixtures of skew-t distributions for model-based clustering
with incomplete data. Computational Statistics & Data Analysis, 130:18–41.
Examples
data('bankruptcy')
#++++ With no missing values ++++#
X <- bankruptcy[, 2:3]
mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)
#++++ With missing values ++++#
set.seed(1234)
X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1)
mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)
Summary for MixtureMissing
Description
Summarizes main information regarding a MixtureMissing
object.
Usage
## S3 method for class 'MixtureMissing'
summary(object, ...)
Arguments
object |
A |
... |
Arguments to be passed to methods, such as graphical parameters. |
Details
Information includes the model used to fit the data set, initialization method, clustering table, total outliers, outliers per cluster, mixing proportions, component means and variances, final log-likelihood value, information criteria.
Value
No return value, called to summarize the fitted model's results
Examples
#++++ With no missing values ++++#
X <- auto[, c('horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)
#++++ With missing values ++++#
X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)