% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/acc_univariate_outlier.R
\name{acc_univariate_outlier}
\alias{acc_univariate_outlier}
\title{Identify univariate outliers by four different approaches}
\usage{
acc_univariate_outlier(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  exclude_roles,
  n_rules = length(unique(criteria)),
  max_non_outliers_plot = 10000,
  criteria = c("tukey", "3sd", "hubert", "sigmagap"),
  meta_data = item_level,
  meta_data_v2
)
}
\arguments{
\item{resp_vars}{\link{variable list} the name of the continuous measurement
variable}

\item{study_data}{\link{data.frame} the data frame that contains the measurements}

\item{label_col}{\link{variable attribute} the name of the column in the metadata
with labels of variables}

\item{item_level}{\link{data.frame} the data frame that contains metadata
attributes of study data}

\item{exclude_roles}{\link{variable roles} a character (vector) of variable roles
not included}

\item{n_rules}{\link{integer} from=1 to=4. the no. rules that must be violated
to flag a variable as containing outliers. The default is 4, i.e. all.}

\item{max_non_outliers_plot}{\link{integer} from=0. Maximum number of non-outlier
points to be plot. If more
points exist, a subsample will
be plotted only. Note, that
sampling is not deterministic.}

\item{criteria}{\link{set} tukey | 3SD | hubert | sigmagap. a vector with
methods to be used for detecting outliers.}

\item{meta_data}{\link{data.frame} old name for \code{item_level}}

\item{meta_data_v2}{\link{character} path to workbook like metadata file, see
\code{\link{prep_load_workbook_like_file}} for details.
\strong{ALL LOADED DATAFRAMES WILL BE PURGED},
using \code{\link{prep_purge_data_frame_cache}},
if you specify \code{meta_data_v2}.}
}
\value{
a list with:
\itemize{
\item \code{SummaryTable}: \code{\link{data.frame}} with the columns
\code{Variables}, \code{Mean}, \code{SD}, \code{Median}, \code{Skewness}, \code{Tukey (N)},
\verb{3SD (N)}, \code{Hubert (N)}, \code{Sigma-gap (N)}, \code{NUM_acc_ud_outlu},
\verb{Outliers, low (N)}, \verb{Outliers, high (N)} \code{Grading}
\itemize{
\item \code{SummaryData}: \code{\link{data.frame}} with the columns
\code{Variables}, \code{Mean}, \code{SD}, \code{Median}, \code{Skewness}, \code{Tukey (N)},
\verb{3SD (N)}, \code{Hubert (N)}, \code{Sigma-gap (N)}, \code{Outliers (N)},
\verb{Outliers, low (N)}, \verb{Outliers, high (N)}
\item \code{SummaryPlotList}: \code{\link[ggplot2:ggplot]{ggplot2::ggplot}} univariate outlier plots
}
}
}
\description{
A classical but still popular approach to detect univariate outlier is the
boxplot method introduced by Tukey 1977. The boxplot is a simple graphical
tool to display information about continuous univariate data (e.g., median,
lower and upper quartile). Outliers are defined as values deviating more
than \eqn{1.5 \times IQR} from the 1st (Q25) or 3rd (Q75) quartile. The
strength of Tukey's method is that it makes no distributional assumptions
and thus is also applicable to skewed or non mound-shaped data
Marsh and Seo, 2006. Nevertheless, this method tends to identify frequent
measurements which are falsely interpreted as true outliers.

A somewhat more conservative approach in terms of symmetric and/or normal
distributions is the 3SD approach, i.e. any measurement not in
the interval of \eqn{mean(x) +/- 3 * \sigma} is considered an outlier.

Both methods mentioned above are not ideally suited to skewed distributions.
As many biomarkers such as laboratory measurements represent in skewed
distributions the methods above may be insufficient. The approach of Hubert
and Vandervieren 2008 adjusts the boxplot for the skewness of the
distribution. This approach is implemented in several R packages such as
\code{\link[robustbase:mc]{robustbase::mc}} which is used in this implementation of \code{\link{dataquieR}}.

Another completely heuristic approach is also included to identify outliers.
The approach is based on the assumption that the distances between
measurements of the same underlying distribution should homogeneous. For
comprehension of this approach:
\itemize{
\item consider an ordered sequence of all measurements.
\item between these measurements all distances are calculated.
\item the occurrence of larger distances between two neighboring measurements
may
than indicate a distortion of the data. For the heuristic definition of a
large distance \eqn{1 * \sigma} has been been chosen.
}

Note, that the plots are not deterministic, because they use
\link[ggplot2:geom_jitter]{ggplot2::geom_jitter}.

\link{Indicator}
}
\details{
\emph{\strong{Hint}}: \emph{The function is designed for unimodal data only.}
}
\section{ALGORITHM OF THIS IMPLEMENTATION:}{
\itemize{
\item Select all variables of type float in the study data
\item Remove missing codes from the study data (if defined in the metadata)
\item Remove measurements deviating from limits defined in the metadata
\item Identify outliers according to the approaches of Tukey (Tukey 1977),
3SD (Saleem et al. 2021), Hubert (Hubert and Vandervieren 2008),
and SigmaGap (heuristic)
\item An output data frame is generated which indicates the no. possible
outliers, the direction of deviations (Outliers, low; Outliers, high) for all methods
and a summary score which sums up the deviations of the different rules
\item A scatter plot is generated for all examined variables, flagging
observations according to the no. violated rules (step 5).
}
}

\seealso{
\itemize{
\item \link{acc_robust_univariate_outlier}
\item \href{https://dataquality.qihs.uni-greifswald.de/VIN_acc_impl_robust_univariate_outlier.html}{Online Documentation}
}
}
