% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/standardize.R
\name{standardize}
\alias{standardize}
\title{Standardize a formula and data frame for regression.}
\usage{
standardize(formula, data, family = gaussian, scale = 1, offset, ...)
}
\arguments{
\item{formula}{A regression \code{\link[stats]{formula}}.}

\item{data}{A data.frame containing the variables in \code{formula}.}

\item{family}{A regression \code{\link[stats]{family}} (default gaussian).}

\item{scale}{The desired scale for the regression frame. Must be a single
positive number. See 'Details'.}

\item{offset}{An optional \code{\link[stats]{offset}} vector. Offsets can
also be included in the \code{formula} (e.g. \code{y ~ x + offset(o)}), but
if this is done, then the column \code{o} (in this example) must be in any 
data frame passed as the \code{newdata} argument to 
\code{\link[=predict.standardized]{predict}}.}

\item{...}{Currently unused.  If \code{na.action} is specified in \code{...}
and is anything other than \code{na.pass}, a warning is issued and the argument
argument is ignored.}
}
\value{
A \code{\link[=standardized-class]{standardized}} object. The
  \code{formula}, \code{data}, and \code{offset} elements of the object can 
  be used in calls to regression functions.
}
\description{
Create a \code{\link[=standardized-class]{standardized}} object which places
all variables in \code{data} on the same scale based on \code{formula},
making regression output easier to interpret.
For mixed effects regressions, this also offers computational benefits, and
for Bayesian regressions, it also makes determining reasonable priors easier.
}
\details{
First \code{\link[stats]{model.frame}} is called. Then,
if \code{family = gaussian}, the response is checked to ensure that it is
numeric and has more than two unique values.  If \code{\link{scale_by}} is
used on the response in \code{formula}, then the \code{scale} argument to
\code{scale_by} is ignored and forced to \code{1}.  If \code{\link{scale_by}}
is not called, then \code{\link[base]{scale}} is used with default arguments.
The result is that gaussian responses are on unit scale (i.e. have mean
\code{0} and standard deviation \code{1}), or, if \code{\link{scale_by}} is
used on the left hand side of \code{formula}, unit scale within each
level of the specified conditioning factor.
Offsets in gaussian models are divided by the standard deviation of the
the response prior to scaling (within-factor-level if \code{\link{scale_by}}
is used on the response).  In this way, if the transformed offset is added
to the transformed response, and then placed back on the response's original
scale, the result would be the same as if the un-transformed offset had
been added to the un-transformed response.
For all other values for \code{family}, the response and offsets are not checked.
If offsets are used within the \code{formula}, then they will be in the
\code{formula} and \code{data} elements of the \code{\linkS4class{standardized}}
object.  If the \code{offset} argument to the \code{standardize} function is
used, then the offset provided in the argument will be
in the \code{offset} element of the \code{\linkS4class{standardized}} object
(scaled if \code{family = gaussian}).

For the other predictors in the formula, first any random effects grouping factors
in the formula are coerced to factor and unused levels are dropped.  The
levels of the resulting factor are then recorded in the \code{groups} element.
Then for the remaining predictors, regardless of their original
class, if they have only two unique non-\code{NA} values, they are coerced
to unordered factors.  Then, \code{\link{named_contr_sum}} and
\code{\link{scaled_contr_poly}} are called for unordered and ordered factors,
respectively, using the \code{scale} argument provided in the call
to \code{standardize} as the \code{scale} argument to the contrast
functions.  For numeric variables, if the variable contains a call to
\code{\link{scale_by}}, then, regardless of whether the call to
\code{\link{scale_by}} specifies \code{scale}, the value of \code{scale}
in the call to \code{standardize} is used.  If the numeric variable
does not contain a call to \code{\link{scale_by}}, then
\code{\link[base]{scale}} is called, ensuring that the result has
standard deviation \code{scale}.

With the default value of \code{scale = 1}, the result is a
\code{\linkS4class{standardized}} object which contains a formula and data
frame (and offset vector if the \code{offset} argument to the 
\code{standardize} function was used) which can be used to fit regressions 
where the predictors are all on a similar scale.  Its data frame
has numeric variables on unit scale, unordered factors with named sum
sum contrasts, and ordered factors with orthogonal polynomial contrasts
on unit scale.  For gaussian regressions, the response is also placed on
unit scale.  If \code{scale = 0.5} (for example),
then gaussian responses would still
be placed on unit scale, but unordered factors' named sum contrasts would
take on values {-0.5, 0, 0.5} rather than {-1, 0, 1}, the standard deviation
of each column in the contrast matrices for ordered factors would be
\code{0.5} rather than \code{1}, and the standard deviation of numeric
variables would be \code{0.5} rather than \code{1} (within-factor-level
in the case of \code{\link{scale_by}} calls).
}
\section{Note}{
 The \code{\link{scale_by}}
  function is supported so long as it is not nested within other function
  calls.  The \code{\link[stats]{poly}} function is supported so long as
  it is either not nested within other function calls, or is nested as the
  transformation of the numeric variable in a \code{\link{scale_by}} call.
  If \code{\link[stats]{poly}} is used, then the \code{lsmeans} function
  will yield misleading results (as would normally be the case).

  In previous versions of \code{standardize} (v0.2.0 and earlier),
  \code{na.action} could be specified.  Starting with v0.2.1, specifying
  something other than \code{na.pass} is ignored with a warning.  Use of
  \code{na.omit} and \code{na.exclude} should be done when calling regression
  fitting functions using the elements returned in the 
  \code{\link[=standardized-class]{standardized}} object.
}
\examples{
dat <- expand.grid(ufac = letters[1:3], ofac = 1:3)
dat <- as.data.frame(lapply(dat, function(n) rep(n, 60)))
dat$ofac <- factor(dat$ofac, ordered = TRUE)
dat$x <- rpois(nrow(dat), 5)
dat$z <- rnorm(nrow(dat), rep(rnorm(30), each = 18), rep(runif(30), each = 18))
dat$subj <- rep(1:30, each = 18)
dat$y <- rnorm(nrow(dat), -2, 5)

sobj <- standardize(y ~ log(x + 1) + scale_by(z ~ subj) + ufac + ofac +
  (1 | subj), dat)

sobj
sobj$formula
head(dat)
head(sobj$data)
sobj$contrasts
sobj$groups
mean(sobj$data$y)
sd(sobj$data$y)
mean(sobj$data$log_x.p.1)
sd(sobj$data$log_x.p.1)
with(sobj$data, tapply(z_scaled_by_subj, subj, mean))
with(sobj$data, tapply(z_scaled_by_subj, subj, sd))

sobj <- standardize(y ~ log(x + 1) + scale_by(z ~ subj) + ufac + ofac +
  (1 | subj), dat, scale = 0.5)

sobj
sobj$formula
head(dat)
head(sobj$data)
sobj$contrasts
sobj$groups
mean(sobj$data$y)
sd(sobj$data$y)
mean(sobj$data$log_x.p.1)
sd(sobj$data$log_x.p.1)
with(sobj$data, tapply(z_scaled_by_subj, subj, mean))
with(sobj$data, tapply(z_scaled_by_subj, subj, sd))

\dontrun{
mod <- lmer(sobj$formula, sobj$data)
# this next line causes warnings about contrasts being dropped, but
# these warnings can be ignored (i.e. the statement still evaluates to TRUE)
all.equal(predict(mod, newdata = predict(sobj, dat)), fitted(mod))
}

}
\seealso{
For scaling and contrasts, see \code{\link[base]{scale}},
  \code{\link{scale_by}}, \code{\link{named_contr_sum}}, and
  \code{\link{scaled_contr_poly}}. For putting new data into the same space
  as the standardized data, see \code{\link[=predict.standardized]{predict}}.
  For the elements in the returned object, see
  \code{\linkS4class{standardized}}.
}

