% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/data_analysis.R
\name{fit_regression}
\alias{fit_regression}
\title{Search for the best regression model(s)}
\usage{
fit_regression(data, lookup = NULL, census = TRUE, yr, project, scenario,
  popn, param = "N", vs = c("GS1"), count_data = c("Nextant", "Nall",
  "Nalleles", "N", "AM", "AF", "Subadults", "Juv", "nDams", "nBroods",
  "nProgeny", "nImmigrants", "nEmigrants", "nHarvested", "nSupplemented",
  "YrExt", "Alleles"), ic = "aic", l = 1, ncand = 30, set_size = NA,
  save2disk = TRUE, dir_out = "DataAnalysis/FitRegression")
}
\arguments{
\item{data}{The long format of census (from \code{conv_l_yr}) or run (lrun,
the second element) of the output from \code{collate_run}}

\item{lookup}{(Optional) A look-up table where the scenario names are listed
together with the (missing) variables needed to fit the regression models}

\item{census}{Whether the input is census data}

\item{yr}{The year that has to be used in the analysis if census=TRUE}

\item{project}{The Vortex project name}

\item{scenario}{Vortex scenario name}

\item{popn}{The sequential number of the population (in integer)}

\item{param}{The dependent variable}

\item{vs}{Character vector with independent variable(s)}

\item{count_data}{Character vector with param(s) that are counts and would use
a Poisson error distribution}

\item{ic}{Information criterion}

\item{l}{Level for glmulti search: 1 main effects, 2 main effects + interactions}

\item{ncand}{The threshold of candidate models after which switch to the
genetic search method, default: 30}

\item{set_size}{Value to be used in confsetsize (from \code{\link[glmulti]{glmulti}}
 The number of models to be looked for, i.e. the size of the returned confidence
set.)}

\item{save2disk}{Whether to save the output to disk, default: TRUE}

\item{dir_out}{The local path to store the output.
Default: DataAnalysis/FitRegression}
}
\value{
A \code{glmulti} object with the best models found.
}
\description{
\code{fit_regression} fits either a Generalized Linear Model or a betareg model
to the data and search for the best model(s) given a list of predictors using
the R package glmulti.
}
\details{
\code{fit_regression} fits a different type of regression model depending on the
dependent variable. When this is a count (e.g. N or the number of alleles),
the function will fit a Generalized Linear Model. The first fit is attempted
with a Poisson error distribution and if the dispersion parameter (calculated
as \eqn{residual deviance / df} is larger than (the somewhat
arbitrary cut off of) 1.5, the model will be refitted with a quasipoisson
error distribution (a message is displayed if this happens).

\code{fit_regression} establishes whether the dependent variable is a count
by searching for it in \code{count_data}. If the users generated their own
dependent variable (e.g. through a PS), this has to be included in \code{count_data}
to indicate \code{fit_regression} that it is analysing count data.

When the number of alleles is the dependent variable (from .run files), this
is rounded to integer (to meet R requirement that count data are integers)
before a GLM is fitted to the data.

If \code{param} is a proportion (e.g. Gene Diversity and Inbreeding), then
the function uses a Beta regression from the R package betareg (Cribari-Neto
& Zeileis 2010). Different link functions are tested and the one with the
lowest AIC value is selected. The selected link function is displayed on the
R console and the difference in the AIC scores relative to the best link
function is also displayed.

In the initial fit of the model the main and interactions effects are included.

Successively, a search for the best model is carried out. This is performed
with the R package \code{\link[glmulti]{glmulti}} (Calcagno & de Mazancourt 2010).
\code{fit_regression} will conduct an exhaustive search if ncand is less or
equal to the number of candidate models, otherwise it will use a genetic
search method (see glmulti documentations for more details about the search
methods). When \code{\link[glmulti]{glmulti}} uses the genetic search method,
two small files (with extension \code{.modgen.back} and \code{.mods.back.} are
written in the working directory even if \code{save2disk=FALSE}.

\code{fit_regression} explicitly ignores NA.

Depending on the data, fitting several Beta regression models to complete the
search may be a long (and memory hungry) process. Also, the package betareg
has the limitation (at least at the moment of writing) that cannot handle
analysis of data when the dependent variable takes value of either exactly 0
or 1.

See vignette for a more detailed explanation of \code{fit_regression}.
}
\examples{
# Using Pacioni et al. example data. See ?pac.run.lhs and ?pac.lhs for more
# details.
data(pac.run.lhs, pac.lhs)

# Remove base scenario from .stdat data
pac.lhs.no.base <- pac.lhs[!pac.lhs$scen.name == 'ST_LHS(Base)', ]

# Use function lookup_table to obtain correct parameter values at year 0
lkup.ST_LHS <- lookup_table(data=pac.lhs.no.base, project='Pacioni_et_al',
                            scenario='ST_LHS',
                            pop='Population 1',
                            SVs=c('SV1', 'SV2', 'SV3', 'SV4', 'SV5', 'SV6', 'SV7'),
                            save2disk=FALSE)

# Remove base scenario from .run output in long format
lrun.ST_LHS.no.base <- pac.run.lhs[[2]][!pac.run.lhs[[2]]$Scenario == 'ST_LHS(Base)', ]

reg <- fit_regression(data=lrun.ST_LHS.no.base, lookup=lkup.ST_LHS,
                      census=FALSE,
                      project='Pacioni_et_al', scenario='ST_LHS', popn=1,
                      param='N', vs=c('SV1', 'SV2', 'SV3'), l=2,  ncand=30,
                      save2disk=FALSE)

 # Clean up of residual files written by glmulti
 # Note, in some OS (W) these files may be locked because in use by R and have
 # to be manually after the R session has been either terminated or restarted
 file.remove(c('Pacioni_et_al_ST_LHS_N.modgen.back',
               'Pacioni_et_al_ST_LHS_N.mods.back'))

# Example of information you can obtained once you have run fit_regression

# The formula for the best model
bestmodel <- reg@formulas[1]

# The formulae for the best 30 model
bestmodels <- reg@formulas

# List of IC values
qaicvalues <- reg@crits

# QAIC differences between the first 5 best models (stored in 'delta')
delta <- as.vector(NULL)
for (i in 1:5) {
  del <- qaicvalues[i+1] - qaicvalues[i]
  delta <- c(delta, del)
}

# The best model's coefficients
coef.best <- coef(reg@objects[[1]])

# The model averaged coefficients
coef.all <- glmulti::coef.glmulti(reg)
coefs <- data.frame(Estimate=coef.all[,1],
                    Lower=coef.all[,1] - coef.all[,5],
                    Upper=coef.all[,1] + coef.all[,5])

# Plot IC profile
plot(reg, type='p')

# Plot of model averaged importance of terms
plot(reg, type='s')
}
\references{
Calcagno, V., and C. de Mazancourt. 2010. glmulti: an R package
for easy automated model selection with (generalized) linear models. Journal
of Statistical Software 34:1-29.

Cribari-Neto, F., and Zeileis, A. (2010) Beta regression in R. Journal of
Statistical Software 34(2).
}
