The base functions of optimall allow users to
efficiently determine optimum allocation, select samples, and split
strata during the design of a multi-wave stratified sampling survey.
Despite these features, an efficient sampling workflow in R still
requires the user to manually organize the many moving parts of the
process including design data, a list of samples, data extracted from
the samples, and merged data for each wave. When a sampling design
involves many waves, these parts can be difficult to keep track of and
may be prone to errors. Even more, it may be difficult to go back and
reproduce results at the end of a long sampling process.
This vignette describes an addition to optimall called
the multiwave object, which stores the metadata, design, samples, and
merged data from each step of the multi-wave sampling process in an
accessible format. It is optional for the user to use, but it offers the
advantages of automatic organization, efficient compatibility with other
functions in optimall, and an option for a summary of the
sampling design to be printed at any point. It is recommended that user
thoroughly reads the introductory package vignette titled “Using
Optimall” before reading this vignette.
This addition contributes towards optimall's goal of
being a tool to streamline the often cumbersome aspects of the
multi-wave sampling workflow.
The multiwave object uses the S4 class system, which makes it slightly more complicated to work with in R. Fortunately, all of these potential complications are dealt with in the package design and will be unseen by the user. Here is what the user will see:
The multiwave object contains three S4 classes:
Wave: The Wave class (green in figure above) holds the metadata, design, samples, sampled data, and data for a single wave in multi-wave sampling.
Phase: The Phase class (orange in figure above) holds the metadata and a list of Wave objects for each phase.
Multiwave: The Multiwave class (dark blue in figure above) contains the metadata and a list of Phase objects. It holds the entire sampling design and is the class that the user will interact with the most.
The light blue objects in the figure above represent the slots that hold the survey information directly:
Metadata: The metadata slot holds an initially
empty list. Relevant information can be added as named elements,
including titles, data dictionaries, and arguments for
optimall functions.
Design: The design slot holds the data frame
specifying the sampling design of the current wave. In order for it to
be used with sample_strata(), it must contain at least one
column holding the strata names and one column holding the number of
samples to be selected from each strata. It is typically the output of
optimum_allocation() or
allocate_wave().
Samples: The samples slot contains a character vector of the sample ids that were selected in a specific wave.
Sampled Data: The sampled data slot contains a data frame holding the data collected in a specific wave.
Data: The data slot contains a data frame with
the full data as it gets updated for each wave (or phase for phase 1).
It is typically the sampled data merged with the (full) data of the
previous wave. In this way, the data in the data slot of a study’s
ultimate sampling wave is the full study data that can be used for
analysis with the survey package.
To work with the multiwave object, users only need to know a few functions. This section describes how to initialize an object, how to access and write slots of it, and how to deploy some of its useful features. This will be demonstrated in the context of the same two-phase, multiwave sampling design used in the package vignette.
A multiwave object is initialized by the function
new_multiwave():
We now have an object that can hold all of the information for our two-phase survey for which the second phase will be conducted over three waves. Note that the length of the ’waves‘ argument must match the number of phases. Phase 1 will almost always have one wave.
As is standard for an S4 object in R, components of the multiwave
object are stored in slots. To access and write slots of the multiwave
object, the user could use @ and $:
#To access overall metadata
MySurvey@metadata
#> list()
#To write overall metadata. We may want to include a title.
MySurvey@metadata <- list(title = "Maternal Weight Survey")
#To access Phase 2 metadata
MySurvey@phases$phase2@metadata
#> list()
#To access Phase 2, Wave 2 design
MySurvey@phases$phase2@waves$wave2@design
#> data frame with 0 columns and 0 rowsbut this is overly complicated and potentially unstable. Instead, any
slot of the multiwave object can be accessed or written using the
function get_data():
#To access overall metadata
get_data(MySurvey, phase = NA, slot = "metadata")
#> $title
#> [1] "Maternal Weight Survey"
#To write overall metadata
get_data(MySurvey, phase = NA, slot = "metadata") <- list(title = "Maternal Weight Survey")
#To access Phase 2 metadata
get_data(MySurvey, phase = 2, slot = "metadata")
#> list()
#To access Phase 2, Wave 2 design
get_data(MySurvey, phase = 2, wave = 2, slot = "design")
#> data frame with 0 columns and 0 rowsNote that calls to get data from phase 1 do not require
wave to be specified, since phase 1 only consists of one
wave.
Let’s suppose that our phase 1 data has been collected. We can add it to the phase 1 data slot of our object and include any relevant metadata:
head(phase1)
#>                    new_strata old_strata   id mat_weight_est  race diabetes
#> 1  Black.MWC_est_(9.75,15.06]      Black 5586      12.176368 Black        0
#> 2  Black.MWC_est_(9.75,15.06]      Black 7322      11.774911 Black        0
#> 3  Black.MWC_est_(9.75,15.06]      Black 3602      10.649515 Black        0
#> 4  Black.MWC_est_(9.75,15.06]      Black 2734      12.026532 Black        0
#> 5 Black.MWC_est_(15.06,38.46]      Black  581      16.828686 Black        0
#> 6 Black.MWC_est_[-30.21,9.75]      Black 6531       9.163454 Black        1
#>   obesity
#> 1       0
#> 2       0
#> 3       0
#> 4       1
#> 5       0
#> 6       0
get_data(MySurvey, phase = 1, slot = "data") <- phase1
#Make Phase 1 data dict
phase1_data_dictionary <- data.frame(
  "Variable" = c( "id", "race", "mat_weight_est", "diabetes", "obesity"),
  "Description" = c("unique identifier", 
                    "race of mother",
                    "error-prone estimate of maternal weight change 
                    during pregnancy",
                    "1/0 indicator for diabetes in the mother during 
                    pregnancy",
                    "1/0 indicator for childhood obesity in child"))
head(phase1_data_dictionary)
#>         Variable
#> 1             id
#> 2           race
#> 3 mat_weight_est
#> 4       diabetes
#> 5        obesity
#>                                                                             Description
#> 1                                                                     unique identifier
#> 2                                                                        race of mother
#> 3 error-prone estimate of maternal weight change \n                    during pregnancy
#> 4       1/0 indicator for diabetes in the mother during \n                    pregnancy
#> 5                                          1/0 indicator for childhood obesity in child
get_data(MySurvey, phase = 1, slot = "metadata") <- list(data_dict = phase1_data_dictionary)At any point during the multi-wave sampling workflow,
optimall allows users to view a diagram of the structure of
their survey with multiwave_diagram():
Notice that the title of the survey, “Maternal Weight Survey”, was found from the overall survey metadata and that boxes are colored based on whether they have been filled yet. Slots that have been filled are blue and contain a short description of their contents. This function enables users to track their progress during a multiwave sampling survey.
optimall functions with fewer argumentsAnother advantage of the multiwave object is that the primary
functions of optimall such as
optimum_allocation(), allocate_wave(), and
sample_strata() can be called on the object using the
function apply_multiwave(), The
apply_multiwave() function takes the standard arguments to
the function as well as phase and wave, which
are used to determine the input dataframe(s) and the slot of the object
where the output should be placed. If the arguments, including names of
columns (which tend to be repetitive when used without the multiwave
object framework), are specified in the metadata, the function will find
them itself, allowing calls to the function to be much simpler. It is
recommended that users gain familiarity with the basic uses of these
functions on dataframes before using them on multiwave objects.
When working with a multiwave object, a new function called
merge_samples() also becomes available. This allows users
to quickly and efficiently merge the sampled data with the
previous wave’s data.
To demonstrate, we return to the example of sampling sepal width from the iris dataset with an adaptive, multi-wave design. Suppose that we have collected data on sepal length, petal width, and petal length for all 150 iris plants in phase 1, but we have not yet collected any of the ”expensive” sepal width variable. We can start by placing the data in the appropriate slot of our multiwave object:
# Initialize Multiwave
IrisSurvey <- new_multiwave(phases = 2, waves = c(1,3))
# Add id column to iris dataset
iris <- cbind(datasets::iris, id = 1:150)
# To place iris data in Phase 1
get_data(IrisSurvey, phase = 1, slot = "data") <-
    subset(iris, select = -Sepal.Width)Now we want to begin our first wave of sampling sepal width in phase
2. Since we expect that sepal length is correlated with our variable of
interest, we decide to x-allocate the first wave of samples using
integer-valued Neyman allocation on the inexpensive sepal length
variable. Since we are working in the multiwave object framework, we can
use apply_multiwave() to apply the
optimum_allocation() function:
IrisSurvey <- apply_multiwave(IrisSurvey, phase = 2, wave = 1,
                            fun = "optimum_allocation",
                            strata = "Species", y = "Sepal.Length",
                            nsample = 30, method = "WrightII")Since "strata" will be "Species" for every
wave, we may instead move that argument to the phase metadata so that we
don’t have to repetitively specify the same argument in every function
call:
Now we no longer have to specify strata in the function
call:
IrisSurvey <- apply_multiwave(IrisSurvey, phase = 2, wave = 1,
                            fun = "optimum_allocation",
                            y = "Sepal.Length",
                            nsample = 30, method = "WrightII")In the absence of a specific strata argument, apply_multiwave() turns
to the wave, phase, and then overall metadata to look for the missing
argument. In this case, it finds strata = "Species" in the
phase metadata. By specifying the phase and
wave in the function call, we are telling
optimum_allocation() to use the most recent version of the data (the
full data from the previous wave, or in this case, the phase 1 data) as
input and to output the results in the corresponding slot of the
specified wave. As such, both calls to apply_multiwave()
output an updated multiwave object with the results of
optimum_allocation() in the phase 2, wave 1
"design" slot:
get_data(IrisSurvey, phase = 2, wave = 1, slot = "design")
#>       strata npop   sd  n_sd stratum_fraction stratum_size
#> 1     setosa   50 0.35 17.62             0.23            7
#> 2 versicolor   50 0.52 25.81             0.33           10
#> 3  virginica   50 0.64 31.79             0.43           13The allocate_wave() function can be applied to multiwave
objects in the same manner and will be demonstrated in the following
examples section. After these functions have been used to specify a
”design” data frame (or a manually created design data frame has been
placed in the ”design” slot to implement a different allocation
strategy), we can use apply_multiwave() to apply
sample_strata() and select the ids to sample for the given
sampling wave:
IrisSurvey <- apply_multiwave(IrisSurvey, phase = 2, wave = 1,
                            fun = "sample_strata", id = "id",
                            design_strata = "strata",
                            n_allocated = "stratum_size")Note that we did not have to specify the data or
design_data as we do in the standard version of
split_strata() because they are extracted using the
phase and wave arguments. We also did not have
to specify the strata argument again because it was
available in the phase metadata. The result of this call to
apply_multiwave() is an updated IrisSurvey
with a character vector of ids to sample in the ”samples” slot:
get_data(IrisSurvey, phase = 2, wave = 1, slot = "samples")
#>  [1] "11"  "21"  "25"  "26"  "27"  "28"  "33"  "54"  "57"  "67"  "68"  "69" 
#> [13] "75"  "80"  "85"  "89"  "98"  "103" "106" "113" "115" "129" "131" "132"
#> [25] "134" "136" "137" "140" "141" "144"When working with a multiwave object, a new function called
merge_samples() also becomes available. This function
allows users to quickly and efficiently merge the sampled data with the
previous wave’s data. Suppose that we have collected the sepal width for
these 30 plant ids and placed the data in the “sampled_data” slot of
phase 2, wave 1 of IrisSurvey:
get_data(IrisSurvey, phase = 2, wave = 1, slot = "sampled_data") <-
  iris[iris$id %in% get_data(IrisSurvey,
                             phase = 2,
                             wave = 1,
                             slot = "samples"),
       c("id", "Sepal.Width")]We can call merge_samples() to smoothly merge the
sampled_data of the current wave with the (full)
data from the previous wave and place the output in the
data slot of the current wave:
IrisSurvey <- merge_samples(IrisSurvey, phase = 2, wave = 1,
                          id = "id", sampled_ind = "sampled_phase2")In the "data" slot of phase 2, wave 1, we will now have
an updated dataframe with all of the phase 1 data and a sepal width
column that and has the sampled data for the selected ids and
NA values for the rest of the plants. There is also a new
column called "sampled_phase2" that holds an indicator for
which samples have been sampled in phase 2 thus far:
head(get_data(IrisSurvey, phase = 2, wave = 1, slot = "data"))
#>   Sepal.Length Petal.Length Petal.Width Species id Sepal.Width sampled_phase2
#> 1          5.1          1.4         0.2  setosa  1          NA              0
#> 2          4.9          1.4         0.2  setosa  2          NA              0
#> 3          4.7          1.3         0.2  setosa  3          NA              0
#> 4          4.6          1.5         0.2  setosa  4          NA              0
#> 5          5.0          1.4         0.2  setosa  5          NA              0
#> 6          5.4          1.7         0.4  setosa  6          NA              0Calls to merge_samples() in later waves of phase 2 will
update the phase sampled indicator each time. The
sample_strata function will use this column to ensure that
units sampled in previous waves are not selected again. The utility of
merge_samples() and apply_multiwave() are
demonstrated further below.
optimall with a Multiwave
ObjectIn this section, we demonstrate an example of how the multiwave object can be useful in the adaptive multi-wave sampling workflow. In this example, we will create and execute Wave 1 of Phase 2 of our sample.
The first step of Phase 2, Wave 1 is to specify the metadata for both
the phase and wave. Note that can be useful to specify a title,
description, function arguments, and anything else relevant to the
survey in the metadata, but nothing is required to be
specified. If no function arguments are found in the metadata, they must
instead be specified during the call to apply_multiwave().
In this example we choose to specify a description and function
arguments in the wave metadata:
# Metadata for Phase 2 including description,
# and column names to be used in function calls.
# Note that each element name corresponds to at least one argument of a 
# function that will be called later on in the multi-wave workflow.
get_data(MySurvey, phase = 2, slot = "metadata") <- 
  list(description = "Phase 2 of Maternal Weight Survey in which we
       seek to validate 750 samples across three waves.",
       strata = "new_strata", # strata column in data (used in multiple funcs)
       id = "id", # name of id column (used in sample_strata and merge_samples)
       y = "mat_weight_true", # col for which to minimize variance 
                                  # (used in optimum_allocation)
       design_strata = "strata", # strata column in designs (used for sample_strata)
       n_allocated = "n_to_sample" # n allocated to strata in designs
                                    # (used for sample_strata)
       )
# Then, metadata for Wave 1 of Phase 2 
get_data(MySurvey, phase = 2, wave = 1, slot = "metadata") <- 
  list(description = "First wave of 250 
       sampled using proportional sampling")After the metadata is specified, we create a dataframe specifying the survey design. It contains one variable holding the stratum names and one variable specifying how many samples should be allocated to each stratum in Wave 1. We will use proportional sampling for the first 250 samples.
#Design for Wave 1
get_data(MySurvey, phase = 2, wave = 1, slot = "design") <- data.frame(
  strata = names(table(phase1$new_strata)), 
  n_to_sample = round(250.3*as.vector(table(phase1$new_strata))/10335)
  ) #250.3 to make sure 250 samples after rounding
get_data(MySurvey, phase = 2, wave = 1, slot = "design")
#>                        strata n_to_sample
#> 1 Black.MWC_est_(15.06,38.46]          15
#> 2  Black.MWC_est_(9.75,15.06]          28
#> 3 Black.MWC_est_[-30.21,9.75]          18
#> 4 Other.MWC_est_(15.06,30.94]           8
#> 5  Other.MWC_est_(9.75,15.06]          22
#> 6  Other.MWC_est_[-5.39,9.75]          11
#> 7 White.MWC_est_(15.06,51.69]          40
#> 8  White.MWC_est_(9.75,15.06]          75
#> 9 White.MWC_est_[-25.68,9.75]          33We can use this design to randomly select ids to sample from the
previous wave (in this case the overall Phase 1 data) using
apply_multiwave() to apply the function
sample_strata() to the multiwave object. When applied to an
object of class “Multiwave”, sample_strata() will find
"data" and "design_data" in the object and
will look in the wave, phase, then overall metadata for the other
arguments to sample_strata() if they are not provided in
the call. It will return the same multiwave object with a vector of ids
to sample in the "samples" slot of the specified wave.
# Get list of ids to sample using stratified random sampling 
# without replacement
set.seed(456)
MySurvey <- apply_multiwave(MySurvey, phase = 2, wave = 1,
                            fun = "sample_strata",
                            strata = "new_strata",
                            id = "id",
                            wave2a = NULL, #No one has been sampled yet
                            design_strata = "strata", #from design
                            n_allocated = "n_to_sample"
                            )
# check that it worked
head(get_data(MySurvey, phase = 2, wave = 1, slot = "samples"))
#> [1] "5702" "3127" "3267" "3369" "1573" "3227"
length(get_data(MySurvey, phase = 2, wave = 1, slot = "samples"))
#> [1] 250
# But, notice that we had already specified most of the arguments to 
# sample_strata in the phase metadata. So, we can get the same result
# with a much shorter call to the function
set.seed(456)
MySurvey <- apply_multiwave(MySurvey, phase = 2, wave = 1, fun = "sample_strata")
ids_wave1 <- get_data(MySurvey, phase = 2, wave = 1, slot = "samples") 
#Check that  it gives same results
head(ids_wave1)
#> [1] "5702" "3127" "3267" "3369" "1573" "3227"
length(ids_wave1)
#> [1] 250We can use these sample ids to sample the Phase 2, Wave 1 data from
MatWgt_Sim, which holds the (hypothetically unknown)
mat_weight_true variable:
# We can use these ids to get the data:
get_data(MySurvey, phase = 2, wave = 1, slot = "sampled_data") <- 
  MatWgt_Sim[MatWgt_Sim$id %in% ids_wave1, c("id", "mat_weight_true")]Then, we can use the function merge_samples() to merge
the sampled data (250 rows of data sampled in wave 1 of phase 2) that we
just collected into our full dataframe (10,335 rows of data sampled in
phase 1). This function generates the dataframe for the
"data" slot of the specified wave using the
"sampled data" and the data from the previous wave (Phase 1
data in this case).
MySurvey <- apply_multiwave(MySurvey, phase = 2, wave = 1, fun = "merge_samples",
                            sampled_ind = "already_sampled_ind")All the slots for this wave are now filled! We can check to make sure
everything that we have done so far looks good using
multiwave_diagram():
It does! Now that we have sampled data which we can use to estimate
stratum standard deviations, we can use allocate_wave() to
allocate samples optimally for the next wave. When applied to a
multiwave object using apply_multiwave(),
allocate_wave() will find the data from the previous wave
(now the previous wave is Wave 1 of Phase 2) and will look for other
function arguments for allocate_wave() in the metadata. The
output will be placed in the "design" slot of the specified
wave:
MySurvey <- apply_multiwave(MySurvey, 
                                    phase = 2, 
                                    wave = 2, 
                                    fun = "allocate_wave",
                                    nsample = 250,
                                    already_sampled = "already_sampled_ind")
get_data(MySurvey, phase = 2, wave = 2, slot = "design")
#>                        strata npop nsample_optimal nsample_actual nsample_prior
#> 1 Black.MWC_est_(15.06,38.46]  628              62             62            15
#> 2  Black.MWC_est_(9.75,15.06] 1154              35             35            28
#> 3 Black.MWC_est_[-30.21,9.75]  745              61             61            18
#> 4 Other.MWC_est_(15.06,30.94]  325              15             15             8
#> 5  Other.MWC_est_(9.75,15.06]  929              29             29            22
#> 6  Other.MWC_est_[-5.39,9.75]  456              16             16            11
#> 7 White.MWC_est_(15.06,51.69] 1631             106            106            40
#> 8  White.MWC_est_(9.75,15.06] 3084             100            100            75
#> 9 White.MWC_est_[-25.68,9.75] 1383              76             76            33
#>   n_to_sample   sd
#> 1          47 5.19
#> 2           7 1.57
#> 3          43 4.30
#> 4           7 2.36
#> 5           7 1.66
#> 6           5 1.84
#> 7          66 3.42
#> 8          25 1.70
#> 9          43 2.91Now the design for Wave 2 is specified. We can continue with similar steps for future waves!
Through these examples, we see the many benefits that the creation of
a Multiwave object offers in optimall.