The REDCapDM package will first allow us to read both data exported directly from REDCap and through an API connection. Next, it will allow us to create reports of queries such as outliers or missing values, and to track them. Finally, we will be able to preprocess previously downloaded data.
Functions included in the package:
redcap_data: read data.
rd_query: identification of queries.
rd_event: identification of events that are ‘missing’.
check_queries: control queries.
rd_transform: data pre-processing.
rd_rlogic: transformation of REDCap logic into R logic.
rd_insert_na: manual entry of a ‘missing’ in specific variables using a filter.
Exported data from REDCap:
datos <- redcap_data(data_path="C:/Users/username/example.r",
dic_path="C:/Users/username/example_dictionary.csv")Data using API:
datos_api <- redcap_data(uri ="https://redcap.idibell.cat/api/",
token = "55E5C3D1E83213ADA2182A4BFDEA")For the examples we will use COVICAN dataset that is already included
in the package. For more information use ?covican.
datos_redcap <- covicanList of 2
$ data :'data.frame': 342 obs. of 56 variables:
$ dictionary:'data.frame': 21 obs. of 8 variables:
datos_redcap$data:
record_id redcap_event_name redcap_data_access_group
1 100-6 initial_visit_arm_1 hospital_11
2 100-6 follow_up_visit_da_arm_1 hospital_11
21 100-13 initial_visit_arm_1 hospital_11
22 100-13 follow_up_visit_da_arm_1 hospital_11
34 100-16 initial_visit_arm_1 hospital_11
35 100-16 follow_up_visit_da_arm_1 hospital_11
datos_redcap$dictionary:
field_name form_name field_type
1 record_id inclusionexclusion_criteria text
2 inc_1 inclusionexclusion_criteria radio
3 inc_2 inclusionexclusion_criteria radio
4 inc_3 inclusionexclusion_criteria radio
6 exc_1 inclusionexclusion_criteria radio
10 screening_fail_crit inclusionexclusion_criteria calc
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary
)
data <- datos$data
#To print the results
datos$results1. Recalculating calculated fields and saving them as '[field_name]_recalc'
| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
| 2 | 0 (0%) | 1 (50%) |
| field_name | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
| age | Yes | FALSE |
| screening_fail_crit | Yes | TRUE |
2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic
Table: Checkbox variables advisable to be reviewed
| Variables without any branching logic |
|:-------------------------------------:|
| type_underlying_disease |
3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
As you can see, there are 4 steps in the transformation:
There were two autocalculated fields, as we see in the summary we have been able to recalculate both of them and there is one that has changed:
data %>%
dplyr::select(d_birth, d_ingreso, age, age_recalc) %>%
dplyr::filter(age != age_recalc) d_birth d_ingreso age age_recalc
1 1945-04-16 2020-04-16 74 75
#Checkbox no gatekeeper
table(data$type_underlying_disease_haematological_cancer)
No Yes
103 87
#Checkbox with gatekeeper: [type_underlying_disease(0)]='1'
#In the original data set:
table(datos_redcap$data$type_underlying_disease___0, datos_redcap$data$underlying_disease_hemato___1)
0 1
0 103 0
1 84 3
#In the transformed data set:
table(data$type_underlying_disease_haematological_cancer, data$underlying_disease_hemato_acute_myeloid_leukemia)
No Yes
No 0 0
Yes 84 3
str(data$dm) Factor w/ 2 levels "No","Yes": 1 NA 2 NA 2 NA 1 NA 2 NA ...
In order to perform the transformation by event we have to add the path where the file with the correspondence between events and forms is located. In this case, we will use the path where we have saved the file of this correspondence that we have downloaded from the COVICAN project.
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_event"
)
#To print the results
datos$results1. Recalculating calculated fields and saving them as '[field_name]_recalc'
| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
| 2 | 0 (0%) | 1 (50%) |
| field_name | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
| age | Yes | FALSE |
| screening_fail_crit | Yes | TRUE |
2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic
Table: Checkbox variables advisable to be reviewed
| Variables without any branching logic |
|:-------------------------------------:|
| type_underlying_disease |
3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
5. Erasing variables from forms that are not linked to any event
6. Final arrangment of the data by event
Now a step in the transformation has been added, which is to split the preprocessed data for each of the events in the study so that the function returns
datos$data# A tibble: 2 × 3
events vars df
<chr> <list> <list>
1 initial_visit_arm_1 <chr [25]> <df [190 × 25]>
2 follow_up_visit_da_arm_1 <chr [8]> <df [152 × 8]>
To perform the transformation by form we must also add the path of the file of the events and forms
datos <- rd_transform(data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_form"
)
#To print the results
datos$results1. Recalculating calculated fields and saving them as '[field_name]_recalc'
| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
| 2 | 0 (0%) | 1 (50%) |
| field_name | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
| age | Yes | FALSE |
| screening_fail_crit | Yes | TRUE |
2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic
Table: Checkbox variables advisable to be reviewed
| Variables without any branching logic |
|:-------------------------------------:|
| type_underlying_disease |
3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
5. Erasing variables from forms that are not linked to any event
6. Final arrangment of the data by form
As before, a final step is added, which consists of splitting the preprocessed data for each form of the study so that the function returns
datos$data# A tibble: 6 × 4
form events vars df
<chr> <list> <list> <list>
1 inclusionexclusion_criteria <chr [1]> <chr [11]> <df [190 × 11]>
2 demographics <chr [1]> <chr [9]> <df [190 × 9]>
3 comorbidities <chr [1]> <chr [10]> <df [190 × 10]>
4 vital_signs <chr [2]> <chr [7]> <df [177 × 7]>
5 laboratory_findings <chr [2]> <chr [7]> <df [177 × 7]>
6 microbiological_studies <chr [1]> <chr [6]> <df [190 × 6]>
checkbox_labels: specify the name of the categories that will have the checkbox variables. Default is No/Yes, we can change it to N/Y.
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary,
checkbox_labels = c("N", "Y")
)
data <- datos$dataWe see how the categories of the following checkboxes have changed, for example
table(data$type_underlying_disease_haematological_cancer)
N Y
103 87
exclude_to_factor: specify the name of a variable that we do not want to be transformed to a factor. For example, if we want the variable dm to keep its numeric version
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary,
exclude_to_factor = "dm"
)
data <- datos$data
table(data$dm)
0 1
140 45
keep_labels: logical argument to retain data set tags that are processed from REDCap
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary,
keep_labels = TRUE
)
data <- datos$data
str(data[,1:5])'data.frame': 342 obs. of 5 variables:
$ record_id : 'labelled' chr "100-13" "100-13" "100-16" "100-16" ...
..- attr(*, "label")= Named chr ""
.. ..- attr(*, "names")= chr "record_id"
$ redcap_event_name : 'labelled' chr "initial_visit_arm_1" "follow_up_visit_da_arm_1" "initial_visit_arm_1" "follow_up_visit_da_arm_1" ...
..- attr(*, "label")= Named chr "Event Name"
.. ..- attr(*, "names")= chr "redcap_event_name"
$ redcap_event_name.factor : Factor w/ 5 levels "Initial visit",..: 1 2 1 2 1 2 1 2 1 2 ...
..- attr(*, "label")= Named chr ""
.. ..- attr(*, "names")= chr "redcap_data_access_group"
$ redcap_data_access_group : 'labelled' chr "hospital_11" "hospital_11" "hospital_11" "hospital_11" ...
..- attr(*, "label")= Named chr "Patients older than 18 years"
.. ..- attr(*, "names")= chr "inc_1"
$ redcap_data_access_group.factor: Factor w/ 26 levels "Hospital 1","Hospital 2",..: 11 11 11 11 11 11 11 11 11 11 ...
..- attr(*, "label")= Named chr "Cancer patients"
.. ..- attr(*, "names")= chr "inc_2"
delete_vars: specify strings whereby any variables containing them will be removed from the data set. By default all variables containing ’_complete’ are removed. In this case we did not have any complete variables. We want for example to remove all inclusion and exclusion criteria
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary,
delete_vars = c("inc_", "exc_")
)
data <- datos$data
names(data) [1] "record_id"
[2] "redcap_event_name"
[3] "redcap_event_name.factor"
[4] "redcap_data_access_group"
[5] "redcap_data_access_group.factor"
[6] "screening_fail_crit"
[7] "screening_fail_crit_recalc"
[8] "d_birth"
[9] "d_ingreso"
[10] "age"
[11] "age_recalc"
[12] "dm"
[13] "type_dm"
[14] "copd"
[15] "fio2_aportado"
[16] "analitica_disponible"
[17] "potassium"
[18] "resp_freq"
[19] "hemato_neo"
[20] "leukemia"
[21] "type_underlying_disease_haematological_cancer"
[22] "type_underlying_disease_solid_tumour"
[23] "underlying_disease_hemato_acute_myeloid_leukemia"
[24] "underlying_disease_hemato_myelodysplastic_syndrome"
[25] "underlying_disease_hemato_chronic_myeloid_leukaemia"
[26] "underlying_disease_hemato_acute_lymphoblastic_leukaemia"
[27] "underlying_disease_hemato_hodgkin_lymphoma"
[28] "underlying_disease_hemato_non_hodgkin_lymphoma"
[29] "underlying_disease_hemato_multiple_myeloma"
[30] "underlying_disease_hemato_myelofibrosis"
[31] "underlying_disease_hemato_aplastic_anaemia"
[32] "urine_culture"
which_event: in the event-driven transformation, specify whether you only want a specific event to be returned. For example, we only want the first visit
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_event",
which_event = "initial_visit_arm_1"
)
data <- datos$data
table(data$redcap_event_name)
initial_visit_arm_1
190
which_form: in the transformation by form, specify if you only want a specific form to be returned. For example, we only want the form demographics
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_form",
which_form = "demographics"
)
data <- datos$data
names(data)[1] "record_id" "redcap_event_name"
[3] "redcap_data_access_group" "redcap_event_name.factor"
[5] "redcap_data_access_group.factor" "d_ingreso"
[7] "d_birth" "age"
[9] "age_recalc"
wide: so that the data set that is returned in the transformation by form is in wide format or not. We do this for the laboratory findings form
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_form",
which_form = "laboratory_findings",
wide = TRUE
)
data <- datos$data
head(data)# A tibble: 6 × 5
record_id analitica_disponible_1 analitica_disponible_2 potassium_1 potassiu…¹
<chr> <fct> <fct> <dbl> <dbl>
1 100-13 Yes Yes 3.66 4.1
2 100-16 Yes No 4.04 NA
3 100-31 Yes <NA> 4.58 NA
4 100-34 Yes No 3.48 NA
5 100-36 Yes No 4.09 NA
6 100-52 Yes Yes 3.7 7.15
# … with abbreviated variable name ¹potassium_2
It transforms the REDCap logic into logic that can be evaluated in R. This function is built into the recalculate function, but it may be useful to use it separately. Let’s see how it transforms the logics of the autocalculated field calculations.
#screening failure
rd_rlogic(logic = "if([exc_1]='1' or [inc_1]='0' or [inc_2]='0' or [inc_3]='0',1,0)",
data = datos_redcap$data)[1] "ifelse(data$exc_1=='1' | data$inc_1=='0' | data$inc_2=='0' | data$inc_3=='0',1,0)"
#age
rd_rlogic(logic = 'rounddown(datediff([d_birth],[d_ingreso],"y","dmy"),0)',
data = datos_redcap$data)[1] "floor(lubridate::time_length(lubridate::interval(data$d_birth,data$d_ingreso), 'year'))"
Function to set missing variables when a certain logic is fulfilled. Useful for example in the checkboxes that we do not have a gatekeeper. For example, we put the checkbox without gatekeeper (type_underlying_disease) to missing when the age is less than 65 years old.
datos <- rd_transform(
data = datos_redcap$data,
dic = datos_redcap$dictionary
)
data <- datos$data
#Before inserting missings
table(data$type_underlying_disease_haematological_cancer)
No Yes
103 87
data2 <- rd_insert_na(
data = data,
filter = rep("age < 65", 2),
vars = grep("type_underlying_disease", names(data), value = TRUE)
)
#After inserting missings
table(data2$type_underlying_disease_haematological_cancer)
No Yes
65 50
| Identifier | DAG | Event | Instrument | Field | Repetition | Description | Query | Code |
|---|---|---|---|---|---|---|---|---|
| 100-58 | Hospital 11 | Initial visit | Comorbidities | copd |
|
Chronic pulmonary disease | The value is NA and it should not be missing | 100-58-1 |
| 102-113 | Hospital 24 | Initial visit | Demographics | age |
|
Age | The value is NA and it should not be missing | 102-113-1 |
| 105-11 | Hospital 5 | Initial visit | Comorbidities | copd |
|
Chronic pulmonary disease | The value is NA and it should not be missing | 105-11-1 |
| 105-11 | Hospital 5 | Initial visit | Demographics | age |
|
Age | The value is NA and it should not be missing | 105-11-2 |
| 105-56 | Hospital 5 | Initial visit | Comorbidities | copd |
|
Chronic pulmonary disease | The value is NA and it should not be missing | 105-56-1 |
| 105-56 | Hospital 5 | Initial visit | Demographics | age |
|
Age | The value is NA and it should not be missing | 105-56-2 |
Mandatory arguments:
example <- rd_query(variables = c("copd", "age"),
expression = c("%in%NA", "%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
# Printing results
example$results| Variables | Description | Total |
|---|---|---|
| copd | Chronic pulmonary disease | 6 |
| age | Age | 5 |
Optional arguments:
example<- rd_query(variables = c("age", "copd"),
variables_names = c("Age", "Chronic obstructive pulmonary disease"),#### OPCIONAL
expression = c("%in%NA", "%in%NA"),
query_name = c("Age is missing at baseline visit", "COPD"), #### OPCIONAL
instrument = c("Inclusión del paciente","Inclusión"), #### OPCIONAL
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)Sin filtro:
example <- rd_query(variables = c("age", "copd", "potassium"),
expression = c("%in%NA", "%in%NA", "%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data=datos_redcap$data)Warning: Some of the variables that were checked for missings present a branching logic.
Check the results tab of output for more details (...$results).
# Printing results
example$results| Variables | Description | Total | Branching logic |
|---|---|---|---|
| age | Age | 5 | - |
| copd | Chronic pulmonary disease | 6 | - |
| potassium | Potassium | 31 | [analitica_disponible]=‘1’ |
Applying filter:
example <- rd_query(variables = c("potassium"),
expression = c("%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data,
filter = c("analitica_disponible=='1'"))Warning: Some of the variables that were checked for missings present a branching logic.
Check the results tab of output for more details (...$results).
# Printing results
example$results| Variables | Description | Total | Branching logic |
|---|---|---|---|
| potassium | Potassium | 21 | [analitica_disponible]=‘1’ |
Simple:
example <- rd_query(variables=c("age"),
expression=c(">20"),
event="initial_visit_arm_1",
dic=datos_redcap$dictionary,
data=datos_redcap$data)
# Printing results
example$results| Variables | Description | Total |
|---|---|---|
| age | Age | 185 |
Complex:
example <- rd_query(variables=c("age", "copd"),
expression=c("(>20 & <70) | %in%NA", "==1"),
event="initial_visit_arm_1",
dic=datos_redcap$dictionary,
data=datos_redcap$data)
# Printing results
example$results| Variables | Description | Total |
|---|---|---|
| age | Age | 108 |
| copd | Chronic pulmonary disease | 21 |
Same expression for all variables:
example <- rd_query(variables = c("copd","age","dm"),
expression = c("%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)Warning: There are more variables than expressions, so the same expression was
applied to all variables
# Printing results
example$results| Variables | Description | Total |
|---|---|---|
| copd | Chronic pulmonary disease | 6 |
| age | Age | 5 |
| dm | Diabetes (treated with insulin or antidiabetic … | 5 |
Not defining an event:
example <- rd_query(variables = c("copd"),
expression = c("%in%NA"),
dic = datos_redcap$dictionary,
data = datos_redcap$data)Warning: event = NA, but the dataset presents a variable that indicates the
presence of events, please specify the event.
# Printing results
example$results| Variables | Description | Total |
|---|---|---|
| copd | Chronic pulmonary disease | 158 |
negate: negation of expression used
example <- rd_query(variables = c("copd"),
expression = c("%in%NA"),
negate = TRUE,
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
# Printing results
example$results| Variables | Description | Total |
|---|---|---|
| copd | Chronic pulmonary disease | 184 |
addTo: join queries to an existing data frame
example2 <- rd_query(variables = c("age"),
expression = c("%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data=datos_redcap$data,
addTo = example)
# Printing results
example2$results| Variables | Description | Total |
|---|---|---|
| copd | Chronic pulmonary disease | 184 |
| age | Age | 5 |
report_title: customize the queries table title
example <- rd_query(variables = c("copd", "age"),
expression = c("%in%NA", "<20"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data,
report_title = "Missing COPD values in the baseline event")
# Printing results
example$results| Variables | Description | Total |
|---|---|---|
| copd | Chronic pulmonary disease | 6 |
report_zeros: choose whether variables with zero queries should be reported in the table
example <- rd_query(variables = c("copd", "age"),
expression = c("%in%NA", "<20"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data,
report_zeros = TRUE)
# Printing results
example$results| Variables | Description | Total |
|---|---|---|
| copd | Chronic pulmonary disease | 6 |
| age | Age | 0 |
Simple:
example <- rd_event(event = "follow_up_visit_da_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
# Print results
example$results| Events | Description | Total |
|---|---|---|
| follow_up_visit_da_arm_1 | Follow up visit day 14+/-5d | 38 |
Filter:
example <- rd_event(event = "follow_up_visit_da_arm_1",
filter = "screening_fail_crit==0",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
# Print results
example$results| Events | Description | Total |
|---|---|---|
| follow_up_visit_da_arm_1 | Follow up visit day 14+/-5d | 34 |
Several events:
example <- rd_event(event = c("initial_visit_arm_1","follow_up_visit_da_arm_1"),
filter = "screening_fail_crit==0",
dic = datos_redcap$dictionary,
data = datos_redcap$data,
report_zeros = TRUE)
# Print results
example$results| Events | Description | Total |
|---|---|---|
| follow_up_visit_da_arm_1 | Follow up visit day 14+/-5d | 34 |
| initial_visit_arm_1 | Initial visit | 0 |
check <- check_queries(old = example$queries, new = queries_nuevas)
# Print results
check$results| State | Total |
|---|---|
| Modified | 18 |
| Unmodified | 16 |
| New | 1 |
Query control output:
| Identifier | DAG | Event | Instrument | Field | Repetition | Description | Query | Code | Modification |
|---|---|---|---|---|---|---|---|---|---|
| 101-36 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 101-36-1 | Unmodified |
| 102-6 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-6-1 | Unmodified |
| 102-10 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-10-1 | Unmodified |
| 102-13 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-13-1 | Unmodified |
| 102-23 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-23-1 | Unmodified |
| 102-27 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-27-1 | Modified |