library(parsermd)
library(stringr)
library(parsermd)
library(stringr)
A common workflow in educational settings involves creating homework assignments that contain both student scaffolding and instructor solutions within the same document. This vignette demonstrates how to use parsermd
to process such documents and automatically generate separate versions for students and instructors.
The typical workflow involves:
Let’s start by examining a sample homework assignment that follows this pattern. The assignment includes multiple exercises, each with two code chunks:
-student
suffix, contains scaffolding code-key
suffix, contains complete solutions# Load the sample assignment
= system.file("examples/hw03-full.qmd", package = "parsermd")
assignment_path cat(readLines(assignment_path), sep = "\n")
#> ---
#> title: "Homework 3 - Data Analysis with R"
#> author: "Your Name"
#> date: "Due: Friday, March 15, 2024"
#> format: html
#> execute:
#> warning: false
#> message: false
#> ---
#>
#> ## Setup
#>
#> Load the required packages for this assignment:
#>
#> ```{r setup}
#> library(tidyverse)
#> library(palmerpenguins)
#> ```
#>
#> ## Exercise 1: Basic Data Exploration
#>
#> Examine the `penguins` dataset from the `palmerpenguins` package. Your task is to create a summary of the dataset that shows the number of observations and variables, and identify any missing values.
#>
#> ```{r ex1-student}
#> # Write your code here to:
#> # 1. Display the dimensions of the penguins dataset
#> # 2. Show the structure of the dataset
#> # 3. Count missing values in each column
#>
#> ```
#>
#> ```{r ex1-key}
#> # Solution: Basic data exploration
#> # 1. Display dimensions
#> cat("Dataset dimensions:", dim(penguins), "\n")
#> cat("Rows:", nrow(penguins), "Columns:", ncol(penguins), "\n\n")
#>
#> # 2. Show structure
#> str(penguins)
#>
#> # 3. Count missing values
#> cat("\nMissing values by column:\n")
#> penguins %>%
#> summarise(across(everything(), ~ sum(is.na(.))))
#> ```
#>
#> ## Exercise 2: Data Visualization
#>
#> Create a scatter plot showing the relationship between flipper length and body mass for penguins. Color the points by species and add appropriate labels and a title.
#>
#> ```{r ex2-student}
#> # Create a scatter plot with:
#> # - x-axis: flipper_length_mm
#> # - y-axis: body_mass_g
#> # - color by species
#> # - add appropriate labels and title
#>
#> ggplot(data = penguins, aes(x = ___, y = ___)) +
#> geom_point(aes(color = ___)) +
#> labs(
#> title = "___",
#> x = "___",
#> y = "___"
#> )
#> ```
#>
#> ```{r ex2-key}
#> # Solution: Scatter plot of flipper length vs body mass
#> ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
#> geom_point(aes(color = species), alpha = 0.8, size = 2) +
#> labs(
#> title = "Penguin Flipper Length vs Body Mass by Species",
#> x = "Flipper Length (mm)",
#> y = "Body Mass (g)",
#> color = "Species"
#> ) +
#> theme_minimal() +
#> scale_color_viridis_d()
#> ```
#>
#> ## Exercise 3: Statistical Analysis
#>
#> Calculate summary statistics for bill length by species. Create a table showing the mean, median, standard deviation, and count for each species.
#>
#> ```{r ex3-student}
#> # Calculate summary statistics for bill_length_mm by species
#> # Include: mean, median, standard deviation, and count
#> # Remove missing values before calculating
#>
#> penguins %>%
#> # Add your code here
#>
#> ```
#>
#> ```{r ex3-key}
#> # Solution: Summary statistics for bill length by species
#> penguins %>%
#> filter(!is.na(bill_length_mm)) %>%
#> group_by(species) %>%
#> summarise(
#> count = n(),
#> mean_bill_length = round(mean(bill_length_mm), 2),
#> median_bill_length = round(median(bill_length_mm), 2),
#> sd_bill_length = round(sd(bill_length_mm), 2),
#> .groups = "drop"
#> ) %>%
#> arrange(desc(mean_bill_length))
#> ```
#>
#> ## Exercise 4: Advanced Data Manipulation
#>
#> Filter the dataset to include only penguins with complete data (no missing values), then create a new variable called `bill_ratio` that represents the ratio of bill length to bill depth. Finally, identify which species has the highest average bill ratio.
#>
#> ```{r ex4-student}
#> # Step 1: Filter for complete cases
#> # Step 2: Create bill_ratio variable (bill_length_mm / bill_depth_mm)
#> # Step 3: Calculate average bill_ratio by species
#> # Step 4: Identify species with highest average ratio
#>
#> ```
#>
#> ```{r ex4-key}
#> # Solution: Advanced data manipulation
#> complete_penguins = penguins %>%
#> # Remove rows with any missing values
#> filter(complete.cases(.)) %>%
#> # Create bill_ratio variable
#> mutate(bill_ratio = bill_length_mm / bill_depth_mm)
#>
#> # Calculate average bill ratio by species
#> bill_ratio_summary = complete_penguins %>%
#> group_by(species) %>%
#> summarise(
#> avg_bill_ratio = round(mean(bill_ratio), 3),
#> n = n(),
#> .groups = "drop"
#> ) %>%
#> arrange(desc(avg_bill_ratio))
#>
#> print(bill_ratio_summary)
#>
#> # Identify species with highest average bill ratio
#> highest_ratio_species = bill_ratio_summary %>%
#> slice_max(avg_bill_ratio, n = 1) %>%
#> pull(species)
#>
#> cat("\nSpecies with highest average bill ratio:", as.character(highest_ratio_species))
#> ```
#>
#> ## Bonus Exercise: Conditional Logic
#>
#> Write a function that categorizes penguins as "small", "medium", or "large" based on their body mass. Use the following criteria:
#> - Small: body mass < 3500g
#> - Medium: body mass between 3500g and 4500g
#> - Large: body mass > 4500g
#>
#> Apply this function to create a new column and create a summary table.
#>
#> ```{r bonus-student}
#> # Create a function to categorize penguins by size
#> categorize_size = function(mass) {
#> # Add your conditional logic here
#>
#> }
#>
#> # Apply the function and create summary
#> ```
#>
#> ```{r bonus-key}
#> # Solution: Conditional logic for size categorization
#> categorize_size = function(mass) {
#> case_when(
#> is.na(mass) ~ "Unknown",
#> mass < 3500 ~ "Small",
#> mass >= 3500 & mass <= 4500 ~ "Medium",
#> mass > 4500 ~ "Large"
#> )
#> }
#>
#> # Apply the function and create summary
#> penguins_with_size = penguins %>%
#> mutate(size_category = categorize_size(body_mass_g))
#>
#> # Create summary table
#> size_summary = penguins_with_size %>%
#> count(species, size_category) %>%
#> pivot_wider(names_from = size_category, values_from = n, values_fill = 0)
#>
#> print(size_summary)
#>
#> # Overall size distribution
#> penguins_with_size %>%
#> count(size_category) %>%
#> mutate(percentage = round(n / sum(n) * 100, 1))
#> ```
First, let’s parse the assignment document to understand its structure:
# Parse the assignment
= parse_rmd(assignment_path)
rmd
# Display the document structure
print(rmd)
#> ├── YAML [5 fields]
#> ├── Heading [h2] - Setup
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 2 lines] - setup
#> ├── Heading [h2] - Exercise 1: Basic Data Exploration
#> │ ├── Markdown [1 line]
#> │ ├── Chunk [r, 5 lines] - ex1-student
#> │ └── Chunk [r, 12 lines] - ex1-key
#> ├── Heading [h2] - Exercise 2: Data Visualization
#> │ ├── Markdown [1 line]
#> │ ├── Chunk [r, 13 lines] - ex2-student
#> │ └── Chunk [r, 11 lines] - ex2-key
#> ├── Heading [h2] - Exercise 3: Statistical Analysis
#> │ ├── Markdown [1 line]
#> │ ├── Chunk [r, 7 lines] - ex3-student
#> │ └── Chunk [r, 12 lines] - ex3-key
#> ├── Heading [h2] - Exercise 4: Advanced Data Manipulation
#> │ ├── Markdown [1 line]
#> │ ├── Chunk [r, 5 lines] - ex4-student
#> │ └── Chunk [r, 25 lines] - ex4-key
#> └── Heading [h2] - Bonus Exercise: Conditional Logic
#> ├── Markdown [6 lines]
#> ├── Chunk [r, 7 lines] - bonus-student
#> └── Chunk [r, 25 lines] - bonus-key
We can also examine the document as a tibble to better understand the chunk labels and structure:
# Convert to tibble for easier inspection
as_tibble(rmd)
#> # A tibble: 24 × 4
#> sec_h2 type label ast
#> <chr> <chr> <chr> <list>
#> 1 <NA> rmd_yaml <NA> <yaml>
#> 2 Setup rmd_heading <NA> <heading [h2]>
#> 3 Setup rmd_markdown <NA> <markdown>
#> 4 Setup rmd_chunk setup <chunk [r]>
#> 5 Exercise 1: Basic Data Exploration rmd_heading <NA> <heading [h2]>
#> 6 Exercise 1: Basic Data Exploration rmd_markdown <NA> <markdown>
#> 7 Exercise 1: Basic Data Exploration rmd_chunk ex1-student <chunk [r]>
#> 8 Exercise 1: Basic Data Exploration rmd_chunk ex1-key <chunk [r]>
#> 9 Exercise 2: Data Visualization rmd_heading <NA> <heading [h2]>
#> 10 Exercise 2: Data Visualization rmd_markdown <NA> <markdown>
#> # ℹ 14 more rows
To create the student version, we need to:
-student
suffix)# Select student chunks and all non-chunk content
= rmd |>
student_version rmd_select(
# Easier to specify the nodes we want to remove
!has_label("*-key")
)
# Display the student version structure
student_version#> ├── YAML [5 fields]
#> ├── Heading [h2] - Setup
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 2 lines] - setup
#> ├── Heading [h2] - Exercise 1: Basic Data Exploration
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 5 lines] - ex1-student
#> ├── Heading [h2] - Exercise 2: Data Visualization
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 13 lines] - ex2-student
#> ├── Heading [h2] - Exercise 3: Statistical Analysis
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 7 lines] - ex3-student
#> ├── Heading [h2] - Exercise 4: Advanced Data Manipulation
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 5 lines] - ex4-student
#> └── Heading [h2] - Bonus Exercise: Conditional Logic
#> ├── Markdown [6 lines]
#> └── Chunk [r, 7 lines] - bonus-student
If we don’t want to let the student on to the fact that the chunks are just for them we can use rmd_modify()
to remove the -student
suffix:
= student_version |>
student_version rmd_modify(
function(node) {
rmd_node_label(node) = stringr::str_remove(rmd_node_label(node), "-student")
node
},has_label("*-student")
)
# Show the first few chunks to see the label changes
student_version#> ├── YAML [5 fields]
#> ├── Heading [h2] - Setup
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 2 lines] - setup
#> ├── Heading [h2] - Exercise 1: Basic Data Exploration
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 5 lines] - ex1
#> ├── Heading [h2] - Exercise 2: Data Visualization
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 13 lines] - ex2
#> ├── Heading [h2] - Exercise 3: Statistical Analysis
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 7 lines] - ex3
#> ├── Heading [h2] - Exercise 4: Advanced Data Manipulation
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 5 lines] - ex4
#> └── Heading [h2] - Bonus Exercise: Conditional Logic
#> ├── Markdown [6 lines]
#> └── Chunk [r, 7 lines] - bonus
Let’s see what the student version looks like as a document:
# Convert to document and display first few sections
as_document(student_version) |>
cat(sep = "\n")
#> ---
#> title: Homework 3 - Data Analysis with R
#> author: Your Name
#> date: 'Due: Friday, March 15, 2024'
#> format: html
#> execute:
#> warning: false
#> message: false
#> ---
#>
#> ## Setup
#>
#> Load the required packages for this assignment:
#>
#>
#> ```{r setup}
#> library(tidyverse)
#> library(palmerpenguins)
#> ```
#>
#> ## Exercise 1: Basic Data Exploration
#>
#> Examine the `penguins` dataset from the `palmerpenguins` package. Your task is to create a summary of the dataset that shows the number of observations and variables, and identify any missing values.
#>
#>
#> ```{r ex1}
#> # Write your code here to:
#> # 1. Display the dimensions of the penguins dataset
#> # 2. Show the structure of the dataset
#> # 3. Count missing values in each column
#>
#> ```
#>
#> ## Exercise 2: Data Visualization
#>
#> Create a scatter plot showing the relationship between flipper length and body mass for penguins. Color the points by species and add appropriate labels and a title.
#>
#>
#> ```{r ex2}
#> # Create a scatter plot with:
#> # - x-axis: flipper_length_mm
#> # - y-axis: body_mass_g
#> # - color by species
#> # - add appropriate labels and title
#>
#> ggplot(data = penguins, aes(x = ___, y = ___)) +
#> geom_point(aes(color = ___)) +
#> labs(
#> title = "___",
#> x = "___",
#> y = "___"
#> )
#> ```
#>
#> ## Exercise 3: Statistical Analysis
#>
#> Calculate summary statistics for bill length by species. Create a table showing the mean, median, standard deviation, and count for each species.
#>
#>
#> ```{r ex3}
#> # Calculate summary statistics for bill_length_mm by species
#> # Include: mean, median, standard deviation, and count
#> # Remove missing values before calculating
#>
#> penguins %>%
#> # Add your code here
#>
#> ```
#>
#> ## Exercise 4: Advanced Data Manipulation
#>
#> Filter the dataset to include only penguins with complete data (no missing values), then create a new variable called `bill_ratio` that represents the ratio of bill length to bill depth. Finally, identify which species has the highest average bill ratio.
#>
#>
#> ```{r ex4}
#> # Step 1: Filter for complete cases
#> # Step 2: Create bill_ratio variable (bill_length_mm / bill_depth_mm)
#> # Step 3: Calculate average bill_ratio by species
#> # Step 4: Identify species with highest average ratio
#>
#> ```
#>
#> ## Bonus Exercise: Conditional Logic
#>
#> Write a function that categorizes penguins as "small", "medium", or "large" based on their body mass. Use the following criteria:
#> - Small: body mass < 3500g
#> - Medium: body mass between 3500g and 4500g
#> - Large: body mass > 4500g
#>
#> Apply this function to create a new column and create a summary table.
#>
#>
#> ```{r bonus}
#> # Create a function to categorize penguins by size
#> categorize_size = function(mass) {
#> # Add your conditional logic here
#>
#> }
#>
#> # Apply the function and create summary
#> ```
We can also save this to a file:
# Save student version (not run in vignette)
as_document(student_version) |>
writeLines("homework-student.qmd")
For the instructor key, we want to:
-key
suffix)# Select solution chunks and all non-chunk content
= rmd |>
instructor_key rmd_select(
# Again this is easier to specify the nodes we want to remove
!has_label("*-student")
)
# Display the instructor key structure
instructor_key#> ├── YAML [5 fields]
#> ├── Heading [h2] - Setup
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 2 lines] - setup
#> ├── Heading [h2] - Exercise 1: Basic Data Exploration
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 12 lines] - ex1-key
#> ├── Heading [h2] - Exercise 2: Data Visualization
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 11 lines] - ex2-key
#> ├── Heading [h2] - Exercise 3: Statistical Analysis
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 12 lines] - ex3-key
#> ├── Heading [h2] - Exercise 4: Advanced Data Manipulation
#> │ ├── Markdown [1 line]
#> │ └── Chunk [r, 25 lines] - ex4-key
#> └── Heading [h2] - Bonus Exercise: Conditional Logic
#> ├── Markdown [6 lines]
#> └── Chunk [r, 25 lines] - bonus-key
Let’s examine the instructor key document:
# Convert to document
= as_document(instructor_key)
instructor_doc
# Display first part of the document
cat(head(strsplit(instructor_doc, "\n")[[1]], 50), sep = "\n")
#> ---
Sometimes instructors may want a very streamlined version that contains only the solution code without all the instructional text.
We can create this by:
#| include: false
for the setup
chunk# Select only headings and solution chunks
= rmd |>
minimalist_key rmd_select(
# Keep yaml and exercise headings for structure
has_type("rmd_yaml"),
has_heading(c("Exercise *", "Bonus*")),
# Keep only solution chunks
has_label(c("*-key", "setup"))
|>
) rmd_modify(
function(node) {
rmd_node_options(node) = list(include = FALSE)
node
},has_label("setup")
)
# Display the minimalist key structure
minimalist_key#> ├── YAML [5 fields]
#> ├── Chunk [r, 2 lines] - setup
#> ├── Heading [h2] - Exercise 1: Basic Data Exploration
#> │ └── Chunk [r, 12 lines] - ex1-key
#> ├── Heading [h2] - Exercise 2: Data Visualization
#> │ └── Chunk [r, 11 lines] - ex2-key
#> ├── Heading [h2] - Exercise 3: Statistical Analysis
#> │ └── Chunk [r, 12 lines] - ex3-key
#> ├── Heading [h2] - Exercise 4: Advanced Data Manipulation
#> │ └── Chunk [r, 25 lines] - ex4-key
#> └── Heading [h2] - Bonus Exercise: Conditional Logic
#> └── Chunk [r, 25 lines] - bonus-key
# Convert to document
= as_document(minimalist_key)
minimalist_doc cat(minimalist_doc, sep = "\n")
#> ---
#> title: Homework 3 - Data Analysis with R
#> author: Your Name
#> date: 'Due: Friday, March 15, 2024'
#> format: html
#> execute:
#> warning: false
#> message: false
#> ---
#>
#> ```{r setup}
#> #| include: false
#> library(tidyverse)
#> library(palmerpenguins)
#> ```
#>
#> ## Exercise 1: Basic Data Exploration
#>
#> ```{r ex1-key}
#> # Solution: Basic data exploration
#> # 1. Display dimensions
#> cat("Dataset dimensions:", dim(penguins), "\n")
#> cat("Rows:", nrow(penguins), "Columns:", ncol(penguins), "\n\n")
#>
#> # 2. Show structure
#> str(penguins)
#>
#> # 3. Count missing values
#> cat("\nMissing values by column:\n")
#> penguins %>%
#> summarise(across(everything(), ~ sum(is.na(.))))
#> ```
#>
#> ## Exercise 2: Data Visualization
#>
#> ```{r ex2-key}
#> # Solution: Scatter plot of flipper length vs body mass
#> ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
#> geom_point(aes(color = species), alpha = 0.8, size = 2) +
#> labs(
#> title = "Penguin Flipper Length vs Body Mass by Species",
#> x = "Flipper Length (mm)",
#> y = "Body Mass (g)",
#> color = "Species"
#> ) +
#> theme_minimal() +
#> scale_color_viridis_d()
#> ```
#>
#> ## Exercise 3: Statistical Analysis
#>
#> ```{r ex3-key}
#> # Solution: Summary statistics for bill length by species
#> penguins %>%
#> filter(!is.na(bill_length_mm)) %>%
#> group_by(species) %>%
#> summarise(
#> count = n(),
#> mean_bill_length = round(mean(bill_length_mm), 2),
#> median_bill_length = round(median(bill_length_mm), 2),
#> sd_bill_length = round(sd(bill_length_mm), 2),
#> .groups = "drop"
#> ) %>%
#> arrange(desc(mean_bill_length))
#> ```
#>
#> ## Exercise 4: Advanced Data Manipulation
#>
#> ```{r ex4-key}
#> # Solution: Advanced data manipulation
#> complete_penguins = penguins %>%
#> # Remove rows with any missing values
#> filter(complete.cases(.)) %>%
#> # Create bill_ratio variable
#> mutate(bill_ratio = bill_length_mm / bill_depth_mm)
#>
#> # Calculate average bill ratio by species
#> bill_ratio_summary = complete_penguins %>%
#> group_by(species) %>%
#> summarise(
#> avg_bill_ratio = round(mean(bill_ratio), 3),
#> n = n(),
#> .groups = "drop"
#> ) %>%
#> arrange(desc(avg_bill_ratio))
#>
#> print(bill_ratio_summary)
#>
#> # Identify species with highest average bill ratio
#> highest_ratio_species = bill_ratio_summary %>%
#> slice_max(avg_bill_ratio, n = 1) %>%
#> pull(species)
#>
#> cat("\nSpecies with highest average bill ratio:", as.character(highest_ratio_species))
#> ```
#>
#> ## Bonus Exercise: Conditional Logic
#>
#> ```{r bonus-key}
#> # Solution: Conditional logic for size categorization
#> categorize_size = function(mass) {
#> case_when(
#> is.na(mass) ~ "Unknown",
#> mass < 3500 ~ "Small",
#> mass >= 3500 & mass <= 4500 ~ "Medium",
#> mass > 4500 ~ "Large"
#> )
#> }
#>
#> # Apply the function and create summary
#> penguins_with_size = penguins %>%
#> mutate(size_category = categorize_size(body_mass_g))
#>
#> # Create summary table
#> size_summary = penguins_with_size %>%
#> count(species, size_category) %>%
#> pivot_wider(names_from = size_category, values_from = n, values_fill = 0)
#>
#> print(size_summary)
#>
#> # Overall size distribution
#> penguins_with_size %>%
#> count(size_category) %>%
#> mutate(percentage = round(n / sum(n) * 100, 1))
#> ```
When creating homework assignments for processing with parsermd
, consider these best practices:
ex1-student
, ex2-key
)