Know your nodes

Introduction

The parsermd package parses R Markdown and Quarto documents into an Abstract Syntax Tree (AST) representation. This vignette introduces the different types of AST nodes and their properties, helping you understand how parsermd represents document structure.

AST Container - rmd_ast

The rmd_ast object serves as the container for all parsed document nodes. It holds a linear sequence of nodes representing different document elements, where each node type corresponds to a specific R Markdown or Quarto construct (headings, code chunks, text, etc.).

Important: The AST represents documents as a linear sequence of nodes, not a nested tree structure. This means that structural elements like fenced divs are represented as separate opening and closing nodes in the sequence, rather than as nodes with children.

The default print method for rmd_ast’s (flat = FALSE) presents an implicit tree structure based on heading levels. This provides a hierarchical view that reflects the document’s logical organization, where content is grouped under headings based on their level.

Properties:

Example:

Raw text that would be parsed:

---
title: "Example Document"
---

# Introduction

This is some text.

```{r}
x <- 1:5
mean(x)
```

This would create an rmd_ast object containing:

  1. rmd_yaml node with the title
  2. rmd_heading node with “Introduction”
  3. rmd_markdown node with “This is some text.”
  4. rmd_chunk node with the R code

Programmatic creation:

ast = rmd_ast(list(
  rmd_yaml(list(title = "Example Document")),
  rmd_heading(name = "Introduction", level = 1L),
  rmd_markdown(lines = "This is some text."),
  rmd_chunk(
    engine = "r",
    code = c("x <- 1:5", "mean(x)")
  )
))

Hierarchical view (flat = FALSE):

print(ast, flat = FALSE)
#> ├── YAML [1 field]
#> └── Heading [h1] - Introduction
#>     ├── Markdown [1 line]
#>     └── Chunk [r, 2 lines] -

Linear view (flat = TRUE):

print(ast, flat = TRUE)
#> ├── YAML [1 field]
#> ├── Heading [h1] - Introduction
#> ├── Markdown [1 line]
#> └── Chunk [r, 2 lines] -

S7 Class System

parsermd uses the S7 object system for all AST node types. S7 provides a modern, robust class system with:

Key S7 Features in parsermd:

Property Access:

# Create a heading node
heading = rmd_heading(name = "Section Title", level = 2L)

# Access properties with @
heading@name
#> [1] "Section Title"
heading@level
#> [1] 2

Core Node Types

Document Structure Nodes

YAML Header - rmd_yaml

The rmd_yaml node represents YAML front matter at the beginning of documents.

Properties:

Example:

Raw text that would be parsed:

---
title: "My Document"
author: "John Doe"
date: "2023-01-01"
---

Programmatic creation:

yaml_node = rmd_yaml(list(
  title = "My Document",
  author = "John Doe",
  date = "2023-01-01"
))
yaml_node
#> <rmd_yaml>
#>  @ yaml:List of 3
#>  .. $ title : chr "My Document"
#>  .. $ author: chr "John Doe"
#>  .. $ date  : chr "2023-01-01"

Markdown Headings - rmd_heading

The rmd_heading node represents section headings in markdown.

Properties:

Example:

Raw text that would be parsed:

# Introduction

Programmatic creation:

heading_node = rmd_heading(
  name = "Introduction", 
  level = 1L
)
heading_node
#> <rmd_heading>
#>  @ name : chr "Introduction"
#>  @ level: int 1

Markdown Text - rmd_markdown

The rmd_markdown node represents plain markdown text content.

Properties:

Example:

Raw text that would be parsed:

This is a paragraph.
With multiple lines.

Programmatic creation:

markdown_node = rmd_markdown(
  lines = c("This is a paragraph.", "With multiple lines.")
)
markdown_node
#> <rmd_markdown>
#>  @ lines: chr [1:2] "This is a paragraph." "With multiple lines."

Code and Execution Nodes

Executable Code Chunks - rmd_chunk

The rmd_chunk node represents executable code chunks with options and metadata.

Properties:

Chunk Option Formats:

Chunks support two option formats that can be used independently or together:

  1. Traditional format: Options specified in the chunk header after the engine and label ```{{r chunk-label, eval=TRUE, echo=FALSE}}

  2. YAML format: Options specified as YAML comments within the chunk

    ```{r chunk-label}
    #| eval: true
    #| echo: false
    ```

Option Conflict Resolution:

When the same option is specified in both formats, YAML options take precedence over traditional options. A warning is emitted when conflicts occur:

{r eval=TRUE} #| eval: false

In this case, eval: false (YAML) wins over eval=TRUE (traditional), and the parser emits: “YAML options override traditional options for: eval”

Type Handling:

Examples:

Traditional format chunk:

```{r example, eval=TRUE, echo=FALSE}
x <- 1:10
mean(x)
```

YAML format chunk:

```{r example}
#| eval: true
#| echo: false
x <- 1:10
mean(x)
```

Mixed format chunk (with conflict):

```{r example, eval=TRUE}
#| eval: false
#| message: false
x <- 1:10
mean(x)
```

In this case, eval: false (YAML) overrides eval=TRUE (traditional).

Programmatic creation:

# Traditional-style options
chunk_node_traditional = rmd_chunk(
  engine = "r",
  label = "example",
  options = list(eval = "TRUE", echo = "FALSE"),
  code = c("x <- 1:10", "mean(x)")
)

# YAML-style options with proper types
chunk_node_yaml = rmd_chunk(
  engine = "r",
  label = "example",
  options = list(eval = TRUE, echo = FALSE),
  code = c("x <- 1:10", "mean(x)")
)

chunk_node_yaml
#> <rmd_chunk>
#>  @ engine : chr "r"
#>  @ label  : chr "example"
#>  @ options:List of 2
#>  .. $ eval: logi TRUE
#>  .. $ echo: logi FALSE
#>  @ code   : chr [1:2] "x <- 1:10" "mean(x)"
#>  @ indent : chr ""
#>  @ n_ticks: int 3

Raw Output Chunks - rmd_raw_chunk

The rmd_raw_chunk node represents raw output chunks for specific formats.

Properties:

Example:

Raw text that would be parsed:

```{=html}
<div class='custom'>
  <p>Custom HTML content</p>
</div>
```

Programmatic creation:

raw_chunk_node = rmd_raw_chunk(
  format = "html",
  code = c(
    "<div class='custom'>", 
    "  <p>Custom HTML content</p>", 
    "</div>"
  )
)
raw_chunk_node
#> <rmd_raw_chunk>
#>  @ format : chr "html"
#>  @ code   : chr [1:3] "<div class='custom'>" "  <p>Custom HTML content</p>" ...
#>  @ indent : chr ""
#>  @ n_ticks: int 3

Fenced Code Blocks - rmd_code_block

The rmd_code_block node represents non-executable fenced code blocks.

Properties:

Example:

Raw text that would be parsed:

```python
def hello():
    print('Hello, World!')
```

Programmatic creation:

code_block_node = rmd_code_block(
  classes = c("python"),
  code = c(
    "def hello():", 
    "    print('Hello, World!')"
  )
)
code_block_node
#> <rmd_code_block>
#>  @ id     : chr(0) 
#>  @ classes: chr "python"
#>  @ attr   : chr(0) 
#>  @ code   : chr [1:2] "def hello():" "    print('Hello, World!')"
#>  @ indent : chr ""
#>  @ n_ticks: int 3

Code Block Literals - rmd_code_block_literal

The rmd_code_block_literal node represents code blocks with literal attribute capture using the {...} syntax. This format preserves the raw attribute content exactly as written, making it ideal for displaying code chunk examples.

Properties:

Example:

Raw text that would be parsed: {r, echo=TRUE, eval=FALSE} x <- 1:10 mean(x)

Programmatic creation:

code_block_literal_node = rmd_code_block_literal(
  attr = "r, echo=TRUE, eval=FALSE",
  code = c(
    "x <- 1:10", 
    "mean(x)"
  )
)
code_block_literal_node
#> <rmd_code_block_literal>
#>  @ attr   : chr "r, echo=TRUE, eval=FALSE"
#>  @ code   : chr [1:2] "x <- 1:10" "mean(x)"
#>  @ indent : chr ""
#>  @ n_ticks: int 3

Nested Braces Support:

The literal format can handle nested braces in attributes: {{r, code='function() { return(1) }'}}

This captures the attribute as: "r, code='function() { return(1) }'"


Structural Elements

Fenced Divs - rmd_fenced_div_open & rmd_fenced_div_close

Fenced divs are represented as pairs of nodes in the linear AST sequence. The rmd_fenced_div_open node marks the beginning of a fenced div block, and the rmd_fenced_div_close node marks the end. Any content between these nodes is considered to be inside the fenced div.

rmd_fenced_div_open Properties:

rmd_fenced_div_close Properties: None (just a marker)

Example:

Raw text that would be parsed:

::: {.warning #important}
This content is inside the fenced div.

More content here.
:::

This would create a sequence of nodes: 1. rmd_fenced_div_open with attributes 2. rmd_markdown with “This content is inside the fenced div.” 3. rmd_markdown with “More content here.” 4. rmd_fenced_div_close

Programmatic creation:

# Create the opening node
fenced_div_open_node = rmd_fenced_div_open(
  classes = c(".warning"),
  attr = c(id = "important")
)

# Create the closing node
fenced_div_close_node = rmd_fenced_div_close()

# These would typically be combined with content nodes in an rmd_ast
ast_with_div = rmd_ast(list(
  fenced_div_open_node,
  rmd_markdown(
    lines = "This content is inside the fenced div."
  ),
  rmd_markdown(
    lines = "More content here."
  ),
  fenced_div_close_node
))

Extracted Elements

The following classes represent elements that can be extracted from AST nodes through secondary parsing, rather than being direct nodes in the AST structure. These elements are found within markdown text and code content.

Inline Code - rmd_inline_code

The rmd_inline_code class represents inline code expressions extracted from markdown text.

Properties:

Example:

Raw text containing inline code:

The result is 4.

Programmatic creation:

# Create directly
inline_code_obj = rmd_inline_code(
  engine = "r",
  code = "2 + 2",
  braced = FALSE
)
inline_code_obj
#>  rmd_inline_code[-1,-1] `r 2 + 2`

Shortcode Function Calls - rmd_shortcode

The rmd_shortcode class represents Quarto shortcode function calls extracted from markdown content.

Properties:

Example:

Raw text containing a shortcode:

{{< embed type=video src=example.mp4 >}}

Programmatic creation:

# Create directly
shortcode_obj = rmd_shortcode(
  func = "embed",
  args = c("type=video", "src=example.mp4")
)
shortcode_obj
#>  rmd_shortcode[-1,-1] {{< embed type=video src=example.mp4 >}}

Spans - rmd_span

The rmd_span class represents inline span elements with attributes extracted from markdown text.

Properties:

Example:

Raw text containing a span:

[Important text]{.highlight #key}

Programmatic creation:

# Create directly
span_obj = rmd_span(
  text = "Important text",
  id = c("#key"),
  classes = c(".highlight")
)
span_obj
#>  rmd_span [Important text]{#key .highlight}

Extraction Functions

These utility functions extract the above elements from AST nodes: