Contents

1 Introduction

tidyGenR core functions cover all the steps necessary to determine sequence variants and genotypes from sequencing reads coming from multilocus amplicon libraries sequenced by high throughput sequencing technologies.

This vignette contains a standard workflow for paired-end data.

library(tidyGenR)

2 Input raw sequences

The raw and processed data used in this vignette come from amplicon libraries of Rattus baluensis (Camacho-Sanchez et al. 2026).

Naming of FASTQ must meet a given format.

# download test raw data (if not downloaded)
raw <-
  system.file("extdata", "raw", package = "tidyGenR")
freads <- list.files(raw,
    pattern = "1.fastq",
    full.names = TRUE)
rreads <- list.files(raw,
    pattern = "2.fastq",
    full.names = TRUE)

3 Demultiplex loci

Reads from different loci are mixed together in the same sample FASTQ. The known primer sequences at the ends of amplicons can be used to demultiplex reads by locus using demultiplex(). This function produces an executable for cutadapt. A data.frame with primer sequences can be used to feed demultiplex() with primer sequences.

# data.frame with primers
data("primers")
head(primers, 3)
#>    locus                   fw                   rv
#> 1 nfkbia GCCTCCAAACACACAGTCAT TGAGGAGAGCTATGACACGG
#> 2 chrna9 TTATCTGGGAGAGCGTGACC TTGGGAAARGATGAACCGGC
#> 3  rogdi  AGAARCCGGCTCACTACCC  GAGGCACAGCTTGTTGAGG
p_sh_out <- tempfile(fileext = ".sh")
# demultiplex reads by locus using 3 primer pairs
demultiplex(
    freads = freads,
    rreads = rreads,
    primers = primers[1:3, ],
    sh_out = p_sh_out,
    run = FALSE
)

demultiplex() creates a script as below:

readLines(p_sh_out)
#>  [1] "#!/bin/bash"                                                                                                                         
#>  [2] "cat /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/samples | while read sample"                                        
#>  [3] "do"                                                                                                                                  
#>  [4] "/usr/local/bin/cutadapt \\"                                                                                                          
#>  [5] "--discard-untrimmed --pair-adapters --no-indels  \\"                                                                                 
#>  [6] "--overlap 15 \\"                                                                                                                     
#>  [7] "-e 0.15 \\"                                                                                                                          
#>  [8] "-g file:/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/fw_primers \\"                                                  
#>  [9] "-G file:/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/rv_primers \\"                                                  
#> [10] "-o /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/$sample.{name}.1.fastq.gz \\"                                        
#> [11] "-p /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/$sample.{name}.2.fastq.gz \\"                                        
#> [12] "/private/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T/RtmpNrSQoX/Rinste0cb5728fb12/tidyGenR/extdata/raw/\"$sample\".1.fastq.gz \\"
#> [13] "/private/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T/RtmpNrSQoX/Rinste0cb5728fb12/tidyGenR/extdata/raw/\"$sample\".2.fastq.gz"   
#> [14] "done > /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/filee0fb1ce3180f.log"

If problems arise when running cutadapt run = TRUE, the script written to sh_out can be edited manually and run outside R. FASTQ with zero or few reads can be removed using remove_poor_fastq()

4 Truncate reads

Variants are determined on the basis of 1-nt resolution. Thus, reads to be compared need to have the same length and minimum quality stantards (eg. no ‘Ns’) (see DADA2 documentation, Callahan et al. 2016). Sequence quality can be checked with FASTQC, dada2::plotQualityProfile() or other QC checking software to guide truncation lengths. Truncation and minimal quality filtering are performed with trunc_amp(), a wrapper function of dada2::filterAndTrim(). Global or locus-specific truncation lengths can be supplied in a data.frame.

# Download per-locus demultiplexed FASTQ
dem <-
  system.file("extdata", "demultiplexed", package = "tidyGenR")
# truncate reads for one locus
p_trun <- file.path(tempdir(), "truncated")
one_locus <-
    trunc_amp(
        loci = "chrna9",
        mode_trun = "pe",
        in_dir = dem,
        fw_pattern = "1.fastq.gz",
        rv_pattern = "2.fastq.gz",
        trunc_fr = c(250, 180),
        max_ee = c(3, 3),
        outdir = p_trun
    )
#> 
#> Samples detected:
#>  BOR1061, BOR1063, BOR1069
#> 
#> Loci detected: 
#> chrna9, nfkbia, rogdi
#> 
#> Sequencing files matching patterns seem to conform the required naming format 'sample.locus.[1|2].fastq.gz'
#> 
#> Truncating chrna9
#> Creating output directory: /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/truncated
#> 
#> Forwards reads of chrna9 truncated at 250 nt.
#> 
#> Reverse reads of chrna9 truncated at 180 nt.
#> Filtered/truncated reads written to/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/truncated

It is possible to use locus-specific truncation lengths specified in a data.frame.

# dataframe with locus-specific truncation lengths
data("trunc_fr")

head(trunc_fr, 3)
#>    locus trunc_f trunc_r
#> 1 nfkbia     198     197
#> 2 chrna9     196     182
#> 3  rogdi     183     185
# truncate reads for all loci using locus-specific truncation lengths
all_loci <-
    trunc_amp(
        mode_trun = "pe",
        in_dir = dem,
        fw_pattern = "1.fastq.gz",
        rv_pattern = "2.fastq.gz",
        trunc_fr = trunc_fr,
        max_ee = c(3, 3),
        outdir = p_trun
    )

trunc_amp() writes truncated FASTQ to a directory and returns a list with in and out reads.

Truncated FASTQ:

head(list.files(p_trun))
#> [1] "BOR1061_chrna9_F_filt.fastq.gz" "BOR1061_chrna9_R_filt.fastq.gz"
#> [3] "BOR1061_nfkbia_F_filt.fastq.gz" "BOR1061_nfkbia_R_filt.fastq.gz"
#> [5] "BOR1061_rogdi_F_filt.fastq.gz"  "BOR1061_rogdi_R_filt.fastq.gz"

IN and OUT reads after trunc_amp():

head(all_loci, 3)
#> $chrna9
#>         reads.in reads.out
#> BOR1061       88        62
#> BOR1063       67        52
#> BOR1069       91        62
#> 
#> $nfkbia
#>         reads.in reads.out
#> BOR1061       92        46
#> BOR1063       94        44
#> BOR1069       93        52
#> 
#> $rogdi
#>         reads.in reads.out
#> BOR1061       91        50
#> BOR1063       93        60
#> BOR1069       93        55

5 Variant calling

Real sequence variants and their frequency are determined with variant_call(). It utilizes the ‘DADA2’ ASV determination pipeline, but for multiple loci. Determination of variants and potentially, merging paired-end reads, and removal of chimeras are performed altogether with variant_call(). Binned read qualities (i.e. from NovaSeq) require the use of error_function = loess_err_mod4 (details in ?loess_err_mod4()). Variants are returned in tidy format.

variants <- variant_call(in_dir = p_trun)
head(variants)
head(variants)
#> # A tibble: 6 × 7
#>   sample  locus  variant reads    nt md5                              sequence  
#>   <chr>   <fct>  <chr>   <int> <dbl> <chr>                            <chr>     
#> 1 BOR1061 chrna9 01         28   196 9598f07291c660d1a00c1497350381ad TGCAGTGTG…
#> 2 BOR1061 chrna9 02         34   196 a189edcac75e0853e8fac55df31acede TGCAGTGTG…
#> 3 BOR1061 nfkbia 1          46   198 cf753e1049c5f09f805df1918309a34b CGTAGGGCA…
#> 4 BOR1061 rogdi  1          50   183 6ddfd64d697d724f09d5f0ba1c909de6 CAGCCACCC…
#> 5 BOR1063 chrna9 01         52   196 9598f07291c660d1a00c1497350381ad TGCAGTGTG…
#> 6 BOR1063 nfkbia 1          44   198 cf753e1049c5f09f805df1918309a34b CGTAGGGCA…

Variants can be further filtered with filter_variants() based on frequency (maf) and depth (ad) thresholds.

6 Genotype

Genotypes can be determined from variants. So far, the implemented method for genotyping is based on the expected maximum number of alleles per sample and locus and the read threshold (ADt) for discriminating homozygotes from hemizygotes.

genotypes <-
    genotype(variants, ADt = 10, ploidy = 2)
head(genotypes)
#> # A tibble: 6 × 8
#>   sample  locus  allele allele_no reads    nt md5                       sequence
#>   <chr>   <chr>  <chr>      <int> <dbl> <dbl> <chr>                     <chr>   
#> 1 BOR1061 chrna9 01             1    28   196 9598f07291c660d1a00c1497… TGCAGTG…
#> 2 BOR1061 chrna9 02             2    34   196 a189edcac75e0853e8fac55d… TGCAGTG…
#> 3 BOR1061 nfkbia 1              1    23   198 cf753e1049c5f09f805df191… CGTAGGG…
#> 4 BOR1061 nfkbia 1              2    23   198 cf753e1049c5f09f805df191… CGTAGGG…
#> 5 BOR1061 rogdi  1              1    25   183 6ddfd64d697d724f09d5f0ba… CAGCCAC…
#> 6 BOR1061 rogdi  1              2    25   183 6ddfd64d697d724f09d5f0ba… CAGCCAC…