| Title: | Functional Shannon Entropy for Virome Mutational Analysis |
| Version: | 1.2 |
| Date: | 2026-02-09 |
| Description: | Estimates Shannon entropy, per gene and per genomic position, associated with non-synonymous mutation frequencies in viral populations, such as wastewater samples or quasispecies. By categorizing amino acids based on their physicochemical properties, the package determines whether a mutation is functionally disruptive or neutral. Provides normalized values (0-1 scale) to facilitate the direct comparison of different genomic positions or total functional entropy across multiple metagenomes. Designed to analyze mutational data using tabular 'Single Nucleotide Variant' (SNV) frequency tables generated by variant callers (e.g., 'iVar' or 'LoFreq'), operating independently of consensus sequence estimation and multiple sequence alignment. |
| Encoding: | UTF-8 |
| Depends: | R (≥ 4.1.0) |
| LazyData: | true |
| Suggests: | rmarkdown |
| Imports: | ggplot2, patchwork, beeswarm, knitr |
| VignetteBuilder: | knitr |
| RoxygenNote: | 7.3.3 |
| License: | MIT + file LICENSE |
| NeedsCompilation: | no |
| Packaged: | 2026-02-24 18:36:01 UTC; leandro |
| Author: | Leandro Roberto Jones
|
| Maintainer: | Leandro Roberto Jones <lrj000@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-03 10:00:38 UTC |
Coerce entropyProfile to a Data Frame
Description
Function to extract summary information from an entropyProfile
object. This function is internally used for plotting.
Usage
## S3 method for class 'entropyProfile'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
Arguments
x |
An object of class |
row.names |
Please see |
optional |
Please see |
... |
Additional arguments passed to the function. |
Value
A data frame with tabular information on an entropy profile.
This information includes the name of the proteins presenting
mutations, the corresponding genomic positions, and the resulting
entropies in the metagenome.
Evaluates Entropy Hotspot
Description
Graphical and formal analyses of contiguous amino acids.
Usage
assessHotSpot(profile, boundaries, chartType = "boxplot")
Arguments
profile |
An object of class |
boundaries |
Numeric vector with the first and last genomic positions of the region to be evaluated. To be set interactively if not provided. |
chartType |
Chart type; either "boxplot", "stripchart" or "swarm". |
Details
The query stretch (e.g. a protein domain with neutralizing epitopes) is compared against the full set of proteins. Hot spot boundaries should be indicated relative to the reference genome used in variant calling.
Value
htest object. This function is called primarily for its side
effects.
See Also
Examples
omicron <- getEntropySignature(wWater[wWater$wave == "third", ])
# Entrpy hotspot at SARS-CoV-2 receptor binding domain
assessHotSpot(omicron, c(22517, 23186), chartType = "swarm")
Summarize variants and frequencies at a genome position
Description
This function is used internally by getEntropySignature().
It creates a vector (aminoAcids) listing the amino acids
observed in a virome at a particular position under analysis, including the
reference amino acid, another vector (frequencies) with the
corresponding frequencies, and returns them combined in a data frame.
Usage
createPositionSummary(variants, ref_aa, alt_aa, alt_aa_freq)
Arguments
variants |
A data frame, similar to the |
ref_aa |
Name of the column that carries reference amino acids. |
alt_aa |
Name of the column carrying alternative amino acids observed in the metagenome. |
alt_aa_freq |
Name of the column giving the frequencies of alternative amino acids. |
Value
A data frame describing the variability (different amino
acids its frequencies) observed at a specific locus.
See Also
Build a structure representing amino acid categories.
Description
The function is used internally by fillPosition, which in
turn is an auxiliary function of getEntropySignature.
It creates a list with one element for each amino acid category, named
according to the categories used (e.g., "aliphatic", "aromatic", etc.).
Each element contains a set of amino acids identified by one-letter codes.
The list also includes an element containing an empty numeric vector, whose
names correspond to the labels of each category.
This vector is to be populated with the frequency of each category at a
given genomic position, by the fillPosition function.
Usage
createStorage(categories)
Arguments
categories |
A character string. Similar to the |
Value
A list with one element (character vector) for amino acid
category and an element (empty named numeric vector) to be
loaded with the frequency in the metagenome of each amino acid category.
See Also
Create (empty) object of class "entropyProfile"
Description
This function is intended primarily for internal use by
getEntropySignature.
Usage
entropyProfile(
polymorphisms,
position = "position",
linkage = "linkage",
ref = "ref",
alt = "alt",
protein = "protein",
aa_position = "aa_position",
ref_aa = "ref_aa",
alt_aa = "alt_aa",
alt_aa_freq = "alt_aa_freq",
entropies = NA_real_,
genome = mn908947.3
)
Arguments
polymorphisms |
A data frame. Please see Details and Examples in
documentation for |
position |
Name of the |
linkage |
Information on linked positions. |
ref |
Column name with reference bases. |
alt |
Column name with the alternative bases observed in the metagenome. |
protein |
Name of the column carrying protein names. |
aa_position |
Name of the column that indicates the protein positions of the mutated amino acids. |
ref_aa |
Name of the column that carries the reference amino acids. |
alt_aa |
Name of the column carrying alternative amino acids observed in the metagenome. |
alt_aa_freq |
Name of the column giving the frequencies of alternative amino acids in the metagenome. |
entropies |
|
genome |
A list providing CDS data and length of the reference genome. |
Details
The documentation for getEntropySignature details the type of
input needed to create a profile. entropyProfile uses the same parameters as
getEntropySignature, with the exception of categories and
entropies.
Value
An (empty) object of class entropyProfile.
See Also
Translate amino acid frequencies into category frequencies
Description
The function is used internally by getEntropySignature.
It creates a storage list by createStorage, and loads on it the
frequency of each amino acid category based on the data contained in a data
frame passed to the function (positionsSummary parameter).
Usage
fillPosition(positionSummary, categories)
Arguments
positionSummary |
A data frame created by |
categories |
A character string indicating which category scheme to
use. Similar to the |
Value
A list with information on the frequencies of each amino acid
category observed in a virome in a specific locus. The list contains
a character vector for each amino acid category, and a named
numeric vector containing the frequency of each category in
the metagenome.
See Also
Infer Entropy Signature
Description
Calculates genome-wide Shannon entropies from SNV data.
Usage
getEntropySignature(
polymorphisms,
position = "position",
linkage = "linkage",
ref = "ref",
alt = "alt",
protein = "protein",
aa_position = "aa_position",
ref_aa = "ref_aa",
alt_aa = "alt_aa",
alt_aa_freq = "alt_aa_freq",
categories = "robust",
genome = mn908947.3
)
Arguments
polymorphisms |
A data frame. Please see Details and Examples. |
position |
Name of the |
linkage |
Information on linked positions. |
ref |
Column name with reference bases. |
alt |
Column name with the alternative bases observed in the metagenome. |
protein |
Name of the column carrying protein names. |
aa_position |
Name of the column that indicates the protein positions of the mutated amino acids. |
ref_aa |
Name of the column that carries the reference amino acids. |
alt_aa |
Name of the column carrying alternative amino acids observed in the metagenome. |
alt_aa_freq |
Name of the column giving the frequencies of alternative amino acids in the metagenome. |
categories |
Whether a class per amino acid should be used ("sensitive") or they should be grouped into aliphatic, aromatic, polar, positively charged, negatively charged, and special ("robust") (Mirny and Shakhnovich, 1999). |
genome |
A list providing CDS data and length of the reference genome. |
Details
You provide a data frame with SNVs information including reference
and alternative aminoacids, their frequencies, and corresponding positions
relative to a reference sequence.
This type of data can be generated by numerous programs and pipelines.
The objective is to assess the biological impact of nonsynonymous
variation within a viral population, such as an environmental sample (e.g.
wastewater) or a single infection (aka quasisepecies).
Entropy is calculated within the metagenome and is therefore independent
of the reference sequence.
Some mutations may be part of a same codon.
This is to be indicated in the linkage column, providing a downstream
linked position, or the closest upstream position if there are no downstream
positions that are part of the same codon.
For example, in the wWater dataset, mutations T22673C and C22674T are linked
to each other and affect codon 371 of the S gene:
| wave | position | linkage | ref | alt | protein | ... | |
| ... | |||||||
| 105 | third | 22599 | NA | G | A | S | ... |
| 106 | third | 22673 | 22674 | T | C | S | ... |
| 107 | third | 22674 | 22673 | C | T | S | ... |
| 108 | third | 22679 | NA | T | C | S | ... |
| ... |
The genome parameter is a list that provides data on the topology of
protein-coding regions in the genome and its length, used internally
primarily for graphical and summary purposes.
The package provides an example (mn908947.3) of how this
information is to be organized.
Value
An object of class entropyProfile. It contains a tidy,
summarized version of the SNV table, a data frame with
information on genome-wide entropy, a data frame with
information on each CDS and corresponding mutations observed in the
virome, and a list with CDS data and length of the reference
genome used in variant calling.
References
Mirny and Shakhnovich, 1999. J Mol Biol 291:177-196. doi:10.1006/jmbi.1999.2911.
Shannon, 1948. Bell System Technical Journal, 27:379-423. doi:10.1002/j.1538-7305.1948.tb01338.x.
Examples
# Entropy across the genome in ancestral lineages
ancestral <- getEntropySignature(wWater[wWater$wave == "first", ], categories = "sensitive")
# Inspect profile
plot(ancestral, chartType = "entroScan")
Calculates aminoacidic entropy resulting from SNVs present in a specific locus
Description
The function is used internally by getEntropySignature.
It calculates Shannon entropy from categories and frequencies passed to the
function.
Usage
getPosEntropy(variantPosition)
Arguments
variantPosition |
A list, created and passed to the function by
|
Value
A numeric value, corresponding to the entropy associated to
the SNVs present at a specific locus.
See Also
CDS topology and length of Wuhan-Hu-1 reference strain
Description
This type of data can be obtained from .gff (General Feature Format)
files using applications such as the rtracklayer package, or manually
from the corresponding entry in the GenBank database.
Usage
mn908947.3
Format
An object of class list of length 2.
Source
Nucleotide [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] – . Accession No. MN908947.3, Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. Available from: https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3.
Plot entropy signatures
Description
Creates entropy charts along a genome.
Usage
## S3 method for class 'entropyProfile'
plot(x, chartType = "bp", ...)
Arguments
x |
Object of class |
chartType |
Whether to graph per-protein summaries ("bp"), per-protein stripcharts ("stripchart" / "swarm"), or position-wise entropy ("entroScan"). |
... |
Additional arguments passed to the function. |
Value
Unrendered gg/ggplot object produced by ggplot2. This
function is primarily called for its side effects.
Examples
ancestral <- getEntropySignature(wWater[wWater$wave == "first", ])
omicron <- getEntropySignature(wWater[wWater$wave == "third", ])
# Enhanced Spike entropy plus pervasive negative selection in Omicron
# sublineages
anc_plot <- plot(ancestral, chartType = "stripchart")
omi_plot <- plot(omicron, chartType = "stripchart")
patchwork::wrap_plots(anc_plot/omi_plot)
Print method for profileSummary objects
Description
This function formats and prints compact entropy profile summaries
(profileSummary objects), on the console.
Usage
## S3 method for class 'profileSummary'
print(x, ...)
Arguments
x |
An object of class |
... |
Additional arguments passed to the function. |
Value
Invisibly returns NULL. This function is used for its side
effect.
Print method for tidyMutations objects
Description
This function formats and prints compact mutation summaries
(tidyMutations objects), on the console.
Usage
## S3 method for class 'tidyMutations'
print(x, ...)
Arguments
x |
An object of class |
... |
Additional arguments passed to the function. |
Value
Invisibly returns NULL. Called for side effect.
Summarize mutations
Description
Displays SNVs, and corresponding protein mutations, at specific genomic positions.
Usage
showMutations(profile, positions)
Arguments
profile |
An object of class |
positions |
A vector with genome positions relative to the reference genome. |
Details
The user provides a list of genome positions and the function prints the mutations associated with them. The output format is "ref_res###alt_res / protein:ref_res###alt_res", where ref_res is the residue (eiter nucleotide or aminoacid) in the reference strain, alt_res is the alternative residue in the metagenome, "###" is the position (either nucleotide or aminoacid) where the mutation was observed, and "protein" is the name of the affected protein.
Value
An object of class c("tidyMutations", "data.frame"),
containing summary information about user-supplied genomic
positions. This information includes the mutations themselves
relative to the reference genome, their positions within it, and the
corresponding abundances in the virome. Intended to be displayed by
print.tidyMutations.
See Also
Examples
# High entropy at the RBD in Omicron lineages
omicron <- getEntropySignature(wWater[wWater$wave == "third", ])
plot(omicron, chartType="stripchart")
# Identify the high-entropy positions
omicron$Entropy$position[ omicron$Entropy$entropy > 0.3 ]
#[1] 22882 22898 22917 23013 23040 23048 23055 23063
# Get a descriptive table
showMutations(omicron, c(22882, 22898, 22917, 23013, 23040, 23048, 23055, 23063))
Summarize entropy profile
Description
Prints a report about an entropy profile (an object of class "entropyProfile").
Usage
## S3 method for class 'entropyProfile'
summary(object, ...)
Arguments
object |
An object of class |
... |
Other parameters passed to the function. |
Value
An object of class c("profileSummary", "list") summarizing
an entropy profile. Intended to be displayed via
print.profileSummary.
Data from first and third COVID-19 waves in Trelew http://tools.wmflabs.org/geohack/geohack.php?language=es&pagename=Trelew¶ms=-43.253333333333_N_-65.309444444444_E_type:city
Description
SNVs inferred from Illumina (2 x 150) sequences from pooled ultra-pure virus concentrates representative of the 1st and 3rd COVID-19 waves in Trelew. Reads were mapped against the Wuhan-Wu-1 reference genome (MN908947.3) by bwa, and variants were called with iVar with a 3% frequency cutoff for minor variants. First wave cases were caused by ancestral strains whereas third wave cases were mainly due to highly human-adapted Omicron sublineages.
Usage
wWater
Format
An object of class data.frame with 148 rows and 10 columns.
Source
Manrique, Julieta Marina, and Leandro Roberto Jones. 2025. A Cost-Effective Wastewater-Based Workflow for Community-Level Insights into SARS-CoV-2 Evolution. Unpublished 0 (0): 000-000.