Overview of MetaEntropy


Motivation

High-performance sequencing has exponentially increased the power of intrapopulation viral diversity study. Short-read methods are particularly popular due to their excellent cost-to-depth ratio and the availability of efficient algorithms, such as the Burrows-Wheeler Transform, to infer read homology within the framework of reference genomes.

MetaEntropy is designed to process the typical output of variant calling tools, consisting of information about the variants present in a population, relative to a reference genome, and their corresponding frequencies. It generates “informed” estimates of entropy at each genomic position, which, in addition to allele frequencies, take into account the physicochemical characteristics of the amino acids. The calculated values are normalized to a scale of 0 to 1, facilitating the comparison of different loci and different metagenomes.


Input data

Single nucleotide variants (SNVs) data must be provided in a data frame. This should contain columns indicating the reference genome positions that are polymorphic in the virome, the linkage relationships between them, the corresponding bases in the reference genome and in the virome, which protein and amino acid position each mutation affects, the corresponding amino acids in the reference strain and in the metagenome, and mutation frequencies. For example, the wWater dataset, which accompanies the package, has this information in the columns position, linkage, ref, alt, protein, aa_position, ref_aa, alt_aa and alt_aa_freq, which are the labels expected by default:

library(MetaEntropy)
wWater[105:108,-1]
#>     position linkage ref alt protein aa_position ref_aa alt_aa alt_aa_freq
#> 105    22599      NA   G   A       S         346      R      K     0.96569
#> 106    22673   22674   T   C       S         371      S      L     1.00000
#> 107    22674   22673   C   T       S         371      S      L     1.00000
#> 108    22679      NA   T   C       S         373      S      P     0.99959

“Linked” mutations are variants that are physically connected in the genome and affect a same codon.

There can also be positions with several different mutations:

wWater[50:51,-1]
#>    position linkage ref alt protein aa_position ref_aa alt_aa alt_aa_freq
#> 50    23429      NA   G   T       S         623      A      S     0.09884
#> 51    23429      NA   G   A       S         623      A      T     0.12706

These variants collectively contribute to the entropy at the affected residue.


Worked example: SARS-CoV-2 immune escape

Loading packages

First we load some packages used in this example:

lapply(c("ggplot2", "patchwork"), library, character.only = TRUE)

Conceptual framework and data

This example uses data from wastewater colected during the first and third COVID-19 waves in Argentina (BioProject PRJNA1183343; Manrique and Jones, n.d.). The first wave was driven by ancestral SARS-CoV-2 lineages, while the third was dominated by Omicron subvariants. Omicron is a highly adapted lineage to humans, including hypervariation in the receptor-binding domain (RBD) in response to immunity.

We will evidence the RBD relevance by an entropy analysis, and will identify key mutations on it.

We begin by building separate datasets for ancestral and Omicron lineages.

firstWave <- wWater[ wWater$wave == "first", ]
thirdWave <- wWater[ wWater$wave == "third", ]

A 3% minor variant cutoff was already applied to these data by the variant caller. Following the same logic, here we will also filter out minor variants whose genotype matches that of the reference genome:

firstWave <- firstWave[ firstWave$alt_aa_freq <= 0.97, ]
thirdWave <- thirdWave[ thirdWave$alt_aa_freq <= 0.97, ]

Entropy signatures

Now we infer the ancestral and Omicron signatures and compare them graphically.

ancestral <- getEntropySignature(firstWave)
omicron <-  getEntropySignature(thirdWave)
# Compare signatures
anc_plot <- plot(ancestral) + ggtitle("Ancestral")
omi_plot <- plot(omicron) + ggtitle("Omicron")
anc_plot / omi_plot

We see that entropy at the S gene is relatively high in the Omicron sublineages.

Omicron sublineages in more detail

Now we look at how entropy is distributed along genome positions in the Omicron sublineages.

plot(omicron, chartType = "entroScan")

There are a handful of S positions that are responsible for the unusually high entropy.

omicron$Entropy$position[ omicron$Entropy$entropy > 0.3 ]
#> [1] 22882 22898 22917 23013 23040 23048 23055 23063

Let’s look at the corresponding mutations:

showMutations(omicron, c(22882, 22898, 22917, 23013, 23040, 23048, 23055, 23063))
#> |   snv   | phenotype | abundance (%) |
#> |:-------:|:---------:|:-------------:|
#> | T22882G |  S:N440K  |      59       |
#> | G22898A |  S:G446S  |      60       |
#> | T22917G |  S:L452R  |      39       |
#> | A23013C |  S:E484A  |      66       |
#> | A23040G |  S:Q493R  |      66       |
#> | G23048A |  S:G496S  |      66       |
#> | A23055G |  S:Q498R  |      66       |
#> | A23063T |  S:N501Y  |      66       |

The literature tells us that these SNVs affect the RBD (Lan et al. 2020).

Viral RBDs are targeted by the immune system because they are vital for cell binding and entry. There is solid observational and experimental evidence for S:N440K, S:G446S, S:L452R, S:E484A, S:Q493R, S:G496S, S:Q498R and S:N501Y (Guigon et al. 2022; Liang et al. 2022; Liu et al. 2021; Motozono et al. 2022; Rani et al. 2021; Starr, Greaney, Addetia, et al. 2021; Starr, Greaney, Dingens, et al. 2021; Weisblum et al. 2020; Zhang et al. 2022).

# Search PUBMED for more studies on these mutations:
library(rentrez)
#
# create search phrase
mutations <- showMutations(omicron, c(22882, 22898, 22917, 23013, 23040, 23048, 23055, 23063))
searchPhrase <- paste0(gsub("S:", "", mutations$phenotype, fixed = TRUE), "[TIAB]")
searchPhrase <- paste0("SARS-CoV-2[TITLE] AND spike [TIAB] AND (", paste0(searchPhrase, collapse = " OR "), ")")
#
# search
mySearch <- entrez_search(db = "pubmed", term = searchPhrase, retmax = 1000)
#
# Number of studies
mySearch$count
[1] 560
#

We can do a formal analysis by assessHotSpot(). The RBD is located between genomic positions 22517 and 23186.

assessHotSpot(omicron, c(22517, 23186))

#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  entropy by hot_spot
#> Kruskal-Wallis chi-squared = 8.9066, df = 1, p-value = 0.002841
assessHotSpot(ancestral, c(22517, 23186))

#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  entropy by hot_spot
#> Kruskal-Wallis chi-squared = 0.0055779, df = 1, p-value = 0.9405


Integrating Biological Signal via Entropy

Entropy analysis categorizes observations and takes into account the corresponding frequencies in natural populations. This adds significant information to sequence data.

summary(ancestral)
#> 
#> 'Ancestral' entropy signature.
#> 
#> Summary of entropy and mutations by protein:
#> 
#> | Feature | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | SNVs/Kb | non-Syn/Kb |
#> |:-------:|:----:|:-------:|:------:|:----:|:-------:|:----:|:-------:|:----------:|
#> | allCDS  | 0.00 |  0.00   |  0.16  | 0.15 |  0.23   | 0.36 |  2.66   |    2.35    |
#> |  nsp1   | 0.00 |  0.00   |  0.00  | 0.00 |  0.00   | 0.00 |  1.86   |    1.86    |
#> |  nsp2   | 0.00 |  0.12   |  0.19  | 0.16 |  0.23   | 0.25 |  3.14   |    3.14    |
#> |  nsp3   | 0.00 |  0.03   |  0.19  | 0.14 |  0.22   | 0.26 |  2.40   |    2.06    |
#> |  nsp4   | 0.36 |  0.36   |  0.36  | 0.36 |  0.36   | 0.36 |  0.67   |    0.67    |
#> |  nsp6   | 0.36 |  0.36   |  0.36  | 0.36 |  0.36   | 0.36 |  1.15   |    1.15    |
#> |  nsp8   | 0.35 |  0.35   |  0.35  | 0.35 |  0.35   | 0.35 |  1.69   |    1.69    |
#> |  nsp9   | 0.20 |  0.20   |  0.20  | 0.20 |  0.20   | 0.20 |  2.97   |    2.97    |
#> | nsp12b  | 0.00 |  0.15   |  0.25  | 0.20 |  0.30   | 0.31 |  1.45   |    1.08    |
#> |  nsp13  | 0.00 |  0.00   |  0.05  | 0.08 |  0.11   | 0.29 |  3.33   |    2.78    |
#> |  nsp15  | 0.00 |  0.00   |  0.00  | 0.00 |  0.00   | 0.00 |  0.97   |    0.97    |
#> |  nsp16  | 0.00 |  0.00   |  0.00  | 0.00 |  0.00   | 0.00 |  1.12   |    0.00    |
#> |    S    | 0.14 |  0.17   |  0.20  | 0.22 |  0.26   | 0.36 |  2.36   |    2.36    |
#> |  ORF3a  | 0.00 |  0.00   |  0.00  | 0.00 |  0.00   | 0.00 |  1.21   |    1.21    |
#> |    M    | 0.25 |  0.25   |  0.25  | 0.25 |  0.25   | 0.25 |  1.50   |    1.50    |
#> |  ORF6   | 0.14 |  0.15   |  0.15  | 0.15 |  0.16   | 0.16 |  10.87  |   10.87    |
#> |  ORF7a  | 0.12 |  0.13   |  0.13  | 0.13 |  0.13   | 0.13 |  5.49   |    5.49    |
#> |  ORF8   | 0.00 |  0.05   |  0.10  | 0.10 |  0.15   | 0.20 |  5.49   |    5.49    |
#> |    N    | 0.00 |  0.00   |  0.09  | 0.10 |  0.19   | 0.26 |  11.13  |    8.74    |

In the summary above, we observe that ORF6 and N proteins have high mutational rates, both overall (SNV/Kb) and non-synonymous (non-Syn/Kb). However, the corresponding entropies do not stand out among those of other proteins.

This contrasts with S, which, despite having average mutation rates, exhibits entropies biased towards high values.

This is even more evident for Omicron sublineages:

summary(omicron)
#> 
#> 'Omicron' entropy signature.
#> 
#> Summary of entropy and mutations by protein:
#> 
#> | Feature | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | SNVs/Kb | non-Syn/Kb |
#> |:-------:|:----:|:-------:|:------:|:----:|:-------:|:----:|:-------:|:----------:|
#> | allCDS  | 0.00 |  0.00   |  0.08  | 0.13 |  0.33   | 0.35 |  1.66   |    1.42    |
#> |  nsp2   | 0.00 |  0.00   |  0.00  | 0.00 |  0.00   | 0.00 |  1.05   |    0.00    |
#> |  nsp3   | 0.00 |  0.04   |  0.07  | 0.07 |  0.11   | 0.15 |  0.34   |    0.17    |
#> |  nsp10  | 0.00 |  0.00   |  0.00  | 0.00 |  0.00   | 0.00 |  2.41   |    2.41    |
#> |  nsp13  | 0.00 |  0.04   |  0.08  | 0.08 |  0.12   | 0.16 |  1.11   |    1.11    |
#> |  nsp14  | 0.00 |  0.00   |  0.00  | 0.00 |  0.00   | 0.00 |  1.27   |    1.27    |
#> |    S    | 0.00 |  0.00   |  0.33  | 0.20 |  0.33   | 0.35 |  3.93   |    3.66    |
#> |  ORF3a  | 0.08 |  0.08   |  0.08  | 0.08 |  0.08   | 0.08 |  2.42   |    2.42    |
#> |    M    | 0.00 |  0.07   |  0.14  | 0.14 |  0.21   | 0.28 |  3.00   |    3.00    |


References

Guigon, Aurélie, Emmanuel Faure, Chloé Lemaire, Marie-Charlotte Chopin, Claire Tinez, Ady Assaf, Mouna Lazrek, et al. 2022. “Emergence of Q493r Mutation in Sars-Cov-2 Spike Protein During Bamlanivimab/Etesevimab Treatment and Resistance to Viral Clearance.” Journal of Infection 84 (2): 248–88. https://doi.org/10.1016/j.jinf.2021.08.033.

Lan, Jun, Jiwan Ge, Jinfang Yu, Sisi Shan, Huan Zhou, Shilong Fan, Qi Zhang, et al. 2020. “Structure of the Sars-Cov-2 Spike Receptor-Binding Domain Bound to the Ace2 Receptor.” Nature 581 (7807): 215–20. https://doi.org/10.1038/s41586-020-2180-5.

Liang, Ronghui, Zi-Wei Ye, Chon Phin Ong, Zhenzhi Qin, Yubin Xie, Yilan Fan, Kaiming Tang, et al. 2022. “The Spike Receptor-Binding Motif G496s Substitution Determines the Replication Fitness of Sars-Cov-2 Omicron Sublineage.” Emerging Microbes & Infections 11 (1): 2093–2101. https://doi.org/10.1080/22221751.2022.2111977.

Liu, Yang, Jianying Liu, Kenneth S. Plante, Jessica A. Plante, Xuping Xie, Xianwen Zhang, Zhiqiang Ku, et al. 2021. “The N501y Spike Substitution Enhances Sars-Cov-2 Infection and Transmission.” Nature 602 (7896): 294–99. https://doi.org/10.1038/s41586-021-04245-0.

Manrique, Julieta Marina, and Leandro Roberto Jones. n.d. “A Cost-Effective Wastewater-Based Workflow for Genomic Analysis of Sars-Cov-2 Evolution.” Under Review 0 (0): 000–000.

Motozono, Chihiro, Mako Toyoda, Toong Seng Tan, Hiroshi Hamana, Yoshihiko Goto, Yoshiki Aritsu, Yusuke Miyashita, et al. 2022. “The Sars-Cov-2 Omicron Ba.1 Spike G446s Mutation Potentiates Antiviral T-Cell Recognition.” Nature Communications 13 (1): 5440. https://doi.org/10.1038/s41467-022-33068-4.

Rani, Pallavali R., Mohamed Imran, Juturu V. Lakshmi, Bani Jolly, Abhinav Jain, Avileli Surekha, Vigneshwar Senthivel, et al. 2021. “Symptomatic Reinfection of Sars‐CoV‐2 with Spike Protein Variant N440k Associated with Immune Escape.” Journal of Medical Virology 93 (7): 4163–5. https://doi.org/10.1002/jmv.26997.

Starr, Tyler N., Allison J. Greaney, Amin Addetia, William W. Hannon, Manish C. Choudhary, Adam S. Dingens, Jonathan Z. Li, and Jesse D. Bloom. 2021. “Prospective Mapping of Viral Mutations That Escape Antibodies Used to Treat Covid-19.” Science 371 (6531): 850–54. https://doi.org/10.1126/science.abf9302.

Starr, Tyler N., Allison J. Greaney, Adam S. Dingens, and Jesse D. Bloom. 2021. “Complete Map of Sars-Cov-2 Rbd Mutations That Escape the Monoclonal Antibody Ly-Cov555 and Its Cocktail with Ly-Cov016.” Cell Reports Medicine 2 (4): 100255. https://doi.org/10.1016/j.xcrm.2021.100255.

Weisblum, Yiska, Fabian Schmidt, Fengwen Zhang, Justin DaSilva, Daniel Poston, Julio CC Lorenzi, Frauke Muecksch, et al. 2020. “Escape from Neutralizing Antibodies by Sars-Cov-2 Spike Protein Variants.” eLife 9 (October): e61312. https://doi.org/10.7554/elife.61312.

Zhang, Yanan, Ting Zhang, Yihui Fang, Jie Liu, Qinong Ye, and Lihua Ding. 2022. “SARS-Cov-2 Spike L452r Mutation Increases Omicron Variant Fusogenicity and Infectivity as Well as Host Glycolysis.” Signal Transduction and Targeted Therapy 7 (1): 76. https://doi.org/10.1038/s41392-022-00941-z.