Help for package topolow

Title:

Force-Directed Euclidean Embedding of Dissimilarity Data

Version:

2.0.1

Maintainer:

Omid Arhami <omid.arhami@uga.edu>

Description:

A robust implementation of Topolow algorithm. It embeds objects into a low-dimensional Euclidean space from a matrix of pairwise dissimilarities, even when the data do not satisfy metric or Euclidean axioms. The package is particularly well-suited for sparse, incomplete, and censored (thresholded) datasets such as antigenic relationships. The core is a physics-inspired, gradient-free optimization framework that models objects as particles in a physical system, where observed dissimilarities define spring rest lengths and unobserved pairs exert repulsive forces. The package also provides functions specific to antigenic mapping to transform cross-reactivity and binding affinity measurements into accurate spatial representations in a phenotype space. Key features include: * Robust Embedding from Sparse Data: Effectively creates complete and consistent maps (in optimal dimensions) even with high proportions of missing data (e.g., >95%). * Physics-Inspired Optimization: Models objects (e.g., antigens, landmarks) as particles connected by springs (for measured dissimilarities) and subject to repulsive forces (for missing dissimilarities), and simulates the physical system using laws of mechanics, reducing the need for complex gradient computations. * Automatic Dimensionality Detection: Employs a likelihood-based approach to determine the optimal number of dimensions for the embedding/map, avoiding distortions common in methods with fixed low dimensions. * Noise and Bias Reduction: Naturally mitigates experimental noise and bias through its network-based, error-dampening mechanism. * Antigenic Velocity Calculation (for antigenic data): Introduces and quantifies "antigenic velocity," a vector that describes the rate and direction of antigenic drift for each pathogen isolate. This can help identify cluster transitions and potential lineage replacements. * Broad Applicability: Analyzes data from various objects that their dissimilarity may be of interest, ranging from complex biological measurements such as continuous and relational phenotypes, antibody-antigen interactions, and protein folding to abstract concepts, such as customer perception of different brands. Methods are described in the context of bioinformatics applications in Arhami and Rohani (2025a) <doi:10.1093/bioinformatics/btaf372>, and mathematical proofs and Euclidean embedding details are in Arhami and Rohani (2025b) <doi:10.48550/arXiv.2508.01733>.

License:

BSD_3_clause + file LICENSE

Encoding:

UTF-8

RoxygenNote:

7.3.2

Imports:

future, lifecycle, ggplot2 (≥ 3.4.0), dplyr (≥ 1.1.0), data.table (≥ 1.14.0), reshape2 (≥ 1.4.4), stats, utils, parallel (≥ 4.1.0), filelock, lhs, rlang,

Suggests:

coda (≥ 0.19-4), Rtsne, ape, Racmacs (≥ 1.1.2), vegan, umap, igraph, rgl (≥ 1.0.0), scales, ggrepel, plotly (≥ 4.10.0), gridExtra, covr, knitr, rmarkdown, testthat (≥ 3.0.0)

Config/testthat/edition:

URL:

https://github.com/omid-arhami/topolow

BugReports:

https://github.com/omid-arhami/topolow/issues

LazyData:

true

Depends:

R (≥ 4.1.0)

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2025-08-30 23:40:10 UTC; omidarhami

Author:

Omid Arhami

[aut, cre, cph]

Repository:

CRAN

Date/Publication:

2025-08-31 00:20:02 UTC

Algorithm Comparison Helper Functions

Description

Helper functions for running RACMACS and Topolow during algorithm comparisons.

The topolow package provides a robust implementation of the Topolow algorithm. It is designed to embed objects into a low-dimensional Euclidean space from a matrix of pairwise dissimilarities, even when the data do not satisfy metric or Euclidean axioms. The package is particularly well-suited for sparse or incomplete datasets and includes methods for handling censored (thresholded) data. The package provides tools for processing antigenic assay data, and visualizing antigenic maps.

Details

The core of the package is a physics-inspired, gradient-free optimization framework. It models objects as particles in a physical system, where observed dissimilarities define spring rest lengths and unobserved pairs exert repulsive forces. Key features include:

Quantitative reconstruction of metric space from non-metric data.
Robustness against local optima, especially for sparse data, due to a stochastic pairwise optimization scheme.
A statistically grounded approach based on maximizing the likelihood under a Laplace error model.
Tools for parameter optimization, cross-validation, and convergence diagnostics.
Support for parallel processing
Cross-validation and error analysis
A comprehensive suite of visualization functions for network analysis and results.
Processing and visualization of antigenic maps

Main Functions

Euclidify: Wizard function to run all steps of the Topolow algorithm automatically
euclidean_embedding: Core embedding algorithm
initial_parameter_optimization: Find optimal parameters using Latin Hypercube Sampling.
run_adaptive_sampling: Refine parameter estimates with adaptive Monte Carlo sampling.

Output Files

Functions that generate output files (like parameter optimization results) will create subdirectories in a user-specified directory (via output_dir parameter)

The following subdirectories may be created:

model_parameters/: Contains optimization results and parameter evaluations
init_param_optimization/: Contains files and outputs when using initial_parameter_optimization

Citation

If you use this package, please cite the Bioinformatics paper: Omid Arhami, Pejman Rohani, Topolow: A mapping algorithm for antigenic cross-reactivity and binding affinity assays, Bioinformatics, 2025;, btaf372, https://doi.org/10.1093/bioinformatics/btaf372 doi:10.1093/bioinformatics/btaf372.

bibtex entry: title={Topolow: a mapping algorithm for antigenic cross-reactivity and binding affinity assays}, author={Arhami, Omid and Rohani, Pejman}, journal={Bioinformatics}, volume={41}, number={7}, pages={btaf372}, year={2025}, issn = {1367-4811}, doi = {10.1093/bioinformatics/btaf372}, url = {https://doi.org/10.1093/bioinformatics/btaf372}, eprint = {https://academic.oup.com/bioinformatics/article-pdf/41/7/btaf372/63582086/btaf372.pdf}, publisher={Oxford University Press}

And/or the preprint on mathematical properties: Omid Arhami, Pejman Rohani, Topolow: Force-Directed Euclidean Embedding of Dissimilarity Data with Robustness Against Non-Metricity and Sparsity, arXiv:2508.01733, https://doi.org/10.48550/arXiv.2508.01733 doi:10.48550/arXiv.2508.01733.

bibtex entry: title={Topolow: Force-Directed Euclidean Embedding of Dissimilarity Data with Robustness Against Non-Metricity and Sparsity}, author={Arhami, Omid and Rohani, Pejman}, year={2025}, doi = {10.48550/arXiv.2508.01733}, url = {https://arxiv.org/abs/2508.01733}, publisher={arXiv}

Author(s)

Maintainer: Omid Arhami omid.arhami@uga.edu (ORCID) [copyright holder]

Automatic Euclidean Embedding with Parameter Optimization

Description

A user-friendly wrapper function that automatically optimizes parameters and performs Euclidean embedding on a dissimilarity matrix. This function handles the entire workflow from parameter optimization to final embedding.

Usage

Euclidify(
  dissimilarity_matrix,
  output_dir,
  ndim_range = c(2, 10),
  k0_range = c(0.1, 20),
  cooling_rate_range = c(1e-04, 0.1),
  c_repulsion_range = c(1e-04, 1),
  n_initial_samples = 50,
  n_adaptive_samples = 150,
  max_cores = NULL,
  folds = 20,
  mapping_max_iter = 500,
  clean_intermediate = TRUE,
  verbose = "standard",
  fallback_to_defaults = FALSE,
  save_results = FALSE
)

Arguments

dissimilarity_matrix

Square symmetric dissimilarity matrix. Can contain NA values for missing measurements and threshold indicators (< or >).

output_dir

Character. Directory for saving optimization files and results. Required - no default.

ndim_range

Integer vector of length 2. Range for number of dimensions (minimum, maximum). Default: c(2, 10)

k0_range

Numeric vector of length 2. Range for initial spring constant (minimum, maximum). Default: c(0.1, 15)

cooling_rate_range

Numeric vector of length 2. Range for cooling rate (minimum, maximum). Default: c(0.001, 0.07)

c_repulsion_range

Numeric vector of length 2. Range for repulsion constant (minimum, maximum). Default: c(0.001, 0.4)

n_initial_samples

Integer. Number of samples for initial parameter optimization. Default: 100

n_adaptive_samples

Integer. Number of samples for adaptive refinement. Default: 250

max_cores

Integer. Maximum number of cores to use. Default: NULL (auto-detect)

folds

Integer. Number of cross-validation folds. Default: 20

mapping_max_iter

Integer. Maximum iterations for final embedding. Half this value is used for parameter search. Default: 1000

clean_intermediate

Logical. Whether to remove intermediate files. Default: TRUE

verbose

Character. Verbosity level: "off" (no output), "standard" (progress updates), or "full" (detailed output including from internal functions). Default: "standard"

fallback_to_defaults

Logical. Whether to use default parameters if optimization fails. Default: TRUE

save_results

Logical. Whether to save the final positions as CSV. Default: FALSE

Value

A list containing:

positions

Matrix of optimized coordinates

est_distances

Matrix of estimated distances

mae

Mean absolute error

optimal_params

List of optimal parameters found, including cross-validation MAE during optimization

optimization_summary

Summary of the optimization process

data_characteristics

Summary of input data characteristics

runtime

Total runtime in seconds

Examples

# Example 1: Basic usage with small matrix
test_data <- data.frame(
object = rep(paste0("Obj", 1:4), each = 4),
reference = rep(paste0("Ref", 1:4), 4),
score = sample(c(1, 2, 4, 8, 16, 32, 64, "<1", ">12"), 16, replace = TRUE)
)
dist_mat <- list_to_matrix(
  data = test_data,  # Pass the data frame, not file path
  object_col = "object",
  reference_col = "reference",
  value_col = "score",
  is_similarity = TRUE
)
## Not run: 
# Note: output_dir is required for actual use
result <- Euclidify(
  dissimilarity_matrix = dist_mat,
  output_dir = tempdir()  # Use temp directory for example
)
coordinates <- result$positions

## End(Not run)

# Example 2: Using custom parameter ranges
## Not run: 
result <- Euclidify(
  dissimilarity_matrix = dist_mat,
  output_dir = tempdir(),
  n_initial_samples = 10,
  n_adaptive_samples = 7,
  verbose = "off"
)

## End(Not run)

# Example 3: Handling missing data
dist_mat_missing <- dist_mat
dist_mat_missing[1, 3] <- dist_mat_missing[3, 1] <- NA
## Not run: 
result <- Euclidify(
  dissimilarity_matrix = dist_mat_missing,
  output_dir = tempdir(),
  n_initial_samples = 10,
  n_adaptive_samples = 7,
  verbose = "off"
)

## End(Not run)

# Example 4: Using threshold indicators
dist_mat_threshold <- dist_mat
dist_mat_threshold[1, 2] <- ">2"
dist_mat_threshold[2, 1] <- ">2"
## Not run: 
result <- Euclidify(
  dissimilarity_matrix = dist_mat_threshold,
  output_dir = tempdir(),
  n_initial_samples = 10,
  n_adaptive_samples = 7,
  verbose = "off"
)

## End(Not run)

# Example 5: Parallel processing with custom cores
## Not run: 
result <- Euclidify(
  dissimilarity_matrix = dist_mat,
  output_dir = tempdir(),
  max_cores = 4,
  n_adaptive_samples = 100,
  save_results = TRUE  # Save positions to CSV
)

## End(Not run)

Perform Adaptive Monte Carlo Sampling (Internal)

Description

Core implementation of the adaptive Monte Carlo sampling algorithm. This internal function explores the parameter space by updating the sampling distribution based on evaluated likelihoods. It is called by the main run_adaptive_sampling function.

Usage

adaptive_MC_sampling(
  samples_file,
  dissimilarity_matrix,
  iterations = 1,
  mapping_max_iter,
  relative_epsilon,
  folds = 20,
  num_cores = 1,
  scenario_name,
  verbose = FALSE
)

Arguments

samples_file

Path to the CSV file with samples for the current job.

dissimilarity_matrix

The dissimilarity matrix to be fitted.

iterations

Number of sampling iterations per job.

mapping_max_iter

Maximum optimization iterations for the embedding.

relative_epsilon

Convergence threshold for the optimization.

folds

Number of cross-validation folds.

num_cores

Number of cores for parallel processing.

scenario_name

Name for output files, used for context.

verbose

Logical. If TRUE, prints progress messages.

Value

A data.frame containing all samples (initial and newly generated) with their parameters and evaluated performance metrics. The data frame includes columns for the log-transformed parameters, Holdout_MAE, and NLL. Returns NULL if the results file was not created.

Analyze Network Structure

Description

Analyzes the connectivity of a dissimilarity matrix, returning node degrees and overall completeness.

Usage

analyze_network_structure(dissimilarity_matrix)

Arguments

dissimilarity_matrix

Square symmetric matrix of dissimilarities.

Value

A list containing the network analysis results:

adjacency

A logical matrix where TRUE indicates a measured dissimilarity.

connectivity

A data.frame with node-level metrics, including the completeness (degree) for each point.

summary

A list of overall network statistics, including n_points, n_measurements, and total completeness.

Examples

# Create a sample dissimilarity matrix
dist_mat <- matrix(runif(25), 5, 5)
rownames(dist_mat) <- colnames(dist_mat) <- paste0("Point", 1:5)
dist_mat[lower.tri(dist_mat)] <- t(dist_mat)[lower.tri(dist_mat)]
diag(dist_mat) <- 0
dist_mat[1, 3] <- NA; dist_mat[3, 1] <- NA

# Analyze the network structure
metrics <- analyze_network_structure(dist_mat)
print(metrics$summary$completeness)

Calculate MCMC-style Diagnostics for Sampling Chains

Description

Calculates standard MCMC-style convergence diagnostics for multiple chains from an optimization or sampling run. It computes the R-hat (potential scale reduction factor) and effective sample size (ESS) to help assess if the chains have converged to a stable distribution.

Usage

calculate_diagnostics(chain_files, mutual_size = 500)

Arguments

chain_files

Character vector. Paths to CSV files, where each file represents a chain of samples.

mutual_size

Integer. Number of samples to use from the end of each chain for calculations.

Value

A list object of class topolow_diagnostics containing convergence diagnostics for the MCMC chains.

rhat

A numeric vector of the R-hat (potential scale reduction factor) statistic for each parameter. Values close to 1 indicate convergence.

ess

A numeric vector of the effective sample size for each parameter.

chains

A list of data frames, where each data frame is a cleaned and trimmed MCMC chain.

param_names

A character vector of the parameter names being analyzed.

mutual_size

The integer number of samples used from the end of each chain for calculations.

Examples

# This example demonstrates how to use the function with temporary files.
# Create dummy chain files in a temporary directory
temp_dir <- tempdir()
chain_files <- character(3)
par_names <- c("log_N", "log_k0", "log_cooling_rate", "log_c_repulsion")
sample_data <- data.frame(
  log_N = rnorm(100), log_k0 = rnorm(100),
  log_cooling_rate = rnorm(100), log_c_repulsion = rnorm(100),
  NLL = runif(100), Holdout_MAE = runif(100)
)
for (i in 1:3) {
  chain_files[i] <- file.path(temp_dir, paste0("chain", i, ".csv"))
  write.csv(sample_data, chain_files[i], row.names = FALSE)
}

# Calculate diagnostics
diag_results <- calculate_diagnostics(chain_files, mutual_size = 50)
print(diag_results)

# Clean up the temporary files and directory
unlink(chain_files)
unlink(temp_dir, recursive = TRUE)

Calculate Prediction Interval for Dissimilarity Estimates

Description

Computes prediction intervals for the estimated dissimilarities based on residual variation between true and predicted values.

Usage

calculate_prediction_interval(
  dissimilarity_matrix,
  predicted_dissimilarity_matrix,
  confidence_level = 0.95
)

Arguments

dissimilarity_matrix

Matrix of true dissimilarities.

predicted_dissimilarity_matrix

Matrix of predicted dissimilarities.

confidence_level

The confidence level for the interval (default: 0.95).

Value

A single numeric value representing the margin of error for the prediction interval.

Calculate Weighted Marginal Distributions

Description

Calculates the marginal probability distribution for each model parameter. The distributions are weighted by the likelihood of each sample, making this useful for identifying the most probable parameter values from a set of Monte Carlo samples.

Usage

calculate_weighted_marginals(samples)

Arguments

samples

A data frame containing parameter samples (e.g., log_N, log_k0) and a negative log-likelihood column named NLL.

Details

This function uses the weighted_kde helper to perform kernel density estimation for each parameter, with weights derived from the normalized likelihoods of the samples.

Value

A named list where each element is a density object (a list with x and y components) corresponding to a model parameter.

x

Vector of parameter values

y

Vector of density estimates

Model Diagnostics and Convergence Testing Check Multivariate Gaussian Convergence

Description

Assesses the convergence of multivariate samples by monitoring the stability of the mean vector and covariance matrix over a sliding window. This is useful for checking if a set of parameter samples has stabilized.

Usage

check_gaussian_convergence(data, window_size = 300, tolerance = 0.01)

Arguments

data

Matrix or Data Frame. A matrix of samples where columns are parameters.

window_size

Integer. The size of the sliding window used to compute statistics.

tolerance

Numeric. The convergence threshold for the relative change in the mean and covariance.

Value

An object of class topolow_convergence containing diagnostics about the convergence of the multivariate samples. This list includes logical flags for convergence (converged, mean_converged, cov_converged) and the history of the mean and covariance changes.

Examples

# Create sample data for the example
chain_data <- as.data.frame(matrix(rnorm(500 * 4), ncol = 4))
colnames(chain_data) <- c("param1", "param2", "param3", "param4")

# Run the convergence check
conv_results <- check_gaussian_convergence(chain_data)
print(conv_results)

# The plot method for this object can be used to create convergence plots.
# plot(conv_results)

Clean Data by Removing MAD-based Outliers

Description

Removes outliers from numeric data using the Median Absolute Deviation (MAD) method. Outliers are replaced with NA values.

Usage

clean_data(x, k = 3, take_log = FALSE)

Arguments

x

Numeric vector to clean.

k

Numeric threshold for outlier detection (default: 3).

take_log

Logical. Deprecated parameter. Log transformation should be done before calling this function.

Value

A numeric vector of the same length as x, where detected outliers have been replaced with NA.

Examples

# Clean parameter values
params <- c(0.01, 0.012, 0.011, 0.1, 0.009, 0.011, 0.15)
clean_params <- clean_data(params)

Color Palettes

Description

Predefined color palettes optimized for visualization.

Usage

c25

Format

An object of class character of length 20.

Convert Coordinates to a Distance Matrix

Description

Calculates pairwise Euclidean distances between points in a coordinate space.

Usage

coordinates_to_matrix(positions)

Arguments

positions

Matrix or Data Frame of coordinates where rows are points and columns are dimensions.

Value

A symmetric matrix of pairwise Euclidean distances between points.

Create Base Theme

Description

Creates a ggplot2 theme based on aesthetic and layout configurations.

Usage

create_base_theme(aesthetic_config, layout_config)

Arguments

aesthetic_config

Aesthetic configuration object

layout_config

Layout configuration object

Value

ggplot2 theme object

Create Cross-Validation Folds for a Dissimilarity Matrix

Description

Creates k-fold cross-validation splits from a dissimilarity matrix while maintaining symmetry. Each fold in the output consists of a training matrix (with some values masked as NA) and a corresponding ground truth matrix for validation.

Usage

create_cv_folds(
  dissimilarity_matrix,
  ground_truth_matrix = NULL,
  n_folds = 10,
  random_seed = NULL
)

Arguments

dissimilarity_matrix

The input dissimilarity matrix, which may contain noise.

ground_truth_matrix

An optional, noise-free dissimilarity matrix to be used as the ground truth for evaluation. If NULL, the input dissimilarity_matrix is used as the truth.

n_folds

The integer number of folds to create.

random_seed

An optional integer to set the random seed for reproducibility.

Value

A list of length n_folds. Each element of the list is itself a list containing two matrices: truth (the ground truth for that fold) and train (the training matrix with NA values for validation).

Note

This function has breaking changes from previous versions:

Parameter truth_matrix renamed to dissimilarity_matrix
Parameter no_noise_truth renamed to ground_truth_matrix
Return structure now uses named elements ($truth, $train)

Examples

# Create a sample dissimilarity matrix
d_mat <- matrix(runif(100), 10, 10)
diag(d_mat) <- 0

# Create 5-fold cross-validation splits
folds <- create_cv_folds(d_mat, n_folds = 5, random_seed = 123)

Create Diagnostic Plots for Multiple Sampling Chains

Description

Creates trace and density plots for multiple sampling or optimization chains to help assess convergence and mixing. It displays parameter trajectories and their distributions across all chains.

Usage

create_diagnostic_plots(
  chain_files,
  mutual_size = 2000,
  output_file = "diagnostic_plots.png",
  output_dir,
  save_plot = FALSE,
  width = 3000,
  height = 3000,
  res = 300
)

Arguments

chain_files

A character vector of paths to CSV files, where each file contains data for one chain.

mutual_size

Integer. The number of samples to use from the end of each chain for plotting.

output_file

Character. The path for saving the plot. Required if save_plot is TRUE.

output_dir

Character. The directory for saving output files. Required if save_plot is TRUE.

save_plot

Logical. If TRUE, saves the plot to a file. Default: FALSE.

width, height, res

Numeric. The dimensions and resolution for the saved plot.

Value

A ggplot object of the combined plots.

Examples

# This example uses sample data files that would be included with the package.
chain_files <- c(
  system.file("extdata", "diag_chain1.csv", package = "topolow"),
  system.file("extdata", "diag_chain2.csv", package = "topolow"),
  system.file("extdata", "diag_chain3.csv", package = "topolow")
)

# Only run the example if the files are found
if (all(nzchar(chain_files))) {
  # Create diagnostic plot without saving to a file
  create_diagnostic_plots(chain_files, mutual_size = 50, save_plot = FALSE)
}

Main TopoLow algorithm implementation (DEPRECATED)

Description

create_topolow_map() was deprecated in version 2.0.0 and will be removed in a future version. Please use euclidean_embedding() instead, which provides the same functionality with improved performance and additional features.

Usage

create_topolow_map(
  distance_matrix,
  ndim,
  mapping_max_iter = 1000,
  k0,
  cooling_rate,
  c_repulsion,
  relative_epsilon = 1e-04,
  convergence_counter = 3,
  initial_positions = NULL,
  write_positions_to_csv = FALSE,
  output_dir,
  verbose = FALSE
)

Arguments

distance_matrix

Matrix. Square, symmetric distance matrix. Can contain NA values for missing measurements and character strings with < or > prefixes for thresholded measurements.

ndim

Integer. Number of dimensions for the embedding space.

mapping_max_iter

Integer. Maximum number of map optimization iterations.

k0

Numeric. Initial spring constant controlling spring forces.

cooling_rate

Numeric. Rate of spring constant decay per iteration (0 < cooling_rate < 1).

c_repulsion

Numeric. Repulsion constant controlling repulsive forces.

relative_epsilon

Numeric. Convergence threshold for relative change in error. Default is 1e-4.

convergence_counter

Integer. Number of iterations below threshold before declaring convergence. Default is 5.

initial_positions

Matrix or NULL. Optional starting coordinates. If NULL, random initialization is used. Matrix should have nrow = nrow(distance_matrix) and ncol = ndim.

write_positions_to_csv

Logical. Whether to save point positions to CSV file. Default is FALSE.

output_dir

Character. Directory to save CSV file. Required if write_positions_to_csv is TRUE.

verbose

Logical. Whether to print progress messages. Default is FALSE.

Details

This function has been superseded by euclidean_embedding(), which offers:

Enhanced matrix reordering for better optimization
Improved parameter validation with informative warnings
Consistent naming convention (dissimilarity vs distance)
Better documentation and examples

The core algorithm remains identical, ensuring your results will be equivalent. The main changes are:

Parameter name: distance_matrix –> dissimilarity_matrix
Function name: create_topolow_map() –> euclidean_embedding()

Value

A list object of class topolow. This list contains the results of the optimization and includes the following components:

positions: A matrix of the optimized point coordinates in the n-dimensional space.
est_distances: A matrix of the Euclidean distances between points in the final optimized configuration.
mae: The final Mean Absolute Error between the target distances and the estimated distances.
iter: The total number of iterations performed before the algorithm terminated.
parameters: A list containing the input parameters used for the optimization run.
convergence: A list containing the final convergence status, including a logical achieved flag and the final error value.

Examples


# Simple example (deprecated - use euclidean_embedding() instead)
dist_mat <- matrix(c(0, 2, 3, 2, 0, 4, 3, 4, 0), nrow=3)

# This will generate a deprecation warning
result <- create_topolow_map(
  dist_mat, 
  ndim = 2, 
  mapping_max_iter = 100,
  k0 = 1.0, 
  cooling_rate = 0.001, 
  c_repulsion = 0.01, 
  verbose = FALSE
)

# Recommended approach with new function:
result_new <- euclidean_embedding(
  dissimilarity_matrix = dist_mat,
  ndim = 2,
  mapping_max_iter = 100,
  k0 = 1.0,
  cooling_rate = 0.001,
  c_repulsion = 0.01,
  verbose = FALSE
)

Dengue Virus (DENV) Titer Data

Description

A dataset containing neutralization titer data for Dengue virus. This data can be used to create antigenic maps and explore the antigenic relationships between different DENV strains.

Usage

denv_data

Format

A data frame with the following columns:

virus_strain: Character, the name of the virus strain.
serum_strain: Character, the name of the serum strain.
titer: Character, the neutralization titer value. May include values like '<10' or '>1280'.
virusYear: Numeric, the year the virus was isolated.
serumYear: Numeric, the year the serum was collected.
cluster: Factor, the cluster or serotype assignment for the strains.
color: Character, a color associated with the cluster for plotting.

Source

Katzelnick, L.C., et al. (2019). An antigenically diverse, representative panel of dengue viruses for neutralizing antibody discovery and vaccine evaluation. eLife. doi:10.7554/eLife.42496

Detect Outliers Using Median Absolute Deviation

Description

Detects outliers in numeric data using the Median Absolute Deviation (MAD) method. This robust method is less sensitive to extreme values than standard deviation and works well for non-normally distributed data.

Usage

detect_outliers_mad(data, k = 3)

Arguments

data

Numeric vector of values to analyze

k

Numeric threshold for outlier detection (default: 3).

Details

The function calculates the median and MAD of the data and identifies points that are more than k MADs from the median as outliers.

Value

A list containing:

outlier_mask

Logical vector indicating outliers

stats

List containing:

median: Median of data
mad: Median absolute deviation
n_outliers: Number of outliers detected

#' @importFrom stats median mad

Error calculation and validation metrics for topolow Calculate Comprehensive Error Metrics

Description

Computes a comprehensive set of error metrics (in-sample, out-of-sample, completeness) between predicted and true dissimilarities for model evaluation.

Usage

error_calculator_comparison(
  predicted_dissimilarities,
  true_dissimilarities,
  input_dissimilarities = NULL
)

Arguments

predicted_dissimilarities

Matrix of predicted dissimilarities from the model.

true_dissimilarities

Matrix of true, ground-truth dissimilarities.

input_dissimilarities

Matrix of input dissimilarities, which may contain NAs and is used to identify the pattern of missing values for out-of-sample error calculation. Optional - if not provided, defaults to true_dissimilarities (no holdout set).

Details

Input requirements and constraints:

All input matrices must have matching dimensions.
Row and column names must be consistent across matrices.
NAs are allowed and handled appropriately.
Threshold indicators (< or >) in the input matrix are processed correctly.

When input_dissimilarities is provided, it represents the training data where some values have been set to NA to create a holdout set. This allows calculation of:

In-sample errors: for data available during training
Out-of-sample errors: for data held out during training

When input_dissimilarities is NULL (default), all errors are treated as in-sample since no data was held out.

Value

A list containing:

report_df

A data.frame with detailed error metrics for each point-pair, including InSampleError, OutSampleError, and their percentage-based counterparts.

Completeness

A single numeric value representing the completeness statistic, which is the fraction of validation points for which a prediction could be made.

Examples

# Example 1: Normal evaluation (no cross-validation)
true_mat <- matrix(c(0, 1, 2, 1, 0, 3, 2, 3, 0), 3, 3)
pred_mat <- true_mat + rnorm(9, 0, 0.1)  # Add some noise

# Evaluate all predictions (input_dissimilarities defaults to true_dissimilarities)
errors1 <- error_calculator_comparison(pred_mat, true_mat)

# Example 2: Cross-validation evaluation
input_mat <- true_mat
input_mat[1, 3] <- input_mat[3, 1] <- NA  # Create holdout set

# Evaluate with train/test split
errors2 <- error_calculator_comparison(pred_mat, true_mat, input_mat)

Main topolow algorithm implementation

Description

topolow (topological stochastic pairwise reconstruction for Euclidean embedding) optimizes point positions in an N-dimensional space to match a target dissimilarity matrix. The algorithm uses a physics-inspired approach with spring and repulsive forces to find optimal point configurations while handling missing and thresholded measurements.

Usage

euclidean_embedding(
  dissimilarity_matrix,
  ndim,
  mapping_max_iter = 1000,
  k0,
  cooling_rate,
  c_repulsion,
  relative_epsilon = 1e-04,
  convergence_counter = 5,
  initial_positions = NULL,
  write_positions_to_csv = FALSE,
  output_dir,
  verbose = FALSE
)

Arguments

dissimilarity_matrix

Matrix. A square, symmetric dissimilarity matrix. Can contain NA values for missing measurements and character strings with < or > prefixes for thresholded measurements.

ndim

Integer. Number of dimensions for the embedding space.

mapping_max_iter

Integer. Maximum number of map optimization iterations.

k0

Numeric. Initial spring constant controlling spring forces.

cooling_rate

Numeric. Rate of spring constant decay per iteration (0 < cooling_rate < 1).

c_repulsion

Numeric. Repulsion constant controlling repulsive forces.

relative_epsilon

Numeric. Convergence threshold for relative change in error. Default is 1e-4.

convergence_counter

Integer. Number of iterations below threshold before declaring convergence. Default is 5.

initial_positions

Matrix or NULL. Optional starting coordinates. If NULL, random initialization is used. Matrix should have nrow = nrow(dissimilarity_matrix) and ncol = ndim.

write_positions_to_csv

Logical. Whether to save point positions to a CSV file. Default is FALSE.

output_dir

Character. Directory to save the CSV file. Required if write_positions_to_csv is TRUE.

verbose

Logical. Whether to print progress messages. Default is FALSE.

Details

The algorithm iteratively updates point positions using:

Spring forces between points with measured dissimilarities.
Repulsive forces between points without measurements.
Conditional forces for thresholded measurements (< or >).
An adaptive spring constant that decays over iterations.
Convergence monitoring based on relative error change.
Automatic matrix reordering to optimize convergence. Consider if downstream analyses depend on specific point ordering: The order of points in the output is adjusted to put high-dissimilarity points in the opposing ends.

This function replaces the deprecated create_topolow_map(). The core algorithm is identical, but includes performance improvements and enhanced validation.

Value

A list object of class topolow. This list contains the results of the optimization and includes the following components:

positions: A matrix of the optimized point coordinates in the N-dimensional space.
est_distances: A matrix of the Euclidean distances between points in the final optimized configuration.
mae: The final Mean Absolute Error between the target dissimilarities and the estimated distances.
iter: The total number of iterations performed before the algorithm terminated.
parameters: A list containing the input parameters used for the optimization run.
convergence: A list containing the final convergence status, including a logical achieved flag and the final error value.

Examples

# Create a simple dissimilarity matrix
dist_mat <- matrix(c(0, 2, 3, 2, 0, 4, 3, 4, 0), nrow=3)

# Run topolow in 2D
result <- euclidean_embedding(
  dissimilarity_matrix = dist_mat,
  ndim = 2,
  mapping_max_iter = 100,
  k0 = 1.0,
  cooling_rate = 0.001,
  c_repulsion = 0.01,
  verbose = FALSE
)

# View results
head(result$positions)
print(result$mae)

# Example with thresholded measurements
thresh_mat <- matrix(c(0, ">2", 3, ">2", 0, "<5", 3, "<5", 0), nrow=3)
result_thresh <- euclidean_embedding(
  dissimilarity_matrix = thresh_mat,
  ndim = 2,
  mapping_max_iter = 50,
  k0 = 0.5,
  cooling_rate = 0.01,
  c_repulsion = 0.001
)

Example Antigenic Mapping Data

Description

HI titers of Influenza antigens and antisera published in Smith et al., 2004 were used to find the antigenic relationships and coordinates of the antigens. It can be used for mapping. The data captures how different influenza virus strains (antigens) react with antisera from infected individuals.

Usage

example_positions

Format

A data frame with 285 rows and 11 variables:

V1: First dimension coordinate from 5D mapping
V2: Second dimension coordinate from 5D mapping
V3: Third dimension coordinate from 5D mapping
V4: Fourth dimension coordinate from 5D mapping
V5: Fifth dimension coordinate from 5D mapping
name: Strain identifier
antigen: Logical; TRUE if point represents an antigen
antiserum: Logical; TRUE if point represents an antiserum
cluster: Factor indicating antigenic cluster assignment (A/H3N2 1968-2003)
color: Color assignment for visualization
year: Year of strain isolation

Source

Smith et al., 2004

Utility functions for the topolow package Extract Numeric Values from Mixed Data

Description

Extracts numeric values from data that may contain threshold indicators (e.g., "<10", ">1280") or regular numeric values.

Usage

extract_numeric_values(x)

Arguments

x

A vector that may contain numeric values, character strings with threshold indicators, or a mix of both.

Value

A numeric vector with threshold indicators converted to their numeric equivalents.

Examples

# Mixed data with threshold indicators
mixed_data <- c(10, 20, "<5", ">100", 50)
extract_numeric_values(mixed_data)

Generate New Parameter Samples Using KDE

Description

Generates new parameter samples using weighted kernel density estimation for each parameter independently. This is an internal helper function for the adaptive sampling process.

Usage

generate_kde_samples(samples, n, epsilon = 0)

Arguments

samples

A data frame of previous samples containing parameter columns and an "NLL" column.

n

The integer number of new samples to generate.

epsilon

A numeric probability (0-1) of sampling with a wider bandwidth to encourage exploration. Default is 0.

Value

A data frame containing n new parameter samples.

Create Grid Around Maximum Likelihood Estimate (Internal)

Description

Internal helper to generate a sequence of values for a parameter. The grid is centered on the parameter's Maximum Likelihood Estimate (MLE), which is found by calculating the mode of its weighted marginal distribution.

Usage

get_grid(samples, param, num_points, start_factor, end_factor)

Arguments

samples

Data frame of parameter samples with an NLL column.

param

Character name of the parameter column.

num_points

Integer number of points for the grid.

start_factor

Numeric factor for grid's lower boundary relative to MLE.

end_factor

Numeric factor for grid's upper boundary relative to MLE.

Value

A numeric vector of grid points.

Save ggplot with white background

Description

Wrapper around ggplot2::ggsave that ensures a white background by default.

Usage

ggsave_white_bg(..., bg = "white")

Arguments

...

Other arguments passed on to the graphics device function, as specified by device.

bg

Background colour. If NULL, uses the plot.background fill value from the plot theme.

Value

No return value, called for side effects.

H3N2 Influenza HI Assay Data from Smith et al. 2004

Description

Hemagglutination inhibition (HI) assay data for influenza A/H3N2 viruses spanning 35 years of evolution.

Usage

h3n2_data

Format

A data frame with the following variables:

virusStrain: Character. Virus strain identifier
serumStrain: Character. Antiserum strain identifier
titer: Numeric. HI assay titer value
virusYear: Numeric. Year virus was isolated
serumYear: Numeric. Year serum was collected
cluster: Factor. Antigenic cluster assignment
color: Character. Color code for visualization

Source

Smith et al. (2004) Science, 305(5682), 371-376.

HIV Neutralization Assay Data

Description

IC50 neutralization measurements between HIV viruses and antibodies.

Usage

hiv_titers

Format

A data frame with the following variables:

Antibody: Character. Antibody identifier
Virus: Character. Virus strain identifier
IC50: Numeric. IC50 neutralization value

Source

Los Alamos HIV Database (https://www.hiv.lanl.gov/)

HIV Virus Metadata

Description

Reference information for HIV virus strains used in neutralization assays.

Usage

hiv_viruses

Format

A data frame with the following variables:

Virus.name: Character. Virus strain identifier
Country: Character. Country of origin
Subtype: Character. HIV subtype
Year: Numeric. Year of isolation

Source

Los Alamos HIV Database (https://www.hiv.lanl.gov/)

Parameter Space Sampling and Optimization Functions for topolow

Description

Performs parameter optimization using Latin Hypercube Sampling (LHS) combined with k-fold cross-validation. Parameters are sampled from specified ranges using maximin LHS design to ensure good coverage of parameter space. Each parameter set is evaluated using k-fold cross-validation to assess prediction accuracy. To calculate one NLL per set of parameters, the function uses a pooled errors approach which combine all validation errors into one set, then calculate a single NLL. This approach has two main advantages: 1- It treats all validation errors equally, respecting the underlying error distribution assumption 2- It properly accounts for the total number of validation points

Note: As of version 2.0.0, this function returns log-transformed parameters directly, eliminating the need to call log_transform_parameters() separately.

Usage

initial_parameter_optimization(
  dissimilarity_matrix,
  mapping_max_iter = 1000,
  relative_epsilon,
  convergence_counter,
  scenario_name,
  N_min,
  N_max,
  k0_min,
  k0_max,
  c_repulsion_min,
  c_repulsion_max,
  cooling_rate_min,
  cooling_rate_max,
  num_samples = 20,
  max_cores = NULL,
  folds = 20,
  verbose = FALSE,
  write_files = FALSE,
  output_dir
)

Arguments

dissimilarity_matrix

Matrix. Input dissimilarity matrix. Must be square and symmetric.

mapping_max_iter

Integer. Maximum number of optimization iterations for each map.

relative_epsilon

Numeric. Convergence threshold for relative change in error.

convergence_counter

Integer. Number of iterations below threshold before declaring convergence.

scenario_name

Character. Name for output files and job identification.

N_min, N_max

Integer. Range for the number of dimensions parameter.

k0_min, k0_max

Numeric. Range for the initial spring constant parameter.

c_repulsion_min, c_repulsion_max

Numeric. Range for the repulsion constant parameter.

cooling_rate_min, cooling_rate_max

Numeric. Range for the cooling rate parameter.

num_samples

Integer. Number of LHS samples to generate. Default: 20.

max_cores

Integer. Maximum number of cores for parallel processing. Default: NULL (uses all but one).

folds

Integer. Number of cross-validation folds. Default: 20.

verbose

Logical. Whether to print progress messages. Default: FALSE.

write_files

Logical. Whether to save results to a CSV file. Default: FALSE.

output_dir

Character. Directory for output files. Required if write_files is TRUE.

Details

Initial Parameter Optimization using Latin Hypercube Sampling

The function performs these steps:

Generates LHS samples in the parameter space (original scale for sampling).
Creates k-fold splits of the input data.
For each parameter set, it trains the model on each fold's training set and evaluates on the validation set, calculating a pooled MAE and NLL across all folds.
Computations are run locally in parallel.
NEW: Automatically log-transforms the final results for direct use with adaptive sampling.

Value

A data.frame containing the log-transformed parameter sets and their performance metrics. Columns include: log_N, log_k0, log_cooling_rate, log_c_repulsion, Holdout_MAE, and NLL.

Note

Breaking Change in v2.0.0: This function now returns log-transformed parameters directly. The returned data frame has columns log_N, log_k0, log_cooling_rate, log_c_repulsion instead of the original scale parameters. This eliminates the need to call log_transform_parameters() separately before using run_adaptive_sampling().

Breaking Change in v2.0.0: The parameter distance_matrix has been renamed to dissimilarity_matrix. Please update your code accordingly.

Examples


# This example can exceed 5 seconds on some systems.
# 1. Create a simple synthetic dataset for the example
synth_coords <- matrix(rnorm(60), nrow = 20, ncol = 3)
dist_mat <- coordinates_to_matrix(synth_coords)

# 2. Run the optimization on the synthetic data
results <- initial_parameter_optimization(
  dissimilarity_matrix = dist_mat,
  mapping_max_iter = 100,
  relative_epsilon = 1e-3,
  convergence_counter = 2,
  scenario_name = "test_opt_synthetic",
  N_min = 2, N_max = 5,
  k0_min = 1, k0_max = 10,
  c_repulsion_min = 0.001, c_repulsion_max = 0.05,
  cooling_rate_min = 0.001, cooling_rate_max = 0.02,
  num_samples = 4,
  max_cores = 1,  # Avoid parallel processing in check environment
  verbose = FALSE
)

Evaluate a Parameter Set with Cross-Validation

Description

This internal function calculates the cross-validated likelihood for a given set of parameters. It splits the data into training and validation sets across multiple folds, fits the topolow model on each training set, and evaluates the error on the corresponding validation set.

Usage

likelihood_function(
  dissimilarity_matrix,
  mapping_max_iter,
  relative_epsilon,
  N,
  k0,
  cooling_rate,
  c_repulsion,
  folds = 20,
  num_cores = 1
)

Arguments

dissimilarity_matrix

The input dissimilarity matrix to fit.

mapping_max_iter

The maximum number of optimization iterations.

relative_epsilon

The convergence threshold for optimization.

N

The number of dimensions for the embedding.

k0

The initial spring constant.

cooling_rate

The spring constant decay rate.

c_repulsion

The repulsion constant.

folds

The number of cross-validation folds.

num_cores

The number of cores for parallel processing.

Details

To calculate a single Negative Log-Likelihood (NLL) value per parameter set, the function uses a "pooled errors" approach. It combines all out-of-sample errors from every fold into a single set before calculating the NLL and the overall Mean Absolute Error (MAE). This method respects the underlying error distribution and correctly accounts for the total number of validation points.

Value

A list containing the pooled Holdout_MAE and the NLL.

topolow Data Preprocessing Functions

Description

Converts data from long/list format (one measurement per row) to a symmetric dissimilarity matrix. The function handles both similarity and dissimilarity data, with optional conversion from similarity to dissimilarity.

Usage

list_to_matrix(
  data,
  object_col,
  reference_col,
  value_col,
  is_similarity = FALSE
)

Arguments

data

Data frame in long format with columns for objects, references, and values.

object_col

Character. Name of the column containing object identifiers.

reference_col

Character. Name of the column containing reference identifiers.

value_col

Character. Name of the column containing measurement values.

is_similarity

Logical. Whether values are similarities (TRUE) or dissimilarities (FALSE). If TRUE, similarities will be converted to dissimilarities by subtracting from the maximum value per reference. Default: FALSE.

Details

Convert List Format Data to Dissimilarity Matrix

The function expects data in long format with at least three columns:

A column for object names
A column for reference names
A column containing the (dis)similarity values

When is_similarity = TRUE, the function converts similarities to dissimilarities by subtracting each similarity value from the maximum similarity value within each reference group. Threshold indicators (< or >) are handled appropriately and inverted during similarity-to-dissimilarity conversion.

Value

A symmetric matrix of dissimilarities with row and column names corresponding to the union of unique objects and references in the data. NA values represent unmeasured pairs, and the diagonal is set to 0.

Examples

# Example with dissimilarity data
data_dissim <- data.frame(
  object = c("A", "B", "A", "C"),
  reference = c("X", "X", "Y", "Y"),
  dissimilarity = c(2.5, 1.8, 3.0, 4.2)
)

mat_dissim <- list_to_matrix(
  data = data_dissim,
  object_col = "object",
  reference_col = "reference",
  value_col = "dissimilarity",
  is_similarity = FALSE
)

# Example with similarity data (will be converted to dissimilarity)
data_sim <- data.frame(
  object = c("A", "B", "A", "C"),
  reference = c("X", "X", "Y", "Y"),
  similarity = c(7.5, 8.2, 7.0, 5.8)
)

mat_from_sim <- list_to_matrix(
  data = data_sim,
  object_col = "object",
  reference_col = "reference",
  value_col = "similarity",
  is_similarity = TRUE
)

Log Transform Parameter Samples

Description

Reads parameter samples from a CSV file and applies a log transformation to specified parameter columns (e.g., N, k0, cooling_rate, c_repulsion).

Note: As of version 2.0.0, this function is primarily for backward compatibility with existing parameter files. The initial_parameter_optimization() function now returns log-transformed parameters directly, eliminating the need for this separate transformation step in the normal workflow.

Usage

log_transform_parameters(samples_file, output_file = NULL)

Arguments

samples_file

Character. Path to the CSV file containing the parameter samples.

output_file

Character. Optional path to save the transformed data as a new CSV file.

Details

This function is maintained for users who have existing parameter files from older versions of the package or who need to work with parameter files that contain original-scale parameters. In the current workflow:

initial_parameter_optimization() –> returns log-transformed parameters directly
run_adaptive_sampling() –> works with log-transformed parameters
euclidean_embedding() –> works with original-scale parameters

If you are working with the current workflow (using Euclidify() or calling initial_parameter_optimization() directly), you typically do not need to call this function.

Value

A data.frame with the log-transformed parameters. If output_file is specified, the data frame is also written to a file and returned invisibly.

Note

Backward Compatibility Note: This function is maintained for compatibility with existing workflows and parameter files. For new workflows, consider using initial_parameter_optimization() which returns log-transformed parameters directly.

Examples

# This example uses a sample file included with the package.
sample_file <- system.file("extdata", "sample_params.csv", package = "topolow")

# Ensure the file exists before running the example
if (nzchar(sample_file)) {
  # Transform the data from the sample file and return as a data frame
  transformed_data <- log_transform_parameters(sample_file, output_file = NULL)

  # Display the first few rows of the transformed data
  print(head(transformed_data))
}

Create Interactive Plot

Description

Converts a static ggplot visualization to an interactive plotly visualization with customizable tooltips and interactive features.

Usage

make_interactive(plot, tooltip_vars = NULL)

Arguments

plot

ggplot object to convert

Vector of variable names to include in tooltips

Details

The function enhances static plots by adding:

Hover tooltips with data values
Zoom capabilities
Pan capabilities
Click interactions
Double-click to reset

If tooltip_vars is NULL, the function attempts to automatically determine relevant variables from the plot's mapping.

Value

A plotly object with interactive features.

Examples

if (interactive() && requireNamespace("plotly", quietly = TRUE)) {
# Create sample data and plot
data <- data.frame(
  V1 = rnorm(100), V2 = rnorm(100), name=1:100,
  antigen = rep(c(0,1), 50), antiserum = rep(c(1,0), 50),
  year = rep(2000:2009, each=10), cluster = rep(1:5, each=20)
)

# Create temporal plot
p1 <- plot_temporal_mapping(data, ndim=2)

# Make interactive with default tooltips
p1_interactive <- make_interactive(p1)

# Create cluster plot with custom tooltips
p2 <- plot_cluster_mapping(data, ndim=2)
p2_interactive <- make_interactive(p2,
  tooltip_vars = c("cluster", "year", "antigen")
)
}

Plot Aesthetic Configuration Class

Description

S3 class for configuring plot visual aesthetics including points, colors, labels and text elements.

Usage

new_aesthetic_config(
  point_size = 3.5,
  point_alpha = 0.8,
  point_shapes = c(antigen = 16, antiserum = 0),
  color_palette = c25,
  gradient_colors = list(low = "blue", high = "red"),
  show_labels = FALSE,
  show_title = FALSE,
  label_size = 3,
  title_size = 14,
  subtitle_size = 12,
  axis_title_size = 12,
  axis_text_size = 10,
  legend_text_size = 10,
  legend_title_size = 12,
  show_legend = TRUE,
  legend_position = "right",
  arrow_head_size = 0.2,
  arrow_alpha = 0.6
)

Arguments

point_size

Base point size

point_alpha

Point transparency

point_shapes

Named vector of shapes for different point types

color_palette

Color palette name or custom palette

gradient_colors

List with low and high colors for gradients

show_labels

Whether to show point labels

show_title

Whether to show plot title (default: FALSE)

label_size

Label text size

title_size

Title text size

subtitle_size

Subtitle text size

axis_title_size

Axis title text size

axis_text_size

Axis text size

legend_text_size

Legend text size

legend_title_size

Legend title text size

show_legend

Whether to show the legend

legend_position

Legend position ("none", "right", "left", "top", "bottom")

arrow_head_size

Size of the arrow head for velocity arrows (in cm)

arrow_alpha

Transparency of arrows (0 = invisible, 1 = fully opaque)

Value

An S3 object of class aesthetic_config, which is a list containing the specified configuration parameters for plot aesthetics.

Visualization functions for the topolow package Plot Annotation Configuration Class

Description

S3 class for configuring point annotations in plots, including labels, connecting lines, and visual properties.

Usage

new_annotation_config(
  notable_points = NULL,
  size = 4.9,
  color = "black",
  alpha = 0.9,
  fontface = "plain",
  box = FALSE,
  segment_size = 0.3,
  segment_alpha = 0.6,
  min_segment_length = 0,
  max_overlaps = Inf,
  outline_size = 0.4
)

Arguments

notable_points

Character vector of notable points to highlight

size

Numeric. Size of annotations for notable points

color

Character. Color of annotations for notable points

alpha

Numeric. Alpha transparency of annotations

fontface

Character. Font face of annotations ("plain", "bold", "italic", etc.)

box

Logical. Whether to draw a box around annotations

segment_size

Numeric. Size of segments connecting annotations to points

segment_alpha

Numeric. Alpha transparency of connecting segments

min_segment_length

Numeric. Minimum length of connecting segments

max_overlaps

Numeric. Maximum number of overlaps allowed for annotations

outline_size

Numeric. Size of the outline for annotations

Value

An S3 object of class annotation_config, which is a list containing the specified configuration parameters for plot annotations.

Dimension Reduction Configuration Class

Description

S3 class for configuring dimension reduction parameters including method selection and algorithm-specific parameters.

Usage

new_dim_reduction_config(
  method = "pca",
  n_components = 2,
  scale = TRUE,
  center = TRUE,
  pca_params = list(tol = sqrt(.Machine$double.eps), rank. = NULL),
  umap_params = list(n_neighbors = 15, min_dist = 0.1, metric = "euclidean", n_epochs =
    200),
  tsne_params = list(perplexity = 30, mapping_max_iter = 1000, theta = 0.5),
  compute_loadings = FALSE,
  random_state = NULL
)

Arguments

method

Dimension reduction method ("pca", "umap", "tsne")

n_components

Number of components to compute

scale

Scale the data before reduction

center

Center the data before reduction

pca_params

List of PCA-specific parameters

umap_params

List of UMAP-specific parameters

tsne_params

List of t-SNE-specific parameters

compute_loadings

Compute and return loadings

random_state

Random seed for reproducibility

Value

An S3 object of class dim_reduction_config, which is a list containing the specified configuration parameters for dimensionality reduction.

Plot Layout Configuration Class

Description

S3 class for configuring plot layout including dimensions, margins, grids and coordinate systems.

Usage

new_layout_config(
  width = 8,
  height = 8,
  dpi = 300,
  aspect_ratio = 1,
  show_grid = TRUE,
  grid_type = "major",
  grid_color = "grey80",
  grid_linetype = "dashed",
  show_axis = TRUE,
  axis_lines = TRUE,
  plot_margin = margin(1, 1, 1, 1, "cm"),
  coord_type = "fixed",
  background_color = "white",
  panel_background_color = "white",
  panel_border = TRUE,
  panel_border_color = "black",
  save_plot = FALSE,
  save_format = "png",
  reverse_x = 1,
  reverse_y = 1,
  x_limits = NULL,
  y_limits = NULL,
  arrow_plot_threshold = 0.1
)

Arguments

width

Plot width in inches

height

Plot height in inches

dpi

Plot resolution

aspect_ratio

Plot aspect ratio

show_grid

Show plot grid

grid_type

Grid type ("none", "major", "minor", "both")

grid_color

Grid color

grid_linetype

Grid line type

show_axis

Show axes

axis_lines

Show axis lines

plot_margin

Plot margins in cm

coord_type

Coordinate type ("fixed", "equal", "flip", "polar")

background_color

Plot background color

panel_background_color

Panel background color

panel_border

Show panel border

panel_border_color

Panel border color

save_plot

Logical. Whether to save the plot to a file.

save_format

Plot save format ("png", "pdf", "svg", "eps")

reverse_x

Numeric multiplier for x-axis direction (1 or -1)

reverse_y

Numeric multiplier for y-axis direction (1 or -1)

x_limits

Numeric vector of length 2 specifying c(min, max) for x-axis. If NULL, limits are set automatically.

y_limits

Numeric vector of length 2 specifying c(min, max) for y-axis. If NULL, limits are set automatically.

arrow_plot_threshold

Threshold for velocity arrows to be drawn in the same antigenic distance unit (default: 0.10)

Value

An S3 object of class layout_config, which is a list containing the specified configuration parameters for plot layout.

Parameter Sensitivity Analysis

Description

Analyzes the sensitivity of the model performance (measured by MAE) to changes in a single parameter. This function bins the parameter range to identify the minimum MAE for each bin, helping to understand how robust the model is to parameter choices.

Usage

parameter_sensitivity_analysis(
  param,
  samples,
  bins = 30,
  mae_col = "Holdout_MAE",
  threshold_pct = 5,
  min_samples = 1
)

Arguments

param

The character name of the parameter to analyze.

samples

A data frame containing parameter samples and performance metrics.

bins

The integer number of bins to divide the parameter range into.

mae_col

The character name of the column containing the Mean Absolute Error (MAE) values.

threshold_pct

A numeric percentage above the minimum MAE to define an acceptable performance threshold.

min_samples

The integer minimum number of samples required in a bin for it to be included in the analysis.

Details

The function performs these steps:

Cleans the input data using Median Absolute Deviation (MAD) to remove outliers.
Bins the parameter values into equal-width bins.
Calculates the minimum MAE within each bin to create an empirical performance curve.
Identifies a performance threshold based on a percentage above the global minimum MAE.
Returns an S3 object for plotting and further analysis.

Value

An object of class "parameter_sensitivity" containing:

param_values

Vector of parameter bin midpoints

min_mae

Vector of minimum MAE values per bin

param_name

Name of analyzed parameter

threshold

Threshold value (default: min. +5%)

min_value

Minimum MAE value across all bins

sample_counts

Number of samples per bin

Plot Parameter Sensitivity Analysis

Description

The S3 plot method for parameter_sensitivity objects. It creates a visualization showing how the model's performance (minimum MAE) changes across the range of a single parameter. A threshold line is included to indicate the region of acceptable performance.

Usage

## S3 method for class 'parameter_sensitivity'
plot(
  x,
  width = 3.5,
  height = 3.5,
  save_plot = FALSE,
  output_dir,
  y_limit_factor = NULL,
  ...
)

Arguments

x

A parameter_sensitivity object, typically from parameter_sensitivity_analysis().

width

The numeric width of the output plot in inches.

height

The numeric height of the output plot in inches.

save_plot

A logical indicating whether to save the plot to a file.

output_dir

A character string specifying the directory for output files. Required if save_plot is TRUE.

y_limit_factor

A numeric factor to set the upper y-axis limit as a percentage above the threshold value (e.g., 1.10 for 10% above). If NULL, scaling is automatic.

...

Additional arguments (not currently used).

Value

A ggplot object representing the sensitivity plot.

Plot Method for profile_likelihood Objects

Description

Creates a visualization of the profile likelihood for a parameter, showing the maximum likelihood estimates and the 95% confidence interval. It supports mathematical notation for parameter names for clearer plot labels.

Usage

## S3 method for class 'profile_likelihood'
plot(x, LL_max, width = 3.5, height = 3.5, save_plot = FALSE, output_dir, ...)

Arguments

x

A profile_likelihood object returned by profile_likelihood().

LL_max

The global maximum log-likelihood value from the entire sample set, used as the reference for calculating the confidence interval.

width, height

Numeric. The width and height of the output plot in inches.

save_plot

Logical. If TRUE, the plot is saved to a file.

output_dir

Character. The directory where the plot will be saved. Required if save_plot is TRUE.

...

Additional arguments passed to the plot function.

Details

The 95% confidence interval is determined using the likelihood ratio test, where the cutoff is based on the chi-squared distribution: LR(\theta_{ij}) = -2[log L_{max}(\theta_{ij}) - log L_{max}(\hat{\theta})]. The interval includes all parameter values \theta_{ij} for which LR(\theta_{ij}) \leq \chi^2_{1,0.05} \approx 3.84.

Value

A ggplot object representing the profile likelihood plot.

Examples


# This example can take more than 5 seconds to run.
# Create a sample data frame of MCMC samples
samples <- data.frame(
  log_N = log(runif(50, 2, 10)),
  log_k0 = log(runif(50, 1, 5)),
  log_cooling_rate = log(runif(50, 0.01, 0.1)),
  log_c_repulsion = log(runif(50, 0.1, 1)),
  NLL = runif(50, 20, 100)
)

# Calculate profile likelihood for the "log_N" parameter
pl_result <- profile_likelihood("log_N", samples, grid_size = 10)

# Provide the global maximum log-likelihood from the samples
LL_max <- max(-samples$NLL)

# The plot function requires the ggplot2 package
if (requireNamespace("ggplot2", quietly = TRUE)) {
  plot(pl_result, LL_max, width = 4, height = 3)
}

Plot Method for topolow Convergence Diagnostics

Description

Creates visualizations of convergence diagnostics from a sampling run, including parameter mean trajectories and covariance matrix stability over iterations. This helps assess whether parameter estimation has converged.

Usage

## S3 method for class 'topolow_convergence'
plot(x, param_names = NULL, ...)

Arguments

x

A topolow_convergence object from check_gaussian_convergence().

param_names

Optional character vector of parameter names for plot titles. If NULL, names are taken from the input object.

...

Additional arguments (not currently used).

Details

The function generates two types of plots:

Parameter mean plots: Shows how the mean value for each parameter changes over iterations. Stabilization of these plots indicates convergence.
Covariance change plot: Shows the relative change in the Frobenius norm of the covariance matrix. A decreasing trend approaching zero indicates stable relationships between parameters.

Value

A grid of plots showing convergence metrics.

Examples

# Example with simulated data
chain_data <- data.frame(
  param1 = rnorm(1000, mean = 1.5, sd = 0.1),
  param2 = rnorm(1000, mean = -0.5, sd = 0.2)
)

# Check convergence
results <- check_gaussian_convergence(chain_data)

# Plot diagnostics
plot(results)

# With custom parameter names
plot(results, param_names = c("Parameter 1 (log)", "Parameter 2 (log)"))

Plot Method for topolow parameter estimation Diagnostics

Description

Creates trace and density plots for multiple chains to assess convergence and mixing. This is an S3 method that dispatches on topolow_diagnostics objects.

Usage

## S3 method for class 'topolow_diagnostics'
plot(
  x,
  output_dir,
  output_file = "topolow_param_diagnostics.png",
  save_plot = FALSE,
  ...
)

Arguments

x

A topolow_diagnostics object from calculate_diagnostics().

output_dir

Character. Directory for output files. Required if save_plot is TRUE.

output_file

Character path for saving the plot.

save_plot

Logical. Whether to save the plot.

...

Additional arguments passed to create_diagnostic_plots.

Value

A ggplot object of the combined plots.

Create 3D Visualization

Description

Creates an interactive or static 3D visualization using rgl. Supports both temporal and cluster-based coloring schemes with configurable point appearances and viewing options.

Usage

plot_3d_mapping(
  df,
  ndim,
  dim_config = new_dim_reduction_config(),
  aesthetic_config = new_aesthetic_config(),
  layout_config = new_layout_config(),
  interactive = TRUE,
  output_dir
)

Arguments

df

Data frame containing: - V1, V2, ... Vn: Coordinate columns - antigen: Binary indicator for antigen points - antiserum: Binary indicator for antiserum points - cluster: (Optional) Factor or integer cluster assignments - year: (Optional) Numeric year values for temporal coloring

ndim

Number of dimensions in input coordinates (must be >= 3)

dim_config

Dimension reduction configuration object

aesthetic_config

Aesthetic configuration object

layout_config

Layout configuration object

interactive

Logical; whether to create an interactive plot

output_dir

Character. Directory for output files. Required if interactive is FALSE.

Details

The function supports two main visualization modes:

Interactive mode: Creates a manipulatable 3D plot window
Static mode: Generates a static image from a fixed viewpoint

Color schemes are automatically selected based on available data:

If cluster data is present: Uses discrete colors per cluster
If year data is present: Uses continuous color gradient
Otherwise: Uses default point colors

For data with more than 3 dimensions, dimension reduction is applied first.

Note: This function requires the rgl package and OpenGL support. If rgl is not available, the function will return a 2D plot with a message explaining how to enable 3D visualization.

Value

Invisibly returns the rgl scene ID for further manipulation if rgl is available, or a 2D ggplot object as a fallback.

Examples

# Create sample data
set.seed(123)
data <- data.frame(
  V1 = rnorm(100), V2 = rnorm(100), V3 = rnorm(100), V4 = rnorm(100), name = 1:100,
  antigen = rep(c(0,1), 50), antiserum = rep(c(1,0), 50),
  cluster = rep(1:5, each=20), year = rep(2000:2009, each=10)
)

# Create a static plot and save to a temporary file
# This example requires an interactive session and the 'rgl' package.
if (interactive() && requireNamespace("rgl", quietly = TRUE)) {
  temp_dir <- tempdir()
  # Basic interactive plot (will open a new window)
  if(interactive()) {
    plot_3d_mapping(data, ndim=4)
  }

# Custom configuration for temporal visualization
aesthetic_config <- new_aesthetic_config(
  point_size = 5,
  point_alpha = 0.8,
  gradient_colors = list(
    low = "blue",
    high = "red"
  )
)

layout_config <- new_layout_config(
  width = 12,
  height = 12,
  background_color = "black",
  show_axis = TRUE
)
  # Create customized static plot and save it
plot_3d_mapping(data, ndim=4,
  aesthetic_config = aesthetic_config,
  layout_config = layout_config,
  interactive = FALSE, output_dir = temp_dir
)
  list.files(temp_dir)
  unlink(temp_dir, recursive = TRUE)
}

Create Clustered Mapping Plots

Description

Antigenic Mapping and Antigenic Velocity Function. Creates a visualization of points colored by cluster assignment using dimension reduction, with optional antigenic velocity arrows. Points are colored by cluster with different shapes for antigens and antisera.

Usage

plot_cluster_mapping(
  df_coords,
  ndim,
  dim_config = new_dim_reduction_config(),
  aesthetic_config = new_aesthetic_config(),
  layout_config = new_layout_config(),
  annotation_config = new_annotation_config(),
  output_dir,
  show_shape_legend = TRUE,
  cluster_legend_title = "Cluster",
  draw_arrows = FALSE,
  annotate_arrows = TRUE,
  phylo_tree = NULL,
  sigma_t = NULL,
  sigma_x = NULL,
  clade_node_depth = NULL,
  show_one_arrow_per_cluster = FALSE,
  cluster_legend_order = NULL
)

Arguments

df_coords

Data frame containing: - V1, V2, ... Vn: Coordinate columns - antigen: Binary indicator for antigen points - antiserum: Binary indicator for antiserum points - cluster: Factor or integer cluster assignments

ndim

Number of dimensions in input coordinates

dim_config

Dimension reduction configuration object specifying method and parameters

aesthetic_config

Aesthetic configuration object controlling plot appearance

layout_config

Layout configuration object controlling plot dimensions and style. Use x_limits and y_limits in layout_config to set axis limits.

annotation_config

Annotation configuration object for labeling notable points

output_dir

Character. Directory for output files. Required if layout_config$save_plot is TRUE.

show_shape_legend

Logical. Whether to show the shape legend (default: TRUE)

cluster_legend_title

Character. Custom title for the cluster legend (default: "Cluster")

draw_arrows

logical; if TRUE, compute and draw antigenic drift vectors

annotate_arrows

logical; if TRUE, show names of the points having arrows

phylo_tree

Optional; phylo object in Newick format. Does not need to be rooted. If provided, used to compute antigenic velocity arrows.

sigma_t

Optional; numeric; bandwidth for the Gaussian kernel discounting on time in years or the time unit of the data. If NULL, uses Silverman's rule of thumb.

sigma_x

Optional; numeric; bandwidth for the Gaussian kernel discounting on antigenic distance in antigenic units. If NULL, uses Silverman's rule of thumb.

clade_node_depth

Optional; integer; number of levels of parent nodes to define clades. Antigens from different clades will be excluded from the calculation antigenic velocity arrows. (Default: Automatically calculated mode of leaf-to-backbone distance of the tree)

show_one_arrow_per_cluster

Shows only the largest antigenic velocity arrow in each cluster

cluster_legend_order

in case you prefer a certain order for clusters in the legend, provide a list with that order here; e.g., c("cluster 2", "cluster 1")

Details

The function performs these steps:

Validates input data structure and types
Applies dimension reduction if ndim > 2
Creates visualization with cluster-based coloring
Applies specified aesthetic and layout configurations
Applies custom axis limits if specified in layout_config

Different shapes distinguish between antigens and antisera points, while color represents cluster assignment. The color palette can be customized through the aesthetic_config.

Value

A ggplot object containing the cluster mapping visualization.

Examples

# Basic usage with default configurations
data <- data.frame(
  V1 = rnorm(100), V2 = rnorm(100), V3 = rnorm(100), name = 1:100,
  antigen = rep(c(0,1), 50), antiserum = rep(c(1,0), 50),
  cluster = rep(1:5, each=20)
)
p1 <- plot_cluster_mapping(data, ndim=3)

# Save plot to a temporary directory
temp_dir <- tempdir()
# Custom configurations with specific color palette and axis limits
aesthetic_config <- new_aesthetic_config(
  point_size = 4,
  point_alpha = 0.7,
  color_palette = c("red", "blue", "green", "purple", "orange"),
  show_labels = TRUE,
  label_size = 3
)

layout_config_save <- new_layout_config(save_plot = TRUE,
  width = 10,
  height = 8,
  coord_type = "fixed",
  show_grid = TRUE,
  grid_type = "major",
  x_limits = c(-10, 10),
  y_limits = c(-8, 8)
)

p_saved <- plot_cluster_mapping(data, ndim=3, 
  layout_config = layout_config_save, 
  aesthetic_config = aesthetic_config,
  output_dir = temp_dir
)

list.files(temp_dir)
unlink(temp_dir, recursive = TRUE)

Plot Network Structure

Description

Creates a visualization of the dissimilarity matrix as a network graph, showing data availability patterns and connectivity between points.

Usage

plot_network_structure(
  network_results,
  output_file = NULL,
  width = 3000,
  height = 3000,
  dpi = 300
)

Arguments

network_results

The list output from analyze_network_structure().

output_file

Character. An optional full path to save the plot. If NULL, the plot is not saved.

width

Numeric. Width in pixels for saved plot (default: 3000).

height

Numeric. Height in pixels for saved plot (default: 3000).

dpi

Numeric. Resolution in dots per inch (default: 300).

Value

A ggplot object representing the network graph.

Examples

# Create a sample dissimilarity matrix
adj_mat <- matrix(runif(25), 5, 5)
rownames(adj_mat) <- colnames(adj_mat) <- paste0("Point", 1:5)
adj_mat[lower.tri(adj_mat)] <- t(adj_mat)[lower.tri(adj_mat)]
diag(adj_mat) <- 0
net_analysis <- analyze_network_structure(adj_mat)

# Create and display the plot
plot_network_structure(net_analysis)

Create Temporal Mapping Plot

Description

Antigenic Mapping and Antigenic Velocity Function. Creates a visualization of points colored by time (year) using dimension reduction, with optional antigenic velocity arrows. Points are colored on a gradient scale based on their temporal values, with different shapes for antigens and antisera.

Usage

plot_temporal_mapping(
  df_coords,
  ndim,
  dim_config = new_dim_reduction_config(),
  aesthetic_config = new_aesthetic_config(),
  layout_config = new_layout_config(),
  annotation_config = new_annotation_config(),
  output_dir,
  show_shape_legend = TRUE,
  draw_arrows = FALSE,
  annotate_arrows = TRUE,
  phylo_tree = NULL,
  sigma_t = NULL,
  sigma_x = NULL,
  clade_node_depth = NULL
)

Arguments

df_coords

Data frame containing: - V1, V2, ... Vn: Coordinate columns - antigen: Binary indicator for antigen points - antiserum: Binary indicator for antiserum points - year: Numeric year values for temporal coloring

ndim

Number of dimensions in input coordinates

dim_config

Dimension reduction configuration object specifying method and parameters

aesthetic_config

Aesthetic configuration object controlling plot appearance

layout_config

Layout configuration object controlling plot dimensions and style. Use x_limits and y_limits in layout_config to set axis limits.

annotation_config

Annotation configuration object for labeling notable points

output_dir

Character. Directory for output files. Required if layout_config$save_plot is TRUE.

show_shape_legend

Logical. Whether to show the shape legend (default: TRUE)

draw_arrows

logical; if TRUE, compute and draw antigenic drift vectors

annotate_arrows

logical; if TRUE, show names of the points having arrows

phylo_tree

Optional; phylo object in Newick format. Does not need to be rooted. If provided, used to compute antigenic velocity arrows.

sigma_t

Optional; numeric; bandwidth for the Gaussian kernel discounting on time in years or the time unit of the data. If NULL, uses Silverman's rule of thumb.

sigma_x

Optional; numeric; bandwidth for the Gaussian kernel discounting on antigenic distancein antigenic units. If NULL, uses Silverman's rule of thumb.

clade_node_depth

Details

The function performs these steps:

Validates input data structure and types
Applies dimension reduction if ndim > 2
Creates visualization with temporal color gradient
Applies specified aesthetic and layout configurations
Applies custom axis limits if specified in layout_config

Different shapes distinguish between antigens and antisera points, while color represents temporal progression.

Value

A ggplot object containing the temporal mapping visualization.

Examples

# Basic usage with default configurations
data <- data.frame(
  V1 = rnorm(100), V2 = rnorm(100), V3 = rnorm(100), name = 1:100,
  antigen = rep(c(0,1), 50), antiserum = rep(c(1,0), 50),
  year = rep(2000:2009, each=10)
)
# Plot without saving
p1 <- plot_temporal_mapping(data, ndim=3)

# Save plot to a temporary directory
temp_dir <- tempdir()
layout_config_save <- new_layout_config(save_plot = TRUE,
                       x_limits = c(-10, 10),
                       y_limits = c(-8, 8))
p_saved <- plot_temporal_mapping(data, ndim = 3, layout_config = layout_config_save, 
                                 output_dir = temp_dir)
list.files(temp_dir) # Check that file was created
unlink(temp_dir, recursive = TRUE) # Clean up

Print Method for Parameter Sensitivity Objects

Description

The S3 print method for parameter_sensitivity objects. It displays a concise summary of the analysis results, including the parameter analyzed, the minimum error found, and the performance threshold.

Usage

## S3 method for class 'parameter_sensitivity'
print(x, ...)

Arguments

x

A parameter_sensitivity object.

...

Additional arguments passed to the print function (not currently used).

Value

Invisibly returns the original object. Called for its side effect of printing a summary to the console.

Print Method for profile_likelihood Objects

Description

Provides a concise summary of a profile_likelihood object.

Usage

## S3 method for class 'profile_likelihood'
print(x, ...)

Arguments

x

A profile_likelihood object.

...

Additional arguments passed to print.

Value

The original profile_likelihood object (invisibly). Called for its side effect of printing a summary to the console.

Print method for topolow objects

Description

Provides a concise display of key optimization results from euclidean_embedding.

Usage

## S3 method for class 'topolow'
print(x, ...)

Arguments

x

A topolow object returned by euclidean_embedding().

...

Additional arguments passed to print (not used).

Value

The original topolow object (invisibly). This function is called for its side effect of printing a summary to the console.

Examples

# Create a simple dissimilarity matrix and run the optimization
dist_mat <- matrix(c(0, 2, 3, 2, 0, 4, 3, 4, 0), nrow=3)
result <- euclidean_embedding(dist_mat, ndim=2, mapping_max_iter=50,
                           k0=1.0, cooling_rate=0.001, c_repulsion=0.1,
                           verbose = FALSE)
# Print the result object
print(result)

Print Method for topolow Convergence Diagnostics

Description

Print Method for topolow Convergence Diagnostics

Usage

## S3 method for class 'topolow_convergence'
print(x, ...)

Arguments

x

A topolow_convergence object.

...

Additional arguments passed to print.

Value

No return value; called for its side effect of printing a summary.

Print Method for topolow parameter estimation Diagnostics

Description

Print Method for topolow parameter estimation Diagnostics

Usage

## S3 method for class 'topolow_diagnostics'
print(x, ...)

Arguments

x

A topolow_diagnostics object.

...

Additional arguments passed to print.

Value

No return value; called for its side effect of printing a summary.

Process Raw Antigenic Assay Data

Description

Processes raw antigenic assay data from data frames into standardized long and matrix formats. Handles both similarity data (like titers, which need conversion to distances) and direct dissimilarity measurements like IC50. Preserves threshold indicators (<, >) and handles repeated measurements by averaging.

Usage

process_antigenic_data(
  data,
  antigen_col,
  serum_col,
  value_col,
  is_similarity = FALSE,
  metadata_cols = NULL,
  base = NULL,
  scale_factor = 1
)

Arguments

data

Data frame containing raw data.

antigen_col

Character. Name of column containing virus/antigen identifiers.

serum_col

Character. Name of column containing serum/antibody identifiers.

value_col

Character. Name of column containing measurements (titers or distances).

is_similarity

Logical. Whether values are measures of similarity such as titers or binding affinities (TRUE) or dissimilarities like IC50 (FALSE). Default: FALSE.

metadata_cols

Character vector. Names of additional columns to preserve.

base

Numeric. Base for logarithm transformation (default: 2 for similarities, e for dissimilarities).

scale_factor

Numeric. Scale factor for similarities. This is the base value that all other dilutions are multiples of. E.g., 10 for HI assay where titers are 10, 20, 40,... Default: 1.

Details

The function handles these key steps:

Validates input data and required columns
Transforms values to log scale
Converts similarities to distances using Smith's method if needed
Averages repeated measurements
Creates standardized long format
Creates symmetric distance matrix
Preserves metadata and threshold indicators
Preserves virusYear and serumYear columns if present

Input requirements and constraints:

Data frame must contain required columns
Column names must match specified parameters
Values can include threshold indicators (< or >)
Metadata columns must exist if specified
Allowed Year-related column names are "virusYear" and "serumYear"

Value

A list containing two elements:

long

A data.frame in long format with standardized columns, including the original identifiers, processed values, and calculated distances. Any specified metadata is also included.

matrix

A numeric matrix representing the processed symmetric distance matrix, with antigens and sera on columns and rows.

Examples

# Example 1: Processing HI titer data (similarities)
antigen_data <- data.frame(
  virus = c("A/H1N1/2009", "A/H1N1/2010", "A/H1N1/2011", "A/H1N1/2009", "A/H1N1/2010"),
  serum = c("anti-2009", "anti-2009", "anti-2009", "anti-2010", "anti-2010"),
  titer = c(1280, 640, "<40", 2560, 1280),  # Some below detection limit
  cluster = c("A", "A", "B", "A", "A"),
  color = c("red", "red", "blue", "red", "red")
)

# Process HI titer data (similarities -> distances)
results <- process_antigenic_data(
  data = antigen_data,
  antigen_col = "virus",
  serum_col = "serum", 
  value_col = "titer",
  is_similarity = TRUE,  # Titers are similarities
  metadata_cols = c("cluster", "color"),
  scale_factor = 10  # Base dilution factor
)

# View the long format data
print(results$long)
# View the distance matrix
print(results$matrix)

# Example 2: Processing IC50 data (already dissimilarities)
ic50_data <- data.frame(
  virus = c("HIV-1", "HIV-2", "HIV-3"),
  antibody = c("mAb1", "mAb1", "mAb2"),
  ic50 = c(0.05, ">10", 0.2)
)

results_ic50 <- process_antigenic_data(
  data = ic50_data,
  antigen_col = "virus",
  serum_col = "antibody",
  value_col = "ic50",
  is_similarity = FALSE  # IC50 values are dissimilarities
)

Profile Likelihood Analysis

Description

Calculates the profile likelihood for a given parameter by evaluating the conditional maximum likelihood across a grid of parameter values. This "empirical profile likelihood" estimates the likelihood surface based on samples from Monte Carlo simulations.

Usage

profile_likelihood(
  param,
  samples,
  grid_size = 40,
  bandwidth_factor = 0.05,
  start_factor = 0.5,
  end_factor = 1.5,
  min_samples = 5
)

Arguments

param

The character name of the parameter to analyze (e.g., "log_N").

samples

A data frame containing parameter samples and a log-likelihoods column named "NLL".

grid_size

The integer number of grid points for the analysis.

bandwidth_factor

A numeric factor for the local sample window size.

start_factor, end_factor

Numeric range multipliers for parameter grid (default: 0.5, 1.2)

min_samples

Integer minimum samples required for reliable estimate (default: 10)

Details

For each value in the parameter grid, the function:

Identifies nearby samples using a bandwidth window.
Calculates the conditional maximum likelihood from these samples.
Tracks sample counts to assess the reliability of the estimate.

Value

Object of class "profile_likelihood" containing:

param

Vector of parameter values

ll

Vector of log-likelihood values

param_name

Name of analyzed parameter

bandwidth

Bandwidth used for local windows

sample_counts

Number of samples per estimate

Examples

# Create a sample data frame of parameter samples
mcmc_samples <- data.frame(
  log_N = log(runif(50, 2, 10)),
  log_k0 = log(runif(50, 1, 5)),
  log_cooling_rate = log(runif(50, 0.01, 0.1)),
  log_c_repulsion = log(runif(50, 0.1, 1)),
  NLL = runif(50, 20, 100)
)

# Calculate profile likelihood for the "log_N" parameter
pl <- profile_likelihood("log_N", mcmc_samples,
                         grid_size = 10, # Smaller grid for a quick example
                         bandwidth_factor = 0.05)

# Print the results
print(pl)

S3 Constructor for Profile Likelihood Results

Description

Internal S3 constructor for storing results from profile_likelihood.

Usage

profile_likelihood_result(
  param_values,
  ll_values,
  param_name,
  bandwidth,
  sample_counts
)

Arguments

param_values

Vector of parameter values.

ll_values

Vector of log-likelihood values.

param_name

Name of the analyzed parameter.

bandwidth

Bandwidth used for local windows.

sample_counts

Number of samples per estimate.

Value

An object of class profile_likelihood.

Perform Dimension Reduction

Description

Applies configured dimension reduction method to input data.

Usage

reduce_dimensions(df, config)

Arguments

df

Data frame containing coordinate data

config

Dimension reduction configuration object

Value

Data frame with reduced dimensions

Performs adaptive Monte Carlo sampling

Description

Performs adaptive Monte Carlo sampling to explore and refine the parameter space, running locally in parallel. Samples are drawn adaptively based on previously evaluated likelihoods to focus sampling in high-likelihood regions. Results from all parallel jobs accumulate in a single output file.

Usage

run_adaptive_sampling(
  initial_samples_file,
  scenario_name,
  dissimilarity_matrix,
  max_cores = NULL,
  num_samples = 10,
  mapping_max_iter = 1000,
  relative_epsilon = 1e-04,
  folds = 20,
  output_dir,
  verbose = FALSE
)

Arguments

initial_samples_file

Character. Path to a CSV file containing initial samples.

scenario_name

Character. Name for the output files.

dissimilarity_matrix

Matrix. The input dissimilarity matrix.

max_cores

Integer. Number of cores to use for parallel execution. If NULL, uses all available cores minus 1.

num_samples

Integer. Number of new samples to generate via adaptive sampling.

mapping_max_iter

Integer. Maximum number of map optimization iterations.

relative_epsilon

Numeric. Convergence threshold for relative change in error. Default is 1e-4.

folds

Integer. Number of cross-validation folds.

output_dir

Character. Required directory for output files.

verbose

Logical. Whether to print progress messages. Default is FALSE.

Value

No return value. Called for its side effect of writing results to a CSV file in output_dir.

Examples


# 1. Locate the example initial samples file included with the package
# In a real scenario, this file would be from an 'initial_parameter_optimization' run.
initial_file <- system.file(
  "extdata", "initial_samples_example.csv",
  package = "topolow"
)

# 2. Create a temporary directory for the function's output
# This function requires a writable directory for its results.
temp_out_dir <- tempdir()

# 3. Create a sample dissimilarity matrix for the function to use
dissim_mat <- matrix(runif(100, 1, 10), 10, 10)
diag(dissim_mat) <- 0

# 4. Run the adaptive sampling only if the example file is found
if (nzchar(initial_file)) {
  run_adaptive_sampling(
    initial_samples_file = initial_file,
    scenario_name = "adaptive_test_example",
    dissimilarity_matrix = dissim_mat,
    output_dir = temp_out_dir,
    max_cores = 1,
    num_samples = 1,
    verbose = FALSE
  )

  # 5. Verify output files were created
  print("Output files from adaptive sampling:")
  print(list.files(temp_out_dir, recursive = TRUE))

  # 6. Clean up the temporary directory
  unlink(temp_out_dir, recursive = TRUE)
}

Save Plot to File

Description

Saves a plot (ggplot or rgl scene) to file with specified configuration. Supports multiple output formats and configurable dimensions.

Usage

save_plot(plot, filename, layout_config = new_layout_config(), output_dir)

Arguments

plot

ggplot or rgl scene object to save

filename

Output filename (with or without extension)

layout_config

Layout configuration object controlling output parameters

output_dir

Character. Directory for output files. This argument is required.

Details

Supported file formats:

PNG: Best for web and general use
PDF: Best for publication quality vector graphics
SVG: Best for web vector graphics
EPS: Best for publication quality vector graphics

The function will:

Auto-detect plot type (ggplot or rgl)
Use appropriate saving method
Apply layout configuration settings
Add file extension if not provided

Value

No return value, called for side effects (saves a plot to a file).

Examples

# The sole purpose of save_plot is to write a file, so its example must demonstrate this. 
# For CRAN tests we wrap the example in \donttest{} to avoid writing files.

# Create a temporary directory for saving all plots
temp_dir <- tempdir()

# --- Example 1: Basic ggplot save ---
# Create sample data with 3 dimensions to support both 2D and 3D plots
data <- data.frame(
  V1 = rnorm(10), V2 = rnorm(10), V3 = rnorm(10), name=1:10,
  antigen = rep(c(0,1), 5), antiserum = rep(c(1,0), 5),
  year = 2000:2009
)
p <- plot_temporal_mapping(data, ndim=2)
save_plot(p, "temporal_plot.png", output_dir = temp_dir)

# --- Example 2: Save with custom layout ---
layout_config <- new_layout_config(
  width = 12,
  height = 8,
  dpi = 600,
  save_format = "pdf"
)
save_plot(p, "high_res_plot.pdf", layout_config, output_dir = temp_dir)

# --- Verify files and clean up ---
list.files(temp_dir)
unlink(temp_dir, recursive = TRUE)

Scale Reduced Dimensions to Match Original Distances

Description

Helper function to scale reduced dimensions to better match original distances.

Usage

scale_to_original_distances(reduced_coords, orig_dist)

Arguments

reduced_coords

Matrix of reduced coordinates

orig_dist

Original distance matrix

Value

Scaled coordinate matrix

Plot Fitted vs. True Dissimilarities

Description

Creates diagnostic plots comparing fitted dissimilarities from a model against the true dissimilarities. It generates both a scatter plot with an identity line and prediction intervals, and a residuals plot.

Usage

scatterplot_fitted_vs_true(
  dissimilarity_matrix,
  p_dissimilarity_mat,
  scenario_name,
  ndim,
  save_plot = FALSE,
  output_dir,
  confidence_level = 0.95
)

Arguments

dissimilarity_matrix

Matrix of true dissimilarities.

p_dissimilarity_mat

Matrix of predicted/fitted dissimilarities.

scenario_name

Character string for output file naming. Used if save_plot is TRUE.

ndim

Integer number of dimensions used in the model.

save_plot

Logical. Whether to save plots to files. Default: FALSE.

output_dir

Character. Directory for output files. Required if save_plot is TRUE.

confidence_level

Numeric confidence level for prediction intervals (default: 0.95).

Value

A list containing the scatter_plot and residuals_plot ggplot objects.

Examples

# Create sample data
true_dist <- matrix(runif(100, 1, 10), 10, 10)
pred_dist <- true_dist + rnorm(100)

# Create plots without saving
plots <- scatterplot_fitted_vs_true(
  dissimilarity_matrix = true_dist,
  p_dissimilarity_mat = pred_dist,
  save_plot = FALSE
 )

# You can then display a plot, for instance:
# plots$scatter_plot

Summary method for topolow objects

Description

Provides a more detailed summary of the optimization results from euclidean_embedding, including parameters, convergence, and performance metrics.

Usage

## S3 method for class 'topolow'
summary(object, ...)

Arguments

object

A topolow object returned by euclidean_embedding().

...

Additional arguments passed to summary (not used).

Value

No return value. This function is called for its side effect of printing a detailed summary to the console.

Examples

# Create a simple dissimilarity matrix and run the optimization
dist_mat <- matrix(c(0, 2, 3, 2, 0, 4, 3, 4, 0), nrow=3)
result <- euclidean_embedding(dist_mat, ndim=2, mapping_max_iter=50,
                           k0=1.0, cooling_rate=0.001, c_repulsion=0.1,
                           verbose = FALSE)
# Summarize the result object
summary(result)

Convert distance matrix to assay panel format

Description

Convert distance matrix to assay panel format

Usage

symmetric_to_nonsymmetric_matrix(dist_matrix, selected_names)

Arguments

dist_matrix

Distance matrix

selected_names

Names of reference points

Value

A non-symmetric matrix in assay panel format, where rows are test antigens and columns are reference antigens.

Convert Table Format Data to Dissimilarity Matrix

Description

Converts data from table/matrix format (objects as rows, references as columns) to a symmetric dissimilarity matrix. The function creates a matrix where both rows and columns contain the union of all object and reference names.

Usage

table_to_matrix(data, is_similarity = FALSE)

Arguments

data

Matrix or data frame where rownames represent objects, columnnames represent references, and cells contain (dis)similarity values.

is_similarity

Details

The function takes a table where:

Rows represent objects
Columns represent references
Values represent (dis)similarities

It creates a symmetric matrix where both rows and columns contain the union of all object names (row names) and reference names (column names). The original measurements are preserved, and the matrix is made symmetric by filling both (i,j) and (j,i) positions with the same value.

When is_similarity = TRUE, similarities are converted to dissimilarities by subtracting each value from the maximum value in its respective column (reference). Threshold indicators (< or >) are handled and inverted during conversion.

Value

A symmetric matrix of dissimilarities with row and column names corresponding to the union of all object and reference names. NA values represent unmeasured pairs, and the diagonal is set to 0.

Examples

# Example with dissimilarity data in table format
dissim_table <- matrix(c(1.2, 2.1, 3.4, 1.8, 2.9, 4.1),
                      nrow = 2, ncol = 3,
                      dimnames = list(c("Obj1", "Obj2"),
                                    c("Ref1", "Ref2", "Ref3")))

mat_dissim <- table_to_matrix(dissim_table, is_similarity = FALSE)

# Example with similarity data (will be converted to dissimilarity)
sim_table <- matrix(c(8.8, 7.9, 6.6, 8.2, 7.1, 5.9),
                   nrow = 2, ncol = 3,
                   dimnames = list(c("Obj1", "Obj2"),
                                 c("Ref1", "Ref2", "Ref3")))

mat_from_sim <- table_to_matrix(sim_table, is_similarity = TRUE)

Convert Long Format Data to Distance Matrix

Description

Converts a dataset from long format to a symmetric distance matrix. The function handles antigenic cartography data where measurements may exist between antigens and antisera points. Row and column names can be optionally sorted by a time variable.

Usage

titers_list_to_matrix(
  data,
  chnames,
  chorder = NULL,
  rnames,
  rorder = NULL,
  values_column,
  rc = FALSE,
  sort = FALSE
)

Arguments

data

Data frame in long format

chnames

Character. Name of column holding the challenge point names.

chorder

Character. Optional name of column for challenge point ordering.

rnames

Character. Name of column holding reference point names.

rorder

Character. Optional name of column for reference point ordering.

values_column

Character. Name of column containing distance/difference values. It should be from the nature of "distance" (e.g., antigenic distance or IC50), not "similarity" (e.g., HI Titer.)

rc

Logical. If TRUE, reference points are treated as a subset of challenge points. If FALSE, they are treated as distinct sets. Default is FALSE.

sort

Logical. Whether to sort rows/columns by chorder/rorder. Default FALSE.

Details

The function expects data in long format with at least three columns:

A column for challenge point names
A column for reference point names
A column containing the distance/difference values

Optionally, ordering columns can be provided to sort the output matrix. The 'rc' parameter determines how to handle shared names between references and challenges.

Value

A symmetric matrix of distances with row and column names corresponding to the unique points in the data. NA values represent unmeasured pairs.

Examples

data <- data.frame(
  antigen = c("A", "B", "A"),
  serum = c("X", "X", "Y"), 
  distance = c(2.5, 1.8, 3.0),
  year = c(2000, 2001, 2000)
)

# Basic conversion
mat <- titers_list_to_matrix(data, 
                     chnames = "antigen",
                     rnames = "serum",
                     values_column = "distance")
                     
# With sorting by year
mat_sorted <- titers_list_to_matrix(data,
                            chnames = "antigen",
                            chorder = "year",
                            rnames = "serum", 
                            rorder = "year",
                            values_column = "distance",
                            sort = TRUE)

Validate Input Data Frame

Description

Validates input data frame for visualization functions, checking required columns and data types.

Usage

validate_topolow_df(
  df,
  ndim,
  require_clusters = FALSE,
  require_temporal = FALSE
)

Arguments

df

Data frame to validate

ndim

Number of dimensions expected in coordinate columns. Names of coordinate columns must start with a "V".

require_clusters

Whether cluster column is required

require_temporal

Whether year column is required

Value

Validated data frame or throws error if invalid

TopoLow Core Functions Vectorized Processing of Dissimilarity Matrix for Convergence Error Calculations

Description

Efficiently processes elements of the dissimilarity matrix for calculating convergence error using pre-processed numeric representations of thresholds. This optimized version eliminates expensive string operations during optimization.

Usage

vectorized_process_distance_matrix(
  distances_numeric,
  threshold_mask,
  p_dist_mat
)

Arguments

distances_numeric

Numeric matrix. The numeric dissimilarity values (without threshold indicators)

threshold_mask

Integer matrix. Codes representing threshold types: 1 for "greater than" (>), -1 for "less than" (<), or 0 for exact values

p_dist_mat

Numeric matrix. The calculated distance matrix to compare against

Details

This function handles threshold logic for convergence error calculation by using pre-processed numeric matrices:

For "greater than" thresholds (threshold_mask = 1): Returns the numeric value if the calculated distance is less than the threshold, otherwise returns NA
For "less than" thresholds (threshold_mask = -1): Returns the numeric value if the calculated distance is greater than the threshold, otherwise returns NA
For regular values (threshold_mask = 0): Returns the numeric value

This function operates on entire matrices at once using vectorized operations, which is significantly faster than processing each element individually.

Value

Numeric matrix with processed distance values. Elements where threshold conditions are not satisfied will contain NA.

Weighted Kernel Density Estimation

Description

Performs weighted kernel density estimation for univariate data. This is useful for analyzing parameter distributions where each sample has an associated importance weight (e.g., a likelihood).

Usage

weighted_kde(x, weights, n = 512, from = min(x), to = max(x))

Arguments

x

A numeric vector of samples.

weights

A numeric vector of weights corresponding to each sample in x.

n

The integer number of points at which to evaluate the density.

from, to

The range over which to evaluate the density.

Value

A list containing the evaluation points (x) and the estimated density values (y).

Algorithm Comparison Helper Functions

Description

Details

Main Functions

Output Files

Citation

Author(s)

See Also

Automatic Euclidean Embedding with Parameter Optimization

Description

Usage

Arguments

Value

Examples

Perform Adaptive Monte Carlo Sampling (Internal)

Description

Usage

Arguments

Value

Analyze Network Structure

Description

Usage

Arguments

Value

Examples

Calculate MCMC-style Diagnostics for Sampling Chains

Description

Usage

Arguments

Value

Examples

Calculate Prediction Interval for Dissimilarity Estimates

Description

Usage

Arguments

Value

Calculate Weighted Marginal Distributions

Description

Usage

Arguments

Details

Value

Model Diagnostics and Convergence Testing Check Multivariate Gaussian Convergence

Description

Usage

Arguments

Value

Examples

Clean Data by Removing MAD-based Outliers

Description

Usage

Arguments

Value

See Also

Examples

Color Palettes

Description

Usage

Format

Convert Coordinates to a Distance Matrix

Description

Usage

Arguments

Value

Create Base Theme

Description

Usage

Arguments

Value

Create Cross-Validation Folds for a Dissimilarity Matrix

Description

Usage

Arguments

Value

Note

Examples

Create Diagnostic Plots for Multiple Sampling Chains

Description

Usage

Arguments