SportMiner is a comprehensive R package for mining, analyzing, and visualizing scientific literature in sport science domains. It provides an end-to-end workflow for:
This vignette demonstrates the core functionality of SportMiner through a practical example.
Before using SportMiner, you need a Scopus API key. You can obtain one by registering at Elsevier Developer Portal.
Let’s search for papers on talent identification in sport science that use principal component analysis or cluster analysis.
# Define the search query
query <- paste0(
'TITLE-ABS-KEY(',
'("talent identification" OR "sport science" OR "athlete") ',
'AND ',
'("principal component analysis" OR "PCA" OR "cluster analysis") ',
') AND DOCTYPE(ar) AND PUBYEAR > 2010'
)
# Retrieve papers
papers <- sm_search_scopus(
query = query,
max_count = 100,
verbose = TRUE
)
# View the data structure
head(papers[, c("title", "year", "author_keywords")])Convert the raw abstracts into a clean, stemmed word count format.
Transform the word counts into a sparse matrix suitable for topic modeling.
Use coherence-based selection to find the best number of topics.
Fit an LDA model using the optimal k.
Visualize how author keywords co-occur across papers.
Compare LDA, STM, and CTM to find the best-performing model.
# Run comparison
comparison <- sm_compare_models(
dtm = dtm,
k = 10,
seed = 1729,
verbose = TRUE
)
# View metrics
print(comparison$metrics)
# Get recommendation
print(paste("Recommended model:", comparison$recommendation))
# Use the recommended model
best_model <- comparison$models[[tolower(comparison$recommendation)]]All plotting functions use the custom theme_sportminer()
theme, but you can customize further.
library(ggplot2)
# Create a plot with custom theme settings
p <- sm_plot_topic_frequency(lda_model, dtm)
# Add customizations
p +
labs(
title = "Distribution of Research Topics in Sport Science",
subtitle = "Based on 100 papers from Scopus (2010-2025)"
) +
theme_sportminer(base_size = 14, grid = FALSE)API Rate Limits: Scopus has rate limits. Use
max_count wisely and add delays between large
queries.
Reproducibility: Always set seeds when running topic models:
Hyperparameter Tuning: Experiment with
min_term_freq and max_term_freq in
sm_create_dtm() to balance vocabulary size and model
performance.
Model Selection: Don’t rely solely on coherence. Inspect the top terms for each topic to ensure interpretability.