The goal of pubmedR is to gather metadata about publications, grants and
clinical trials from the PubMed database using NCBI REST APIs.
https://github.com/massimoaria/pubmedR
Latest version: 1.0.0, 2026-04-15
by Massimo Aria
Full Professor in Social Statistics
PhD in Computational Statistics
Laboratory and Research Group STAD Statistics, Technology, Data Analysis
Department of Economics and Statistics
University of Naples Federico II
email aria@unina.it
You can install the development version of pubmedR from GitHub with:
install.packages("devtools")
devtools::install_github("massimoaria/pubmedR")
You can install the released version of pubmedR from CRAN with:
install.packages("pubmedR")
library(pubmedR)
By default, access to the NCBI API is free and does not strictly require an API key. Without a key, NCBI limits users to 3 requests per second; registered users with an API key are allowed up to 10 requests per second.
To obtain a key, register for a “my ncbi account” (https://account.ncbi.nlm.nih.gov/) and generate one from the “account settings page” (https://account.ncbi.nlm.nih.gov/settings/).
You can pass the key explicitly via the api_key argument
of any function, or - preferably - set it once as an environment
variable. pubmedR will automatically pick it up from
PUBMED_API_KEY or ENTREZ_KEY:
# option 1: pass it explicitly
api_key <- "your API key"
# option 2: set it once per session (or in ~/.Renviron)
Sys.setenv(PUBMED_API_KEY = "your API key")
# no key
api_key <- NULL
Imagine we want to download a metadata collection of journal articles that (1) use bibliometric approaches, (2) were published in the last two decades, and (3) are written in English.
Since version 0.1.0, pubmedR offers two equivalent workflows:
pmQueryBuild
→ pmQueryTotalCount → pmApiRequest →
pmApi2df), which gives you full control over each
stage;pmCollect(), a
convenience wrapper that chains the whole pipeline in a single call and
optionally enriches the result with citation data.
pmCollect()For most use cases, pmCollect() is the fastest way to go
from a query to a bibliometrix-ready data frame. It builds the query,
checks the total count, downloads the records, converts the XML into a
data frame, and (optionally) adds citation counts and cited
references.
library(pubmedR)
M <- pmCollect(
terms = "bibliometric*",
fields = "Title/Abstract",
language = "english",
pub_type = "Journal Article",
date_range = c("2000", "2020"),
limit = 2000,
api_key = NULL
)
# Query: (bibliometric*[Title/Abstract]) AND english[LA] AND
# Journal Article[PT] AND 2000:2020[DP]
#
# Total records found: 2921
# Records to download: 2000
You can also pass a raw PubMed query string directly:
M <- pmCollect(
query = "bibliometric*[Title/Abstract] AND english[LA] AND 2023:2024[DP]",
limit = 200
)
Set enrich = TRUE to add citation counts
(TC) and cited references (CR) via
pmEnrichCitations(). Note that enrichment adds two API
calls per record, so it is best used on smaller collections:
M <- pmCollect(
terms = "bibliometric*",
date_range = c("2023", "2024"),
limit = 50,
enrich = TRUE
)
If you prefer fine-grained control - for example to inspect the translated query before downloading - you can run the pipeline one stage at a time.
Instead of writing the Entrez query string by hand, you can use
pmQueryBuild() to compose it programmatically from
parameters:
query <- pmQueryBuild(
terms = "bibliometric*",
fields = "Title/Abstract",
language = "english",
pub_type = "Journal Article",
date_range = c("2000", "2020")
)
query
# [1] "(bibliometric*[Title/Abstract]) AND english[LA] AND
# Journal Article[PT] AND 2000:2020[DP]"
Of course, you can still write the query manually if you prefer:
query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]"
Use pmQueryTotalCount() to see how many records PubMed
would return, along with the automatically translated query:
res <- pmQueryTotalCount(query = query, api_key = api_key)
res$total_count
# [1] 2921
res$query_translation
# [1] "(bibliometric[Title/Abstract] OR bibliometrica[Title/Abstract] OR
# ... OR bibliometricstrade[Title/Abstract]) AND english[LA] AND
# Journal Article[PT] AND 2000[PDAT] : 2020[PDAT]"
You can now download the whole collection (or a subset, by lowering
limit):
D <- pmApiRequest(query = query, limit = res$total_count, api_key = api_key)
# Documents 200 of 2921
# Documents 400 of 2921
# ...
# Documents 2921 of 2921
pmApiRequest() returns a list with the following
elements:
data - the XML-structured list
containing the bibliographic metadata collection downloaded from
PubMed.query - the original query submitted
by the user.query_translation - the query as
translated and executed by NCBI’s Automatic Terms Translation
system.records_downloaded - number of records
actually downloaded.total_count - total number of records
matching the query.
Finally, transform the XML-structured object D into a
data frame where rows are documents and columns are field tags
compatible with the bibliometrix R package (https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/).
M <- pmApi2df(D, format = "bibliometrix")
str(M)
# 'data.frame': 2918 obs. of 27 variables:
# $ AU : chr ...
# $ AF : chr ...
# $ TI : chr ...
# $ SO : chr ...
# $ LA : chr ...
# $ DT : chr ...
# $ DE : chr ...
# $ AB : chr ...
# $ C1 : chr ...
# $ TC : num ...
# $ PY : num ...
# $ DI : chr ...
# $ PMID : chr ...
# ...
Setting format = "raw" returns the data frame with all
fields in their native PubMed form instead of the bibliometrix-style
field tags.
If you already know which articles you want, you can bypass the query
step and download records directly by their PubMed identifiers with
pmFetchById():
pmids <- c("34813985", "34813456", "34812345")
D <- pmFetchById(pmids = pmids, api_key = api_key)
M <- pmApi2df(D)
The returned object follows the same structure as
pmApiRequest(), so it can be fed into
pmApi2df() exactly the same way.
pubmedR exposes three helpers to retrieve citation information via NCBI’s E-Link service (based on PubMed Central):
pmCitedBy(pmid) - returns the PMIDs of articles citing
the given article;pmReferences(pmid) - returns the PMIDs of articles
referenced by the given article;pmEnrichCitations(df) - adds a TC (times
cited) column and a CR (cited references) column to a
pubmedR data frame.Example:
cites <- pmCitedBy(pmid = "25824007")
cites$count
cites$cited_by
refs <- pmReferences(pmid = "25824007")
refs$count
refs$references
# Add citation counts and references to an existing data frame
M_enriched <- pmEnrichCitations(M, api_key = api_key)
Note: citation data in PubMed comes from PubMed Central and is less comprehensive than commercial databases such as Web of Science or Scopus.
Once you have a data frame M, you can use
bibliometrix for descriptive and network analyses (https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/).
install.packages("bibliometrix")
library(bibliometrix)
results <- biblioAnalysis(M)
summary(results)
# Main Information about data
#
# Documents 2918
# Sources (Journals, Books, etc.) 1275
# Keywords Plus (ID) 2245
# Author's Keywords (DE) 4212
# Period 2000 - 2020
# ...