pubmedR

An R-package to gather bibliographic data from PubMed.

 

The goal of pubmedR is to gather metadata about publications, grants and
clinical trials from the PubMed database using NCBI REST APIs.

 

https://github.com/massimoaria/pubmedR

Latest version: 1.0.0, 2026-04-15

 

by Massimo Aria

Full Professor in Social Statistics

PhD in Computational Statistics

Laboratory and Research Group STAD Statistics, Technology, Data Analysis

Department of Economics and Statistics

University of Naples Federico II

email

https://www.massimoaria.com

 

Installation

You can install the development version of pubmedR from GitHub with:

install.packages("devtools")
devtools::install_github("massimoaria/pubmedR")

You can install the released version of pubmedR from CRAN with:

install.packages("pubmedR")

 

Load the package

library(pubmedR)

 

NCBI API key

By default, access to the NCBI API is free and does not strictly require an API key. Without a key, NCBI limits users to 3 requests per second; registered users with an API key are allowed up to 10 requests per second.

To obtain a key, register for a “my ncbi account” (https://account.ncbi.nlm.nih.gov/) and generate one from the “account settings page” (https://account.ncbi.nlm.nih.gov/settings/).

You can pass the key explicitly via the api_key argument of any function, or - preferably - set it once as an environment variable. pubmedR will automatically pick it up from PUBMED_API_KEY or ENTREZ_KEY:

# option 1: pass it explicitly
api_key <- "your API key"

# option 2: set it once per session (or in ~/.Renviron)
Sys.setenv(PUBMED_API_KEY = "your API key")

# no key
api_key <- NULL

 

A brief example

Imagine we want to download a metadata collection of journal articles that (1) use bibliometric approaches, (2) were published in the last two decades, and (3) are written in English.

Since version 0.1.0, pubmedR offers two equivalent workflows:

 

One-step workflow: pmCollect()

For most use cases, pmCollect() is the fastest way to go from a query to a bibliometrix-ready data frame. It builds the query, checks the total count, downloads the records, converts the XML into a data frame, and (optionally) adds citation counts and cited references.

library(pubmedR)

M <- pmCollect(
  terms      = "bibliometric*",
  fields     = "Title/Abstract",
  language   = "english",
  pub_type   = "Journal Article",
  date_range = c("2000", "2020"),
  limit      = 2000,
  api_key    = NULL
)

# Query: (bibliometric*[Title/Abstract]) AND english[LA] AND
#        Journal Article[PT] AND 2000:2020[DP]
#
# Total records found: 2921
# Records to download: 2000

You can also pass a raw PubMed query string directly:

M <- pmCollect(
  query   = "bibliometric*[Title/Abstract] AND english[LA] AND 2023:2024[DP]",
  limit   = 200
)

Set enrich = TRUE to add citation counts (TC) and cited references (CR) via pmEnrichCitations(). Note that enrichment adds two API calls per record, so it is best used on smaller collections:

M <- pmCollect(
  terms      = "bibliometric*",
  date_range = c("2023", "2024"),
  limit      = 50,
  enrich     = TRUE
)

 

Step-by-step workflow

If you prefer fine-grained control - for example to inspect the translated query before downloading - you can run the pipeline one stage at a time.

Step 1: Build the query

Instead of writing the Entrez query string by hand, you can use pmQueryBuild() to compose it programmatically from parameters:

query <- pmQueryBuild(
  terms      = "bibliometric*",
  fields     = "Title/Abstract",
  language   = "english",
  pub_type   = "Journal Article",
  date_range = c("2000", "2020")
)

query
# [1] "(bibliometric*[Title/Abstract]) AND english[LA] AND
#      Journal Article[PT] AND 2000:2020[DP]"

Of course, you can still write the query manually if you prefer:

query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]"

 

Step 2: Check the effectiveness of the query

Use pmQueryTotalCount() to see how many records PubMed would return, along with the automatically translated query:

res <- pmQueryTotalCount(query = query, api_key = api_key)

res$total_count
# [1] 2921

res$query_translation
# [1] "(bibliometric[Title/Abstract] OR bibliometrica[Title/Abstract] OR
#      ... OR bibliometricstrade[Title/Abstract]) AND english[LA] AND
#      Journal Article[PT] AND 2000[PDAT] : 2020[PDAT]"

 

Step 3: Download the collection of document metadata

You can now download the whole collection (or a subset, by lowering limit):

D <- pmApiRequest(query = query, limit = res$total_count, api_key = api_key)

# Documents  200  of  2921
# Documents  400  of  2921
# ...
# Documents  2921  of  2921

pmApiRequest() returns a list with the following elements:

  • data - the XML-structured list containing the bibliographic metadata collection downloaded from PubMed.
  • query - the original query submitted by the user.
  • query_translation - the query as translated and executed by NCBI’s Automatic Terms Translation system.
  • records_downloaded - number of records actually downloaded.
  • total_count - total number of records matching the query.

 

Step 4: Convert the XML object into a data frame

Finally, transform the XML-structured object D into a data frame where rows are documents and columns are field tags compatible with the bibliometrix R package (https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/).

M <- pmApi2df(D, format = "bibliometrix")

str(M)
# 'data.frame':   2918 obs. of  27 variables:
#  $ AU    : chr  ...
#  $ AF    : chr  ...
#  $ TI    : chr  ...
#  $ SO    : chr  ...
#  $ LA    : chr  ...
#  $ DT    : chr  ...
#  $ DE    : chr  ...
#  $ AB    : chr  ...
#  $ C1    : chr  ...
#  $ TC    : num  ...
#  $ PY    : num  ...
#  $ DI    : chr  ...
#  $ PMID  : chr  ...
#  ...

Setting format = "raw" returns the data frame with all fields in their native PubMed form instead of the bibliometrix-style field tags.

 

Fetching records by PMID

If you already know which articles you want, you can bypass the query step and download records directly by their PubMed identifiers with pmFetchById():

pmids <- c("34813985", "34813456", "34812345")
D <- pmFetchById(pmids = pmids, api_key = api_key)
M <- pmApi2df(D)

The returned object follows the same structure as pmApiRequest(), so it can be fed into pmApi2df() exactly the same way.

 

Citation enrichment

pubmedR exposes three helpers to retrieve citation information via NCBI’s E-Link service (based on PubMed Central):

Example:

cites <- pmCitedBy(pmid = "25824007")
cites$count
cites$cited_by

refs <- pmReferences(pmid = "25824007")
refs$count
refs$references

# Add citation counts and references to an existing data frame
M_enriched <- pmEnrichCitations(M, api_key = api_key)

Note: citation data in PubMed comes from PubMed Central and is less comprehensive than commercial databases such as Web of Science or Scopus.

 

An overview of the collection using bibliometrix

Once you have a data frame M, you can use bibliometrix for descriptive and network analyses (https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/).

install.packages("bibliometrix")
library(bibliometrix)

results <- biblioAnalysis(M)
summary(results)

# Main Information about data
#
#  Documents                             2918
#  Sources (Journals, Books, etc.)       1275
#  Keywords Plus (ID)                    2245
#  Author's Keywords (DE)                4212
#  Period                                2000 - 2020
#  ...