ApplyPolygenicScore provides utilities for applying a polygenic score model to genetic data.
This guide contains step-by-step instructions for how best to take advantage of the functions in this package to facilitate this process and transition smoothly into downstream analyses.
The application of a polygenic score model to genetic data requires two basic inputs:
Optionally, a third data type, phenotype data, can be provided to several ApplyPolygenicScore functions.
VCF file requirements:
PGS weight file requirements:
import.pgs.weight.file
to
ensure proper formattingCHROM
, POS
,
effect_allele
, beta
and optionally
ID
, other_allele
, and
allelefrequency_effect
.ApplyPolygenicScore provides functions for the importation of input data into R, and the verification of these data for compatibility with other functions in the package.
Use the import.vcf
function to import VCF data into R.
This function is a wrapper around the vcfR
package that
ensures the VCF data is in the format expected by other package
functions.
The example below imports a VCF file from GIAB, included in the package as an example file.
vcf.file <- system.file(
'extdata',
'HG001_GIAB.vcf.gz',
package = 'ApplyPolygenicScore',
mustWork = TRUE
)
vcf.data <- import.vcf(
vcf.path = vcf.file,
long.format = TRUE,
info.fields = NULL,
format.fields = NULL,
verbose = TRUE
)
Scanning file to determine attributes.
File attributes:
meta lines: 234
header_line: 235
variant count: 63
column count: 11
Meta line 234 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 63
Character matrix gt cols: 11
skip: 0
nrows: 63
row_num: 0
Processed variant: 63
All variants processed
Extracting gt element DP
Extracting gt element GQ
Extracting gt element ADALL
Extracting gt element AD
Extracting gt element GT
Extracting gt element PS
List of 2
$ split.wide.vcf.matrices:List of 2
..$ genotyped.alleles: chr [1:63, 1:2] "A/A" "T/C" "G/T" "T/T" ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:63] "chr1_87734095" "chr1_111721652" "chr1_154895579" "chr1_184222582" ...
.. .. ..$ : chr [1:2] "HG001" "2:HG001"
..$ vcf.fixed.fields :Classes 'data.table' and 'data.frame': 63 obs. of 6 variables:
.. ..$ CHROM : chr [1:63] "chr1" "chr1" "chr1" "chr1" ...
.. ..$ POS : chr [1:63] "87734095" "111721652" "154895579" "184222582" ...
.. ..$ ID : chr [1:63] NA NA NA NA ...
.. ..$ REF : chr [1:63] "T" "T" "G" "C" ...
.. ..$ ALT : chr [1:63] "A" "C" "T" "T" ...
.. ..$ allele.matrix.row.index: int [1:63] 1 2 3 4 5 6 7 8 9 10 ...
.. ..- attr(*, ".internal.selfref")=<externalptr>
$ combined.long.vcf.df :List of 2
..$ dat : tibble [126 × 31] (S3: tbl_df/tbl/data.frame)
.. ..$ CHROM : chr [1:126] "chr1" "chr1" "chr1" "chr1" ...
.. ..$ POS : int [1:126] 87734095 87734095 111721652 111721652 154895579 154895579 184222582 184222582 219521561 219521561 ...
.. ..$ ID : chr [1:126] NA NA NA NA ...
.. ..$ REF : chr [1:126] "T" "T" "T" "T" ...
.. ..$ ALT : chr [1:126] "A" "A" "C" "C" ...
.. ..$ QUAL : num [1:126] 50 50 50 50 50 50 50 50 50 50 ...
.. ..$ FILTER : chr [1:126] "PASS" "PASS" "PASS" "PASS" ...
.. ..$ DPSum : int [1:126] NA NA NA NA NA NA NA NA NA NA ...
.. ..$ platforms : int [1:126] 5 5 5 5 5 5 5 5 5 5 ...
.. ..$ platformnames : chr [1:126] "Illumina,PacBio,CG,10X,Solid" "Illumina,PacBio,CG,10X,Solid" "Illumina,PacBio,CG,10X,Solid" "Illumina,PacBio,CG,10X,Solid" ...
.. ..$ platformbias : chr [1:126] NA NA NA NA ...
.. ..$ datasets : int [1:126] 5 5 5 5 5 5 5 5 5 5 ...
.. ..$ datasetnames : chr [1:126] "HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR,SolidSE75bp" "HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR,SolidSE75bp" "HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR,SolidSE75bp" "HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR,SolidSE75bp" ...
.. ..$ datasetsmissingcall : chr [1:126] "IonExome" "IonExome" "IonExome" "IonExome" ...
.. ..$ callsets : int [1:126] 7 7 7 7 7 7 7 7 7 7 ...
.. ..$ callsetnames : chr [1:126] "HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,10XLRGATK,SolidSE75GATKHC" "HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,10XLRGATK,SolidSE75GATKHC" "HiSeqPE300xGATK,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,CCS15kb_20kbDV,10XLRGATK,SolidSE75GATKHC" "HiSeqPE300xGATK,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,CCS15kb_20kbDV,10XLRGATK,SolidSE75GATKHC" ...
.. ..$ varType : chr [1:126] NA NA NA NA ...
.. ..$ filt : chr [1:126] NA NA NA NA ...
.. ..$ callable : chr [1:126] "CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_C"| __truncated__ "CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_C"| __truncated__ "CS_HiSeqPE300xGATK_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE"| __truncated__ "CS_HiSeqPE300xGATK_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE"| __truncated__ ...
.. ..$ difficultregion : chr [1:126] NA NA NA NA ...
.. ..$ arbitrated : chr [1:126] NA NA NA NA ...
.. ..$ callsetwiththisuniqgenopassing : chr [1:126] NA NA NA NA ...
.. ..$ callsetwithotheruniqgenopassing: chr [1:126] NA NA NA NA ...
.. ..$ Indiv : chr [1:126] "HG001" "2:HG001" "HG001" "2:HG001" ...
.. ..$ gt_DP : int [1:126] 762 762 658 658 966 966 711 711 924 924 ...
.. ..$ gt_GQ : int [1:126] 371 371 1463 1463 1278 1278 395 395 425 425 ...
.. ..$ gt_ADALL : chr [1:126] "0,276" "0,276" "70,109" "70,109" ...
.. ..$ gt_AD : chr [1:126] "54,380" "54,380" "154,209" "154,209" ...
.. ..$ gt_GT : chr [1:126] "1/1" "1/1" "0/1" "0/1" ...
.. ..$ gt_PS : int [1:126] NA NA NA NA NA NA NA NA NA NA ...
.. ..$ gt_GT_alleles : chr [1:126] "A/A" "A/A" "T/C" "T/C" ...
..$ meta: tibble [22 × 5] (S3: tbl_df/tbl/data.frame)
.. ..$ Tag : chr [1:22] "INFO" "INFO" "INFO" "INFO" ...
.. ..$ ID : chr [1:22] "DPSum" "platforms" "platformnames" "platformbias" ...
.. ..$ Number : chr [1:22] "1" "1" "." "." ...
.. ..$ Type : chr [1:22] "Integer" "Integer" "String" "String" ...
.. ..$ Description: chr [1:22] "Total read depth summed across all datasets, excluding MQ0 reads" "Number of different platforms for which at least one callset called this genotype, whether filtered or not" "Names of platforms for which at least one callset called this genotype, whether filtered or not" "Names of platforms that have reads containing a variant at this location, but the high-confidence call is homoz"| __truncated__ ...
As we can see, import.vcf
returns a list with two
elements: split.wide.vcf.matrices
and
combined.long.vcf.df
. These two objects represent two ways
of representing the imported data.
The “wide” format is the most memory-efficient. It splits the VCF
data into two objects: a table of per-variant data with only essential
columns (CHROM, POS, ID, REF, ALT) and a matrix of genotypes where each
row corresponds to a variant and each column corresponds to an
individual. This format allows for efficient storage and manipulation of
genotype data, especially for large VCF files. The key limitation of
this format is that it does not allow the importation of information
(other than genotype) unique to a sample-variant pair, such as that
stored in the FORMAT
field of the VCF file.
The “long” format is optional and will only be returned if
long.format = TRUE
is specified. It allows for the
importation of all information in the VCF file, including the
FORMAT
field, however it requires a lot more memory to
store in R. This is because a row is created for each sample-variant
pair, resulting in a data frame with many more rows than the wide
format. For most applications, the wide format is sufficient and
preferred for its memory efficiency.
Use the import.pgs.weight.file
function to import PGS
weight files formatted according to PGS Catalog specifications. This
function parses the header that is typically provided by PGS catalog
files and differentiates between harmonized and not harmonized data
columns.
The example below imports a PGS weight file from the PGS Catalog, included in the package as an example file.
pgs.file <- system.file(
'extdata',
'PGS003378_hmPOS_GRCh38.txt.gz',
package = 'ApplyPolygenicScore',
mustWork = TRUE
)
pgs.data <- import.pgs.weight.file(
pgs.weight.path = pgs.file,
use.harmonized.data = TRUE
)
str(pgs.data)
List of 2
$ file.metadata :'data.frame': 15 obs. of 2 variables:
..$ key : chr [1:15] "format_version" "pgs_id" "pgs_name" "trait_reported" ...
..$ value: chr [1:15] "2.0" "PGS003378" "PSA_PGS_128" "Prostate-specific antigen (PSA) levels" ...
$ pgs.weight.data:'data.frame': 128 obs. of 16 variables:
..$ rsID : chr [1:128] "rs12131120" "rs2076591" "rs4951018" "rs12569177" ...
..$ chr_name : chr [1:128] "1" "1" "1" "1" ...
..$ chr_position : int [1:128] 88199778 112264274 205636334 51237409 219694903 205632217 154868055 184191716 219954267 60759747 ...
..$ effect_allele : chr [1:128] "T" "C" "C" "C" ...
..$ other_allele : chr [1:128] "A" "T" "A" "G" ...
..$ effect_weight : num [1:128] 0.0333 0.0317 0.0317 0.0264 0.0254 ...
..$ locus_name : chr [1:128] "1:88199778:T:A" "1:112264274:T:C" "1:205636334:A:C" "1:51237409:G:C" ...
..$ hm_source : chr [1:128] "ENSEMBL" "ENSEMBL" "ENSEMBL" "ENSEMBL" ...
..$ hm_rsID : chr [1:128] "rs12131120" "rs2076591" "rs4951018" "rs12569177" ...
..$ hm_chr : chr [1:128] "1" "1" "1" "1" ...
..$ hm_pos : int [1:128] 87734095 111721652 205667206 50771737 219521561 205663089 154895579 184222582 219780925 60532612 ...
..$ hm_inferOtherAllele: logi [1:128] NA NA NA NA NA NA ...
..$ ID : chr [1:128] "rs12131120" "rs2076591" "rs4951018" "rs12569177" ...
..$ CHROM : chr [1:128] "1" "1" "1" "1" ...
..$ POS : int [1:128] 87734095 111721652 205667206 50771737 219521561 205663089 154895579 184222582 219780925 60532612 ...
..$ beta : num [1:128] 0.0333 0.0317 0.0317 0.0264 0.0254 ...
The output of the import.pgs.weight.file
function is a
list with one data frame containing the SNP coordinates and weights for
the polygenic score, with standardized column names expected by other
package functions, and a second data frame containing the header
information from the PGS catalog file.
Providing phenotype data is an optional feature of several
ApplyPolygenicScore functions. Phenotype data must be imported as a data
frame and must contain a column named Indiv
corresponding
to the individual IDs in the VCF file.
# Isolating the individual IDs from the VCF data
vcf.individuals <- colnames(vcf.data$split.wide.vcf.matrices$genotyped.alleles)
# Simulating phenotype data
set.seed(123)
phenotype.data <- data.frame(
Indiv = vcf.individuals,
continuous.phenotype = rnorm(length(vcf.individuals)),
binary.phenotype = rbinom(length(vcf.individuals), 1, 0.5)
)
head(phenotype.data)
Indiv continuous.phenotype binary.phenotype
1 HG001 -0.5604756 1
2 2:HG001 -0.2301775 0
VCF files can be very large. Sometimes they are too large to be imported into R. In these cases, it is useful to first filter the VCF file to just the variants that are included in the PGS you wish to calculate and reduce file size. This is best done using command line tools designed for VCF file manipulation. For filtering, they typically require a BED file containing the coordinates of the variants you wish to keep. To simplify this process, ApplyPolygenicScore provides functions for converting PGS weight files to BED-formatted coordinate files.
BED format requires the following first three columns: chromosome name, start position, and end position. PGS weight files only contain the chromosome name and end position of each variant, so must be reformatted with an additional column for the start position, and with the correct column order.
Note: When writing BED formatted data frames to files, make sure to use tab-separated values and not include row names. Additionally, most tools do not accept BED files with column names. If you wish to maintain a header, you may need to add a comment character to the first line of the file:
# chr start end
Use the convert.pgs.to.bed
function to convert a PGS
weight file to a BED-formatted coordinate data frame.
pgs.coordinate.info <- pgs.data$pgs.weight.data
pgs.bed.format <- convert.pgs.to.bed(
pgs.weight.data = pgs.coordinate.info,
chr.prefix = TRUE,
numeric.sex.chr = FALSE,
slop = 10
)
head(pgs.bed.format)
chr start end rsID chr_name chr_position effect_allele
1 chr1 87734084 87734105 rs12131120 1 88199778 T
2 chr1 111721641 111721662 rs2076591 1 112264274 C
3 chr1 205667195 205667216 rs4951018 1 205636334 C
4 chr1 50771726 50771747 rs12569177 1 51237409 C
5 chr1 219521550 219521571 rs12034581 1 219694903 C
6 chr1 205663078 205663099 . 1 205632217 G
other_allele effect_weight
1 A 0.0333
2 T 0.0317
3 A 0.0317
4 G 0.0264
5 A 0.0254
6 GGCCGACAGCCCTTCTGCTGGCTCGGTGGGGCCCAGC 0.1628
locus_name hm_source hm_rsID hm_chr hm_pos
1 1:88199778:T:A ENSEMBL rs12131120 1 87734095
2 1:112264274:T:C ENSEMBL rs2076591 1 111721652
3 1:205636334:A:C ENSEMBL rs4951018 1 205667206
4 1:51237409:G:C ENSEMBL rs12569177 1 50771737
5 1:219694903:C:A ENSEMBL rs12034581 1 219521561
6 SLC45A3_p.Leu223_Ala234del liftover 1 205663089
hm_inferOtherAllele ID beta
1 NA rs12131120 0.0333
2 NA rs2076591 0.0317
3 NA rs4951018 0.0317
4 NA rs12569177 0.0264
5 NA rs12034581 0.0254
6 NA 0.1628
Coordinate files must match the coordinate style used in the VCF file
you wish to filter. The convert.pgs.to.bed
function
provides options for formatting chromosome names, as these can vary
between human genome reference files that originate from different
databases.
chr.prefix = TRUE
to add ‘chr’ to the chromosome
name (UCSC style)chr.prefix = FALSE
to remove ‘chr’ from the
chromosome name (Ensembl style)numeric.sex.chr = FALSE
to format the X and Y
chromosomes as ‘X’ and ‘Y’ respectivelynumeric.sex.chr = TRUE
to format the X and Y
chromosomes as ‘23’ and ‘24’ respectivelyThe slop
option imitates bedtools
nomenclature for adding base pairs to the start and end of a set of
coordinates. slop = 10
adds 10 base pairs to the start and
end of each variant coordinate.
Here is an example of genomic coordinates in BED file format for a variant on chromosome 1 at the 20th base pair.
No slop:
chr | start | end |
---|---|---|
chr1 | 19 | 20 |
With slop of 10 base pairs:
chr | start | end |
---|---|---|
chr1 | 9 | 30 |
What if you want to apply multiple polygenic scores to the same VCF
file? Instead of filtering the VCF file multiple times, you can use the
combine.pgs.bed
function to merge multiple BED-formatted
data frames into a single set of coordinates, and filter your VCF just
once for the union of all variants in multiple PGSs.
# convert your PGS weight files with no added slop
pgs.bed1 <- convert.pgs.to.bed(pgs.coordinate.info, slop = 0)
# simulating a second PGS with all coordinates shifted by 20 base pairs.
pgs.bed2 <- pgs.bed1
pgs.bed2$start <- pgs.bed2$start + 20
pgs.bed2$end <- pgs.bed2$end + 20
# Input must be a named list
pgs.bed.list <- list(PGS1 = pgs.bed1, PGS2 = pgs.bed2)
merged.pgs.bed <- combine.pgs.bed(
pgs.bed.list = pgs.bed.list,
add.annotation.data = TRUE,
annotation.column.index = which(colnames(pgs.bed1) == 'rsID') # keep information from the rsID column during merge
)
str(merged.pgs.bed)
'data.frame': 256 obs. of 4 variables:
$ chr : chr "chr1" "chr1" "chr1" "chr1" ...
$ start : num 5.08e+07 5.08e+07 8.77e+07 8.77e+07 1.12e+08 ...
$ end : num 5.08e+07 5.08e+07 8.77e+07 8.77e+07 1.12e+08 ...
$ annotation: chr "rs12569177|PGS1" "rs12569177|PGS2" "rs12131120|PGS1" "rs12131120|PGS2" ...
combine.pgs.bed
automatically annotates each interval
(in a new column) with the name of the origin PGS. It provides the
option of adding information from one additional column from the inputs
to the annotation in the merged output. In the cases of overlapping
intervals (e.g. the same variant is included in multiple PGSs),
overlapping annotations are concatenated with a comma.
Both the import.vcf
and
import.pgs.weight.file
functions perform some basic
validation of the input data during import. For example, PGS files are
checked for duplicate variants, and VCF files are checked for unmerged
multiallelic sites. If you have not used the native import functions,
you may wish to check that your data is compatible with the polygenic
score application functions in this package.
The apply.polygenic.score
function performs extensive
input validation prior to starting the PGS application process. It can
be configured to just run the validation step to check your inputs
first.
# By default, apply.polygenic.score expects the vcf data to be in wide format.
apply.polygenic.score(
vcf.data = vcf.data$split.wide.vcf.matrices,
vcf.long.format = FALSE,
pgs.weight.data = pgs.data$pgs.weight.data,
phenotype.data = phenotype.data,
validate.inputs.only = TRUE
)
Input data passed validation
[1] TRUE
# If you have imported a VCF file in long format, you can specify that with the vcf.long.format argument.
apply.polygenic.score(
vcf.data = vcf.data$combined.long.vcf.df$dat,
vcf.long.format = TRUE,
pgs.weight.data = pgs.data$pgs.weight.data,
phenotype.data = phenotype.data,
validate.inputs.only = TRUE
)
Input data passed validation
[1] TRUE
The main function of ApplyPolygenicScore is
apply.polygenic.score
. It comes with some bells and
whistles, but the basic usage is simple. Provide the VCF data and the
PGS weight data, and specify how you wish the algorithm to handle
missing genotypes.
pgs.results <- apply.polygenic.score(
vcf.data = vcf.data$split.wide.vcf.matrices,
pgs.weight.data = pgs.data$pgs.weight.data,
correct.strand.flips = FALSE, # no strand flip check to avoid warnings
missing.genotype.method = 'none'
)
Warning in combine.vcf.with.pgs(vcf.data = vcf.data$vcf.fixed.fields,
pgs.weight.data = pgs.weight.data): PGS is missing 66 SNPs from VCF
List of 2
$ pgs.output :'data.frame': 2 obs. of 9 variables:
..$ Indiv : chr [1:2] "HG001" "2:HG001"
..$ PGS : num [1:2] 1.98 1.98
..$ percentile : num [1:2] 1 1
..$ decile : int [1:2] 10 10
..$ quartile : int [1:2] 4 4
..$ n.pgm.sites : int [1:2] 128 128
..$ n.missing.genotypes : num [1:2] 66 66
..$ percent.missing.genotypes: num [1:2] 0.52 0.52
..$ n.non.missing.alleles : num [1:2] 124 124
$ regression.output: NULL
We see several things. First, a warning was printed by
combine.vcf.with.pgs
. This is a function called by
apply.polygenic.score
to merge vcf and pgs data by genomic
coordinates. The warning indicates that some variants from the PGS
weight data were not found in the VCF data. These variants were not
called in any individual. apply.polygenic.score
does not
apply any weights to these variants, which effectively assumes their
dosage as homozygous reference for all individuals. Checkout our
discussion on missing
genotype data for more details on how missing variants come to
be.
Next, we see the output of the function. The output is a list with
two elements: pgs.output
, and
regression.output
. regression.output
is empty
because we did not provide the optional phenotype.data
,
more on that later. pgs.output
is a data frame with the
individual IDs from the VCF and the calculated polygenic score in the
PGS
column for each individual. As we will see later, each
missing genotype method creates a uniquely named column of PGS
values.
Next, the percentile
, decile
, and
quartile
columns report the respective percentile
information for each individual’s PGS among the distribution of the
entire cohort in the VCF data. These values look a bit strange in this
example because our example
VCF contains only two individuals which have identical genotypes. In
a real-world scenario, these values would be more informative.
The next two columns report the number and percentage of missing PGS
genotypes for each individual. This number should be equal to or greater
than the number reported in the warning message from
combine.vcf.with.pgs
.
The final column reports the number of non-missing alleles for each
individual. This value is equivalent to the number of non-missing
variants multiplied by 2 (2 alleles per genotype in a diploid genome).
It is used to normalize the PGS values for missing genotypes when the
normalize
missing genotype method is used.
combine.vcf.with.pgs
A quick aside about this function. While it is internal to
apply.polygenic.score
, it is also available for use on its
own. The output includes a list of the variants that were not found in
the VCF data, which may be useful for troubleshooting missing genotype
data.
merged.vcf.pgs.data <- combine.vcf.with.pgs(
vcf.data = vcf.data$split.wide.vcf.matrices$vcf.fixed,
pgs.weight.data = pgs.data$pgs.weight.data
)
Warning in combine.vcf.with.pgs(vcf.data =
vcf.data$split.wide.vcf.matrices$vcf.fixed, : PGS is missing 66 SNPs from VCF
[1] "merged.vcf.with.pgs.data" "missing.snp.data"
Key: <ID.pgs>
ID.pgs CHROM.pgs POS.pgs rsID chr_name chr_position
<char> <char> <int> <char> <char> <int>
1: chr1 205663089 . 1 205632217
2: chr17 7890049 rs34009884 17 7793367
3: chr19 50846585 rs113920094 19 51349841
4: chr19 50857142 rs374546878 19 51360398
5: chr19 50875019 rs112103380 19 51378275
6: chr6 128846912 rs73583119 6 129168057
After PGS and VCF data are merged by coordinates, you may also wish
to check that the alleles in the VCF data variant record match the
corresponding alleles in the PGS weight data. By convention, the
other_allele
(non-effect allele) in the PGS weight data
should match the reference (REF) allele in the VCF data, and the
effect_allele
should match the alternate (ALT) allele in
the VCF data. Checkout our discussion on allele
matching for more details on the various ways in which this matching
may fail.
apply.polygenic.score
can perform an allele match check
(internally using assess.pgs.vcf.allele.match
). This
functionality can be customized with the following arguments:
correct.strand.flips
to automatically correct
mismatches caused by strand flips by flipping the PGS weight data
alleles.remove.ambiguous.allele.matches
to remove variants with
mismatched alleles that cannot be corrected by flipping. The variant
will be treated as a cohort-wide missing variant and handled according
to missing variant rules described below.max.strand.flips
to specify the number of unambiguous
strand flips that needs to be present for removal of ambiguous strand
flips to occur (convenient for cases where you have reason to believe
that all ambiguous strand flips are actually effect size flips)remove.mismatched.indels
to remove variants with
mismatched INDEL alleles. These variants will be treated as cohort-wide
missing variants and handled according to missing variant rules
described below.# Most strict allele match handling
strict.allele.match.result <- apply.polygenic.score(
vcf.data = vcf.data$split.wide.vcf.matrices,
pgs.weight.data = pgs.data$pgs.weight.data,
missing.genotype.method = 'none',
correct.strand.flips = TRUE,
remove.ambiguous.allele.matches = TRUE,
remove.mismatched.indels = TRUE
);
# Less strict allele match handling
less.strict.allele.match.result <- apply.polygenic.score(
vcf.data = vcf.data$split.wide.vcf.matrices,
pgs.weight.data = pgs.data$pgs.weight.data,
missing.genotype.method = 'none',
correct.strand.flips = TRUE,
remove.ambiguous.allele.matches = TRUE,
max.strand.flips = 1, # don't remove ambiguous allele matches unless there is at least one unambiguous strand flip
remove.mismatched.indels = FALSE
);
Let’s explore all the different options for handling missing genotype
data. The missing.genotype.method
argument can be set to
one of the following options: none
, normalize
,
and mean.dosage
. Any combination of these methods can be
applied in the same call to apply.polygenic.score
,
producing the corresponding number of polygenic score output columns.
For details on how each of these methods work, head back to our missing
genotype data discussion post.
Briefly:
none
ignores missing genotypes, effectively assuming
them to be homozygous reference callsnormalize
computes the score the same was as
none
, then divides the score by the number of non-missing
allelesmean.dosage
replaces the dosage of genotypes missing in
some but not all individuals with the mean dosage of the variant across
the cohort; genotypes missing in all individuals are assumed homozygous
referenceall.missing.methods.pgs.results <- apply.polygenic.score(
vcf.data = vcf.data$split.wide.vcf.matrices,
pgs.weight.data = pgs.data$pgs.weight.data,
correct.strand.flips = FALSE, # no strand flip check to avoid warnings
missing.genotype.method = c('none', 'normalize', 'mean.dosage')
)
Warning in combine.vcf.with.pgs(vcf.data = vcf.data$vcf.fixed.fields,
pgs.weight.data = pgs.weight.data): PGS is missing 66 SNPs from VCF
Indiv PGS PGS.with.normalized.missing PGS.with.replaced.missing
1 HG001 1.9761 0.01593629 1.9761
2 2:HG001 1.9761 0.01593629 1.9761
percentile decile quartile n.pgm.sites n.missing.genotypes
1 1 10 4 128 66
2 1 10 4 128 66
percent.missing.genotypes n.non.missing.alleles
1 0.52 124
2 0.52 124
We can see that the output now contains three columns of PGS values, one for each missing genotype method.
missing genotype method | column name |
---|---|
none |
PGS |
normalize |
PGS.with.normalized.missing |
mean.dosage |
PGS.with.replaced.missing |
The PGS.with.normalized.missing
contains the
PGS
values divided by 124: the number of non-missing
alleles in the cohort. As we can see from the
n.missing.genotypes
column, the number of missing variants
is 66. The number of variants in the example PGS weight file is 128, so
the number of non-missing genotypes is 128 - 66 = 62; multiply by two
for a diploid genome (2 alleles per genotype): 62 * 2 = 124.
The PGS.with.replaced.missing
column is identical to the
PGS
column in this case, since all missing genotypes were
missing from all individuals in the cohort, all such dosages were
assumed as homozygous reference, and no change from the
none
method was made.
What if you have a list of population allele frequencies for the
variants in your PGS weight file? Surely, when implementing the
mean.dosage
method, values derived from a massive public
dataset would be more accurate than inferring the population dosage from
your cohort! No problem, simply add these frequencies to your pgs weight
file, in PGS Catalog format, and use the
use.external.effect.allele.frequency
argument to tell
apply.polygenic.score
to use them instead.
Curious to see how your PGS values distribute among quintiles? No
problem. Specify a custom number of percentiles with
n.percentiles
.
custom.percentiles.pgs.results <- apply.polygenic.score(
vcf.data = vcf.data$split.wide.vcf.matrices,
pgs.weight.data = pgs.data$pgs.weight.data,
correct.strand.flips = FALSE, # no strand flip check to avoid warnings
n.percentiles = 5
)
Warning in combine.vcf.with.pgs(vcf.data = vcf.data$vcf.fixed.fields,
pgs.weight.data = pgs.weight.data): PGS is missing 66 SNPs from VCF
Indiv PGS.with.replaced.missing percentile decile quartile percentile.5
1 HG001 1.9761 1 10 4 5
2 2:HG001 1.9761 1 10 4 5
n.pgm.sites n.missing.genotypes percent.missing.genotypes
1 128 66 0.52
2 128 66 0.52
n.non.missing.alleles
1 124
2 124
Let’s explore the extra features provided for handling phenotype data.
Since these features are much more useful in larger cohorts, let’s start by loading a larger VCF file. The following example file contains fake genotype data for 10 fake individuals, along with some simulated phenotype data and a fake set of PGS weights.
phenotype.test.data.path <- system.file(
'extdata',
'phenotype.test.data.Rda',
package = 'ApplyPolygenicScore',
mustWork = TRUE
)
load(phenotype.test.data.path)
str(phenotype.test.data)
List of 3
$ vcf.data :'data.frame': 100 obs. of 7 variables:
..$ CHROM : chr [1:100] "chr1" "chr1" "chr1" "chr1" ...
..$ POS : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
..$ ID : chr [1:100] "rs1" "rs1" "rs1" "rs1" ...
..$ REF : chr [1:100] "T" "T" "T" "T" ...
..$ ALT : chr [1:100] "A" "A" "A" "A" ...
..$ Indiv : chr [1:100] "sample1" "sample2" "sample3" "sample4" ...
..$ gt_GT_alleles: chr [1:100] "T/T" "T/T" "T/T" "T/T" ...
$ pgs.weight.data:'data.frame': 10 obs. of 6 variables:
..$ CHROM : chr [1:10] "chr1" "chr2" "chr3" "chr4" ...
..$ POS : int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ ID : chr [1:10] "rs1" "rs2" "rs3" "rs4" ...
..$ effect_allele: chr [1:10] "A" "A" "A" "A" ...
..$ other_allele : chr [1:10] "T" "T" "T" "T" ...
..$ beta : num [1:10] -0.432 1.69 1.228 0.276 -1.049 ...
$ phenotype.data :'data.frame': 10 obs. of 5 variables:
..$ Indiv : chr [1:10] "sample1" "sample2" "sample3" "sample4" ...
..$ continuous.phenotype : num [1:10] -0.468 -0.773 2.15 -1.334 0.496 ...
..$ binary.phenotype : int [1:10] 1 1 0 0 0 0 1 0 0 0
..$ binary.factor.phenotype: Factor w/ 2 levels "control","case": 2 2 1 1 1 1 2 1 1 1
..$ categorical.phenotype : Factor w/ 5 levels "a","b","c","d",..: 3 1 2 1 2 4 5 5 3 1
When provided with phenotype data, apply.polygenic.score
merges the phenotype data with the PGS output.
pgs.results.with.phenotype <- apply.polygenic.score(
vcf.data = phenotype.test.data$vcf.data,
vcf.long.format = TRUE,
pgs.weight.data = phenotype.test.data$pgs.weight.data,
phenotype.data = phenotype.test.data$phenotype.data
)
head(pgs.results.with.phenotype$pgs.output)
Indiv PGS.with.replaced.missing percentile decile quartile n.pgm.sites
1 sample1 -1.5698448 0.2 2 1 10
2 sample10 0.6369117 0.7 7 3 10
3 sample2 1.8992260 0.9 9 4 10
4 sample3 -0.7625591 0.5 5 2 10
5 sample4 -1.0417387 0.4 4 2 10
6 sample5 -2.1401365 0.1 1 1 10
n.missing.genotypes percent.missing.genotypes n.non.missing.alleles
1 0 0 20
2 0 0 20
3 0 0 20
4 0 0 20
5 0 0 20
6 0 0 20
continuous.phenotype binary.phenotype binary.factor.phenotype
1 -0.4682005 1 case
2 -0.1524106 0 control
3 -0.7729782 1 case
4 2.1499193 0 control
5 -1.3343536 0 control
6 0.4958705 0 control
categorical.phenotype
1 c
2 a
3 a
4 b
5 a
6 b
Here we see that the pgs.output
data frame now contains
the phenotype data columns in addition to the PGS columns.
When provided with a vector of
phenotype.analysis.columns
,
apply.polygenic.score
automatically detects the data type
of each column (continuous or binary), and performs a regression (linear
or logistic) between the PGS and each phenotype column. This is
particularly useful to quickly see how well your PGS predicts the trait
it is intended to predict, or whether it is associated with any other
variables of interest.
pgs.results.with.phenotype.analysis <- apply.polygenic.score(
vcf.data = phenotype.test.data$vcf.data,
vcf.long.format = TRUE,
pgs.weight.data = phenotype.test.data$pgs.weight.data,
phenotype.data = phenotype.test.data$phenotype.data,
phenotype.analysis.columns = c('continuous.phenotype', 'binary.phenotype')
)
head(pgs.results.with.phenotype.analysis$regression.output)
phenotype model beta se p.value
1 continuous.phenotype linear.regression -0.01401100 0.2058113 0.9473952
2 binary.phenotype logistic.regression -0.02360666 0.4185203 0.9550191
r.squared AUC
1 0.0005789728 NA
2 NA 0.5238095
We finally see a data frame in the regression.output
element of the output list. This data frame contains the results of the
regression analysis for each compatible phenotype. An accuracy metric in
the form of an r-squared value is computed for linear regression, and an
Area Under the (Receiver Operator) Curve (AUC) value is computed for
logistic regression.
If multiple missing genotype methods are being used at once,
producing multiple PGS columns, the regression analysis is only
performed on one of these and defaults to the first method specified in
missing.genotype.method
. If you wish to use one of the
other PGSs for analysis, the source PGS column for regression (and
percentile) analysis can be specified using the
analysis.source.pgs
argument.
ApplyPolygenicScore provides four plotting functions designed to
operate on the output of apply.polygenic.score
:
create.pgs.density.plot
create.pgs.boxplot
create.pgs.with.continuous.phenotype.plot
create.pgs.rank.plot
These visualizations are intended to provide quick quality assessments of PGS in a cohort. They really shine when combined with phenotype data to produce at-a-glance summaries of the relationship between PGS and phenotypes.
Plotting functions are built on the BoutrosLab.plotting.general
package and produce lattice multipanel plot objects. Each function can
be provided with an output.directory
, a
filename.prefix
, and a file.extension
to write
the plots to a file. A filename will be generated using the current date
and the provided prefix, and the plot will be written in the format
specified by the extension.
Note: in the examples below, all plots are written to a temporary directory from which they are read and displayed inline in the vignette.
Plotting functions automatically detect and plot data from the
standardized PGS output columns produced by
apply.polygenic.score
: PGS
,
PGS.with.normalized.missing
, and
PGS.with.replaced.missing
. If you are modifying and
renaming your PGS outputs, or wish to plot data generated elsewhere, you
can also specify a custom PGS column to plot using the
pgs.columns
argument. Plots are automatically titled with
the name of the PGS column being plotted, and the titles can be “tidied
up” to remove period characters and replace them with spaces using the
tidy.titles
argument.
Multipanel plot dimensions can be controlled with width
and height
arguments, which are interpreted as inches. If
your plots are looking squished, try increasing these values. Axis and
title label font sizes can be controlled with xaxis.cex
,
yaxis.cex
, and titles.cex
arguments. Border
padding can be controlled with border.padding
and is
applied to all four sides of the plot at once.
create.pgs.density.plot
plots the PGS distribution in
the cohort.
create.pgs.density.plot(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
output.dir = temp.dir,
filename.prefix = 'vignette-example-basic',
file.extension = 'png',
width = 5,
height = 5
)
NULL
If provided with phenotype column names, this function will plot individual PGS density curves for each category in a categorical phenotype.
create.pgs.density.plot(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
output.dir = temp.dir,
filename.prefix = 'vignette-example-phenotype',
file.extension = 'png',
tidy.titles = TRUE,
phenotype.columns = c('binary.factor.phenotype', 'categorical.phenotype', 'continuous.phenotype'),
width = 5,
height = 7,
yaxes.cex = 1.2
)
NULL
create.pgs.boxplot
plots a boxplot of the PGS
distribution in the cohort, optionally grouped by phenotype categories.
By default, individual scores are plotted as points over the boxplot in
a stripplot style, and this option can be disabled with the
add.stripplot = FALSE
argument.
create.pgs.boxplot(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
output.dir = temp.dir,
filename.prefix = 'vignette-example-basic',
file.extension = 'png',
width = 4,
height = 4
)
NULL
If provided with phenotype column names, this function will plot individual boxplots for each category in a categorical phenotype.
create.pgs.boxplot(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
output.dir = temp.dir,
filename.prefix = 'vignette-example-phenotype',
file.extension = 'png',
tidy.titles = TRUE,
add.stripplot = FALSE,
width = 5,
height = 11,
alpha = 0.9,
phenotype.columns = c('binary.factor.phenotype', 'categorical.phenotype', 'continuous.phenotype'),
titles.cex = 1.2
)
NULL
create.pgs.with.continuous.phenotype.plot
plots a
scatterplot between the PGS and a continuous phenotype, and reports a
correlation. Non-continuous phenotypes are ignored. This function can
only be used if phenotype data are provided.
create.pgs.with.continuous.phenotype.plot(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
output.dir = temp.dir,
filename.prefix = 'vignette-example-correlation',
file.extension = 'png',
tidy.titles = TRUE,
phenotype.columns = c('continuous.phenotype', 'categorical.phenotype'),
width = 5,
height = 5
)
NULL
Unsurprisingly, our randomly generated phenotype data are not correlated with our fake PGS.
create.pgs.rank.plot
plots several panels of
per-individual data, each in the order of increasing PGS percentile
rank.
At the top, a barplot of missing PGS variant genotype counts or percents for each individual (only if at least one individual is missing a variant). Next, a barplot of the percentile rank of each individual. Next, a covariate bar of the PGS quartile and decile of each individual. Next, a covariate bar of categories for each provided categorical phenotype in each individual. Finally, a heatmap of continuous phenotype values for each provided continuous phenotype in each individual.
# Introducing some missing genotypes to demonstrate the missing genotype barplot
pgs.results.with.phenotype.analysis$pgs.output$n.missing.genotypes <- rep(c(0, 1, 0, 2, 1), 2)
create.pgs.rank.plot(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
output.dir = temp.dir,
filename.prefix = 'vignette-example-basic',
file.extension = 'png',
width = 5,
height = 6
)
NULL
create.pgs.rank.plot(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
output.dir = temp.dir,
filename.prefix = 'vignette-example-phenotype',
file.extension = 'png',
phenotype.columns = c('binary.factor.phenotype', 'binary.phenotype', 'categorical.phenotype', 'continuous.phenotype'),
width = 6,
height = 8
)
[1] "springgreen3"
[1] "darkturquoise"
[1] "deepskyblue"
NULL
This function comes with a handful of additional arguments to control the appearance of the plot.
missing.genotype.style
to toggle between displaying
counts and percentages.categorical.palette
to specify a custom color
palette for categorical phenotype bars.binary.pallette
to specify a custom color palette
for binary and continuous phenotype bars.Another feature of ApplyPolygenicScore is the ability to perform a
case-control style analysis to assess the accuracy of a PGS in
predicting a binary phenotype. The
analyze.pgs.binary.predictiveness
function takes the output
of apply.polygenic.score
, fits a logistic regression to
predict the specified phenotype, plots a receiver-operator curve, and
returns the corresponding area under the curve (AUC) value.
pgs.case.control.analysis <- analyze.pgs.binary.predictiveness(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
pgs.columns = 'PGS.with.replaced.missing',
phenotype.column = 'binary.phenotype',
output.dir = temp.dir,
filename.prefix = 'vignette-example-basic',
file.extension = 'png',
width = 5,
height = 5,
titles.cex = 1.2
);
Warning in analyze.pgs.binary.predictiveness(pgs.data =
pgs.results.with.phenotype.analysis$pgs.output, : Phenotype column
'binary.phenotype' is not a factor. Attempting to convert to factor.
If you wish to assess the predictiveness of a PGS in predicting a continuous phenotype, you can binarize the continuous phenotype by specifying a threshold that will split the continuous values into two categories.
pgs.case.control.analysis.continuous <- analyze.pgs.binary.predictiveness(
pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
pgs.columns = 'PGS.with.replaced.missing',
phenotype.column = 'continuous.phenotype',
phenotype.type = 'continuous',
cutoff.threshold = 0, # binarize at 0
output.dir = temp.dir,
filename.prefix = 'vignette-example-continuous',
file.extension = 'png',
width = 5,
height = 5,
titles.cex = 1.2
);