Here we’ll examine an example application of the widyr package, particularly the pairwise_cor and pairwise_dist functions. We’ll use the data on United Nations General Assembly voting from the unvotes package:
library(dplyr)
library(unvotes)
un_votes## # A tibble: 738,764 x 4
##     rcid country                  country_code vote 
##    <int> <chr>                    <chr>        <fct>
##  1     3 United States of America US           yes  
##  2     3 Canada                   CA           no   
##  3     3 Cuba                     CU           yes  
##  4     3 Haiti                    HT           yes  
##  5     3 Dominican Republic       DO           yes  
##  6     3 Mexico                   MX           yes  
##  7     3 Guatemala                GT           yes  
##  8     3 Honduras                 HN           yes  
##  9     3 El Salvador              SV           yes  
## 10     3 Nicaragua                NI           yes  
## # ... with 738,754 more rowsThis dataset has one row for each country for each roll call vote. We’re interested in finding pairs of countries that tended to vote similarly.
Notice that the vote column is a factor, with levels (in order) “yes”, “abstain”, and “no”:
levels(un_votes$vote)## [1] "yes"     "abstain" "no"We may then be interested in obtaining a measure of country-to-country agreement for each vote, using the pairwise_cor function.
library(widyr)
cors <- un_votes %>%
  mutate(vote = as.numeric(vote)) %>%
  pairwise_cor(country, rcid, vote, use = "pairwise.complete.obs", sort = TRUE)
cors## # A tibble: 39,800 x 3
##    item1          item2          correlation
##    <chr>          <chr>                <dbl>
##  1 Slovakia       Czech Republic       0.989
##  2 Czech Republic Slovakia             0.989
##  3 Lithuania      Estonia              0.971
##  4 Estonia        Lithuania            0.971
##  5 Lithuania      Latvia               0.970
##  6 Latvia         Lithuania            0.970
##  7 Germany        Liechtenstein        0.968
##  8 Liechtenstein  Germany              0.968
##  9 Slovakia       Slovenia             0.966
## 10 Slovenia       Slovakia             0.966
## # ... with 39,790 more rowsWe could, for example, find the countries that the US is most and least in agreement with:
US_cors <- cors %>%
  filter(item1 == "United States of America")
# Most in agreement
US_cors## # A tibble: 199 x 3
##    item1                    item2                                                correlation
##    <chr>                    <chr>                                                      <dbl>
##  1 United States of America United Kingdom of Great Britain and Northern Ireland       0.576
##  2 United States of America Canada                                                     0.559
##  3 United States of America Israel                                                     0.540
##  4 United States of America Netherlands                                                0.515
##  5 United States of America Luxembourg                                                 0.505
##  6 United States of America Australia                                                  0.502
##  7 United States of America Belgium                                                    0.496
##  8 United States of America Italy                                                      0.467
##  9 United States of America New Zealand                                                0.458
## 10 United States of America Japan                                                      0.458
## # ... with 189 more rows# Least in agreement
US_cors %>%
  arrange(correlation)## # A tibble: 199 x 3
##    item1                    item2                correlation
##    <chr>                    <chr>                      <dbl>
##  1 United States of America Belarus                   -0.358
##  2 United States of America Czechoslovakia            -0.330
##  3 United States of America Cuba                      -0.306
##  4 United States of America Russian Federation        -0.301
##  5 United States of America Egypt                     -0.247
##  6 United States of America India                     -0.243
##  7 United States of America Syrian Arab Republic      -0.238
##  8 United States of America Afghanistan               -0.229
##  9 United States of America Ukraine                   -0.225
## 10 United States of America Yemen Arab Republic       -0.224
## # ... with 189 more rowsThis can be particularly useful when visualized on a map.
library(maps)
library(fuzzyjoin)
library(countrycode)
library(ggplot2)
world_data <- map_data("world") %>%
  regex_full_join(iso3166, by = c("region" = "mapname")) %>%
  filter(region != "Antarctica")US_cors %>%
  mutate(a2 = countrycode(item2, "country.name", "iso2c")) %>%
  full_join(world_data, by = "a2") %>%
  ggplot(aes(long, lat, group = group, fill = correlation)) +
  geom_polygon(color = "gray", size = .1) +
  scale_fill_gradient2() +
  coord_quickmap() +
  theme_void() +
  labs(title = "Correlation of each country's UN votes with the United States",
       subtitle = "Blue indicates agreement, red indicates disagreement",
       fill = "Correlation w/ US")Another useful kind of visualization is a network plot, which can be created with Thomas Pedersen’s ggraph package. We can filter for pairs of countries with correlations above a particular threshold.
library(ggraph)
library(igraph)
cors_filtered <- cors %>%
  filter(correlation > .6)
continents <- data_frame(country = unique(un_votes$country)) %>%
  filter(country %in% cors_filtered$item1 |
         country %in% cors_filtered$item2) %>%
  mutate(continent = countrycode(country, "country.name", "continent"))
set.seed(2017)
cors_filtered %>%
  graph_from_data_frame(vertices = continents) %>%
  ggraph() +
  geom_edge_link(aes(edge_alpha = correlation)) +
  geom_node_point(aes(color = continent), size = 3) +
  geom_node_text(aes(label = name), check_overlap = TRUE, vjust = 1, hjust = 1) +
  theme_void() +
  labs(title = "Network of countries with correlated United Nations votes")Choosing the threshold for filtering correlations (or other measures of similarity) typically requires some trial and error. Setting too high a threshold will make a graph too sparse, while too low a threshold will make a graph too crowded.