In this document, we will introduce the algorithms underlying the coop package [1], and offer some benchmarks. In order to recreate the benchmarks here, one needs a compiler that supports OpenMP [2] and a high-performance BLAS library [3]. See the other coop package vignette Introducing coop: Fast Covariance, Correlation, and Cosine Operations [4] for details.
We do not bother to go into details for the covariance and correlation algorithms, because they are obvious and uninteresting.
Of the three operations, only cosine similarity currently has a sparse implementation. The short reason why is that the other two operations require centering and/or scaling.
To better understand the problem, consider the \(5\times 20\) matrix whose first row is a row of ones, and all other rows consist entirely of zeros:
x <- matrix(0, 20, 5)
x[1, ] <- 1The original matrix is, obviously, 95% sparse:
coop::sparsity(x)## [1] 0.95But if we center, the data, it becomes 100% dense:
coop::sparsity(scale(x, T, F))## [1] 0For dense implementations, the performance should scale well, and the non-BLAS components will use multiple threads (if your compiler supports OpenMP) when the matrix has more than 1000 columns. Additionally, we try to use vector operations (using OpenMP’s simd construct) for additional performance; but you need a compiler that supports a relatively modern OpenMP standard for this.
Given an \(m\times n\) matrix \(A\) (input) and an \(n\times n\) matrix \(C\) (preallocated output):
C = t(A) %*% X using a symmetric rank-k update (the _syrk BLAS function).The total number of floating point operations is:
The algorithmic complexity is \(O(mn^2)\), and is dominated by the symmetric rank-k update. The storage complexity, ignoring the required allocation of outputs (namely the \(C\) matrix), is \(O(1)\).
Given two \(n\)-length vectors \(x\) and \(y\) (inputs):
crossprod = t(x) %*% y (using the _gemm BLAS function)._syrk BLAS function).crossprod from 1 by the square root of the product of the norms from 2.The total number of floating point operations is:
The algorithmic complexity is \(O(n)\). The storage complexity is \(O(1)\).
Given an \(m\times n\) sparse matrix \(A\) stored as a COO with row/column indices \(i\) and \(j\) where they are sorted by columns first, then rows, and corresponding data vector \(a\) (inputs), and given a preallocated \(n\times n\) dense matrix \(C\) (output):
NaN (for compatibility with dense routines). Go to 2.i>j of a (call it y), find its first and final position in the COO storage.epsilon=1e-10 for us):
The worst case runtime complexity occurs when the matrix is dense but stored as a sparse matrix, and is \(O(mn^2)\), the same as in the dense case. However, this will cause serious cache thrashing, and the performance will be abysmal.
The function stores the \(j\)’th column data and its row indices in temporary storage for better cache access patterns. Best case, this requires 12 KiB of additional storage, with 8 for the data and 4 for the indices. Worse case (an all-dense column), this balloons up to \(12m\). The storage complexity is best case \(O(1)\), and worst case \(O(m)\).
The source code for all benchmarks presented here can be found in the source tree of this package under inst/benchmarks/, or in the binary installation under benchmarks/.
All benchmarks were performed using:
Throughout the benchmarks, we will use the following packages and data:
library(rbenchmark)
reps <- 100
cols <- c("test", "replications", "elapsed", "relative")Compared to the version in the lsa package (as of 27-Oct-2015), this implementation performs quite well:
m <- 2000
n <- 200
x <- matrix(rnorm(m*n), m, n)
benchmark(coop::cosine(x), lsa::cosine(x), columns=cols, replications=reps)
##                test replications elapsed relative
## 1 coop::cosine(x)          100   0.177    1.000
## 2    lsa::cosine(x)          100 113.543  641.486Here the two perform identically:
n <- 1000000
x <- rnorm(n)
y <- rnorm(n)
benchmark(coop::cosine(x, y), lsa::cosine(x, y), columns=cols, replications=reps)
##                   test replications elapsed relative
## 1 coop::cosine(x, y)          100   0.757    1.000
## 2    lsa::cosine(x, y)          100   0.768    1.015Benchmarking sparse matrix methods can be more challenging than with dense for a variety of reasons, chief among them being that the level of sparsity can make an enormous impact in performance.
We present two cases here of varying levels of sparsity. First, we will generate a 0.1% dense / 99.9% sparse matrix:
m <- 6000
n <- 250
dense <- coop:::dense_stored_sparse_mat(m, n, .001)
sparse <- slam::as.simple_triplet_matrix(dense)This gives us a fairly dramatic difference in storage:
memuse::memuse(dense)
## 11.444 MiB
memuse::memuse(sparse)
## 24.445 KiBSo the dense matrix needs roughly 479 times as much storage for the exact same data. In such very sparse cases, the sparse implementation will perform quite nicely:
benchmark(dense=coop::cosine(dense), coop::cosine(sparse), columns=cols, replications=reps)
##     test replications elapsed relative
## 1  dense          100   0.712    3.082
## 2 sparse          100   0.231    1.000Note that this is a 3-fold speedup over our already highly optimized implementation. This is quite nice, especially considering the sparse implementation uses only one thread and limited vectorization, while the dense one uses 4 threads and vectorization. However, as the matrix becomes more dense (and it doesn’t take much), dense methods begin to perform better:
dense <- coop:::dense_stored_sparse_mat(m, n, .01)
sparse <- slam::as.simple_triplet_matrix(dense)
memuse::memuse(dense)
## 11.444 MiB
memuse::memuse(sparse)
## 235.383 KiB
benchmark(coop::cosine(dense), coop::cosine(sparse), as.matrix(sparse), columns=cols, replications=reps)
benchmark(cosine(dense), cosine(sparse), as.matrix(sparse), columns=cols, replications=reps)
##     test replications elapsed relative
## 1  dense          100   0.707    1.000
## 2 sparse          100   2.076    2.936While the sparse implementation performs significantly worse than the dense one for this level of sparsity and data size, note that the memory usage for the dense case is greater than that of the sparse by a factor of 50.
It is hard to give perfect advice for when to use a dense or sparse method, but a general rule of thumb is that if you have more than 5% non-zero data, definitely use dense methods. For 1-5%, there is a memory/runtime tradeoff worth considering; if you can comfortably store the matrix densely, then by all means use dense methods. For data <1% dense, sparse methods will generally have better runtime performance than dense methods.
[1]D. Schmidt, Coop: Fast correlation, covariance, and cosine similarity. 2016.
[2]OpenMP Architecture Review Board, “OpenMP application program interface version 4.0.” July-2013.
[3]C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic linear algebra subprograms for fortran usage,” ACM Transactions on Mathematical Software (TOMS), vol. 5, no. 3, pp. 308–323, 1979.
[4]D. Schmidt, Introducing coop: Fast covariance, correlation, and cosine operations. 2016.