houba provides manipulation of large data through memory-mapped files, supporting vectors, matrices, and arrays. This allows to work with large datasets by keeping them on disk.
houba defines three S4 classes:
mvector
for memory-mapped vectorsmmatrix
for memory-mapped matricesmarray
for memory-mapped arraysCurrently, it supports float
, double
,
integer
and char
data types.
houba allows to extract sub-vectors or sub-matrices,
and to make assignments. It also performs component wise arithmetic
operations (currently no matrix arithmetic). In-place arithmetic
operations are supported. rowSums
, colSums
,
rowMeans
, colMeans
methods are defined for
memory-mapped matrices.
A minimal compatibility with the bigmemory package is provided through descriptor files.
NOTE 1 A current limitation of houba is that it relies on R integers for indices, thus vectors of length larger than 2,147,483,647 can’t be manipulated. Same limitations apply to matrices and arrays dimensions.
NOTE 2 houba relies on the C++ header only library mio by vimpunk, which is under MIT Licence : https://github.com/vimpunk/mio.
To create zero-filled objects, associated with new files, use
mvector
, mmatrix
and marray
.
Here we create a memory-mapped vector of length 100, associated with a temporary file:
<- mvector(datatype = "double", length = 100)
A A
## A mvector of length 100
## data type: double
## File: /tmp/Rtmppw0bVM/mmatrix209eaecdac51d
## --- excerpt
## [1] 0 0 0 0 0
We can specify the filename for the backing file. Here we create a memory-mapped matrix:
<- file.path(tempdir(), "integers120")
filename <- mmatrix(datatype = "integer", nrow = 12, ncol = 10, filename = filename)
B B
## A mmatrix with 12 rows and 10 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/integers120
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
Similarly, marray("float", c(10, 20, 3))
a 10 by 20 by 3
array.
The methods as.mvector
, as.mmatrix
and
as.marray
allow to create a file corresponding to the
content of a R object.
# Convert regular R objects to memory-mapped objects
<- matrix(1:20, 4, 5)
a <- as.mmatrix(a, datatype = "float")
A A
## A mmatrix with 4 rows and 5 cols
## data type: float
## File: /tmp/Rtmppw0bVM/mmatrix209eae526be3f1
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
If datatype
is not provided, the method will use
integer
of double
, depending on the type of
the R object.
<- 1:10
v <- as.mvector(v)
V V
## A mvector of length 10
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae70a74bd6
## --- excerpt
## [1] 1 2 3 4 5
These methods also have an argument filename
.
You can recover a R object using as.vector
,
as.matrix
and as.array
:
as.vector(V)
## [1] 1 2 3 4 5 6 7 8 9 10
An existing file can be mapped, as long as is has the good size. Here
we use the file mapped in B
created above.
<- mvector("int", 120, filename)
C C
## A read-only mvector of length 120
## data type: integer
## File: /tmp/Rtmppw0bVM/integers120
## --- excerpt
## [1] 0 0 0 0 0
Providing an incompatible size will raise an error.
<- mvector("int", 100, filename) D
## Error: The file size doesn't match the matrix size
The mvector C
is read-only, this is the default when
mapping an existing file. You can change this by providing the argument
readonly = FALSE
to mvector
.
As C
and B
are mapping the same files,
modifying one object should modify the other:
1:4] <- 1:4
B[ C
## A read-only mvector of length 120
## data type: integer
## File: /tmp/Rtmppw0bVM/integers120
## --- excerpt
## [1] 1 2 3 4 0
However this may not work always well, depending on your system, or
when a file is mapped through several R sessions. The function
flush
makes sure all changes are written on disk:
1:4] <- 2:5
B[flush(B)
C
## A read-only mvector of length 120
## data type: integer
## File: /tmp/Rtmppw0bVM/integers120
## --- excerpt
## [1] 2 3 4 5 0
Descriptor files aim to provide a minimal compatibility with the bigmemory package.
To create a descriptor file associated is a mapped file, use
descriptor.file
. We illustrate it here on the matrix
B
created above.
B
## A mmatrix with 12 rows and 10 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/integers120
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 0 0 0 0
## [2,] 3 0 0 0 0
## [3,] 4 0 0 0 0
## [4,] 5 0 0 0 0
## [5,] 0 0 0 0 0
<- descriptor.file(B) dsc
## Warning in mk.descriptor.file(object@file, object@dim[1], object@dim[2], : Creating
## a descriptor file for an object stored in tmp directory
Descriptor files can be read with read.descriptor
:
<- read.descriptor(dsc)
D D
## A read-only mmatrix with 12 rows and 10 cols
## data type: integer
## File: /tmp/Rtmppw0bVM//integers120
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 0 0 0 0
## [2,] 3 0 0 0 0
## [3,] 4 0 0 0 0
## [4,] 5 0 0 0 0
## [5,] 0 0 0 0 0
The descriptor files created by houba can be read with the package bigmemory:
We first load the package and read the descriptor file:
library(bigmemory)
<- dget(dsc) desc
We then attach the file:
<- attach.big.matrix(desc) bm
The resulting object maps the same datafile:
1] bm[,
## [1] 2 3 4 5 0 0 0 0 0 0 0 0
Note that alhougj houba allows to create descriptor files for marrays, these won’t be accepted by bigmemory which doesn’t handle arrays.
When restoring data from a previous session, pointers to external
objects are broken, making objects unsuable. If the underlying data file
still exists, you can use restore
to overcome the
problem.
Here we simulate this behaviour on the matrix B
, using
save.image
.
B
## A mmatrix with 12 rows and 10 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/integers120
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 0 0 0 0
## [2,] 3 0 0 0 0
## [3,] 4 0 0 0 0
## [4,] 5 0 0 0 0
## [5,] 0 0 0 0 0
<- tempfile(fileext = ".rda")
rdata_file save.image(rdata_file)
Now we erase B
:
rm(B)
And we load the saved image:
load(rdata_file)
B
## A mmatrix with a broken external ptr ! Try using restore()
The pointer in B
is broken, but can be restored as
this:
<- restore(B)
B B
## A mmatrix with 12 rows and 10 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/integers120
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 0 0 0 0
## [2,] 3 0 0 0 0
## [3,] 4 0 0 0 0
## [4,] 5 0 0 0 0
## [5,] 0 0 0 0 0
You can create a copy with copy
. This will also create a
new file.
<- copy(B)
C C
## A mmatrix with 12 rows and 10 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae4d8ad073
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 0 0 0 0
## [2,] 3 0 0 0 0
## [3,] 4 0 0 0 0
## [4,] 5 0 0 0 0
## [5,] 0 0 0 0 0
This function have an argument filename
. It can in
particular be used to save data that are stored in a temporary file.
The dimensions of an object can be accessed through
dim
.
<- matrix(1:12, 3, 4)
a <- as.mmatrix(a)
A A
## A mmatrix with 3 rows and 4 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae3d4271a7
## --- excerpt
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
dim(A)
## [1] 3 4
You can change the dimensions:
dim(A) <- c(4, 3)
A
## A mmatrix with 4 rows and 3 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae3d4271a7
## --- excerpt
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
Setting the dimensions to NULL
creates a mvector:
dim(A) <- NULL
A
## A mvector of length 12
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae3d4271a7
## --- excerpt
## [1] 1 2 3 4 5
Similarly, you can obtain an marray:
dim(A) <- c(2,2,3)
A
## A marray with dimensions 2 2 3
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae3d4271a7
You can access elements of a memory-mapped object just as regular objects.
Let us create a memory-mapped matrix
<- matrix( sample(0:99, 2500, TRUE), 50, 50)
a <- as.mmatrix(a) A
Acessing a single element:
1,1] A[
## [1] 1
Accessing a row:
1,] A[
## [1] 1 73 97 81 34 34 11 4 37 29 9 96 95 3 55 52 48 37 4 48 56 83 79 2 22 95 94
## [28] 81 91 55 58 90 11 88 89 75 40 77 68 8 53 10 70 33 88 19 52 67 98 99
The result here is a R object. This behaviour actually depends on its size! The default is to return a R object if the result’s size is less than one million, and else to return a memory-mapped object.
This can be changed through the option max.size
, as
follows:
houba(max.size = 20)
## $max.size
## [1] 20
And now, accessing to the first row will sends a new memory-mapped object:
1,] A[
## A mmatrix with 1 rows and 50 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae7743c7a6
## --- excerpt
## [1] 1 73 97 81 34
Again, you can use R syntax to assign values:
1,1] <- 0
A[2,] <- 10
A[ A
## A mmatrix with 50 rows and 50 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae5f63f5b3
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 73 97 81 34
## [2,] 10 10 10 10 10
## [3,] 7 73 44 6 8
## [4,] 66 64 27 7 71
## [5,] 24 58 93 65 2
Assignement with another memory-mapped object is also possible:
<- as.mvector(1:50, "int")
V 3,] <- V
A[ A
## A mmatrix with 50 rows and 50 cols
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae5f63f5b3
## --- excerpt
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 73 97 81 34
## [2,] 10 10 10 10 10
## [3,] 1 2 3 4 5
## [4,] 66 64 27 7 71
## [5,] 24 58 93 65 2
There is no type promotion. Assigning a floating point value to an integer object will cast it to integer:
1,1] <- pi
A[1,1] A[
## [1] 3
Arithmetic operations are available with the usual R syntax.
<- matrix( sample.int(16), 4, 4)
a <- as.mmatrix(a, datatype = "float")
A <- 1 + 2*A
A A
## A mmatrix with 4 rows and 4 cols
## data type: float
## File: /tmp/Rtmppw0bVM/mmatrix209eae52a8e8c4
## --- excerpt
## [,1] [,2] [,3] [,4]
## [1,] 33 19 27 5
## [2,] 31 9 25 23
## [3,] 29 11 21 15
## [4,] 17 3 13 7
Memory-mapped objects can be used for both operands:
<- A + 2
B <- A / B
C C
## A mmatrix with 4 rows and 4 cols
## data type: float
## File: /tmp/Rtmppw0bVM/mmatrix209eae7fcdb445
## --- excerpt
## [,1] [,2] [,3] [,4]
## [1,] 0.9428571 0.9047619 0.9310345 0.7142857
## [2,] 0.9393939 0.8181818 0.9259259 0.9200000
## [3,] 0.9354839 0.8461539 0.9130435 0.8823529
## [4,] 0.8947368 0.6000000 0.8666667 0.7777778
There is no type promotion. If the two operands have different types, the type of the result is the type of the left operand.
Let’s create to vectors with type float
and
integer
:
<- as.mvector( seq(0, 1, length = 11), datatype = "float" )
A <- as.mvector( 0:10, datatype = "integer" ) B
Now A + B
has type float
:
+ B A
## A mvector of length 11
## data type: float
## File: /tmp/Rtmppw0bVM/mmatrix209eae6051408
## --- excerpt
## [1] 0.0 1.1 2.2 3.3 4.4
and B + A
has type integer
:
+ A B
## A mvector of length 11
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae6795eb37
## --- excerpt
## [1] 0 1 2 3 4
We can modify the data without creating copies:
<- as.mvector(1:20, "float")
V <- as.mvector(sample.int(20))
W
inplace.sum(V, 1) # Add 1 to all elements
inplace.prod(V, W) # Multiply elements of V by elements of W
inplace.minus(V, c(1,2)) # Subtract c(1,2) from all elements (recycling)
inplace.div(V, 4) # Divide all elements by 4
inplace.opposite(V) # Take opposite of all elements
inplace.inverse(V) # Take reciprocal of all elements
V
## A mvector of length 20
## data type: float
## File: /tmp/Rtmppw0bVM/mmatrix209eae62424780
## --- excerpt
## [1] -0.10810811 -0.08163265 -0.06779661 -0.30769232 -0.04210526
houba provides analogs to rowSums
,
rowMeans
, colSums
, colMeans
, and
apply
, for memory-mapped matrices (but not for memory
mapped arrays).
<- matrix( sample.int(100), 10, 10)
a <- as.mmatrix(a)
A
# Row sums and meands
rowSums(A)
## [1] 570 519 545 415 503 541 344 445 598 570
rowMeans(A)
## [1] 57.0 51.9 54.5 41.5 50.3 54.1 34.4 44.5 59.8 57.0
Here the result is a R object, because its size does not exceed the
value of the option max.size
. In the contrary case, it will
be a memory-mapped object:
houba(max.size = 5)
## $max.size
## [1] 5
rowSums(A)
## A mvector of length 10
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae253fd3be
## --- excerpt
## [1] 570 519 545 415 503
The apply
method will extract row or lines to R objects.
Again, the type of the result depends on the max.size
option.
If the size of the result is larger than max.size
, a
memory mapped object is returned:
houba(max.size = 5)
## $max.size
## [1] 5
apply(A, 1, sd)
## A mvector of length 10
## data type: double
## File: /tmp/Rtmppw0bVM/mmatrix209eae5adedfe8
## --- excerpt
## [1] 26.91963 29.75623 27.64155 36.13324 22.04566
The data type of this object will be double
or
integer
, depending on the values returned by the function.
For example, the sum
function will return integers:
apply(A, 1, sum)
## A mvector of length 10
## data type: integer
## File: /tmp/Rtmppw0bVM/mmatrix209eae70957650
## --- excerpt
## [1] 570 519 545 415 503
And if the size of the result is smaller than max.size
,
a R object is returned:
houba(max.size = 1e6)
## $max.size
## [1] 1e+06
apply(A, 1, sd)
## [1] 26.91963 29.75623 27.64155 36.13324 22.04566 30.06456 32.67415 27.80587 31.73081
## [10] 26.43230
You may e-mail the author if for bug reports, feature requests, or contributions. The source of the package is on github.
Houba, hop!