Featured post
r - Generating a very large matrix of string combinations using combn() and bigmemory package -
i have vector x of 1,344 unique strings. want generate matrix gives me possible groups of 3 values, regardless of order, , export csv.
i'm running r on ec2 on m1.large instance w 64bit ubuntu. when using combn(x, 3) out of memory error:
error: cannot allocate vector of size 9.0 gb
the size of resulting matrix c1344,3 = 403,716,544 rows , 3 columns - transpose of result of combn() function.
i thought of using bigmemory package create file backed big.matrix can assign results of combn() function. can create preallocated big matrix:
library(bigmemory) x <- as.character(1:1344) combos <- 403716544 test <- filebacked.big.matrix(nrow = combos, ncol = 3, init = 0, backingfile = "test.matrix")
but when try allocate values test <- combn(x, 3)
still same: error: cannot allocate vector of size 9.0 gb
i tried coercing result of combn(x,3)
think because combn() function returning error, big.matrix function doesn't work either.
test <- as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") error: cannot allocate vector of size 9.0 gb error in as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") : error in evaluating argument 'x' in selecting method function 'as.big.matrix'
is there way combine these 2 functions need? there other ways of achieving this? thanks.
you first find 2-way combinations, , combine them 3d value while saving them every time. takes lot less memory:
combn.mod <- function(x,fname){ tmp <- combn(x,2,simplify=f) n <- length(x) ( in x[-c(n,n-1)]){ # drop combinations contain value id <- which(!unlist(lapply(tmp,function(t) %in% t))) tmp <- tmp[id] # add other combinations , write file out <- do.call(rbind,lapply(tmp,c,i)) write(t(out),file=fname,ncolumns=3,append=t,sep=",") } } combn.mod(x,"f:/tmp/test.txt")
this not general joshua's answer though, case. guess faster -again, particular case-, didn't make comparison. function works on computer using little on 50 mb (roughly estimated) when applied x.
edit
on sidenote: if simulation purposes, find hard believe scientific application needs 400+ million simulation runs. might asking correct answer wrong question here...
proof of concept :
i changed write line tt[[i]]<-out
, added tt <- list()
before loop , return(tt) after it. then:
> do.call(rbind,combn.mod(letters[1:5])) [,1] [,2] [,3] [1,] "b" "c" "a" [2,] "b" "d" "a" [3,] "b" "e" "a" [4,] "c" "d" "a" [5,] "c" "e" "a" [6,] "d" "e" "a" [7,] "c" "d" "b" [8,] "c" "e" "b" [9,] "d" "e" "b" [10,] "d" "e" "c"
- Get link
- X
- Other Apps
Comments
Post a Comment