Saturday 28 October 2017

Passing data-frames from Java to R

Recently we needed to implement a scenario where Java code was to, incrementally, hand over data to R, this data was to be accumulated R-side into one "uber" data-frame which was then processed, e.g:
 
    process.uber.data.frame <- function(uber.data.frame) {
        # process the uber.data.frame here
}
In terms of technology on Java side we had rJava and REngine. On R side, in addition to R ver. 3.4.x, we had dplyr.
R has extensive capabilities of converting CSV files into data-frames, and the approach we took takes leverages those:
0. While there are more chunks available:
    1. Java-side: write chunk of data into a csv file
    2. Java-side: notify R of the csv file
    3. R-side: read the csv file and append it to the uber data-frame
    4. Java-side: repeat from 0
Once Java-side has consumed all the data:
5. Java-side: invoke uber data-frame processing on R-side
To support this logic the Java pseudo code looks as follows:
    while (moreChunksAvailable) {  
        path2csv = write2csv ( getNextChunk() )
        rEngine.parseAndEval(  String.format("process.chunk( %s )", path2csv)   );
    }

    rEngine.parseAndEval(  "process.done.all.chunks()"   );

On the R-side we have:

    library(dplyr)

    process.chunk <- function( path2csv ) {
        chunk.df <- read.csv2(path2csv, ....)

        # chunk specific logic here

        if !exists("uber.data.frame") {
            assign("uber.data.frame",  chunk.df,   envir = .GlobalEnv)
        } else {
            assign("uber.data.frame", bind_rows(uber.data.frame, chunk.df),  envir = .GlobalEnv)
        }

        # partial uber data-frame logic here
    } 


    process.done.all.chunks <- function {
        process.uber.data.frame( uber.data.frame )    
    }


    process.uber.data.frame <- function(uber.data.frame) {
        # process the uber.data.frame
    }
A note on dplyr and bind_rows. R has many ways of adding rows to an existing data-frame with rbind probably being the simplest. We found dplyr.bind_rows to be much more memory efficient than rbind.