Saturday 28 October 2017

Passing data-frames from Java to R

Recently we needed to implement a scenario where Java code was to, incrementally, hand over data to R, this data was to be accumulated R-side into one "uber" data-frame which was then processed, e.g:
 
    process.uber.data.frame <- function(uber.data.frame) {
        # process the uber.data.frame here
}
In terms of technology on Java side we had rJava and REngine. On R side, in addition to R ver. 3.4.x, we had dplyr.
R has extensive capabilities of converting CSV files into data-frames, and the approach we took takes leverages those:
0. While there are more chunks available:
    1. Java-side: write chunk of data into a csv file
    2. Java-side: notify R of the csv file
    3. R-side: read the csv file and append it to the uber data-frame
    4. Java-side: repeat from 0
Once Java-side has consumed all the data:
5. Java-side: invoke uber data-frame processing on R-side
To support this logic the Java pseudo code looks as follows:
    while (moreChunksAvailable) {  
        path2csv = write2csv ( getNextChunk() )
        rEngine.parseAndEval(  String.format("process.chunk( %s )", path2csv)   );
    }

    rEngine.parseAndEval(  "process.done.all.chunks()"   );

On the R-side we have:

    library(dplyr)

    process.chunk <- function( path2csv ) {
        chunk.df <- read.csv2(path2csv, ....)

        # chunk specific logic here

        if !exists("uber.data.frame") {
            assign("uber.data.frame",  chunk.df,   envir = .GlobalEnv)
        } else {
            assign("uber.data.frame", bind_rows(uber.data.frame, chunk.df),  envir = .GlobalEnv)
        }

        # partial uber data-frame logic here
    } 


    process.done.all.chunks <- function {
        process.uber.data.frame( uber.data.frame )    
    }


    process.uber.data.frame <- function(uber.data.frame) {
        # process the uber.data.frame
    }
A note on dplyr and bind_rows. R has many ways of adding rows to an existing data-frame with rbind probably being the simplest. We found dplyr.bind_rows to be much more memory efficient than rbind.

Tuesday 1 March 2016

All Armstrong numbers calculator

Briefly, an Armstrong number  "is a number that is the sum of its own digits each raised to the power of the number of digits". For example `8208` is an Armstrong number because 8208 = 8^4 + 2^4 + 0^4 + 8^4. We will say that `8208` is a level-4 A-number and call the power sum of its digits, an aSum. Single digit numbers are obviously (level-1) Armstrong numbers , all the A-numbers up to level 4 are:
    1 2 3 4 5 6 7 8 9 
    153 370 371 407 
    1634 8208 9474
There is a finite number of A-numbers, this is because the magnitude of a number grows quicker then it's aSum so that after certain point, the aSum falls behind. The largest A-number is a level-39ner.
This code calculates all A-numbers up to level-39 with the following timings *:
  
  Level-15 0  sec.   41 numbers
  Level-18 2  sec.   46 numbers
  Level-20 6  sec.   51 numbers
  Level-21 11 sec.   53 numbers
  Level-22 15 sec    53 numbers
  Level-23 21 sec.   58 numbers
  Level-25 46 sec    66 numbers
  Level-39 3518 sec. (58 min) 88 numbers

* i7-4790K CPU @ 4.00GHz

Thursday 18 February 2016

Synchronized vs ReentrantLock

Java Concurrency in Practice (2006) has a section titled "Choosing Between Synchronized and ReentrantLock". As far a performance in concerned we have "The performance of ReentrantLock appears to dominate that of intrinsic locking, winning slightly on Java 6 and dramatically on Java 5.0" as well as "Future performance improvements are likely to favor synchronized over ReentrantLock".

Here is a quick throughput benchmark comparing the two on the current Java 8

The setup
Java version:
  > java -version                                                 
  java version "1.8.0_71"                                         
  Java(TM) SE Runtime Environment (build 1.8.0_71-b15)            
  Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode) 
wmic cpu output (partial):
Name                                      NumberOfCores
Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz  4              
In the JMH benchmark attached bellow, a shared counter is incremented by multiple threads. Before incrementing, a thread acquires a lock using either synchronized or a ReentrantLock. It is executed with N = [1..4]:
> java -jar target\benchmarks.jar concurrent -t N -wi 10 -i 20 -rf csv -tu ms -si true  
Here is the complete source (Gist link):
package concurrent;

/**
 * java -jar target\benchmarks.jar concurrent -t N -wi 10 -i 20 -rf csv -tu ms -si true
 */

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;

import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

@State(Scope.Benchmark)
public class ReentrantLockVsSynchronized {

    @Param({"ReentrantLock", "Synchronized"})
    private String lockMode;

    @State(Scope.Benchmark)
    public static class Counter {
        private static long value;
    }

    @State(Scope.Benchmark)
    public static class JvmLock {
        private static Object lock = new Object();
    }

    @State(Scope.Benchmark)
    public static class RLock {
        private static Lock lock = new ReentrantLock();
    }

    @Benchmark
    public void measure(JvmLock jvmLock, RLock rLock, Counter counter) {
        switch (lockMode) {
            case "Synchronized":
                doSynchronized(jvmLock, counter);
            case "ReentrantLock":
                doReentrantLock(rLock, counter);
        }
    }

    public void doReentrantLock(RLock rLock, Counter counter) {
        rLock.lock.lock();
        try {
            Counter.value++;
        } finally {
            rLock.lock.unlock();
        }
    }

    public void doSynchronized(JvmLock jvmLock, Counter counter) {
        synchronized (jvmLock.lock) {
            Counter.value++;
        }
    }
}


The results
In these tests, the throughput of the ReentrantLock is two to three times higher than that of synchronized. There is also a funny dip when N = 2, this was observed before here .

Friday 22 January 2016

MongoDb - inserting documents throughput

Code comparing various approaches to inserting documents into MongoDB. The sources along with a more detailed description are on GitHub. Here is little chart showing the throughput of each approach.


More details in README.md

Monday 28 December 2015

Take that Turing and Church!!!



Halting problem is undecidable ala Dr Seuss:   

No program can say what another will do. 
Now, I won’t just assert that, I’ll prove it to you:
I will prove that although you might work til you drop,
you can’t predict whether a program will stop.

Imagine we have a procedure called P
that will snoop in the source code of programs to see
there aren’t infinite loops that go round and around;
and P prints the word “Fine!” if no looping is found.

You feed in your code, and the input it needs,
and then P takes them both and it studies and reads
and computes whether things will all end as they should
(as opposed to going loopy the way that they could).

Well, the truth is that P cannot possibly be,
because if you wrote it and gave it to me,
I could use it to set up a logical bind
that would shatter your reason and scramble your mind.

Here’s the trick I would use – and it’s simple to do.
I’d define a procedure – we’ll name the thing Q –
that would take any program and call P (of course!)
to tell if it looped, by reading the source;

And if so, Q would simply print “Loop!” and then stop;
but if no, Q would go right back to the top, 
and start off again, looping endlessly back,
til the universe dies and is frozen and black.

And this program called Q wouldn’t stay on the shelf;
I would run it, and (fiendishly) feed it itself.
What behaviour results when I do this with Q?
When it reads its own source, just what will it do?

If P warns of loops, Q will print “Loop!” and quit;
yet P is supposed to speak truly of it.
So if Q’s going to quit, then P should say, “Fine!” –
which will make Q go back to its very first line!

No matter what P would have done, Q will scoop it:
Q uses P’s output to make P look stupid.
If P gets things right then it lies in its tooth;
and if it speaks falsely, it’s telling the truth!

I’ve created a paradox, neat as can be –
and simply by using your putative P.
When you assumed P you stepped into a snare;
Your assumptions have led you right into my lair.

So, how to escape from this logical mess?
I don’t have to tell you; I’m sure you can guess.
By reductio, there cannot possibly be
a procedure that acts like the mythical P.

You can never discover mechanical means
for predicting the acts of computing machines.
It’s something that cannot be done. So we users
must find our own bugs; our computers are losers!



Geoffrey K. Pullum

Monday 23 November 2015

Currying in Kotlin

As an example consider summing over a range of integers applying a provided function on each. Something like
fun sigma(range: IntRange, f: (Int) -> Int): Int {
    var result = 0
    for (i in range) result += f(i)
    return result
}
and we have
println(sigma(1..5) { x -> x  })    // prints: 15
println(sigma(1..5) { x -> x * x }) // prints: 55
We would like to to fix f() so that we could define:
val sigmaOfSum = sigma { x -> x }
val sigmaOfProduct = sigma { x -> x * x }
without committing to a range, such that later on, we can invoke
println(sigmaOfSum(1..5))     // prints: 15
println(sigmaOfProduct(1..5)) // prints: 55
This can be done in straight Kotlin:
fun sigma(f: (Int) -> Int): (IntRange) -> Int {
    fun applyF(range: IntRange): Int {
        var result = 0
        for (i in range) result += f(i)
        return result
    }
    return ::applyF
}
Here, the (higher order) function sigma takes f() as a parameter and returns a function that applies f() on the provided range.
A more down to earth example; suppose we want to perform a DB lookup. In principal, we need specify a key/query and the th DB instance but we want to break these two things apart in the spirit of the first example:
fun lookUp(db: Map): (String) -> String? {
    fun dbLookup(key: String): String? {
        return db.get(key)
    }
    return ::dbLookup
}
now the following works:
val customerLookup = lookUp(mapOf("1" to "Customer1", "2" to "Customer2"))
val productLookup = lookUp(mapOf("1" to "Product1", "2" to "Product2"))

println(customerLookup("1")) // prints Customer1
println(productLookup("1"))  // prints Product1
Our lookUp() function is quite simple and can be collapsed to:
fun lookUp(map: Map): (String) -> String? = { map.get(it) }   

Friday 20 November 2015

Class delegating in Kotlin

The Kotlin language documentation is somewhat patchy and the class delegating functionality is under documented. To see what it does it's best to look at an example:
class Customer()
class Product()

interface CustomersFinder {
    fun findCustomers(query: String): Set<Customer>
}

interface ProductFinder {
    fun findProducts(query: String): Set<Product>
}
Now, suppose we want to have a MegaFinder class that is capable of querying both Customers and Products. The standard way to do this is

  1. pass instances of the finders to the MegaFinder
  2. have the MegaFinder stash those instances in private properties, and
  3. call on them as needed

All that is in the realm of boilerplate stuff and Kotlin has a concise alternative. This is how it looks like:
class MegaFinder(customerFinder: CustomerFinder, productFinder: ProductFinder)
 : CustomerFinder by customerFinder, ProductFinder by productFinder 
{
    fun loadAll() {
        val customers = findCustomers("*:*")
        val products = findProducts("*:*")
        //...
    }
}
What this says is that MegaFinder implements the finder interfaces by delegating to the designated instances. It is more terse then the traditional approach with two caveats that I see
  1. The delegating approach requires the MegaFinder to implement the finder interfaces. This is not necessarily bad, but the freedom not to do that is not available
  2. When reading the code it is not immediately obvious where the findCustomers(...) and findProducts(...) are implemented - it is almost is if you need some sort of IDE to figure this out :-)