Saturday, September 17, 2011

RCaller: Support for sequential commands with a single process

I think, this revision will be the foundation of the version  2.1. RCaller is supposed to be slow but the easiest way of calling R from Java.

Finally I have implemented the method runAndReturnResultOnline() for running sequential commands in a single process. What does this stand for? Let me give an example to explain this:

Suppose that you want to perform a simulation study to measure the success of your new procedure. For this, you decide to draw random numbers from a distribution and calculate something and handle the results in Java. RCaller creates  Rscript processes for each single iteration. This cause to too many operating system calls.

Latest release of RCaller includes the method for this. Lets have a look at the Test file:


@Test
  public void onlineCalculationTest() {
    RCaller rcaller = new RCaller();
    rcaller.setRExecutable("/usr/bin/R");
    rcaller.cleanRCode();
    rcaller.addRCode("a<-1:10");
    rcaller.runAndReturnResultOnline("a");
    assertEquals(rcaller.getParser().getAsIntArray("a")[0], 1);

    rcaller.cleanRCode();
    rcaller.addRCode("b<-1:10");
    rcaller.addRCode("m<-mean(b)");
    rcaller.runAndReturnResultOnline("m");
    assertEquals(rcaller.getParser().getAsDoubleArray("m")[0], 5.5, 0.000001);

    rcaller.cleanRCode();
    rcaller.addRCode("a<-1:99");
    rcaller.addRCode("k<-median(a)");
    rcaller.runAndReturnResultOnline("k");
    assertEquals(rcaller.getParser().getAsDoubleArray("k")[0], 50.0, 0.000001);
  }
  }
 

In first stage,we are creating an integer vector and getting the first element. In the second one, we are creating the same integer vector with a different name and calculating the arithmetic mean. In the last one, we are recreating the vector a and getting the median, which is equal to 50.

This example uses the same RCaller object. In first stage, the R executable file (it is /usr/bin/R in my Ubuntu Linux) is created once. In second stage the same R file is used and no longer processes are created again. In this stage, the vector a is accessible and still remains alive. At the last stage, b is alive again and a is recreated. So this example does not cause the R to open and close three times but only once.

This modification speeds up the RCaller, but it can be still considered as slow.
However, it is still easy to implement and much more faster than the previous implementation.

Have Fun!


5 comments:

  1. Hello World :) ,

    since a few week i am doing scientific calculations with R on work . For my project i had to realize Robust Regression calculations. After the calculations have shown positiv results, the tricky part had already began. Because i had to evaluate systematically other datasets in the same way. As a java programmer I thought it wouldnt be so difficult to programm just a systematization of R -calculations in Java but it was very painfull until I found the RCaller solution. I know very well how much valuable your developed solution is. It needs no special enviroment settings, just only adding the RCaller library and here we go! Congratulations and thank you very much.

    Greetings :)

    ReplyDelete
  2. Thank you for using RCaller!

    All the best!

    ReplyDelete
  3. Hello,

    We are trying to run R script using RCaller, running into exception, below is the code.

    Exception:
    Parse error at line 1, column 1. Encountered: [
    at bsh.Parser.generateParseException(Unknown Source)
    at bsh.Parser.jj_consume_token(Unknown Source)
    at bsh.Parser.Line(Unknown Source)
    at bsh.Interpreter.Line(Unknown Source)
    at bsh.Interpreter.eval(Unknown Source)
    at bsh.Interpreter.eval(Unknown Source)
    at bsh.Interpreter.eval(Unknown Source)
    at rcaller.StreamReader.run(RCaller.java:79)



    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;

    import rcaller.RCaller;

    public class Rcaller1 {

    public static void main(String[] args) throws Exception {
    RCaller caller = new RCaller();
    caller
    .setRScriptExecutableFile("D:\\Satish\\C\\recon\\risk\\R\\R-2.14.2\\bin\\i386\\Rscript");
    caller
    .RunRCode(
    readFile("D:\\Satish\\C\\recon\\risk\\R\\R-2.14.2\\bin\\i386\\function.R"),
    false, false);
    }

    private static String readFile(String file) throws IOException {
    BufferedReader reader = new BufferedReader(new FileReader(file));
    String line = null;
    StringBuilder stringBuilder = new StringBuilder();
    String ls = System.getProperty("line.separator");
    while ((line = reader.readLine()) != null) {
    stringBuilder.append(line);
    stringBuilder.append(ls);
    }
    String strg = stringBuilder.toString();
    //System.out.println(strg);
    return strg;
    }

    }


    function.R
    twosam <- function(y1, y2) {
    n1 <- length(y1); n2 <- length(y2)
    yb1 <- mean(y1); yb2 <- mean(y2)
    s1 <- var(y1); s2 <- var(y2)
    s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2)
    tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2))
    tst
    }
    male <- c(60,61,62,63,64,65)
    female <- c(50,51,52,53,54,55)
    tstat <- twosam(male,female)
    tstat


    (2) How do we pass values to vectors male, female from java program instead of hardcoding in R script? How do we read return value "tstat" from java program? Is it possible to invoke R function[which is in R script file] from java program?


    (3) What is the difference between rcaller.setRExecutable("") and rcaller.setRScriptExecutableFile("")?

    (4) Performance difference between
    Rcaller - uses seperate process per invocation which is slow. Mutithreading and
    concurrency - should be okay as each invocation from Rcaller is seperate
    process.

    rserv - uses socket connections - is it multithreaded and thread safe? Does it
    require pool of connections? If more hits from java client does it
    queue the requests?


    JRI - does it require R script converted to dll/shared library and accessed
    as JNI from java? This is not multi-threaded and thread safe - correct?


    Thanks

    ReplyDelete
  4. Thanks for building RCaller - it's an awesome tool that has made my life a heck of a lot easier.

    I had a question for you about managing the processes (Rterm.exe) that are created and kept alive as you use runAndReturnResultOnline. If I use RCaller in the following way:

    for (int i = 0; i < 100; i++) {
    RCaller rcaller = new RCaller();
    rcaller.setRExecutable("/usr/bin/R");
    rcaller.cleanRCode();
    rcaller.addRCode("a<-1:9");
    rcaller.runAndReturnResultOnline("a");
    assertEquals(rcaller.getParser().getAsIntArray("a")[0], 1);

    rcaller.cleanRCode();
    rcaller.addRCode("k<-median(a)");
    rcaller.runAndReturnResultOnline("k");
    assertEquals(rcaller.getParser().getAsDoubleArray("k")[0], 5.0, 0.000001);
    }

    on each iteration a separate process is spawned, correctly siloing the different instances of RCaller. However, those processes stay alive and stack up - by the end of the loop, you will have 100 different Rterm.exe processes alive and kicking.

    The documentation in the source implies that you should call rcaller.stopStreamConsumers() to stop the consumers of the process so that the OS will know it's OK to kill the idle process. However, I tried this in my environment (Windows 8, Java 1.7.0_25, RCaller 2.1.1-SNAPSHOT) and the processes still remain alive. I'm finding I have to forcefully kill the processes (Runtime.getRuntime().exec("taskkill /F /IM Rterm.exe"))

    Is this a common problem for others using the runAndReturnResultOnline method?

    ReplyDelete
    Replies
    1. Hi,
      I'm curious if you've found a solution to this issue regarding the termination of RTerm.exe processes. I'm seeing the same thing as well. I too found the documentation on stopStreamConsumers(), hoping that this would allow RCaller to terminate the connection. However, no luck with this.

      In my case, however, the processes terminate when the application JVM exists. So they persist only for the JVM session.

      Delete

Thanks