Showing posts with label r. Show all posts
Showing posts with label r. Show all posts

## Tuesday, May 24, 2016

### RCaller 3.0 is released!

RCaller 3.0 is released with new features.

http://mhsatman.com/rcaller-3-0

for the source code, compiled binaries, other downloads and the blog post.

Hope you enjoy the project!

## Thursday, March 19, 2015

### Why is R awesome?

For someone it is a magic, somebody hates its notation (maybe you!),  it has some weird rules and maybe it is just a programming language like others (That is also my opinion). As the other programming languages, R has its good and bad properties but I can say it is the best candidate as a toolbox of a statistician or researchers who work on data analysis.

In this blog post, I collect 8 (from 0 to 7) nice properties of R. As a lecturer and researcher, I experienced that many students are more capable to understand some statistical concepts when I try to show and get them work using Monte Carlo simulations.  In R, we are able to write compact codes to demonstrate these concepts which would be difficult to implement in an other programming language. R is not a simple toy, so we are always capable to enhance our knowledge, programming skills and get capabilities of writing better codes by introducing external codes that are written in real programming languages (an old joke of real man which uses C).

So, if it is, why is R awesome ?

0. Syntax of Algol Family

R has a weird assign operator but the remaining part is similar to Algol family languages such as C, C++, Java and C#.  R has a similar facility of operator overloading (yes, it is not exactly the operator overloading), in other terms, single or compound character of symbols can be assigned to function names like this:

```> '%_%' <- function(a,b){
+    return(exp(a+b))
+ }
> 5 %_% 2
[1] 1096.633```

1. Vectors are primitive data types

Yes, vectors are also primitives with an opening and a closing bracket in other members of Algol. In C/C++ they are arrays of primitives and objects in Java. Contrary this, binary operators are directly applicable on the vectors and matrices in R.  For example estimation of least squares coefficients is a single line expression in R as:

```> assign("x",cbind(1,1:30))
> assign("y",3+3*x[,2]+rnorm(30))
> solve(t(x) %*% x) %*% t(x) %*% y
[,1]
[1,] 2.858916
[2,] 3.003787
```

This example shows the differences between a scaler and a vector:

 ``` 1 2 3 4 5 6 7 8 9 10``` ```> assign("x", c(1,2,3)) > assign("a", 5) > typeof(x) [1] "double" > typeof(a) [1] "double" > class(x) [1] "numeric" > class(a) [1] "numeric" ```

No difference!

2. Theorems get alive in minutes

Suppose that X is a random variable that follows an Exponential Distribution with ratio = 5.
Sum or mean of randomly selected samples with size of N follows a normal distribution.  This is an explanation of the Central Limit Theorem with an example. Theorems are theorems. But you may see a fast demonstration (and probably a proof for educational purposes only) and try to write a rapid application. A process of writing a code like this takes minutes if you use R.

```> assign("nsamp", 5000)
> assign("n", 100)
> assign("theta", 5.0)
> assign("sums", rep(0,nsamp))
>
> for (i in 1:nsamp){
+     sums[i] <- sum(rexp(n = n, rate = theta))
+ }
> hist(sums)```

3. There is always a second plan for faster code

Now suppose that we are drawing 50,000 samples randomly using the code above. What would be the computation time?

```> assign("nsamp", 50000)
> assign("n", 100)
> assign("theta", 5.0)
> assign("sums", rep(0,nsamp))
>
> s <- system.time(
+     for (i in 1:nsamp){
+         sums[i] <- sum(rexp(n = n, rate = theta))
+     }
+ )
>
> print(s)
user  system elapsed
0.582   0.000   0.572 ```

Drawing 50,000 samples with size 100 takes 0.582 seconds. Is it now fast enough? Lets try to write it in C++ !

```#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector CalculateRandomSums(int m, int n) {
NumericVector result(m);
int i;
for (i = 0; i < m; i++){
result[i] = sum(rexp(n, 5.0));
}
return(result);
}```
```
```
```
```
After compiling the code within Rcpp, we can call the function CalculateRandomSums() from R.

```> s <- system.time(
+ vect <- calculaterandomsums(50000,100)
> print(s)
user  system elapsed
0.185   0.000   0.184 ```

Now our R code is 3.145946 times slower than the code written in C++.

4. Interaction with C/C++/Fortran is enjoyable

Since a huge amount of R is written in C, migration of old C libraries is easy by writing wrapper methods using SEXP data types. Rcpp masks these routines in a clever way. Fortran code is also
linkable. Interaction with other languages makes use of old libraries in R and enables the possibility of writing faster new libraries.  It is also possible to create instances of R in C and C++ applications.
For an enjoyable example, have a look at the section 3. There is always a second plan for faster code.
The R package eive includes a small portion of C++ code and it is a compact example of calling C++ functions from within R. Accessing C++ objects from R is also possible thank to Rcpp. Click here to see the explanation and an example.

5. Interaction with Java

Calling Java from R (rJava) and calling R from Java (JRI, RCaller) are all possible. Renjin has a different concept as it is the R interpreter written in Java (Another possibility of calling R from Java , huh?).  A detailed comparison of these method is given in this documentation and this.

6. Sophisticated variable scoping

In R, functions have their own variable scopes and accessing variables at the top level is possible. Addition to this, variable scoping is handled by standard R lists (specially they are called environments) and in any side of code user based environments can be created. For detailed information visit Environment in R.

7. Optional Object Oriented Programming (O-OOP)

R functions take values of variables as parameters rather than their addresses. If a vector with size of 10,0000 is passed through a function, R first copies this vector then passes it to the function. After body of the function is performed, the copied parameter is then labeled as free for later garbage collecting. As C/C++ programmers know, passing objects with their addresses rather than their values is a good solution for using less memory and spending less computation time. Reference classes in R are passed to functions with their addresses in a way similar to passing C++ references and Java objects to functions and methods:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32``` ```Person <- setRefClass( Class = "Person", fields = c("name","surname","email"), methods = list( initialize = function(name, surname, email){ .self\$name <- name .self\$surname <- surname .self\$email <- email }, setName = function(name){ .self\$name <- name }, setSurname = function(surname){ .self\$surname <- surname }, setEMail = function (email){ .self\$email <- email }, toString = function (){ return(paste(name, " ", surname, " ", email)) } ) # End of methods ) # End of class p <- Person\$new("John","Brown","brown@server.org") print(p\$toString()) ```

The output is

`[1] "John   Brown   brown@server.org"`

Java and C++ programmers probably like this notation!

## Monday, March 16, 2015

### Compact Genetic Algorithms with R

Compact Genetic Algorithm (CGA) is a member of Genetic Algorithms (GAs) and also Estimation of Distribution Algorithms (EDAs). Since it is based on a single chromosome rather than a population of chromosomes, it is compact.

For detailed information, research papers [1] and [2] present a complete and a brief documentations, respectively.

In this blog post, we give an example of use of compact genetic algorithms on ONEMAX function. ONEMAX function takes n-bits as parameters and returns the number of ones as integer. Since it is only one local optimum when all of the bits equal to 1, it is called ONEMAX.

First of all, we load the R package eive which includes the wrapped C++ function cga.

`> require("eive")`

The other step is to define the ONEMAX function.

```> ONEMAX <- function (x){
+     return(-sum(x))
+ }```

Now we write the main part, optimization with cga:

`> result <- cga(chsize = 10 , popsize = 100 , evalFunc = ONEMAX)`
```> result
[1] 1 1 1 1 1 1 1 1 1 1```

The result is a vector in which the bits are all equal to 1.

The most important issue in this example is speed, because the algorithm is implemented in C++ and wrapped using Rcpp to be called within R.

Here is the example of 1000 bits and the time consumed by the cga function call:

```> system.time(
+     result <- cga(chsize = 1000,popsize = 100,evalFunc = ONEMAX)```
```+ )
user  system elapsed
0.443   0.000   0.433
> ONEMAX(result)
[1] -994```

This result seems to be considerably fast and 994 of 1000 bits are found as 1 by the function in 0.433 seconds. Lets increase the population size from 100 to 200:

```> system.time(
+     result <- cga(chsize = 1000,popsize = 200,evalFunc = ONEMAX)```
```+ )
user  system elapsed
0.891   0.000   0.866
> print (ONEMAX(result))
[1] -1000```

Now, after setting the population size from 100 to 200, function doubles the time consumed to 0.866 seconds. But this time, 1000 of 1000 bits are 1, and the optimal solution is reached.

[1] Harik, Georges R., Fernando G. Lobo, and David E. Goldberg. "The compact genetic algorithm." Evolutionary Computation, IEEE Transactions on 3.4 (1999): 287-297.

[2] Satman, M. Hakan, and Erkin Diyarbakirlioglu. "Reducing errors-in-variables bias in linear regression using compact genetic algorithms." Journal of Statistical Computation and Simulation ahead-of-print (2014): 1-20.

### Accessing C++ objects from R using Rcpp

Rcpp (Seemless R and C++ integration) package for R provides an easy way of combining C++ and R code. Since R is an interpreter, a bulk of code would probably run at least 2 times slower than its counterpart written in C++. Speed is the most concerning issue many times, however, the main purpose of using C++ would be using an old native library with R.

In this post blog, we give an example of accessing a C++ class from within R using Rcpp. This C++ class is defined with name MyClass and has two private double typed variables. This class also has getter and setter methods for its private fields.

MyClass is defined as the code shown below:

```
```
```#include <Rcpp.h>

using namespace Rcpp;
using namespace std;```
```
```

```class MyClass {
private:
double a,b;

public:
MyClass(double a, double b);
~MyClass();
void setA(double a);
void setB(double b);
double getA();
double getB();
};```
```
```
```
```
```
```
`MyClass has its private double typed variables a and b, a constructor, a destructor, getter and setter methods for a and b, respectively. The implementation of MyClass is given below:`
```
```

```MyClass::MyClass(double a, double b){
this->a = a;
this->b = b;
}

MyClass::~MyClass(){
cout << "Destructor called" << std::endl;
}

void MyClass::setA (double a){
this->a = a;
}

void MyClass::setB (double b){
this->b = b;
}

double MyClass::getA(){
return(this->a);
}

double MyClass::getB(){
return(this->b);
}```
```
```

```MyClass is defined nearly minimal. Since it is a C++ class it is not directly accessable from R. In this example, we write some wrapper methods to create instances of MyClass and return their addresses in memory to perform later function calls. In other terms, in R side, we register address of C++ objects to access them.

// [[Rcpp::export]]
long class_create(double a, double b){
MyClass *m =  new MyClass(a,b);
class_print((long) m);
return((long)m);
}

The method class_create is a C++ method and it has a special comment which will be used by Rcpp before compiling. After compiling process, class_create wrapper R function will be created to call its C++ counterpart. This function create an instance of class_create with given double typed values and returns the address of created object in type long integer.  Here is the other wrapper functions:

// [[Rcpp::export]]
cout << "a = " << m->getA() << " b = " << m->getB() << "\n";
}

// [[Rcpp::export]]
delete m;
}

// [[Rcpp::export]]
m->setA(a);
}

// [[Rcpp::export]]
m->setB(b);
}

// [[Rcpp::export]]
return(m->getA());
}

// [[Rcpp::export]]
return(m->getB());
}

Suppose the whole code is written in a file classcall.cpp.  In R side, this code can be compiled and tested as shown below:

```
```

> require("Rcpp")
> Rcpp::sourceCpp('rprojects/classcall.cpp')

> myobj <- class_create(3.14, 7.8)
a = 3.14 b = 7.8
> myobj
[1] 104078752

> class_set_a(myobj,100)
> class_set_b(myobj,500)
> class_print(myobj)
a = 100 b = 500
```
```
```
```> class_get_a(myobj)
[1] 100
> class_get_b(myobj)
[1] 500
> class_destroy(myobj)
Destructor called
```
```
```
```

```

## Saturday, March 14, 2015

### SQLite with R - The sqldf package

R 's data sorting functions sort and order, the data filtering function which, vector accessing operators [], vector and matrix manipulation functions cbind and rbind, and other functions and keywords make data analysis easy in much situations. SQL (Structered Querying Language) is used for storing, adding, removing, sorting and filtering the data in which saved on a disk permenantly or memory.

The R package sqldf builds a SQLite database using an R data.frame object. A data.frame is a matrix with richer properties in R.  In this blog post, we present a basic introduction of sqldf package and its use in R.

First of all, the package can be installed by typing:

> install.packages("dftable")

After installing the package, it can be got ready to use by typing:

> require("dftable")

Now lets create two vectors with length of 100:

> assign("x", rnorm(100))
> assign("y", rnorm(100))
> assign("mydata", as.data.frame(cbind(x,y)))

We can see first 6 rows:

x         y
1 -1.9357660 0.2784369
2 -0.6976428 1.4646022
3  0.1913628 0.1578977
4  0.3049607 0.6055087
5  2.3773249 1.1800434
6  0.4641791 1.7143130

Let's perform some SQL statements on this data frame using sqldf

Averages of x and y

> sqldf("select avg(x), avg(y) from mydata")
avg(x)   avg(y)
1 0.0790934 0.220756

Number of cases

> sqldf("select count(x), count(y) from mydata")
count(x) count(y)
1      100      100

First Three Cases

> sqldf("select x,y from mydata limit 3")
x         y
1 -1.9357660 0.2784369
2 -0.6976428 1.4646022
3  0.1913628 0.1578977

Minimum and Maximum Values

> sqldf("select min(x),max(x),min(y),max(y) from mydata")
min(x)   max(x)   min(y)   max(y)
1 -2.155768 2.377325 -1.75477 2.531869

First 3 Cases of Ordered Data

> sqldf("select x,y from mydata order by x limit 3")
x         y
1 -2.155768 0.6614813
2 -1.935766 0.2784369
3 -1.837502 0.1073177
> sqldf("select x,y from mydata order by y limit 3")
x         y
1 0.7665811 -1.754770
2 0.3373319 -1.736727
3 0.6199159 -1.335649

Insert into

dftable does not alter the data frame. After inserting a new case, a new data.frame is created and returned. In the example below, sqldf takes a vector of two sql statements as parameters and the result is in accessable with the name main.mydata rather than mydata

> tail (sqldf(
+ c(
+ "insert into mydata values (6,7)"
+ ,
+ "select * from main.mydata"
+ )
+ )
+ )
x          y
96   1.58024523  1.3937920
97  -1.79352203  0.2105787
98   0.02632872 -1.0567890
99  -0.60934162 -0.1359667
100  1.43393159 -0.9396326
101  6.00000000  7.0000000

Delete

> sqldf(
+ c(
+ "delete from mydata where x < 0 or y < 0"
+ ,
+ "select * from main.mydata"
+ )
+ )
x          y
1  0.19136277 0.15789771
2  0.30496074 0.60550873
3  2.37732485 1.18004342
4  0.46417906 1.71431305
5  1.16290585 1.17154756
6  0.49335335 0.19904607
7  1.45769371 0.08291387
8  0.78473338 1.07769098
9  0.69043300 1.35040512
10 1.47893118 1.01057351
.....

### Handling all variables in a workspace in R with RCaller

It is known that the R assigns a value to a variable name by using the Assignment Symbol <- which corresponds to assign function.

RCaller handles results as list objects. Since R environments are list s, they can easily be converted to R lists (Visit the previous blog post on R list here).

Here is an example of RCaller on getting all variables that are created in the run time in R side.

package rcallerenvironments;

import rcaller.RCaller;
import rcaller.RCode;

public class RCallerEnvironments {

public static void main(String[] args) {
RCaller rcaller = new RCaller();
RCode code = new RCode();
rcaller.setRscriptExecutable("/usr/bin/Rscript");

rcaller.setRCode(code);

rcaller.runAndReturnResult("allvars");

System.out.println(rcaller.getParser().getNames());
try {
System.out.println(rcaller.getParser().getXMLFileAsString());
} catch (Exception e) {
System.out.println("Error in accessing XML");
}
}

}

The output is

As it is seen in output, created variables avector, a, b and d are returned to Java side in a single call without any manual translations.

## Friday, March 13, 2015

We are happy to announce that our 'easy to use' Java library for calling R from Java is available for downloading by now on. Developers access the compiled jar file in site

https://github.com/jbytecode/rcaller/releases/tag/2.5

This release does not extend the main functionality of the library but now there are some handy functions for performing some calculations and later development of the library.

What is new:

* Official document bibtex added to cite RCaller in any projects or papers

* RealMatrix class is implemented. Matrix operations are performed in more 'java-ish style'

* RService is implemented for developing wrapper functions

Where to start?

* Read the web page on RCaller http://mhsatman.com/tag/rcaller/
* Read blog entries in http://stdioe.blogspot.com.tr/search/label/rcaller
* Have a look at the source tree in https://github.com/jbytecode/rcaller

Have a nice try!

### Migration of RCaller and Fuzuli Projects to GitHub

Since Google announced that they are shutting down the code hosting service 'Google code' in which our two projects RCaller and Fuzuli Programming Language are hosted.

We migrated our projects into the popular code hosting site GitHub.

Source code of these projects will no longer be committed in Google code site. Please check the new repositories.

GitHub pages are listed below:

RCaller:

https://github.com/jbytecode/rcaller

Fuzuli Project:

https://github.com/jbytecode/fuzuli

## Monday, March 9, 2015

### Nearest-Neighbor Clustering using RCaller - A library for Calling R from Java

RCaller is a software for calling R from Java. A blog post includes the latest version of downloadable jar and documentation here. The latest news can always be traced using the RCaller label in Practical Code Solutions blog.

A blog post on performing a k-means clustering analysis using RCaller is also available at this link.

In the code below, two double arrays, x and y, are created in Java side. These variables are then passed to R. In R side, distance matrix d is calculated. The R function hclust performs the main calculations. Finally, calculated heights of clustering tree and a dendrogram plot are returned to Java. The source code, output text and the returned plot are presented here:

package kmeansrcaller;

import java.io.File;
import rcaller.RCaller;
import rcaller.RCode;

public static void main(String[] args) {
RCaller caller = new RCaller();
RCode code = new RCode();
File dendrogram = null;

double[] x = new double[]{1, 2, 3, 4, 5, 10, 20, 30, 40, 50};
double[] y = new double[]{2, 4, 6, 8, 10, 20, 40, 60, 80, 100};

try {
dendrogram = code.startPlot();
code.endPlot();
} catch (Exception e) {
System.out.println("Plot Error: " + e.toString());
}

caller.setRCode(code);

caller.setRscriptExecutable("/usr/bin/Rscript");

caller.runAndReturnResult("h");
System.out.println(caller.getParser().getNames());

if (dendrogram != null) {
code.showPlot(dendrogram);
}

double[] heights = caller.getParser().getAsDoubleArray("height");
for (int i = 0; i < heights.length; i++) {
System.out.println("Height " + i + " = " + heights[i]);
}
}
}

The output is

[merge, height, order, method, call, dist_method]
Height 0 = 2.23606797749979
Height 1 = 2.23606797749979
Height 2 = 2.23606797749979
Height 3 = 2.23606797749979
Height 4 = 11.1803398874989
Height 5 = 22.3606797749979
Height 6 = 22.3606797749979
Height 7 = 22.3606797749979
Height 8 = 22.3606797749979

The screen shot of the plotted graphics is here:

## Saturday, March 7, 2015

### K-means clustering with RCaller - A library for calling R from Java

Here is an example of RCaller, a library for calling R from Java.

In the code below, we create two variables x and y. K-means clustering function kmeans is applied on the data matrix that consists of x and y. The result is then reported in Java.

package kmeansrcaller;

import rcaller.RCaller;
import rcaller.RCode;

public class KMeansRCaller {

public static void main(String[] args) {
RCaller caller = new RCaller();
RCode code = new RCode();

double[] x = new double[]{1, 2, 3, 4, 5, 10, 20, 30, 40, 50};
double[] y = new double[]{2, 4, 6, 8, 10, 20, 40, 60, 80, 100};

caller.setRCode(code);

caller.setRscriptExecutable("/usr/bin/Rscript");

caller.runAndReturnResult("result");
System.out.println(caller.getParser().getNames());

int[] clusters = caller.getParser().getAsIntArray("cluster");
double[][] centers = caller.getParser().getAsDoubleMatrix("centers");
double[] totalSumOfSquares = caller.getParser().getAsDoubleArray("totss");
// RCaller automatically replaces dots with underlines in variable names
// So the parameter tot.withinss is accessible as tot_withinss
double[] totalWithinSumOfSquares = caller.getParser().getAsDoubleArray("tot_withinss");
double[] totalBetweenSumOfSquares = caller.getParser().getAsDoubleArray("betweenss");

for (int i = 0; i < clusters.length; i++) {
System.out.println("Observation " + i + " is in cluster " + clusters[i]);
}

System.out.println("Cluster Centers:");
for (int i = 0; i < centers.length; i++) {
for (int j = 0; j < centers[0].length; j++) {
System.out.print(centers[i][j] + " ");
}
System.out.println();
}

System.out.println("Total Within Sum of Squares: " + totalWithinSumOfSquares[0]);
System.out.println("Total Between Sum of Squares: " + totalBetweenSumOfSquares[0]);
System.out.println("Total Sum of Squares: " + totalSumOfSquares[0]);
}

}

The output is

[cluster, centers, totss, withinss, tot_withinss, betweenss, size, iter, ifault]
Observation 0 is in cluster 2
Observation 1 is in cluster 2
Observation 2 is in cluster 2
Observation 3 is in cluster 2
Observation 4 is in cluster 2
Observation 5 is in cluster 2
Observation 6 is in cluster 2
Observation 7 is in cluster 1
Observation 8 is in cluster 1
Observation 9 is in cluster 1
Cluster Centers:
40.0 6.42857142857143
80.0 12.8571428571429
Total Within Sum of Squares: 2328.57142857143
Total Between Sum of Squares: 11833.9285714286
Total Sum of Squares: 14162.5