Saturday, March 14, 2015

SQLite with R - The sqldf package


R 's data sorting functions sort and order, the data filtering function which, vector accessing operators [], vector and matrix manipulation functions cbind and rbind, and other functions and keywords make data analysis easy in much situations. SQL (Structered Querying Language) is used for storing, adding, removing, sorting and filtering the data in which saved on a disk permenantly or memory.

The R package sqldf builds a SQLite database using an R data.frame object. A data.frame is a matrix with richer properties in R.  In this blog post, we present a basic introduction of sqldf package and its use in R.

First of all, the package can be installed by typing:

> install.packages("dftable")

After installing the package, it can be got ready to use by typing:

> require("dftable")
Loading required package: sqldf
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: DBI


Now lets create two vectors with length of 100:

> assign("x", rnorm(100))
> assign("y", rnorm(100))
> assign("mydata", as.data.frame(cbind(x,y)))

We can see first 6 rows:

> head(mydata)
           x         y
1 -1.9357660 0.2784369
2 -0.6976428 1.4646022
3  0.1913628 0.1578977
4  0.3049607 0.6055087
5  2.3773249 1.1800434
6  0.4641791 1.7143130

Let's perform some SQL statements on this data frame using sqldf

Averages of x and y

> sqldf("select avg(x), avg(y) from mydata")
     avg(x)   avg(y)
1 0.0790934 0.220756


Number of cases

> sqldf("select count(x), count(y) from mydata")
  count(x) count(y)
1      100      100


First Three Cases

> sqldf("select x,y from mydata limit 3")
           x         y
1 -1.9357660 0.2784369
2 -0.6976428 1.4646022
3  0.1913628 0.1578977


Minimum and Maximum Values

> sqldf("select min(x),max(x),min(y),max(y) from mydata")
     min(x)   max(x)   min(y)   max(y)
1 -2.155768 2.377325 -1.75477 2.531869


First 3 Cases of Ordered Data

> sqldf("select x,y from mydata order by x limit 3")
          x         y
1 -2.155768 0.6614813
2 -1.935766 0.2784369
3 -1.837502 0.1073177
> sqldf("select x,y from mydata order by y limit 3")
          x         y
1 0.7665811 -1.754770
2 0.3373319 -1.736727
3 0.6199159 -1.335649


Insert into 

dftable does not alter the data frame. After inserting a new case, a new data.frame is created and returned. In the example below, sqldf takes a vector of two sql statements as parameters and the result is in accessable with the name main.mydata rather than mydata

> tail (sqldf( 
+ c(
+ "insert into mydata values (6,7)"
+ ,
+ "select * from main.mydata"
+ )
+ )
+ )
              x          y
96   1.58024523  1.3937920
97  -1.79352203  0.2105787
98   0.02632872 -1.0567890
99  -0.60934162 -0.1359667
100  1.43393159 -0.9396326
101  6.00000000  7.0000000


Delete

> sqldf( 
+ c(
+ "delete from mydata where x < 0 or y < 0"
+ ,
+ "select * from main.mydata"
+ )
+ )
            x          y
1  0.19136277 0.15789771
2  0.30496074 0.60550873
3  2.37732485 1.18004342
4  0.46417906 1.71431305
5  1.16290585 1.17154756
6  0.49335335 0.19904607
7  1.45769371 0.08291387
8  0.78473338 1.07769098
9  0.69043300 1.35040512
10 1.47893118 1.01057351
.....


Have a nice read!


No comments:

Post a Comment

Thanks