R 's data sorting functions sort and order, the data filtering function which, vector accessing operators [], vector and matrix manipulation functions cbind and rbind, and other functions and keywords make data analysis easy in much situations. SQL (Structered Querying Language) is used for storing, adding, removing, sorting and filtering the data in which saved on a disk permenantly or memory.
The R package sqldf builds a SQLite database using an R data.frame object. A data.frame is a matrix with richer properties in R. In this blog post, we present a basic introduction of sqldf package and its use in R.
First of all, the package can be installed by typing:
> install.packages("dftable")
After installing the package, it can be got ready to use by typing:
> require("dftable")
Loading required package: sqldf
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: DBI
Now lets create two vectors with length of 100:
> assign("x", rnorm(100))
> assign("y", rnorm(100))
> assign("mydata", as.data.frame(cbind(x,y)))
We can see first 6 rows:
> head(mydata)
x y
1 -1.9357660 0.2784369
2 -0.6976428 1.4646022
3 0.1913628 0.1578977
4 0.3049607 0.6055087
5 2.3773249 1.1800434
6 0.4641791 1.7143130
Let's perform some SQL statements on this data frame using sqldf:
Averages of x and y
> sqldf("select avg(x), avg(y) from mydata")
avg(x) avg(y)
1 0.0790934 0.220756
Number of cases
> sqldf("select count(x), count(y) from mydata")
count(x) count(y)
1 100 100
First Three Cases
> sqldf("select x,y from mydata limit 3")
x y
1 -1.9357660 0.2784369
2 -0.6976428 1.4646022
3 0.1913628 0.1578977
Minimum and Maximum Values
> sqldf("select min(x),max(x),min(y),max(y) from mydata")
min(x) max(x) min(y) max(y)
1 -2.155768 2.377325 -1.75477 2.531869
First 3 Cases of Ordered Data
> sqldf("select x,y from mydata order by x limit 3")
x y
1 -2.155768 0.6614813
2 -1.935766 0.2784369
3 -1.837502 0.1073177
> sqldf("select x,y from mydata order by y limit 3")
x y
1 0.7665811 -1.754770
2 0.3373319 -1.736727
3 0.6199159 -1.335649
Insert into
dftable does not alter the data frame. After inserting a new case, a new data.frame is created and returned. In the example below, sqldf takes a vector of two sql statements as parameters and the result is in accessable with the name main.mydata rather than mydata.
> tail (sqldf(
+ c(
+ "insert into mydata values (6,7)"
+ ,
+ "select * from main.mydata"
+ )
+ )
+ )
x y
96 1.58024523 1.3937920
97 -1.79352203 0.2105787
98 0.02632872 -1.0567890
99 -0.60934162 -0.1359667
100 1.43393159 -0.9396326
101 6.00000000 7.0000000
Delete
> sqldf(
+ c(
+ "delete from mydata where x < 0 or y < 0"
+ ,
+ "select * from main.mydata"
+ )
+ )
x y
1 0.19136277 0.15789771
2 0.30496074 0.60550873
3 2.37732485 1.18004342
4 0.46417906 1.71431305
5 1.16290585 1.17154756
6 0.49335335 0.19904607
7 1.45769371 0.08291387
8 0.78473338 1.07769098
9 0.69043300 1.35040512
10 1.47893118 1.01057351
.....
Have a nice read!