Saturday, February 28, 2015

Levenshtein Distance in R

Edit distance and Levenshtein distance are nonparametric distance measures that not like well known metric distance measures such as Euclidean or Mahalanobis distances in some persfectives.

Levenshtein distance is a measure of how many characters should be replaced or moved to get two strings same.

In the example below, a string text is asked from the user in console mode. Then the input string is compared to colour names defined in R.  Similar colour names are then reported:

user.string <- readline("Enter a word: ")
wordlist <- colours()
dists <- adist(user.string, wordlist)
mindist <- min(dists)
best.ones <- which(dists == mindist)

for (index in best.ones){
    cat("Did you mean: ", wordlist[index],"\n")

Here is the results:

Enter a word: turtoise
Did you mean:  turquoise 

Enter a word: turtle
Did you mean:  purple 

Enter a word: night blue
Did you mean:  lightblue 

Enter a word: parliament
Did you mean:  darkmagenta 

Enter a word: marooon
Did you mean:  maroon

Have a nice read

Friday, February 27, 2015

Frequency table of characters in a string in R

R's string manipulating functions includes splitting a string. One can think that parsing a string or extracting its characters into an array and generating the frequency information may be used in a language detection system.

Here is a example on a text that is captured from the Oracle - History of Java site. The code below defines a large string. The string is then parsed into its characters. After calculating the frequencies of each single character (including numbers, commas and dots) a histogram is saved in a file.

# Defining string 
s <- "Since 1995, Java has changed our world and our expectations. Today, with technology such a part of our daily lives, we take it for granted that we can be connected and access applications and content anywhere, anytime. Because of Java, we expect digital devices to be smarter, more functional, and way more entertaining. In the early 90s, extending the power of network computing to the activities of everyday life was a radical vision. In 1991, a small group of Sun engineers called the \"Green Team\" believed that the next wave in computing was the union of digital consumer devices and computers. Led by James Gosling, the team worked around the clock and created the programming language that would revolutionize our world – Java. The Green Team demonstrated their new language with an interactive, handheld home-entertainment controller that was originally targeted at the digital cable television industry. Unfortunately, the concept was much too advanced for the them at the time. But it was just right for the Internet, which was just starting to take off. In 1995, the team announced that the Netscape Navigator Internet browser would incorporate Java technology.Today, Java not only permeates the Internet, but also is the invisible force behind many of the applications and devices that power our day-to-day lives. From mobile phones to handheld devices, games and navigation systems to e-business solutions, Java is everywhere!"

# First converting to lower case
# then splitting by characters.
# strsplit return a list, we are unlisting to a vector.
chars <- unlist(strsplit(tolower(s), ""))

# Generating frequency table
freqs <- table(chars)

# Generating plot into a file
hist(freqs,include.lowest=TRUE, breaks=46,freq=TRUE,labels=rownames(freqs))

The generated output is

We translate the text used in our example to Spanish using Google Translate site. The code is shown below:

# Defining string 
s <- "Desde 1995, Java ha cambiado nuestro mundo y nuestras expectativas. Hoy en día, con la tecnología de una parte de nuestra vida cotidia    na tal, damos por sentado que se puede conectar y acceder a las aplicaciones y contenido en cualquier lugar ya cualquier hora. Debido a Java    , esperamos que los dispositivos digitales para ser más inteligente, más funcional, y de manera más entretenida. A principios de los años 90    , que se extiende el poder de la computación en red para las actividades de la vida cotidiana era una visión radical. En 1991, un pequeño gr    upo de ingenieros de Sun llamado \"Green Team\" cree que la próxima ola de la informática fue la unión de los dispositivos digitales de cons    umo y ordenadores. Dirigido por James Gosling, el equipo trabajó durante todo el día y creó el lenguaje de programación que revolucionaría e    l mundo - Java. El Equipo Verde demostró su nuevo idioma con una mano controlador interactivo, el entretenimiento en casa que fue dirigido o    riginalmente a la industria de la televisión digital por cable. Por desgracia, el concepto fue demasiado avanzado para el ellos en el moment    o. Pero fue justo para Internet, que estaba empezando a despegar. En 1995, el equipo anunció que el navegador de Internet Netscape Navigator     incorporaría Java technology.Today, Java no sólo impregna el Internet, pero también es la fuerza invisible detrás de muchas de las aplicaci    ones y dispositivos que alimentan nuestra vida del día a día. Desde teléfonos móviles para dispositivos de mano, juegos y sistemas de navega    ción para e-business soluciones, Java está en todas partes!"

# First converting to lower case
# then splitting by characters.
# strsplit return a list, we are unlisting to a vector.
chars <- unlist(strsplit(tolower(s), ""))

# Generating frequency table
freqs <- table(chars)

# Generating plot into a file
hist(freqs,include.lowest=TRUE, breaks=46,freq=TRUE,labels=rownames(freqs))

The generated plot for the Spanish translation is:

Note that some characters such as space, dots and commas can be replaced from the string s using a code similar to this:

news <- gsub(pattern=c(" "), "", x=s)

The code above removes all spaces from the string s and a new string variable news holds the modified string. Original variable remains same. 

Have a nice read!

Environments in R

An environment is a special term in R, but its concept is used in many interpreters of programming languages. The term of variable scope is directly related with environments. An environment in R encapsulates a couple of variables or objects which itself is encapsulated by a special environment called global environment.

After setting a variable to a value out of any function and class, the default holder of this variable is the global environment.

Suppose we set t to 10 by

t <- 10

and this is the same as writing

assign(x="t", value=12, envir=.GlobalEnv)

and the value of t is now 12:

> t <- 10
> assign(x="t", value=12, envir=.GlobalEnv)
> t
[1] 12

Instead of using the global environment, we can create new environments and attach them to parent environments. Suppose we create a new environment as below:

> my.env <- new.env()
> assign(x="t", value="20", envir=my.env)
> t
[1] 12

Value of t is still 12 because we create an other t variable which is encapsulated by the environment my.env.

Variables in environments are accessable using the assign and the get functions.

> get(x="t", envir=my.env)
[1] "20"
> get(x="t", envir=.GlobalEnv)
[1] 12

As we can see, values of the variables with same name are different, because they are encapsulated by separate environments.

exists() function returns TRUE if an object exists in an environment else returns FALSE. Examining existence of an object is crucial in some cases. 

> exists(x="t", envir=.GlobalEnv)
[1] TRUE
> exists(x="t", envir=my.env)
[1] TRUE
> exists(x="a", envir=my.env)

 is.environment() function returns TRUE if an object is an environment else returns FALSE. 

> is.environment(.GlobalEnv)
[1] TRUE
> is.environment(my.env)
[1] TRUE
> is.environment(t)

Finally, environments are simply lists and a list can be converted to an environment easly. 

> my.list <- list (a=3, b=7)
> my.env <- as.environment(my.list)
> get("a", envir=my.env)
[1] 3
> get("b", envir=my.env)
[1] 7

The inverse process is converting an environment to a list: 

> as.list(.GlobalEnv)
[1] 12


[1] 3

[1] 7

Happy R days :)

Wednesday, February 18, 2015

Fast and robust estimation of regression coefficients with R

Outliers are aberrant observations that do not fit the remaining of the data, well. In regression analysis, outliers should not be distant from the remaining part, that is, if an observation is distant from the unknown regression object (a line in two dimensional space, a plane in three dimensional space, a hyper-plane in more dimensional space, etc.) it is said to be an outlier. If the observation is distant from the regression object by its independent variables, it is called bad leverage. If an observation is distant by its dependent variables, it is said to be regression outlier. If it is distant by both of the dimensions, it can be a good leverage, which generally reduces the standard errors of estimates. Bad leverages may result a big difference in estimated coefficients and they are accepted as more dangerous in the statistics literature.

Since an outlier may change the partial coefficients of regression, examining the residuals of a non-robust estimator results wrong conclusions. An outlier may change one or more regression coefficients and hide itself with a relatively small residual. This effect is called masking. This change in coefficients can get a clean observation distant from the regression object with higher residual. This effect is called swamping. A successful robust estimator should minimize these two effects to estimate regression coefficients in more precision.

The medmad function in R package galts can be used for robust estimation of regression coefficients. This package is hosted in the CRAN servers and can be installed in R terminal by typing


Once the package is installed, its content can be used by typing


and the functions and help files can be ready to use after typing an enter key.  Here is a complete example of generating a regression data, contaminating some observations and estimating the robust regression coefficients:

The output is

(Intercept)          x1          x2 
4.979828          4.993914    4.985901 

and the medmad function returns in 0.25 seconds in an Intel i5 computer with 8 GBs ram installed.

in which the parameters are near to 5 as the data is generated before. The details of this algorithm can be found in the paper

Satman, Mehmet Hakan. "A New Algorithm for Detecting Outliers in Linear Regression." International Journal of Statistics and Probability 2.3 (2013): p101.

which is avaliable at site


Have a nice detect!