Showing posts with label outlier. Show all posts
Showing posts with label outlier. Show all posts

Wednesday, February 18, 2015

Fast and robust estimation of regression coefficients with R

Outliers are aberrant observations that do not fit the remaining of the data, well. In regression analysis, outliers should not be distant from the remaining part, that is, if an observation is distant from the unknown regression object (a line in two dimensional space, a plane in three dimensional space, a hyper-plane in more dimensional space, etc.) it is said to be an outlier. If the observation is distant from the regression object by its independent variables, it is called bad leverage. If an observation is distant by its dependent variables, it is said to be regression outlier. If it is distant by both of the dimensions, it can be a good leverage, which generally reduces the standard errors of estimates. Bad leverages may result a big difference in estimated coefficients and they are accepted as more dangerous in the statistics literature.

Since an outlier may change the partial coefficients of regression, examining the residuals of a non-robust estimator results wrong conclusions. An outlier may change one or more regression coefficients and hide itself with a relatively small residual. This effect is called masking. This change in coefficients can get a clean observation distant from the regression object with higher residual. This effect is called swamping. A successful robust estimator should minimize these two effects to estimate regression coefficients in more precision.

The medmad function in R package galts can be used for robust estimation of regression coefficients. This package is hosted in the CRAN servers and can be installed in R terminal by typing

install.packages("galts")

Once the package is installed, its content can be used by typing

require("galts")

and the functions and help files can be ready to use after typing an enter key.  Here is a complete example of generating a regression data, contaminating some observations and estimating the robust regression coefficients:

The output is

(Intercept)          x1          x2
4.979828          4.993914    4.985901

and the medmad function returns in 0.25 seconds in an Intel i5 computer with 8 GBs ram installed.

in which the parameters are near to 5 as the data is generated before. The details of this algorithm can be found in the paper

Satman, Mehmet Hakan. "A New Algorithm for Detecting Outliers in Linear Regression." International Journal of Statistics and Probability 2.3 (2013): p101.

which is avaliable at site

http://www.ccsenet.org/journal/index.php/ijsp/article/view/28207

and