Monday, April 6, 2009

Meeting 10 - Fast Mining of Distance-Based Outliers in High-Dimensional Datasets

This week, Adnan will be presenting "Fast Mining of Distance-Based Outliers in High-Dimensional Datasets" by Amol Ghoting, Srinivasan Parthasarathy and Matthew Eric Otey of IBM, Ohio State University and Google Inc respectively.

Abstract: Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. Existing algorithms for mining distance-based outliers do not scale to large, high-dimensional data sets. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional data sets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art, often by an order of magnitude.

Keywords: Outlier detection, high-dimensional data sets, approximate k-nearest neighbors, clustering.

0 comments:

Post a Comment