Monday, January 17, 2011

Visualizing clusters of high-dimensional data

I wanted to demonstrate that distance-based methods should have a hard time distinguishing some classes (say the libras UCI data set) and not others (say the iris UCI data set) with a visualization. S.R. suggested:

  1. Use multidimensional scaling (MDS@wikipedia) to plot points in 2 dimensions.
  2. draw lines between points and their nearest neighbors in the original space

If the lines are an overlapping mess, the points don't cluster well.


Steps to do this in MATLAB:


  1. M=dlmread('my.data', '\t') % read tab-delimited data into memory

  2. M=transpose(M) % I have to transpose my data to make columns examples, rows are variables (features)

  3. D=dist(M) % compute distances into a square matrix

  4. G=mdscale(D, 2) % find points in 2 dimensions


G is now a Nx2 matrix (N=number of samples). The rest of the process involves plotting the lines to connect neighbors. You can do that in MATLAB, but I don't know a particularly fast way.

No comments:

Post a Comment

Followers