notolog: Visualizing clusters of high-dimensional data

Monday, January 17, 2011

Visualizing clusters of high-dimensional data

I wanted to demonstrate that distance-based methods should have a hard time distinguishing some classes (say the libras UCI data set) and not others (say the iris UCI data set) with a visualization. S.R. suggested:

Use multidimensional scaling (MDS@wikipedia) to plot points in 2 dimensions.
draw lines between points and their nearest neighbors in the original space

If the lines are an overlapping mess, the points don't cluster well.

Steps to do this in MATLAB:

M=dlmread('my.data', '\t') % read tab-delimited data into memory

M=transpose(M) % I have to transpose my data to make columns examples, rows are variables (features)

D=dist(M) % compute distances into a square matrix

G=mdscale(D, 2) % find points in 2 dimensions

G is now a Nx2 matrix (N=number of samples). The rest of the process involves plotting the lines to connect neighbors. You can do that in MATLAB, but I don't know a particularly fast way.

notolog

Monday, January 17, 2011

Visualizing clusters of high-dimensional data

No comments:

Post a Comment

Followers

Blog Archive

About Me