I wanted to demonstrate that distance-based methods should have a hard time distinguishing some classes (say the libras UCI data set) and not others (say the iris UCI data set) with a visualization. S.R. suggested:
- Use multidimensional scaling (MDS@wikipedia) to plot points in 2 dimensions.
- draw lines between points and their nearest neighbors in the original space
If the lines are an overlapping mess, the points don't cluster well.
Steps to do this in MATLAB:
- M=dlmread('my.data', '\t') % read tab-delimited data into memory
- M=transpose(M) % I have to transpose my data to make columns examples, rows are variables (features)
- D=dist(M) % compute distances into a square matrix
- G=mdscale(D, 2) % find points in 2 dimensions
G is now a Nx2 matrix (N=number of samples). The rest of the process involves plotting the lines to connect neighbors. You can do that in MATLAB, but I don't know a particularly fast way.
No comments:
Post a Comment