1.0.0 Documentation



Running HiSee

HiSee can be run either as a standalone application or as a component in another application (see component mode). To run Hisee as a standalone, double-click on "run.bat" in the Hisee/bin directory, or (Macintosh, Linux), open a terminal window, go the Hisee directory, and enter:

sh bin/run.sh

Hisee will then open in its own window, with a default dataset pre-loaded. Note that Hisee is a java application and as such requires that java be installed on your system. It is installed by default in most recent versions of Macintosh OS X and Linux. For Windows, if you have not done so already, you will have to obtain a copy of the Java Runtime Environment, version 1.4 or later. It can be downloaded here.

 


Overview

The HiSee window can be divided into four main parts: the Menu bar, the Tool bar, the Display, and the Status bar.

Menu commands in the menu bar are described below.

The tool bar buttons are the ones you will be most likely to use in practice as you explore a dataset. If you are new to HiSee try changing the projection algorithm using the main drop-down list, and, while in "Sammon," try pressing play and "randomize" repeatedly to get a sense of how the algorithm works. For details on each projection method click here; for details on the toolbar, including how to link different projections, click here. Note that when HiSee is incorporated into another program a Component mode indicator appears on the left side of the Tool bar (not shown above). This allows the user to monitor and control the flow of data into HiSee.

The status bar shows the number of points in the data set for the high dimensional object and the number of dimensions of the space containing the high dimensional object. In the illustration there are 819 data points in the high dimensional data set and the high dimensional space has 3 dimensions. The Status bar can be turned off in the Graphics preferences menu. Other information, in particular an error-rate for the Sammon map, can also be shown here. See the Preferences section below.

The display contains the image of the projected object. The user can pan and zoom this image by dragging the mouse while holding the left and right buttons down, respectively (see the Display section below) and, by clicking on individual points, can see what they correspond to in the high dimensional dataset.


Projection Methods

HiSee projects data in two ways: (1) By applying algorithms to the entire dataset at a time, and (2) by dynamically adding new data points to a dataset based on its current configuration. The second method is described in adding points. The first method, the main one, is described here. Note that between (1) and (2) and the ability to perturb the data, quite a number of possibilities are open to the user. One can open an initial dataset using PCA, for example, then add new points using triangulation, and finally use the resulting configuration of points as an initial condition for the Sammon mapping algorithm.
Coordinate Projection:   This is perhaps the simplest possible projection technique. If one has a list of data points with 40 components each, coordinate projection to two-dimensions simply ignores all but two of these components, which are then used to display the data in two-space. In HiSee users can select which two components are used and view the object from several view points. The program can also automatically select the two most variant dimensions to project along.
Principal Component Analysis (PCA): PCA builds on coordinate projection by making use of the "principal axes" of the dataset. The principal axes of an object are the directions in space about which the object is most balanced or evenly spaced. PCA selects the two principal axes along which the dataset is the most spread out and projects the data onto these two axes.

Sammon map:   The Sammon map is an iterative technique for making interpoint distances in the low-dimensional projection as similar as possible to the interpoint distances in the high-dimensional object. Two points close together in the high-dimensional space should appear close together in the projection, while two points far apart in the high dimensional space should appear far apart in the visible projection. By minimizing an error function between the high and low dimensional sets of interpoint distances, the Sammon map does its best to preserve these distances in the projection. This iterative procedure can actually be watched in HiSee by loading a dataset and pressing the "play" button on the interface. Note that when Sammon mapping is invoked points which overlap (what constitutes "overlap" can be specified in the Sammon preferences) are perturbed, since overlapping points blow up the algorithm.

For more background on these projection algorithms see the About section of the HiSee web-page.


Menus

 
The projection method menu allows the user to select the projection method to be used by HiSee. It displays the currently selected projection method. See the Preferences section below on this page for information about how to control the parameters of the projection methods. The different projections can be "chained" in interesting ways; the results of PCA, for example, can be used as an initial condition for the Sammon map, simply by selecting PCA before Sammon.
 

The preferences menu allows users to customize HiSee in various ways. Note that the "Projection Preferences" brings up a different dialog box depending on which projection method is selected. For more information on using these dialogs see the preferences section below.

Checking the "autoscale" box causes the HiSee program to automatically resize the image of the projected object in the Display. This can make it easier to see the projected image. Unchecking the autoscale box disengages the autozoom feature and allows users to zoom on on data (see the display section below).


HiSee uses three types of data files: high, low, and combined. A data file can consist entirely of the coordinates for the high dimensional data set, entirely of coordinates for the low dimensional data set, or it can combine both the coordinates for the high dimensional data set and the coordinates for the projected low dimensional data set. Some of these options are counter-intuitive (why save high-dimensional data that have just been loaded?), but every form of persistence turns out to be useful in certain contexts, as discussed below in the File format section. It is also possible to add high-dimensional data, in which case one adds data points to a dataset which has already been loaded and projected. This can be useful when one wants to test or make use of add-data methods.


Toolbar buttons

Button Action performed
  Iterate indefinitely (For projection methods that use iteration).
The triangle indicates that the program is not currently iterating the projection algorithm. Pressing this button will cause the program to iterate the algorithm an indefinite number of steps and will change the icon into a square "stop" sign. If pressing the button does not change the triangle into a square then the current projection method does not use iteration.
  Stop Iteration   (For projection methods that use iteration).
The square indicates that the program is currently iterating the projection algorithm. Pressing this button will cause the program to stop iterating the algorithm. The button will change to a triangle to indicate that it is no longer iterating the algorithm. If you do not see the square turn into a triangle be patient. It just means that the program is still completing the last few iterations.
  Iterate One Step (For projection methods that use iteration).
Pressing this button will cause the projection algorithm to iterate a single step. 
  Erase data
This button clears both the high dimensional data set and low dimensional data set from the program. Useful for clearing out data that have accumulated when the program is run in component mode.
  Perturb data
This button randomly perturbs the points in the low dimensional set. Useful for bumping the Sammon map out of local minima, and for exploring different possible projections of a given dataset under the Sammon map.


Preference Dialog Boxes

Only add points if at least this far from any other point. When datasets are initially loaded, or when data are added to an existing dataset (e.g. when HiSee is run in component mode), we want to ignore repeated points. Even if a new point is not exactly the same as some other point in the set, it may be "close enough" to be considered the same point. This field allows one to set a tolerance level for deciding whether two points are the same. If "2" is specified in this field, for example, then any new point within a radius of 2 of some existing point will not be added to the dataset. Note: Repeated points are allowed in the low-dimensional dataset; this field only applies to the high-dimensional data.

Degree to which to perturb overlapping low-dimensional points is the distance the program will move coincident low dimensional points before running the Sammon mapping algorithm. It must do this because overlapping low-dimensional points will cause the Sammon map to divide by zero (this is observed on-screen as the disappearance of all data or the contraction of data to a small point).

The method for adding new data points combo-box allows users to choose how new points will be added to the dataset (in component mode). For more on these methods see below.

Checking Show Error causes the program to output the value of the Sammon error function to the status bar while it iterates the Sammon mapping algorithm. 

Checking Show the Status Bar places the number of points in the data set for the high dimensional object and the number of dimensions of the space containing the high dimensional object in the Status bar.

The default color of the points in the projected image is green. Checking the Color the data points box colors the data points sequentially red, orange, yellow, green, blue, violet, red. The points are colored to indicate the order in which they were listed in the high dimensional data set.  

Minimum Point Size is the relative size of the dots in the display which represent the points in the projected image.

Number of iterations between Graphics update is the number of times the projection algorithm will iterate before updating the projected image in the Display.

Margin size controls the size of the projected image in the display when autoscale is on.

Coordinate Projection Preferences

First and second dimension to project control which dimensions of the high-dimensional data are projected to the horizontal and vertical axis of the Display.

If Automatically use most variant dimensions is selected then the program selects the two most variant axes of the high dimensional dataset for coordinate projection.

Sammon Mapping Preferences

This dialog allows the user to set the step size used at each iteration by the Sammon map. The bigger the step size is the faster the projection algorithm will run but if the step size is too large the projected image will explode. One can generally experiment with different step sizes to get the right one. If iteration is progressing very slowly, one can just try something large, like 100, 300, or even 1000. If the dataset "explodes" (in which case everything in the display may contract to a point), press the randomize button to start over.

TIP: A step size of a little less than 1 is good for objects with about a dozen points while a step size in the hundreds is good for objects with hundreds of points.

 


Adding New Data-Points

In some cases it is useful to be able to add new points to an existing dataset without running the projection method on the whole dataset again. Methods exist for quickly adding new data points based on those that have already been projected. This can be particularly useful when HiSee is used as a component in an application which generates new data in real-time. These methods will also be invoked if new data are added using the menu command, "Add High-dimensional data." Note that these methods work best when a certain amount of data has already been collected and projected using, for example, PCA or the Sammon map, for new points are added based on the positions of current points.

Nearest Neighbor subspace: The nearest-neighbor subspace method (1) takes each new point and determines the three points in the current data set which are closest to it, (2) finds the projection of the new point into the two-dimensional subspace which contains these three nearest neighbors in the high-dimensional space, (3) uses the three nearest neighbors and their corresponding points in the low dimensional dataset to find an affine map which approximates the full projection method (whichever one is currently being used), and (4) applies that affine map to the new datapoint.

Triangulate:The Triangulate method takes each new point and determines the two points in the current data set which are closest to it. It will either be possible to place the projected image of the new point so that its distance from the projected image of its two nearest neighbors is the same as in the high dimensional space or it won't be possible. When it is possible there will almost always be two such places. The choice of which to use is made by using the distance of the new point to the point in the current data set which is the third closest to it. When it is not possible to project the new point so that its distance to its two nearest neighbors is preserved then the projected image of the new point is placed on a line connecting the projected image of its two nearest neighbors. In this case the position of the projected image of the new point on this line is determined by the relative sizes of the distances between the new point and its two nearest neighbors in the current data set.

Off: Data points are not added using any special algorithm. Rather, when new data points arrive, the current projection algorithm is re-run on the entire updated dataset (if the current projection algorithm is an iterative algorithm like the Sammon map then coordinate projection is used by default). PCA tends to be useful in refresh mode, for it is relatively fast but also takes into consideration the entire dataset. For faster performance coordinate projection with "automatically select most variant dimensions" can also be used.


Component Mode

HiSee is open source Java code and can be incorporated into other programs. When HiSee is used as a component of a larger program the icon appears on the left side of the task bar. This button indicates whether HiSee is receiving data from the program it is embedded in (using some add data method). By clicking on this button the user can open or close the HiSee component to new data.

This indicates that the HiSee component is currently open to receiving data from the rest of the program. Clicking on the indicator will close the HiSee component to receiving data from the rest of the program.

This indicates that the HiSee component is currently closed to receiving data from the rest of the program. Clicking on the indicator will open the HiSee component to receiving data from the rest of the program.


Display (Panning and Zooming)

The HiSee display is currently 2-dimensional. It is based on the Piccolo zoomable user interface (ZUI), which allows users to pan and zoom and graphical data. When autoscale in the preference menu is turned off, you can pan the visible data by left-dragging (dragging the mouse while holding the left-mouse button down), and you can zoom in or out on data by right-dragging (dragging the mouse while holding the right button down). You can also click on data points to reveal which high-dimensional points they correspond to (results will appear in the terminal window).


Keyboard Commands

Button Action performed

H

Dump the high-dimensional dataset to the terminal window.

L

Dump the low-dimensional dataset to the terminal window.

 


 

File format

The data files used by HiSee contain lists of Cartesian coordinates for the position of the points in the data sets. The file format is known as "Comma Separated Values" or CSV, which is easily recognized by a wide-range of programs, including Microsoft Excel (though you may have to rename the files with the suffix ".csv"). The coordinates for single points are placed in a single line and they are separated by commas. The coordinates can be further separated by white space characters for easier reading by a user but this is not necessary. The coordinates for each point are placed on their own line in the file. For example here are the contents of a data file for an equilateral triangle in three dimensional space.

1, 0, 0
0, 1, 0
0, 0, 1

Default file: Note that there is a default file, named "default.hi" in the Hisee/data directory, which is opened by default when the program launches, and projected using PCA. You can set this default file yourself simply by copying a file to the Hisee/data directory and renaming it "default.hi."

Three kinds of files: files containing high-dimensional data, low-dimensional data, and combined files containing both kinds of data. Each kind of file is automatically saved with a suffix. When opening files, HiSee filters out all files except those with the relevant suffix.

Suffix Kind of File

.hi

High-dimensional data. ".hi" files contains the high-dimensional data to be studied. It is useful to save high dimensional data at the end of a session when the program is used in component mode and new data is constantly being added. One can also add high-dimensional data to a dataset using the Add High Dimensional Data menu command.

.low

Low-dimensional data. If a low dimensional dataset is opened, it must have the same number of datapoints as the high-dimensional dataset. It is sometimes useful to save several different low dimensional data files corresponding to different projections of the same data. It is also useful to save low-dimensional data to analyze a particular projection in more detail.

.comb

Combined data. Both high- and low-dimensional data; the "results" of a HiSee session. It is useful to save combined data when a high-dimensional dataset has taken a long time to project (using the Sammon map, for example), and the user wants to quickly open it later and begin where he or she left off. This data format is also useful for demonstrations.

Format for combined data files. When saving the combined high and low dimensional data sets for a figure the file first begins with the high dimensional coordinates followed by a two blank lines and ending with the low dimensional coordinates. For example here are the contents of a data file for an equilateral triangle in three dimensional space which has been projected down to two dimensions.

1, 0, 0
0, 1, 0
0, 0, 1


1.1220084679281466, 0.12200846792814646
0.12200846792814646, 1.1220084679281466
-0.24401693585629225, -0.24401693585629225

Comments: Lines in a data file which begin with a "#" character are comments and are ignored by HiSee when reading the file. When HiSee creates a new data file it adds comments at the beginning of the file to record the name of the file. If the file contains combined high and low dimensional data HiSee adds the comment "Combined". For example

#
# Combined File: C_triangle.csv
#

1, 0, 0
0, 1, 0
0, 0, 1


1.1220084679281466, 0.12200846792814646
0.12200846792814646, 1.1220084679281466
-0.24401693585629225, -0.24401693585629225


Credits

HiSee was written by Scott Hotton and Jeff Yoshimi. Help was provided by Matthew Lloyd. Thanks also to Patricia Churchland. and Paul Churchland.

This software uses the Piccolo Zoomable User Interface library from the University of Maryland, a csv utility by Steven Ostermiller, and the JAMA linear algebra package.