Vous êtes sur la page 1sur 14

Introduction to Topological Data Analysis

Overview Ayasdi Iris is a tool which allows one to visually analyze many different kinds of data sets. It creates images or maps which represent the data in a conceptually useful way, and allows the user to interact with them in ways which allow one to better understand the nature of data sets. The image is not obtained by any standard method in which one projects on two or three coordinates, and views a scatterplot. Rather, it uses intrinsic properties of the data in question, and produces a combinatorial representation which can then be laid out in a convenient form. To give an idea of how this is done, we create an analogy between what we do and the way one uses actual, physical maps of regions on the surface of the earth. We emphasize that the data is in no sense required to be geographical, but rather that we create abstract images which reect the structure of the data. Here are three things one does with a physical map. One looks at the map to obtain a global summary of the region one is interested in. For example, a look at the map of the United States would show that it breaks up into three noncontiguous pieces, namely the continental 48 states, Alaska, and Hawaii. As another example, if one had a map of North and South America together, one could find that it is only possible to travel from the Atlantic Ocean to the Pacific by going South of Cape Horn, through the Northwest Passage, or through the Panama Canal. One could query a particular region of the map for towns contained in the region. For example, one could select Rhode Island as the region, and ask for the cities in that state. One could list one or more towns and ask where they belong in the map. One could ask for higher resolution maps to see more detail.

These are all tasks which can be useful in the study of data. It is frequently important to obtain an overall structural understanding of ones data set. In the study of diabetes, for example, a key feature of the disease that it breaks into two essentially different forms, namely the type I and type II varieties. This can be reected in the image or map of a data set of diabetes patients, as we will see below. If we have a region in the image of the data set of patients, we can ask what points it contains, and what characterizes these data points. In the case of diabetes data, one nds that the type I and type II
Copyright 2013 Ayasdi Inc.

varieties are roughly localized in separate regions of the image, and is able to characterize the different regions using a metabolic signature. If one already knows that ones data breaks up into separate pieces, one can view where the subsets lie in the image. It will turn out that the Iris images include a notion of resolution, and that one can produce images at various resolution values. This tutorial will show you how to create images of your data, how to interpret them, how to search within them, and how to place given data points in the image. We will begin by showing you how the method works, without describing in detail how to construct the images. After we have completed this general description, we will tell you how to construct images of your own data sets. Geometry of Data One of the important ideas in using Iris is to think about data sets geometrically. Lets illustrate what this means. Many kinds of data are given as data matrices. For example, suppose we have this data matrix:

Copyright 2013 Ayasdi Inc.

Notice that in every row, there is an identier and a pair of numbers, or coordinates. We know that each pair of numbers can be represented as a point in the plane. Here is a picture of what happens when we plot each of these ten rows in the plane.

Because we can represent rows in a data matrix this way, it is helpful to think of them as points, and we often will refer to each of the rows of the matrix as points or data points. Notice that if we had a data matrix with three data columns instead of two, then we could plot the rows in 3-dimensional space, and we would also have reason to refer to the rows as points. If we had four columns, then the points would lie in a space we cannot see (we can only see things in 3 dimensions), but the notions of geometry have an extension to sets of rows with any number of columns. The distance from one point to another still makes sense, since there is a simple algebraic formula for it. We will think of data matrices as sets of points in a space which we cant necessarily visualize. Representing the Geometry The idea behind Iris is that it allows us to represent a data set in a way which is very easy to understand, even if the data set sits in dimensions much higher than we can visualize. Here is a picture of a data set which we already can visualize, with which we will illustrate the representation of data which we use.
Copyright 2013 Ayasdi Inc.

We will represent the data using a version of the well known notion of Venn diagrams. The idea is that we will nd collections of data points which may overlap, and draw a diagram which encodes the information about which sets overlap and which dont. In the picture below, the collections are drawn as yellow discs, and the sets in question are the sets of data points contained in each of the balls. Each yellow ball represents a set of points in the data set.

The final step to create a diagram in which each collection is represented by a node, and where nodes corresponding to overlapping collections are connected by an edge. The diagram is drawn below. Each yellow ball in the diagram above corresponds to exactly one of the red nodes in the picture below.

Copyright 2013 Ayasdi Inc.

Features in the Geometry


Our visual system allows us to identify certain kinds of interesting features in geometric objects. Two very important classes of features are ares and loops. They are illustrated in the following pictures. Features like these are reected in the images produced by Iris. The pictures below illustrate this idea.

Flares reected in Iris Image

Loops reected in Iris image An important point to notice is that the placement of the nodes in the Iris image, and the distances between the nodes, do not reect anything intrinsic about the data set.

Copyright 2013 Ayasdi Inc.

It is also important to understand what happens when Iris is applied to unstructured data, by which we will mean data which is normally distributed in dimensions greater than 2. In this case, the Iris output for both density and centrality lenses looks like this.

The coloring is done by the density filter values, so we have a red high density core represented by the leftmost node and gradually making a transition to a blue low density region. If we color by the centrality lens, the colors are reversed, giving this picture.

The red node represents the high eccentricity region, i.e. the one consisting of points far from the center. If in studying your data set, you obtain this kind of output, it could mean that the data is unstructured, but it could also result from many other possibilities. Here are a few. 1. 2. 3. 4. The resolution and the gain might have been chosen to low. Consider increasing them independently and look at the higher resolution images they produce. It might be that the filter values are too skewed, in the sense that the features occur in too small a range of filter values. In this case, you could try equalizing the filter. The lens or lenses you have chosen may not be appropriate for the problem. Experiment with different choices to see what happens. The metric chosen may not be appropriate. For example, correlation is often appropriate when Euclidean is not. Experiment with different choices of metric as well. Straight lines are not always uninteresting, though. If you obtain the picture below, it is an indication of one dimensional structure in your data set.
Copyright 2013 Ayasdi Inc.

Although the output is a straight line, the coloring by the lens is not that obtained by unstructured data, but is the output one would nd on data which is obtained from a one-dimensional normal distribution. In this case, it is colored by a centrality lens. If the data has more than a single coordinate, this is an interesting observation about it.

Using Geometric Features to Understand Your Data Set


Lets look at a real data set, and see how the geometric features we see are interpretable in terms of understanding the decomposition of the data set into interesting groups. The picture below comes from the Miller-Reaven diabetes study performed at Stanford during the 1970s. The data matrix contains 145 rows, each of which has 5 entries, corresponding to various metabolic variables and a normalized notion of size. They are also classied by whether the patient is normal or near normal, or has type I or type II diabetes. The image was constructed using a density lens.

Copyright 2013 Ayasdi Inc.

Miller-Reaven Diabetes Data


The coloring of the image reects the value of the density lens, so red nodes contain high density data points and the blue nodes contain points of low density. The size reects the number of data points contained in the node. We see three ares, two of which are blue at the extremity and one of which contains red and orange nodes.

Geometric Feature: Flares


This suggests that we explore the members in the points at the extreme ends of the ares. We will later show in detail how one performs the exploration. The results are that the patients in the orange extreme node are essentially normal, one of the blue extreme nodes consists of type I diabetes patients, and the other blue extreme node consists of type II diabetes patients. Further examination of the nodes shows that nearly all the points contained in the are labelled A are normal or near normal, that are B contains almost exclusively patients with type I diabetes, and that are C contains almost exclusively patients with type II diabetes. In fact, it appears likely that are A actually represents the core of the data set, with the blue ares being representing the less central portions of the data. Finding ares in the Iris image is often an important way to understand the different portions of the data set. A typical way to approach any data set is to create the Iris image, and then examine the endpoints of any are in order to determine what the distinguishing features between the are are. This can be done by hand, but one can also use the Explain feature, which will be described below.
Copyright 2013 Ayasdi Inc.

To illustrate the notion of resolution, here is a Iris image of the same diabetes data set, but at a higher resolution.

Higher Resolution image of Miller-Reaven Set

You will notice that the rough structure (two blue ares, and a red/orange are) is still preserved, but the blue ares are longer than the ares in the lower resolution image. The red/orange group now shows additional detail, in fact it appears to split into two shorter ares. The fact that the blue ares contain the type I and type II patients is also preserved. The presence (and lengthening) of the blue ares is corroborative evidence that the these ares are genuine features of the data set, and not artifact. The structure in the red are is less obvious to interpret, and since it does not persist under lowering of resolution means that it is potentially artifact. Geometric Feature: Loops One can also use the presence of loops to analyze a data set. Suppose that we study employment statistics within a state. We are able to estimate the total employment each month, and we are also able to estimate the new hirings and layoffs which occur within a given month. Suppose we carry out this data collection over a period of 10 years, so that we have 10 x 12 = 120 data points, each of which consists of two numbers, the total employment E and the difference D = (new hires layoffs) for the give month. E will always be positive, but D may be positive or negative. Suppose that we now create a
Copyright 2013 Ayasdi Inc.

Iris image based on the quantity E used as a lens. The image might look something like the one below. We see that we have a red node on the right (high employment) and a blue node on the left (low employment) and then values in between. Looking at the picture, we see that there are two yellow nodes A and B, for which the employment values are roughly comparable, but which are somehow distinct.

We can then explore the members of nodes A and B, and nd that in the case of A, the variable D is signicantly positive and that in the case of B, it is signicantly negative. This means that node A represents data points occurring in a period of increasing employment, and that node B represents points occurring in a period of declining employment. The loop in the data suggests the presence of a business cycle, in which we move from low employment up to high employment and back down to low employment.

Metrics
The choice of metric is a key choice one makes in the analysis of ones data, and it will affect the Iris output. A typical consideration which is applicable to the choice of metric comes up in gene expression proling, where one nds that one is often not so concerned with absolute gene expression levels as with the relative expression values among the different genes or probes. In this case, one would not want to study so much absolute values, but rather some kind of correlation or angle (in high
Copyright 2013 Ayasdi Inc.

dimensions) between the vectors representing the data points. Of course, one should also experiment with the different choices of metric on ones data set to see what the different outputs are and what kind of information about the data they yield. Euclidean The Euclidean distance is applies to data sets which come in the form of a data matrix, i.e. as a set of N-tuples of numbers. It generalizes the usual distance formula studied in 2 and 3 dimensional space. It is used when you believe that your coordinates are comparable in value. For example, if the coordinates are all observations of the same phenomenon at different time points, then this distance is very appropriate. However, suppose that we have a data set in which one coordinate is a persons weight (in kilograms) and another is his/her height in meters, then the fact that the height is measured in quantities which are much smaller than the corresponding quantities, then using the Euclidean metric will underemphasize the effect of height relative to the weight. Normalized Euclidean This metric is a variant on the Euclidean metric, in which the scale choices are accounted for. For each variable, one nds its mean and standard deviation, and rescales the value of the coordinate around its mean by dividing by the standard deviation of the set of values taken by the coordinate. Cosine and Angle This distance also applies to data matrices, where the points correspond to N-tuples of data numbers for some (xed) N. The distance between two vectors is computed as 1-cos(!) where ! is the angle between the two vectors. This cosine is computed using the dot product of the vectors.

Copyright 2013 Ayasdi Inc.

This is a metric which should be used when one is not concerned with recognizing as different two vectors which may differ in absolute magnitude but which are collinear. A similar metric is the angle metric, in which the distance is simply the value! of the angle between the two vectors. Correlation This is the Pearson correlation distance. It is available for data matrices, i.e. for collections of vectors N-tuples for xed N. For each data point, we subtract from each coordinate the average of all the coordinates, to create a new data set. The distance is then the cosine distance between the new (mean centered) points. This is an excellent distance when one suspects that similar phenomena may be represented by different mean values. Hamming This is a distance used for the analysis of Boolean data. This means that the data is given as vectors of {0,1} values. The distance between two such Boolean vectors is the count of number of coordinates in which they differ.

Copyright 2013 Ayasdi Inc.

Lenses
Lenses are used together with the metric to construct the Iris output. A lens is a real valued function on the data set. As such, it could be any user dened functions, but Iris provides a number of such functions which are computed automatically from the metric. They are chosen to be functions which are of importance in the statistical analysis of data. One can use two or more lenses simultaneously in the analysis of each data set. As in the choices of metrics, it is useful to carry out a great deal of experimentation with different choices of lens or sets of lenses, as well as resolution and gain choices for the lenses. Density This lens is the result of applying a kernel Gaussian density estimator to a Euclidean data set. Estimating probability density functions based on nite samples is one of the fundamental tasks in statistics. This lens should be among the rst ones tried, and can of course used in conjunction with the other lenses provided. Centrality Measures of how central or peripheral a point is in a data set are also important quantities in the study of a data set. There are a number of such measures which come under the heading of data depth. We choose two such measures, which we refer to as L-infinity and L1 centrality. The rst one associates to each point x the maximal distance from x to any other data point in the data set, and the second computes the average value of the distances from x to all the other data points. In normally distributed variables, these measures are strongly correlated with density, but in other distributions, for instance bimodal distributions, they are not. Note that this lens does not require the data set to be Euclidean. Variance This lens is computed for Euclidean data sets as the sum of the squares of the differences of each of the coordinates from the mean value of each of the coordinates. This is typically useful only when the
Copyright 2013 Ayasdi Inc.

different coordinates are comparable, for example if they are obtained by measuring the same quantities at different times or different locations. Resolution and Gain

For each lens you choose, you will also choose two numbers, resolution and gain. Both will affect your Iris output in a signicant way. The intuition is that increasing the resolution will construct an Iris output which is in a sense a higher resolution version of the output. For example, it will contain a larger number of nodes. As one increases the resolution, there is a certain point at which the outptut begins to be very fragmented, and most of the connections are lost. One way to overcome this is by increasing the gain. Increasing the gain will increase the number of edges in the output. Experimentation with various different choices of the resolution and gain often allows one to obtain a substantially improved image of the data set, which captures more information about it.

Copyright 2013 Ayasdi Inc.

Vous aimerez peut-être aussi