Node-Based Resilience Measure Clustering

Information about the experiments used in "Robust Graph-theoretic Clustering Approaches Using Node-Based Resilience Measures", presented at IEEE Conference on Data Mining, December 2016. 

The complete code is available on GitHub.

File Types:

The programming uses several different file types, which are explained here.

Points file

This is a file that contains data points.  For example, here is a few lines of the iris file, taken from the UCI Machine Learning Repository.

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica

 Per the UCI readme, each line represents one example of a particular iris flower.  Each line contains 4 decimal numbers and one label.  The decimal numbers represent measurements taken of various parts of the iris, such as sepal and petal length and width.  The labels represent the type of iris.  The type of iris can be referred to as the ground truth of the dataset.  The idea is that the numbers, if clustered correctly, will form tighter groups within one cluster label.  For example, because there are 4 measurements, data points representing specimens of Iris-setosa will be closer to each other in four-dimensional-space than data points representing Iris-virginica specimens. 

Our format for points files is tab or space separated, and contains only the data points, not the labels.  For the iris file above, the point file would look like this:

5.1  3.5  1.4  0.2
4.9  3.0  1.4  0.2
4.7  3.2  1.3  0.2
4.6  3.1  1.5  0.2
7.0  3.2  4.7  1.4
6.4  3.2  4.5  1.5
6.9  3.1  4.9  1.5
5.5  2.3  4.0  1.3
6.3  3.3  6.0  2.5
5.8  2.7  5.1  1.9
7.1  3.0  5.9  2.1

The file extension of the point file is not relevant.  Usually, we use .txt.

Label (or Data) File

Data files contain the ground truth information for a point set.  We give these files a .data extension.  Generally, the last column of the data files is used as the ground truth.  For the iris dataset shown above, the data file could look like this:

5.1  3.5  1.4  0.2  1
4.9  3.0  1.4  0.2  1
4.7  3.2  1.3  0.2  1
4.6  3.1  1.5  0.2  1 
7.0  3.2  4.7  1.4  2
6.4  3.2  4.5  1.5  2
6.9  3.1  4.9  1.5  2
5.5  2.3  4.0  1.3  2
6.3  3.3  6.0  2.5  3 
5.8  2.7  5.1  1.9  3
7.1  3.0  5.9  2.1  3

At minimum, it contains the ground truth labels, like this:

1
1
1
1
2
2
2
2
3
3
3

Graph file

The graph file represents the graph that has been created from a points file.  It has a .graph extension.  The first line of a graph file has the word "weighted" or "unweighted", indicating a weighted or unweighted graph.  The first column of a graph file is always the vertex number, starting with zero.  The following columns represent edges, and the weights of the preceding edge.  For example:

weighted
0 54 10.3928052036012 45 22.3407475255418 48 24.7602322283132 
1 0 31.2650123940484 8 6.78638342565464 9 7.83291771947082 48 13.1407648179244 
2 52 12.2969915019894 26 13.1676193748149 50 36.7105734087606 
3 5 30.0919690282972 10 31.1594175170204 31 35.8096216120752 14 68.0764636273066 
4 68 17.7286801539201 168 20.7765252147706 174 22.4215588218125 
5 3 30.0919690282972 10 60.485243654961 31 65.3977094400102 
6 27 6.23935092778087 11 10.3446749586442 30 10.7227002196275 58 13.383822323985 
7 16 15.3305511968748 58 16.6335714745812 
8 1 6.78638342565464 9 3.27007645170568 29 10.3394583997422 
9 1 7.83291771947082 8 3.27007645170568 29 10.5494454830574 
... etc

In the above example, vertex 0 has an edge to vertex 54 with a weight of 10.3928052036012, an edge to vertex 45 with a weight of 22.3407475255418, and an edge to vertex 48 with a weight of 24.7602322283132 .

 An unweighted graph looks the same, except that the weights are ommitted:

unweighted
0 54 45 48  
1  0 8 9 48 
2 52 26 50  
3 5 10 31 14  
4 68 168 174  
5 3 10 31 
6 27 11 30 58  
7 16 58  
8 1 9 29 
9 1 8 29  
... etc

As before, the numbers are separated by spaces or tabs.

GML

In some cases, we can use graphs in GML format, which is a standardized format.  GML is described here.

Cluster file

The result of clustering a graph is the creation of a .cluster file. An example is shown below.  The first line of the cluster file tells first if the clustering results from a points file or a graph file.  It then gives the location and name of the file. In the example below, the clustering is done from a Graph file (see format above).  The exact file is iris_Euclidean_KNN_25.graph.

The second line gives the number of clusters, and then each cluster is listed on two lines.  The first line gives the number of elements in the cluster, and the second line lists them. The clustering shown has 3 clusters, with 50, 65 and 35 members.  This cluster file shows that two heirarchical steps were done in clustering.  In the first step, one node, node 98 was removed.  This created 2 clusters.  For the second step, the cluster (of the two) with the minimum VAT was divided by removing 20 nodes. 

Graph C:\Users\John\clustProject\iris\iris_Euclidean_KNN_25.graph
Clusters 3
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
65
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 101 106 113 114 119 121 123 126 127 133 134 138 142 146 149 
35
100 102 103 104 105 107 108 109 110 111 112 115 116 117 118 120 122 124 125 128 129 130 131 132 135 136 137 139 140 141 143 144 145 147 148 
Meta HVatClust
All calculated VATs:
0.0196078431372549 
Vat: MinVat=0.0196078431372549
Removed Count:1
98
All calculated VATs:
1.26666666666667 0.833333333333333 
Vat: MinVat=0.833333333333333
Removed Count:20
103,147,116,137,128,111,110,123,149,127,141,139,108,145,52,77,132,134,70,114