advisor : dr. hsu graduate : jian-lin kuo author : silvia nittel kelvin t.leung

25
1 Intelligent Database Systems Lab Advisor Dr. Hsu Graduate Jian-Lin Kuo Author Silvia Nittel Kelvin T.Leung Amy B raverman 國國國國國國國國 National Yunlin University of Science and Technology Scaling Clustering Algorithm for Massive Data Sets using Data Streams

Upload: dante

Post on 16-Mar-2016

75 views

Category:

Documents


0 download

DESCRIPTION

國立雲林科技大學 National Yunlin University of Science and Technology. Scaling Clustering Algorithm for Massive Data Sets using Data Streams. Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman. Outline. N.Y.U.S.T. I.M. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

1Intelligent Database Systems Lab

Advisor : Dr. HsuGraduate : Jian-Lin KuoAuthor : Silvia Nittel

Kelvin T.Leung Amy Braverman

國立雲林科技大學National Yunlin University of Science and Technology

Scaling Clustering Algorithm for Massive Data Sets using Data Streams

Page 2: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

2Intelligent Database Systems Lab

Outline Motivation Objective Introduction Literature review Implementing K-means Using Data Stram Space & Time complexity Parallelizing Partial/Merge K-means Experimental Evaluation Conclusions

N.Y.U.S.T.I.M.

Page 3: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

3Intelligent Database Systems Lab

Motivation Computing data mining algorithms such as clustering tec

hniques on massive data sets is still not feasible nor efficient today.

To cluster massive data sets or subsets, overall exection time and scalability are important issues.

N.Y.U.S.T.I.M.

Page 4: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

4Intelligent Database Systems Lab

Objective It achieves an overall high performance computation for

massive data.

N.Y.U.S.T.I.M.

Page 5: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

5Intelligent Database Systems Lab

Introduction To improved data distribution and analysis, we substitute data sets with compressed conterparts. We partition the data set into 1 degree * 1 degree grid cell to improve data distribution and analysis.

N.Y.U.S.T.I.M.

Page 6: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

6Intelligent Database Systems Lab

Literature review K-means algorithm 1. Initialization: Select a set of k initial cluster centroids ramdomly, i.e. , 1 ≤ j ≤ k. 2. Distance Calculation: For each data point , 1 ≤ i ≤ n, computes its Euclidean distance to each centroid. 3. Centroid Recalculation: For each 1 ≤ j ≤ k, computing that is the actual mean of the cluster is the new centroid. ( where )

4. Convergence Condition: Repeat (2) to (3) until convergence criteria is met. The convergence criteria is || MSE(n-1) – MSE(n) || ≤ 1* 10-9

N.Y.U.S.T.I.M.

m j

X i

C j

|C|/ C x jjxj

j

Page 7: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

7Intelligent Database Systems Lab

Literature review (cont.)

Computing K-means via a serial algorithm 1. scanning a grid cell at a time, 2. compressing it, and 3. scanning the next grid cell. All data points of one grid cell are kept in memory

N.Y.U.S.T.I.M.

C j

Page 8: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

8Intelligent Database Systems Lab

Literature review (cont.)

The quality of the clustering process is indicated by the error function E which is defined as

In this case , - memory complexity O(N) - time complexity is O(GRINK)

N.Y.U.S.T.I.M.

C || x - k || Ek x

2 K ,1 k

Page 9: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

9Intelligent Database Systems Lab

Literature review (cont.)

Parallel implementations of K-means - Method A is a naive way of parallelizing k-means is to assign the clustering of one grid cell each to a processor. - Method B is to assign each run of k-means on one grid cell using one set of initial, randomly chosen k seeds to a processor.

N.Y.U.S.T.I.M.

RiC j

Page 10: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

10Intelligent Database Systems Lab

Literature review (cont.)

- Method C is divide the grid cell into disjunct subsets (clusters) assigened to different slaves by choosing a set of initial centroids. - It reduces both the computational and the memory bottleneck.

N.Y.U.S.T.I.M.

Page 11: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

11Intelligent Database Systems Lab

Implementing K-means Using Data Stream It consists of the following steps: - Scan the temporal-spatial grid cells. 1) We assumed that the data had been scanned once, and sorted into one degree latitude and one degree longitude grid buckets used as data input. 2) All data points belonging to the grid cell have to be available.

N.Y.U.S.T.I.M.

Page 12: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

12Intelligent Database Systems Lab

Implementing K-means Using Data Stream (cont.)

- Partial k-means on a subset of data points. 1) Instead of storing all data points v1,…,vn of a grid cell Cs in memory divide the data of Cs into p partitions P1,…PP. 2) All data points v1,…,vm of partition Pj can be stored into available memory. 3) Selecting a set of random k seeds for a partition Pj until the convergence criteria is met, and repeating for several sets of random k-seeds. 4) The partial k-means produces a set of weighted centroids cij is included in Pj {(c1j ,w1j), (c2j ,w2j),..., (ckj ,wkj)}.

N.Y.U.S.T.I.M.

Page 13: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

13Intelligent Database Systems Lab

Implementing K-means Using Data Stream (cont.)

- Merge k-means the results of step 2. 1) It performs another k-means using the set of all centroids that were computed in the partial k-means for all partitions P1,…PP

2) Given a set S of M D-dimensional centroids {(c1 ,w1), (c2 ,w2),...(cm , wm)} where M is the sum of centroids of P1,…PP.

N.Y.U.S.T.I.M.

Page 14: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

14Intelligent Database Systems Lab

Implementing K-means Using Data Stream (cont.)

- Merge K-means algorithm: 1) Initialization: Select the subset of k initial cluster centroids zi (the weight wi of zi is one of the k largest weights in S.). 2) Distance Calculation: For each data point ci, 1 < i < m, compute its Euclidean distance to each centroid zj, 1 < i < k, and then find the closest cluster centroid. 3) Centroid Recalculation: For each 1 ≤ j ≤ k, computing the actual, weighted mean of the cluster Cj that is the new centroid. ( where ) 4) Convergence Condition: Repeat (2) to (3) until convergence criteria is met; e.g. is || MSE(n-1) – MSE(n) || ≤ 1* 10-9

N.Y.U.S.T.I.M.

|C|/w* C x jijxj

Page 15: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

15Intelligent Database Systems Lab

Implementing K-means Using Data Stream (cont.)

N.Y.U.S.T.I.M.

Page 16: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

16Intelligent Database Systems Lab

Space & Time complexity Partial k-means vs. Serial k-means

where N is the number of data points, K is the number of centroids, I is the number of iterations to converge, O ( N ’ p ) = O ( N ) in the space complexity ( p is the number of partitions), and O ( N ’ K I ’ p ) << O ( N K I ) in the time complexity.

N.Y.U.S.T.I.M.

Space TimeSerial K-means O ( N ) O ( NKI )Partial K-means O ( N ’ ) O ( N ’KI’ )

Page 17: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

17Intelligent Database Systems Lab

Space & Time complexity (cont.)

Merge k-means

where K is the number of weighted centroid from each partition, p is the number of partitions, and I is the number of iterations to converge.

N.Y.U.S.T.I.M.

Space Time

Merge K-means O ( K p ) O ( I 2 K p)

Page 18: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

18Intelligent Database Systems Lab

Parallelizing Partial/Merge K-means Several options for parallelization can be considered. - Option1 is to clone the partial k-means to as many machines as possible, and compute all k-means algorithms on the data partitions in parallel, and merge the results on one of the machines. - Option2 is to send a data partition to several machines at the same time, and perform partial k-means with a different set of initial seeds on each machine in parallel. - Option3 is to break up the partial k-means into several finer grained operators.

N.Y.U.S.T.I.M.

Page 19: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

19Intelligent Database Systems Lab

Experimental Evaluation The goal of the experimental evaluation is to - compare the scalability of the partial/merge k-means. ( 5 split/10 split case), - speed-up of the processing if the partial k-means operators are parallelized, and run on different machines. - the achieved quality of the clustering with a serial k-means that clusters all data points in the same iteration. - analyze the quality of the merge k-means operator with regard to the size, and number of data partitions.

N.Y.U.S.T.I.M.

Page 20: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

20Intelligent Database Systems Lab

Experimental Evaluation (cont.)

Experiment Environment - Conquest version that was implemented using JDK 1.3.1, - four Dell Optiplex GX260 PCs which is equipped with a 2.8 GHz Intel Pentium IV processor, 1 GB of RAM, a 80 GB hard disk, and Netgear GS508T GigaSwith.

N.Y.U.S.T.I.M.

Page 21: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

21Intelligent Database Systems Lab

Experimental Evaluation (cont.)

Data Sets - EOS MISR data set, - 1 deg * 1 deg grid cell with the following characteristics: 1) the number of data points per grid cell between 250, 2500, 5000, 20,000, 50,000, 75,000 points, 2) six attributes for each data point, 3) a fixed k for all configurations ( k = 40 )

N.Y.U.S.T.I.M.

Page 22: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

22Intelligent Database Systems Lab

Experimental Evaluation (cont.)

The computation time for the serial k-means is increasing exponentially with the number of data points per grid cell.

The overall execution time of the partial/merge k-means in most cases is significantly lower.

N.Y.U.S.T.I.M.

Overall execution time,

serial v.s partial / merge K-means

Page 23: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

23Intelligent Database Systems Lab

Experimental Evaluation (cont.)

Comparing 10-split vs. 5-split vs. serial

N.Y.U.S.T.I.M.

Page 24: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

24Intelligent Database Systems Lab

Experimental Evaluation (cont.)N.Y.U.S.T.

I.M.

Minimum mean square error,

serial vs. 5-split vs.10-splitPartial K-means processing time,

5-split vs.10-split

Page 25: Advisor  : Dr. Hsu Graduate : Jian-Lin Kuo Author    : Silvia Nittel         Kelvin T.Leung

25Intelligent Database Systems Lab

Conclusions The partial/merge stream-based k-means - is simpler to find an appropriate cluster representation. - provides a highly scalable, parallel approach, efficiency, and a significantly higher clustering quality.

N.Y.U.S.T.I.M.