advisor : dr. hsu graduate : chien-shing chen author : byoung-kee yi

36
Intelligent Database Systems Lab Advisor Dr. Hsu Graduate Chien-Shing Chen Author Byoung-Kee Yi N.D.Sidiropoulos Theodore Johnson 國國國國國國國國 National Yunlin University of Science and Technology Online Data Mining for Co- Evolving Time Sequences Data Engineering, 2000. Proceedings. 16th International Conference on , 29 Feb.-3 March 2000

Upload: melaney-ganas

Post on 01-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

國立雲林科技大學 National Yunlin University of Science and Technology. Online Data Mining for Co-Evolving Time Sequences. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi N.D.Sidiropoulos Theodore Johnson. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Advisor: Dr. Hsu

Graduate: Chien-Shing Chen

Author: Byoung-Kee Yi

N.D.Sidiropoulos

Theodore Johnson

國立雲林科技大學National Yunlin University of Science and Technology

Online Data Mining for Co-Evolving Time Sequences

Data Engineering, 2000. Proceedings. 16th International Conference on , 29 Feb.-3 March 2000

Page 2: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Outline

Motivation Objective Introduction Related Work MUSCLES Experimental Results Conclusions Personal Opinion Review

N.Y.U.S.T.

I.M.

Page 3: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.Motivation

The data of interest comprises multiple sequences that each evolve over time.

These sequences are not independent: in fact, they frequently exhibit high correlations.

What we desire is to study the entire set of sequences as a whole, where the number of sequences in the set can be very large.

Page 4: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Objective

We develop a fast method to analyze such co-evolving time sequences jointly Estimation/forecasting of missing /delayed

/future values discovering correlations Quantitative data mining Outlier detection

N.Y.U.S.T.

I.M.

Page 5: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Introduction

Much useful information is lost if each sequence is analyzed individually.

Our method is to do the our prediction for the last “current” value of this sequence, given all the past information about this sequence, and all the past and current information for the other sequences.

Missing value, correlation detection, selective

N.Y.U.S.T.

I.M.

Page 6: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 7: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Related WorkN.Y.U.S.T.

I.M.

The traditional, highly successful methodology for forecasting is the so called Box-Jenkins, ARIMA. ARIMA: linear dependency of the future value on the past

values. DeCoste:linear regression and NN for multivariate time se

quences, limited to outlier detection and doesn’t scale well for large set of dynamically growing time sequences.

Goldin and Kanellakis:DFT coefficients, for sub-pattern matching.

Page 8: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

MUSCLESN.Y.U.S.T.

I.M.

MUSCLES(Multi-SequenCeLEast Squares) Our proposed solution is to set up the problem as a multivaria

te linear regression.

Page 9: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 10: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

MUSCLES

We try to estimate its value as a linear combination of the values of the same and the other time sequences within a windows of size w.

N.Y.U.S.T.

I.M.

Page 11: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

MUSCLES-delayed sequence

For a sample s[t] from a time sequence s = delays it by d time steps, t

hat is,

multi-variate regression.

N.Y.U.S.T.

I.M.

Page 12: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

ARIMA Models for time series Yt = Yt-1 + t

N.Y.U.S.T.

I.M.

1:= 15.00 2:= 16.00 3:= 20.00 4:= 22.00 5:= 25.00 6:= 23.00 7:= 21.00 8:= 20.00 9:= 18.0010:= 16.0011:= 15.0012:= 13.0013:= 16.0014:= 17.00

Page 13: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Each column of the matrix X consists of sample values of the corresponding independent variable, and each row is observations made at time t. y is a vector of desired values

N.Y.U.S.T.

I.M.

Page 14: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

With this set up, the optimal regression coefficients are given by

Page 15: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

gives the best regression coefficients, it is very inefficient in terms of both storage requirement and computation time.

O(N * v) storage for the matrix X. The number of samples N is not fixed and can

grow indefinitely. With limited main memory, the computation may

require quadratic disk I/O operations very much.

N.Y.U.S.T.

I.M.

Page 16: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

can be incrementally computed using previous value . This method is called Recursive Least Square (RLS) and its computation cost is reduced to

N.Y.U.S.T.

I.M.

Page 17: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

forgetting factor

Formulae derived based on old observed values will no longer be correct.

It turns out that our MUSCLES can be slightly modified to “forget” older samples gracefully. We call the method Exponentially Forgetting MUSCLES.

Let be the forgetting factor, which determines how fast the effect of older samples fades away.

N.Y.U.S.T.

I.M.

Page 18: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

MUSCLES-missing value

Let one value, si[t], be missing. Make the best guess for , given all the information available.

Then, at time t, one is immediately able to reconstruct the missing or delayed value, irrespective of which sequence i it belongs to.

N.Y.U.S.T.

I.M.

Page 19: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Correlation detection: A high absolute value for a regression coeffieient means that the corresponding variable is highly correlated to the dependent variable (or current status of a seuqnece) as well as it is valuable for the estimation of the missing value.

N.Y.U.S.T.

I.M.

Page 20: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

On-line outlier detetion: If we assume that the estimation error follows a Gaussian distribution with standard deviation .The rsult is that , 95% of the probability mass is within from the mean.

Corrupted data and back-casting: We can treat it as “delayed” and forecast it.

N.Y.U.S.T.

I.M.

Page 21: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Experimental Set-up

Currency: exchange rates of k=6 ,HKD,JPY,USD,DEM,FRF,GBP,CAD, and N=2561 daily observations for each currency.

Modem: k=14, N=1500 time-ticks Internet: k=4 (e.g., connect time, traffic and error in

packets etc.). For each of the data streams, N=980 observations were made.

N.Y.U.S.T.

I.M.

Page 22: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 23: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 24: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Correlation detecion- visualization

The correlation coefficient picks the single predictor for a given sequence, where high absolute values show strong correlations.

We can trun it into a dis-similarity function, and appply FastMap to obtain a low dimensionality scatter plot of our sequences.

N.Y.U.S.T.

I.M.

Page 25: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 26: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Scaling-up:Selective MUSCLES

We propose to do some preprocessing of a training set, to find a promising subset of sequences, and to apply MUSCLES only to those promising ones.

Give v independent variables x1,x2,x3,x4,….,xv and a dependent variable y with N samples each, find the best b(<v) independent variables to minimize the mean-sequare error.

N.Y.U.S.T.

I.M.

Page 27: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Scaling-up:Selective MUSCLES

Give independent variables x1,x2,x3,….,x and a dependent variable y with N samples each, find the best b(< ) independent variables to minimize the mean-square error(EEE) for for the given samples.

N.Y.U.S.T.

I.M.

Page 28: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Scaling-up:Selective MUSCLES

Given a dependent variable y, and independent variables with unit variance , the best single variable to keep to minimize EEE(S) is the one with the highest absolute correlation coefficient with y.

We propose to use a greedy algorithm. At each step , we select the independent variable that minimizes the EEE for the dependent variable y.

N.Y.U.S.T.

I.M.

Page 29: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 30: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Scaling-up:Selective MUSCLES

Bottleneck of the algorithm is clearly the computation of EEE. Since it computes EEE approximately times and each computation of EEE requires in average, the overall complexity mounts to

N.Y.U.S.T.

I.M.

Page 31: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Experiments-speed

How much faster the Selective MUSCLES method is than MUSCLES, and at what cost in accuracy.

Figure 4 shows the speed-accuracy trade-off Selective MUSCLES

b = 3~5 best-picked variables suffice for accurate estimation.

Selective MUSCLES is very effective

N.Y.U.S.T.

I.M.

Page 32: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 33: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 34: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Concluding

build analytical models for co-evolving time sequences discovering correlations forecasting of missing/delayed values changing correlations among time sequences less storage,reduce I/O operations more robust method called Least Median of Squares is pr

omising.

N.Y.U.S.T.

I.M.

Page 35: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Personal Opinion

This method can be used in our lab’s.

N.Y.U.S.T.

I.M.

Page 36: Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi

Intelligent Database Systems Lab

Review MUSCLES Missing value

Correlation detection Outlier Corrupted data and back-casting

Experimental-Accuracy Correlation detection-Visualization Scaling-up:Selective MUSCLES

Speed-accuracy trade-off

N.Y.U.S.T.

I.M.