IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12,

NO. 5,

SEPTEMBER/OCTOBER 2000

675

Querying Time Series Data Based on Similarity Davood Rafiei and Alberto O. Mendelzon AbstractÐWe study similarity queries for time series data where similarity is defined, in a fairly general way, in terms of a distance function and a set of affine transformations on the Fourier series representation of a sequence. We identify a safe set of transformations supporting a wide variety of comparisons and show that this set is rich enough to formulate operations such as moving average and time scaling. We also show that queries expressed using safe transformations can efficiently be computed without prior knowledge of the transformations. We present a query processing algorithm that uses the underlying multidimensional index built over the data set to efficiently answer similarity queries. Our experiments show that the performance of this algorithm is competitive to that of processing ordinary (exact match) queries using the index, and much faster than sequential scanning. We propose a generalization of this algorithm for simultaneously handling multiple transformations at a time, and give experimental results on the performance of the generalized algorithm. Index TermsÐSimilarity queries, time series retrieval, indexing time series, Fourier transform.

æ 1

T

INTRODUCTION

IME-SERIES data are of growing importance in many new database applications, such as data mining or data warehousing. A time series is a sequence of real numbers, each number representing a value at a time point. For example, the sequence could represent stock or commodity prices, sales, exchange rates, weather data, biomedical measurements, etc. We are often interested in similarity queries on time-series data [3], [2]. For example, we may want to find stocks that behave in approximately the same way (or approximately the opposite way, for hedging); or products that had similar selling patterns during the last year; or years when the temperature patterns in two regions of the world were similar. In queries of this type, approximate, rather than exact, matching is required. A simple approach to determine a possible similarity between two time series is to compute the Euclidean distance (or any other distance, such as the city-block distance) between the two series, and call the two series similar if their distance is less than some user-defined threshold. However, the notion of similarity is often more complex and cannot be expressed using a distance function. The meaning of a similarity comparison may vary from one data domain to another data domain and different users may have different perceptions of a similarity comparison; for example, one may consider two stocks similar if they have almost the same price fluctuations, even though one stock

. D. Rafiei is with the Department of Computing Science, University of Alberta, Edmonton, AB, Canada T6G 2H1. E-mail: [email protected] . A.O. Mendelzon is with the Department of Computer Science, University of Toronto, Toronto, ON, Canada, M5S 3H5. E-mail: [email protected] Manuscript received 21 Sept. 1999; revised 13 July 2000; accepted 18 July 2000. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 112605.

might sell for twice as much as the other. Consider the following motivating examples. Example 1.1. Consider time series s~1 36; 38; 40; 38; 42; 38; 36; 36; 37; 38; 39; 38; 40; 38; 37 and s~2 40; 37; 37; 42; 41; 35; 40; 35; 34; 42; 38; 35; 45; 36; 34; for example, corresponding to the closing prices of two stocks. Looking at Fig. 1a and 1b, the sequences do not appear very similar. This is justified by the high Euclidean distance D ~ s1 ; s~2 11:92 between them. However, if we look at the three-day moving averages of the two sequences (Fig. 1c and 1d), they do look quite similar. The Euclidean distance between the three-day moving averages of the two sequences is 0:47. Moving averages are widely used in stock data analysis (for example, see [6]). Their primary use is to smooth out short term fluctuations and depict the underlying trend of a stock. The computation is simple; the l-day moving average of a sequence ~ s v1 ; . . . ; vn is computed as follows: The mean is computed for an l-day-wide window placed over the end of the sequence; this will give the moving average for day n ÿ bl=2c; the subsequent values are obtained by stepping the window through the beginning of the sequence, one day at a time. This will produce a moving average of length n ÿ l 1. We use a slightly different version of a moving average which is easier to compute in our framework. We circulate the window to the end of the sequence when it reaches the beginning. This gives us a moving average of length n. It turns out that when the length of the window is small enough compared to the length of the sequence, which is usually the case in practice, both averages are almost the same.

1041-4347/00/$10.00 ß 2000 IEEE

676

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 1. (a) Time series s~1 36; 38; 40; 38; 42; 38; 36; 36; 37; 38; 39; 38; 40; 38; 37, (b) times series s~2 40; 37; 37; 42; 41; 35; 40; 35; 34; 42; 38; 35; 45; 36; 34, (c) the three-day moving average of s~1 , and (d) the three-day moving average of s~2 .

Example 1.2. Consider two time series ~ s 20; 20; 21; 21; 20; 20; 23; 23 and ~ p 20; 21; 20; 23 that are sampled with different frequencies (Fig. 2). For example, ~ s could be the closing price of a stock taken every day, and ~ p could be the closing price of another stock taken every other day. A typical query is ªis ~ p similar to ~ s ?º The sequence ~ s is twice as long as ~ p, so they cannot be compared directly. The Euclidean distance between ~ p and any subsequence of length four of ~ s is more than 1:41. If the time dimension of ~ p is scaled by 2, i.e., every value ªvj º is replaced by ªvj ; vj ,º the resulting sequence will be identical to ~ s. This operation is a special kind of time warping (for example, see [24]), often called time scaling, which can be expressed within our framework. We propose a class of transformations that can be used in a query language to express similarity in a fairly general way, handling a wide variety of comparisons including cases like the two examples above. We show that the query evaluation for similarity comparisons can be efficiently implemented on top of a point-based access method, such as R-tree [11], without a prior knowledge of transformations. For example, we demonstrate that an index structure for a moving average can be constructed on the fly from an existing index (built on the original sequences) and the

query evaluation engine can benefit from the new index in the same way as it does from the original index with no extra disk overhead. To the best of our knowledge, this is the first indexing method that can handle moving average and time scaling in the context of similarity queries. To apply this framework to time series data, we need to make some choices on how we compare two sequences and also on how we do indexing. We have chosen the Euclidean distance to compare two sequences mainly because: 1) it easily corresponds to the cross correlation (13); 2) it is frequently used [1], [8]; 3) it remains the same under orthonormal transforms.1 There are several multidimensional indexes (such as R-tree family [11], [4], [25], the k-dB-tree [13], or the grid file [16]) that can be used to do indexing on sequences. All these indexes reduce to sequential scanning in higher dimensions due to a phenomenon known as the curse of dimensionality. A solution to avoid this problem is to choose a few important features, those that can approximately differentiate one sequence from another, and only keep those features in the index. The approximation introduces a few false hits when sequences are being compared. However, the false hits can be easily removed in a postprocessing step. For feature extraction, we use the Discrete Fourier Transform (DFT) to map sequences into the frequency domain and only keep the first few DFT coefficients for each sequence. The reason for choosing DFT is mainly because: 1.

2. 3.

Fig. 2. (a) Time series ~ s 20; 20; 21; 21; 20; 20; 23; 23 and (b) ~ p 20; 21; 20; 23.

For a large class of time series (often referred as color noise data), it concentrates the energy in a few lower frequency coefficients so those coefficients can make a key for indexing purposes; It is an orthonormal transform, so the Euclidean distance is unchanged under the DFT; The class of transformations proposed in this paper (see Section 2) can express an interesting class of operations if applied in the frequency domain.

1. A transformation, denoted with matrix M, is orthonormal if M T :M I, where M T denotes the transpose of matrix M and I represents the identity matrix.

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

In general, one can use any orthonormal transform, such as the Discrete Cosine Transform (DCT), the Harr transform, the wavelet transform, etc. For a broad class of feature selection methods see, for example, the online bibliography maintained at the National Research Council Canada site [15]. The organization of the rest of the paper is as follows: In the rest of the current section, we provide some background material on the discrete Fourier transform and also review the related work. Our definition of similarity queries is discussed in Section 2. In Section 3, we use R-tree indexes and develop algorithms for efficiently evaluating queries expressed within our framework. In Section 4, we briefly describe how our algorithms can be implemented using the grid file and the k-d-B-tree. Section 5 presents experimental performance results. We conclude in Section 6.

1.1 The Discrete Fourier Transform In this section, we briefly review the Discrete Fourier Transform and its properties. Let a time sequence be a finite duration signal ~ x xt for t 0; 1; ; n ÿ 1. The DFT of ~ x, ~ is given by denoted by X, nÿ1 ÿi2tf 1 X xt e n f 0; 1; ; n ÿ 1; 1 Xf p n t0 p where i ÿ1 is the imaginary unit. Throughout this paper, unless it is stated otherwise, we use small letters for sequences in the time domain and capital letters for sequences in the frequency domain. The inverse Fourier ~ gives the original signal, i.e., transform of X nÿ1 i2tf 1 X Xf e n xt p n f0

t 0; 1; ; n ÿ 1:

2

Some books define the DFT with no constant in front and the inverse DFT with constant 1=n in front or vice versa. p Following some earlier conventions [1], [8], we have 1= n in front of both (1) and (2) for it simplifies the upcoming Parseval relation without changing any properties of the DFT. The energy of signal ~ x is given by the expression E ~ x

nÿ1 X

j xt j2 :

3

t0

The convolution of two signals ~ x and ~ y is given by Conv ~ x; ~ yj

nÿ1 X

xk yjÿk

j 0; 1; ; n ÿ 1;

4

k0

where j ÿ k is computed modulo n. This convolution is usually called circular convolution. Equations (3) and (4) are unchanged in the frequency domain. The following properties of DFT can be found in any signal processing textbook (for example, see [17]). The symbol , denotes a DFT pair. ~ and ~ Linearity: If ~ x,X y , Y~, then ~ bY~ a~ x b~ y , aX

5

677

for arbitrary constants a and b. ~ and ~ Convolution-Multiplication: If ~ x,X y , Y~, then ~ Y~; conv ~ x; ~ y , X

6

~ Y~ is the element-to-element multiplication of two where X ~ and Y~. vectors X ~ for a real-valued sequence ~ Symmetry: If ~ x,X x of length n, then j Xnÿf jj Xf j

for f 1; . . . ; n ÿ 1

7

and ~ then Parseval's Relation: If ~ x , X, ~ E ~ x E X:

8

Using Parseval's relation and the linearity, it is easy to show that the Euclidean distance between two signals in the time domain is the same as their distance in the frequency domain. ~ ÿ Y~ D2 X; ~ Y~: x; ~ y E ~ x ÿ~ y E X D2 ~

9

1.2 Related Work An indexing technique for the fast retrieval of similar time sequences is proposed by Agrawal et al. [1]. The idea is to use Discrete Fourier Transform (DFT) to map time sequences (stored in a database) into the frequency domain. Keeping only the first k Fourier coefficients, each sequence becomes a point in a k-dimensional feature space. To allow a fast retrieval, the authors keep the first k Fourier coefficients of a sequence in an R-tree index. An extension of this technique for subsequence matching is proposed by Faloutsos et al. [8]. None of the aforementioned work allows the expression of a query that uses some transformations in its expression of similarity. Goldin et al. [9] show that the similarity retrieval will be roughly invariant to simple translations and scales if sequences are normalized before being stored in the index. The authors store in the index both the translation and the scale factors, in addition to normalized sequences, and also allow those factors to be queried using range predicates. In our earlier work [18], [21], we have shown that the last few Fourier coefficients of a sequence (those corresponding to lower negative frequencies) are as important as the first few coefficients due to the symmetry property of DFT for real-valued sequences. We have also shown that this observation can be used to reduce the size of the search rectangle in the indexing method of Agrawal et al. [1] and its extensions [8], [9] without really storing the last few Fourier coefficients in the index. This leads to a search time improvement of more than a factor of 2. In this paper, we use this improved version of the index for efficiently evaluating queries that use transformations in their expressions of similarities. Jagadish et al. [12] developed a domain-independent framework to pose similarity queries on a database. The framework has three components: a pattern language P, a transformation rule language T, and a query language L. An expression in P specifies a set of data objects. An object A is considered similar to an object B, if B can be reduced to it by a sequence of transformations defined in T. The query language proposed in the paper is an extension of relational

678

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

calculus with predicates that test whether an object A can be transformed into a member of the set of objects described by expression e using the transformation t, at a cost bounded by c. As a specialization of this work to real-valued sequences [7], the same authors describe how the search can be performed over sequence signatures instead of the original sequences. Our work here can also be seen as a specialized variation of this general framework where transformations are restricted to affine transformations. There is other work on time series data. Yi et al. [29] use time warping as a distance function and present algorithms for retrieving similar time sequences under this function. Chu et al. [5] show how similar sequences can be efficiently searched irrespective of differences in simple translations and scales. A hierarchical sequence matching algorithm based on the correlation between sequences is proposed by Li et al. [14]. Agrawal et al. [3] describes a pattern language, called SDL, to encode queries about ªshapesº found in time sequences. The language allows a kind of blurry matching where the user specifies the overall shape instead of the specific details, but it does not support any operations or transformations on sequences. A method for approximately representing sequences in terms of some functions and processing queries over such a representation is described by Shatkay and Zdonik [28]. A query language for time series data in the stock market domain was developed by Roth [22]. The language is built on top of CORAL [23] and every query is translated into a sequence of CORAL rules. Seshadri et al. [27] develop a data model and a query language for sequences in general, but do not mention similarity matching as a query language operator.

2

SIMILARITY QUERIES

Time series often have differences that need to be removed or reduced before comparing them to each other. One way to remove those differences is to apply some transformations to them. We are interested in a set of transformations which fulfills the following two requirements: 1) it expresses a large class of useful transformations on time series and 2) queries expressed using this set of transformations can be efficiently processed. The set of affine transformations seems to be a good candidate since its members can express any combination of four important transformations in space: translation, scaling, rotation, and shear. Furthermore, any affine transformation preserves two basic features namely collinearity (i.e., all points lying on a line initially still lie on a line after transformation) and ratios of distances (e.g., the midpoint of a line segment remains the midpoint after transformation). To fulfill the second requirement, we limit our transformations to translation and scaling (along all dimensions) only. Specifically, we define a transformation in an n-dimensional space, denoted by ~ a; ~ b, as a pair of vectors where ~ a specifies a stretch and ~ b represents a translation. We will show that this class can express a large class of useful transformations on time series; at the same time, queries expressed using this class can be efficiently processed.

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Definition 1. The transformation ~ a; ~ b applied to a point ~ x in an n-dimensional space maps ~ x to ~ a~ x~ b. Transformations may be associated with costs. Given a set of transformations T , and the cost of applying each transformation, a measure of distance (dissimilarity) between two time series can be defined as follows: 8 D0 ~ x; ~ y > > > > x; ~ y < mint2T cost t D t ~ D ~ x; ~ y min mint2T cost t D ~ 10 x; t ~ y > > cost t cost t min > t ;t 2T 1 2 1 2 > : x; t2 ~ y; D t1 ~ where D0 ~ x; ~ y is a base distance (such as the Euclidean distance) between ~ x and ~ y. Next, we propose an extension of multidimensional similarity queries where similarity is defined on the basis of both a distance function and a set of transformations.

2.1 Queries with Single Transformations Transformations can be seen as a way to remove certain variations before aligning two sequences. Although many kinds of variations may be present in each sequence, we consider only those that can be removed using the limited form of affine transformations on the Fourier series representation of the sequence (Fig. 3). We first consider queries that use a single transformation to remove such variations. To gain some insight into this class of queries, we give some examples from real stock data. The data was obtained from the FTP site: ªftp.ai.mit.edu/pub/stocks/ results/.º Example. 2.1. Fig. 4 shows the daily closing price for The Bombay Co. (BBA) starting from October 25, 1994 for 128 days, and that for Zweig Total Return Fund Inc. (ZTR) starting from July 20, 1995 for 128 days. The Euclidean distance between the two series is 16.16. The mean for BBA is 9.51, and the mean for ZTR is 8.64. If we do not care about this variation in our comparison, we can vertically shift the mean of both series to zero, i.e., subtract the mean of each series from the everyday closing price. Now the Euclidean distance between the two sequences reduces to 12.78. The closing price of ZTR fluctuates in a smaller range than that of BBA; the standard deviation of ZTR is 0.10 while the standard deviation of BBA is 1.18. Again, if we do not care about the scale in our comparison, we can scale both series, for example, by the inverse of their standard deviation. The resulting series are often called the normal forms of the original series [9]. Fig. 4 shows that the Euclidean distance between the normal forms of the two series is still 11.10; ZTR is more volatile than BBA. To remove this variation, we need to smooth out short term fluctuations, for example, by computing the 20-day moving average of the two series. The Euclidean distance between the two series after applying the 20-day moving average drops to 2.75. Example 2.2. Fig. 5 shows in normal form the daily closing prices of stocks of Pacific Gas and Electric Co. (PCG) and Plum Creek Timber Co. (PCL) both starting from November 2, 1994 for 128 days. The Euclidean distance

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

679

Fig. 3. Applying transformations to time series.

Fig. 4. From left to right, top to bottom: The daily closing price of The Bombay Co. (BBA) starting from ªOctober 25, 1994º for 128 days, the daily closing price of Zweig Total Return Fund Inc. (ZTR) starting from ªJuly 20, 1995º for 128 days, the two stocks put together, both vertically translated, both vertically scaled, and both smoothed using the 20-day moving average (D denotes the Euclidean distance).

between the two normalized time series is 11.34. One way to compare the change rates of two stocks is to compare their ªmomenta,º which are obtained for every stock by subtracting the price at time t from the price at time t+1 (or, in general, t+n for some n). The Euclidean distance between the two momenta is 13.01. The series representing the price of PCG has a spike on February 7 while the series of PCL has a spike on February 8. If we horizontally shift the series of PCL one day to the left, then the spikes will overlap. The Euclidean distance between the two normalized time series after the horizontal shift drops to 8.91. The horizontal shifting reduces the distance between the momenta of the two series to 5.62. The momentum of a time series describes the rate at which its value (such as the price in the preceding example) is rising or falling and it is seen as a measure of strength behind upward or downward movements. On the other hand, shifting a sequence horizontally before comparing it to another sequence removes any possible delay between

the two sequences which can arise, for example, in the stock market domain because of the different reactions of two stocks to the same piece of news or recording errors. As in the examples, we often specify a sequence of transformations to be applied to a time series. Given transformations t1 a~1 ; b~1 and t2 a~2 ; b~2 , for example, respectively, corresponding to ªone-day shiftº and ª10-day moving averageº, suppose we want to apply t1 followed by ~ We can t2 , which we denote by t2 t1 , to sequence X. construct the new transformation as follows: ~ a~2 a~1 X ~ b~1 b~2 t2 t1 X ~ a~2 b~1 b~2 : a~2 a~1 X

11

Transformation t2 t1 equivalently can be expressed as t3 a~3 ; b~3 , where a~3 a~2 a~1 and b~3 a~2 b~1 b~2 . We now show how we can express some of these transformations in our framework. Suppose we want to express the three-day moving average in our transformation ~ and language. Let us denote the Fourier transform of ~ s by S 1 1 1 ~3 3 ; 3 ; 3 ; 0; 0; 0; 0; 0; 0; the Fourier transform of m

680

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 5. The daily closing price of Pacific Gas and Electric Co. (PCG) and that of Plum Creek Timber Co. (PCL), both starting from February 11, 1994 for 128 days, represented in normal forms and their momenta.

~3 . Now, consider the transformation 0; 0; 0; 0; 0; 0 by M ~3 ; ~ 0, where ~ 0 is a zero vector of the same size tmavg3 M ~ i.e., ~3 . If we apply the transformation tmavg3 to S, as M ~ S ~ M ~ M ~3 ~ ~3 ; 0S tmavg3 S

1.

we get the three-day moving average of ~ s in the frequency domain. If we transform the right hand side back to the time domain using the convolution-multiplication relation (6), ~ which is the three-day moving average we get conv ~ s; m3 of ~ s in the time domain. In general, the m-day moving average of a series of a; ~ 0, where length n can be expressed by tmavg ~ ~ a w1 ; w2 ; ; wm ; 0; 0; ; 0 |{z} m |{z}

standard deviation are stored in a relation. This is mainly because of efficiency, as is noted by Goldin and Kanellakis [9], and the following two attractive properties of the normal form which are not mentioned by these authors.

12

n

and ~ 0 is a zero vector of size n. Transformation tmavg may be applied several times to get successive moving averages. The weights w1 ; ; wm are not necessarily equal. For trend prediction purposes, for example, the weights at the end are usually chosen to be higher than those at the beginning. Whereas for normal smoothing purposes, weights are equal, or those at the center are larger than those at the endpoints. Both momentum and horizontal shifting can also be formulated as affine transformations over the Fourier representation of a sequence. See Appendices B and C for details. Another transformation of special interest is normalization, which can be applied to time series before storing them in an index. Given a time series of mean and standard deviation , the normal form of the series is obtained by first vertically shifting the series by and then scaling it by 1=. The equivalent transformation can be ! expressed in the frequency domain as 1=; ÿ=; 0; 0; . . .. Although it is not required by the algorithms given in this paper, we assume time sequences are normalized and for every sequence, its normal form along with its mean and

2.

It minimizes the Euclidean distance with respect to ~ ÿ sx ; Y~ ÿ sy has its minimum translation, i.e., D X when sx and sy , respectively, are chosen to be the means of ~ x and ~ y .2 The Euclidean distance between two normalized sequences is directly related to their cross-correlation.3 The cross-correlation is a statistic measure to find out if two random variables (such as the crops harvested and the rain level) are linearly correlated. A value of cross-correlation near 0 often indicates that the variables are independent, while a value near 1 or -1 indicates a strong positive or negative correlation. ~ Y~ 2 n ÿ 1 1 ÿ X; ~ Y~: D2 X;

13

Equation (13) can be derived by expanding the Euclidean distance formula and replacing the mean and the standard deviation, respectively, by 0 and 1 in both the Euclidean distance and the crosscorrelation formulas. The second property can be quite useful in formulating similarity queries or translating one query to another. Since the Euclidean distance between two sequences can range from zero to infinity, it is usually difficult to specify a threshold for this distance. Instead, we can specify a threshold for cross-correlation which is between -1 and 1 and plug it into (13) to find a threshold for the Euclidean distance. Using (13), we can also translate any expression that uses the cross-correlation in a query to one that uses the Euclidean distance or vice versa. 2. This can be verified by taking the first derivatives of D w.r.t. sx and sy and equating them to zero. ~ Y~ ÿX ~ :Y~ ~ Y~ X 3. X; ~ ~ X

Y

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

681

Fig. 6. On the top from left to right, daily closing of Dow Jones 65 Composite Volume (COMPX) index, NYSE Volume (NYV) index and both put together, normalized and smooth using a nine-day moving average. On the bottom from the left to right, again daily closing of COMPV index, NYSE Declining Issues (DECL) index and both put together, normalized and smoothed using 19-day moving average.

Transformations can also be defined to stretch the time dimension (Example 1.2). Details are given elsewhere [20].

2.2 Queries with Multiple Transformations It is often desirable to specify a set of transformations in a query. This is particularly useful if there are different ways of removing variations and one is not sure which one should be used. Given a distance function D and a set of transformations denoted by T , the semantics of D T ~ x; ~ y for arbitrary data points ~ x and ~ y is defined as follows: D T ~ x; ~ y minfD t ~ x; ~ y j t 2 T g:

14

Example 2.3 Fig. 6 shows daily closings of three indices: Dow Jones 65 Composite Volume (COMPV), NYSE Volume (NYV), and NYSE Declining Issues (DECL). It is difficult to see any similarity between these sequences. The Euclidean distance between closes of COMPV and NYV is 2,873 and that between COMPV and DECL is 12,939. On the other hand, if we normalize the closes of COMPV and NYV and compare their nine-day moving averages, they look very similar. The Euclidean distance between the nine-day moving averages of the normalized closes of COMPV and NYV is less than 3. Similarly, if we normalize the closes of COMPV and DECL and compare their 19-day moving averages, they also look quite similar. In fact, ª19-day moving averageº is the shortest moving average that reduces the Euclidean distance between normalized closes of COMPV and DECL to less than 3. To identify these similarities, a query may specify a set of moving averages to be applied to sequences. Although it may seem that if two sequences are similar w.r.t. n-day moving average, they should be similar w.r.t. n 1-day moving average, this is not true in general. For a counter example, see Lemmas 6 and 7 in Appendix A. Similar to the way we did for single transformations, we can compose two sets of transformations. Given two transformation sets T1 and T2 , for example, respectively,

corresponding to ªs-day shiftº for s 0; . . . ; 10 and ªm-day moving averageº for m 1; . . . ; 40, we can construct transformation set T3 T2 T1 , which corresponds to a ªs-day shiftº followed by an ªm-day moving averageº for all possible values of s and m, as follows: T3 ft3 t2 t1 j t1 2 T1 ; t2 2 T2 g;

15

where t2 t1 is defined by (11). Using (11) and (15), we can simplify a query by replacing any expression that uses a sequence of transformations with one that uses only a single or a set of transformations. We can process the resulting query using the techniques proposed in Section 3.2.

2.3 Safety of Transformations Time series data can be easily stored in a multidimensional index and be efficiently retrieved using simple similarity queries. However, to the best of our knowledge, no prior work has been done on using these access methods for efficiently evaluating queries that use transformations in their expression of similarity. To achieve this goal, we define a notion of safety for transformations. We show in Section 3 that any query that only uses safe transformations in their expression of similarity can be efficiently supported using multidimensional indexes. Definition 2. An affine transformation is safe if it maps every jaxis-parallel line segment in Rn ; Rn , for j 1; . . . ; n, to a jaxis-parallel line segment. Lemma 1. An affine transformation t is safe iff it can be expressed as t ~ a; ~ b, where ~ a; ~ b 2 Rn . Proof. For the clarity of the presentation, we give the proof in a two-dimensional space. The extension of the result to an arbitrary n-dimensional space is straightforward. (if) Suppose t can be expressed as t ax ; ay ; bx ; by . Let x1 ; y1 ; x2 ; y2 be an arbitrary axis-parallel line segment. If the line segment is parallel to the x axis, i.e., y1 y2 , then it will remain parallel to the x axis after the

682

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

transformation because ay :y1 by ay :y2 by . Similarly, if the line segment is parallel to the y axis, then it will remain parallel to the y axis after the transformation. Thus, t is safe. (only if) Suppose t is safe. We show that t can be expressed as ax ; ay ; bx ; by . Let x1 ; y1 ; x2 ; y2 be an arbitrary axis-parallel line segment and x01 ; y01 ; x02 ; y02 be its image under t. We can find ax ; ay ; bx ; by such that x01 ax :x1 bx ; y01 ay :y1 by ; x02 ax :x2 bx ; y02 ay :y2 by :

For instance, if the original line segment is parallel to the x axis, then y1 y2 , y01 y02 and there will be three equations with four unknown variables to be solved. The rest is straightforward. u t Since we use the Fourier series representation of time series data as our features, and a Fourier coefficient, in general, is a complex number, we need to do a mapping before we can show the safety of transformations. If we decompose a complex number into its real and imaginary components, then we can represent a vector of k Fourier coefficients with a point in R2k . We denote this mapping from the complex plane to the real plane by Mrect . Alternatively, we can decompose a complex number into its components in the polar coordinate system where a complex number is represented by a magnitude and a phase angle. We denote this mapping from the complex plane to the real plane by Mpol . We use Re x, Im x, Abs x, and Angle x to denote, respectively, the real, the imaginary, the magnitude, and the phase angle of a complex number x. We now show that the transformations described for time series data in previous sections are safe. Definition 3. Let ~ a and ~ b be vectors of complex numbers and a~0 0 ~ and b be respectively their images under a mapping M from the complex plane to the real plane. Transformation t ~ a; ~ b 0 0 0 ~ ~ is safe w.r.t. M if t a ; b is safe. Lemma 2. Let ~ a be a vector of real numbers, and ~ b be a vector of complex numbers; the transformation t ~ a; ~ b is safe w.r.t. Mrect . Proof. Without loss of generality, we assume the real and the imaginary components of the complex coordinate j (for j 1; ; k) are respectively mapped to the coordinates 2j ÿ 1 and 2j of the real plane. Suppose ~ z is a kdimensional vector of complex numbers and ~ x, a 2kdimensional vector, is its image under Mrect . We have zj x2jÿ1 x2j i for j 1; ; k. If we apply transformaz ~ a ~ z~ b. We can rewrite this tion t to ~ z, we get ~ z0 t ~ as follows: z0j aj x2jÿ1 x2j i Re bj Im bj i aj x2jÿ1 Re bj aj x2j Im bj i for j 1; ; k. If we map the resulting vector to a point x0 using Mrect , we get x02jÿ1 aj x2jÿ1 Re bj and x02j aj x2j Im bj for j 1; ; k. This transforma~ where c2jÿ1 c2j aj , c; d, tion can be rewritten as t0 ~ c d2jÿ1 Re bj , and d2j Im bj for j 1; ; k. Since ~ and d~ are vectors of real numbers, the rest follows from Lemma 1. u t

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

On the other hand, Lemma 2 does not hold if ~ a is chosen to be a vector of complex numbers. For example, consider the axis-parallel line segment made by the two points p ÿ5 2i and q ÿ5 ÿ 5i. If we apply the transformation t 2 ÿ 3i; 0 to the line segment, i.e., we multiply the two endpoints by 2 ÿ 3i, the resulting line segment made by t p ÿ4 19i and t q ÿ25 5i is not axis-parallel anymore! Lemma 3. Let ~ a be a vector of complex numbers, and ~ b be a zero vector (~ b ~ 0); the transformation t ~ a; ~ b is safe w.r.t. Mpol . Proof. Without loss of generality, we assume the magnitude and the phase angle of the complex coordinate j (for j 1; ; k) are respectively mapped to the coordinates 2j ÿ 1 and 2j of the real plane. Suppose~ z is a k-dimensional vector of complex numbers and ~ x is its image under Mpol . We have zj x2jÿ1 ex2j i for j 1; ; k. If we apply z ~ a ~ z~ b. We can transformation t to ~ z, we get ~ z0 t ~ rewrite this as follows: z0j Abs aj eAngle aj i x2jÿ1 ex2j i 0 Abs aj x2jÿ1 e x2j Angle aj i for j 1; ; k. If we map the resulting vector to a point x0 using Mpol , we get x02jÿ1 Abs aj x2jÿ1 and x02j x2j Angle aj for j 1; ; k. This transformation can ~ where c2jÿ1 Abs aj , d2jÿ1 0, c; d, be rewritten as t0 ~ c and d~ c2j 1, and d2j Angle aj for j 1; ; k. Since ~ are vectors of real numbers, the rest follows from Lemma 1. u t In the next section, we develop efficient algorithms for evaluating queries that only use safe transformations in their expression of similarity.

3

QUERY EVALUATION USING R-TREE INDEXES

Given a set of time series data, an index can be constructed as follows ([1], [21]): Find the DFT of each sequence and keep the first few DFT coefficients as the sequence features. If we only keep the first k DFT coefficients, sequences will become points in a k-dimensional space. These points can be organized in a multidimensional index such as the R-tree family [11], [4], [25], the k-d-B-tree [13], or the grid file [16]. In this section, we describe the query evaluation for R-tree indexes. The same techniques can be extended to other index structures. We do that for the k-d-B-tree and the grid file in Section 4. An R-tree can be seen as a natural extension of B-trees for more than one dimension. Nonleaf nodes contain entries of the form (MBR, ptr), where MBR is a Minimum Bounding Rectangle of all the entries in a descendent node, and ptr is a pointer to the descendent node. Bounding rectangles at the same tree level may overlap. Leaf nodes for point data sets contain a set of data points or pointers to full data records.

3.1

Evaluation of Queries with Single Transformations Given an R-tree index I on a data set S, and any safe transformation t, we can construct an R-tree index I 0 for t S ft ~ x j ~ x 2 Sg as follows: Algorithm 1. For every node n

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

683

n MBR1 ; ptr1 ; ; MBRm ; ptrm in I, construct node n0 in I 0 such that n0 t n MBR01 ; ptr01 ; ; MBRm ; ptr0m ; where MBR0i is the rectangle obtained by applying t to the endpoints of MBRi and ptr0i is a pointer to the node t ni if ni is the node pointed by ptri . If n is a leaf, ptr0i ( ptri ) is a pointer to a data tuple or null. Since MBRi is an axisparallel rectangle and t is a safe transformation, MBR0i is an axis-parallel rectangle due to the definition of a safe transformation. Thus, the new node n0 is a valid R-tree node. The construction stops when every node in I is mapped to a node in I 0 . The original index I can be constructed in many different ways, each with a different performance, depending on the order of insertions and the way splits and overlaps are handled. Our experiments show that the new index I 0 has a similar performance to that of the original index. The main observation here is that for a given index I and transformation t, index I 0 can be built on the fly without having much impact on the performance of the search. This allows us to use one index for many possible transformations. As our running example, consider the following proximity query: Query 1. Given a query point ~ q, a safe transformation t and a threshold , find all points ~ x in the data set such that the Euclidean distance D t ~ x;~ q < . A naive evaluation of this query requires reading the whole data set, applying t to every data point and choosing every point ~ x such that D t ~ x;~ q < . This is a costly process. A better approach is to use Algorithm 1 to construct a new index for transformed data points. This new index can be built on the fly during the search operation as follows: Algorithm 2. Suppose an R-tree index is constructed on the first k DFT coefficients of sequences, and suppose the root node is denoted by N. 1.

Preprocessing:

2.

Transform t and ~ q into the frequency domain if they are in the time domain. Let us denote the qk , first k Fourier coefficients of t and ~ q by tk and ~ respectively. qk . A search b. Build a search rectangle qrect for ~ rectangle is the minimum bounding rectangle that contains all points within the Euclidean distance of ~ q. Building a search rectangle is straightforward in the rectangular coordinate s y s t e m ; i t i s s i m p l y qi ÿ ; qi f o r i 1; ; 2k. The minimum bounding rectangle for a complex number mej in the polar coordinate system is demonstrated in Fig. 7. The magnitude is in the range from m ÿ to m , and the angle is in the range from ÿ arcsin m to arcsin m . Search: a.

Fig. 7. Minimum bounding rectangle in the polar coordinate system.

a.

3.

If N is not a leaf, apply t to every (rectangle) entry of N and check if the resulting rectangle overlaps qrect . For all overlapping entries, call Search on the index whose root node is pointed to by the overlapping entry. b. If N is a leaf, apply t to every (point) entry of N and check if the resulting point overlaps qrect . If so, the entry is a candidate. Postprocessing:

For every candidate point ~ x, check its full database record to determine if the Euclidean distance between t ~ x and ~ q is at most . If so, the entry is in the answer set. The algorithm is guaranteed not to miss any qualifying sequence (see Appendix A for a proof). Similarly, all-pairs queries and nearest-neighbor queries can be efficiently processed using the index. For an all-pairs query, if we do a spatial join using the index, the only change in the spatialjoin algorithm will be to replace the join predicate evaluation with one that applies the desired transformation to the objects used in the join predicate. For example, the join predicate evaluation for ai \ bj 6 ; can be easily changed to that for t ai \ t bj 6 ;, where t is a transformation and ai and bj are members of two spatial sets (such as two MBRs). For a nearest-neighbor query, the search starts from the root and proceeds down the tree. As we go down the tree, we apply t to all entries of the node we visit. This change can be easily encoded in existing nearest- neighbor search algorithms, such as those proposed by Roussopoulos et al. [19] and Seidl et al. [26]. a.

3.2

Evaluation of Queries with Multiple Transformations Given an R-tree index on a data set S, consider evaluating the following proximity query: Query 2. ªGiven a query time series ~ q and a set T of safe transformations, find every time series ~ s 2 S and transformation t 2 T such that the Euclidean distance D t ~ s; t ~ q < .º As a specific example, T could be the set of m-day moving averages for m 2 f1 . . . 40g and we may want to find all stocks that have an m-day moving average similar to that of IBM. One way to process this query is to scan the data set sequentially and for every sequence~ s 2 S and transformation

684

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 8. From left to right, the second DFT coefficient of m-day moving averages for m 1; . . . ; 40 and a data rectangle before and after being transformed.

t 2 T find out if the distance predicate evaluates to true. The cost of this algorithm is one scan of the data set and computing the distance predicate j T j j S j times. Another approach is for every transformation t 2 T , use Algorithm 2 to retrieve all sequences within the proximity of the query sequence. The union of the results retrieved gives the query answer. We call this algorithm ST-index, where ST stands for ªa Single Transformation at a time.º The cost of this algorithm is j T j times the cost of traversing the index. The third approach is to group transformations together and scan the index once for every group. We call this algorithm MT-index, where MT stands for ªMultiple Transformations at a time.º There are two issues that need to be resolved here: first, how shall transformations be grouped together; second, how can a group of transformations be efficiently processed. We first address the second issue. Since a transformation is a pair of n-dimensional vectors, it is a point in a 2n-dimensional space. We assume both vectors are of real numbers. If not, one can rewrite any safe transformation in terms of a pair of real vectors (due to Lemma 1). Given a group of transformations, we can construct a minimum bounding rectangle (MBR) for all transformations in the group. Having an R-tree index built over sequences, we can apply the transformation rectangle to entries of the index and construct a new index on the fly. To apply a transformation rectangle to a data rectangle, we decompose the 2n-dimensional transformation rectangle into two n-dimensional MBRs, one corresponding to ~ a which we denote by mult-MBR, and the other corresponding to ~ b which we denote by add-MBR. Given mult-MBR: < M1 l; M1 h; . . . > , add-MBR: < A1 l; A1 h; . . . > and data rectangle X: < X1 l; X1 h; . . . > , the result of applying multMBR and add-MBR to rectangle X is rectangle Y: < Y1 l; Y1 h; . . . > , where Yi l Ai l min Mi l Xi l; Mi l Xi h; Mi h Xi l; Mi h Xi h Yi h Ai h max Mi l Xi l; Mi l Xi h; Mi h Xi l; Mi h Xi h for all dimensions i. As an example, consider the points of m-day moving average for m 1; . . . ; 40 and their MBR. The result of applying the MBR of these transformations to a data rectangle is shown in Fig. 8.

We now address the next issue, i.e., how shall transformations be grouped together. Consider the extreme case where all transformations in T are grouped together. If a few transformations spread all over the space, the minimum bounding rectangle of transformations will cover a large area. This MBR, when applied to a data rectangle, can easily make the data rectangle intersect the query region. This can reduce the filtering power of the index dramatically. On the other hand, if the number of groups and as a result the number of MBRs goes up, the area of each MBR gets smaller. Thus, the filtering power of each MBR increases, but the same index needs to be traversed several times. In the worst case, the number of MBRs is the same as the number of transformations, i.e., every MBR includes only one transformation point. In such a case, both ST-index and MT-index perform exactly the same. Now, the question is how we should optimally choose MBRs for a given set of transformations such that the total cost (in terms of the number of disk accesses) becomes minimum. One solution is to estimate the cost for any possible set of MBRs and choose the set with minimum cost. A first attempt in estimating the cost for a given set of MBRs is to use the total area of MBRs. However, the total area is minimum if every MBR includes only one transformation point, i.e., the ST-index algorithm is used. Another approach for estimating the cost of a given set of MBRs is to apply MBRs for a fixed data rectangle, say a unit square, then compute the total area of the resulting data rectangles. Due to this estimation, the best performance should be obtained using only one transformation rectangle. However, our experiments showed that using one transformation rectangle did not necessarily give the best performance. The worst performance for MT-index, which is close to that of ST-index, is when we pack two clusters of transformations into one rectangle. A solution to avoid this problem is to use a cluster detection algorithm (such as CURE [10]) and avoid packing two clusters into one rectangle. We can now write the search algorithm for Query 2 as follows: Algorithm 3. Given an R-tree index which is built on the first k Fourier coefficients of time series and whose root is N, a safe transformation set T , a threshold , and a search sequence ~ q, use the index to find all sequences that become within distance of ~ q after being transformed by a member of T .

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

1.

Decompose T into sets T1 ; T2 ; . . . using a clustering algorithm. 2. Do the following steps (Steps 3 to 6) for every set Ti : 3. Build an MBR for points in Ti and project it into a mult-MBR and an add-MBR as described earlier. 4. If N is not a leaf, apply the mult-MBR and the addMBR to every (rectangle) entry of N using (16) and check if the resulting rectangle intersects qrect . For every intersecting entry, go to Step 4 and do this step on the index rooted at the node of the intersecting entry. 5. If N is a leaf, apply the mult-MBR and the add-MBR to every (point) entry of N and check if the resulting rectangle intersects qrect . If so, the entry is a candidate. 6. For every candidate entry, retrieve its full database record, apply all transformations in Ti to the sequence, and determine transformations that reduce the Euclidean distance between the data sequence and the query sequence to less than . This algorithm is guaranteed not to miss any qualifying sequence. The proof is similar to the one given in Appendix A for Algorithm 2. We can develop similar algorithms for efficiently processing spatial join and nearest-neighbor queries. In a spatial join query, we apply the transformation MBR to all data items used in the join predicate before computing the predicate. Similarly in a nearest-neighbor query, as we walk down the tree, we apply the transformation MBR to all entries of the node we visit. This change can be easily incorporated into the existing nearest-neighbor search algorithms on the R-tree. So far, we have made no assumption on any possible ordering among transformations. Next, we show how one can exploit such an ordering during query evaluation.

3.3 Orderings of Transformations In this section, we define a notion of ordering between transformations and show that this notion can be quite useful in guiding the search more effectively. Definition 4. Let T be a set of transformations on a vector space S (e.g., Rn ) and D be a distance function on pairs of points in S; we call < T ; > an ordering of T w.r.t. D if 8~ vi ; v~j 2 S; 8tk ; tl 2 T , vi ; tl ~ vj D tk ~ vi ; tk ~ vj : tl tk ) D tl ~ Once an ordering is established among transformations, we can use this ordering to guide the search more cleverly. To give an example, consider Query 2 with the transformation set T ft2 ; . . . ; t100 g on Rn , where ti means ªscale by factor i.º An ordering of T w.r.t. the Euclidean distance is: tl tk if l < k (see Appendix A for a proof). To find all transformations that make a data sequence become similar to the query sequence, we do not need to apply all transformations to sequences. Instead, we need to find the ti with the largest i (i.e., the ti that corresponds to the largest scale factor) that makes the distance predicate true. One way to find ti is to do a binary search on the ordered list of transformations. Definition 4 implies that the distance predicate is true for every transformation tj , where j < i.

685

Given an ordered set T of transformations on the Fourier series representation of time series w.r.t. the Euclidean distance, we can use the binary search technique in all three algorithms presented in Section 3.2. In the case of the sequential scan method, we still need to scan the whole data set S. However, the number of transformations to be checked or the number of sequence comparisons drops to j S j log j T j . The ordering assumption reduces the number of index traversals for STindex to log j T j . In the case of MT-index, suppose T is decomposed into transformation groups T1 ; T2 ; . . . ; Tm . If the ordering is preserved among transformation groups, then the number of transformation groups to be checked and as a result the number of index traversals reduces to log m, whereas the number of transformations to be checked inside each group Ti (out of the log m groups tested) reduces to log j Ti j . However, if the ordering is not preserved among transformation groups, then the number of index traversals will remain the same and the Pm total number of comparisons reduces to i1 log j Ti j . There are useful transformations on time series that are not ordered w.r.t. the Euclidean distance. For example, no ordering is possible for a set of moving averages on Rn w.r.t. the Euclidean distance. See Appendix A for a proof.

4

QUERY EVALUATION USING OTHER INDEX STRUCTURES

The algorithms presented in Section 3 can be easily extended to other point-based access methods. To this end, we briefly examine two other index structures, namely the grid file and the k-d-B-tree. Compared to the R-tree which partitions the data points, these two access methods partition the embedding space that contains the data points. In this section, we show how these indexes can be used in efficiently evaluating our similarity queries.

4.1 Use of the Grid File The grid file imposes an n-dimensional grid decomposition of space. A grid directory maps one or more cells of the grid into a data bucket. The grid, represented by n onedimensional arrays called scales, is kept in main memory while the directory and data buckets are stored on disk. Given a set of time series data, a grid file can be constructed as follows: find the DFT of each sequence and keep the first k DFT coefficients in a 2k-dimensional grid file with data buckets keeping the full sequences. If a query requires a safe transformation t ~ a; ~ b to be applied to the data set, a new grid file can be constructed on the fly by ~i Xi1 ; Xi2 ; . . . along replacing every scale array X ~0 ai :Xi1 bi ; ai :Xi2 dimension i with scale array X i x with t ~ x. The result is still bi ; . . . and every data record ~ guaranteed to give a valid grid file, due to the safety of t. Transformations may also be grouped together and simultaneously applied to the grid file. This can be done by: 1) building an MBR for each group of transformations, 2) applying each side of the MBR to the scale array associated with that side, and 3) applying all transformations inside the MBR to the qualifying sequences.

686

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 9. Time per query varying the number of sequences.

4.2 Use of the K-D-B-Tree The k-d-B-tree imposes an irregular (and not unique) decomposition of space. A leaf node contains a set of data records or pointers to data records. An interior node contains a set of pairs (region, ptr) with region corresponding to the disjoint union of regions represented by the node pointed to by ptr if ptr points to an interior node; otherwise, it is a bounding rectangle for a set of points. Given a k-d-B-tree on a set of time series data, and a query that requires a safe transformation t to be applied to data sequences, a new k-d-B-tree can be constructed on the fly by applying t to entries of the index (which are either rectangles or points) as the search proceeds. Due to the safety of t, the result of applying t to a k-d-B-tree is guaranteed to give a valid k-d-B-tree. Similarly, transformations may be grouped together and simultaneously applied to the k-d-B-tree. The algorithm is very similar to that of the R-tree.

5

EXPERIMENTAL RESULTS

We implemented our methods on top of Norbert Beckmann's Version 2 implementation of the R -tree [4]. We ran experiments on both real stock prices data obtained from the FTP site ªftp.ai.mit.edu/pub/stocks/results/º and synthetic data. The stock prices database consisted of 1,068 stocks and for each stock, its daily closing prices for 128 days. To test our algorithms on a larger data set and also to simulate the experiments reported in the literature for the purpose of comparison, we also did some experiments on synthetic data. Each synthetic sequence was in the form of ~ x xt , where xt xtÿ1 zt and zt was a randomly generated number in the range ÿ500; 500. For every time series, we first transformed it to the normal form for reasons described in Section 2.1, and then we found its Fourier coefficients. Since the mean of a normal form series is zero by definition, the first Fourier coefficient is always zero, so we can discard it. For every

sequence, we stored the magnitudes and the angles of the second and the third DFT coefficients in the index. All our experiments were conducted on a 168MHZ Ultrasparc station. For our experiments on proximity queries, we ran each experiment at least 100 times. Each time we chose a random query sequence from the data set and searched for all other sequences within distance of the query sequence. The execution time was averaged over these runs. Our all-pairs queries were spatial self-join queries where we searched the data set for all sequence pairs within distance of each other. We report our experiments in two parts: 1. 2.

evaluating queries with single transformations (e.g., Query 1), evaluating queries with multiple transformations (e.g., Query 2).

5.1 Evaluating Queries with Single Transformations In this section, first we show that the overhead of applying single transformations to the index is negligible. We then compare our method to sequential scanning. 5.1.1 Overhead on the Index for Proximity Queries Our first experiment was on synthetic data. Fig. 9 compares the execution time for two kinds of queries: 1) a proximity query using transformations and 2) a proximity query that uses no transformations. We kept the sequence length fixed to 128 while we varied the number of sequences from 500 to 12,000. In order to have a precise comparison, the identity ~~ ~ X ~ 0 was chosen such that Ti X transformation Ti I; ~ (I~ is a vector of 1s and ~ for all sequences X 0 is a vector of 0s). This made the two queries produce the same results. Instead of setting the distance threshold directly in our proximity queries, we set the correlation to 0:85, plugged it in (13) and found a value for . This threshold resulted in an average output size ranging from 1 to 69. As Fig. 9 shows there is not much difference between the two curves. The minor difference is due to the CPU time spent for vector

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

687

Fig. 10. Time per query varying the sequence length.

Fig. 11. Time per query varying the number of sequences.

multiplication which is unavoidable. The number of disk accesses is the same in both cases. For a nonidentity transformation, the CPU time spent for vector multiplication is still the same, but the number of disk accesses can be more or less depending on whether the transformation inflates or deflates the index rectangles. In the next experiment, we kept the number of sequences fixed to 1,000 while we varied the length of the sequences from 64 to 1,024. The experiment was on proximity queries over synthetic data. The correlation threshold was again set to 0:85; this resulted in an average output size ranging from 1 to 6. We used the identity transformation again for the reason described in the previous experiment. To increase the accuracy of our measurements, we ran each experiment 1,000 times. As demonstrated in Fig. 10, the result was the same. Thus, the index traversal for similarity queries does not deteriorate the performance of the index.

5.1.2 Comparison with Sequential Scanning for Proximity Queries Figs. 11 and 12 compare the execution time of our approach to sequential scanning for proximity queries. In Fig. 11, the sequence length was fixed to 128 and the number of sequences varied from 500 to 12,000. In Fig. 12, the number of sequences was fixed to 1,000 and the sequence length varied from 64 to 1,024. The experiment was again on synthetic data. The correlation threshold was set to 0:85 and the identity transformation was also used. To have a good implementation of the sequential scanning algorithm, the transformation is applied to both data and query sequences during the distance computation; this means both sequences are scanned at most once when they are in the main memory. We also stop the distance computation process as soon as the distance exceeds . Both graphs show the superiority of our algorithm.

688

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 12. Time per query varying the sequence length.

5.1.3 Comparison for All-Pairs Queries Our last experiment in this part was the spatial self-join. We used the following methods: 1.

Scan the relation of time sequences sequentially, and compare every sequence ~ s to all other sequences in the relation; the transformation tmavg20 is applied to every sequence during the comparison. The distance computation is stopped as soon as the distance exceeds . 2. Scan the relation of Fourier coefficients sequentially, and for every time series build a search rectangle and pose it as a proximity query to the index after applying tmavg20 to both the index and the search rectangle. 3. Do the spatial self-join as described in b without applying any transformation. The experiment ran on a relation of stock prices data that had 1,068 time sequences and the length of each sequence was 128. The correlation threshold was set to 0:9895 and the distance threshold was computed accordingly. The result of the test is shown in Table 1. Method b compared to its competitor method a is 6 to 7 times faster. Methods b and c compute different queries, but the comparison between them confirms that the overhead on the index is very small.

5.2

Evaluating Queries with Multiple Transformations In this section, we first compare the performance of the MTindex to that of the ST-index and sequential scan. To do the comparison, we made the choice of packing all transformations into one rectangle though it did not necessarily give us the best possible performance of MT-index. In the subsequent section, we show the effect of grouping transformations on the performance of MT-index. In all our experiments over proximity queries, we set the correlation threshold fixed to 0.96 and used (13) to find a value for the Euclidean distance threshold.

5.2.1 Comparing MT-index to ST-index and Sequential Scan Fig. 13 shows the running time of Query 2 using three algorithms sequential-scan, ST-index, and MT-index. The transformations were a set of moving averages ranging from 10-day moving average to 25-day moving average with their number fixed to 16. The experiment ran on synthetic sequences of length 128 with their number varying from 500 to 12,000. The average output size was 7 or more depending on the number of input sequences. The figure shows that the MT-index performs better than both the ST-index and sequential-scan. Fig. 14 shows the running time of Query 2, again using three algorithms sequential-scan, ST-index, and MT-index. In the experiment, we set the number of sequences fixed to 1,068 but we varied the number of transformations from 1 to 30. The transformations were a set of moving averages ranging from a five-day moving average to a 34-day moving average. The experiment ran on real stock price data. The average output size was 11 or more depending on the number of transformations. The figure shows that the MT-index outperforms both the ST-index and sequential-scan. The next experiment was on a spatial join query which was expressed as follows: Query 3. ªGiven a set of transformations denoted by T, find every pair s~1 and s~2 of time series and every t 2 T such that s2 0:99.º the correlation t ~ s1 ; t ~

TABLE 1 The Result of the Join

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

689

Fig. 13. Time per query varying the number of sequences.

Fig. 14. Time per query varying the number of transformations.

The transformations were again a set of moving averages ranging from a five-day moving average to a 34-day moving average. We varied the number of transformations from 1 to 30. The experiment ran on real stock prices data which consisted of 1,068 sequences of length 128. The average output size was at least 7. Fig. 15 shows the running time of Query 3 using three algorithms: sequential-scan, STindex, and MT-index. Both ST-index and MT-index perform better than sequential-scan. As we increase the number of transformations, the MT-index algorithm performs better than the ST-index until the number of transformations gets 30. At this point, the running time for both is the same.

5.2.2 Multiple Transformation Rectangles In this section, we show that grouping all transformations in one rectangle does not necessarily give us the best possible performance. To show this, we ran Query 2 using

the MT-index algorithm on real stock price data, but varied the number of transformations per MBR from one to its maximum. The transformation set consisted of m-day moving averages for m 6; . . . ; 29. We equally partitioned subsequent transformations and built an MBR for each partition. As is shown in Fig. 16, despite the fact that collecting all transformations in one rectangle resulted in the minimum number of disk accesses, it did not necessarily give us the best running time. We later added the inverted version of each transformation, which was obtained by multiplying every coefficient by -1, to the transformation set. This created two clusters in a multidimensional space. Again, we equally partitioned subsequent transformations and built an MBR for each partition. We varied the number of transformations per MBR from one to 48 which was the size of the

690

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 15. Time per query varying the number of transformations.

Fig. 16. Both the running time and the number of disk accesses varying by the number of transformations per MBRs.

transformation set. As is shown in Fig. 17, the running time shows bumps when we pack one third or all of the transformations in a rectangle. The same bumps are also observed in the number of disk accesses. This is due to the fact that, in these two cases, two separate clusters (or parts of them) are placed in one transformation rectangle. These experiments show that as we start packing transformations into rectangles, we see a major performance improvement up to a certain point (six to eight transformations per rectangle here). The performance after this point either stays the same or goes down. The worst performance for MT-index, which was even worse than ST-index, was when we packed two clusters of transformations into one rectangle. A solution to avoid this problem is to use a cluster detection algorithm in advance and avoid packing more than one cluster to a rectangle.

6

CONCLUSIONS

We have proposed a class of transformations that can be used in a query language to express similarity between time series in a general way. This class allows the expression of several practically important notions of similarity and queries using this class can be efficiently implemented on top of point-based multidimensional indexes. Our contributions can be summarized as follows: . . .

Formulation of an interesting class of transformations, including the moving average in our transformation language. Development of a safety condition for transformations. Implementation of similarity matching under these transformations on top of the R-tree index.

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

691

Fig. 17. Both the running time and the number of disk accesses varying by the number of transformations per MBRs.

.

Development of a new algorithm that applies multiple transformations specified in a query to a set of sequences in one scan of the R-tree index built on those sequences. . Development of a notion of ordering among transformations and using this notion in efficiently guiding the search. We evaluated our methods over both real stock prices and synthetic data. In the case of queries with single transformations, the experiments show that the execution time of our method is almost the same as that of accessing the index with no transformations; our method has much better performance than sequential scanning, and the

kÿ1 X

2

j a f x f b f ÿ qf j

f0

!12

nÿ1 X

2

j af xf bf ÿ qf j

!12

f0

16 for k n. Thus, sequence ~ x must be returned by the proximity query on the index. This is a contradiction. t u Lemma 5. Let T ft1 ; t2 ; . . . ; tm g be a set of transformations on Rn , where ti means ªscale by factor iº; an ordering of T w.r.t. the Euclidean distance is defined as tl tk if l < k. Proof. Let ti and tj be two arbitrary transformations in T and suppose, without loss of generality, i < j. Let ~ x and ~ y x; ~ y denotes their be two arbitrary points in Rn and D ~

performance gets better as both the number and the length

Euclidean distance. Since D ~ x; ~ y is a positive number,

of sequences increase. In the case of queries with multiple

we can multiply it to both sides of inequality i < j. This

transformations, the experiments show that the given

will give us

algorithm for handling multiple transformations outperforms both sequential scanning and the index traversal using one transformation at a time.

sequence. Proof. Suppose ~ x is a qualifying data sequence, i.e., D t ~ x;~ q < , but ~ x is not returned by Algorithm 2. If we represent t by ~ a; ~ b, we can write nÿ1 X f0

However,

!0:5

k0

Lemma 4. Algorithm 2 is guaranteed not to miss any qualifying

j a f x f b f ÿ qf j

2

!12

17

but we have nÿ1 X xk ÿ yk 2 i:D ~ x; ~ y i:

APPENDIX A SOME PROOFS

D ~ a~ x~ b;~ q

i:D ~ x; ~ y < j:D ~ x; ~ y;

nÿ1 X

18

!0:5 2

i:xk ÿ i:yk

D i:~ x; i:~ y:

k0

Equations (17) and (18) imply D i:~ x; i:~ y < D j:~ x; j:~ y. t u Lemma 6. Let T denote a set of (circular) moving averages on Rn and D denote the Euclidean distance; no ordering is possible for transformation set T w.r.t. D.

:

Proof. We prove this lemma by contradiction. Suppose there is an ordering among members of T w.r.t. D. Consider the following sequences:

692

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

s~1 10; 12; 10; 12 s~2 10; 11; 12; 11 s~3 11; 11; 11; 11:

.

s3 D mv2 ~ s2 ; mv2 ~ s3 0:33: 0:87 > D mv3 ~ s2 ; mv3 ~

mv2 ~ s2 10:5; 10:5; 11:5; 11:5; mv2 ~ s3 11; 11; 11; 11; mv3 ~ s1 10:67; 11:33; 10:67; 11:33;

.

mv3 ~ s2 11; 10:67; 11; 11:33; mv3 ~ s3 11; 11; 11; 11:

Case 2: mv3 mv2 By Definition 4, sj D mv2 ~ si ; mv2 ~ sj D mv3 ~ si ; mv3 ~

There are two possible orderings between mv2 and mv3: Case 1: mv2 mv3 By Definition 4,

for all pairs s~i and s~j . However, this does not hold for s~1 and s~3 ; s3 D mv3 ~ s1 ; mv3 ~ s3 0: 0:47 > D mv2 ~ s1 ; mv2 ~

sj D mv3 ~ si ; mv3 ~ sj D mv2 ~ si ; mv2 ~ for all pairs s~i and s~j . However, this does not hold for s~2 and s~3 ; s3 D mv2 ~ s2 ; mv2 ~ 1 > D mv3 ~ s2 ; mv3 ~ s3 0:75:

There are no other cases, so the proof is complete.

APPENDIX B EXPRESSING MOMENTUM

Case 2: mv3 mv2 By Definition 4, sj D mv2 ~ si ; mv2 ~ sj D mv3 ~ si ; mv3 ~ for all pairs s~i and s~j . However, this does not hold for s~1 and s~3 ; s3 D mv3 ~ s1 ; mv3 ~ s3 0: 0:66 > D mv2 ~ s1 ; mv2 ~

There are no other cases, so the proof is complete.

Case 1: mv2 mv3 By Definition 4,

for all pairs s~i and s~j . However, this does not hold for s~2 and s~3 ;

mv2 ~ s1 11; 11; 11; 11;

.

SEPTEMBER/OCTOBER 2000

D mv2 ~ si ; mv2 ~ sj D mv3 ~ si ; mv3 ~ sj

If we denote the circular two-day moving average by mv2 and the circular three-day moving average by mv3, we can write

.

VOL. 12, NO. 5,

u t

AS A

u t

TRANSFORMATION

~ 1; ÿ1; 0; . . . ; 0 be a vector of length n and ~ Let m x be a ~ time series of the same length. Let us denote the DFT of m ~ and the DFT of ~ ~ The convolution of ~ by M x by X. x and m, ~ ~ gives the momentum of ~ conv ~ x; m, x. Since a convolution in the time domain corresponds to a multiplication in the ~ and X ~ gives the frequency domain, the product of M momentum in the frequency domain. If we use the polar ~ and M, ~ representation for complex numbers and map X 0 0 ~ ~ respectively, to real vectors X and M such that Mj 0 0 0 iM2j1 0 iX2j1 e and Xj X2j e , we will have M2j 0

0

0 0 :X2j ei X2j1 M2j1 Mj :Xj M2j

Lemma 7. Let T denotes a set of noncircular moving averages on Rn and D denotes the Euclidean distance; no ordering is possible for transformation set T w.r.t. D.

for j 0; . . . ; n ÿ 1. Thus, we can express the momentum operation as an affine transformation of the form ~ a; ~ b, 0 0 where a2j M2j , b2j 0, a2j1 1 and b2j1 M2j1 .

Proof. The proof is similar to that of Lemma 6. Suppose there is an ordering among members of T w.r.t. D. If we denote the noncircular two-day moving averages by mv2 and the noncircular three-day moving averages by mv3, we can write

APPENDIX C EXPRESSING TIME SHIFT

mv2 ~ s1 11; 11; 11; mv2 ~ s2 10:5; 11:5; 11:5; mv2 ~ s3 11; 11; 11; mv3 ~ s1 10:67; 11:33; mv3 ~ s2 11; 11:33; mv3 ~ s3 11; 11; where sequences s~1 , s~2 and s~3 are those given in Lemma 6. There are two possible orderings between mv2 and mv3:

AS A

TRANSFORMATION

Suppose we want to shift sequence ~ x x0 ; x1 ; . . . ; xnÿ1 one day to the right. If we inserted a zero at the beginning, the result after the shift would be ~ x0 0; x0 ; x1 ; . . . ; xnÿ1 which is a sequence of length n 1. Using (1), we can write the DFT of ~ x0 as follows: ! nÿ1 nÿ1 ÿi2 t1f ÿi2f ÿi2tf 1 X 1 X 0 xt e n1 e n1 p xt e n1 ; Xf p n 1 t0 n 1 t0 where f 0; . . . ; n. Time sequences are usually long, i.e., n is a large number, so we can easily replace n 1 inside the parentheses by n without much affecting the equation.

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

Now, the expression inside the parentheses becomes Xf , the fth DFT coefficient of ~ x, and we can write ÿi2f

Xf0 e n1 Xf : This gives the first n Fourier coefficients of ~ x0 . If we use the polar representation for complex numbers, we can express the shift operation as an affine transformation of the ÿ2 1 form ~ 1; 0; ÿ2 0 n1 ; 0; n1 ; . . .. We can still do time shift even if ~ x is not a long sequence. The trick is to pad at least as many zeros as the amount of the shift at the end of the sequence. Now, we can forget the overflow zeros generated by the shift and consider the shifted sequence the same size as the original sequence.

ACKNOWLEDGMENTS This research was supported by Communications and Information Technology Ontario and the Natural Sciences and Engineering Research Council. This work was done when Davood Rafiei was a graduate student in the Computer Science Department at the University of Toronto.

REFERENCES [1]

[2]

[3] [4]

[5] [6] [7] [8]

[9]

[10] [11] [12] [13]

R. Agrawal, C. Faloutsos, and A. Swami, ªEfficient Similarity Search in Sequence Databases,º Proc. Fourth Int'l Conf. Foundations of Data Organizations and Algorithms (FODO '93), pp. 69±84, Oct. 1993. R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim, ªFast Similarity Search in the Presence of Noise, Scaling, and Translation in TimeSeries Databases,º Proc. 21st Int'l Conf. Very Large Data Bases (VLDB '95), pp. 490±501, Sept. 1995. R. Agrawal, G. Psaila, E.L. Wimmers, and M. Zait, ªQuerying Shapes of Histories,º Proc. 21st Int'l Conf. Very Large Data Bases (VLDB '95), pp. 502±514, Sept. 1995. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, ªThe R* Tree: An Efficient and Robust Index Method for Points and Rectangles,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '90), pp. 322±331, May 1990. K.K.W. Chu and M.H. Wong, ªFast Time Series Searching with Scaling and Shifting,º Proc. ACM Symp. Principles of Database Systems (PODS '99), pp. 237±248, 1999. R.D. Edwards and J. Magee, Technical Analysis of Stock Trends. Springfield, Mass., 1969. C. Faloutsos, H.V. Jagadish, A.O. Mendelzon, and T. Milo, ªA Signature Technique for Similarity-Based Queries,º Proc. Compression and Complexity of Sequences (SEQUENCES '97), June 1997. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, ªFast Subsequence Matching in Time-Series Databases,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '94), pp. 419± 429, May 1994. D.Q. Goldin and P.C. Kanellakis, ªOn Similarity Queries for TimeSeries Data: Constraint Specification and Implementation,º Proc. First Int'l Conf. Principles and Practice of Constraint Programming, pp. 137±153, Sept. 1995. S. Guha, R. Rastogi, and K. Shim, ªCURE: An Efficient Clustering Algorithm for Large Databases,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 73±84, June 1998. A. Guttman, ªR-Trees: A Dynamic Index Structure for Spatial Searching,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '84), pp. 47±57, June 1984. H.V. Jagadish, A.O. Mendelzon, and T. Milo, ªSimilarity-Based Queries,º Proc. 14th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS '95), pp. 36±45, May 1995. D.B. Lomet and B. Salzberg, ªThe HB-Tree: A Multiattribute Indexing Method with Good Guaranteed Performance,º ACM Trans. Database Systems, vol. 15, no. 4, pp. 625±658, Dec. 1990.

693

[14] C.S. Li, P.S. Yu, and V. Castelli, ªHierarchyscan: A Hierarchical Similarity Search Algorith for Databases of Long Sequences,º Proc. 12th Int'l. Conf. Data Eng. (ICDE '96), pp. 546±553, Feb. 1996. [15] NRC-CNRC, Feature Selection Bibliography. http://ai.iit.nrc.ca/ bibliographies/feature-selection.html. [16] J. Nievergelt, H. Hinterberger, and K.C. Sevcik, ªThe Grid File: An Adaptable, Symmetric Multikey File Structure,º ACM Trans. Database Systems, vol. 9, no. 1, pp. 38±71, Mar. 1984. [17] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing. Englewood Cliffs, N.J.: Prentice-Hall, 1975. [18] D. Rafiei, ªFourier-Transform Based Techniques in Efficient Retrieval of Similar Time Sequences,º PhD thesis, Univ. of Toronto, 1998. [19] N. Roussopoulos, S. Kelley, and F. Vincent, ªNearest Neighbor Queries,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '95), pp. 71±79, May 1995. [20] D. Rafiei and A. Mendelzon, ªSimilarity-Based Queries for Time Series Data,º Proc. ACM SIGMOD International Conf. Management of Data (SIGMOD '97), pp. 13±24, May 1997. [21] D. Rafiei and A. Mendelzon, ªEfficient Retrieval of Similar Time Sequences Using DFT,º Proc. Fifth Int'l Conf. Foundations of Data Organizations and Algorithms (FODO '98), pp. 249±257, Nov. 1998. [22] W.G. Roth, ªMIMSY: A System for Analyzing Time Series Data in the Stock Market Domain,º master's thesis, Univ. of Wisconsin, Madison, 1993. [23] R. Ramakrishnan and D. Srivastava, ªCORAL: Control, Relations, and Logic,º Proc. 18th Int'l Conf. Very Large Data Bases (VLDB '92), pp. 238±250, Aug. 1992. [24] D. Sankoff and J.B. Kruskal, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983. [25] K.C. Sevcik and N. Koudas, ªFilter Trees for Managing Spatial Data Over a Range of Size Granularities,º Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB '96), pp. 16±27, Sept. 1996. [26] T. Seidl and H.-P. Kriegel, ªOptimal Multi-step Nearest Neighbour Search,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 154±165, 1998. [27] P. Seshadri, M. Livny, and R. Ramakrishnan, ªSequence Query Processing,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '94), pp. 430±441, May 1994. [28] H. Shatkay and S. Zdonik, ªApproximate Queries and Representations for Large Data Sequences,º Proc. 12th Int'l Conf. Data Eng. (ICDE '96), pp. 536±545, Mar. 1996. [29] B.-K. Yi, H.V. Jagadish, and C. Faloutsos, ªEfficient Retrieval of Similar Time Sequences Under Time Warping,º Proc. 14th Int'l Conf. Data Eng. (ICDE '98), pp. 201±208, Feb. 1998. Davood Rafiei received his undergraduate degree in computer engineering at Sharif University of Technology, Tehran, in 1990, his master's degree in computer science from the University of Waterloo in 1995, and his PhD degree in computer science from the University of Toronto in 1999. He was a postdoctoral fellow at the University of Toronto prior to joining to the Department of Computing Science at the University of Alberta as an assistant professor in 2000. His research interests include database systems, querying and indexing nontraditional data such as time series and images, and information retrieval on the Web. Alberto O. Mendelzon received his undergraduate degree from the University of Buenos Aires and his master's and PhD degrees from Princeton University. He was a postdoctoral fellow at the IBM T.J. Watson Research Center. Since 1980, he has been with the University of Toronto where he is now a professor of Computer Science and a member of the Computer Systems Research Group. He has spent time as a visiting scientist at the NTT Basic Research Laboratories in Japan, the IASI in Rome, the IBM Toronto Lab, and AT&T Bell Labs Research in New Jersey. His research interests are in database systems, database query languages, Webbased information systems, and information integration.

VOL. 12,

NO. 5,

SEPTEMBER/OCTOBER 2000

675

Querying Time Series Data Based on Similarity Davood Rafiei and Alberto O. Mendelzon AbstractÐWe study similarity queries for time series data where similarity is defined, in a fairly general way, in terms of a distance function and a set of affine transformations on the Fourier series representation of a sequence. We identify a safe set of transformations supporting a wide variety of comparisons and show that this set is rich enough to formulate operations such as moving average and time scaling. We also show that queries expressed using safe transformations can efficiently be computed without prior knowledge of the transformations. We present a query processing algorithm that uses the underlying multidimensional index built over the data set to efficiently answer similarity queries. Our experiments show that the performance of this algorithm is competitive to that of processing ordinary (exact match) queries using the index, and much faster than sequential scanning. We propose a generalization of this algorithm for simultaneously handling multiple transformations at a time, and give experimental results on the performance of the generalized algorithm. Index TermsÐSimilarity queries, time series retrieval, indexing time series, Fourier transform.

æ 1

T

INTRODUCTION

IME-SERIES data are of growing importance in many new database applications, such as data mining or data warehousing. A time series is a sequence of real numbers, each number representing a value at a time point. For example, the sequence could represent stock or commodity prices, sales, exchange rates, weather data, biomedical measurements, etc. We are often interested in similarity queries on time-series data [3], [2]. For example, we may want to find stocks that behave in approximately the same way (or approximately the opposite way, for hedging); or products that had similar selling patterns during the last year; or years when the temperature patterns in two regions of the world were similar. In queries of this type, approximate, rather than exact, matching is required. A simple approach to determine a possible similarity between two time series is to compute the Euclidean distance (or any other distance, such as the city-block distance) between the two series, and call the two series similar if their distance is less than some user-defined threshold. However, the notion of similarity is often more complex and cannot be expressed using a distance function. The meaning of a similarity comparison may vary from one data domain to another data domain and different users may have different perceptions of a similarity comparison; for example, one may consider two stocks similar if they have almost the same price fluctuations, even though one stock

. D. Rafiei is with the Department of Computing Science, University of Alberta, Edmonton, AB, Canada T6G 2H1. E-mail: [email protected] . A.O. Mendelzon is with the Department of Computer Science, University of Toronto, Toronto, ON, Canada, M5S 3H5. E-mail: [email protected] Manuscript received 21 Sept. 1999; revised 13 July 2000; accepted 18 July 2000. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 112605.

might sell for twice as much as the other. Consider the following motivating examples. Example 1.1. Consider time series s~1 36; 38; 40; 38; 42; 38; 36; 36; 37; 38; 39; 38; 40; 38; 37 and s~2 40; 37; 37; 42; 41; 35; 40; 35; 34; 42; 38; 35; 45; 36; 34; for example, corresponding to the closing prices of two stocks. Looking at Fig. 1a and 1b, the sequences do not appear very similar. This is justified by the high Euclidean distance D ~ s1 ; s~2 11:92 between them. However, if we look at the three-day moving averages of the two sequences (Fig. 1c and 1d), they do look quite similar. The Euclidean distance between the three-day moving averages of the two sequences is 0:47. Moving averages are widely used in stock data analysis (for example, see [6]). Their primary use is to smooth out short term fluctuations and depict the underlying trend of a stock. The computation is simple; the l-day moving average of a sequence ~ s v1 ; . . . ; vn is computed as follows: The mean is computed for an l-day-wide window placed over the end of the sequence; this will give the moving average for day n ÿ bl=2c; the subsequent values are obtained by stepping the window through the beginning of the sequence, one day at a time. This will produce a moving average of length n ÿ l 1. We use a slightly different version of a moving average which is easier to compute in our framework. We circulate the window to the end of the sequence when it reaches the beginning. This gives us a moving average of length n. It turns out that when the length of the window is small enough compared to the length of the sequence, which is usually the case in practice, both averages are almost the same.

1041-4347/00/$10.00 ß 2000 IEEE

676

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 1. (a) Time series s~1 36; 38; 40; 38; 42; 38; 36; 36; 37; 38; 39; 38; 40; 38; 37, (b) times series s~2 40; 37; 37; 42; 41; 35; 40; 35; 34; 42; 38; 35; 45; 36; 34, (c) the three-day moving average of s~1 , and (d) the three-day moving average of s~2 .

Example 1.2. Consider two time series ~ s 20; 20; 21; 21; 20; 20; 23; 23 and ~ p 20; 21; 20; 23 that are sampled with different frequencies (Fig. 2). For example, ~ s could be the closing price of a stock taken every day, and ~ p could be the closing price of another stock taken every other day. A typical query is ªis ~ p similar to ~ s ?º The sequence ~ s is twice as long as ~ p, so they cannot be compared directly. The Euclidean distance between ~ p and any subsequence of length four of ~ s is more than 1:41. If the time dimension of ~ p is scaled by 2, i.e., every value ªvj º is replaced by ªvj ; vj ,º the resulting sequence will be identical to ~ s. This operation is a special kind of time warping (for example, see [24]), often called time scaling, which can be expressed within our framework. We propose a class of transformations that can be used in a query language to express similarity in a fairly general way, handling a wide variety of comparisons including cases like the two examples above. We show that the query evaluation for similarity comparisons can be efficiently implemented on top of a point-based access method, such as R-tree [11], without a prior knowledge of transformations. For example, we demonstrate that an index structure for a moving average can be constructed on the fly from an existing index (built on the original sequences) and the

query evaluation engine can benefit from the new index in the same way as it does from the original index with no extra disk overhead. To the best of our knowledge, this is the first indexing method that can handle moving average and time scaling in the context of similarity queries. To apply this framework to time series data, we need to make some choices on how we compare two sequences and also on how we do indexing. We have chosen the Euclidean distance to compare two sequences mainly because: 1) it easily corresponds to the cross correlation (13); 2) it is frequently used [1], [8]; 3) it remains the same under orthonormal transforms.1 There are several multidimensional indexes (such as R-tree family [11], [4], [25], the k-dB-tree [13], or the grid file [16]) that can be used to do indexing on sequences. All these indexes reduce to sequential scanning in higher dimensions due to a phenomenon known as the curse of dimensionality. A solution to avoid this problem is to choose a few important features, those that can approximately differentiate one sequence from another, and only keep those features in the index. The approximation introduces a few false hits when sequences are being compared. However, the false hits can be easily removed in a postprocessing step. For feature extraction, we use the Discrete Fourier Transform (DFT) to map sequences into the frequency domain and only keep the first few DFT coefficients for each sequence. The reason for choosing DFT is mainly because: 1.

2. 3.

Fig. 2. (a) Time series ~ s 20; 20; 21; 21; 20; 20; 23; 23 and (b) ~ p 20; 21; 20; 23.

For a large class of time series (often referred as color noise data), it concentrates the energy in a few lower frequency coefficients so those coefficients can make a key for indexing purposes; It is an orthonormal transform, so the Euclidean distance is unchanged under the DFT; The class of transformations proposed in this paper (see Section 2) can express an interesting class of operations if applied in the frequency domain.

1. A transformation, denoted with matrix M, is orthonormal if M T :M I, where M T denotes the transpose of matrix M and I represents the identity matrix.

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

In general, one can use any orthonormal transform, such as the Discrete Cosine Transform (DCT), the Harr transform, the wavelet transform, etc. For a broad class of feature selection methods see, for example, the online bibliography maintained at the National Research Council Canada site [15]. The organization of the rest of the paper is as follows: In the rest of the current section, we provide some background material on the discrete Fourier transform and also review the related work. Our definition of similarity queries is discussed in Section 2. In Section 3, we use R-tree indexes and develop algorithms for efficiently evaluating queries expressed within our framework. In Section 4, we briefly describe how our algorithms can be implemented using the grid file and the k-d-B-tree. Section 5 presents experimental performance results. We conclude in Section 6.

1.1 The Discrete Fourier Transform In this section, we briefly review the Discrete Fourier Transform and its properties. Let a time sequence be a finite duration signal ~ x xt for t 0; 1; ; n ÿ 1. The DFT of ~ x, ~ is given by denoted by X, nÿ1 ÿi2tf 1 X xt e n f 0; 1; ; n ÿ 1; 1 Xf p n t0 p where i ÿ1 is the imaginary unit. Throughout this paper, unless it is stated otherwise, we use small letters for sequences in the time domain and capital letters for sequences in the frequency domain. The inverse Fourier ~ gives the original signal, i.e., transform of X nÿ1 i2tf 1 X Xf e n xt p n f0

t 0; 1; ; n ÿ 1:

2

Some books define the DFT with no constant in front and the inverse DFT with constant 1=n in front or vice versa. p Following some earlier conventions [1], [8], we have 1= n in front of both (1) and (2) for it simplifies the upcoming Parseval relation without changing any properties of the DFT. The energy of signal ~ x is given by the expression E ~ x

nÿ1 X

j xt j2 :

3

t0

The convolution of two signals ~ x and ~ y is given by Conv ~ x; ~ yj

nÿ1 X

xk yjÿk

j 0; 1; ; n ÿ 1;

4

k0

where j ÿ k is computed modulo n. This convolution is usually called circular convolution. Equations (3) and (4) are unchanged in the frequency domain. The following properties of DFT can be found in any signal processing textbook (for example, see [17]). The symbol , denotes a DFT pair. ~ and ~ Linearity: If ~ x,X y , Y~, then ~ bY~ a~ x b~ y , aX

5

677

for arbitrary constants a and b. ~ and ~ Convolution-Multiplication: If ~ x,X y , Y~, then ~ Y~; conv ~ x; ~ y , X

6

~ Y~ is the element-to-element multiplication of two where X ~ and Y~. vectors X ~ for a real-valued sequence ~ Symmetry: If ~ x,X x of length n, then j Xnÿf jj Xf j

for f 1; . . . ; n ÿ 1

7

and ~ then Parseval's Relation: If ~ x , X, ~ E ~ x E X:

8

Using Parseval's relation and the linearity, it is easy to show that the Euclidean distance between two signals in the time domain is the same as their distance in the frequency domain. ~ ÿ Y~ D2 X; ~ Y~: x; ~ y E ~ x ÿ~ y E X D2 ~

9

1.2 Related Work An indexing technique for the fast retrieval of similar time sequences is proposed by Agrawal et al. [1]. The idea is to use Discrete Fourier Transform (DFT) to map time sequences (stored in a database) into the frequency domain. Keeping only the first k Fourier coefficients, each sequence becomes a point in a k-dimensional feature space. To allow a fast retrieval, the authors keep the first k Fourier coefficients of a sequence in an R-tree index. An extension of this technique for subsequence matching is proposed by Faloutsos et al. [8]. None of the aforementioned work allows the expression of a query that uses some transformations in its expression of similarity. Goldin et al. [9] show that the similarity retrieval will be roughly invariant to simple translations and scales if sequences are normalized before being stored in the index. The authors store in the index both the translation and the scale factors, in addition to normalized sequences, and also allow those factors to be queried using range predicates. In our earlier work [18], [21], we have shown that the last few Fourier coefficients of a sequence (those corresponding to lower negative frequencies) are as important as the first few coefficients due to the symmetry property of DFT for real-valued sequences. We have also shown that this observation can be used to reduce the size of the search rectangle in the indexing method of Agrawal et al. [1] and its extensions [8], [9] without really storing the last few Fourier coefficients in the index. This leads to a search time improvement of more than a factor of 2. In this paper, we use this improved version of the index for efficiently evaluating queries that use transformations in their expressions of similarities. Jagadish et al. [12] developed a domain-independent framework to pose similarity queries on a database. The framework has three components: a pattern language P, a transformation rule language T, and a query language L. An expression in P specifies a set of data objects. An object A is considered similar to an object B, if B can be reduced to it by a sequence of transformations defined in T. The query language proposed in the paper is an extension of relational

678

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

calculus with predicates that test whether an object A can be transformed into a member of the set of objects described by expression e using the transformation t, at a cost bounded by c. As a specialization of this work to real-valued sequences [7], the same authors describe how the search can be performed over sequence signatures instead of the original sequences. Our work here can also be seen as a specialized variation of this general framework where transformations are restricted to affine transformations. There is other work on time series data. Yi et al. [29] use time warping as a distance function and present algorithms for retrieving similar time sequences under this function. Chu et al. [5] show how similar sequences can be efficiently searched irrespective of differences in simple translations and scales. A hierarchical sequence matching algorithm based on the correlation between sequences is proposed by Li et al. [14]. Agrawal et al. [3] describes a pattern language, called SDL, to encode queries about ªshapesº found in time sequences. The language allows a kind of blurry matching where the user specifies the overall shape instead of the specific details, but it does not support any operations or transformations on sequences. A method for approximately representing sequences in terms of some functions and processing queries over such a representation is described by Shatkay and Zdonik [28]. A query language for time series data in the stock market domain was developed by Roth [22]. The language is built on top of CORAL [23] and every query is translated into a sequence of CORAL rules. Seshadri et al. [27] develop a data model and a query language for sequences in general, but do not mention similarity matching as a query language operator.

2

SIMILARITY QUERIES

Time series often have differences that need to be removed or reduced before comparing them to each other. One way to remove those differences is to apply some transformations to them. We are interested in a set of transformations which fulfills the following two requirements: 1) it expresses a large class of useful transformations on time series and 2) queries expressed using this set of transformations can be efficiently processed. The set of affine transformations seems to be a good candidate since its members can express any combination of four important transformations in space: translation, scaling, rotation, and shear. Furthermore, any affine transformation preserves two basic features namely collinearity (i.e., all points lying on a line initially still lie on a line after transformation) and ratios of distances (e.g., the midpoint of a line segment remains the midpoint after transformation). To fulfill the second requirement, we limit our transformations to translation and scaling (along all dimensions) only. Specifically, we define a transformation in an n-dimensional space, denoted by ~ a; ~ b, as a pair of vectors where ~ a specifies a stretch and ~ b represents a translation. We will show that this class can express a large class of useful transformations on time series; at the same time, queries expressed using this class can be efficiently processed.

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Definition 1. The transformation ~ a; ~ b applied to a point ~ x in an n-dimensional space maps ~ x to ~ a~ x~ b. Transformations may be associated with costs. Given a set of transformations T , and the cost of applying each transformation, a measure of distance (dissimilarity) between two time series can be defined as follows: 8 D0 ~ x; ~ y > > > > x; ~ y < mint2T cost t D t ~ D ~ x; ~ y min mint2T cost t D ~ 10 x; t ~ y > > cost t cost t min > t ;t 2T 1 2 1 2 > : x; t2 ~ y; D t1 ~ where D0 ~ x; ~ y is a base distance (such as the Euclidean distance) between ~ x and ~ y. Next, we propose an extension of multidimensional similarity queries where similarity is defined on the basis of both a distance function and a set of transformations.

2.1 Queries with Single Transformations Transformations can be seen as a way to remove certain variations before aligning two sequences. Although many kinds of variations may be present in each sequence, we consider only those that can be removed using the limited form of affine transformations on the Fourier series representation of the sequence (Fig. 3). We first consider queries that use a single transformation to remove such variations. To gain some insight into this class of queries, we give some examples from real stock data. The data was obtained from the FTP site: ªftp.ai.mit.edu/pub/stocks/ results/.º Example. 2.1. Fig. 4 shows the daily closing price for The Bombay Co. (BBA) starting from October 25, 1994 for 128 days, and that for Zweig Total Return Fund Inc. (ZTR) starting from July 20, 1995 for 128 days. The Euclidean distance between the two series is 16.16. The mean for BBA is 9.51, and the mean for ZTR is 8.64. If we do not care about this variation in our comparison, we can vertically shift the mean of both series to zero, i.e., subtract the mean of each series from the everyday closing price. Now the Euclidean distance between the two sequences reduces to 12.78. The closing price of ZTR fluctuates in a smaller range than that of BBA; the standard deviation of ZTR is 0.10 while the standard deviation of BBA is 1.18. Again, if we do not care about the scale in our comparison, we can scale both series, for example, by the inverse of their standard deviation. The resulting series are often called the normal forms of the original series [9]. Fig. 4 shows that the Euclidean distance between the normal forms of the two series is still 11.10; ZTR is more volatile than BBA. To remove this variation, we need to smooth out short term fluctuations, for example, by computing the 20-day moving average of the two series. The Euclidean distance between the two series after applying the 20-day moving average drops to 2.75. Example 2.2. Fig. 5 shows in normal form the daily closing prices of stocks of Pacific Gas and Electric Co. (PCG) and Plum Creek Timber Co. (PCL) both starting from November 2, 1994 for 128 days. The Euclidean distance

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

679

Fig. 3. Applying transformations to time series.

Fig. 4. From left to right, top to bottom: The daily closing price of The Bombay Co. (BBA) starting from ªOctober 25, 1994º for 128 days, the daily closing price of Zweig Total Return Fund Inc. (ZTR) starting from ªJuly 20, 1995º for 128 days, the two stocks put together, both vertically translated, both vertically scaled, and both smoothed using the 20-day moving average (D denotes the Euclidean distance).

between the two normalized time series is 11.34. One way to compare the change rates of two stocks is to compare their ªmomenta,º which are obtained for every stock by subtracting the price at time t from the price at time t+1 (or, in general, t+n for some n). The Euclidean distance between the two momenta is 13.01. The series representing the price of PCG has a spike on February 7 while the series of PCL has a spike on February 8. If we horizontally shift the series of PCL one day to the left, then the spikes will overlap. The Euclidean distance between the two normalized time series after the horizontal shift drops to 8.91. The horizontal shifting reduces the distance between the momenta of the two series to 5.62. The momentum of a time series describes the rate at which its value (such as the price in the preceding example) is rising or falling and it is seen as a measure of strength behind upward or downward movements. On the other hand, shifting a sequence horizontally before comparing it to another sequence removes any possible delay between

the two sequences which can arise, for example, in the stock market domain because of the different reactions of two stocks to the same piece of news or recording errors. As in the examples, we often specify a sequence of transformations to be applied to a time series. Given transformations t1 a~1 ; b~1 and t2 a~2 ; b~2 , for example, respectively, corresponding to ªone-day shiftº and ª10-day moving averageº, suppose we want to apply t1 followed by ~ We can t2 , which we denote by t2 t1 , to sequence X. construct the new transformation as follows: ~ a~2 a~1 X ~ b~1 b~2 t2 t1 X ~ a~2 b~1 b~2 : a~2 a~1 X

11

Transformation t2 t1 equivalently can be expressed as t3 a~3 ; b~3 , where a~3 a~2 a~1 and b~3 a~2 b~1 b~2 . We now show how we can express some of these transformations in our framework. Suppose we want to express the three-day moving average in our transformation ~ and language. Let us denote the Fourier transform of ~ s by S 1 1 1 ~3 3 ; 3 ; 3 ; 0; 0; 0; 0; 0; 0; the Fourier transform of m

680

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 5. The daily closing price of Pacific Gas and Electric Co. (PCG) and that of Plum Creek Timber Co. (PCL), both starting from February 11, 1994 for 128 days, represented in normal forms and their momenta.

~3 . Now, consider the transformation 0; 0; 0; 0; 0; 0 by M ~3 ; ~ 0, where ~ 0 is a zero vector of the same size tmavg3 M ~ i.e., ~3 . If we apply the transformation tmavg3 to S, as M ~ S ~ M ~ M ~3 ~ ~3 ; 0S tmavg3 S

1.

we get the three-day moving average of ~ s in the frequency domain. If we transform the right hand side back to the time domain using the convolution-multiplication relation (6), ~ which is the three-day moving average we get conv ~ s; m3 of ~ s in the time domain. In general, the m-day moving average of a series of a; ~ 0, where length n can be expressed by tmavg ~ ~ a w1 ; w2 ; ; wm ; 0; 0; ; 0 |{z} m |{z}

standard deviation are stored in a relation. This is mainly because of efficiency, as is noted by Goldin and Kanellakis [9], and the following two attractive properties of the normal form which are not mentioned by these authors.

12

n

and ~ 0 is a zero vector of size n. Transformation tmavg may be applied several times to get successive moving averages. The weights w1 ; ; wm are not necessarily equal. For trend prediction purposes, for example, the weights at the end are usually chosen to be higher than those at the beginning. Whereas for normal smoothing purposes, weights are equal, or those at the center are larger than those at the endpoints. Both momentum and horizontal shifting can also be formulated as affine transformations over the Fourier representation of a sequence. See Appendices B and C for details. Another transformation of special interest is normalization, which can be applied to time series before storing them in an index. Given a time series of mean and standard deviation , the normal form of the series is obtained by first vertically shifting the series by and then scaling it by 1=. The equivalent transformation can be ! expressed in the frequency domain as 1=; ÿ=; 0; 0; . . .. Although it is not required by the algorithms given in this paper, we assume time sequences are normalized and for every sequence, its normal form along with its mean and

2.

It minimizes the Euclidean distance with respect to ~ ÿ sx ; Y~ ÿ sy has its minimum translation, i.e., D X when sx and sy , respectively, are chosen to be the means of ~ x and ~ y .2 The Euclidean distance between two normalized sequences is directly related to their cross-correlation.3 The cross-correlation is a statistic measure to find out if two random variables (such as the crops harvested and the rain level) are linearly correlated. A value of cross-correlation near 0 often indicates that the variables are independent, while a value near 1 or -1 indicates a strong positive or negative correlation. ~ Y~ 2 n ÿ 1 1 ÿ X; ~ Y~: D2 X;

13

Equation (13) can be derived by expanding the Euclidean distance formula and replacing the mean and the standard deviation, respectively, by 0 and 1 in both the Euclidean distance and the crosscorrelation formulas. The second property can be quite useful in formulating similarity queries or translating one query to another. Since the Euclidean distance between two sequences can range from zero to infinity, it is usually difficult to specify a threshold for this distance. Instead, we can specify a threshold for cross-correlation which is between -1 and 1 and plug it into (13) to find a threshold for the Euclidean distance. Using (13), we can also translate any expression that uses the cross-correlation in a query to one that uses the Euclidean distance or vice versa. 2. This can be verified by taking the first derivatives of D w.r.t. sx and sy and equating them to zero. ~ Y~ ÿX ~ :Y~ ~ Y~ X 3. X; ~ ~ X

Y

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

681

Fig. 6. On the top from left to right, daily closing of Dow Jones 65 Composite Volume (COMPX) index, NYSE Volume (NYV) index and both put together, normalized and smooth using a nine-day moving average. On the bottom from the left to right, again daily closing of COMPV index, NYSE Declining Issues (DECL) index and both put together, normalized and smoothed using 19-day moving average.

Transformations can also be defined to stretch the time dimension (Example 1.2). Details are given elsewhere [20].

2.2 Queries with Multiple Transformations It is often desirable to specify a set of transformations in a query. This is particularly useful if there are different ways of removing variations and one is not sure which one should be used. Given a distance function D and a set of transformations denoted by T , the semantics of D T ~ x; ~ y for arbitrary data points ~ x and ~ y is defined as follows: D T ~ x; ~ y minfD t ~ x; ~ y j t 2 T g:

14

Example 2.3 Fig. 6 shows daily closings of three indices: Dow Jones 65 Composite Volume (COMPV), NYSE Volume (NYV), and NYSE Declining Issues (DECL). It is difficult to see any similarity between these sequences. The Euclidean distance between closes of COMPV and NYV is 2,873 and that between COMPV and DECL is 12,939. On the other hand, if we normalize the closes of COMPV and NYV and compare their nine-day moving averages, they look very similar. The Euclidean distance between the nine-day moving averages of the normalized closes of COMPV and NYV is less than 3. Similarly, if we normalize the closes of COMPV and DECL and compare their 19-day moving averages, they also look quite similar. In fact, ª19-day moving averageº is the shortest moving average that reduces the Euclidean distance between normalized closes of COMPV and DECL to less than 3. To identify these similarities, a query may specify a set of moving averages to be applied to sequences. Although it may seem that if two sequences are similar w.r.t. n-day moving average, they should be similar w.r.t. n 1-day moving average, this is not true in general. For a counter example, see Lemmas 6 and 7 in Appendix A. Similar to the way we did for single transformations, we can compose two sets of transformations. Given two transformation sets T1 and T2 , for example, respectively,

corresponding to ªs-day shiftº for s 0; . . . ; 10 and ªm-day moving averageº for m 1; . . . ; 40, we can construct transformation set T3 T2 T1 , which corresponds to a ªs-day shiftº followed by an ªm-day moving averageº for all possible values of s and m, as follows: T3 ft3 t2 t1 j t1 2 T1 ; t2 2 T2 g;

15

where t2 t1 is defined by (11). Using (11) and (15), we can simplify a query by replacing any expression that uses a sequence of transformations with one that uses only a single or a set of transformations. We can process the resulting query using the techniques proposed in Section 3.2.

2.3 Safety of Transformations Time series data can be easily stored in a multidimensional index and be efficiently retrieved using simple similarity queries. However, to the best of our knowledge, no prior work has been done on using these access methods for efficiently evaluating queries that use transformations in their expression of similarity. To achieve this goal, we define a notion of safety for transformations. We show in Section 3 that any query that only uses safe transformations in their expression of similarity can be efficiently supported using multidimensional indexes. Definition 2. An affine transformation is safe if it maps every jaxis-parallel line segment in Rn ; Rn , for j 1; . . . ; n, to a jaxis-parallel line segment. Lemma 1. An affine transformation t is safe iff it can be expressed as t ~ a; ~ b, where ~ a; ~ b 2 Rn . Proof. For the clarity of the presentation, we give the proof in a two-dimensional space. The extension of the result to an arbitrary n-dimensional space is straightforward. (if) Suppose t can be expressed as t ax ; ay ; bx ; by . Let x1 ; y1 ; x2 ; y2 be an arbitrary axis-parallel line segment. If the line segment is parallel to the x axis, i.e., y1 y2 , then it will remain parallel to the x axis after the

682

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

transformation because ay :y1 by ay :y2 by . Similarly, if the line segment is parallel to the y axis, then it will remain parallel to the y axis after the transformation. Thus, t is safe. (only if) Suppose t is safe. We show that t can be expressed as ax ; ay ; bx ; by . Let x1 ; y1 ; x2 ; y2 be an arbitrary axis-parallel line segment and x01 ; y01 ; x02 ; y02 be its image under t. We can find ax ; ay ; bx ; by such that x01 ax :x1 bx ; y01 ay :y1 by ; x02 ax :x2 bx ; y02 ay :y2 by :

For instance, if the original line segment is parallel to the x axis, then y1 y2 , y01 y02 and there will be three equations with four unknown variables to be solved. The rest is straightforward. u t Since we use the Fourier series representation of time series data as our features, and a Fourier coefficient, in general, is a complex number, we need to do a mapping before we can show the safety of transformations. If we decompose a complex number into its real and imaginary components, then we can represent a vector of k Fourier coefficients with a point in R2k . We denote this mapping from the complex plane to the real plane by Mrect . Alternatively, we can decompose a complex number into its components in the polar coordinate system where a complex number is represented by a magnitude and a phase angle. We denote this mapping from the complex plane to the real plane by Mpol . We use Re x, Im x, Abs x, and Angle x to denote, respectively, the real, the imaginary, the magnitude, and the phase angle of a complex number x. We now show that the transformations described for time series data in previous sections are safe. Definition 3. Let ~ a and ~ b be vectors of complex numbers and a~0 0 ~ and b be respectively their images under a mapping M from the complex plane to the real plane. Transformation t ~ a; ~ b 0 0 0 ~ ~ is safe w.r.t. M if t a ; b is safe. Lemma 2. Let ~ a be a vector of real numbers, and ~ b be a vector of complex numbers; the transformation t ~ a; ~ b is safe w.r.t. Mrect . Proof. Without loss of generality, we assume the real and the imaginary components of the complex coordinate j (for j 1; ; k) are respectively mapped to the coordinates 2j ÿ 1 and 2j of the real plane. Suppose ~ z is a kdimensional vector of complex numbers and ~ x, a 2kdimensional vector, is its image under Mrect . We have zj x2jÿ1 x2j i for j 1; ; k. If we apply transformaz ~ a ~ z~ b. We can rewrite this tion t to ~ z, we get ~ z0 t ~ as follows: z0j aj x2jÿ1 x2j i Re bj Im bj i aj x2jÿ1 Re bj aj x2j Im bj i for j 1; ; k. If we map the resulting vector to a point x0 using Mrect , we get x02jÿ1 aj x2jÿ1 Re bj and x02j aj x2j Im bj for j 1; ; k. This transforma~ where c2jÿ1 c2j aj , c; d, tion can be rewritten as t0 ~ c d2jÿ1 Re bj , and d2j Im bj for j 1; ; k. Since ~ and d~ are vectors of real numbers, the rest follows from Lemma 1. u t

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

On the other hand, Lemma 2 does not hold if ~ a is chosen to be a vector of complex numbers. For example, consider the axis-parallel line segment made by the two points p ÿ5 2i and q ÿ5 ÿ 5i. If we apply the transformation t 2 ÿ 3i; 0 to the line segment, i.e., we multiply the two endpoints by 2 ÿ 3i, the resulting line segment made by t p ÿ4 19i and t q ÿ25 5i is not axis-parallel anymore! Lemma 3. Let ~ a be a vector of complex numbers, and ~ b be a zero vector (~ b ~ 0); the transformation t ~ a; ~ b is safe w.r.t. Mpol . Proof. Without loss of generality, we assume the magnitude and the phase angle of the complex coordinate j (for j 1; ; k) are respectively mapped to the coordinates 2j ÿ 1 and 2j of the real plane. Suppose~ z is a k-dimensional vector of complex numbers and ~ x is its image under Mpol . We have zj x2jÿ1 ex2j i for j 1; ; k. If we apply z ~ a ~ z~ b. We can transformation t to ~ z, we get ~ z0 t ~ rewrite this as follows: z0j Abs aj eAngle aj i x2jÿ1 ex2j i 0 Abs aj x2jÿ1 e x2j Angle aj i for j 1; ; k. If we map the resulting vector to a point x0 using Mpol , we get x02jÿ1 Abs aj x2jÿ1 and x02j x2j Angle aj for j 1; ; k. This transformation can ~ where c2jÿ1 Abs aj , d2jÿ1 0, c; d, be rewritten as t0 ~ c and d~ c2j 1, and d2j Angle aj for j 1; ; k. Since ~ are vectors of real numbers, the rest follows from Lemma 1. u t In the next section, we develop efficient algorithms for evaluating queries that only use safe transformations in their expression of similarity.

3

QUERY EVALUATION USING R-TREE INDEXES

Given a set of time series data, an index can be constructed as follows ([1], [21]): Find the DFT of each sequence and keep the first few DFT coefficients as the sequence features. If we only keep the first k DFT coefficients, sequences will become points in a k-dimensional space. These points can be organized in a multidimensional index such as the R-tree family [11], [4], [25], the k-d-B-tree [13], or the grid file [16]. In this section, we describe the query evaluation for R-tree indexes. The same techniques can be extended to other index structures. We do that for the k-d-B-tree and the grid file in Section 4. An R-tree can be seen as a natural extension of B-trees for more than one dimension. Nonleaf nodes contain entries of the form (MBR, ptr), where MBR is a Minimum Bounding Rectangle of all the entries in a descendent node, and ptr is a pointer to the descendent node. Bounding rectangles at the same tree level may overlap. Leaf nodes for point data sets contain a set of data points or pointers to full data records.

3.1

Evaluation of Queries with Single Transformations Given an R-tree index I on a data set S, and any safe transformation t, we can construct an R-tree index I 0 for t S ft ~ x j ~ x 2 Sg as follows: Algorithm 1. For every node n

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

683

n MBR1 ; ptr1 ; ; MBRm ; ptrm in I, construct node n0 in I 0 such that n0 t n MBR01 ; ptr01 ; ; MBRm ; ptr0m ; where MBR0i is the rectangle obtained by applying t to the endpoints of MBRi and ptr0i is a pointer to the node t ni if ni is the node pointed by ptri . If n is a leaf, ptr0i ( ptri ) is a pointer to a data tuple or null. Since MBRi is an axisparallel rectangle and t is a safe transformation, MBR0i is an axis-parallel rectangle due to the definition of a safe transformation. Thus, the new node n0 is a valid R-tree node. The construction stops when every node in I is mapped to a node in I 0 . The original index I can be constructed in many different ways, each with a different performance, depending on the order of insertions and the way splits and overlaps are handled. Our experiments show that the new index I 0 has a similar performance to that of the original index. The main observation here is that for a given index I and transformation t, index I 0 can be built on the fly without having much impact on the performance of the search. This allows us to use one index for many possible transformations. As our running example, consider the following proximity query: Query 1. Given a query point ~ q, a safe transformation t and a threshold , find all points ~ x in the data set such that the Euclidean distance D t ~ x;~ q < . A naive evaluation of this query requires reading the whole data set, applying t to every data point and choosing every point ~ x such that D t ~ x;~ q < . This is a costly process. A better approach is to use Algorithm 1 to construct a new index for transformed data points. This new index can be built on the fly during the search operation as follows: Algorithm 2. Suppose an R-tree index is constructed on the first k DFT coefficients of sequences, and suppose the root node is denoted by N. 1.

Preprocessing:

2.

Transform t and ~ q into the frequency domain if they are in the time domain. Let us denote the qk , first k Fourier coefficients of t and ~ q by tk and ~ respectively. qk . A search b. Build a search rectangle qrect for ~ rectangle is the minimum bounding rectangle that contains all points within the Euclidean distance of ~ q. Building a search rectangle is straightforward in the rectangular coordinate s y s t e m ; i t i s s i m p l y qi ÿ ; qi f o r i 1; ; 2k. The minimum bounding rectangle for a complex number mej in the polar coordinate system is demonstrated in Fig. 7. The magnitude is in the range from m ÿ to m , and the angle is in the range from ÿ arcsin m to arcsin m . Search: a.

Fig. 7. Minimum bounding rectangle in the polar coordinate system.

a.

3.

If N is not a leaf, apply t to every (rectangle) entry of N and check if the resulting rectangle overlaps qrect . For all overlapping entries, call Search on the index whose root node is pointed to by the overlapping entry. b. If N is a leaf, apply t to every (point) entry of N and check if the resulting point overlaps qrect . If so, the entry is a candidate. Postprocessing:

For every candidate point ~ x, check its full database record to determine if the Euclidean distance between t ~ x and ~ q is at most . If so, the entry is in the answer set. The algorithm is guaranteed not to miss any qualifying sequence (see Appendix A for a proof). Similarly, all-pairs queries and nearest-neighbor queries can be efficiently processed using the index. For an all-pairs query, if we do a spatial join using the index, the only change in the spatialjoin algorithm will be to replace the join predicate evaluation with one that applies the desired transformation to the objects used in the join predicate. For example, the join predicate evaluation for ai \ bj 6 ; can be easily changed to that for t ai \ t bj 6 ;, where t is a transformation and ai and bj are members of two spatial sets (such as two MBRs). For a nearest-neighbor query, the search starts from the root and proceeds down the tree. As we go down the tree, we apply t to all entries of the node we visit. This change can be easily encoded in existing nearest- neighbor search algorithms, such as those proposed by Roussopoulos et al. [19] and Seidl et al. [26]. a.

3.2

Evaluation of Queries with Multiple Transformations Given an R-tree index on a data set S, consider evaluating the following proximity query: Query 2. ªGiven a query time series ~ q and a set T of safe transformations, find every time series ~ s 2 S and transformation t 2 T such that the Euclidean distance D t ~ s; t ~ q < .º As a specific example, T could be the set of m-day moving averages for m 2 f1 . . . 40g and we may want to find all stocks that have an m-day moving average similar to that of IBM. One way to process this query is to scan the data set sequentially and for every sequence~ s 2 S and transformation

684

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 8. From left to right, the second DFT coefficient of m-day moving averages for m 1; . . . ; 40 and a data rectangle before and after being transformed.

t 2 T find out if the distance predicate evaluates to true. The cost of this algorithm is one scan of the data set and computing the distance predicate j T j j S j times. Another approach is for every transformation t 2 T , use Algorithm 2 to retrieve all sequences within the proximity of the query sequence. The union of the results retrieved gives the query answer. We call this algorithm ST-index, where ST stands for ªa Single Transformation at a time.º The cost of this algorithm is j T j times the cost of traversing the index. The third approach is to group transformations together and scan the index once for every group. We call this algorithm MT-index, where MT stands for ªMultiple Transformations at a time.º There are two issues that need to be resolved here: first, how shall transformations be grouped together; second, how can a group of transformations be efficiently processed. We first address the second issue. Since a transformation is a pair of n-dimensional vectors, it is a point in a 2n-dimensional space. We assume both vectors are of real numbers. If not, one can rewrite any safe transformation in terms of a pair of real vectors (due to Lemma 1). Given a group of transformations, we can construct a minimum bounding rectangle (MBR) for all transformations in the group. Having an R-tree index built over sequences, we can apply the transformation rectangle to entries of the index and construct a new index on the fly. To apply a transformation rectangle to a data rectangle, we decompose the 2n-dimensional transformation rectangle into two n-dimensional MBRs, one corresponding to ~ a which we denote by mult-MBR, and the other corresponding to ~ b which we denote by add-MBR. Given mult-MBR: < M1 l; M1 h; . . . > , add-MBR: < A1 l; A1 h; . . . > and data rectangle X: < X1 l; X1 h; . . . > , the result of applying multMBR and add-MBR to rectangle X is rectangle Y: < Y1 l; Y1 h; . . . > , where Yi l Ai l min Mi l Xi l; Mi l Xi h; Mi h Xi l; Mi h Xi h Yi h Ai h max Mi l Xi l; Mi l Xi h; Mi h Xi l; Mi h Xi h for all dimensions i. As an example, consider the points of m-day moving average for m 1; . . . ; 40 and their MBR. The result of applying the MBR of these transformations to a data rectangle is shown in Fig. 8.

We now address the next issue, i.e., how shall transformations be grouped together. Consider the extreme case where all transformations in T are grouped together. If a few transformations spread all over the space, the minimum bounding rectangle of transformations will cover a large area. This MBR, when applied to a data rectangle, can easily make the data rectangle intersect the query region. This can reduce the filtering power of the index dramatically. On the other hand, if the number of groups and as a result the number of MBRs goes up, the area of each MBR gets smaller. Thus, the filtering power of each MBR increases, but the same index needs to be traversed several times. In the worst case, the number of MBRs is the same as the number of transformations, i.e., every MBR includes only one transformation point. In such a case, both ST-index and MT-index perform exactly the same. Now, the question is how we should optimally choose MBRs for a given set of transformations such that the total cost (in terms of the number of disk accesses) becomes minimum. One solution is to estimate the cost for any possible set of MBRs and choose the set with minimum cost. A first attempt in estimating the cost for a given set of MBRs is to use the total area of MBRs. However, the total area is minimum if every MBR includes only one transformation point, i.e., the ST-index algorithm is used. Another approach for estimating the cost of a given set of MBRs is to apply MBRs for a fixed data rectangle, say a unit square, then compute the total area of the resulting data rectangles. Due to this estimation, the best performance should be obtained using only one transformation rectangle. However, our experiments showed that using one transformation rectangle did not necessarily give the best performance. The worst performance for MT-index, which is close to that of ST-index, is when we pack two clusters of transformations into one rectangle. A solution to avoid this problem is to use a cluster detection algorithm (such as CURE [10]) and avoid packing two clusters into one rectangle. We can now write the search algorithm for Query 2 as follows: Algorithm 3. Given an R-tree index which is built on the first k Fourier coefficients of time series and whose root is N, a safe transformation set T , a threshold , and a search sequence ~ q, use the index to find all sequences that become within distance of ~ q after being transformed by a member of T .

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

1.

Decompose T into sets T1 ; T2 ; . . . using a clustering algorithm. 2. Do the following steps (Steps 3 to 6) for every set Ti : 3. Build an MBR for points in Ti and project it into a mult-MBR and an add-MBR as described earlier. 4. If N is not a leaf, apply the mult-MBR and the addMBR to every (rectangle) entry of N using (16) and check if the resulting rectangle intersects qrect . For every intersecting entry, go to Step 4 and do this step on the index rooted at the node of the intersecting entry. 5. If N is a leaf, apply the mult-MBR and the add-MBR to every (point) entry of N and check if the resulting rectangle intersects qrect . If so, the entry is a candidate. 6. For every candidate entry, retrieve its full database record, apply all transformations in Ti to the sequence, and determine transformations that reduce the Euclidean distance between the data sequence and the query sequence to less than . This algorithm is guaranteed not to miss any qualifying sequence. The proof is similar to the one given in Appendix A for Algorithm 2. We can develop similar algorithms for efficiently processing spatial join and nearest-neighbor queries. In a spatial join query, we apply the transformation MBR to all data items used in the join predicate before computing the predicate. Similarly in a nearest-neighbor query, as we walk down the tree, we apply the transformation MBR to all entries of the node we visit. This change can be easily incorporated into the existing nearest-neighbor search algorithms on the R-tree. So far, we have made no assumption on any possible ordering among transformations. Next, we show how one can exploit such an ordering during query evaluation.

3.3 Orderings of Transformations In this section, we define a notion of ordering between transformations and show that this notion can be quite useful in guiding the search more effectively. Definition 4. Let T be a set of transformations on a vector space S (e.g., Rn ) and D be a distance function on pairs of points in S; we call < T ; > an ordering of T w.r.t. D if 8~ vi ; v~j 2 S; 8tk ; tl 2 T , vi ; tl ~ vj D tk ~ vi ; tk ~ vj : tl tk ) D tl ~ Once an ordering is established among transformations, we can use this ordering to guide the search more cleverly. To give an example, consider Query 2 with the transformation set T ft2 ; . . . ; t100 g on Rn , where ti means ªscale by factor i.º An ordering of T w.r.t. the Euclidean distance is: tl tk if l < k (see Appendix A for a proof). To find all transformations that make a data sequence become similar to the query sequence, we do not need to apply all transformations to sequences. Instead, we need to find the ti with the largest i (i.e., the ti that corresponds to the largest scale factor) that makes the distance predicate true. One way to find ti is to do a binary search on the ordered list of transformations. Definition 4 implies that the distance predicate is true for every transformation tj , where j < i.

685

Given an ordered set T of transformations on the Fourier series representation of time series w.r.t. the Euclidean distance, we can use the binary search technique in all three algorithms presented in Section 3.2. In the case of the sequential scan method, we still need to scan the whole data set S. However, the number of transformations to be checked or the number of sequence comparisons drops to j S j log j T j . The ordering assumption reduces the number of index traversals for STindex to log j T j . In the case of MT-index, suppose T is decomposed into transformation groups T1 ; T2 ; . . . ; Tm . If the ordering is preserved among transformation groups, then the number of transformation groups to be checked and as a result the number of index traversals reduces to log m, whereas the number of transformations to be checked inside each group Ti (out of the log m groups tested) reduces to log j Ti j . However, if the ordering is not preserved among transformation groups, then the number of index traversals will remain the same and the Pm total number of comparisons reduces to i1 log j Ti j . There are useful transformations on time series that are not ordered w.r.t. the Euclidean distance. For example, no ordering is possible for a set of moving averages on Rn w.r.t. the Euclidean distance. See Appendix A for a proof.

4

QUERY EVALUATION USING OTHER INDEX STRUCTURES

The algorithms presented in Section 3 can be easily extended to other point-based access methods. To this end, we briefly examine two other index structures, namely the grid file and the k-d-B-tree. Compared to the R-tree which partitions the data points, these two access methods partition the embedding space that contains the data points. In this section, we show how these indexes can be used in efficiently evaluating our similarity queries.

4.1 Use of the Grid File The grid file imposes an n-dimensional grid decomposition of space. A grid directory maps one or more cells of the grid into a data bucket. The grid, represented by n onedimensional arrays called scales, is kept in main memory while the directory and data buckets are stored on disk. Given a set of time series data, a grid file can be constructed as follows: find the DFT of each sequence and keep the first k DFT coefficients in a 2k-dimensional grid file with data buckets keeping the full sequences. If a query requires a safe transformation t ~ a; ~ b to be applied to the data set, a new grid file can be constructed on the fly by ~i Xi1 ; Xi2 ; . . . along replacing every scale array X ~0 ai :Xi1 bi ; ai :Xi2 dimension i with scale array X i x with t ~ x. The result is still bi ; . . . and every data record ~ guaranteed to give a valid grid file, due to the safety of t. Transformations may also be grouped together and simultaneously applied to the grid file. This can be done by: 1) building an MBR for each group of transformations, 2) applying each side of the MBR to the scale array associated with that side, and 3) applying all transformations inside the MBR to the qualifying sequences.

686

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 9. Time per query varying the number of sequences.

4.2 Use of the K-D-B-Tree The k-d-B-tree imposes an irregular (and not unique) decomposition of space. A leaf node contains a set of data records or pointers to data records. An interior node contains a set of pairs (region, ptr) with region corresponding to the disjoint union of regions represented by the node pointed to by ptr if ptr points to an interior node; otherwise, it is a bounding rectangle for a set of points. Given a k-d-B-tree on a set of time series data, and a query that requires a safe transformation t to be applied to data sequences, a new k-d-B-tree can be constructed on the fly by applying t to entries of the index (which are either rectangles or points) as the search proceeds. Due to the safety of t, the result of applying t to a k-d-B-tree is guaranteed to give a valid k-d-B-tree. Similarly, transformations may be grouped together and simultaneously applied to the k-d-B-tree. The algorithm is very similar to that of the R-tree.

5

EXPERIMENTAL RESULTS

We implemented our methods on top of Norbert Beckmann's Version 2 implementation of the R -tree [4]. We ran experiments on both real stock prices data obtained from the FTP site ªftp.ai.mit.edu/pub/stocks/results/º and synthetic data. The stock prices database consisted of 1,068 stocks and for each stock, its daily closing prices for 128 days. To test our algorithms on a larger data set and also to simulate the experiments reported in the literature for the purpose of comparison, we also did some experiments on synthetic data. Each synthetic sequence was in the form of ~ x xt , where xt xtÿ1 zt and zt was a randomly generated number in the range ÿ500; 500. For every time series, we first transformed it to the normal form for reasons described in Section 2.1, and then we found its Fourier coefficients. Since the mean of a normal form series is zero by definition, the first Fourier coefficient is always zero, so we can discard it. For every

sequence, we stored the magnitudes and the angles of the second and the third DFT coefficients in the index. All our experiments were conducted on a 168MHZ Ultrasparc station. For our experiments on proximity queries, we ran each experiment at least 100 times. Each time we chose a random query sequence from the data set and searched for all other sequences within distance of the query sequence. The execution time was averaged over these runs. Our all-pairs queries were spatial self-join queries where we searched the data set for all sequence pairs within distance of each other. We report our experiments in two parts: 1. 2.

evaluating queries with single transformations (e.g., Query 1), evaluating queries with multiple transformations (e.g., Query 2).

5.1 Evaluating Queries with Single Transformations In this section, first we show that the overhead of applying single transformations to the index is negligible. We then compare our method to sequential scanning. 5.1.1 Overhead on the Index for Proximity Queries Our first experiment was on synthetic data. Fig. 9 compares the execution time for two kinds of queries: 1) a proximity query using transformations and 2) a proximity query that uses no transformations. We kept the sequence length fixed to 128 while we varied the number of sequences from 500 to 12,000. In order to have a precise comparison, the identity ~~ ~ X ~ 0 was chosen such that Ti X transformation Ti I; ~ (I~ is a vector of 1s and ~ for all sequences X 0 is a vector of 0s). This made the two queries produce the same results. Instead of setting the distance threshold directly in our proximity queries, we set the correlation to 0:85, plugged it in (13) and found a value for . This threshold resulted in an average output size ranging from 1 to 69. As Fig. 9 shows there is not much difference between the two curves. The minor difference is due to the CPU time spent for vector

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

687

Fig. 10. Time per query varying the sequence length.

Fig. 11. Time per query varying the number of sequences.

multiplication which is unavoidable. The number of disk accesses is the same in both cases. For a nonidentity transformation, the CPU time spent for vector multiplication is still the same, but the number of disk accesses can be more or less depending on whether the transformation inflates or deflates the index rectangles. In the next experiment, we kept the number of sequences fixed to 1,000 while we varied the length of the sequences from 64 to 1,024. The experiment was on proximity queries over synthetic data. The correlation threshold was again set to 0:85; this resulted in an average output size ranging from 1 to 6. We used the identity transformation again for the reason described in the previous experiment. To increase the accuracy of our measurements, we ran each experiment 1,000 times. As demonstrated in Fig. 10, the result was the same. Thus, the index traversal for similarity queries does not deteriorate the performance of the index.

5.1.2 Comparison with Sequential Scanning for Proximity Queries Figs. 11 and 12 compare the execution time of our approach to sequential scanning for proximity queries. In Fig. 11, the sequence length was fixed to 128 and the number of sequences varied from 500 to 12,000. In Fig. 12, the number of sequences was fixed to 1,000 and the sequence length varied from 64 to 1,024. The experiment was again on synthetic data. The correlation threshold was set to 0:85 and the identity transformation was also used. To have a good implementation of the sequential scanning algorithm, the transformation is applied to both data and query sequences during the distance computation; this means both sequences are scanned at most once when they are in the main memory. We also stop the distance computation process as soon as the distance exceeds . Both graphs show the superiority of our algorithm.

688

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 12. Time per query varying the sequence length.

5.1.3 Comparison for All-Pairs Queries Our last experiment in this part was the spatial self-join. We used the following methods: 1.

Scan the relation of time sequences sequentially, and compare every sequence ~ s to all other sequences in the relation; the transformation tmavg20 is applied to every sequence during the comparison. The distance computation is stopped as soon as the distance exceeds . 2. Scan the relation of Fourier coefficients sequentially, and for every time series build a search rectangle and pose it as a proximity query to the index after applying tmavg20 to both the index and the search rectangle. 3. Do the spatial self-join as described in b without applying any transformation. The experiment ran on a relation of stock prices data that had 1,068 time sequences and the length of each sequence was 128. The correlation threshold was set to 0:9895 and the distance threshold was computed accordingly. The result of the test is shown in Table 1. Method b compared to its competitor method a is 6 to 7 times faster. Methods b and c compute different queries, but the comparison between them confirms that the overhead on the index is very small.

5.2

Evaluating Queries with Multiple Transformations In this section, we first compare the performance of the MTindex to that of the ST-index and sequential scan. To do the comparison, we made the choice of packing all transformations into one rectangle though it did not necessarily give us the best possible performance of MT-index. In the subsequent section, we show the effect of grouping transformations on the performance of MT-index. In all our experiments over proximity queries, we set the correlation threshold fixed to 0.96 and used (13) to find a value for the Euclidean distance threshold.

5.2.1 Comparing MT-index to ST-index and Sequential Scan Fig. 13 shows the running time of Query 2 using three algorithms sequential-scan, ST-index, and MT-index. The transformations were a set of moving averages ranging from 10-day moving average to 25-day moving average with their number fixed to 16. The experiment ran on synthetic sequences of length 128 with their number varying from 500 to 12,000. The average output size was 7 or more depending on the number of input sequences. The figure shows that the MT-index performs better than both the ST-index and sequential-scan. Fig. 14 shows the running time of Query 2, again using three algorithms sequential-scan, ST-index, and MT-index. In the experiment, we set the number of sequences fixed to 1,068 but we varied the number of transformations from 1 to 30. The transformations were a set of moving averages ranging from a five-day moving average to a 34-day moving average. The experiment ran on real stock price data. The average output size was 11 or more depending on the number of transformations. The figure shows that the MT-index outperforms both the ST-index and sequential-scan. The next experiment was on a spatial join query which was expressed as follows: Query 3. ªGiven a set of transformations denoted by T, find every pair s~1 and s~2 of time series and every t 2 T such that s2 0:99.º the correlation t ~ s1 ; t ~

TABLE 1 The Result of the Join

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

689

Fig. 13. Time per query varying the number of sequences.

Fig. 14. Time per query varying the number of transformations.

The transformations were again a set of moving averages ranging from a five-day moving average to a 34-day moving average. We varied the number of transformations from 1 to 30. The experiment ran on real stock prices data which consisted of 1,068 sequences of length 128. The average output size was at least 7. Fig. 15 shows the running time of Query 3 using three algorithms: sequential-scan, STindex, and MT-index. Both ST-index and MT-index perform better than sequential-scan. As we increase the number of transformations, the MT-index algorithm performs better than the ST-index until the number of transformations gets 30. At this point, the running time for both is the same.

5.2.2 Multiple Transformation Rectangles In this section, we show that grouping all transformations in one rectangle does not necessarily give us the best possible performance. To show this, we ran Query 2 using

the MT-index algorithm on real stock price data, but varied the number of transformations per MBR from one to its maximum. The transformation set consisted of m-day moving averages for m 6; . . . ; 29. We equally partitioned subsequent transformations and built an MBR for each partition. As is shown in Fig. 16, despite the fact that collecting all transformations in one rectangle resulted in the minimum number of disk accesses, it did not necessarily give us the best running time. We later added the inverted version of each transformation, which was obtained by multiplying every coefficient by -1, to the transformation set. This created two clusters in a multidimensional space. Again, we equally partitioned subsequent transformations and built an MBR for each partition. We varied the number of transformations per MBR from one to 48 which was the size of the

690

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 12, NO. 5,

SEPTEMBER/OCTOBER 2000

Fig. 15. Time per query varying the number of transformations.

Fig. 16. Both the running time and the number of disk accesses varying by the number of transformations per MBRs.

transformation set. As is shown in Fig. 17, the running time shows bumps when we pack one third or all of the transformations in a rectangle. The same bumps are also observed in the number of disk accesses. This is due to the fact that, in these two cases, two separate clusters (or parts of them) are placed in one transformation rectangle. These experiments show that as we start packing transformations into rectangles, we see a major performance improvement up to a certain point (six to eight transformations per rectangle here). The performance after this point either stays the same or goes down. The worst performance for MT-index, which was even worse than ST-index, was when we packed two clusters of transformations into one rectangle. A solution to avoid this problem is to use a cluster detection algorithm in advance and avoid packing more than one cluster to a rectangle.

6

CONCLUSIONS

We have proposed a class of transformations that can be used in a query language to express similarity between time series in a general way. This class allows the expression of several practically important notions of similarity and queries using this class can be efficiently implemented on top of point-based multidimensional indexes. Our contributions can be summarized as follows: . . .

Formulation of an interesting class of transformations, including the moving average in our transformation language. Development of a safety condition for transformations. Implementation of similarity matching under these transformations on top of the R-tree index.

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

691

Fig. 17. Both the running time and the number of disk accesses varying by the number of transformations per MBRs.

.

Development of a new algorithm that applies multiple transformations specified in a query to a set of sequences in one scan of the R-tree index built on those sequences. . Development of a notion of ordering among transformations and using this notion in efficiently guiding the search. We evaluated our methods over both real stock prices and synthetic data. In the case of queries with single transformations, the experiments show that the execution time of our method is almost the same as that of accessing the index with no transformations; our method has much better performance than sequential scanning, and the

kÿ1 X

2

j a f x f b f ÿ qf j

f0

!12

nÿ1 X

2

j af xf bf ÿ qf j

!12

f0

16 for k n. Thus, sequence ~ x must be returned by the proximity query on the index. This is a contradiction. t u Lemma 5. Let T ft1 ; t2 ; . . . ; tm g be a set of transformations on Rn , where ti means ªscale by factor iº; an ordering of T w.r.t. the Euclidean distance is defined as tl tk if l < k. Proof. Let ti and tj be two arbitrary transformations in T and suppose, without loss of generality, i < j. Let ~ x and ~ y x; ~ y denotes their be two arbitrary points in Rn and D ~

performance gets better as both the number and the length

Euclidean distance. Since D ~ x; ~ y is a positive number,

of sequences increase. In the case of queries with multiple

we can multiply it to both sides of inequality i < j. This

transformations, the experiments show that the given

will give us

algorithm for handling multiple transformations outperforms both sequential scanning and the index traversal using one transformation at a time.

sequence. Proof. Suppose ~ x is a qualifying data sequence, i.e., D t ~ x;~ q < , but ~ x is not returned by Algorithm 2. If we represent t by ~ a; ~ b, we can write nÿ1 X f0

However,

!0:5

k0

Lemma 4. Algorithm 2 is guaranteed not to miss any qualifying

j a f x f b f ÿ qf j

2

!12

17

but we have nÿ1 X xk ÿ yk 2 i:D ~ x; ~ y i:

APPENDIX A SOME PROOFS

D ~ a~ x~ b;~ q

i:D ~ x; ~ y < j:D ~ x; ~ y;

nÿ1 X

18

!0:5 2

i:xk ÿ i:yk

D i:~ x; i:~ y:

k0

Equations (17) and (18) imply D i:~ x; i:~ y < D j:~ x; j:~ y. t u Lemma 6. Let T denote a set of (circular) moving averages on Rn and D denote the Euclidean distance; no ordering is possible for transformation set T w.r.t. D.

:

Proof. We prove this lemma by contradiction. Suppose there is an ordering among members of T w.r.t. D. Consider the following sequences:

692

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

s~1 10; 12; 10; 12 s~2 10; 11; 12; 11 s~3 11; 11; 11; 11:

.

s3 D mv2 ~ s2 ; mv2 ~ s3 0:33: 0:87 > D mv3 ~ s2 ; mv3 ~

mv2 ~ s2 10:5; 10:5; 11:5; 11:5; mv2 ~ s3 11; 11; 11; 11; mv3 ~ s1 10:67; 11:33; 10:67; 11:33;

.

mv3 ~ s2 11; 10:67; 11; 11:33; mv3 ~ s3 11; 11; 11; 11:

Case 2: mv3 mv2 By Definition 4, sj D mv2 ~ si ; mv2 ~ sj D mv3 ~ si ; mv3 ~

There are two possible orderings between mv2 and mv3: Case 1: mv2 mv3 By Definition 4,

for all pairs s~i and s~j . However, this does not hold for s~1 and s~3 ; s3 D mv3 ~ s1 ; mv3 ~ s3 0: 0:47 > D mv2 ~ s1 ; mv2 ~

sj D mv3 ~ si ; mv3 ~ sj D mv2 ~ si ; mv2 ~ for all pairs s~i and s~j . However, this does not hold for s~2 and s~3 ; s3 D mv2 ~ s2 ; mv2 ~ 1 > D mv3 ~ s2 ; mv3 ~ s3 0:75:

There are no other cases, so the proof is complete.

APPENDIX B EXPRESSING MOMENTUM

Case 2: mv3 mv2 By Definition 4, sj D mv2 ~ si ; mv2 ~ sj D mv3 ~ si ; mv3 ~ for all pairs s~i and s~j . However, this does not hold for s~1 and s~3 ; s3 D mv3 ~ s1 ; mv3 ~ s3 0: 0:66 > D mv2 ~ s1 ; mv2 ~

There are no other cases, so the proof is complete.

Case 1: mv2 mv3 By Definition 4,

for all pairs s~i and s~j . However, this does not hold for s~2 and s~3 ;

mv2 ~ s1 11; 11; 11; 11;

.

SEPTEMBER/OCTOBER 2000

D mv2 ~ si ; mv2 ~ sj D mv3 ~ si ; mv3 ~ sj

If we denote the circular two-day moving average by mv2 and the circular three-day moving average by mv3, we can write

.

VOL. 12, NO. 5,

u t

AS A

u t

TRANSFORMATION

~ 1; ÿ1; 0; . . . ; 0 be a vector of length n and ~ Let m x be a ~ time series of the same length. Let us denote the DFT of m ~ and the DFT of ~ ~ The convolution of ~ by M x by X. x and m, ~ ~ gives the momentum of ~ conv ~ x; m, x. Since a convolution in the time domain corresponds to a multiplication in the ~ and X ~ gives the frequency domain, the product of M momentum in the frequency domain. If we use the polar ~ and M, ~ representation for complex numbers and map X 0 0 ~ ~ respectively, to real vectors X and M such that Mj 0 0 0 iM2j1 0 iX2j1 e and Xj X2j e , we will have M2j 0

0

0 0 :X2j ei X2j1 M2j1 Mj :Xj M2j

Lemma 7. Let T denotes a set of noncircular moving averages on Rn and D denotes the Euclidean distance; no ordering is possible for transformation set T w.r.t. D.

for j 0; . . . ; n ÿ 1. Thus, we can express the momentum operation as an affine transformation of the form ~ a; ~ b, 0 0 where a2j M2j , b2j 0, a2j1 1 and b2j1 M2j1 .

Proof. The proof is similar to that of Lemma 6. Suppose there is an ordering among members of T w.r.t. D. If we denote the noncircular two-day moving averages by mv2 and the noncircular three-day moving averages by mv3, we can write

APPENDIX C EXPRESSING TIME SHIFT

mv2 ~ s1 11; 11; 11; mv2 ~ s2 10:5; 11:5; 11:5; mv2 ~ s3 11; 11; 11; mv3 ~ s1 10:67; 11:33; mv3 ~ s2 11; 11:33; mv3 ~ s3 11; 11; where sequences s~1 , s~2 and s~3 are those given in Lemma 6. There are two possible orderings between mv2 and mv3:

AS A

TRANSFORMATION

Suppose we want to shift sequence ~ x x0 ; x1 ; . . . ; xnÿ1 one day to the right. If we inserted a zero at the beginning, the result after the shift would be ~ x0 0; x0 ; x1 ; . . . ; xnÿ1 which is a sequence of length n 1. Using (1), we can write the DFT of ~ x0 as follows: ! nÿ1 nÿ1 ÿi2 t1f ÿi2f ÿi2tf 1 X 1 X 0 xt e n1 e n1 p xt e n1 ; Xf p n 1 t0 n 1 t0 where f 0; . . . ; n. Time sequences are usually long, i.e., n is a large number, so we can easily replace n 1 inside the parentheses by n without much affecting the equation.

RAFIEI AND MENDELZON : QUERYING TIME SERIES DATA BASED ON SIMILARITY

Now, the expression inside the parentheses becomes Xf , the fth DFT coefficient of ~ x, and we can write ÿi2f

Xf0 e n1 Xf : This gives the first n Fourier coefficients of ~ x0 . If we use the polar representation for complex numbers, we can express the shift operation as an affine transformation of the ÿ2 1 form ~ 1; 0; ÿ2 0 n1 ; 0; n1 ; . . .. We can still do time shift even if ~ x is not a long sequence. The trick is to pad at least as many zeros as the amount of the shift at the end of the sequence. Now, we can forget the overflow zeros generated by the shift and consider the shifted sequence the same size as the original sequence.

ACKNOWLEDGMENTS This research was supported by Communications and Information Technology Ontario and the Natural Sciences and Engineering Research Council. This work was done when Davood Rafiei was a graduate student in the Computer Science Department at the University of Toronto.

REFERENCES [1]

[2]

[3] [4]

[5] [6] [7] [8]

[9]

[10] [11] [12] [13]

R. Agrawal, C. Faloutsos, and A. Swami, ªEfficient Similarity Search in Sequence Databases,º Proc. Fourth Int'l Conf. Foundations of Data Organizations and Algorithms (FODO '93), pp. 69±84, Oct. 1993. R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim, ªFast Similarity Search in the Presence of Noise, Scaling, and Translation in TimeSeries Databases,º Proc. 21st Int'l Conf. Very Large Data Bases (VLDB '95), pp. 490±501, Sept. 1995. R. Agrawal, G. Psaila, E.L. Wimmers, and M. Zait, ªQuerying Shapes of Histories,º Proc. 21st Int'l Conf. Very Large Data Bases (VLDB '95), pp. 502±514, Sept. 1995. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, ªThe R* Tree: An Efficient and Robust Index Method for Points and Rectangles,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '90), pp. 322±331, May 1990. K.K.W. Chu and M.H. Wong, ªFast Time Series Searching with Scaling and Shifting,º Proc. ACM Symp. Principles of Database Systems (PODS '99), pp. 237±248, 1999. R.D. Edwards and J. Magee, Technical Analysis of Stock Trends. Springfield, Mass., 1969. C. Faloutsos, H.V. Jagadish, A.O. Mendelzon, and T. Milo, ªA Signature Technique for Similarity-Based Queries,º Proc. Compression and Complexity of Sequences (SEQUENCES '97), June 1997. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, ªFast Subsequence Matching in Time-Series Databases,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '94), pp. 419± 429, May 1994. D.Q. Goldin and P.C. Kanellakis, ªOn Similarity Queries for TimeSeries Data: Constraint Specification and Implementation,º Proc. First Int'l Conf. Principles and Practice of Constraint Programming, pp. 137±153, Sept. 1995. S. Guha, R. Rastogi, and K. Shim, ªCURE: An Efficient Clustering Algorithm for Large Databases,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 73±84, June 1998. A. Guttman, ªR-Trees: A Dynamic Index Structure for Spatial Searching,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '84), pp. 47±57, June 1984. H.V. Jagadish, A.O. Mendelzon, and T. Milo, ªSimilarity-Based Queries,º Proc. 14th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS '95), pp. 36±45, May 1995. D.B. Lomet and B. Salzberg, ªThe HB-Tree: A Multiattribute Indexing Method with Good Guaranteed Performance,º ACM Trans. Database Systems, vol. 15, no. 4, pp. 625±658, Dec. 1990.

693

[14] C.S. Li, P.S. Yu, and V. Castelli, ªHierarchyscan: A Hierarchical Similarity Search Algorith for Databases of Long Sequences,º Proc. 12th Int'l. Conf. Data Eng. (ICDE '96), pp. 546±553, Feb. 1996. [15] NRC-CNRC, Feature Selection Bibliography. http://ai.iit.nrc.ca/ bibliographies/feature-selection.html. [16] J. Nievergelt, H. Hinterberger, and K.C. Sevcik, ªThe Grid File: An Adaptable, Symmetric Multikey File Structure,º ACM Trans. Database Systems, vol. 9, no. 1, pp. 38±71, Mar. 1984. [17] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing. Englewood Cliffs, N.J.: Prentice-Hall, 1975. [18] D. Rafiei, ªFourier-Transform Based Techniques in Efficient Retrieval of Similar Time Sequences,º PhD thesis, Univ. of Toronto, 1998. [19] N. Roussopoulos, S. Kelley, and F. Vincent, ªNearest Neighbor Queries,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '95), pp. 71±79, May 1995. [20] D. Rafiei and A. Mendelzon, ªSimilarity-Based Queries for Time Series Data,º Proc. ACM SIGMOD International Conf. Management of Data (SIGMOD '97), pp. 13±24, May 1997. [21] D. Rafiei and A. Mendelzon, ªEfficient Retrieval of Similar Time Sequences Using DFT,º Proc. Fifth Int'l Conf. Foundations of Data Organizations and Algorithms (FODO '98), pp. 249±257, Nov. 1998. [22] W.G. Roth, ªMIMSY: A System for Analyzing Time Series Data in the Stock Market Domain,º master's thesis, Univ. of Wisconsin, Madison, 1993. [23] R. Ramakrishnan and D. Srivastava, ªCORAL: Control, Relations, and Logic,º Proc. 18th Int'l Conf. Very Large Data Bases (VLDB '92), pp. 238±250, Aug. 1992. [24] D. Sankoff and J.B. Kruskal, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983. [25] K.C. Sevcik and N. Koudas, ªFilter Trees for Managing Spatial Data Over a Range of Size Granularities,º Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB '96), pp. 16±27, Sept. 1996. [26] T. Seidl and H.-P. Kriegel, ªOptimal Multi-step Nearest Neighbour Search,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 154±165, 1998. [27] P. Seshadri, M. Livny, and R. Ramakrishnan, ªSequence Query Processing,º Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '94), pp. 430±441, May 1994. [28] H. Shatkay and S. Zdonik, ªApproximate Queries and Representations for Large Data Sequences,º Proc. 12th Int'l Conf. Data Eng. (ICDE '96), pp. 536±545, Mar. 1996. [29] B.-K. Yi, H.V. Jagadish, and C. Faloutsos, ªEfficient Retrieval of Similar Time Sequences Under Time Warping,º Proc. 14th Int'l Conf. Data Eng. (ICDE '98), pp. 201±208, Feb. 1998. Davood Rafiei received his undergraduate degree in computer engineering at Sharif University of Technology, Tehran, in 1990, his master's degree in computer science from the University of Waterloo in 1995, and his PhD degree in computer science from the University of Toronto in 1999. He was a postdoctoral fellow at the University of Toronto prior to joining to the Department of Computing Science at the University of Alberta as an assistant professor in 2000. His research interests include database systems, querying and indexing nontraditional data such as time series and images, and information retrieval on the Web. Alberto O. Mendelzon received his undergraduate degree from the University of Buenos Aires and his master's and PhD degrees from Princeton University. He was a postdoctoral fellow at the IBM T.J. Watson Research Center. Since 1980, he has been with the University of Toronto where he is now a professor of Computer Science and a member of the Computer Systems Research Group. He has spent time as a visiting scientist at the NTT Basic Research Laboratories in Japan, the IASI in Rome, the IBM Toronto Lab, and AT&T Bell Labs Research in New Jersey. His research interests are in database systems, database query languages, Webbased information systems, and information integration.