Multivariate Clustering and Data Compression Multivariate Clustering and Data Compression



1. A.B. Nobel and A.D. Wyner, "A Recurrence theorem for dependent processes with applications to data compression", IEEE Transactions on Information Theory, vol.38, pp.1561-1564, 1992.

Summary: Wyner and Ziv established a recurrence theorem for finite alphabet Markov processes and applied this result to the analysis of a finite memory version of the Lempel Ziv compression algorithm. In this paper it is shown that their result holds under the (essentially weaker) hypothesis that the process is beta-mixing, with no conditions on the mixing rate, or alpha-mixing with coefficients decaying faster than k-a for some a > 1. The new proof is considerably simpler than the original.



3. A.B. Nobel, "Vanishing distortion and shrinking cells", IEEE Transactions on Information Theory, vol.42, pp.1303-1305, 1996.

Summary: A multivariate clustering scheme assigns each vector in Rd to one of a finite number of representative vectors of the same dimension. The set of vectors assigned to a particular representative is called a cell of the clustering scheme. Suppose that a sequence of clustering schemes with convex cells is applied to a fixed random vector whose distribution is absolutely continuous. It is shown that if the expected squared distance between X and its representative vector tends to zero, then the diameter of the cell containing X tends to zero in probability.



3. A.B. Nobel and R.A. Olshen, "Termination and continuity of greedy growing for tree-structured vector quantizers", IEEE Transactions on Information Theory, vol.42, pp.191-205, 1996.

Summary: Tree structured multivariate clustering schemes are widely used by engineers as a means of compressing vector-valued data. In practice these schemes are constructed from data by so called greedy algorithms, which produce a tree one node at a time in a recursive, stepwise optimal fashion. One popular algorithm recursively splits that terminal node whose division offers the greatest increase in performance per increase in complexity. The termination and structural consistency of this algorithm is established when the available data come from a stationary ergodic source. In addition, it is shown that the algorithm may not terminate if the support of the data is unbounded.



4. A.B. Nobel, "Recursive partitioning to reduce distortion", IEEE Transactions on Information Theory, vol.43, pp.1122-1133, 1997.

Summary: This paper studies recursive growing algorithms for tree structured clustering schemes. Of interest are schemes that, at each stage of their operation, split the terminal node of the current tree whose division offers the greatest increase in performance. It is shown that the unsupervised application of the algorithms to ergodic data yields clustering schemes whose loss tends to zero with increasing sample size. The analysis is applicable to procedures, employing both axis-parallel and oblique splitting, that can be efficiently implemented on a digital computer.