aravindan vijayaraghavan cmu northwestern university

40
Aravindan Vijayaraghavan CMU Northwestern University Smoothed Analysis of Tensor Decompositions and Learning based on joint works with Aditya Bhaskara Google Research Moses Charikar Princeton Ankur Moitra MIT

Upload: helmut

Post on 24-Feb-2016

60 views

Category:

Documents


0 download

DESCRIPTION

Smoothed Analysis of Tensor Decompositions and Learning. Aravindan Vijayaraghavan CMU Northwestern University. b ased on joint works with. Aditya Bhaskara Google Research. Moses Charikar Princeton. Ankur Moitra MIT. Factor analysis. Explain using few unobserved variables. M. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Aravindan Vijayaraghavan CMU    Northwestern University

Aravindan Vijayaraghavan

CMU Northwestern University

Smoothed Analysis of Tensor Decompositions and Learning

based on joint works with

Aditya BhaskaraGoogle Research

Moses CharikarPrinceton

Ankur MoitraMIT

Page 2: Aravindan Vijayaraghavan CMU    Northwestern University

Factor analysis

d

d

• Sum of “few” rank one matrices (k < d )

Assumption: matrix has a “simple explanation”

𝑀=𝑎1⊗𝑏1+𝑎2⊗𝑏2+…+𝑎𝑘⊗𝑏𝑘

M

Explain using few unobserved variables

Qn [Spearman]. Can we find the ``desired’’ explanation ?

Page 3: Aravindan Vijayaraghavan CMU    Northwestern University

The rotation problem

Any suitable “rotation” of the vectors gives a different decomposition

A BT = A BTQ QT

Often difficult to find “desired” decomposition..

𝑀=𝑎1⊗𝑏1+𝑎2⊗𝑏2+…+𝑎𝑘⊗𝑏𝑘

Page 4: Aravindan Vijayaraghavan CMU    Northwestern University

Multi-dimensional arrays

Tensors

dd d

d d

• dimensional array tensor of order -tensor

• Represent higher order correlations, partial derivatives, etc.

• Collection of matrix (or smaller tensor) slices

Page 5: Aravindan Vijayaraghavan CMU    Northwestern University

3-way factor analysisTensor can be written as a sum of few rank-one tensors

𝑇❑=∑𝑖=1

𝑘

𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖

Rank(T) = smallest k s.t. T written as sum of k rank-1 tensors

3-Tensors:

T 𝑎1 𝑎2 𝑎𝑘𝑐1 𝑐2 𝑐𝑘

𝑏 1 𝑏 2 𝑏 𝑘

• Rank of 3-tensor . Rank of t-tensor

Thm [Harshman’70, Kruskal’77]. Rank- decompositions for 3-tensors (and higher orders) unique under mild conditions.

3-way decompositions overcome rotation problem !

Page 6: Aravindan Vijayaraghavan CMU    Northwestern University

Learning Probabilistic Models:Parameter Estimation

Learning goal: Can the parameters of the model be learned from polynomial samples generated by the model ?

• EM algorithm – used in practice, but converges to local optima

HMMs for speech recognition

Mixture of Gaussiansfor clustering points

Question: Can given data be “explained” by a simple probabilistic model?

• Algorithms have exponential time & sample complexity

Multiview models

Page 7: Aravindan Vijayaraghavan CMU    Northwestern University

Parameters• Mixing weights: • Gaussian : mean , covariance : diagonal

Learning problem: Given many sample points, find

Probabilistic model for Clustering in -dims

Mixtures of (axis-aligned) Gaussians

• Algorithms use samples and time [FOS’06, MV’10]

• Lower bound of [MV’10] in worst case

ℝ𝑑

𝜇𝑖𝑥

Aim: guarantees in realistic settings

Page 8: Aravindan Vijayaraghavan CMU    Northwestern University

Method of Moments and Tensor decompositions

step 1. compute a tensor whose decomposition encodes model parametersstep 2. find decomposition (and hence parameters)

𝑑

⋅⋅ ⋯ ⋅⋮ 𝐸 [𝑥𝑖 𝑥 𝑗𝑥𝑘 ] ¿

⋯ ¿¿𝑻=∑

𝒊=𝟏

𝒌𝒘 𝒊𝝁𝒊⊗𝝁𝒊⊗𝝁𝒊

• Uniqueness Recover parameters and • Algorithm for Decomposition efficient learning

[Chang] [Allman, Matias, Rhodes][Anandkumar,Ge,Hsu, Kakade, Telgarsky]

Page 9: Aravindan Vijayaraghavan CMU    Northwestern University

What is known about Tensor Decompositions ?

Thm [Kruskal’77]. Rank- decompositions for 3-tensors unique (non-algorithmic) when !

Thm [Jennrich via Harshman’70]. Find unique rank- decompositions for 3-tensors when !

• Uniqueness proof is algorithmic !• Called Full-rank case. No symmetry or orthogonality needed.• Rediscovered in [Leurgans et al 1993] [Chang 1996]

Thm [DeLathauwer, Castiang, Cardoso’07]. Algorithm for 4-tensors of rank generically when

Thm [Chiantini Ottaviani‘12]. Uniqueness (non-algorithmic) of 3-tensors of rank generically

Page 10: Aravindan Vijayaraghavan CMU    Northwestern University

Uniqueness and Algorithms resilient to noise of 1/poly(d,k) ?

Robustness to Errors

Beware : Sampling error

Empirical estimate

With samples, error

Thm [BCV’14]. Robust version of Kruskal Uniqueness theorem (non-algorithmic) with error

Thm. Jennrich’s polynomial time algorithm for Tensor Decompositions robust up to error

Open Problem: Robust version of generic results[De Lauthewer et al]?

Page 11: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithms for Tensor Decompositions

Polynomial time algorithms when rank [Jennrich]

NP-hard when rank in worst case [Hastad, Hillar-Lim]

Overcome worst-case intractability using Smoothed Analysis

Polynomial time algorithms* for robust Tensor decompositions for rank k >> d (rank is any polynomial in dimension) *Algorithms for recovery up to error in

This talk

Page 12: Aravindan Vijayaraghavan CMU    Northwestern University

Efficient Learning when no. of clusters/ topics k dimension d[Chang 96, Mossel-Roch 06, Anandkumar et al. 09-14]• Learning Phylogenetic trees [Chang,MR]• Axis-aligned Gaussians [HK]• Parse trees [ACHKSZ,BHD,B,SC,PSX,LIPPX]• HMMs [AHK,DKZ,SBSGS]• Single Topic models [AHK], LDA [AFHKL] • ICA [GVX] … • Overlapping Communities [AGHK] …

Implications for Learning

``Full rank’’ or ``Non-degenerate’’ setting

Known only in restricted cases: No. of clusters No. of dims

Page 13: Aravindan Vijayaraghavan CMU    Northwestern University

Overcomplete Learning Setting

Number of clusters/topics/states dimension

Computer Vision

Previous algorithms do not work when !Speech

Need polytime decomposition of Tensors of rank

Page 14: Aravindan Vijayaraghavan CMU    Northwestern University

Smoothed Analysis

[Spielman & Teng 2000]

• Small random perturbation of inputmakes instances easy

• Best polytime guarantees in the absence of any worst-case guarantees

Smoothed analysis guarantees:

• Worst instances are isolated

Simplex algorithm solves LPs efficiently (explains practice).

Page 15: Aravindan Vijayaraghavan CMU    Northwestern University

Today’s talk: Smoothed Analysis for Learning [BCMV STOC’14]

• First Smoothed Analysis treatment for Unsupervised Learning

Thm. Polynomial time algorithms for learning axis-aligned Gaussians, Multview models etc. even in ``overcomplete settings’’.

Mixture of Gaussians

Thm. Polynomial time algorithms for tensor decompositions in smoothed analysis setting.

based on

Multiview models

Page 16: Aravindan Vijayaraghavan CMU    Northwestern University

Smoothed Analysis for LearningLearning setting (e.g. Mixtures of Gaussians)

Worst-case instances: Means in pathological configurations

Means not in adversarial configurations in real-world!

What if means perturbed slightly ?𝝁𝒊 ~𝝁𝒊

Generally, parameters of the model are perturbed slightly.

Page 17: Aravindan Vijayaraghavan CMU    Northwestern University

Smoothed Analysis for Tensor Decompositions

1. Adversary chooses tensor

3. Input: . Analyse algorithm on .

2. is random -perturbation of

i.e. add independent (gaussian) random vector of length .

𝑇 𝑑×𝑑×…×𝑑=∑𝑖=1

𝑘

𝑎𝑖(1)⨂ 𝑎𝑖

(2)⨂…⨂𝑎𝑖( 𝑡 )

T 𝑎1(1) 𝑎2(1) 𝑎𝑘(1)

𝑎1(3 ) 𝑎2(3 ) 𝑎𝑘(3 )

𝑎 1(2 ) 𝑎 2(2 ) 𝑎 𝑘(2 )

~𝑇=∑𝑖=1

𝑘~𝑎𝑖

(1 )⨂~𝑎𝑖(2 )⨂…⨂~𝑎𝑖

(𝑡 )+noise

Factors of the Decomposition are perturbed

Page 18: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithmic GuaranteesThm [BCMV’14]. Polynomial time algorithm for decomposing t-tensor (-dim) in smoothed analysis model when rank w.h.p.

Running time, sample complexity = .

Guarantees for order- tensors in d-dims (each)

Rank of the t-tensor=k (number of clusters)

Previous Algorithms Algorithms (smoothed case)

Corollary. Polytime algorithms (smoothed analysis) for Mixtures of axis-aligned Gaussians, Multiview models etc. even in overcomplete setting i.e. no. of clusters for any constant C w.h.p.

Page 19: Aravindan Vijayaraghavan CMU    Northwestern University

Interpreting Smoothed Analysis Guarantees

Time, sample complexity = .

Works with probability 1-exp(- )

• Exponential small failure probability (for constant order t)

Smooth Interpolation between Worst-case and Average-case

• : worst-case

• is large: almost random vectors.

• Can handle inverse-polynomial in

Page 20: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithm Details

Page 21: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithm Outline

• Helps handle the over-complete setting

[Jennrich 70] A simple (robust) algorithm for 3-tensor T when:

2. For higher order tensors using ``tensoring / flattening’’.

1. An algorithm for 3-tensors in the ``full rank setting’’ ().

𝑇=∑𝑖=1

𝑘

𝐴𝑖⨂𝐵𝑖⨂𝐶𝑖Recall: 𝐴𝑖

𝑑 𝑇

Aim: Recover A, B, C

• Any algorithm for full-rank (non-orthogonal) tensors suffices

Page 22: Aravindan Vijayaraghavan CMU    Northwestern University

Blast from the Past

𝑇

𝑻 ≈𝝐∑𝑖=1

𝑘

𝑎𝑖⨂𝑏𝑖⨂ 𝑐𝑖

Recall

𝑎𝑖

Aim: Recover A, B, C

Qn. Is this algorithm robust to errors ?Yes ! Needs perturbation bounds for eigenvectors.

[Stewart-Sun]

Thm. Efficiently decompose T and recover upto errorwhen 1) are min-singular-value 1/poly(d) 2) C doesn’t have parallel columns.

[Jennrich via Harshman 70]Algorithm for 3-tensor

• A, B are full rank (rank=)• C has rank 2

• Reduces to matrix eigen-decompositions

Page 23: Aravindan Vijayaraghavan CMU    Northwestern University

Consider rank 1 tensor

Slices of tensors

s’th slice:

s’th slice:

All slices have a common diagonalization !

𝑇=∑𝑖=1

𝑘

𝑎𝑖⊗𝑏𝑖⊗𝑐𝑖 ∑𝑖=1

𝑘

𝑏𝑖 (𝑠 ) .(𝑎𝑖⊗𝑐𝑖)

Random combination of slices: ∑𝑖=1

𝑘

⟨𝑏𝑖 ,𝑤 ⟩ .(𝑎𝑖⊗𝑐𝑖)

Page 24: Aravindan Vijayaraghavan CMU    Northwestern University

Two matrices with common diagonalization

Simultaneous diagonalization

If 1) are invertible and

2) have unequal non-zero entries,

We can find by matrix diagonalization!

Page 25: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithm:1. Take random combination along as .2. Take random combination along as .3. Find eigen-decomposition of to get . Similarly B,C.

Decomposition algorithm [Jennrich]

𝑇 ≈𝜖∑𝑖=1

𝑘

𝑎𝑖⊗𝑏𝑖⊗𝑐𝑖

Thm. Efficiently decompose T and recover up to error (in Frobenius norm) when

1) are full rank i.e. min-singular-value 1/poly(d) 2) C doesn’t have parallel columns (in a robust sense).

Page 26: Aravindan Vijayaraghavan CMU    Northwestern University

Overcomplete Case

into Techniques

Page 27: Aravindan Vijayaraghavan CMU    Northwestern University

Mapping to Higher DimensionsHow do we handle the case rank ?

(or even vectors with “many” linear dependencies?)

1. Tensor corresponding to map computable using the data

2. are linearly independent (min singular value)

map

ℝ𝑑

𝑓 (𝑎¿¿1)¿

𝑓 (𝑎𝑖) ℝ𝑑2

𝑓 (𝑎¿¿ 𝑘)¿𝑓 (𝑎¿¿ 2)¿𝑎1

𝑎2𝑎𝑘

𝑎𝑖

• Reminiscent of Kernels in SVMs

maps parameter/factor vectors to higher dimensions s.t.

Factor matrix A

Page 28: Aravindan Vijayaraghavan CMU    Northwestern University

A mapping to higher dimensions

Qn: are these vectors linearly independent?Is ``essential dimension’’ ?

Outer product / Tensor products:Map Map

Basic Intuition:

1. has dimensions.

2. For non-parallel unit vectors distance increases:

• Tensor is

Page 29: Aravindan Vijayaraghavan CMU    Northwestern University

Bad cases

Beyond Worst-case analysisCan we hope for “dimension” to multiply “typically”?

Bad example where :• Every vectors of U and V are linearly independent• But vectors of Z are linearly dependent !

𝑉 (𝑑×𝑘)

𝑣 𝑖❑

𝑈 (𝑑×𝑘)

𝑢𝑖❑

Z

=

Lem. Dimension (K-rank) under tensoring is additive.

U, V have rank=d. Vectors

Strategy does not work in the worst-case

But, bad examples are pathological and hard to construct!

Page 30: Aravindan Vijayaraghavan CMU    Northwestern University

Product vectors & linear structure

• Easy to compute tensor with as factors / parameters (``Flattening’’

of 3t-order moment tensor)

• New factor matrix is full rank using Smoothed Analysis.

Theorem. For any matrix , for , with probability 1- exp(-poly(d)).

Map

𝑎𝑖

~𝑎𝑖random -perturbation

Page 31: Aravindan Vijayaraghavan CMU    Northwestern University

Proof sketch (t=2)

Main Issue: perturbation before product.. • easy if columns perturbed after tensor

product (simple anti-concentration bounds)

Technical componentshow perturbed product vectors behave like random vectors in

𝑑2

𝑘

𝑈

Prop. For any matrix , matrix below has with probability 1- exp(-poly(d)).

• only bits of randomness in dims• Block dependencies

Page 32: Aravindan Vijayaraghavan CMU    Northwestern University

Projections of product vectorsQuestion. Given any vector and gaussian -perturbation , does have projection onto any given dimensional subspace with prob. ?

Easy : Take dimensional , -perturbation to will have projection on to S w.h.p.

anti-concentration for polynomials implies this with

probability 1-1/poly

….... ~𝑎 (𝑑 )⊗𝑏

Much tougher for product of perturbations!(inherent block structure)

Page 33: Aravindan Vijayaraghavan CMU    Northwestern University

Projections of product vectors

dot product of block with

=

Question. Given any vector and gaussian -perturbation , does have projection onto any given dimensional subspace with prob. ?

𝑑22

𝑑2 ~𝑎 (𝑑 )⊗𝑏

is projection matrix onto

is a matrix

Page 34: Aravindan Vijayaraghavan CMU    Northwestern University

Two steps of Proof..

2. If has eigenvalues , then w.p. (over perturbation of ), has large projection onto .

follows easily analyzing projection of a vector to

a dim-k space

will show with

1. W.h.p. (over perturbation of b), has at least eigenvalues

Page 35: Aravindan Vijayaraghavan CMU    Northwestern University

Suppose: Choose first “blocks” in were orthogonal...

….….

….

Structure in any subspace S

(restricted to cols)

• Entry (i,j) is:

• Translated i.i.d. Gaussian matrix!

has many big eigenvalues

√𝑑

𝑣 𝑖𝑗∈ℝ𝑑

Page 36: Aravindan Vijayaraghavan CMU    Northwestern University

Main claim: every dimensional space has vectors with such a structure..

….….

….

Property: picked blocks (d dim vectors) have “reasonable” component orthogonal to span of rest..

Finding Structure in any subspace S

Earlier argument goes through even with blocks not fully orthogonal!

𝑣1𝑣2𝑣√𝑑

Page 37: Aravindan Vijayaraghavan CMU    Northwestern University

Idea: obtain “good” columns one by one..

• Show there exists a block with many linearly independent “choices”

• Fix some choices and argue the same property holds, …

Main claim (sketch)..

Generalization: similar result holds for higher order products, implies main result.

crucially use the fact that we have a dim

subspace

• Uses a delicate inductive argument

Page 38: Aravindan Vijayaraghavan CMU    Northwestern University

Summary

• Polynomial time Algorithms in

Overcomplete settings:

• Smoothed Analysis for Learning Probabilistic models.

Guarantees for order- tensors in d-dims (each)

Rank of the t-tensor=k (number of clusters)

Previous Algorithms Algorithms (smoothed case)

• Flattening gets beyond full-rank conditions: Plug into results on Spectral Learning of Probabilistic models

Page 39: Aravindan Vijayaraghavan CMU    Northwestern University

Future Directions

Smoothed Analysis for other Learning problems ?

Better guarantees using Higher-order moments• Better bounds w.r.t. smallest singular value ?

Better Robustness to Errors

• Modelling errors?

• Tensor decomposition algorithms that more robust to errors ?

promise: [Barak-Kelner-Steurer’14] using Lasserre hierarchy

Better dependence on rank k vs dim d (esp. 3 tensors)

• Next talk by Anandkumar: Random/ Incoherent decompositions

Page 40: Aravindan Vijayaraghavan CMU    Northwestern University

Thank You!

Questions?