一些聚類（clustering）演算法的總結

02-03

本著此專欄既要有技術類的乾貨，又要有吹牛逼類的水貨原則，我今天來嘗試著總結一下當下比較流行的聚類（clustering）/分類（classification）的演算法，供大家參考與討論。由於我念的是洋書，很多專業辭彙一下子轉不過彎來，所以這篇文章中間會夾雜一些英文，請各位看官諒解。

K-mean

簡介：K-mean演算法的目標是把n個observation放到k個聚類（cluster）中間去，使得每一個observation都被放到離它最近的那個聚類（cluster）中去，這裡「最近」是用這個observation跟相對應的聚類（cluster）的平均值（mean）的距離（distance）來衡量的。

演算法：Objective function:

$operatornamewithlimits{argmin}_{C} sum_{i=1}^{k}{sum_{xin C_i}{left|left| vec{x} - vec{mu_i} right|right|} } n$

where $C={C_1, C_2, ..., C_k}$ is a set of clusters. $vec{x}n$ is a $ntimes m$ matrix, each row represents an observation and each column represents one dimension in the feature space. $vec{mu_i}$ is a vector of length $m$ . It is the mean of points in cluster $C_i$ . If an observation has more than one cluster which is the "nearest", choose only one of them.

用人話說就是：把每一個observation assign到合適的cluster中間，使得所有observation到它所在cluster的中心（centroid)的距離之和最小。（卧槽，我居然一句話把它說完了！）

實現：常見的K-means演算法都是用迭代的方法，其中最有名的要數Lloyds algorithm啦。這個演算法主要分成兩個步驟，一個是Assignment step，另一個是Update step。同過反覆交替執行這兩個步驟，最終是演算法收斂，從而得到穩定的結果。演算法的實現步驟如下：

Initialization: 在所有的observation中任意取K(number of clusters)個，把它們作為相對應cluster的中心。 $m_1^0,...,m_K^0$ ，其中 $m_j^t$ 表示第 $j$ 個cluster在第 $t$ 次迭代之後的中心。

Assignment step: 把每一個observation放到離它「最近」的cluster中間，「最近」與否取決於它到cluster中心的距離（Euclidean distance）。數學表達如下：

$C_i^t = { x_p : |x_p -m_i^t|^2 leq |x_p -m_j^t|^2 forall j, 1leq jleq k}$

where $x_p$ is one observation and $C_i^t$ is the cluster $i$ after the $t^{th}$ iteration. If $x_p$ has more than one the nearest cluster, choose only one to assign.

Update step: 經過上一步，所有observation都被重新assign到各個cluster裡面，所以我們需要update每個cluster的中心（centroid）。數學表達如下：

$m_i^{t+1} = frac{1}{|C_i^t|} sum_{x_j in C_i^t}{x_j}$

where $|C_t^t|$ is the number of observations in cluster $C_i^t$ .

重複Assignment step 和 Update step 多次直到 $C_i^t$ 不再隨 $t$ 變化，演算法收斂。選取不同的初始值，重複以上步驟，從而得到穩定的結果。

優劣：跟所有演算法一樣，K-means也有它的優點跟缺點。

優點：1. 邏輯簡單，便於實現

2. 對於某些數據，有著較好的表現

缺點：1. Computationally expensive，因為他要計算所有observation到所有centroid之間的距離

2. K的選擇需要對數據有比較深刻的了解

3. 「距離」的選擇直接影響結果。這裡我們用的Euclidean distance，假設每個feature是equal weight的，但如果每個feature的重要程度不一樣的話，equal weight就變得不可取了。這裡牽扯到distance metric learning的內容，我不展開了。

應用：由於K-means簡單易行，而且相對其他更好的演算法來說更加容易實現，所以它經常被用來做pre-clustering。在使用更加高級的演算法之前，把比較相似observation放在一起。

Hierarchical clsutering

簡介：Hierarchical clustering 演算法是一種試圖建立hierarchy of cluster的演算法。它有兩種策略，一種是 Agglomerative，另一種是 Divisive。 Agglomerative 是一種「bottom up」 approach，它一開始假設每一個observation都是一個獨立的cluster，然後通過 merge 每一層相鄰的兩個 cluster 最終達到所有的最高層，所有的observation都屬於同一個cluster。於此相反，Divisive 是一種「top down" approach，它一開始假設所有的observation都屬於同一個cluster，然後通過分離每個cluster中間比較不相似的observation刀下一層，之道所有的observation都屬於不同的cluster。一般來說，人們用Dendrogram來可視化Hierarchical clusters。

Cluster dissimilarity：為了決定哪些cluster被合成一個（Agglomerative），或者一個cluster被怎麼分成小的cluster（Divisive），人們需要一個指標來衡量兩個集合的observation 的差異程度（dissimilarity）。一般來說，這個指標由兩個組成部分，一個叫做metric，它衡量兩個observation之間的距離，另一個叫做linkage，它衡量兩個集合（set）間的差異程度（dissimilarity），它是一個pairwise distances of observations in the sets的函數。Metric跟Linkage的選取對最後cluster出來的結果有很大的影響，下面我列幾個常用的Metric和Linkage：

Metric

$Euclidean distance = sqrt{sum_{i}{(a_i - b_i)^2} }$

$Squared Euclidean distance = sum_{i}{|a_i - b_i|}$

$Maximum distance = operatornamewithlimits{max}_{i} |a_i - b_i|$

$Manhattan distance = sum_{i}{|a_i - b_i|}$

$Mahalanobis distance = sqrt{(vec{a} - vec{b}^{T}S^{-1}(vec{a} - vec{b})}$ , where $S$ is the Covariance matrix

Linkage

$Maximum linkage = max{d(a,b) : a in A, b in B}$

$Minimum linkage = min{d(a,b):a in A, b in B}$

$Centroid linkage = ||c_i - c_j||$ , where $c_i$ and $c_j$ are centroids of cluster $i$ and $j$ .

$Average linkage = frac{1}{|A|cdot |B|} sum_{a in A}{ sum_{b in B}{d(a,b)} }$

$Minimum energy linkage = 2 mathbb{E} ||A - B|| - mathbb{E} ||A - A|| - mathbb{E}||B - B||$ , where $mathbb{E}||A - B||$ is the average distance between all the pairs with one observation from set A and the other from set B. $mathbb{E}||A - A||$ is the average distance between all the pairs with both observations from set A.

實現：這個實現其實也沒有什麼好講的，如果是 Agglomerative 的話就從每一個 observation 開始，把linkage最短的兩個 cluster 合併，知道只剩一個大 cluster 為止；如果是 Divisive 的話就從一個大的 cluster 開始，每次把一個 cluster 分成兩個，使得他們的 linkage 最大，直到每個 observation 都是一個 cluster 為止。最終的結果取決使用者在哪裡停止 Agglomerate / Divide。

優劣：

優點

不用像 K-means 那樣假設總共的 cluster 的數量

缺點

沒有任何目標函數
最終的 cluster 結果需要主觀獲得
Metric 和 Linkage 的選擇沒有統一標準

應用：Search result clustering, collection or subsets of collection, information retrieval.

Latent Dirichlet allocation （LDA）

這個演算法我得主要用英文寫了，專業辭彙太多了，而且我也不知道對應的中文怎麼說，請大家多多海涵啊。

簡介：這個演算法的三大發明者都是機器學習領域大名鼎鼎的男神，Latent Dirichlet allocation論文的一作是 David M. Blei，二作是 Andrew Ng，三作是 Michael I. Jordan。其中Andrew Ng現在更是百度的首席科學家。嚴格說起來，LDA也算是 hierarchical clustering 的一種，是一個 hierarchical Bayesian model。當時這個模型的初衷是用來識別文章的主題（topic）的。

演算法：

首先來一發定義。

A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1, ..., V}. 具體一點，可以把word想像成文章裡面的一個單詞，vocabulary就相當於一個字典，裡面包含了所有的word。Thus, using superscripts to denote components, the vth word in the vocabulary is represented by a V-vector w such that $w^v=1$ and $w^u=0$ for $une v$ . 這樣的話，第v個word可以用一個長度為 V 的向量 w 來表示， w 除了第 v 項是1之外，別的都為0.
A document is a sequence of N words denoted by $textbf{w} = (w_1, w_2, ..., w_N)$ , where $w_n$ is the nth word in the sequence. 一個document就是一串word，它用一個矩陣 $textbf{w}$ 表示， $textbf{w}$ 中的每一列都是一個word，用上面提到的向量來表示。
A corpus is a collection of M documents denoted by $bf{D} = {w_1, w_2, ..., w_M}$ . M個document就組成了一個corpus。

來一張大名鼎鼎的圖。

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. 它的主要思路是：每一個document都是又一些我們觀察不到的topic的隨機組合，因為topic是不能直接觀察到的，我們用words的一個概率分布來表示topic。

LDA模型有著以下假設：

$theta sim Dir(alpha)$

對 $N$ 個 $word$ 中的每一個 $w_n$ :

它談論的 $topic$ $z_n sim Multinomial(theta)n$
每一個 $word$ 都來自於一個 multinomial probability distribution conditioned on the topic $z_n$ ,概率為 $p(w_n | z_n, beta)$

其中，我們定義topic的數量為k，所以 $theta$ 是一個k維向量， $z$ 也是一個k維向量。 $beta$ 就是我們之前提到的字典，它是一個 $k times V$ 的矩陣，每一行是一個topic，每一列是一個word。 $beta_{ij} = p(w^j=1|z^i=1)$ ， $beta$ 中的每一項代表的是給定一個topic，任意一個word出現的概率。N是每一篇document中word的數量。

根據Dirichlet distribution的定義，我們有：

$p(theta|alpha)=frac{Gamma(sum_{i=1}^{k}{alpha_i} )}{prod_{i=1}^{k}{Gamma(alpha_i)}}theta_1^{alpha_1-1} dots theta_k^{alpha_k-1}$

where $theta_i geq 0, sum_{i=1}^{k}{theta_i=1}$ and $Gamma(x)$ is the Gamma function.

參照上面那張圖，given the parameters $alpha$ and $beta$ , the joint distribution of a topic mixture $theta$ , a set of N topics z, and a set of N words $textbf{w}$ is given by:

$p({bf theta}, textbf{z}, textbf{w}|alpha, beta)=p(theta|alpha)prod_{n=1}^{N}p(z_n|theta)p(w_n|z_n,beta)$

對 $theta$ 積分並且對 $z$ 求和，我們就得到了the marginal distribution of a document: $p(textbf{w}|alpha, beta) = int_{}^{} p(theta|alpha)(prod_{n=1}^{N} sum_{z_n}^{}{p(z_n|theta)p(w_n|z_n,beta)} )dtheta$

上面是一個document的概率，對於一個corpus，我們只要吧所有document的概率相乘就可以了： $p(D|alpha, beta) = prod_{d=1}^{M}int_{}^{}p(theta_d|alpha)(prod_{n=1}^{N_d} sum_{z_{dn}}^{}{p(z_{dn}|theta_d)p(w_{dn}|z_{dn},beta} ) dtheta_d$

回到前面的公式，我們有 $p({bf theta}, textbf{z}, textbf{w}|alpha, beta)$ ，其中 $theta$ 和 $textbf{z}$ 都是hidden variable，他們的posterior distribution given a document 如下： $p(theta, textbf{z}|textbf{w}, alpha, beta)=frac{p(theta, textbf{z}, textbf{w}|alpha, beta)}{p(textbf{w}|alpha, beta)}$

由於這個概率分布是intractable的，所以我們需要用一些approximation，這裡我們介紹一下variational approximation。這個趨近方法主要用的是Jensens inequality來得到原來的函數的下限。簡單來說，我們從一坨下限（a family of lower bounds）中間，通過優化的方法試圖找到一個最趨近的tractable的下限，來代表原來的函數。

要得到這麼一坨下限，我們可以把原來的圖像模型（graphic model）稍作改動，新的模型由下圖表示：

原來非常麻煩的計算來自於 $theta$ 和 $beta$ 之間的聯繫，在新的圖像模型中間，我們去掉了 $theta$ , $textbf{z}$ 和 $textbf{w}$ 之間的聯繫，以及 $textbf{w}$ 這個節點（node）。然後我們就會得到下面這個variational distribution：

$q(theta, textbf{z} | gamma, phi ) = q(theta|gamma)prod_{n=1}^{N}q(z_n|phi_n)$

其中 $gamma$ 是Dirichlet parameter， $(phi_1, dots , phi_n)$ 是multinomial parameters。

經過一系列推導，我們的得出可以通過解下面的優化問題來找到一個下限來趨近log likelihood：

$(gamma^*, phi^*) = operatornamewithlimits{argmin}_{(gamma, phi)}D(q(theta, textbf{z} | gamma, phi) || p(theta, textbf{z} | textbf{w}, alpha, beta))$

在得到最優的 $gamma$ 和 $(phi_1, dots , phi_n)$ 之後，我們最想要得到的是 $alpha$ 和 $beta$ 的值，它們可以通過解以下優化問題來得到： $l(alpha, beta) = sum_{d=1}^{M}{log p(textbf{w}_d | alpha, beta)}$

我們之前討論過，由於 $p(textbf{w}_d | alpha, beta)$ 並不是tractable的，我們用一個tractable的下限來逼近，從而得到最優的 $alpha$ 和 $beta$ 。我們可以用Expectation-Maximization演算法來實現：

(E-step) 對每一個document，找到最優的 ${gamma^*_d, phi^*_d:d in D}$ 。
(M-step) 最大化那個log likelihood的下限，從而得到最優的 $alpha$ 和 $beta$ 。

當字典V太大時，我們會有嚴重的sparsity的問題，所以我們需要用到一些smoothing的技巧，za

優劣：LDA能夠方便的應用於未見的文章上，但是它並沒有考慮文章中單詞（word）的順序。

評論中@mafia提到，「LDA不太適合短文本吧」，這是一個非常好的補充！LDA在段文本上的效果的確沒有長文本好，除非短文本全部都是關鍵字哈！

應用：topic modelling, discrete feature clustering.