NIPS 2015有什麼值得關注的亮點？

11-25

NIPS每年都有的Optimization workshop依舊值得聽
OPT 2015: Optimization for Machine Learning
Speakers有:
1. Jorge Nocedal http://users.iems.northwestern.edu/~nocedal/
2. Guanghui Lan http://www.ise.ufl.edu/glan/
3. Elad Hazan http://www.cs.princeton.edu/~ehazan/

正好最近在公眾號寫 NIPS 2015 Deep Learning Symposium 的論文總結，直接搬運到這裡吧。

先來說結論，我推薦的論文有：

《Character-aware Neural Language Models》. Yoon Kim, Yacine Jernite, David Sontag, Alexander M. Rush.

《A Neural Algorithm Of Artistic Style》. Leon A. Gatys, Alexander S. Ecker, Matthias Bethge.

《Skip-thought vectors》. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, et al.

《Teaching machines to read and comprehend》. Karl Moritz Hermann, Tomá? Ko?isky, Edward Grefenstette, et al.

《Visualizing and understanding recurrent networks》. Andrej Karpathy, Justin Johnson, Li Fei-Fei.

《Spatial Transformer Networks》. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu.

《Deep Generative Image Models Using A Laplacian Pyramid Of Adversarial Networks》. Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus.

《Early stopping is nonparametric variational inference》. Dougal Maclaurin, David Duvenaud, Ryan P. Adams.

《Dropout as a Bayesian approximation: Representing model uncertainty in deep learning》. Yarin Gal, Zoubin Ghahramani.

以下複製全部筆記（不僅包括推薦的，還包括別的）。為了更好的排版，可以直接移步公眾號原文章查看：乾貨 | NIPS 2015 Deep Learning Symposium（一）

乾貨 | NIPS 2015 Deep Learning Symposium（二）

http://weixin.qq.com/r/AElSSi7E8HOPrWop9xwS (二維碼自動識別)

Character-aware Neural Language Models

這篇文章之前掛在 arXiv 上，現在已經被 AAAI 2016 正式接收。推薦係數5星。是一篇 NLP 和 DL 結合的非常好的論文，而且信息量非常大。他們的 model 由兩個部分組成，character-level 的輸入，輸入給 CNN，CNN 的輸出，輸入給 RNNLM，但最終預測仍然是 word-level。

In this work, we propose a language model that leverages subword information through a character-level convolutional neural network (CNN), whose output is used as an input to a recurrent neural network language model (RNNLM).

只用了一次 convolutional + pooling (max-over-time)，並且作者表示用多次 conv+pooling 的組合（stacked，就像 CNN for Sentence Modeling 那裡一樣，並沒有提升效果）。Whereas a conventional NLM takes word embeddings as inputs, our model instead takes the output from a single-layer character-level CNN with max-over-time pooling. 除了在 stacked 與否的問題上，和 Sentence Modeling 的 CNN 不同的第二個地方在於，這裡不再是 wide convolutional，而是 narrow convolutional。

特殊的地方在於 pooling 的 output 不是直接輸入給 LSTM，而是先經過一個基於 Highway Network （HW-Net）改造的 module。而且在實驗中，作者驗證了，如果沒有這個 module，效果會差。這個 HW-Net Module 只對 character-level 的模型有效，對於 word-level input，則無提升。HW-Net 相當於另一個 nonlinear 隱層，作用類似於別的模型中的 MLP（multilayer perceptron），增強 feature 之間的 interaction 的。在這篇論文中的這個模型里，因為 conv+pooling 只是單層，沒有那麼多 interaction 被 model 進去，所以考慮了這層 interaction。但是作者聲稱嘗試了 stacked CNN，沒有提升，所以是否可以推論 highway networks 的 interaction 和 stacked CNN 是不同的？

Similar to the adaptive memory cells in LSTM networks, HW-Net allows for training of deep networks by adaptively carrying some dimensions of the input directly to the output.

Applying HW-Net to the CharCNN has the following interpretation: since each output is essentially detecting a character n-gram (where n equals the width of the filter), HW-Net allows some character n-grams to be combined to build new features (dimensions where transform ≈ 1), while allowing other character n-grams to remain 『as-is』 (dimensions where carry ≈ 1).

最後作者在實驗結論部分表示這個 highway 對於 character-aware compositonal 的 model 非常重要，對於 word-level 不重要。加不加這個東西，可以從學出來的 word representation 周圍都是啥詞看出效果（見 Table 5）. 這個結果還是不要太明顯……震驚。

Before the highway layers the representations seem to solely rely on surface forms—for example the nearest neighbors of you are your, young, four, youth, which are close to you in terms of edit distance. The highway layers however, seem to enable encoding of semantic features that are not discernable from orthography alone. After highway layers the nearest neighbor of you is we, which is orthographically distinct from you. Another example is while and though— these words are far apart edit distance-wise yet the composition model is able to place them near each other.

HW Net 對於 word-level 沒用的原因是：dimensions of word embeddings do not (a priori) encode features that benefit from nonlinear, hierarchical composition availed by highway layers. 最後這篇的 Related Work 也很值得看。

Character-level Convolutional Networks for Text Classification

這篇論文前身是《Text Understanding from Scratch》，當時在微博上一時因為實驗效果太過拔群，引起轟動和過分關注。最後被發現是實驗數據測試集訓練集有嚴重重疊，暫時撤稿。重新修改後被 NIPS』15 接收。這篇論文的風格給人感覺非常不 NLP，從術語到模型思路，到寫作風格，都感覺是純做 Deep Learning 的人，來做了 NLP 的任務而已。

文章中的模型是完全基於 ConvNet 改造。按照作者的原話是，This article is the first to apply ConvNets only on characters。是很規矩的, temporal convolutional，temporal max-pooling (max-over-time)，然後有 dropout 在最後三個全連接層。值得注意的是他們用到的 nonlinear funtion 是 rectifier/thresholding：h(x) = max{0,x}，使得很像 ReLUs。

細節上，這個模型十分「簡單」，並沒有做變長的 convolutional 處理。也就是說，一旦一個輸入的 text chunk，超過了他們預定的一個 length，那麼後面的 character 就都不要了。輸入的時候就是把一個定長的 character embedding sequence input 進去。另外的細節是，儘管不是用 RNN 這樣的 recurrent 模型 encoding decoding，但他們依然用 reverse 的 input，The character quantization order is backward so that the latest reading on characters is always placed near the begin of the output, making it easy for fully connected layers to associate weights with the latest reading。

在實驗設置上考慮了大小寫區分開，然後發現很多時候區分反而不如不區分。We report experiments on this choice and observed that it usually (but not always) gives worse results when such distinction is made. One possible explanation might be that semantics do not change with different letter cases, therefore there is a benefit of regularization. 他們使用了 Data Augmentation，用同義詞去製造更多的「正例」。這部分細節在 Section 2.4。

A Neural Algorithm Of Artistic Style

這篇文章應該是這次 Symposium 中最著名的了。這個工作被叫做 neural art，就是用 Deep Neural Networks 的方法，將一些作品，進行特定風格化（photorealistic rendering）。比如可以將梵高在《Starry Night》中的那種，高對比和清晰筆觸的風格，渲染在各種風景攝影作品上。

這個工作主要基於 CNN，其核心是對一張圖片中的 content 和 style 分別建模 representation，而核心的核心就是 style 的建模。由於 CNN 的 hierarchy，content 建模中，越是 high-level 的 content representation 可能越 general，越難 reconstruct；但另一方面，style representation，則是在 high-level 的地方更不容易被「image content」的局部信息所「迷惑」。

因為 content 和 style 是很難完全獨立開的，在網路設計上，作者也考慮了這點。所以 style representation 並不是基於某一層的 content representation，而是在 CNN 中的每一層都有一個 style representation。style representation 的建模就是利用「不變性」——這背後的假設其實是，不管你在畫什麼東西，畫哪個局部，你的 style 應該保證了一種局部性；不變的 feature，才是 style。所以它採用的是對比每一個層 filter 之間的 correlation，進行 style representation。

這個工作已經有相當多開源代碼，在線應用，也被集成在各種 NN 框架中。大家有興趣可以找來玩玩。

Listen, attend and spell

這篇文章其實挺簡單的。核心思想就是用一個 listener-speller encoder-decoder 的結果做 speech recognition（speeach utterances -&> characters)。listener（encoder）部分用的是 pyramidal RNN，這個比較特別，作者 argue 說 pyramidal RNN 對於這個任務提速顯著。

pyramidal RNN 的部分，實質是一個 hierarchical 的 Bi-LSTM （pBLSTM）。而這個架構，就像 CNN 一樣，high-level（在「金字塔」頂端）的 features 會比較少，比較「濃縮」，用這樣少一些的 features 傳給 decoders，會減少 decoder 解析的耗時，提高解析的能力，並且整體上提高 encoder-decoder 的 inference 速度。

另一方面，speller 端用的 attention-mechansim，好處很顯然，防止 overfitting。

Without the attention mechanism, the model overfits the training data significantly, in spite of our large training set of three million utterances - it memorizes the training transcripts without paying attention to the acoustics. Without the pyramid structure in the encoder side, our model converges too slowly - even after a month of training, the error rates were significantly higher than the errors we report here.

Skip-Thought Vectors

已經是相當有名的工作。模型上，具體使用是 RNN-RNN 的 encoder-decoder 模型；其中兩個 RNN 都用了 GRU 去「模擬」 LSTM 的更優表現。在 encoder 階段，只是一個 RNN；在 decoder 階段，則是兩個（分別對應前一個句子和後一個句子——也就是說不能預測多個前面的句子和後面的句子）。

這樣的模型可以保留一個 encoding for each sentence，這個 encoding 會很有用，就被稱為 skip-thoughts vector，用來作為特徵提取器，進行後續 task。注意是 Figure 1 中所謂的 unattached arrows，對應在 decoder 階段，是有一個 words conditioned on previous word + previous hidden state 的 probability 束縛。同時，因為 decoder 也是 RNN，所以可用於 generation（在論文結尾處也給出了一些例子）。

本文的另一個貢獻是 vocabulary mapping。因為 RNN 的複雜性，但作者又不希望不能同時 learn word embedding，所以只好取捨一下——我們 learn 一部分 word embedding（words in training vocabulary）；對於沒出現的部分，我們做一個 mapping from word embedding pre-trained from word2vec。這裡的思想就是 Mikolov"13 年那篇 word similarity for MT 的，用一個沒有正則的 L2 學好 mapping。

在實驗中，他們用此方法將 20K 的 vocabulary 擴充到了 930K。

In our experiments we consider 8 tasks: semantic-relatedness, paraphrase detection, image-sentence ranking and 5 standard classification benchmarks. In these experiments, we extract skip-thought vectors and train linear models to evaluate the representations directly, without any additional fine-tuning. As it turns out, skip-thoughts yield generic representations that perform robustly across all tasks considered.

首先是他們有三種 feature vectors，uni-skip/bi-skip/combine-skip。分別對應 encoder 是 unbidirectional，bidirectional，和 combine 的。分別都是 4800 dimensions。對於不同的 task，可能用不同的 feature indicator，比如把兩個 skip-thoughts-vectors u 和 v，點乘或者相減，作為兩個 feature，再用 linear classifier(logistic)。

Ask me anything: Dynamic memory networks for natural language processing

這篇文章也是相當早就放在 arXiv 上了，ACL 2015 的論文中就已經有人引用。文章來自 Richard Socher 的 MetaMind 團隊。主要就是利用一個 dynamic memory network（DMN）框架去進行 QA（甚至是 Understanding Natural Language）。

這個框架是由幾個模塊組成，可以進行 end-to-end 的 training。其中核心的 module 就是Episodic Memory module，可以進行 iterative 的 semantic + reasoning processing。DMN 先從 input 接受 raw input（question），然後生成 question representation，送給 semantic memory module，semantic module 再將 question representation + explicit knowledge basis（只是設想）一起傳給核心的 Episodic Memory module。這個 Episodic Memory module 會首先 retrieve question 中涉及到的 facts 和 concepts，再逐步推理得到一個 answer representation。由於可能有多個涉及到的 facts 和 questions，所以這裡還用到了 attention mechanism。最後，Answer Module 就可以用接收到的 answer representation 去 generate 一個真正的 answer。

Teaching machines to read and comprehend

這篇論文有兩個主要貢獻，一個在於 attention-based model 的運用和改進，一個在於構造了一個 supervised document-query based 的數據集，雖然說是供 machine comprehend 使用，其實依然沒有超過 QA 範疇，就是基於一篇 document，一個 query（document-query pair）回答一個 entity form 的 answer。數據集在此不表。來看他們 attention-based 的相關 model。

論文一共提出三個新 model，其中只有後兩個（圖中 (a)(b)）是 attention-based 的。input 都是一個 document query pair。作者嘗試了兩種機制，一種是 document 按一小段句子（以標點分割）輸入，一小段句子+一個query，這樣算一次輸入；另一種是一篇 document 全部輸入完畢再輸入 query，這種方法被認為失去了 query 的 mention 作用。

兩個 attention-based model，(a) Attentive Reader 和 (b) Impatient Reader。(a) 非常好理解，看 (a) 左邊就是標準的 attention mechanism 的結構啊，對比一下：

再看 (b) Impatient Reader，這個 model 很有趣，尤其和我下面想說的非 NLP 那篇有點像。我按我的理解來解讀，這個 model 強調「reread」，就是說，對於每一個 query，有許多個 token，按照 query token 一個個輸入，每一個 query token（不再是每一個 query），就都讀一遍 document，然後下一個 token，再來一遍——reread。

我會把這種 reread 機制，理解為一種「逐漸」獲取（理解）文章的過程，就像我們讀一篇艱深的文章，讀一遍不成，讀兩遍，讀三遍。這個機制的 motivation 很好，但是如果只用來預測一個 token（the answer），我會認為起不到 motivation 的作用。個人理解。

Towards AI-complete question answering: A set of prerequisite toy tasks

和《Ask Me Anything》那篇一樣，也是很早就放在 arXiv 上的工作。這個數據集不僅被《Ask Me Anything》引用，也被很多 ACL 2015 和後續的 QA 工作引用。文章主要就是介紹他們的 AI-related QA 數據集，因為是 Facebook 團隊製作且有 20類問題，所以這個數據集後來被縮寫為 FB20。

按照上次介紹的 ICLR2016 中 Jianfeng Gao 團隊的工作，在這 20類問題中，最難做的是 position reasoning 和 path finding 兩類任務。

We achieve near-perfect accuracy on all categories, including positional reasoning and pathfinding that have proved difficult for all previous approaches due to the special two-dimensional relationships identified from this study.

如果想了解在這個數據集上的一些工作，可以看下面幾篇論文：

1.《Learning Answer-Entailing Structures for Machine Comprehension》Mrinmaya Sachan, Kumar Dubey, Eric Xing, Matthew Richardson. ACL 2015. CMU 出品，Eric Xing 老師的組。本文不是 NN，數學上還算簡單。個人覺得有兩個亮點，一個就是假設了一個中間的 hypothesis，一個是在數學的地方結合了 multi-task，並使用了 feature map 的 technique 把 multi-task 給「退化」成了原始問題。他們先用 Question 和 Answer，學出一個 hypothesis，這個 hypothesis 就是一種 latent variable，也可以認為是一種 embedding 後的 fact。如果我們認為 question + answer 共同描述了一個 fact/truth/event 的話。基於這個 hypothesis，再去 match 原始 paragraph/text 里的 relevant words。具體可以看看 Figure 1.我覺得這個蠻有趣的。因為讓我想起編碼解碼。Question + Answer 的組合就是一種對於這篇 doc 的一種表達；而這篇 doc 本身是另一種表達。這兩種表達就是兩種 representation 的結果，那麼中間真實的事情是什麼？所謂的完整的 information 是什麼？他這樣直接結合的 hypothesis 肯定也是 reduce 了信息的。實際我覺得現在 Machine Translation/Conversation 那邊也在做類似的事情。我們不要直接一對一，要有中間一個看不見的「hypothesis」。第二個 multi-task，他們用了FB20這20類，把任務細分，細分成 20個 subtask。這樣就變成了 multi-task 的問題。然後使用了 feature map（Evgeniou 2004）的技術，把 multi-task 又給轉化成了原始問題。我覺得還蠻有趣的。當然 multi-task 已經有非常多的解決辦法了，這個只是一種適用於他的模型的有效簡單的辦法。

2.《Machine Comprehension with Discourse Relations》. Karthik Narasimhan and Regina Barzilay. ACL 2015. MIT CSAIL 出品。開源。是一篇很 neat 的論文，而且不是 NN。這篇文章的賣點是：discourse information + less human annotation所以他們的 model，可以使用 discourse relation（relations between sentences, learned, not annotated) 去增強 machine comprehension 的 performance。具體的，他們先使用 parsing 等方法，去選出和 question 最 relevant 的一個句子（Model 1）或者多個句子（Model 2 和 Model 3），並在這個過程中建立 relation，最後預測。思想都是 discriminative model 的最簡單的思想，找 hidden variable，概率連乘。如果對本文有興趣，推薦看 Section 3.1，討論了一下他們認為這個 task 上可能相關的四【類】feature。

3.《Reasoning in Vector Space: An Exploratory Study of Question Answering》. In submission to ICLR 2016. 文章來自 Microsoft Jianfeng Gao, Xiaodong He 團隊。是一份比較詳細的針對 Facebook 20 tasks（FB20）的分析和工作。所謂分析是指，過去針對 FB20 的 Reasoning Work 基本都是 end-to-end 的，所以對於 error case 的分析不夠明確，不知道到底是作為 basis 的 semantics 就沒建模好；還是 reasoning 的過程出了問題。為了進一步提高在這個 tasks 上的 performances，作者就將 end-to-end 拆分開來，利用 tensor product representation（TPR）的方法，融合一些 common-sense inference（比如東和西是 opposite 的兩個方向），將 FB20 的正確率提高到了幾乎完美的程度。

Visualizing and understanding recurrent networks

作者是寫出《The Unreasonable Effectiveness Of RNN》博文的 Stanford 學生，Andrej Karpathy。同時 Andrej Karpathy 也是 Fei-Fei Li 教授的高徒。

這篇工作從幾個月前遞交到 arXiv，前幾日又更新了一版，投到了 ICLR 2016，內容上是博文的擴展。主要是通過 controlled experiment 的實驗方式，結合可視化的方法，去「量化」展示 char-LSTM 到底為什麼 powerful，是否真的如 often cited/claimed 的那樣，可以 model long term dependency。這篇工作最後的結論也和之前 Yoav Goldberg 澄清 char-LSTM 令人驚奇之處的文章一致——指出 char-LSTM 厲害之處，不在於它能 generate 出看起來還不錯的 char sequence，而是在於其對於 bracket, quote 等顯著 long distance information 的 retrieval 能力。

它們通過 visulization cell 的激活，gate activation statistics, error type/case analysis 的方式，展現了許多 LSTM 確實是「對應」和「負責」某些 character position 的，同時 LSTM 確實比 n-gram character language model 大幅降低了 bracket, quote 等 long distance information 的 error case。

End-to-end memory networks

這篇文章，及 Neural Turing Machine，其實是很多類似思想的前身工作，下次會把相關一起對比一下。這類工作的 motivation 是，如何把 large body 的 external memory 用在 Neural Networks 里。

在這篇工作中，他們就是嘗試性地探究了幾種方式。首先，是 single-layer or multi-layer，其次是 feature 空間如何轉換。如果將這樣的 end-to-end memory networks 的輸出拆分成兩種，就可以和 typical RNN 的工作映射起來。將 output 分為 internal output 和 external output，那麼分別就可以對應到 RNN 中的 memory 和 predicted label。

Grid Long-Short Term Memory

總的來說，這篇的貢獻應該是給出了一個更 flexible 還 computation capability 更高的框架。要理解這個論文，可能首先要理解三個概念：grid/block, stacked, depth。（1）Grid/Block 是把一個 LSTM 機制改造後的一個 component，這個 component 可以是 multi-dimensional 的，決定了幾個方向進行 propagate。每一個 dimension 都有 memory 和 hidden cell。1-dimensional 的 Grid LSTM 就很像上面所說的 Highway Networks。（2）Stacked 和 LSTM stacked 一樣，是指把 output 和 input 連在一起。但是 stacked 並不會改變 Grid LSTM 的dimension。stacked 2D Grid LSTM 依然是 2D 的，而不是 3D 的。從 visualize 來看，無非就是把一個個方塊/方形，平鋪在空間里（每個 dimension 都要延展）。（3）Depth 則是會增加dimension。在一個 block 內部，變 deep，就是增加 layers。一個 block 由幾個 layer 組成，就是幾層 deep 的 Grid LSTM。

只是 1D/2D 的時候，Grid LSTM 看不出特別大的優點。但是當變成 multidimensional 的時候，就會比傳統的 multidimensional LSTM 更好的解決 gradient vanishing 的問題。原因是，傳統multidimensional LSTM 在計算每層的 memory cell 的時候，是把每個 dimensional 的 gate 信息集合起來的。顯然這樣有問題。Grid LSTM 就不是這樣。它是每個 dimensional 分開計算memory cell。對於每一個 grid，有 N 個 incoming memory cells 和 hidden cells，同時還有 N個 outgoing memory cells 和 hidden cells。N 是 dimension 的個數。而 Grid LSTM share 的其實大的隱層 H。這樣既保證了 interaction 又保證了 information flow。

這篇論文後面還有挺有趣的應用，把 MT 的任務轉換成一個 3D Grid LSTM 的問題，其中兩個dimensions 分別是 bi-LSTM 正向逆向讀寫，第三個 dimension 是 depth。效果不俗。

可能這篇論文的這個框架的提出，在於讓 LSTM 的變種稍微有跡可循了一點，到底有多大performance 的提高，我還是比較懷疑的。

Spatial Transformer Networks

來自 Google DeepMind 的工作。主要是說，儘管 CNN 一直號稱可以做 spatial invariant feature extraction，但是這種 invariant 是很有局限性的。因為 CNN 的 max-pooling 首先只是在一個非常小的、rigid 的範圍內（2×2 pixels）進行，其次即使是 stacked 以後，也需要非常 deep 才可以得到大一點範圍的 invariant feature，三者來說，相比 attention 那種只能抽取 relevant 的 feature，我們需要的是更廣範圍的、更 canonical 的 features。為此它們提出了一種新的完全 self-contained transformation module，可以加入在網路中的任何地方，靈活高效地提取 invariant image features.

具體上，這個 module 就叫做 Spatial Transformers，由三個部分組成： Localization Network, Grid generator 和 Sampler。Localization Network 非常靈活，可以認為是一個非常 general 的進一步生成 feature map 和 map 對應的 parameter 的網路。因此，它不局限於用某一種特定的 network，但是它要求在 network 最後有一層 regression，因為需要將 feature map 的 parameter 輸出到下一個部分：Grid generator。Grid generator 可以說是 Spatial Transformers 的核心，它主要就是生成一種「蒙版」，用於「摳圖」（Photoshop 附體……）。Grid generator 定義了 Transformer function，這個 function 的決定了能不能提取好 invariant features。如果是 regular grid，就好像一張四四方方沒有傾斜的蒙版，是 affined grid，就可以把蒙版「扭曲」變換，從而提取出和這個蒙版「變換」一致的特徵。在這個工作中，只需要六個參數就可以把 cropping, translation, rotation, scale and skew 這幾種 transformation 都涵蓋進去，還是很強大的；而最後的 Sampler 就很好理解了，就是用於把「圖」摳出來。

這個工作有非常多的優點：（1）它是 self-contained module，可以加在網路中的任何地方，加任何數量，不需要改變原網路；（2）它是 differentiable 的，所以可以直接進行各種 end-to-end 的訓練；（3）它這個 differentiable simple and fast，所以不會使得原有網路變慢；（4）相比於 pooling 和 attention 機制，它抽取出的 invariant features 更 general。

Semi-Supervised Learning with Ladder Networks

這篇論文並沒有特別多的創新點，主要是將 Ladder Networks 從純 unsupervised fashion 改成了 semi-supervised fashion。Ladder Networks 其實就是把 stacked autoencoder 中 layer 和 decoded reconstruction 之間加上了 skip-connection，所以就像在 encoder 和 decoder 之間有了 ladder，因此命名。那麼這篇論文的改進就是在 Ladder Networks 上，encoder 部分的每一層 layer 都加入了 Gaussian noise，並保持 decoder 部分是 noise-free 的。加了 noise 的部分用於 unsupervised autoencoder loss，noise-free 的就是用來提供 supervised loss。

但是這篇論文的實驗結果實在是太 outstanding。在 MNIST 數據集上，達到了 1.13% 的超低錯誤率。這也可以一定程度上證明 semi-supervised 的 improvements。不過，這種 semi-supervised 暫時來看還沒被運用得很好，因為這篇工作中，在 validation set 上用的依然是全部的 10K label，而不是小範圍的 label。這點上，個人認為是有點 cheating 的。

Neural Turing Machines

Neural Turing Machines（NTM） 這個工作應該是整個 DL Symposium 中最出名的了。跟這篇工作相關的工作有個五六篇（比如同是這個 Symposium 中的另外兩篇《Large-scale simple question answering with memory networks》和《End-to-end memory networks》），下次有機會專門寫一下。這次只講這篇原始的 NTM。NTM 的 motivation 我個人理解，主要有兩點：（1）neural networks 雖然可以提供很好的 hidden units 計算，去 model internal memory，但是我們在真實生活中有時候更需要 external memory 的輔助和交互（這是兩件事，比如和 NTM 很像的 memory networks 其實就只有輔助，沒有交互，而 NTM 是有交互的）；（2）RNN 作為一種出色的 neural networks，其實是圖靈完備的（已被證明）。既然如此，是否可以去把它設計成圖靈機？出於這兩個目的，就有了 NTM 這個工作。

一個 NTM 包括 Controller，Read+Write Heads 和 External Memory；Controller 就是 NN。換言之，NTM 比一般的 NN 多在了讀寫頭和外部存儲交互（memory networks 就沒有讀寫頭）。個人理解，如果把 NTM 中的 Controller 比作計算機的 CPU，那麼其中的 memory 就是計算機的 RAM，而 hidden states 就是 CPU 中的 registers。NTM 中的 Read+Write Heads 非常重要，首先它們可以實現 content-based/location-based 的相關操作，也因此就可以模擬 Focus/Attention 的效果——於是就可以用 content addressing 實現查找 similar data（content-based）。Content addressing 之後，interpolation，提供的是 gate 機制；convolutional shift 提供的是 location-based addressing。有了上面這些模塊，NTM 就可以模擬圖靈機，實現一些演算法。不僅如此，NTM 是 end-to-end differentiable 的。

從 NTM 的兩個 motivation 出發，就可以看出NTM 的兩個 goal：（1）NTM 是為了增強 RNN 的學習能力，那麼它也應該像 RNN 一樣能 solve problems；（2）NTM 是模擬圖靈機，是否有可能學習出內部演算法？基於這兩個 goal，這篇工作中設計了很多種 tasks，比如 copy，比如 priority sort，同時橫向對比了三種架構，NTM with LSTM, NTM with feedforward, standard LSTM。

Deep Generative Image Models Using A Laplacian Pyramid Of Adversarial Networks

這個工作雖然知名度不那麼大，但是也已經被廣泛引用和改進。同樣是來自 NYU 和 Facebook AI team 的合作（這次 DL Symposium 中入選的很多篇都出自他們）。工作的思想上很像之前推薦過很多次的 Google DeepMind 的DRAW，就是說，我們在 generate 圖片時，不要強迫 model 一步到位，而是讓它一步步來。

這篇工作中的 model 叫做 Laplacian Generative Adversarial Networks（LAPGAN），由 conditional GAN 和 Laplacian pyramid 結構組成。前者，conditional GAN 是 GAN 的一種改造，而 GAN 是由一個用於生成 sample 的 generative model（G）和一個用於比較 G 生成的 sample 和真實 training data 的 discriminative model（D）構成的框架。那麼 conditional GAN 就是在此基礎上，再增加上 additional information，比如 sample class/label。後者，Laplacian pyramid 則是一種層次化的圖像特徵表達，主要體現的圖像不同 scale 之間的差異。具體公式可以見 Equation (3)-(4)。那麼這篇工作就是將這兩點結合起來，使得 GAN 也變成一種層次化的 framework，變成了 multi-scale 的。

個人理解，這樣的 LAPGAN 有兩個好處：（1）是 unsupervised，這是 GAN 的優勢。可以直接從 finest/highest-scale/level 的圖像，一直利用 Adversial Network 逐步進行 training；（2）就像 DRAW 一樣，LAPGAN 的核心思想就是把 generation 的過程給「分解」了，變成了一種逐步的「refinement」，所以降低了網路每次需要記憶的內容量，也同時就提高了網路的 capacity 和 scalability。反過來，這樣的網路也有一個劣勢，就是它拋棄了圖像的 global feature 和 representation，缺少了對於一個 image 的 probability，所以也就在進行 evaluate 時，需要用一些特殊的技巧（比如這篇文章中採用的 Gaussian Parzen window）。

Breaking the generation into successive refinements is the key idea in this work. Note that we give up any 「global」 notion of fidelity; we never make any attempt to train a network to discriminate between the output of a cascade and a real image and instead focus on making each step plausible. Furthermore, the independent training of each pyramid level has the advantage that it is far more difficult for the model to memorize training examples – a hazard when high capacity deep networks are used.

Natural Neural Networks

這篇論文的 motivation 也很 fundamental，是說 SGD 這樣基於 point gradient 的優化方法，在日趨複雜的 NN 架構上越來越無力。另一方面，distribution gradient 的方法則還有很多值得探索的空間。畢竟 distribution 在優化的過程中，是一直可被捕捉的（見今天的另一篇論文《Early stopping is nonparametric variational inference》）。Distribution gradient 的求解就需要 KL divergence measurement 和 Fisher matrix。然而，Fisher matrix 的求解計算量非常大（matrix size 大，且包括逆運算等等），使得過去想用 Fisher matrix 的工作都不太 scalable。

基於這個 distribution gradient （也許）可以幫助提高 convergence 效率的想法，這篇工作開始探究 Fisher matrix 的性質。最終通過假設和實驗，設計出了一種基於特定 Fisher matrix 的 NN（給 Fisher matrix 加了一些限定條件，並忽略了一些 interaction）。在這種 NN 下，它們的優化演算法與更有名的 Mirror Descent 很像。

個人認為這篇工作很直觀的貢獻是，過去的一些 NN tricks，比如 batch normalization （before non-linearity），zero-mean activations 等等，在這個框架下，都可以有一些理論上的解釋。也算是 theoretical Deep Learning的一種進展吧。

Early stopping is nonparametric variational inference

這篇文章很推薦，是一篇優化相關的工作。出發點是，我們除了去優化 training loss，我們也可以優化 marginal likelihood。這樣有很多優勢，首先，我們就不需要哪些基於 validation set 的 trick 了（比如 early stopping），我們可以直接用 marginal likelihood estimator 去 evaluate performance。

那麼如何實現這件事呢，這篇工作給優化過程找了一些 Bayesian 的解釋：優化過程中，每一步都會「生成」一個 distribution。這樣，整個優化過程中，就會產生一個 distribution sequence。這個 sequence 從 Bayesian 的角度，可以看成是被某個 true posterior distribution 不斷 sample 出來的，sample 的樣本數 N，也就是優化的迭代次數，就可以被看成是 variational parameter。有了這樣一個解釋，作者進一步就把 early stopping 這個 trick 解釋成了對 varitional lower bound 的優化；ensembling random initializations 就可以看成是 ensembling various independent variational samples.

上面所說的，就是這篇論文的第一個貢獻（也是論文的標題）。除此以外，本文利用這樣的解釋，進一步去構造了 marginal likelihood estimator，並用這個 estimator 去做了 training stop 選擇，model selection capacity 選擇和 model hypermeter 選擇。

之所以推薦這篇文章，並不是說它給出的這種優化方法就比以前 SGD 等等優化 training loss 的好；而是基於兩個原因：（1）首先，它裡面提到了非常多對於優化的思考。比如 training loss 和 marginal likelihood 兩個「指標」，到底應該更「相信」哪個？varational lower bound 這個東西越高，是否真的代表 model 的 accuracy 越准？它和 validation error/test error 指標相反的時候該怎麼理解？這些是很有趣的。（2）對於優化過程中 distribution sequence 的解釋我個人覺得很有用，現在 variational sequence learning 的工作也越來越多，但是被優化方法局限。這個工作也是一個啟發。

Dropout as a Bayesian approximation: Representing model uncertainty in deep learning

這篇論文從 Bayesian 角度，解釋了 why dropout works。雖然在2013年，也有人試圖解釋過 dropout，但當時是從 sparse regularization 的角度解釋的，有一定局限性。這篇工作更 general，更 provoking。

首先作者論證了 dropout 在理論上，是等價於一種 Gaussian Process 的 Bayesian approximation 的。這個證明過程很簡單，大家可以去看一下。個人感覺，這個解釋其實和dropout as noise regularization 很相似，畢竟 approximation 也在引入 noise。只不過它們這種解釋更數學化。

隨後，有了這樣一種解釋，就可以從使用了 dropout 的 NN 中，得到 model uncertainty。這個 uncertainty 其實才是作者的 motivation（當然也是 Bayesian 學派的 motivation）。比如現在的 NN，有一個 softmax layer 去 output 出一個 prediction，比如就是預測一個 label 吧，這個 output 只是對這個 label 的 propability，但是並不包含它對於自己這個 prediction 的 certainty 程度。設想一種情況，我們一直用 dog 的圖像去 train 一個網路，最後讓這個網路預測的全是 cat 的圖片；最後很可能它預測的 probability 有些比較高，但其實 uncertainty 應該更高。以前的框架下，都無法很好地涵蓋這種 uncertainty as output。現在，有了 dropout as approximation 的解釋，就可以通過 moment-matching 的技術，從 NN 中得到這種 uncertainty 了。

得到這種 uncertainty 後，可以把它用於 regression, classification 甚至是 reinforcement learning 的各種任務上。從實驗結果來看，增加了 uncertainty 之後，各種 task 都有提升。

另外可以想到，這樣的 Bayesian 解釋，有助於提高 model interpretation 的能力，也算是一個非常大的 motivation。最後，如果覺得論文讀起來比較枯燥，可以去作者主頁找他相關的 slides，看起來非常生動。主頁上還有他自己寫的一篇 blog，詳細地展示了他的 motivation。這次 DL Symposium 中的另外兩篇，《Stochastic backpropagation and approximate inference in deep generative models》和《Scalable Bayesian optimization using deep neural networks》也和這個工作非常相似，就不單獨介紹了。

正打算做個筆記，那就順帶放知乎好了（禁止轉載）
我每場poster session都去了，把我比較能看懂的文章都掃了一遍，其中很多文章都給我留下的深刻的印象，在跟作者的交流過程中受益匪淺。我對theory的東西不是特別感興趣（雖然也會了解一下大致有什麼結論），所以以下按照類別說一說我覺得比較有意思的：

優化

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent
UT Austin
這個工作利用LP解的稀疏性開發了一個近似求解LP的快速演算法

統計

A Linear-Time Particle Gibbs Sampler for Infinite Hidden Markov Models
University of Cambridge
利用hidden states後驗分布的稀疏性設計了一個加速的particle Gibbs sampler

Estimating Mixture Models via Mixtures of Polynomials
Stanford
一個非常精妙的想法：把mixture model的參數看成empirical measure，通過先估計它的moment來反過來求解參數，某種程度上避免了傳統EM方法中identifiability的問題。

Moment matching for LDA and discrete ICA
INRIA/ENS
算是第二篇利用moment matching來求解LDA的演算法，比前一篇的效果好，這類moment based的方法可以避免傳統方法要麼太過依賴初始化（variational inference），要麼收斂太慢的問題（MCMC）。

Extending Gossip Algorithms to Distributed Estimation of U-statistics
ParisTech
我也很喜歡這個工作，傳統的gossip algorithm只能用來求解mean statistics，作者把它推廣到U-statistics，值得一看。

Fast and Accurate Inference of Plackett–Luce Models
EPFL
PL model的應用在推薦系統中非常廣泛，這篇文章的方法可以適用於更加flexible的ranking tuple data。

學習

Learning with Group Invariant Features: A Kernel Perspective
通過定製action來自動學習action invariant的kernel，不過作者沒有給出對具體問題應該如何設計template function的回答（目前用的是Gaussian sampling）。

Principal Differences Analysis: Interpretable Characterization of Differences between Distributions
MIT
一個新的框架，可以用來選擇特徵（feature selection），效果比傳統方法好很多，非常值得一看。缺點的優化問題是非凸的。我跟作者聊完後，發現這個工作可以啟發很多非常有意思的擴展。

Distributionally Robust Logistic Regression
EPFL
這個工作有點四兩撥千斤的感覺，本來一個特別複雜的東西，組合在一起後能變成一個特別簡單的東西，讓我非常surprise。但是它的推導我還得在check一下。

End-to-end Learning of Latent Dirichlet Allocation by Mirror-Descent Back Propagation
Microsoft Research, Redmond
這個工作把mean field inference看成一個chain，從而通過BP來優化LDA參數，感覺有點神奇。但是具體的模型有點太過複雜了。值得一提的是，這個模型不是傳統的generative LDA，而是discriminative的，目的是用來分類。

Learning with a Wasserstein Loss
MIT
一個非常簡單的想法，效果也非常好。這類工作可遇不可求，作者給出了generalization bound（雖然好像也沒什麼卵用）。

神經網路
今年有很多把神經網路和其他領域結合的問題，大部分的策略就是在傳統框架下的線性operator換成一個neural network based的非線性operator。很多時候我對此類idea都有點眼盲，所以也不是太注意。另外還有很多設計neural net的architecture的，我個人也不是太關心。

其他文章帶補充（等我回顧一下我拍的poster）。
最後今年的Optimization workshop真的非常贊，speaker講的內容讓我了解到很多東西。

1. 相比於去年，今年看到的許多文章的keyword都有RNN。
2. 參會人數接近4000了(和去年比幾乎翻倍)，每天的poster session都擠炸，排隊看個poster的時間都可以把文章瀏覽完。
3.Attention Model越來越吸引眾多研究者的關注，不管是location-based還是content-based(或者二者結合)都在vision和speech task上亮瞎了眼。
4.有興趣的話可以搜下http://nips.cc上的schedule, 每天的poster session前20篇左右都是比較偏應用(一般都是neural networks主導)，後面的理論性比較強些。

高票答案提供了豐富的deep learning的內容，補充一點不那麼學術的內容：正會的第一天，兩位program chair展示了這次NIPS所收錄的論文按課題分類的組成比例，排在第一的是deep learning, 佔到約總收錄論文數（403篇）的11%，而在投稿論文（約1900篇）中deep learning佔到9%。鑒於近幾年來deep learning在圖像識別、語音識別、自然語言處理領域的突出表現，這個比例雖然很高，但也並不出乎意料，另一方面，剩下的89%是其他不同類型的課題，說明NIPS也還是涵蓋了很多不同的研究方向。

以下補充一些非deep learning或者不完全是deep learning的點

1. Probabilistic programming，Zoubin Ghahramani大神在這次公開講座probabilistic machine learning里強推的一個點，其實是在鼓勵把過去幾十年機器學習研究中基於概率的這一範式的工具化，舉例說，開源的Stan 就在試圖構建一個通用的推理引擎（inference engine），用戶只需要關注與模型結構的搭建，把推斷概率的問題交給通用引擎來處理即可得到需要的結果。這個想法的完善，能讓各種概率模型的使用、調試、分析更加容易。對於研究者而言，可以把精力放在演算法本身（比如MCMC sampling本身就是很熱門的研究領域），而對於更關心結果的業界用戶，則可以把工具當成一個黑盒拿來用就可以了。

2. 通過視覺圖靈測試的機器，MIT的認知科學教授Joshua Tenenbaum在Brain, Mind and Machine Symposium中展示了他的課題組今年發表在science上的文章Human-level concept learning through probabilistic program induction, 實驗的大致過程是提供一組圖形符號，然後讓人和電腦分別繪製一些類似的符號，最後讓另外一批實驗人員分辨哪一組是由電腦產生的圖片，結果是超過3/4的實驗人員無法分辨，在某種程度上意味著電腦通過了圖靈測試。

3. Deep reinforcement learning，在 Brains, Minds and Machines Symposium ,Google deep mind聯合創始人Demis Hassabis介紹了他們今年早些時候在Nature上發的一篇文章，關於如何訓練AI玩Atari遊戲（像你在紅白機上玩過的那些遊戲），僅提供遊戲畫面和得分作為輸入，電腦需要『自學』玩遊戲，同一套基於deep learning的模型結構適用不同的2600款遊戲，其中49款達到了專業人類選手水平。這裡牽涉很多有趣的問題包括計算機視覺、機器學習、人工智慧、優化控制等等，今年NIPS至少有三個不同環節的reinforcement learning，都非常的火，從Richard Sutton大神的公開講座，到Symposium，再到workshop，每個環節都爆棚。

4. 參加NIPS這樣的會議的另一個感受就是在Euclidean space和大神無限接近，雖然在feature space可能還相距十萬八千里。比如大會最後一天Yoshua Bengio穿著深紅色的牛仔褲，站在我前面排隊拿自助餐，以及看著David Blei玩demo玩得很high的樣子

當然是緊跟YouAgain 施密特湖泊看撕逼

今年LSTM的文章挺多的，順便發兩張現場圖，一個是Samy Bengio( Yoshua Bengio 的brother)的poster，關於神經網路的，另一個是接著上面有位仁兄發的圖片，Robert Tibshirani做的invited talk的一張slide,各位ML研究者共勉

這篇paper挺有意思的。《A Neural Algorithm Of Artistic Style》. Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

之前有修過《非真實感圖形學》這門課，覺得風格化（photorealistic）蠻有意思。所以看到這個paper就關注了一下下。他們做的這個工作叫做neural art，可以算是將DL和CG相結合。具體來講，就是當你給出一個reference image作為參照，比如梵高的星空，將它上面的筆觸、顏色、對比度等特徵，渲染到攝影作品或者圖片上，以生成與該參照作品風格一致的圖片，即風格化。

這個工作可以在http://arxiv.org上找到：

Here we introduce an arti?cial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images.

用的是CNN，key point在Content Reconstructions和Style Reconstructions倆個部分：

實驗結果：

現在很多人都在做這方面的工作，感興趣的可以參考這篇。

補充：
早上起來看到群里有老師分享了在NIPS現場的兩頁PPT，來自Stanford的Prof. Robert Tibshirani.
關於Statistics和Machine Learning之間的差別：

謝 @熊風大V邀請。。。實力不夠寫不來亮點啊T_T
不過我聽南大老師表示今年新人很多，可以算一個亮點？

10號ILSVRC比賽結果