淺析至強RNN可微分神經計算機(DNC)

CNN和RNN神經網路模型在廣泛的模式識別任務中表現良好。其中RNN可以以通過將「word states」轉換為記憶向量(隱藏狀態)處理諸如翻譯,手寫生成和語音識別之類的序列建模任務。但是在實踐中RNN的主要變種LSTM處理長距離依賴序列需要大量的計算資源,且效果欠佳。為了解決這個問題,多種具有外存儲機制的神經網路被設計出來,目前最具知名度的是Deepmind 的Differentiable Neural Computer (DNC)。

通過把可訓練的神經網路控制器和可讀寫的外部存儲器進行結合,可微分神經計算機(Differentiable Neural Computer,DNC)這種混合學習型神經網路,既能像神經網路那樣進行演算法和參數的學習,又能像計算機那樣處理複雜數據信息流。在發表於Nature的論文中,DNC是一種具有外存儲器(不可訓練)的特殊的循環神經網路。在每時間步 t 由可訓練的控制器基於t-1 時刻的信息流與外存儲器交換信息流之後線性組合兩部分的預測信息決定最終輸出預測。

DNC使用向量Vector來存儲記憶。存儲器矩陣Memory matrix的每行對應於不同的記憶。控制器通過使用介面向量Interface parameters控制一個寫頭控制和多個讀頭控制(每個讀頭都是由兩種定址機制線性組合而成,讀頭數量在結構設計中未有約束)與外存儲記憶交互。記憶矩陣M in mathbb{R}^{N 	imes W}一行向量表示一組記憶,N行表示記憶矩陣最多可以保有多少組記憶。在每個時間步dnc接受上一時刻讀頭信息流與此時刻外部輸入信息流組成廣義dnc外部輸入信息流(也就是傳統LSTM對應每步外部輸入inputs),經過處理髮出隱藏狀態,隱藏狀態生成輸出向量和介面向量。介面向量控制讀寫頭控制通過讀寫機制與外存儲矩陣交互,生成此時刻的寫信息,並更新矩陣獲得此時刻的讀信息。讀信息與輸出向量線性組合生成此時刻最終輸出向量。

外存儲器記憶矩陣更新

M_t = M_{t-1} circ (E-mathbf{w}_t^w mathbf{e}_t^intercal) + mathbf{w}_t^w mathbf{v}_t^intercal\ r_t^i = M_t^T w_t^{r,i}

其中E in mathbb{R}^{N 	imes W}為全1矩陣;w in mathbb{R}^N為寫頭是歸一化的分布權重;mathbf{e} in mathbb{R}^W為擦除向量,取值局限於[0,1]之間;mathbf{v} in mathbb{R}^W為寫入記憶向量也就是此時刻新的記憶信息;注意讀寫頭控制變數為記憶矩陣行與行之間的相對強度,而不是具體的記憶信息向量。從左向右,先擦除後寫入。在此時刻記憶矩陣更新之後,讀頭提取此時刻記憶矩陣讀頭信息流r_t 該信息流線性組合此時刻最終輸出並且作為下一時刻輸入外部輸入使用。

比較直觀的看發表在Nature上的DNC相當於對LSTM添加一個外存儲器,提高LSTM的記憶遺忘的問題。這篇論文裡面構建了一個非常流暢的可微分的外記憶讀寫系統。如下圖,在時間步t,control接收input vector x_t和上一個時間步記憶讀取向量(讀頭r_t)以及control(LSTM的隱藏狀態矩陣)進行計算處理之後,發出兩個向量(ouput vector v_t & interface vectorxi_t)。其中v_t與讀頭r_t線性組合為最終輸出向量y_txi_t通過定址機制Memory Addressing 與記憶矩陣M交互生成輸出讀頭記憶r_t

control 通過發出介面向量控制access中write weighting 和read weightings的更新來控制記憶矩陣並得到讀頭讀取的外部記憶向量。寫頭控制write weighting通過基於餘弦相似性的內容定址和基於寫入頻率和時間的使用情況usage確定記憶矩陣行位置的使用情況(根據先後寫入和讀取頻率的程度信息決定被新記憶覆蓋的順序)。讀頭控制read weighting 通過基於餘弦相似性的內容定址和記憶矩陣行之間相對寫入順序更新讀頭控制。如下圖,每時間步t,control接受上一時刻t-1讀頭記憶向量和此時刻外部輸入作為時刻t的控制器輸入信息流,在計算之後發出interface vector 控制外存儲器更新寫頭,由寫頭控制擦除和寫入更新記憶過程(圖中綠色操作)。之後新的讀頭基於新記憶矩陣得到時刻t的讀取記憶向量。時刻t得到讀取記憶向量同控制器輸出向量線性變換映射為時刻t的輸出,並且讀取記憶向量作為新的短期記憶傳遞向下一時刻t+1。DNC最終輸出結果由時刻t得到讀取記憶向量同控制器輸出向量線性組合得到,類似於人腦的長期記憶和短期記憶Search of Associative Memory 模型,也就是每時間步的輸出由短期記憶和長期記憶線性組合生成。

mathbf{y}_t = mathbf{u_t} + W_r [mathbf{r}_t^1; ldots ;mathbf{r}_t^R]

代碼實現

引用:

Hybrid computing using a neural network with dynamic external memory

Mostafa-Samir/DNC-tensorflow

附錄:

Controller

論文使用一個標準的LSTM模型。注意將工業界用的的LSTM公式組去掉batch_size,並設置hidden_num=1即為如下公式組書寫形式。

i_t^l = sigma(W_i^l[chi_t;h_{t-1}^l;h_t^{l-1}] + b_i^l)\ f_t^l = sigma(W_f^l[chi_t;h_{t-1}^l;h_t^{l-1}] + b_f^l)\ s_t^l = f_t^l s_{t-1}^l + i_t^l tanh (W_s^l[chi_t;h_{t-1}^l;h_t^{l-1}] + b_s^l) \ o_t^l = sigma(W_o^l[chi_t;h_{t-1}^l;h_t^{l-1}] + b_o^l)\ h_t^l = o_t^ltanh(s_t^l)\ chi_t = [x_t;r_{t-1}^1;...;r_{t-1}^l]

"""inputs: [batch_size, in_width]* read_vectors_prev: [batch_size, memory_W * read_head_num]hidden_prev: [batch_size, hidden_num]hidden_lower: [batch_size, hidden_num]state_prev: [batch_size, hidden_num]weights_input: [in_width, hidden_num]weights_read_vectors_prev: [memory_W * read_head_num, hidden_num]weights_hidden_prev: [hidden_num, hidden_num]weights_hidden_lower: [hidden_num, hidden_num]biaes: [hidden_num] input_gate: [batch_size, hidden_num]forget_gate: [batch_size, hidden_num]state: [batch_size. hidden_num]output_gate: [batch_size, hidden_num]hidden: [batch_size, hidden_num]""" input_gate = tf.nn.sigmoid(tf.matmul(inputs, weights["input_gate_inputs"]) + tf.matmul(read_vectors_prev, weights["input_gate_read_vectors_prev"]) + tf.matmul(hidden_prev, weights["input_gate_hidden_prev"]) + tf.matmul(hidden_lower, weights["input_gate_hidden_lower"]) + weights["input_gate_bias"]) forget_gate = tf.nn.sigmoid(tf.matmul(inputs, weights["forget_gate_inputs"]) + tf.matmul(read_vectors_prev, weights["forget_gate_read_vectors_prev"]) + tf.matmul(hidden_prev, weights["forget_gate_hidden_prev"]) + tf.matmul(hidden_lower, weights["forget_gate_hidden_lower"]) + weights["forget_gate_bias"]) state = tf.multiply(forget_gate, state_prev) + tf.multiply(input_gate, tf.tanh(tf.matmul(inputs, weights["state_inputs"]) + tf.matmul(read_vectors_prev, weights["state_read_vectors_prev"]) + tf.matmul(hidden_prev, weights["state_hidden_prev"]) + tf.matmul(hidden_lower, weights["state_hidden_lower"]) + weights["state_bias"]))output_gate = tf.nn.sigmoid(tf.matmul(inputs, weights["output_gate_inputs"]) + tf.matmul(read_vectors_prev, weights["output_gate_read_vectors_prev"]) + tf.matmul(hidden_prev, weights["output_gate_hidden_prev"]) + tf.matmul(hidden_lower, weights["output_gate_hidden_lower"]) + weights["output_gate_bias"]) hidden = tf.multiply(output_gate, state)

多層LSTM之間信息流動方向如下圖:

# input layerhidden_L1, state_L1 = self.element(inputs, read_vectors_prev, hidden_L1_prev, hidden_L1_lower, state_L1_prev, weights_L1)# primary thinking layerhidden_L2_lower = hidden_L1hidden_L2, state_L2 = self.element(inputs, read_vectors_prev, hidden_L2_prev, hidden_L2_lower, state_L2_prev, weights_L2)# advanced thinking layerhidden_L3_lower = hidden_L2hidden_L3, state_L3 = self.element(inputs, read_vectors_prev, hidden_L3_prev, hidden_L3_lower, state_L3_prev, weights_L3)# output layerhidden_L4_lower = hidden_L3hidden_L4, state_L4 = self.element(inputs, read_vectors_prev, hidden_L4_prev, hidden_L4_lower, state_L4_prev, weights_L4)hiddens = tf.concat([hidden_L1, hidden_L2, hidden_L3, hidden_L4], axis=1) output_vector = tf.matmul(hiddens, weight_output)interface_vector = tf.matmul(hiddens, weight_interface)

每時間步control最終輸出的output vector和 interface vector綜合所有隱藏狀態向量生成。


u_t = W_y[h_t^l;...;h_t^l]\ varepsilon_t = W_{varepsilon}[h_t^l;...;h_t^l]

控制外存儲器的Interface parameters

使用oneplussoftmax函數約束值範圍。通過數值約束函數保證外存儲器的讀寫頭定址機制控制權重被局限在[0,1]的預期範圍內。

# 處理控制器發出介面向量"""interface_vector: [batch_size, W*R+5R+3W+3]* read_keys: [batch_size, memory_W * read_head_num] -> [batch_size, memory_W, read_head_num]read_strengths: [batch_size, read_head_num]* write_key: [batch_size, memory_W] -> [batch_size, memory_W, write_head_num=1]write_strengths: [batch_size, write_head_num=1]erase_vector: [batch_size, memory_W]write_vector: [batch_size, memory_W]free_gates: [batch_size, read_head_num]allocation_gate: [batch_size, 1]write_gate: [batch_size, 1]* read_modes: [batch_size, 3 * read_head_num] -> [batch_size, 3, read_head_num]"""R,W = self.read_head_num, self.memory_W splits = np.cumsum([0,W*R,R,W,1,W,W,R,1,1,3*R]) tmp_list = [interface_vector[:, splits[i]:splits[i+1]] for i in range(len(splits)-1)]read_keys = tf.reshape(tmp_list[0], shape=[-1, self.memory_W, self.read_head_num])read_strengths = 1 + tf.nn.softplus(tf.reshape(tmp_list[1], shape=[-1, self.read_head_num]))write_key = tf.reshape(tmp_list[2], shape=[-1, self.memory_W, 1])write_strength = 1 + tf.nn.softplus(tf.reshape(tmp_list[3], shape=[-1,1]))erase_vector = tf.reshape(tmp_list[4], shape=[-1, self.memory_W])write_vector = tf.reshape(tmp_list[5], shape=[-1, self.memory_W])free_gates = tf.nn.sigmoid(tf.reshape(tmp_list[6], shape=[-1, self.read_head_num]))allocation_gate = tf.nn.sigmoid(tmp_list[7])write_gate = tf.nn.sigmoid(tmp_list[8])read_modes = tf.nn.softmax(tf.reshape(tmp_list[9], shape=[-1, 3, self.read_head_num]))

Reading and writing to memory

注意在每時間步記憶矩陣先根據寫頭更新記憶再返回讀頭新記憶信息流。

更新寫頭控制->更新記憶矩陣->更新讀頭控制->計算獲得讀頭新記憶信息流

# Memory addressing -> write weighting """memory_matrix_prev: [batch_size, memory_N, memory_W]write_key: [batch_size, memory_W, write_head_num=1]write_strengths: [batch_size, write_head_num=1]free_gates: [batch_size, read_head_num]read_weightings_prev: [batch_size, memory_N, read_head_num]write_weighting_prev: [batch_size, memory_N]usage_vector_prev: [batch_size, memory_N]allocation_gate: [batch_size, 1]write_gate = tf.nn.sigmoid(tmp_list[8])write_weighting: [batch_size, memory_N]usage_vector: [batch_size, memory_N]"""write_weighting, usage_vector = self._Write_weighting(memory_matrix_prev, write_key, write_strength, free_gates, read_weightings_prev, write_weighting_prev, usage_vector_prev, allocation_gate, write_gate)# reading and writing to memory -> update memory"""write_weighting: [batch_size, memory_N]* write_weighting_fit: [batch_size, memory_N] -> write_weighting: [batch_size, memory_N, 1]erase_vector: [batch_size, memory_W] -> erase_vector: [batch_size, 1, memory_W]write_vector: [batch_size, memory_W] -> write_vector: [batch_size, 1, memory_W]memory_matrix_prev: [batch_size, memory_N, memory_W]"""write_weighting_fit = tf.expand_dims(write_weighting, axis=2)erase_vector = tf.expand_dims(erase_vector, axis=1)write_vector = tf.expand_dims(write_vector, axis=1)erase_operation = tf.multiply(memory_matrix_prev, (1-tf.matmul(write_weighting_fit, erase_vector)))write_operation = tf.matmul(write_weighting_fit, write_vector)memory_matrix = erase_operation + write_operation # memory addressing -> read weightings"""memory_matrix: [batch_size, memory_N, memory_W]read_keys: [batch_size, memory_W , read_head_num]read_strengths: [batch_size, read_head_num]precedence_weighting_prev: [batch_size, memory_N]link_marix_prev: [batch_size, memory_N, memory_N]read_modes: [batch_size, 3 * read_head_num]read_weightings: [batch_size, memory_N, read_head_num]precedence_weighting: [batch_size, memory_N]link_marix: [batch_size, memory_N, memory_N]read_vectors: [batch_size, memory_W, read_head_num] """read_weighting, precedence_weighting, link_matrix = self._Read_weighting(memory_matrix, read_keys, read_strengths, write_weighting, precedence_weighting_prev, link_matrix_prev, read_weightings_prev, read_modes) read_vectors = tf.matmul(memory_matrix, read_weighting, adjoint_a=True)

Memory addressing

content-based addressing

通過餘弦可以通過夾角的大小,來判斷向量的相似程度。夾角越小,就代表越相似。餘弦值越接近1,就表明夾角越接近0度,也就是兩個向量越相似,這就叫"餘弦相似性"。通過內容相似度定義Memory Matrix 行(N or locations) 歸一化的概率分布, 其中餘弦相似性使用tf.nn.l2_normalize拆分計算。

mathcal{C}(M, mathbf{k}, eta)[i] = frac{exp{mathcal{D}(mathbf{b},M[i,cdot])eta}}{sum_j exp{mathcal{D}(mathbf{b},M[j,cdot])eta}} quad quad mathrm{where} quad quad mathcal{D}(mathbf{u}, mathbf{v}) = frac{mathbf{u} cdot mathbf{v}}{lvert mathbf{u} 
vert lvert mathbf{v} 
vert}

"""memory_matrix: [batch_size, memory_N, memory_W]lookup_key: [batch_size, memory_W, read_head_num or write_head_num]key_strength: [batch_size, read_head_num or write_head_num] -> [batch_size, 1, read_head_num or write_head_num] cosine_similarity: [batch_size, memory_N, read_head_num or write_head_num]content_weighting: [batch_size, memory_N, read_head_num or write_head_num]"""cosine_similarity = tf.matmul(tf.nn.l2_normalize(memory_matrix, dim=2), tf.nn.l2_normalize(lookup_key, dim=1))content_weighting = tf.nn.softmax(tf.multiply(cosine_similarity, tf.expand_dims(key_strength, axis=1)), dim=1)

Dynamic memory allocation

使用可微分的"free list"通過添加和刪除鏈表中的地址來維持可用內存位置列表,允許controller在需要時釋放和分配內存。retention_vector 根據所有讀頭反饋決定memory_matrix 矩陣索引memory_N的不被釋放的程度。使用usage_vector 表述Memory 位置(N)的使用程度,usage_vector 取值限於[0,1],Memory位置寫入信息增加usage_vector 值,並且usage_vector 值只能被free_gate 釋放內存位置的行為降低。free_list 根據usage_vector的值升序排列,確定新寫入Memory位置時候有內容被釋放順序。最終根據寫頭根據allocation_weighting決定在哪個位置寫入新記憶。

首先根據usage排序計算allocation之後使用tf.TensorArray返回信息流對應的行順序。

Note: 在設定初始值的情況下,計算公式得出結果的數值範圍自然被局限在Delta N邊界條件內。原文中認為這裡排序信息似乎對於學習沒有貢獻,在計算梯度的時候忽略這裡由於排序索引造成不連續。

# 動態內存分配,retention_vector 根據所有讀頭反饋決定memory_matrix 矩陣索引memory_N的不被釋放的程度 """* free_gates: [batch_size, read_head_num] -> [batch_size, 1, read_head_num] read_weightings_prev: [batch_size, memory_N, read_head_num]retention_vector: [batch_size, memory_N]write_weighting_prev: [batch_size, memory_N]usage_vector_prev: [batch_size, memory_N]usage_vector: [batch_size, memory_N]allocation_weighting: [batch_size, memory_N]Note: * read_weightings_prev: [batch_size, memory_N, read_head_num] * write_weighting_prev: [batch_size, memory_N]"""free_gates = tf.expand_dims(free_gates, axis=1)retention_vector = tf.reduce_prod(1 - tf.multiply(read_weightings_prev, free_gates), axis=2) usage_vector = tf.multiply((usage_vector_prev + write_weighting_prev - tf.multiply(usage_vector_prev, write_weighting_prev)), retention_vector)values, indices = tf.nn.top_k(usage_vector, k=self.memory_N) # 降序排列 values_sorted = tf.reverse(values, axis=[1]) # 值 升序indices_sorted = tf.reverse(indices, axis=[1]) # 索引 升序allocation_weighting = tf.multiply((1-values_sorted), tf.cumprod(values_sorted, axis=1, exclusive=True))"""allocation_weighting 值返回原始位置這裡根據論文計算的公式論文裡面使用的索引計算方式,也就是按索引順序的計算得出上面的矩陣,還需要返回正常的順序。通過TensorArray操作將allocation_weighting 返回原來的序列,也就是usage_vector 最初對應的序列,也就是memory 矩陣的正常序列,返回的數值為根據 allocation_weighting 計算的釋放程度的值"""indices_flat = tf.reshape(tensor= indices_sorted + self.flat_shift, shape=[-1])flat_array = tf.TensorArray(dtype=tf.float32, size=self.batch_size*self.memory_N)scatter= flat_array.scatter(indices= indices_flat, value= tf.reshape(allocation_weighting, shape=[-1]))allocation_sort_undo = scatter.stack()allocation_weighting = tf.reshape(allocation_sort_undo, shape=[self.batch_size, self.memory_N]) allocation_weighting = tf.stop_gradient(allocation_weighting)return allocation_weighting, usage_vector

Temporal memory linkage

使用時間鏈接矩陣L 跟蹤Memory寫入位置順序信息也即記憶的新舊。使用矩陣L (N 	imes N) 表示每次寫入新記憶之後各記憶位置相對新舊程度。其中precedence weighting控制記憶矩陣各行之間的優先保留或更新的強度。寫頭根據precedence weighting強度確定記憶矩陣優先寫入位置(行選擇)。

"""write_weighting: [batch_size, memory_N]precedence_weighting_prev: [batch_size, memory_N]precedence_weighting: [batch_size, memory_N]* write_weighting: [batch_size, memory_N] -> write_weighting: [batch_size, memory_N, 1]tmp: [batch_size, memory_N, memory_N]link_marix_prev: [batch_size, memory_N, memory_N]left: [batch_size, memory_N, memory_N]* precedence_weighting_fit: [batch_size, memory_N] -> precedence_weighting: [batch_size, 1, memory_N] 維度匹配right: [batch_size, memory_N, memory_N]link_marix: [batch_size, memory_N, memory_N]read_weightings_prev: [batch_size, memory_N, read_head_num]forward_weighting: [batch_size, memory_N, read_head_num]backward_weighting: [batch_size, memory_N, read_head_num]"""precedence_weighting = tf.multiply(precedence_weighting_prev, (1 - tf.reduce_sum(write_weighting, axis=1, keep_dims=True))) + write_weightingwrite_weighting = tf.expand_dims(write_weighting, axis=2)tmp = tf.tile(write_weighting, multiples=[1,1,self.memory_N])left = tf.multiply(link_matrix_prev, (1 - tmp - tf.transpose(tmp,perm=[0,2,1])))precedence_weighting_fit = tf.expand_dims(precedence_weighting_prev, axis=1) # 維度匹配right = tf.matmul(write_weighting, precedence_weighting_fit) link_matrix = tf.multiply((left+right), self.mask_eye) # 掩碼處理對角線消除forward_weighting = tf.matmul(link_matrix, read_weightings_prev)backward_weighting = tf.matmul(link_matrix, read_weightings_prev, transpose_a=True)(https://github.com/Mostafa-Samir/DNC-tensorflow

推薦閱讀:

對沖策略
【聚寬投資】求賢季
『多因子』MultiFactors Alpha Model - 基於因子IC的多因子合成
兒童節的私貨

TAG:RNN | 量化 | 模式识别 |