如何準備機器學習工程師的面試？

11-25

本題已加入知乎圓桌 ?「機器學習 · 學以致用」，更多「機器學習」相關話題討論歡迎關注。
我之前面試一些公司的機器學習或者數據挖掘工程師的職位。感覺自己準備的不夠充分。想了解下一般會問哪些問題，考察哪些方面的東西。
相關問題：如果你是面試官，你怎麼去判斷一個面試者的深度學習水平？ - 知乎

機器學習方面的面試主要分成三個部分：
1. 演算法和理論基礎
2. 工程實現能力與編碼水平
3. 業務理解和思考深度

1. 理論方面，我推薦最經典的一本書《統計學習方法》，這書可能不是最全的，但是講得最精髓，薄薄一本，適合面試前突擊準備。

我認為一些要點是：
統計學習的核心步驟：模型、策略、演算法，你應當對logistic、SVM、決策樹、KNN及各種聚類方法有深刻的理解。能夠隨手寫出這些演算法的核心遞歸步的偽代碼以及他們優化的函數表達式和對偶問題形式。

非統計學習我不太懂，做過複雜網路，但是這個比較深，面試可能很難考到。

數學知識方面，你應當深刻理解矩陣的各種變換，尤其是特徵值相關的知識。

演算法方面：你應當深刻理解常用的優化方法：梯度下降、牛頓法、各種隨機搜索演算法（基因、蟻群等等），深刻理解的意思是你要知道梯度下降是用平面來逼近局部，牛頓法是用曲面逼近局部等等。

2. 工程實現能力與編碼水平
機器學習從工程實現一般來講都是某種數據結構上的搜索問題。

你應當深刻理解在1中列出的各種演算法對應應該採用的數據結構和對應的搜索方法。比如KNN對應的KD樹、如何給圖結構設計數據結構？如何將演算法map-red化等等。

一般來說要麼你會寫C，而且會用MPI，要麼你懂Hadoop，工程上基本都是在這兩個平台實現。實在不濟你也學個python吧。

3. 非常令人失望地告訴你儘管機器學習主要會考察1和2
但是實際工作中，演算法的先進性對真正業務結果的影響，大概不到30%。當然演算法必須要足夠快，離線演算法最好能在4小時內完成，實時演算法我沒搞過，要求大概更高。

機器學習大多數場景是搜索、廣告、垃圾過濾、安全、推薦系統等等。對業務有深刻的理解對你做出來的系統的結果影響超過70%。這裡你沒做過實際的項目，是完全不可能有任何體會的，我做過一個推薦系統，沒有什麼演算法上的高大上的改進，主要是業務邏輯的創新，直接就提高了很明顯的一個CTR（具體數目不太方便透露，總之很明顯就是了）。如果你做過實際的項目，一定要主動說出來，主動讓面試官知道，這才是最大最大的加分項目。

最後舉個例子，阿里內部機器學習挑戰賽，無數碾壓答主10000倍的大神參賽。最后冠軍沒有用任何高大上的演算法而是基於對數據和業務的深刻理解和極其細緻的特徵調優利用非常基本的一個演算法奪冠。所以啥都不如真正的實操擼幾個生產項目啊。

我面試過5-6家互聯網公司的數據挖掘和分析、機器學習相關職位的工程師。被問到下面一些問題。

SVM的原理，SVM裡面的核
K-means，如何用hadoop實現k-means
naive bayes和logistic regression的區別
LDA的原理和推導
做廣告點擊率預測，用哪些數據什麼演算法
推薦系統的演算法中最近鄰和矩陣分解各自適用場景
用戶流失率預測怎麼做（遊戲公司的數據挖掘都喜歡問這個）
一個遊戲的設計過程中該收集什麼數據
如何從登陸日誌中挖掘儘可能多的信息

這些問題我回答的情況，分幾種。一種是在面試官的提示下，算是勉強完成了答案。一種是在面試官的提示下，答了一點但是答得不夠好。一種是面試官不提示也沒有反饋，我回答了但是我不知道回答得怎樣。

我非常後悔的一點是我現在才想起來總結。有一個題是遊戲玩家流失率預測，我被問過兩次。但是每次我都說是個分類問題。最近我突然想起來去網上查了下，有兩個點，數據不平衡問題和時間序列分析。我網上查到是一個大學教授和人人遊戲合作的課題。我然後查了下這個老師的publication。沒發現相關的論文。可能公司不讓發表吧。

這些問題的特點是很基礎很簡單，因為實際中很少用複雜的演算法，複雜的演算法不好控制，而且理論要求高。另一個特點是注重考查實際工程能力，我經常被問到自己實現了哪些演算法。還有的問題很契合實際。

我覺得如果現在再給我準備的機會。我會準備下面幾點。
首先是計算機基礎知識和演算法，這些都是會正常考察的。有些公司考的少，有些公司正常考察。
針對機器學習這部分，需要理論紮實，還需要自己動手實現代碼。另外hadoop，mpi，最近比較火的spark，應該都是加分項。另一個是接觸下實際的數據分析系統。我在學校裡面看的論文，都是講演算法的多，講應用系統的少。這個可以靠之前的實習，也可以看些比較實用的論文。

PS 我是題主，自問自答下。
PS2 我面試的都是國內互聯網公司和遊戲公司。

最好的辦法就是看看別人是怎麼準備的，通過別人的面經來反思自己如何準備。

針對應屆生校招面試「機器學習」相關崗位的情況，牛妹為大家整理了一批面經，供參考：

2018屆校招面經精選

1、菜鳥圖像圖形演算法內推一面

2、機器學習演算法面經

3、百度面試一面

4、京東雲演算法工程師一面分享

5、京東演算法工程師一面面經

6、京東雲機器學習面試分享

7、京東 AI 與大數據部面經

8、華為優招面試機器學習面經

9、大量面經總結(包括牛客網的和我聽來的)

10、華為814留學生專場，研發類（大數據，技術研究）

11、拼多多+貓眼+京東雲+獵豹移動機器學習面試分享

12、美團貓眼電影機器學習工程師一面題目分享

13、360中科院專場--大數據中心數據開發工程師面經

14、唯品會電面--演算法工程師

15、遠景能源面經（電面）大數據崗

2017屆校招面經精選

1、數據方向學渣的春招總結

2、騰訊【基礎研究面經】，4-27才接到電話拿到offer

3、2017騰訊基礎研究實習生面經，已拿到offer

4、頭條大數據一面

5、網易郵箱數據挖掘，滴滴數據工程師，均跪的面經

6、百度實習面試（三面，搜索rank）

7、阿里機器學習四輪面試反饋一波

8、機器學習實習面經分享（螞蟻金服，微信，美國亞馬遜，完美）

9、菜鳥網路一面面經數據挖掘崗

10、百度面經：謹以此紀念五掛百度_筆經面經_牛客網

11、百度機器學習崗位123面面經_筆經面經_牛客網

12、南京美團面試--機器學習崗_筆經面經_牛客網

13、百度一面面經_筆經面經_牛客網

14、新鮮的美團面經，機器學習崗_筆經面經_牛客網

15、招銀科技電話一面面經_筆經面經_牛客網

16、百度北京機器學習/數據挖掘提前批

17、關於面試一點感觸_筆經面經_牛客網

18、好未來三面面試經驗-數據挖掘崗-武漢現場面_筆經面經_牛客網

19、內推京東金融演算法工程師_筆經面經_牛客網

20、滴滴騰訊演算法崗面經分享_筆經面經_牛客網

21、阿里、百度、騰訊、華為面經（均已拿到offer）_筆經面經_牛客網

22、數據挖掘/數據研發麵經分享_筆經面經_牛客網

23、2016年面試總結（c++，機器學習），希望給同學們一點幫助_筆經面經_牛客網

24、非CS專業小碩的機器學習/數據挖掘崗位秋招經驗_筆經面經_牛客網

25、美團機器學習一輪失敗經驗_筆經面經_牛客網

26、百度機器學習數據挖掘_筆經面經_牛客網

27、非CS專業小碩的機器學習/數據挖掘崗位秋招經驗_筆經面經_牛客網

28、美團面經（面的晚...發的晚）_筆經面經_牛客網

29、百度機器學習數據挖掘_筆經面經_牛客網

30、【數據挖掘面經】騰訊+百度+華為（均拿到sp offer）

以上

可以參考我的這篇文章《機器學習面試的那些事兒》，有時間再補充點其他的內容。

假設我們做過一個垃圾郵件分類器的項目。為了建立這個分類器，我們首先對數據進行清理及預處理，如缺失數據的處理、數據的歸一化等。在獲得初始特徵向量後，用PCA進行了特徵選擇。利用特徵選擇得到的特徵向量及對應數據，訓練一個隨機森林的分類器作為我們的垃圾郵件分類器。針對的這樣一個項目，有這樣幾個點可以進行挖掘和準備。

1. 項目簡介

如何向面試官介紹你做過的項目，這是一個非常基礎、非常常見但是又充滿技巧的問題。首先，項目簡介不應過於冗長，力爭用最短的幾句話勾勒出項目的框架。其次，數據科學相關項目通常是業務與技術並存，因此，既要突出項目過程中解決的技術難題及應用的相關技術，又應該兼顧項目帶來的業務上的影響。

2. 模型簡介

這類問題同樣是機器學習面試中最普遍最常見的一類問題，面試的形式一般為介紹一個你最喜歡的模型，或是介紹項目中應用的某種模型。與項目簡介相同，模型簡介也應力求簡潔，用最簡短的幾句話，講清楚模型是用了什麼樣的原理完成了怎樣的目標。wikipedia中關於隨機森林的定義給我們提供了一個非常好的學習模板，可以用來借鑒：

Random forests is a notion of the general technique of random decision forests that are an ensemble learning method(怎樣的方法) for classification, regression and other tasks(解決了什麼問題), that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees(基本原理).

但是，wikipedia中大部分模型的描述更偏書面化的表達，並不適合原封不動地照搬到面試中。我們需要將它轉化為更口語化的表達。

3. 模型的優缺點

模型的優缺點與模型簡介是緊密相關的，可以將兩個問題結合起來一起準備。比如之前我們談到了什麼是隨機森林，緊接著可以談一下隨機森林有什麼優點，如：a. 對於很多數據集表現良好，精確度比較高；b. 不容易過擬合；c. 可以得到變數的重要性排序；d. 既能處理離散型數據，也能處理連續型數據，且不需要進行歸一化處理； e. 能夠很好的處理缺失數據；f. 容易並行化等等。同時，將理論與實踐結合也是非常好的切入點，如隨機森林的諸多優點是如何體現在垃圾郵件分類器項目中，這樣的結合能更好的展示出面試人對於模型的理解及掌控。

4. 模型原理及相關技術細節

模型簡介與模型優缺點的問題屬於概念性問題，偏向於考察面試人是否了解某種模型，而更進一步的則是對於模型原理及相關技術細節的考察，比如模型假設、目標函數、優化過程、演算法收斂性等。所謂知其然，又知其所以然，這是對於面試人的進一步要求。

例如，在模型的優缺點中，我們提到了隨機森林可以對變數重要性進行排序。相應地，我們應該能夠解釋隨機森林是如何對變數重要性進行排序，有哪幾種常見的排序指標，比如利用OOB誤分率的改變或者分裂時信息增益的變化等。當然，問題並非到此終止，基於上面提到的兩種常見的變數重要性排序指標，又可以衍生出新的問題。例如，針對OOB誤分率這個指標，解釋一下什麼是OOB，隨機森林中OOB是如何計算的，它有什麼樣的優缺點；針對信息增益，同樣會有很多與之有關的問題，如什麼是信息增益，如何計算信息增益，什麼是熵，什麼是GINI指數，他們之間的區別是什麼，他們之間的區別會對建樹產生怎樣的影響等。

再如，在垃圾郵件分類器項目中，有一部分數據存在缺失，而隨機森林具有處理缺失數據的優點，建模的過程中我們充分利用了這一特性。那麼，與之相關的問題可能會是，隨機森林為什麼會有這個優點？隨機森林是怎樣對缺失數據進行訓練及預測？

5. 模型的橫向比較

模型原理及相關技術細節的考察屬於對機器學習知識深度考察的範疇，與之對應的是機器學習知識廣度方面的考察。廣度上的考察主要有兩大部分，一方面是從理論上對不同演算法進行橫向比較，如模型假設，優化方法等。另一方面，是結合實際案例對不同演算法進行橫向比較，這要求面試人不僅僅要熟知不同模型的原理及技術細節，更需要將抽象的理論與具體的實踐結合，在實際案例中對演算法進行比較。

在垃圾郵件分類器項目中，隨機森林被用作最終的分類器模型。面試官可能會就此提出以下問題：為什麼選擇隨機森林而非其他模型，比如樸素貝葉斯或者支持向量機。一般來說，面試者可以從數學理論和工程實現兩個方面進行比較回答。從理論上講，數據表現出來的特徵，以及模型所基於的假設都是很好的突破口;從工程實現上講，實現的難易程度，是否易於scale都是可以考慮的點。

6. 開放性問題

除了對機器學習知識深度和廣度上的考察，開放性問題也是面試中經常會遇到的問題，對於初學者來說這也是最難準備的一類問題。一方面這類問題很難在教科書中見到，沒有固定的問題清單；另一方面，這類問題沒有標準答案，很多時候是對過往經驗的總結。針對這一類問題，更多的是靠平時工作學習過程中多思考、多總結、多積累，臨陣抱佛腳很難起到效果。

再次回到垃圾郵件分類器項目，這個項目中有多個開放性問題可以被提問。比如，1. 郵件數據存在缺失，通常情況下，如何對缺失數據進行處理？2. 垃圾郵件分類是一個非平衡數據集分類的問題，針對這一類問題，我們應該如何進行建模。3. 項目中，PCA被用於特徵選擇，除此而外，還有哪些方法可以用來進行特徵選擇？

7. 準備材料

在準備機器學習的過程中，我主要用了如下的材料：

A. Stanford CS229 Machine Learning.

B. CMU 10-701 Introduction to Machine Learning.

C. The Elements of Statistical Learning. By Trevor Hastie, Robert Tibshirani and Jerome Friedman.

D. Pattern Recognition and Machine Learning. By Christopher Bishop.

A.和B.是Stanford和CMU機器學習課程的課件，裡面涵蓋了各種常用演算法，應該力求掌握這些演算法。C.和D.是經典中的經典，難度適中，內容沒有太理論，語言也沒有太晦澀，是機器學習內功修鍊的不二法門。

關於材料再說句題外話，我之前是個資料收集整理愛好者，總是在努力充實自己的資料庫，總是擔心遺漏任何有用的材料。但是後來才逐漸意識到，資料求精不求多，存在腦子裡面的叫知識，存在硬碟里叫文檔。

拋磚引玉，在讀碩士，半個月前去面了某公司機器學習intern鎩羽而歸了。機器學習部分一個面試官問了我一個小時，我估計如果真正能上的人光機器學習部分應該還會多來幾輪。

首先介紹自己的研究經歷。會隨機問一些細節。
我面的推薦，問了各類協同過濾的好與壞。
然後我說我做過LDA，問我，Dirichlet Distribution的定義和性質，並問我，為什麼它和multinomial distribution是共軛的，順便問了我啥叫共軛分布。
問了一個很有意思的問題，現實應用中的Top-N推薦問題和學術研究中的評分預測問題之間有什麼不同。
問我ItemCF的工程實現，面對大數據如何實現，又追問了有沒有什麼工程優化演算法。這個問題我沒答好，一開始我說了一個MapReduce模型，他問能不能更快一點，我就卡那了。。。最後面試官告訴我，不能只從演算法角度分析，要從系統設計分析，利用內存來減小MapReduce的吞吐量。（當然也許從MapReduce那一刻開始我就輸了也不一定）
最後考了我一個基本概念，什麼叫判別模型什麼叫生成模型。

回憶以前有一個去百度做廣告業務的師兄分享的面試經歷，記得沒問什麼高深演算法，馬爾科夫系列？

現在工業界的機器學習業務也不會太多，公司不是學術界在學術研究灌水上有多少追求，演算法可實現，能可擴展高效分散式運行才是重點。
所以個人認為，機器學習工程師的核心還是在於工程師的能力，實際應用的問題分析能力和演算法的實現能力很重要。機器學習的演算法並不是重點。

PS1，各種研究院（微軟雅虎IBM之類）的正式員工另當別論。
PS2，我只面過一個公司，和打聽過一些情況，和聽過一些演算法工程師的技術分享去推斷它們的工作狀態和需求，一面之詞，拋磚引玉。
PS3，國內
PS4，題主也請分享面試經歷。

看了一下各位回答，我要去面試妥妥跪了。。。借用一下@Filestorm當年的一句話，原話記不得，大概意思是熟練工一天做完的，我們可能需要三天。但是我們一年做完的，熟練工永遠做不完。以上各位答主都是在招熟練工吧。。。

錄製過一些解讀的視頻，分享一下。
完整版在這裡：http://www.bittiger.io/blog/post/r7s58dHkzLjCm7agm

數據科學（數據分析＋機器學習）

入門

如何入門機器學習
數據科學家在公司做什麼
機器學習的分類
該不該轉型機器學習

應用

廣告搜索

搜索廣告內部原理精華版
搜索廣告內部原理完整版

深度學習

大規模深度學習的應用
深度學習與無人車／機器人
Google在機器學習的探索

推薦系統

如何做好推薦系統
AppStore之推薦系統（一）／（二）

Airbnb大數據預測

Airbnb機器學習實戰
Airbnb大數據預測（一）／（二）／（三）／（四）／（五）／（六）

圖像問答

實戰深度學習之圖像問答

用戶分析

實戰R語言用戶分析

面過七家公司的機器學習工程師崗位，包括諸如BAT這樣的大公司，頭條這樣的成熟創業公司，商湯這樣的初創期創業公司。

機器學習工程師崗位面試主要看機器學習掌握得如何，自己做的相關項目，以及coding能力。

機器學習大概問過lr，svm，pr曲線，樸素貝葉斯的assumption，ensemble方法，決策樹節點用哪個特徵進行劃分，gbdt原理，random forest原理，pca和lda降維原理，寫k means和gmm公式，特徵選擇的方法有哪些，cnn與rnn的區別，你所知道的距離度量方式，你所知道的loss function，蓄水池抽樣。具體可以看看https://zhuanlan.zhihu.com/p/30420494。

過了簡歷說明你做的項目跟面試官所在團隊做的東西相似度比較高，所以面試官一般會比較懂你做的項目。在問的時候，面試官會問你一些他覺得該項目的難點，以及你是如何解決的，比如樣本不平衡問題，負樣本如何挑選等問題。面試前一定要好好過一下自己做的項目，想一想你是面試官你會問自己什麼問題。

最後就是基礎的coding能力。背包問題，lcs，lis，編輯距離，最長迴文子串等dp問題；鏈表反轉，遞歸和非遞歸解法，判斷是否有環等鏈表問題；樹的前序，中序，後序遍歷，判斷鏡像，判斷是否是完全二叉樹或者滿二叉樹，求樹的深度，lca等樹的問題；快排，堆排序，歸併排序，最大k個數，時間複雜度等排序問題；實現atoi等等。具體可以看看這個總結https://zhuanlan.zhihu.com/p/29731623。

數據結構演算法水題+常用機器學習演算法推導+模型調優細節+業務認識

趁著寫這個答案的同時也順帶梳理一下平時收藏的一些面試題（不過這些題目比較偏向Data Scientist，問題和答案是分快寫的，所有題目均有比較可信的出處，回答內容不定期更新。）：

一來ANALYTICS VIDHYA：40 Interview Questions asked at Startups in Machine Learning / Data Science （這些問題並沒有要求你寫出具體的推導公式，主要考察的是機器學習如何具體應用。挑了幾個比較有意思的貼出來）

以下是Question模塊：

Q1 關於在有限內存情況下如何對數據進行降維（這個問題挺有意思的，而且文章給的答案很詳細，答案中也給出了所提到方法的更為詳細的解釋的鏈接，贊一個！）

You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

Q2 你在一個用於cancer detection的數據集上使用的分類模型達到了96%的精度，但「你不是真正的快樂」，為啥？該怎麼做？（關於性能度量的基礎題）

You are given a data set on cancer detection. You』ve build a classification model and achieved an accuracy of 96%. Why shouldn』t you be happy with your model performance? What can you do about it?

Q3 如何處理回歸模型中的多重共線性問題

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he』s true? Without losing any information, can you still build a better model?

Q4 如何選擇重要特徵

While working on a data set, how do you select important variables? Explain your methods.

Q5 在時間序列數據集中，應該使用何種交叉驗證方法？

What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?

以下是Answer模塊

A1 （重點加粗，下同）

Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation:

1.Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use.

2.We can randomly sample the data set. This means, we can create a smaller data set, let』s say, having 1000 variables and 300000 rows and do the computations.

3.To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we』ll use correlation. For categorical variables, we』ll use chi-square test.

4.Also, we can use PCA and pick the components which can explain the maximum variance in the data set.

5.Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option.

6.Building a linear model using Stochastic Gradient Descent is also helpful.

7.We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information.

Note: For point 4 5, make sure you read about online learning algorithms Stochastic Gradient Descent. These are advanced methods.

If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps:
1. We can use undersampling, oversampling or SMOTE to make the data balanced.
（註：我原先不知道SMOTE這個技術，中文名叫「合成少數類過採樣技術」，意思就是對少數類樣本進行分析並根據少數類樣本人工合成新樣本添加到數據集中）
2. We can alter the prediction threshold value by doing probability caliberation and finding a optimal threshold using AUC-ROC curve.
3. We can assign weight to classes such that the minority classes gets larger weight.
4. We can also use anomaly detection.
Know more: Imbalanced Classification

To check multicollinearity, we can create a correlation matrix to identify remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value &<= 4 suggests no multicollinearity whereas a value of &>= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.
But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.
Know more: Regression
（註：關於多重共線性問題還可以參考這篇博文---&>對於多重共線性的簡單理解）

1. Remove the correlated variables prior to selecting important variables
2. Use linear regression and select variables based on p values
3. Use Forward Selection, Backward Selection, Stepwise Selection
4. Use Random Forest, Xgboost and plot variable importance chart
5. Use Lasso Regression
6. Measure information gain for the available set of features and select top n features accordingly.

Neither.
In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold as shown below:
fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]
where 1,2,3,4,5,6 represents 「year」.

二來自Quora「20 questions to detect a fake data scientist」：

1 What is the life cycle of a data science project?

2 How do you measure yield (over base line) resulting from a new or refined algorithm or architecture?

3 What is cross-validation? How to do it right?

4 Is it better to design robust or accurate algorithms?

5 Have you written production code? Prototyped an algorithm? Created a proof of concept?

6 What is the biggest data set you have worked with, in terms of training set size, and in terms of having your algorithm implemented in production mode to process billions of transactions per day / month / year?

7 Name a few famous API"s (for instance Google search). How would you create one?

8 How to efficiently scrape web data, or collect tons of tweets?

9 How to optimize algorithms (parallel processing and/or faster algorithm: provide examples for both)

10 Examples of NoSQL architecture?

11 How do you clean data?

12 How do you define / select metrics? Have you designed and used compound metrics?

13 Examples of bad and good visualizations?

14 Have you been involved - as an adviser or architect - in the design of dashboard or alarm systems?

15 How frequently an algorithm must be updated? What about lookup tables in real-time systems?

16 Provide examples of machine-to-machine communication.

17 Provide examples where you automated a repetitive analytical task.

18 How do you assess the statistical significance of an insight?

19 How to turn unstructured data into structured data?

20 How to very efficiently cluster 100 billion web pages, for instance with a tagging or indexing algorithm?

（最後補充的一個問題，比較開放性）
If you were interviewing a data scientist, what questions would you ask her?

三來自ANALYTICS VIDHYA（專門針對Regression的45個問題）：45 questions to test a Data Scientist on Regression (Skill test - Regression Solution)

四來自Data Science Central（100+ Common Data Science Interview Questions）：100+ Common Data Science Interview Questions

它包含了以下六個模塊的題，覆蓋面還是非常廣的：

1. Statistics
2. Programming：General，Big Data，Python，R，SQL
3. Modeling
4. Behavioral
5. Culture Fit
6. Problem-Solving

五來自ANALYTICS VIDHYA：30個關於NLP的筆試題（題目還是不像最後那個一樣貼出來了否則這個回答真的太長了你們會不耐煩的，給個介紹）30 Questions to test a data scientist on Natural Language Processing [Solution: Skilltest – NLP]

Introduction

Humans are social animals and language is our primary tool to communicate with the society. But, what if machines could understand our language and then act accordingly? Natural Language Processing (NLP) is the science of teaching machines how to understand the language we humans speak and write.
We recently launched an NLP skill test on which a total of 817 people registered. This skill test was designed to test your knowledge of Natural Language Processing. If you are one of those who missed out on this skill test, here are the questions and solutions.
Here are the leaderboard ranking for all the participants.

這是參加這個筆試的participants的得分，做完可以對比下

六來自ANALYTICS VIDHYA（45個關於Tree Based演算法的題目，包括決策樹，隨機森林和XGBoost）：45 questions to test Data Scientists on Tree Based Algorithms (Decision tree, Random Forests, XGBoost)

七來自ANALYTICS VIDHYA（25個與圖像處理有關的題目，非深度學習導向）：https://www.analyticsvidhya.com/blog/2017/10/image-skilltest/

Introduction

Extracting useful information from unstructured data has always been a topic of huge interest in the research community. One such example of unstructured data is an image, and analysis of image data has applications in various aspects of business.
This skilltest is specially designed for you to test your knowledge on the knowledge on how to handle image data, with an emphasis on image processing. More than 300 people registered for the test. If you are one of those who missed out on this skill test, here are the questions and solutions.

八最後是來自ANALYTICS VIDHYA（這裡的40題應該算是機器學習筆試題，不過也很值得一讀）：40 Questions to test a data scientist on Machine Learning [Solution: SkillPower – Machine Learning, DataFest 2017]（先寫到這兒了，有空再補充。如果這些題目對你有啟發，希望點個贊~）

1) Which of the following statement is true in following case?

A) Feature F1 is an example of nominal variable.
B) Feature F1 is an example of ordinal variable.
C) It doesn』t belong to any of the above category.
D) Both of these

Solution: (B)

Ordinal variables are the variables which has some order in their categories. For example, grade A should be consider as high grade than grade B.

2) Which of the following is an example of a deterministic algorithm?

A) PCA

B) K-Means

C) None of the above

Solution: (A)

A deterministic algorithm is that in which output does not change on different runs. PCA would give the same result if we run again, but not k-means.

3) [True or False] A Pearson correlation between two variables is zero but, still their values can still be related to each other.

A) TRUE

B) FALSE

Solution: (A)

Y=X2. Note that, they are not only associated, but one is a function of the other and Pearson correlation between them is 0.

4) Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic Gradient Decent (SGD)?

In GD and SGD, you update a set of parameters in an iterative manner to minimize the error function.
In SGD, you have to run through all the samples in your training set for a single update of a parameter in each iteration.
In GD, you either use the entire data or a subset of training data to update a parameter in each iteration.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1,2 and 3

Solution: (A)

In SGD for each iteration you choose the batch which is generally contain the random sample of data But in case of GD each iteration contain the all of the training observations.

5) Which of the following hyper parameter(s), when increased may cause random forest to over fit the data?

Number of Trees
Depth of Tree
Learning Rate

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1,2 and 3

Solution: (B)

Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter in random forest. Increase in the number of tree will cause under fitting.

6) Imagine, you are working with 「Analytics Vidhya」 and you want to develop a machine learning algorithm which predicts the number of views on the articles.

Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features. Which of the following evaluation metric would you choose in that case?

Mean Square Error
Accuracy
F1 Score

A) Only 1

B) Only 2

C) Only 3

D) 1 and 3

E) 2 and 3

F) 1 and 2

Solution:(A)

You can think that the number of views of articles is the continuous target variable which fall under the regression problem. So, mean squared error will be used as an evaluation metrics.

7) Given below are three images (1,2,3). Which of the following option is correct for these images?

A) 1 is tanh, 2 is ReLU and 3 is SIGMOID activation functions.

B) 1 is SIGMOID, 2 is ReLU and 3 is tanh activation functions.

C) 1 is ReLU, 2 is tanh and 3 is SIGMOID activation functions.

D) 1 is tanh, 2 is SIGMOID and 3 is ReLU activation functions.

Solution: (D)

The range of SIGMOID function is [0,1].

The range of the tanh function is [-1,1].

The range of the RELU function is [0, infinity].

So Option D is the right answer.

8) Below are the 8 actual values of target variable in the train file.

[0,0,0,1,1,1,1,1]

What is the entropy of the target variable?

A) -(5/8 log(5/8) + 3/8 log(3/8))

B) 5/8 log(5/8) + 3/8 log(3/8)

C) 3/8 log(5/8) + 5/8 log(3/8)

D) 5/8 log(3/8) – 3/8 log(5/8)

Solution: (A)

The formula for entropy is

So the answer is A.

9) Let』s say, you are working with categorical feature(s) and you have not looked at the distribution of the categorical variable in the test data.

You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges you may face if you have applied OHE on a categorical variable of train dataset?

A) All categories of categorical variable are not present in the test dataset.

B) Frequency distribution of categories is different in train as compared to the test dataset.

C) Train and Test always have same distribution.

D) Both A and B

E) None of these

Solution: (D)

Both are true, The OHE will fail to encode the categories which is present in test but not in train so it could be one of the main challenges while applying OHE. The challenge given in option B is also true you need to more careful while applying OHE if frequency distribution doesn』t same in train and test.

10) Skip gram model is one of the best models used in Word2vec algorithm for words embedding. Which one of the following models depict the skip gram model?

A) A

B) B

C) Both A and B

D) None of these

Solution: (B)

Both models (model1 and model2) are used in Word2vec algorithm. The model1 represent a CBOW model where as Model2 represent the Skip gram model.

11) Let』s say, you are using activation function X in hidden layers of neural network. At a particular neuron for any given input, you get the output as 「-0.0001」. Which of the following activation function could X represent?

A) ReLU

B) tanh

C) SIGMOID

D) None of these

Solution: (B)

The function is a tanh because the this function output range is between (-1,-1).

12) [True or False] LogLoss evaluation metric can have negative values.

A) TRUE
B) FALSE

Solution: (B)

Log loss cannot have negative values.

13) Which of the following statements is/are true about 「Type-1」 and 「Type-2」 errors?

Type1 is known as false positive and Type2 is known as false negative.
Type1 is known as false negative and Type2 is known as false positive.
Type1 error occurs when we reject a null hypothesis when it is actually true.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 1 and 3

F) 2 and 3

Solution: (E)

In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a 「false positive」), while a type II error is incorrectly retaining a false null hypothesis (a 「false negative」).

14) Which of the following is/are one of the important step(s) to pre-process the text in NLP based projects?

Stemming
Stop word removal
Object Standardization

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) 1,2 and 3

Solution: (D)

Stemming is a rudimentary rule-based process of stripping the suffixes (「ing」, 「ly」, 「es」, 「s」 etc) from a word.

Stop words are those words which will have not relevant to the context of the data for example is/am/are.

Object Standardization is also one of the good way to pre-process the text.

15) Suppose you want to project high dimensional data into lower dimensions. The two most famous dimensionality reduction algorithms used here are PCA and t-SNE. Let』s say you have applied both algorithms respectively on data 「X」 and you got the datasets 「X_projected_PCA」 , 「X_projected_tSNE」.

Which of the following statements is true for 「X_projected_PCA」「X_projected_tSNE」 ?

A) X_projected_PCA will have interpretation in the nearest neighbour space.

B) X_projected_tSNE will have interpretation in the nearest neighbour space.

C) Both will have interpretation in the nearest neighbour space.

D) None of them will have interpretation in the nearest neighbour space.

Solution: (B)

t-SNE algorithm consider nearest neighbour points to reduce the dimensionality of the data. So, after using t-SNE we can think that reduced dimensions will also have interpretation in nearest neighbour space. But in case of PCA it is not the case.

Context: 16-17

Given below are three scatter plots for two features (Image 1, 2 3 from left to right).

16) In the above images, which of the following is/are example of multi-collinear features?

A) Features in Image 1

B) Features in Image 2

C) Features in Image 3

D) Features in Image 1 2

E) Features in Image 2 3

F) Features in Image 3 1

Solution: (D)

In Image 1, features have high positive correlation where as in Image 2 has high negative correlation between the features so in both images pair of features are the example of multicollinear features.

17) In previous question, suppose you have identified multi-collinear features. Which of the following action(s) would you perform next?

Remove both collinear variables.
Instead of removing both variables, we can remove only one variable.
Removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression.

A) Only 1

B)Only 2

C) Only 3

D) Either 1 or 3

E) Either 2 or 3

Solution: (E)

You cannot remove the both features because after removing the both features you will lose all of the information so you should either remove the only 1 feature or you can use the regularization algorithm like L1 and L2.

18) Adding a non-important feature to a linear regression model may result in.

Increase in R-square
Decrease in R-square

A) Only 1 is correct

B) Only 2 is correct

C) Either 1 or 2

D) None of these

Solution: (A)

After adding a feature in feature space, whether that feature is important or unimportant features the R-squared always increase.

19) Suppose, you are given three variables X, Y and Z. The Pearson correlation coefficients for (X, Y), (Y, Z) and (X, Z) are C1, C2 C3 respectively.

Now, you have added 2 in all values of X (i.enew values become X+2), subtracted 2 from all values of Y (i.e. new values are Y-2) and Z remains the same. The new coefficients for (X,Y), (Y,Z) and (X,Z) are given by D1, D2 D3 respectively. How do the values of D1, D2 D3 relate to C1, C2 C3?

A) D1= C1, D2 &< C2, D3 &> C3

B) D1 = C1, D2 &> C2, D3 &> C3

C) D1 = C1, D2 &> C2, D3 &< C3

D) D1 = C1, D2 &< C2, D3 &< C3

E) D1 = C1, D2 = C2, D3 = C3

F) Cannot be determined

Solution: (E)

Correlation between the features won』t change if you add or subtract a value in the features.

20) Imagine, you are solving a classification problems with highly imbalanced class. The majority class is observed 99% of times in the training data.

Your model has 99% accuracy after taking the predictions on test data. Which of the following is true in such a case?

Accuracy metric is not a good idea for imbalanced class problems.
Accuracy metric is a good idea for imbalanced class problems.
Precision and recall metrics are good for imbalanced class problems.
Precision and recall metrics aren』t good for imbalanced class problems.

A) 1 and 3

B) 1 and 4

C) 2 and 3

D) 2 and 4

Solution: (A)

Refer the question number 4 from in this article.

21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of these models will give a better prediction than prediction of individual models.

Which of the following statements is / are true for weak learners used in ensemble model?

They don』t usually overfit.
They have high bias, so they cannot solve complex learning problems
They usually overfit.

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) Only 1

E) Only 2

F) None of the above

Solution: (A)

Weak learners are sure about particular part of a problem. So, they usually don』t overfit which means that weak learners have low variance and high bias.

22) Which of the following options is/are true for K-fold cross-validation?

Increase in K will result in higher time required to cross validate the result.
Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K.
If K=N, then it is called Leave one out cross validation, where N is the number of observations.

A) 1 and 2

B) 2 and 3

C) 1 and 3

D) 1,2 and 3

Solution: (D)

Larger k value means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) and higher running time (as you are getting closer to the limit case: Leave-One-Out CV). We also need to consider the variance between the k folds accuracy while selecting the k.

Question Context 23-24

Cross-validation is an important step in machine learning for hyper parameter tuning. Let』s say you are tuning a hyper-parameter 「max_depth」 for GBM by selecting it from 10 different depth values (values are greater than 2) for tree based model using 5-fold cross validation.

Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds and for the prediction on remaining 1-fold is 2 seconds.

Note: Ignore hardware dependencies from the equation.

23) Which of the following option is true for overall execution time for 5-fold cross validation with 10 different values of 「max_depth」?

A) Less than 100 seconds

B) 100 – 300 seconds

C) 300 – 600 seconds

D) More than or equal to 600 seconds

C) None of the above

D) Can』t estimate

Solution: (D)

Each iteration for depth 「2」 in 5-fold cross validation will take 10 secs for training and 2 second for testing. So, 5 folds will take 12*5 = 60 seconds. Since we are searching over the 10 depth values so the algorithm would take 60*10 = 600 seconds. But training and testing a model on depth greater than 2 will take more time than depth 「2」 so overall timing would be greater than 600.

24) In previous question, if you train the same algorithm for tuning 2 hyper parameters say 「max_depth」 and 「learning_rate」.

You want to select the right value against 「max_depth」 (from given 10 depth values) and learning rate (from given 5 different learning rates). In such cases, which of the following will represent the overall time?

A) 1000-1500 second

B) 1500-3000 Second

C) More than or equal to 3000 Second

D) None of these

Solution: (D)

Same as question number 23.

25) Given below is a scenario for training error TE and Validation error VE for a machine learning algorithm M1. You want to choose a hyperparameter (H) based on TE and VE.

Which value of H will you choose based on the above table?

A) 1

B) 2

C) 3

D) 4

E) 5

Solution: (D)

Looking at the table, option D seems the best

26) What would you do in PCA to get the same projection as SVD?

A) Transform data to zero mean

B) Transform data to zero median

C) Not possible

D) None of these

Solution: (A)

When the data has a zero mean vector PCA will have same projections as SVD, otherwise you have to centre the data first before taking SVD.

Question Context 27-28

Assume there is a black box algorithm, which takes training data with multiple observations (t1, t2, t3,…….. tn) and a new observation (q1). The black box outputs the nearest neighbor of q1 (say ti) and its corresponding class label ci.

You can also think that this black box algorithm is same as 1-NN (1-nearest neighbor).

27) It is possible to construct a k-NN classification algorithm based on this black box alone.

Note: Where n (number of training observations) is very large compared to k.

A) TRUE

B) FALSE

Solution: (A)

In first step, you pass an observation (q1) in the black box algorithm so this algorithm would return a nearest observation and its class.

In second step, you through it out nearest observation from train data and again input the observation (q1). The black box algorithm will again return the a nearest observation and it』s class.

You need to repeat this procedure k times

28) Instead of using 1-NN black box we want to use the j-NN (j&>1) algorithm as black box. Which of the following option is correct for finding k-NN using j-NN?

J must be a proper factor of k
J &> k
Not possible

A) 1

B) 2

C) 3

Solution: (A)

Same as question number 27

29) Suppose you are given 7 Scatter plots 1-7 (left to right) and you want to compare Pearson correlation coefficients between variables of each scatterplot.

Which of the following is in the right order?

1&<2&<3&<4
1&>2&>3 &> 4
7&<6&<5&<4
7&>6&>5&>4

A) 1 and 3

B) 2 and 3

C) 1 and 4

D) 2 and 4

Solution: (B)

from image 1to 4 correlation is decreasing (absolute value). But from image 4 to 7 correlation is increasing but values are negative (for example, 0, -0.3, -0.7, -0.99).

30) You can evaluate the performance of a binary class classification problem using different metrics such as accuracy, log-loss, F-Score. Let』s say, you are using the log-loss function as evaluation metric.

Which of the following option is / are true for interpretation of log-loss as an evaluation metric?

If a classifier is confident about an incorrect classification, then log-loss will penalise it heavily.
For a particular observation, the classifier assigns a very small probability for the correct class then the corresponding contribution to the log-loss will be very large.
Lower the log-loss, the better is the model.

A) 1 and 3

B) 2 and 3

C) 1 and 2

D) 1,2 and 3

Solution: (D)

Options are self-explanatory.

Question 31-32

Below are five samples given in the dataset.

Note: Visual distance between the points in the image represents the actual distance.

31) Which of the following is leave-one-out cross-validation accuracy for 3-NN (3-nearest neighbor)?

A) 0

D) 0.4

C) 0.8

D) 1

Solution: (C)

In Leave-One-Out cross validation, we will select (n-1) observations for training and 1 observation of validation. Consider each point as a cross validation point and then find the 3 nearest point to this point. So if you repeat this procedure for all points you will get the correct classification for all positive class given in the above figure but negative class will be misclassified. Hence you will get 80% accuracy.

32) Which of the following value of K will have least leave-one-out cross validation accuracy?

A) 1NN

B) 3NN

C) 4NN

D) All have same leave one out error

Solution: (A)

Each point which will always be misclassified in 1-NN which means that you will get the 0% accuracy.

33) Suppose you are given the below data and you want to apply a logistic regression model for classifying it in two given classes.

You are using logistic regression with L1 regularization.

Where C is the regularization parameter and w1 w2 are the coefficients of x1 and x2.

Which of the following option is correct when you increase the value of C from zero to a very large value?

A) First w2 becomes zero and then w1 becomes zero

B) First w1 becomes zero and then w2 becomes zero

C) Both becomes zero at the same time

D) Both cannot be zero even after very large value of C

Solution: (B)

By looking at the image, we see that even on just using x2, we can efficiently perform classification. So at first w1 will become 0. As regularization parameter increases more, w2 will come more and more closer to 0.

34) Suppose we have a dataset which can be trained with 100% accuracy with help of a decision tree of depth 6. Now consider the points below and choose the option based on these points.

Note: All other hyper parameters are same and other factors are not affected.

Depth 4 will have high bias and low variance
Depth 4 will have low bias and low variance

A) Only 1

B) Only 2

C) Both 1 and 2

D) None of the above

Solution: (A)

If you fit decision tree of depth 4 in such data means it will more likely to underfit the data. So, in case of underfitting you will have high bias and low variance.

35) Which of the following options can be used to get global minima in k-Means Algorithm?

Try to run algorithm for different centroid initialization
Adjust number of iterations
Find out the optimal number of clusters

A) 2 and 3

B) 1 and 3

C) 1 and 2

D) All of above

Solution: (D)

All of the option can be tuned to find the global minima.

36) Imagine you are working on a project which is a binary classification problem. You trained a model on training dataset and get the below confusion matrix on validation dataset.

Based on the above confusion matrix, choose which option(s) below will give you correct predictions?

Accuracy is ~0.91
Misclassification rate is ~ 0.91
False positive rate is ~0.95
True positive rate is ~0.95

A) 1 and 3

B) 2 and 4

C) 1 and 4

D) 2 and 3

Solution: (C)

The Accuracy (correct classification) is (50+100)/165 which is nearly equal to 0.91.

The true Positive Rate is how many times you are predicting positive class correctly so true positive rate would be 100/105 = 0.95 also known as 「Sensitivity」 or 「Recall」

37) For which of the following hyperparameters, higher value is better for decision tree algorithm?

Number of samples used for split
Depth of tree
Samples for leaf

A)1 and 2

B) 2 and 3

C) 1 and 3

D) 1, 2 and 3

E) Can』t say

Solution: (E)

For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase. For example, if we have a very high value of depth of tree, the resulting tree may overfit the data, and would not generalize well. On the other hand, if we have a very low value, the tree may underfit the data. So, we can』t say for sure that 「higher is better」.

Context 38-39

Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with the input depth of 3 and output depth of 8.

Note: Stride is 1 and you are using same padding.

38) What is the dimension of output feature map when you are using the given parameters.

A) 28 width, 28 height and 8 depth

B) 13 width, 13 height and 8 depth

C) 28 width, 13 height and 8 depth

D) 13 width, 28 height and 8 depth

Solution: (A)

The formula for calculating output size is

output size = (N – F)/S + 1

where, N is input size, F is filter size and S is stride.

Read this article to get a better understanding.

39) What is the dimensions of output feature map when you are using following parameters.

A) 28 width, 28 height and 8 depth

B) 13 width, 13 height and 8 depth

C) 28 width, 13 height and 8 depth

D) 13 width, 28 height and 8 depth

Solution: (B)

Same as above

40) Suppose, we were plotting the visualization for different values of C (Penalty parameter) in SVM algorithm. Due to some reason, we forgot to tag the C values with visualizations. In that case, which of the following option best explains the C values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel.

A) C1 = C2 = C3

B) C1 &> C2 &> C3

C) C1 &< C2 &< C3

D) None of these

Solution: (C)

Penalty parameter C of the error term. It also controls the trade-off between smooth decision boundary and classifying the training points correctly. For large values of C, the optimization will choose a smaller-margin hyperplane. Read more here.

謝邀。

機器學習工程師在100offer後台中的確是一個比較熱門的崗位，但是同時也是一個要求比較高的崗位。這裡特地整理了最近在咱們的知乎LIVE中，從事這個領域工作的陳村關於面試的心得。

分享包括以下幾個部分：

面試流程
優秀面試者特質
面試技巧
演算法面試題分享

1.面試流程

一般面試官會根據求職者寫在簡歷上的項目經驗開場，這裡的考察點一般就是：項目的難度和求職者對項目的主要貢獻。通過這些可以了解求職者的真實能力和技術水平，還可以推測求職者未來加入目標公司的行為模式。項目中的一些技術知識也會順便考察下。之後會進行一些擴展，有些求職者可能說到自己的項目說的非常流暢，這可能就是準備的比較好，但是一些新的項目可能就不知所措了，但是問到的這些新的問題，完全可以套用之前項目的方法，但是很多求職者到這邊就掛了。

2.優秀面試者的特質

（1）基本功紮實

基本功非常紮實，在學校的排名比較高，有較好的學術論文和項目，在基礎演算法的回答上非常準確，編寫無誤。

（2）思維聰明

如何判斷面試者的思維是否「聰明」，體現在以下兩點：第一，求職者能夠很好的和面試者進行交流，我跟他交流會非常愉快。其次，他能夠很好的了解面試官考察的內容，並表現出很好的問題分析能力。同時，當面對一些有難度的問題的時候，能和面試官進行比較好的交流，一起逐步完成。

（3）態度端正

最後是要態度端正，會對對方留下比較好的印象。簡歷上不能誇大，不能造假，交流要謙遜，不卑不亢。

3.機器學習面試技巧

（1）簡歷一定要精簡

首先，簡歷一定要精簡，不要過多的堆砌一些不重要的。

（2）簡歷項目描述要精準的描述出項目技術難點，並且包含準確的評價數字。

一份好的簡歷中的項目描寫應該要精準的描述出項目技術難點，並且包含準確的評價數字。比如說：我用Tensor Flow 的密性，在公開的MNIST的數據上，得到了96%的準確率。這樣就會顯得很專業，也會加深的項目的可信度。

（3）對面試官的問題精確回答，不要做過多的辯解

對於面試官的問題，我希望做精確的回答，不要摸索，不要做過多的揭示和辯解，過多的解釋，要麼代表你沒有理解我的意思，不懂我要問什麼，會說一些無關緊要的話，要麼就是有點緊張底氣不足。所以對於面試官的問題要謹慎作答，不要過多的解釋，節省大家的時間。

4.演算法面試題

在演算法的面試中，可以採取以下的節奏：

首先，要先複述一下面試問題，澄清題意，來代表你理解這個問題並且沒有錯誤。

同時，你要保持和面試官的不斷的交流，不要把他當成面試官，而要把他當成未來團隊成員，先和未來的團隊成員一起想辦法討論問題，解決問題，然後臨時給出一個基本的可運行版本。即使演算法和運行效率看起來特別差，但是先給出一個基本的可運行的版本，然後再和面試官討論。這也是我們工作過程中的基本的工作模式，先給一個基本的運行版本，然後再慢慢迭代改進。

面試過程中，最忌諱的就是悶頭做題，給了題目之後二話不說就開始寫代碼，寫了20分鐘之後發現，思路不對或者題目看錯了，這是非常糟糕的一件事情。

相關Live內容：

NLP 演算法工程師的學習、成長和實戰經驗

機器學習工程師的進階之路 | 知乎 Live 筆錄

相關乾貨內容：

我們找了 4 家大數據公司技術 Leader，聊了聊演算法和數據挖掘工程師的機會和選擇

從技術 Leader 的招聘需求看，如何轉崗為當前緊缺的大數據相關人才？

本問題，周開拓老師有了很好的總結與參考書目推薦，建議大家可以先移步周老師先行閱讀，這裡給到的是相關具體可能會被問及的問題(編程、基礎演算法、機器學習演算法)。從本次關於演算法工程師常見的九十個問題大多是各類網站的問題匯總，希望你能從中分析出一些端倪，文末附了部分參考的答案

《 機器學習、大數據相關崗位職責及準備（轉） 》: 機器學習面試的職責和面試問題(轉) - 知乎專欄

《 機器學習經典演算法優缺點總結》: 機器學習經典演算法優缺點總結 - 知乎專欄

其中的關於推薦系統的入門介紹可以查看專欄文章:什麼是推薦系統(個性化內容分發)？ - 知乎專欄

其中的關於用戶畫像的入門介紹可以查看專欄文章比你更了解你，淺談用戶畫像 - 知乎專欄

也歡迎大家參加我的知乎live-

《推薦演算法那點事》:知乎 Live - 全新的實時問答

《 推薦演算法那點事（二）：細節 》:知乎 Live - 全新的實時問答

歡迎大家關注後續的文章

1.struct和class區別，你更傾向用哪個
2.kNN，樸素貝葉斯，SVM的優缺點，樸素貝葉斯的核心思想，有沒有考慮屬性之間不是相互獨立的情況
3.10億個整數，1G內存，O(n)演算法，統計只出現一次的數。
4.SVM非線性分類，核函數的作用
5.海量數據排序
6.項目中的數據是否會歸一化處理，哪個機器學習演算法不需要歸一化處理
7.兩個數組，求差集
8.開放性問題：每個實體有不同屬性，現在有很多實體的各種屬性數據，如何判斷兩個實體是否是同一種東西
9.寫程序實現二分查找演算法，給出遞歸和非遞歸實現，並分析演算法的時間複雜度。
10.用C/C++實現單鏈表的反轉。
11.python讀取文件，寫代碼
12.python計算一個文件中有N行，每行一列的數的平均值，方差，寫代碼
13.C++求兩個一維數組的餘弦相似度，寫代碼
14.SVM詳細過程，支持向量，幾何間隔概念，拉格朗日函數如何求取超平面，非線性分類
15.海量數據中求取出現次數最大的100個數。

16.字元串翻轉，手寫

17.快排，手寫

18.KNN（分類與回歸）、CART（回歸樹用平方誤差最小化準則，分類樹用基尼指數最小化準則）、Logistics（推導）、GBDT（利用損失函數的負梯度在當前模型的值作為回歸問題提升樹演算法中的殘差的近似值，擬合一個回歸樹）、隨機森林（Bagging+CART）

19. 非遞歸的二叉前序遍歷兩個字元串的複製（除了字元串地址重疊的情況，也要注意判斷字元串本身的空間足夠不足夠，對於異常情況要考慮全面）

20.一個概率題目： 6個LED燈管，找整體旋轉180"後仍然是一個正常輸入的情況（考慮全即可）
21. 給一個情境，考察你對於機器學習演算法的了解程度以及常用情境的了解（要特別注意思維要開闊，我就是陷入某一個）

22.一個數組，如果存在兩個數之和等於第三個數，找出滿足這一條件的最大的第三個數（設為x+y =c）

23.聚類和分類有什麼區別？分類是事先知道類標的，而聚類事先不知道類標。

24.快速排序，怎樣將二叉排序樹變成雙向鏈表，且效率最高，從棧里找最小的元素，且時間複雜度為常數級，

25.神經網路,plsi的推導，還有float轉string,判斷一棵樹是否是另一棵的子樹。

26.寫寫SVM的優化形式、推導SVM

27.在一個n*n的矩陣中填數的問題，那種轉圈填數，上網搜搜有
28.鏈表存在環問題，環的第一個節點在哪裡？
29.幾個排序演算法，必須寫出

30.用拉格朗日公式推導SVM kernel變換

31.數據結構當中的樹，都有哪些？

32.推薦系統

33.輸出一個循環矩陣，這個我想的有點複雜了，簡單的循環即可實現，我用了遞歸
34.翻轉字元串，《劍指offer》原題
35.確定鏈表中環的起始位置

36.N個數找K大數那個題,堆解釋了一遍,比較滿意,問還能怎麼優化O(Nlogk)的方法，並行方面想

37.一個班60個人怎麼保證有兩個人生日相同,聽完後有點奇怪,①為什麼是60個人?②為什麼是保證?,反正沒管這麼多就是概率嘛,算就完了.
38.問一個字元串怎麼判斷是郵箱比如:vzcxn@sdf.gre.有限狀態自動機,然後要我畫狀態轉移圖.
39.快排的空間複雜度,答O(n).歸併的空間複雜度,答O(n).他讓我好好想想,我想了會,難道空間複雜度的常數不能省嗎?然後做了修改,快排是O(n)歸併是O(2n).
40.給10^10個64位數,100M內存的空間排序,感謝隊長剛好在去的前一天教過我一個求中位數的方法.用文件操作來做了,像快排一樣,二分選個數統計大於那個數的數量和小於那個數的數量,如果能用100M的空間排序就把那些數排了,如果不能繼續.直到能排為止.
41.main(argc,argv[])裡面兩個參數什麼意思
42.kmp演算法
43.電梯問題
44.一個應用題，考察hash演算法

45.求最大欄位和，用動態規劃和分治法兩個方法，時間複雜度怎麼算
46.寫了一下二分查找演算法的代碼

47.統計字元串中出現的字元個數，忽略大小寫，其中可能有其他字元。
48.一個文件2G內容是userid,username 一個文件3G內容是username,userpassword
要求：輸出userid,userpassword 8核cpu 2G內存

49.貝葉斯概率、卷積

50.尋找二叉樹的公共父節點

51.通過尋找兩條路徑，然後尋找最後一個公共節點。

52.SVM核函數，合併兩個文件的問題

53.b+ b-樹、紅黑樹、要求寫出排序演算法

54.判斷兩條鏈表是否交叉。

55.歸併排序，random指針的鏈表複製等

56.樹的廣度、深度遍歷，

57.L1和L2的區別

58.生成與判別模型

59.隱式馬爾科夫

60.SVM：中文分詞

61.關聯分析、aprior

62.各類演算法優缺點、模型調優細節

63.特徵提取的方法（無關鍵詞也是一個特徵）

64.穩定與不穩定排序

65.RBF核與高斯核的區別

66.Python實現LogReg

67.ROC與AUC

68.K-means起始點

69.深度學習和機器學習的區別、數據挖掘和人工智慧的區別、測試集和訓練集的區別kmeans，FCM，SVM演算法的具體流程、如何優化kmeans演算法

70.二叉樹前序遍歷非遞歸實現，大家總結一下前序，中序，後序遍歷的非遞歸實現，嘗試多幾種方法會有不一樣的收穫。

71. Deep CNN, Deep RNN, RBM的典型應用與局限，看Hinton講義和Paper去吧

72. 有哪些聚類方法？

73. 判斷一個鏈表是否存在環？回答：通過兩個指針，快慢指針進行遍歷。

74. 正則化是怎麼回事（L1和L2）

75.PCA

76. 學校食堂如何應用數據挖掘的知識

77. 哪些模型容易過擬合，模型怎麼選擇

78. 什麼是模糊聚類，還有劃分聚類，層次聚類等

79. 最長上升子序列啊，兩個大小相同的有序數組找公共中位數

80. 並行計算、壓縮演算法

81.SVD、LDA

82. naive bayes和logistic regression的區別
83.LDA的原理和推導
84.做廣告點擊率預測，用哪些數據什麼演算法
85.推薦系統的演算法中最近鄰和矩陣分解各自適用場景
86.用戶流失率預測怎麼做（遊戲公司的數據挖掘都喜歡問這個）
87.一個遊戲的設計過程中該收集什麼數據
88.如何從登陸日誌中挖掘儘可能多的信息

89. 統計學習的核心步驟：模型、策略、演算法，你應當對logistic、SVM、決策樹、KNN及各種聚類方法有深刻的理解。能夠隨手寫出這些演算法的核心遞歸步的偽代碼以及他們優化的函數表達式和對偶問題形式。

90. 梯度下降、牛頓法、各種隨機搜索演算法（基因、蟻群等等）

部分答案的參考答案整理

1.struct比class有更多的限制，詳見下面鏈接

http://angeson1987.blog.163.com/blog/static/1625900902010728035209/

2. 樸素貝葉斯核心思想利用先驗概率得到後驗概率，並且最終由期望風險最小化得出後驗概率最大化，從而輸出讓後驗概率最大化的值（具體概率與先驗概率由加入拉普拉斯平滑的極大似然估計而成的貝葉斯估計得到），特徵必須相互獨立。

file:///C:Userskaifei.yaoAppDataLocalTempmsohtmlclip11clip_image001.jpg

3.方案一：分拆然後分散式，方案二：對應每個數有三個狀態，01代表出現一次，統計10億以內數據，然後看最終哪些是01狀態

4.應對非線性分類問題

5.bit 位操作

6.量綱問題：歸一化有利於優化迭代速度（梯度下降），提高精度（KNN）

7. 實操，演算法流程：

從數組1的尚未比較的元素中拿出第一個元素array1(i)，用array1(i)與array2(j)進行比較（其中j&>i且j&

1．數組2中找到了一個與array1(i)相等的元素，則將array2(j)與array2(i)進行交換，I 加一，進行下次迭代

2．數組2直到結尾也沒找到與array1(i)相等的元素，則將array1(i)與尚未進行比較的最後一個元素array1(k)進行交換，i不加一，進行下次迭代。

8.重寫equals方法，對類裡面的對象進行屬性比較

9.實操

10. 實操：為了反轉這個單鏈表，我們先讓頭結點的next域指向結點2，再讓結點1的next域指向結點3，最後將結點2的next域指向結點1，就完成了第一次交換，順序就變成了Header-結點2-結點1-結點3-結點4-NULL，然後進行相同的交換將結點3移動到結點2的前面，然後再將結點4移動到結點3的前面就完成了反轉，思路有了，就該寫代碼了：　每次都將原第一個結點之後的那個結點放在list後面，下圖是原始的單鏈表。file:///C:Userskaifei.yaoAppDataLocalTempmsohtmlclip11clip_image002.jpg

11.12.13.14.實操

15. 處理海量數據問題，無非就是，詳細見鏈接

http://blog.csdn.net/flyqwang/article/details/7395866：

分而治之/hash映射 + hash統計 + 堆/快速/歸併排序；

雙層桶劃分

Bloom filter/Bitmap；

Trie樹/資料庫/倒排索引；

外排序；

分散式處理之Hadoop/Mapreduce。

16. 實操：將原數組看成 ab，需要轉換成 ba，先單獨對子數組a進行反轉得到a"b（a"表示a反轉後的結果），同理單獨反轉b，得到 a"b",最後將得到的 a"b" 一起進行一次反轉可得（a"b"）",而這就是最終結果 ba了

17.18.實操

19. 實操：http://ocaicai.iteye.com/blog/1047397

22.先排序，然後遍曆數組，每次遍歷的元素求是否是前後兩個元的和，小於則左邊前進，大於則右邊後退

23. 分類是事先定義好類別 ,類別數不變， K-均值聚類演算法、K-中心點聚類演算法、CLARANS、 BIRCH、CLIQUE、DBSCAN等

24. 實操：http://m.blog.csdn.net/blog/wkupaochuan/8912622，

http://blog.163.com/kevinlee_2010/blog/static/169820820201092091554523/

25.實操，子樹問題分兩步：

找值相同的根結點（遍歷解決）

判斷兩結點是否包含（遞歸：值、左孩子、右孩子分別相同）

26.實操

27. http://www.docin.com/p-19876385.html

28. http://blog.csdn.net/kerryfish/article/details/24043099

29.30.實操

31. 二叉查找樹（二叉排序樹）、平衡二叉樹（AVL樹）、紅黑樹、B-樹、B+樹、字典樹（trie樹）、後綴樹、廣義後綴樹。詳見下面鏈接

http://www.cnblogs.com/dong008259/archive/2011/11/22/2255361.html

32. http://baike.baidu.com/link?url=ECsYE4xe1gguMd3R5X4x5V7eQX54NkFp0PJ0FYbAvgJIFPDiaCdD_PuftDAYZTuzH0EuIobF1vDa2Vx2rj6Dda

33. 實操 34.見16，實操 35.見28

36. http://www.douban.com/note/275544555/

37.1減去50個人生日不同的概率≈100%

38. http://zhidao.baidu.com/link?url=DAnewo2j-Jz2u3WwyhFb4kYpnI3QZzfBqsQXdzVG9R061hcBTCUu01WXtoX5T89SmiiMJ_eWIkXOAAe1lhDFM0S7OPjnL_zTEX3Mm1ARc-a

39. 空間複雜度：快排是O(n)歸併是O(2n).

40. http://blog.csdn.net/guyulongcs/article/details/7520467

41. args是Java命令行參數，我們在DOS中執行Java程序的時候使用「java 文件名 args參數」。args這個數組可以接收到這些參數。當然也可以在一個類中main方法中直接調用另一個類里的main方法，因為main方法都是static修飾的靜態方法，因此可以通過類名.main來調用，這時就可在調用處main方法中傳入String[]類型的字元串數組，達到調用的目的，也可不傳入參數。

42. http://www.cnblogs.com/goagent/archive/2013/05/16/3068442.html

43. http://www.cnblogs.com/BeyondAnyTime/archive/2012/06/06/2538764.html

44. http://www.360doc.com/content/13/0409/14/10384031_277138819.shtml

45. http://blog.csdn.net/wwj_748/article/details/8919838

46.47.實操

48. http://freewxy.iteye.com/blog/737576

49. http://book.51cto.com/art/201205/338050.htm

http://www.guokr.com/post/342476/

50. http://blog.csdn.net/zcsylj/article/details/6532787

http://www.cnblogs.com/chlde/archive/2012/10/26/2741380.html

51.52. 53.實操

54. http://m.blog.csdn.net/blog/yangmm2048/44924997

55.實操

56. http://driftcloudy.iteye.com/blog/782873

57.實操

58.生成是先P(X,Y)再P(Y|X),判別是P(Y|X)

59.實操

60.LDA提取特徵，再用SVM做分類

61.62.63.實操

64.a1與a2值相等，排序完以後兩者順序仍然沒變則是穩定排序，穩定排序有插入、冒泡、歸併

65.一樣

66. http://blog.csdn.net/zouxy09/article/details/20319673

67. http://www.tuicool.com/articles/q6zYrq

68. http://www.cnki.com.cn/Article/CJFDTotal-DNZS200832067.htm

69.實操

70.見19

71.實操

72. K-均值聚類演算法、K-中心點聚類演算法、CLARANS、 BIRCH、CLIQUE、DBSCAN等

73. http://m.blog.csdn.net/blog/lavor_zl/42784247

74.75.76.77.實操

78. http://blog.csdn.net/xiahouzuoxin/article/details/7748823

http://www.cnblogs.com/guolei/p/3899509.html

79. http://blog.csdn.net/zcsylj/article/details/6802062

http://www.cnblogs.com/davidluo/articles/k-smallest-element-of-two-sorted-array.html

80. http://www.doc88.com/p-1621945750499.html

81.實操

82. http://m.blog.csdn.net/blog/muye5/19409615

83.實操

84. http://bbs.pinggu.org/thread-3182029-1-1.html

85. http://www.doc88.com/p-3961053026557.html

86. http://www.docin.com/p-1204742211.html

87.

88.http://www.docin.com/p-118297971.html

89.90.實操

面試過百度和阿里數據挖掘

演算法有svm em演算法都要推導證明
因為我做過文本主題所以問了很多lda的知識
其它的就比較扯了

多扶老奶奶過馬路多攢人品。

先總結幾點，再談個人經歷。
1. 基本的Model要懂
Kmeans必須懂 KNN懂原理不用寫 SVM、MR和Pagerank多看看，遇到問題多用這些方法討論（不關乎信仰，面試官都懂）

2. 高層次的Model是關乎信仰的，不需要浪費時間在上面
比如LDA。以前信貝爺，覺得高大上。做過才知道坑有多深，出paper搞點黑數據可以，真正業界要用效果甚不穩定。HMM可以做語句分析優化，別的地方用也不顯著。絕大多數時候搞個word net，做個tf idf就足夠了，不行自己再加些hardcode，效果也比所謂的model好，性能也高。
另一說就是面試官不期望你懂這麼高級的模型，就跟他們不能因為你不會拉小提琴而覺得你智商不夠把你拒掉一樣。

3. 人品、自信、反客為主的面試思路
以我的個人經歷（Observation）（詳見文末），不懂但人品好就拿到了Offer，懂但人品不好就拿不到Offer。（請隨意Predict）
個人比較成功的麵筋是看好面試官的cd間隔放大招：比如看簡歷沉沒的時候主動說high light一下，有條件就把線上的project拿出來秀；代碼寫完主動說說能有什麼改進思路，把面試官拐跑，就不會被他帶到挖好的坑上來了。

4. 數據敏感性
比如預估數據規模和運行時間。

經歷過以下幾種case，不知是否match。
1. 10年，北京，剛畢業。申請國外phd跪了。魂友指明路去T大某ML實驗室歷練。提前兩周看了下，感覺最懂的也只有貝葉斯公式了（其餘只知道個縮寫看著眼熟）約了時間也只能硬著頭皮去了。
結果老闆就看了看簡歷，只問我啥時候可以過去。=_=

2. 一年半後，在美讀碩。經濟壓力山大，暑假前就開始找實習（真心困難，因為當時招ML或者Data Scientist的都只招Fulltime或者經老闆介紹的PhD）簡歷投了也沒人鳥，感覺還是懂iOS Android的吃香。好不容易遇到個start-up，面試時候我主動展示之前做的項目，挺滿意，讓我寫個Page Rank，30分鐘拍好Over。（允許看wiki看公式）老闆不懂/不信LDA，反正SVM各種搞。

3. 有同學拿到EBay電面，抓我去幫忙遞紙條，問的概率論基礎，線性回歸，碼字元串處理，都是基本功。

4. 後來面了次Yelp，電面第二輪，代碼拍太快了，導致他有時間問了第二個Follow up question，投票統計相關，我提了兩種normalize方案，都不太完善，後來就不要我了。

5. FB一直沒理我，估計是之前在Graph上面抓數據超了兩次limit被封過ID（他們自己文檔不寫清楚）Google問的都是基本碼農問題，跟ML關係不大。因為面的Test，碼完讓我自己出數據，我出多了，被鄙視，鋸。

6. 去面Bing。Super Day面了6個人，前面問演算法，後面吃著混沌面問的Query分析。最後沒給Offer，因為收了N記2萬5員工用完招人名額。（據說還給EVP打了報告，沒批。算了）

最後忠告：
攢人品。
信自己的數學和邏輯直覺，不要信Model。
信不信貝爺和得不得永生是相對獨立事件。
所以「信貝爺得永生」在貝爺的公理範圍內至少是悖論。

關於考察方面，上面已經講得很詳細了，直接給你貼幾個面試數據挖掘實習生經驗貼，都是15年嶄新的血與淚啊。（網站是100塊錢的國外虛擬空間，沒有進行什麼優化處理，所以網頁打開會非常慢，請見諒。）

百度數據挖掘實習工程師一、二現場面試（深圳）
阿里巴巴電話面試2面總結（數據挖掘，天貓事業部）
百度NLP電話面試總結

如何準備面試呢？個人經驗：

1. 代碼演算法：基本演算法（如快排等，需要熟練掌握） + 劍指Offer（面試經常出相似的題） + LeetCode（劍指Offer的補充，增強動手能力）
2. 機器學習：李航《統計學習方法》（讀3遍都不為過啊！） + Coursera Stanford《Machine Learning》（講得很基礎，但是沒有告訴你所以然） + Coursera 台灣大學《機器學習高級技法》（裡面詳解了SVM，Ensemble等模型的推導，優劣）
3. 請詳細地回憶自己做過的項目，項目用了什麼演算法，為什麼用它，有什麼優缺點等。如果沒項目經驗可以參加天貓大數據比賽和Kaggle比賽。
4. 教你如何迅速秒殺掉：99%的海量數據處理面試題。（基本每次都有一道海量數據處理的面試題）

PS：終於拿到某互聯網公司的數據挖掘實習工程師的Offer啦，好開心啊~~~

1 決策樹的模型很重要，不會問你ID3這些簡單的，甚至CART都不會，不過會問GBDT，隨機森林。

2 SVM不可能不問。

3 理解得很深才能給面試官講明白。

點滴積累。下面是我積累的資源...

機器學習面試準備（持續更新）--- 優秀博文傳送門，收集優秀資源 - 做一枚優秀的程序猿 - 博客頻道 - CSDN.NET

會邏輯回歸就夠了，其他的大致看看，編程紮實一點，特徵怎麼抽怎麼處理的詳細了解一下，就夠了，沒那麼多扯犢子東西。

主要靠練，準備的話看看C加加，熟悉一下自己的項目。

最近在Quora上看到了一個相關回到，於是搬到這裡來。
鏈接在此 quora.com 的頁面
搞機器學習的閱讀英文肯定毫無壓力了，就不翻譯了

-----以下是原回答------
First, here is my list of all skills I might want to see for this position:

Academic

CS coursework
Stats and linear algebra
Some ML coursework, covering at least
- regression
- classification
- clustering
- recommendation
- graphical models

Data Collection Tools

Hadoop-based tools like Flume / Sqoop
Text munging languages like Python, or maybe Perl
Basic SQL

Data Modeling Tools

A library like scipy / numpy or Weka
A tool like R (or commercial equivalents like SAS, SPSS)

Model Serving Tools

(Ideally) some familiarity with PMML
Basic knowledge of a NoSQL store
Systems language skill, like Java

Business Smarts

Communication skills
Some facility with a visualization tool, even if gnuplot or Excel
Domain knowledge relevant to my business

You certainly don"t need all of that. In fact, for an internship, you can"t be expected to have most of it. I assume you are in school, so I would expect you to have much of the academic background, and would like to see that you have some of the tool skills. I would not expect business skills, but believe me, communication skills are a big differentiator.

So what to focus on? First, academics. If I were interviewing you I would probably ask about this as a filter. If you"re not able to explain the very basics, like what linear regression does, that means there"s a big lack of either knowledge or communication skills. So I would feel comfortable with the very basics. I"d ask you to explain one moderately advanced algorithm and why it works, of your choice. Same reasoning, if you can"t pick something out of everything you know to explain reasonably, probably not going to proceed.

Unfortunately I do think a lot of interviews focus too much on the math and algorithms like it was an exam. I would not want to work at places that think that"s the important thing. I personally would want to see that you"re smart and communicate well and know the basics. Chances are that whatever math is relevant to my business is something you"ll need to learn (more) anyway.

I know you"re asking about tools though. The tools that are relevant really depend on the kind of place you"re applying. A classic research department is going to focus mostly on modeling tools. Since you can"t get SAS / SPSS easily, focus on R and Weka as a skill.

At the other end of the spectrum, say, a small startup, the requirement is broader and shallower. They won"t need you to know R. They will need you to quickly understand a business problem and put together a production-ready system to solve it. So it"s much more about data collection, munging, a little modeling, and then integration. For that I would make sure you know how to get data out of a DB or log files, into a modeling tool, and then how to transform a model into some code someone could put in a web server. So: basic SQL, Python or Java, and whatever DB / web serving tools the company uses.

Kaggle is great practice although it will not "test" your data collection skills or the serving side of things. But it will challenge you to understand a business problem, munge real data and model it. I would look favorably on an intern who had taken the time to solve a Kaggle problem and done reasonably well.

常用模型和演算法，至少有那麼一兩個能推導，能說出適用範圍，能自己實現，有過應用。
另外，要有演算法工程師的氣質，那種對任何事物都願意且能夠深入思考的傾向。
據說，如果是應聘大公司初級工程師的話，刷題貌似有那麼一丟丟用處，僅僅是一丟丟。

如何準備機器學習工程師的面試 ？

1.面試流程

2.優秀面試者的特質

3.機器學習面試技巧

4.演算法面試題

如何準備機器學習工程師的面試？