如何準備機器學習工程師的面試 ?

1. 演算法和理論基礎
2. 工程實現能力與編碼水平
3. 業務理解和思考深度

1. 理論方面,我推薦最經典的一本書《統計學習方法》,這書可能不是最全的,但是講得最精髓,薄薄一本,適合面試前突擊準備。





2. 工程實現能力與編碼水平



3. 非常令人失望地告訴你儘管機器學習主要會考察1和2




naive bayes和logistic regression的區別





PS 我是題主,自問自答下。
PS2 我面試的都是國內互聯網公司和遊戲公司。


針對應屆生校招面試 「機器學習」 相關崗位的情況,牛妹為大家整理了一批面經,供參考:


1、菜鳥圖像圖形演算法 內推 一面






7、京東 AI 與大數據部面經


















9、菜鳥網路一面面經 數據挖掘崗







16、百度 北京 機器學習/數據挖掘 提前批














30、【數據挖掘面經】騰訊+百度+華為(均拿到sp offer)




1. 項目簡介


2. 模型簡介


Random forests is a notion of the general technique of random decision forests that are an ensemble learning method(怎樣的方法) for classification, regression and other tasks(解決了什麼問題), that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees(基本原理).


3. 模型的優缺點

模型的優缺點與模型簡介是緊密相關的,可以將兩個問題結合起來一起準備。比如之前我們談到了什麼是隨機森林,緊接著可以談一下隨機森林有什麼優點,如:a. 對於很多數據集表現良好,精確度比較高;b. 不容易過擬合;c. 可以得到變數的重要性排序;d. 既能處理離散型數據,也能處理連續型數據,且不需要進行歸一化處理; e. 能夠很好的處理缺失數據;f. 容易並行化等等。同時,將理論與實踐結合也是非常好的切入點,如隨機森林的諸多優點是如何體現在垃圾郵件分類器項目中,這樣的結合能更好的展示出面試人對於模型的理解及掌控。

4. 模型原理及相關技術細節




5. 模型的橫向比較



6. 開放性問題


再次回到垃圾郵件分類器項目,這個項目中有多個開放性問題可以被提問。比如,1. 郵件數據存在缺失,通常情況下,如何對缺失數據進行處理?2. 垃圾郵件分類是一個非平衡數據集分類的問題,針對這一類問題,我們應該如何進行建模。3. 項目中,PCA被用於特徵選擇,除此而外,還有哪些方法可以用來進行特徵選擇?

7. 準備材料


A. Stanford CS229 Machine Learning.

B. CMU 10-701 Introduction to Machine Learning.

C. The Elements of Statistical Learning. By Trevor Hastie, Robert Tibshirani and Jerome Friedman.

D. Pattern Recognition and Machine Learning. By Christopher Bishop.




然後我說我做過LDA,問我,Dirichlet Distribution的定義和性質,並問我,為什麼它和multinomial distribution是共軛的,順便問了我啥叫共軛分布。






  1. 數據科學(數據分析+機器學習)

    1. 入門

      1. 如何入門機器學習

      2. 數據科學家在公司做什麼

      3. 機器學習的分類

      4. 該不該轉型機器學習

    2. 應用

      1. 廣告搜索

        1. 搜索廣告內部原理精華版

        2. 搜索廣告內部原理完整版

      2. 深度學習

        1. 大規模深度學習的應用

        2. 深度學習與無人車/機器人

        3. Google在機器學習的探索

      3. 推薦系統

        1. 如何做好推薦系統

        2. AppStore之推薦系統(一)/(二)

      4. Airbnb大數據預測

        1. Airbnb機器學習實戰

        2. Airbnb大數據預測(一)/(二)/(三)/(四)/(五)/(六)

      5. 圖像問答

        1. 實戰深度學習之圖像問答

      6. 用戶分析

        1. 實戰R語言用戶分析



機器學習大概問過lr,svm,pr曲線,樸素貝葉斯的assumption,ensemble方法,決策樹節點用哪個特徵進行劃分,gbdt原理,random forest原理,pca和lda降維原理,寫k means和gmm公式,特徵選擇的方法有哪些,cnn與rnn的區別,你所知道的距離度量方式,你所知道的loss function,蓄水池抽樣。具體可以看看https://zhuanlan.zhihu.com/p/30420494。




趁著寫這個答案的同時也順帶梳理一下平時收藏的一些面試題(不過這些題目比較偏向Data Scientist,問題和答案是分快寫的,所有題目均有比較可信的出處,回答內容不定期更新。):

一 來ANALYTICS VIDHYA:40 Interview Questions asked at Startups in Machine Learning / Data Science (這些問題並沒有要求你寫出具體的推導公式,主要考察的是機器學習如何具體應用。挑了幾個比較有意思的貼出來)


Q1 關於在有限內存情況下如何對數據進行降維(這個問題挺有意思的,而且文章給的答案很詳細,答案中也給出了所提到方法的更為詳細的解釋的鏈接,贊一個!)

You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

Q2 你在一個用於cancer detection的數據集上使用的分類模型達到了96%的精度,但「你不是真正的快樂」,為啥?該怎麼做?(關於性能度量的基礎題)

You are given a data set on cancer detection. You』ve build a classification model and achieved an accuracy of 96%. Why shouldn』t you be happy with your model performance? What can you do about it?

Q3 如何處理回歸模型中的多重共線性問題

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he』s true? Without losing any information, can you still build a better model?

Q4 如何選擇重要特徵

While working on a data set, how do you select important variables? Explain your methods.

Q5 在時間序列數據集中,應該使用何種交叉驗證方法?

What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?


A1 (重點加粗,下同)

Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation:

1.Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use.

2.We can randomly sample the data set. This means, we can create a smaller data set, let』s say, having 1000 variables and 300000 rows and do the computations.

3.To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we』ll use correlation. For categorical variables, we』ll use chi-square test.

4.Also, we can use PCA and pick the components which can explain the maximum variance in the data set.

5.Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option.

6.Building a linear model using Stochastic Gradient Descent is also helpful.

7.We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information.

Note: For point 4 5, make sure you read about online learning algorithms Stochastic Gradient Descent. These are advanced methods.


If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps:
1. We can use undersampling, oversampling or SMOTE to make the data balanced.
2. We can alter the prediction threshold value by doing probability caliberation and finding a optimal threshold using AUC-ROC curve.
3. We can assign weight to classes such that the minority classes gets larger weight.
4. We can also use anomaly detection.
Know more: Imbalanced Classification


To check multicollinearity, we can create a correlation matrix to identify remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value &<= 4 suggests no multicollinearity whereas a value of &>= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.
But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.
Know more: Regression


1. Remove the correlated variables prior to selecting important variables
2. Use linear regression and select variables based on p values
3. Use Forward Selection, Backward Selection, Stepwise Selection
4. Use Random Forest, Xgboost and plot variable importance chart
5. Use Lasso Regression
6. Measure information gain for the available set of features and select top n features accordingly.


In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold as shown below:
fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]
where 1,2,3,4,5,6 represents 「year」.

二 來自Quora「20 questions to detect a fake data scientist」:

1 What is the life cycle of a data science project?

2 How do you measure yield (over base line) resulting from a new or refined algorithm or architecture?

3 What is cross-validation? How to do it right?

4 Is it better to design robust or accurate algorithms?

5 Have you written production code? Prototyped an algorithm? Created a proof of concept?

6 What is the biggest data set you have worked with, in terms of training set size, and in terms of having your algorithm implemented in production mode to process billions of transactions per day / month / year?

7 Name a few famous API"s (for instance Google search). How would you create one?

8 How to efficiently scrape web data, or collect tons of tweets?

9 How to optimize algorithms (parallel processing and/or faster algorithm: provide examples for both)

10 Examples of NoSQL architecture?

11 How do you clean data?

12 How do you define / select metrics? Have you designed and used compound metrics?

13 Examples of bad and good visualizations?

14 Have you been involved - as an adviser or architect - in the design of dashboard or alarm systems?

15 How frequently an algorithm must be updated? What about lookup tables in real-time systems?

16 Provide examples of machine-to-machine communication.

17 Provide examples where you automated a repetitive analytical task.

18 How do you assess the statistical significance of an insight?

19 How to turn unstructured data into structured data?

20 How to very efficiently cluster 100 billion web pages, for instance with a tagging or indexing algorithm?

If you were interviewing a data scientist, what questions would you ask her?

三 來自ANALYTICS VIDHYA(專門針對Regression的45個問題):45 questions to test a Data Scientist on Regression (Skill test - Regression Solution)

四 來自Data Science Central(100+ Common Data Science Interview Questions):100+ Common Data Science Interview Questions


1. Statistics
2. Programming:General,Big Data,Python,R,SQL
3. Modeling
4. Behavioral
5. Culture Fit
6. Problem-Solving

五 來自ANALYTICS VIDHYA:30個關於NLP的筆試題(題目還是不像最後那個一樣貼出來了否則這個回答真的太長了你們會不耐煩的,給個介紹)30 Questions to test a data scientist on Natural Language Processing [Solution: Skilltest – NLP]


Humans are social animals and language is our primary tool to communicate with the society. But, what if machines could understand our language and then act accordingly? Natural Language Processing (NLP) is the science of teaching machines how to understand the language we humans speak and write.
We recently launched an NLP skill test on which a total of 817 people registered. This skill test was designed to test your knowledge of Natural Language Processing. If you are one of those who missed out on this skill test, here are the questions and solutions.
Here are the leaderboard ranking for all the participants.


六 來自ANALYTICS VIDHYA(45個關於Tree Based演算法的題目,包括決策樹,隨機森林和XGBoost):45 questions to test Data Scientists on Tree Based Algorithms (Decision tree, Random Forests, XGBoost)

七 來自ANALYTICS VIDHYA(25個與圖像處理有關的題目,非深度學習導向):https://www.analyticsvidhya.com/blog/2017/10/image-skilltest/


Extracting useful information from unstructured data has always been a topic of huge interest in the research community. One such example of unstructured data is an image, and analysis of image data has applications in various aspects of business.
This skilltest is specially designed for you to test your knowledge on the knowledge on how to handle image data, with an emphasis on image processing. More than 300 people registered for the test. If you are one of those who missed out on this skill test, here are the questions and solutions.

八 最後是來自ANALYTICS VIDHYA(這裡的40題應該算是機器學習筆試題,不過也很值得一讀):40 Questions to test a data scientist on Machine Learning [Solution: SkillPower – Machine Learning, DataFest 2017](先寫到這兒了,有空再補充。如果這些題目對你有啟發,希望點個贊~)

1) Which of the following statement is true in following case?

A) Feature F1 is an example of nominal variable.
B) Feature F1 is an example of ordinal variable.
C) It doesn』t belong to any of the above category.
D) Both of these

Solution: (B)

Ordinal variables are the variables which has some order in their categories. For example, grade A should be consider as high grade than grade B.

2) Which of the following is an example of a deterministic algorithm?


B) K-Means

C) None of the above

Solution: (A)

A deterministic algorithm is that in which output does not change on different runs. PCA would give the same result if we run again, but not k-means.

3) [True or False] A Pearson correlation between two variables is zero but, still their values can still be related to each other.



Solution: (A)

Y=X2. Note that, they are not only associated, but one is a function of the other and Pearson correlation between them is 0.

4) Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic Gradient Decent (SGD)?

  1. In GD and SGD, you update a set of parameters in an iterative manner to minimize the error function.
  2. In SGD, you have to run through all the samples in your training set for a single update of a parameter in each iteration.
  3. In GD, you either use the entire data or a subset of training data to update a parameter in each iteration.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1,2 and 3

Solution: (A)

In SGD for each iteration you choose the batch which is generally contain the random sample of data But in case of GD each iteration contain the all of the training observations.

5) Which of the following hyper parameter(s), when increased may cause random forest to over fit the data?

  1. Number of Trees
  2. Depth of Tree
  3. Learning Rate

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1,2 and 3

Solution: (B)

Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter in random forest. Increase in the number of tree will cause under fitting.

6) Imagine, you are working with 「Analytics Vidhya」 and you want to develop a machine learning algorithm which predicts the number of views on the articles.

Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features. Which of the following evaluation metric would you choose in that case?

  1. Mean Square Error
  2. Accuracy
  3. F1 Score

A) Only 1

B) Only 2

C) Only 3

D) 1 and 3

E) 2 and 3

F) 1 and 2


You can think that the number of views of articles is the continuous target variable which fall under the regression problem. So, mean squared error will be used as an evaluation metrics.

7) Given below are three images (1,2,3). Which of the following option is correct for these images?




A) 1 is tanh, 2 is ReLU and 3 is SIGMOID activation functions.

B) 1 is SIGMOID, 2 is ReLU and 3 is tanh activation functions.

C) 1 is ReLU, 2 is tanh and 3 is SIGMOID activation functions.

D) 1 is tanh, 2 is SIGMOID and 3 is ReLU activation functions.

Solution: (D)

The range of SIGMOID function is [0,1].

The range of the tanh function is [-1,1].

The range of the RELU function is [0, infinity].

So Option D is the right answer.

8) Below are the 8 actual values of target variable in the train file.


What is the entropy of the target variable?

A) -(5/8 log(5/8) + 3/8 log(3/8))

B) 5/8 log(5/8) + 3/8 log(3/8)

C) 3/8 log(5/8) + 5/8 log(3/8)

D) 5/8 log(3/8) – 3/8 log(5/8)

Solution: (A)

The formula for entropy is

So the answer is A.

9) Let』s say, you are working with categorical feature(s) and you have not looked at the distribution of the categorical variable in the test data.

You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges you may face if you have applied OHE on a categorical variable of train dataset?

A) All categories of categorical variable are not present in the test dataset.

B) Frequency distribution of categories is different in train as compared to the test dataset.

C) Train and Test always have same distribution.

D) Both A and B

E) None of these

Solution: (D)

Both are true, The OHE will fail to encode the categories which is present in test but not in train so it could be one of the main challenges while applying OHE. The challenge given in option B is also true you need to more careful while applying OHE if frequency distribution doesn』t same in train and test.

10) Skip gram model is one of the best models used in Word2vec algorithm for words embedding. Which one of the following models depict the skip gram model?

A) A

B) B

C) Both A and B

D) None of these

Solution: (B)

Both models (model1 and model2) are used in Word2vec algorithm. The model1 represent a CBOW model where as Model2 represent the Skip gram model.

11) Let』s say, you are using activation function X in hidden layers of neural network. At a particular neuron for any given input, you get the output as 「-0.0001」. Which of the following activation function could X represent?


B) tanh


D) None of these

Solution: (B)

The function is a tanh because the this function output range is between (-1,-1).

12) [True or False] LogLoss evaluation metric can have negative values.


Solution: (B)

Log loss cannot have negative values.

13) Which of the following statements is/are true about 「Type-1」 and 「Type-2」 errors?

  1. Type1 is known as false positive and Type2 is known as false negative.
  2. Type1 is known as false negative and Type2 is known as false positive.
  3. Type1 error occurs when we reject a null hypothesis when it is actually true.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 1 and 3

F) 2 and 3

Solution: (E)

In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a 「false positive」), while a type II error is incorrectly retaining a false null hypothesis (a 「false negative」).

14) Which of the following is/are one of the important step(s) to pre-process the text in NLP based projects?

  1. Stemming
  2. Stop word removal
  3. Object Standardization

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) 1,2 and 3

Solution: (D)

Stemming is a rudimentary rule-based process of stripping the suffixes (「ing」, 「ly」, 「es」, 「s」 etc) from a word.

Stop words are those words which will have not relevant to the context of the data for example is/am/are.

Object Standardization is also one of the good way to pre-process the text.

15) Suppose you want to project high dimensional data into lower dimensions. The two most famous dimensionality reduction algorithms used here are PCA and t-SNE. Let』s say you have applied both algorithms respectively on data 「X」 and you got the datasets 「X_projected_PCA」 , 「X_projected_tSNE」.

Which of the following statements is true for 「X_projected_PCA」 「X_projected_tSNE」 ?

A) X_projected_PCA will have interpretation in the nearest neighbour space.

B) X_projected_tSNE will have interpretation in the nearest neighbour space.

C) Both will have interpretation in the nearest neighbour space.

D) None of them will have interpretation in the nearest neighbour space.

Solution: (B)

t-SNE algorithm consider nearest neighbour points to reduce the dimensionality of the data. So, after using t-SNE we can think that reduced dimensions will also have interpretation in nearest neighbour space. But in case of PCA it is not the case.

Context: 16-17

Given below are three scatter plots for two features (Image 1, 2 3 from left to right).

16) In the above images, which of the following is/are example of multi-collinear features?

A) Features in Image 1

B) Features in Image 2

C) Features in Image 3

D) Features in Image 1 2

E) Features in Image 2 3

F) Features in Image 3 1

Solution: (D)

In Image 1, features have high positive correlation where as in Image 2 has high negative correlation between the features so in both images pair of features are the example of multicollinear features.

17) In previous question, suppose you have identified multi-collinear features. Which of the following action(s) would you perform next?

  1. Remove both collinear variables.
  2. Instead of removing both variables, we can remove only one variable.
  3. Removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression.

A) Only 1

B)Only 2

C) Only 3

D) Either 1 or 3

E) Either 2 or 3

Solution: (E)

You cannot remove the both features because after removing the both features you will lose all of the information so you should either remove the only 1 feature or you can use the regularization algorithm like L1 and L2.

18) Adding a non-important feature to a linear regression model may result in.

  1. Increase in R-square
  2. Decrease in R-square

A) Only 1 is correct

B) Only 2 is correct

C) Either 1 or 2

D) None of these

Solution: (A)

After adding a feature in feature space, whether that feature is important or unimportant features the R-squared always increase.

19) Suppose, you are given three variables X, Y and Z. The Pearson correlation coefficients for (X, Y), (Y, Z) and (X, Z) are C1, C2 C3 respectively.

Now, you have added 2 in all values of X (i.enew values become X+2), subtracted 2 from all values of Y (i.e. new values are Y-2) and Z remains the same. The new coefficients for (X,Y), (Y,Z) and (X,Z) are given by D1, D2 D3 respectively. How do the values of D1, D2 D3 relate to C1, C2 C3?

A) D1= C1, D2 &< C2, D3 &> C3

B) D1 = C1, D2 &> C2, D3 &> C3

C) D1 = C1, D2 &> C2, D3 &< C3

D) D1 = C1, D2 &< C2, D3 &< C3

E) D1 = C1, D2 = C2, D3 = C3

F) Cannot be determined

Solution: (E)

Correlation between the features won』t change if you add or subtract a value in the features.

20) Imagine, you are solving a classification problems with highly imbalanced class. The majority class is observed 99% of times in the training data.

Your model has 99% accuracy after taking the predictions on test data. Which of the following is true in such a case?

  1. Accuracy metric is not a good idea for imbalanced class problems.
  2. Accuracy metric is a good idea for imbalanced class problems.
  3. Precision and recall metrics are good for imbalanced class problems.
  4. Precision and recall metrics aren』t good for imbalanced class problems.

A) 1 and 3

B) 1 and 4

C) 2 and 3

D) 2 and 4

Solution: (A)

Refer the question number 4 from in this article.

21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of these models will give a better prediction than prediction of individual models.

Which of the following statements is / are true for weak learners used in ensemble model?

  1. They don』t usually overfit.
  2. They have high bias, so they cannot solve complex learning problems
  3. They usually overfit.

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) Only 1

E) Only 2

F) None of the above

Solution: (A)

Weak learners are sure about particular part of a problem. So, they usually don』t overfit which means that weak learners have low variance and high bias.

22) Which of the following options is/are true for K-fold cross-validation?

  1. Increase in K will result in higher time required to cross validate the result.
  2. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K.
  3. If K=N, then it is called Leave one out cross validation, where N is the number of observations.

A) 1 and 2

B) 2 and 3

C) 1 and 3

D) 1,2 and 3

Solution: (D)

Larger k value means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) and higher running time (as you are getting closer to the limit case: Leave-One-Out CV). We also need to consider the variance between the k folds accuracy while selecting the k.

Question Context 23-24

Cross-validation is an important step in machine learning for hyper parameter tuning. Let』s say you are tuning a hyper-parameter 「max_depth」 for GBM by selecting it from 10 different depth values (values are greater than 2) for tree based model using 5-fold cross validation.

Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds and for the prediction on remaining 1-fold is 2 seconds.

Note: Ignore hardware dependencies from the equation.

23) Which of the following option is true for overall execution time for 5-fold cross validation with 10 different values of 「max_depth」?

A) Less than 100 seconds

B) 100 – 300 seconds

C) 300 – 600 seconds

D) More than or equal to 600 seconds

C) None of the above

D) Can』t estimate

Solution: (D)

Each iteration for depth 「2」 in 5-fold cross validation will take 10 secs for training and 2 second for testing. So, 5 folds will take 12*5 = 60 seconds. Since we are searching over the 10 depth values so the algorithm would take 60*10 = 600 seconds. But training and testing a model on depth greater than 2 will take more time than depth 「2」 so overall timing would be greater than 600.

24) In previous question, if you train the same algorithm for tuning 2 hyper parameters say 「max_depth」 and 「learning_rate」.

You want to select the right value against 「max_depth」 (from given 10 depth values) and learning rate (from given 5 different learning rates). In such cases, which of the following will represent the overall time?

A) 1000-1500 second

B) 1500-3000 Second

C) More than or equal to 3000 Second

D) None of these

Solution: (D)

Same as question number 23.

25) Given below is a scenario for training error TE and Validation error VE for a machine learning algorithm M1. You want to choose a hyperparameter (H) based on TE and VE.

Which value of H will you choose based on the above table?

A) 1

B) 2

C) 3

D) 4

E) 5

Solution: (D)

Looking at the table, option D seems the best

26) What would you do in PCA to get the same projection as SVD?

A) Transform data to zero mean

B) Transform data to zero median

C) Not possible

D) None of these

Solution: (A)

When the data has a zero mean vector PCA will have same projections as SVD, otherwise you have to centre the data first before taking SVD.

Question Context 27-28

Assume there is a black box algorithm, which takes training data with multiple observations (t1, t2, t3,…….. tn) and a new observation (q1). The black box outputs the nearest neighbor of q1 (say ti) and its corresponding class label ci.

You can also think that this black box algorithm is same as 1-NN (1-nearest neighbor).

27) It is possible to construct a k-NN classification algorithm based on this black box alone.

Note: Where n (number of training observations) is very large compared to k.



Solution: (A)

In first step, you pass an observation (q1) in the black box algorithm so this algorithm would return a nearest observation and its class.

In second step, you through it out nearest observation from train data and again input the observation (q1). The black box algorithm will again return the a nearest observation and it』s class.

You need to repeat this procedure k times

28) Instead of using 1-NN black box we want to use the j-NN (j&>1) algorithm as black box. Which of the following option is correct for finding k-NN using j-NN?

  1. J must be a proper factor of k
  2. J &> k
  3. Not possible

A) 1

B) 2

C) 3

Solution: (A)

Same as question number 27

29) Suppose you are given 7 Scatter plots 1-7 (left to right) and you want to compare Pearson correlation coefficients between variables of each scatterplot.

Which of the following is in the right order?

  1. 1&<2&<3&<4
  2. 1&>2&>3 &> 4
  3. 7&<6&<5&<4
  4. 7&>6&>5&>4

A) 1 and 3

B) 2 and 3

C) 1 and 4

D) 2 and 4

Solution: (B)

from image 1to 4 correlation is decreasing (absolute value). But from image 4 to 7 correlation is increasing but values are negative (for example, 0, -0.3, -0.7, -0.99).

30) You can evaluate the performance of a binary class classification problem using different metrics such as accuracy, log-loss, F-Score. Let』s say, you are using the log-loss function as evaluation metric.

Which of the following option is / are true for interpretation of log-loss as an evaluation metric?

  1. If a classifier is confident about an incorrect classification, then log-loss will penalise it heavily.
  2. For a particular observation, the classifier assigns a very small probability for the correct class then the corresponding contribution to the log-loss will be very large.
  3. Lower the log-loss, the better is the model.

A) 1 and 3

B) 2 and 3

C) 1 and 2

D) 1,2 and 3

Solution: (D)

Options are self-explanatory.

Question 31-32

Below are five samples given in the dataset.

Note: Visual distance between the points in the image represents the actual distance.

31) Which of the following is leave-one-out cross-validation accuracy for 3-NN (3-nearest neighbor)?

A) 0

D) 0.4

C) 0.8

D) 1

Solution: (C)

In Leave-One-Out cross validation, we will select (n-1) observations for training and 1 observation of validation. Consider each point as a cross validation point and then find the 3 nearest point to this point. So if you repeat this procedure for all points you will get the correct classification for all positive class given in the above figure but negative class will be misclassified. Hence you will get 80% accuracy.

32) Which of the following value of K will have least leave-one-out cross validation accuracy?

A) 1NN

B) 3NN

C) 4NN

D) All have same leave one out error

Solution: (A)

Each point which will always be misclassified in 1-NN which means that you will get the 0% accuracy.

33) Suppose you are given the below data and you want to apply a logistic regression model for classifying it in two given classes.

You are using logistic regression with L1 regularization.

Where C is the regularization parameter and w1 w2 are the coefficients of x1 and x2.

Which of the following option is correct when you increase the value of C from zero to a very large value?

A) First w2 becomes zero and then w1 becomes zero

B) First w1 becomes zero and then w2 becomes zero

C) Both becomes zero at the same time

D) Both cannot be zero even after very large value of C

Solution: (B)

By looking at the image, we see that even on just using x2, we can efficiently perform classification. So at first w1 will become 0. As regularization parameter increases more, w2 will come more and more closer to 0.

34) Suppose we have a dataset which can be trained with 100% accuracy with help of a decision tree of depth 6. Now consider the points below and choose the option based on these points.

Note: All other hyper parameters are same and other factors are not affected.

  1. Depth 4 will have high bias and low variance
  2. Depth 4 will have low bias and low variance

A) Only 1

B) Only 2

C) Both 1 and 2

D) None of the above

Solution: (A)

If you fit decision tree of depth 4 in such data means it will more likely to underfit the data. So, in case of underfitting you will have high bias and low variance.

35) Which of the following options can be used to get global minima in k-Means Algorithm?

  1. Try to run algorithm for different centroid initialization
  2. Adjust number of iterations
  3. Find out the optimal number of clusters

A) 2 and 3

B) 1 and 3

C) 1 and 2

D) All of above

Solution: (D)

All of the option can be tuned to find the global minima.

36) Imagine you are working on a project which is a binary classification problem. You trained a model on training dataset and get the below confusion matrix on validation dataset.

Based on the above confusion matrix, choose which option(s) below will give you correct predictions?

  1. Accuracy is ~0.91
  2. Misclassification rate is ~ 0.91
  3. False positive rate is ~0.95
  4. True positive rate is ~0.95

A) 1 and 3

B) 2 and 4

C) 1 and 4

D) 2 and 3

Solution: (C)

The Accuracy (correct classification) is (50+100)/165 which is nearly equal to 0.91.

The true Positive Rate is how many times you are predicting positive class correctly so true positive rate would be 100/105 = 0.95 also known as 「Sensitivity」 or 「Recall」

37) For which of the following hyperparameters, higher value is better for decision tree algorithm?

  1. Number of samples used for split
  2. Depth of tree
  3. Samples for leaf

A)1 and 2

B) 2 and 3

C) 1 and 3

D) 1, 2 and 3

E) Can』t say

Solution: (E)

For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase. For example, if we have a very high value of depth of tree, the resulting tree may overfit the data, and would not generalize well. On the other hand, if we have a very low value, the tree may underfit the data. So, we can』t say for sure that 「higher is better」.

Context 38-39

Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with the input depth of 3 and output depth of 8.

Note: Stride is 1 and you are using same padding.

38) What is the dimension of output feature map when you are using the given parameters.

A) 28 width, 28 height and 8 depth

B) 13 width, 13 height and 8 depth

C) 28 width, 13 height and 8 depth

D) 13 width, 28 height and 8 depth

Solution: (A)

The formula for calculating output size is

output size = (N – F)/S + 1

where, N is input size, F is filter size and S is stride.

Read this article to get a better understanding.

39) What is the dimensions of output feature map when you are using following parameters.

A) 28 width, 28 height and 8 depth

B) 13 width, 13 height and 8 depth

C) 28 width, 13 height and 8 depth

D) 13 width, 28 height and 8 depth

Solution: (B)

Same as above

40) Suppose, we were plotting the visualization for different values of C (Penalty parameter) in SVM algorithm. Due to some reason, we forgot to tag the C values with visualizations. In that case, which of the following option best explains the C values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel.

A) C1 = C2 = C3

B) C1 &> C2 &> C3

C) C1 &< C2 &< C3

D) None of these

Solution: (C)

Penalty parameter C of the error term. It also controls the trade-off between smooth decision boundary and classifying the training points correctly. For large values of C, the optimization will choose a smaller-margin hyperplane. Read more here.




  1. 面試流程
  2. 優秀面試者特質
  3. 面試技巧
  4. 演算法面試題分享














一份好的簡歷中的項目描寫應該要精準的描述出項目技術難點,並且包含準確的評價數字。比如說:我用Tensor Flow 的密性,在公開的MNIST的數據上,得到了96%的準確率。這樣就會顯得很專業,也會加深的項目的可信度。









NLP 演算法工程師的學習、成長和實戰經驗

機器學習工程師的進階之路 | 知乎 Live 筆錄


我們找了 4 家大數據公司技術 Leader,聊了聊演算法和數據挖掘工程師的機會和選擇

從技術 Leader 的招聘需求看,如何轉崗為當前緊缺的大數據相關人才?


機器學習、大數據相關崗位職責及準備(轉) 》: 機器學習面試的職責和面試問題(轉) - 知乎專欄

機器學習經典演算法優缺點總結》: 機器學習經典演算法優缺點總結 - 知乎專欄

其中的關於推薦系統的入門介紹可以查看專欄文章:什麼是推薦系統(個性化內容分發)? - 知乎專欄

其中的關於用戶畫像的入門介紹可以查看專欄文章比你更了解你,淺談用戶畫像 - 知乎專欄


推薦演算法那點事》:知乎 Live - 全新的實時問答

推薦演算法那點事(二):細節 》:知乎 Live - 全新的實時問答






19. 非遞歸的二叉前序遍歷 兩個字元串的複製(除了字元串地址重疊的情況,也要注意判斷字元串本身的空間足夠不足夠,對於異常情況要考慮全面)

20.一個概率題目: 6個LED燈管,找整體旋轉180"後仍然是一個正常輸入的情況(考慮全即可)
21. 給一個情境,考察你對於機器學習演算法的了解程度以及常用情境的了解(要特別注意思維要開闊,我就是陷入某一個)

22.一個數組,如果存在兩個數之和等於第三個數,找出滿足這一條件的最大的第三個數(設為x+y =c)






30.用拉格朗日公式推導SVM kernel變換







48.一個文件2G內容是userid,username 一個文件3G內容是username,userpassword
要求:輸出userid,userpassword 8核cpu 2G內存





53.b+ b-樹、紅黑樹、要求寫出排序演算法


















71. Deep CNN, Deep RNN, RBM的典型應用與局限,看Hinton講義和Paper去吧

72. 有哪些聚類方法?

73. 判斷一個鏈表是否存在環?回 答:通過兩個指針,快慢指針進行遍歷。

74. 正則化是怎麼回事(L1和L2)


76. 學校食堂如何應用數據挖掘的知識

77. 哪些模型容易過擬合,模型怎麼選擇

78. 什麼是模糊聚類,還有劃分聚類,層次聚類等

79. 最長上升子序列啊,兩個大小相同的有序數組找公共中位數

80. 並行計算、壓縮演算法


82. naive bayes和logistic regression的區別

89. 統計學習的核心步驟:模型、策略、演算法,你應當對logistic、SVM、決策樹、KNN及各種聚類方法有深刻的理解。能夠隨手寫出這些演算法的核心遞歸步的偽代碼以及他們優化的函數表達式和對偶問題形式。

90. 梯度下降、牛頓法、各種隨機搜索演算法(基因、蟻群等等)




2. 樸素貝葉斯核心思想利用先驗概率得到後驗概率,並且最終由期望風險最小化得出後驗概率最大化,從而輸出讓後驗概率最大化的值(具體概率與先驗概率由加入拉普拉斯平滑的極大似然估計而成的貝葉斯估計得到),特徵必須相互獨立。




5.bit 位操作


7. 實操,演算法流程:


1. 數組2中找到了一個與array1(i)相等的元素,則將array2(j)與array2(i)進行交換,I 加一,進行下次迭代

2. 數組2直到結尾也沒找到與array1(i)相等的元素,則將array1(i)與尚未進行比較的最後一個元素array1(k)進行交換,i不加一,進行下次迭代。



10. 實操: 為了反轉這個單鏈表,我們先讓頭結點的next域指向結點2,再讓結點1的next域指向結點3,最後將結點2的next域指向結點1,就完成了第一次交換,順序就變成了Header-結點2-結點1-結點3-結點4-NULL,然後進行相同的交換將結點3移動到結點2的前面,然後再將結點4移動到結點3的前面就完成了反轉,思路有了,就該寫代碼了: 每次都將原第一個結點之後的那個結點放在list後面,下圖是原始的單鏈表。file:///C:Userskaifei.yaoAppDataLocalTempmsohtmlclip11clip_image002.jpg實操

15. 處理海量數據問題,無非就是,詳細見鏈接


分而治之/hash映射 + hash統計 + 堆/快速/歸併排序;


Bloom filter/Bitmap;




16. 實操:將原數組看成 ab,需要轉換成 ba,先單獨對子數組a進行反轉得到a"b(a"表示a反轉後的結果),同理單獨反轉b,得到 a"b",最後將得到的 a"b" 一起進行一次反轉可得 (a"b")",而這就是最終結果 ba了


19. 實操:http://ocaicai.iteye.com/blog/1047397


23. 分類是事先定義好類別 ,類別數不變, K-均值聚類演算法、K-中心點聚類演算法、CLARANS、 BIRCH、CLIQUE、DBSCAN等

24. 實操:http://m.blog.csdn.net/blog/wkupaochuan/8912622,






27. http://www.docin.com/p-19876385.html

28. http://blog.csdn.net/kerryfish/article/details/24043099


31. 二叉查找樹(二叉排序樹)、平衡二叉樹(AVL樹)、紅黑樹、B-樹、B+樹、字典樹(trie樹)、後綴樹、廣義後綴樹。詳見下面鏈接


32. http://baike.baidu.com/link?url=ECsYE4xe1gguMd3R5X4x5V7eQX54NkFp0PJ0FYbAvgJIFPDiaCdD_PuftDAYZTuzH0EuIobF1vDa2Vx2rj6Dda

33. 實操 34.見16,實操 35.見28

36. http://www.douban.com/note/275544555/


38. http://zhidao.baidu.com/link?url=DAnewo2j-Jz2u3WwyhFb4kYpnI3QZzfBqsQXdzVG9R061hcBTCUu01WXtoX5T89SmiiMJ_eWIkXOAAe1lhDFM0S7OPjnL_zTEX3Mm1ARc-a

39. 空間複雜度:快排是O(n)歸併是O(2n).

40. http://blog.csdn.net/guyulongcs/article/details/7520467

41. args是Java命令行參數,我們在DOS中執行Java程序的時候使用「java 文件名 args參數」。args這個數組可以接收到這些參數。當然也可以在一個類中main方法中直接調用另一個類里的main方法,因為main方法都是static修飾的靜態方法,因此可以通過類名.main來調用,這時就可在調用處main方法中傳入String[]類型的字元串數組,達到調用的目的,也可不傳入參數。

42. http://www.cnblogs.com/goagent/archive/2013/05/16/3068442.html

43. http://www.cnblogs.com/BeyondAnyTime/archive/2012/06/06/2538764.html

44. http://www.360doc.com/content/13/0409/14/10384031_277138819.shtml

45. http://blog.csdn.net/wwj_748/article/details/8919838


48. http://freewxy.iteye.com/blog/737576

49. http://book.51cto.com/art/201205/338050.htm


50. http://blog.csdn.net/zcsylj/article/details/6532787


51.52. 53.實操

54. http://m.blog.csdn.net/blog/yangmm2048/44924997


56. http://driftcloudy.iteye.com/blog/782873








66. http://blog.csdn.net/zouxy09/article/details/20319673

67. http://www.tuicool.com/articles/q6zYrq

68. http://www.cnki.com.cn/Article/CJFDTotal-DNZS200832067.htm




72. K-均值聚類演算法、K-中心點聚類演算法、CLARANS、 BIRCH、CLIQUE、DBSCAN等

73. http://m.blog.csdn.net/blog/lavor_zl/42784247實操

78. http://blog.csdn.net/xiahouzuoxin/article/details/7748823


79. http://blog.csdn.net/zcsylj/article/details/6802062


80. http://www.doc88.com/p-1621945750499.html


82. http://m.blog.csdn.net/blog/muye5/19409615


84. http://bbs.pinggu.org/thread-3182029-1-1.html

85. http://www.doc88.com/p-3961053026557.html

86. http://www.docin.com/p-1204742211.html





演算法有svm em演算法 都要推導證明
因為我做過文本主題 所以問了很多lda的知識


1. 基本的Model要懂
Kmeans必須懂 KNN懂原理不用寫 SVM、MR和Pagerank多看看,遇到問題多用這些方法討論(不關乎信仰,面試官都懂)

2. 高層次的Model是關乎信仰的,不需要浪費時間在上面
比如LDA。以前信貝爺,覺得高大上。做過才知道坑有多深,出paper搞點黑數據可以,真正業界要用效果甚不穩定。HMM可以做語句分析優化,別的地方用也不顯著。絕大多數時候搞個word net,做個tf idf就足夠了,不行自己再加些hardcode,效果也比所謂的model好,性能也高。

3. 人品、自信、反客為主的面試思路
個人比較成功的麵筋是看好面試官的cd間隔放大招:比如看簡歷沉沒的時候主動說high light一下,有條件就把線上的project拿出來秀;代碼寫完主動說說能有什麼改進思路,把面試官拐跑,就不會被他帶到挖好的坑上來了。

4. 數據敏感性

1. 10年,北京,剛畢業。申請國外phd跪了。魂友指明路去T大某ML實驗室歷練。提前兩周看了下,感覺最懂的也只有貝葉斯公式了(其餘只知道個縮寫看著眼熟)約了時間也只能硬著頭皮去了。

2. 一年半後,在美讀碩。經濟壓力山大,暑假前就開始找實習(真心困難,因為當時招ML或者Data Scientist的都只招Fulltime或者經老闆介紹的PhD)簡歷投了也沒人鳥,感覺還是懂iOS Android的吃香。好不容易遇到個start-up,面試時候我主動展示之前做的項目,挺滿意,讓我寫個Page Rank,30分鐘拍好Over。(允許看wiki看公式)老闆不懂/不信LDA,反正SVM各種搞。

3. 有同學拿到EBay電面,抓我去幫忙遞紙條,問的概率論基礎,線性回歸,碼字元串處理,都是基本功。

4. 後來面了次Yelp,電面第二輪,代碼拍太快了,導致他有時間問了第二個Follow up question,投票統計相關,我提了兩種normalize方案,都不太完善,後來就不要我了。

5. FB一直沒理我,估計是之前在Graph上面抓數據超了兩次limit被封過ID(他們自己文檔不寫清楚)Google問的都是基本碼農問題,跟ML關係不大。因為面的Test,碼完讓我自己出數據,我出多了,被鄙視,鋸。

6. 去面Bing。Super Day面了6個人,前面問演算法,後面吃著混沌面問的Query分析。最後沒給Offer,因為收了N記2萬5員工用完招人名額。(據說還給EVP打了報告,沒批。算了)





1. 代碼演算法:基本演算法(如快排等,需要熟練掌握) + 劍指Offer(面試經常出相似的題) + LeetCode(劍指Offer的補充,增強動手能力)
2. 機器學習:李航《統計學習方法》(讀3遍都不為過啊!) + Coursera Stanford《Machine Learning》(講得很基礎,但是沒有告訴你所以然) + Coursera 台灣大學《機器學習高級技法》(裡面詳解了SVM,Ensemble等模型的推導,優劣)
3. 請詳細地回憶自己做過的項目,項目用了什麼演算法,為什麼用它,有什麼優缺點等。如果沒項目經驗可以參加天貓大數據比賽和Kaggle比賽。
4. 教你如何迅速秒殺掉:99%的海量數據處理面試題。(基本每次都有一道海量數據處理的面試題)


1 決策樹的模型很重要,不會問你ID3這些簡單的,甚至CART都不會,不過會問GBDT,隨機森林。

2 SVM不可能不問。

3 理解得很深才能給面試官講明白。


機器學習面試準備(持續更新)--- 優秀博文傳送門,收集優秀資源 - 做一枚優秀的程序猿 - 博客頻道 - CSDN.NET



鏈接在此 quora.com 的頁面

First, here is my list of all skills I might want to see for this position:


  • CS coursework
  • Stats and linear algebra
  • Some ML coursework, covering at least
    • regression
    • classification
    • clustering
    • recommendation
    • graphical models

Data Collection Tools

  • Hadoop-based tools like Flume / Sqoop
  • Text munging languages like Python, or maybe Perl
  • Basic SQL

Data Modeling Tools

  • A library like scipy / numpy or Weka
  • A tool like R (or commercial equivalents like SAS, SPSS)

Model Serving Tools

  • (Ideally) some familiarity with PMML
  • Basic knowledge of a NoSQL store
  • Systems language skill, like Java

Business Smarts

  • Communication skills
  • Some facility with a visualization tool, even if gnuplot or Excel
  • Domain knowledge relevant to my business

You certainly don"t need all of that. In fact, for an internship, you can"t be expected to have most of it. I assume you are in school, so I would expect you to have much of the academic background, and would like to see that you have some of the tool skills. I would not expect business skills, but believe me, communication skills are a big differentiator.

So what to focus on? First, academics. If I were interviewing you I would probably ask about this as a filter. If you"re not able to explain the very basics, like what linear regression does, that means there"s a big lack of either knowledge or communication skills. So I would feel comfortable with the very basics. I"d ask you to explain one moderately advanced algorithm and why it works, of your choice. Same reasoning, if you can"t pick something out of everything you know to explain reasonably, probably not going to proceed.

Unfortunately I do think a lot of interviews focus too much on the math and algorithms like it was an exam. I would not want to work at places that think that"s the important thing. I personally would want to see that you"re smart and communicate well and know the basics. Chances are that whatever math is relevant to my business is something you"ll need to learn (more) anyway.

I know you"re asking about tools though. The tools that are relevant really depend on the kind of place you"re applying. A classic research department is going to focus mostly on modeling tools. Since you can"t get SAS / SPSS easily, focus on R and Weka as a skill.

At the other end of the spectrum, say, a small startup, the requirement is broader and shallower. They won"t need you to know R. They will need you to quickly understand a business problem and put together a production-ready system to solve it. So it"s much more about data collection, munging, a little modeling, and then integration. For that I would make sure you know how to get data out of a DB or log files, into a modeling tool, and then how to transform a model into some code someone could put in a web server. So: basic SQL, Python or Java, and whatever DB / web serving tools the company uses.

Kaggle is great practice although it will not "test" your data collection skills or the serving side of things. But it will challenge you to understand a business problem, munge real data and model it. I would look favorably on an intern who had taken the time to solve a Kaggle problem and done reasonably well.



