機器學習 Python scikit-learn 中文文檔（6）監督學習: 從高維觀測中預測輸出變數

01-19

與官方文檔完美匹配的中文文檔，請訪問 Python 機器學習包 - scikit-learn 0.20.2 documentation

統計學習: 問題設置以及 …

模型選擇: 選擇合適的估計器及其參數

用於科學數據處理的統計學習教程

scikit-learn v0.20.1

其他版本

該中文文檔由人工智慧社區的Antares翻譯!

監督學習: 從高維觀測中預測輸出變數

最近鄰演算法與維數災難

K-近鄰分類器

維數災難

線性模型：從回歸到稀疏

線性回歸

縮減(Shrinkage)

稀疏性(Sparsity)

分類

支持向量機 (SVMs)

線性 SVMs

使用核函數的SVMs

監督學習: 從高維觀測中預測輸出變數

監督學習中要求解的問題

監督學習(Supervised learning)旨在學習兩個數據集合之間的關聯：觀測數據集合 X 和我們想要預測的外部變數 y。變數 y 通常被稱為「target」或「labels」。大多數情況下，y 是一個長度為 n_samples 的 1-D 數組。

scikit-learn中的所有supervised estimators 都實現了兩個方法：一個是用於在訓練數據上擬合模型的 fit(X, y) ；另一個是用於預測未知標記的觀測數據 X 的 predict(X)，返回模型預測得到的標籤 y。

術語辭彙: 分類(classification) 與回歸(regression)

如果預測任務是把觀測數據在一個有限的標籤集合中進行分類，或者說是給觀測到的對象」命名」,那麼這樣的預測任務就叫分類任務；另一方面，如果我們的目的是預測一個連續目標變數，那麼這樣的預測任務就叫回歸任務。

scikit-learn中進行分類任務的時候, y 通常是一個整型數或字元串構成的向量。

注意: 請參考 Introduction to machine learning with scikit-learn Tutorial 快速瀏覽scikit-learn中使用的基本機器學習辭彙。

最近鄰演算法與維數災難

分類鳶尾花 (irises):

…/…/images/sphx_glr_plot_iris_dataset_001.png

鳶尾花數據集(iris dataset)是一個分類任務，要根據花瓣(petal)和花萼(sepal)的長度與寬度辨別3種不同品種的鳶尾花

import numpy as np

from sklearn import datasets

iris = datasets.load_iris()

iris_X = iris.data

iris_y = iris.target

np.unique(iris_y)

array([0, 1, 2])

K-近鄰分類器

針對這問題，我們可以使用的最簡單的分類器就是最近鄰分類器(nearest neighbor): 給定一個新的觀測 X_test, 在訓練集中找到一個與新的觀測距離最近的觀測。因為訓練集的觀測數據的類標籤是已知的，所以那個最近鄰的類標籤就可以被當作新觀測 X_test 的類標籤了。(請參考sklearn的在線文檔 Nearest Neighbors section 查看更多最近鄰分類器的信息)

訓練集和測試集

在實驗任何學習演算法的時候，我們不能在學習器在訓練階段已經見過的樣本上測試評估模型的預測性能，這樣會帶來欺騙性。所以，我們通常把數據集劃分成訓練集和測試集。在模型的性能評估階段，測試集中的數據對學習器來說是全新的從未見過的數據。這樣的測試結果才會真實可靠。

KNN (k nearest neighbors) classification example:

…/…/images/sphx_glr_plot_classification_001.png

Split iris data in train and test data

A random permutation, to split the data randomly

np.random.seed(0)

indices = np.random.permutation(len(iris_X))

iris_X_train = iris_X[indices[:-10]]

iris_y_train = iris_y[indices[:-10]]

iris_X_test = iris_X[indices[-10:]]

iris_y_test = iris_y[indices[-10:]]

Create and fit a nearest-neighbor classifier

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(iris_X_train, iris_y_train)

KNeighborsClassifier(algorithm=『auto』, leaf_size=30, metric=『minkowski』,

metric_params=None, n_jobs=None, n_neighbors=5, p=2,

weights=『uniform』)

knn.predict(iris_X_test)

array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])

iris_y_test

array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

維數災難

For an estimator to be effective, you need the distance between neighboring points to be less than some value , which depends on the problem. In one dimension, this requires on average points. In the context of the above -NN example, if the data is described by just one feature with values ranging from 0 to 1 and with training observations, then new data will be no further away than . Therefore, the nearest neighbor decision rule will be efficient as soon as is small compared to the scale of between-class feature variations.

If the number of features is , you now require points. Let』s say that we require 10 points in one dimension: now points are required in dimensions to pave the space. As becomes large, the number of training points required for a good estimator grows exponentially.

For example, if each point is just a single number (8 bytes), then an effective -NN estimator in a paltry dimensions would require more training data than the current estimated size of the entire internet (±1000 Exabytes or so).

這被稱之為維數災難( curse of dimensionality), 是很多機器學習理論和演算法都會強調的核心問題。

線性模型：從回歸到稀疏

糖尿病數據集(Diabetes dataset)

糖尿病數據集由442個病人的10個生理變數組成 (age, sex, weight, blood pressure) , 還有一個指示一年以後疾病進展情況的變數

diabetes = datasets.load_diabetes()

diabetes_X_train = diabetes.data[:-20]

diabetes_X_test = diabetes.data[-20:]

diabetes_y_train = diabetes.target[:-20]

diabetes_y_test = diabetes.target[-20:]

這個數據集的任務是根據已有的生理變數數據預測疾病的進展情況。

線性回歸

線性回歸( LinearRegression )，以其最簡單的形式，通過調整一組參數將線性模型擬合到給定的數據集，以使模型的平方殘差之和(the sum of the squared residuals)儘可能小。

…/…/images/sphx_glr_plot_ols_001.png

線性模型:

: data(訓練數據)

target variable(目標變數)

Coefficients(係數)

Observation noise(觀測雜訊)

from sklearn import linear_model

regr = linear_model.LinearRegression()

regr.fit(diabetes_X_train, diabetes_y_train)

…

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,

normalize=False)

print(regr.coef_)

[ 0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937

492.81458798 102.84845219 184.60648906 743.51961675 76.09517222]

The mean square error

np.mean((regr.predict(diabetes_X_test) - diabetes_y_test)**2)

…

2004.56760268…

Explained variance score: 1 is perfect prediction

and 0 means that there is no linear relationship

between X and y.

regr.score(diabetes_X_test, diabetes_y_test)

0.5850753022690…

縮減(Shrinkage)

如果每個維度的數據點很少，則觀測中的雜訊將會引起高方差:

…/…/images/sphx_glr_plot_ols_ridge_variance_001.png

X = np.c_[ .5, 1].T

y = [.5, 1]

test = np.c_[ 0, 2].T

regr = linear_model.LinearRegression()

import matplotlib.pyplot as plt

plt.figure()

np.random.seed(0)

for _ in range(6):

… this_X = .1 * np.random.normal(size=(2, 1)) + X

… regr.fit(this_X, y)

… plt.plot(test, regr.predict(test))

… plt.scatter(this_X, y, s=3)

高維統計學習的一個解決方案是將回歸係數縮小到零：任意兩個隨機選擇的觀測值集很可能不相關。這被稱之為嶺回歸。 A solution in high-dimensional statistical learning is to shrink the regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression:

…/…/images/sphx_glr_plot_ols_ridge_variance_002.png

regr = linear_model.Ridge(alpha=.1)

plt.figure()

np.random.seed(0)

for _ in range(6):

… this_X = .1 * np.random.normal(size=(2, 1)) + X

… regr.fit(this_X, y)

… plt.plot(test, regr.predict(test))

… plt.scatter(this_X, y, s=3)

這是偏差/方差折衷(bias/variance tradeoff)的一個例子：嶺 alpha 參數越大，偏差越高，方差越低。

我們可以選擇 alpha 來最小化遺漏錯誤，這次使用糖尿病數據集而不是我們的合成數據

from future import print_function

alphas = np.logspace(-4, -1, 6)

print([regr.set_params(alpha=alpha)

… .fit(diabetes_X_train, diabetes_y_train)

… .score(diabetes_X_test, diabetes_y_test)

… for alpha in alphas])

…

[0.5851110683883…, 0.5852073015444…, 0.5854677540698…,

0.5855512036503…, 0.5830717085554…, 0.57058999437…]

稀疏性(Sparsity)

Fitting only features 1 and 2

diabetes_ols_1 diabetes_ols_3 diabetes_ols_2

Note 完整的糖尿病數據集的表示將涉及11個維度（10個特徵維度和1個目標變數）。很難對這種高維表示形成直官感受，但是認識到它是一個相當空的空間(a fairly empty space)可能是有用的。

我們可以看到，雖然特徵2在全模型上有很強的係數，但當與特徵1比較時，發現它傳遞的 y 的信息卻很少。

為了改善問題的條件(即減輕維數災難)，只選擇信息量大的特徵(informative features)並把信息量太小的特徵(non-informative features) 拋棄掉(比如把特徵2直接置為0)是很有趣的。嶺回歸(Ridge regression)的做法是降低這些non-informative features的貢獻，但是不會把他們全部置為0。還有另外一種懲罰方法，叫做 Lasso (least absolute shrinkage and selection operator), 可以把線性模型的一些係數設置為0。這樣的方法被稱為 sparse method，並且稀疏性可視為奧坎姆剃刀(Occam』s razor)原理的應用: 總是傾向於簡單點兒的模型(prefer simpler models)。

regr = linear_model.Lasso()

scores = [regr.set_params(alpha=alpha)

… .fit(diabetes_X_train, diabetes_y_train)

… .score(diabetes_X_test, diabetes_y_test)

… for alpha in alphas]

best_alpha = alphas[scores.index(max(scores))]

regr.alpha = best_alpha

regr.fit(diabetes_X_train, diabetes_y_train)

Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,

max_iter=1000, normalize=False, positive=False, precompute=False,

random_state=None, selection=『cyclic』, tol=0.0001, warm_start=False)

print(regr.coef_)

[ 0. -212.43764548 517.19478111 313.77959962 -160.8303982 -0.

-187.19554705 69.38229038 508.66011217 71.84239008]

針對同一問題的不同演算法

不同的演算法可以用來解決相同的數學問題。例如，Scikit-Learn中的 Lasso 對象使用坐標下降法 (coordinate descent)解決了lasso回歸問題，這種方法在大型數據集上是有效的。然而，Scikit-Learn還提供 LassoLars 對象(使用了 LARS 演算法)，這對於權向量估計非常稀疏的問題(即只有很少量的觀測數據的問題)是非常有效的。

分類

…/…/images/sphx_glr_plot_logistic_001.png

對於分類，如在虹膜( iris )分類任務中，線性回歸併不是正確的方法，因為它會給遠離決策前沿的數據賦予太多的權重。這時，一個可用的線性方法是去擬合一個 sigmoid function 或者 logistic function:

log = linear_model.LogisticRegression(solver=『lbfgs』, C=1e5,

… multi_class=『multinomial』)

LogFit(iris_X_train, iris_y_train)

LogisticRegression(C=100000.0, class_weight=None, dual=False,

fit_intercept=True, intercept_scaling=1, max_iter=100,

multi_class=『multinomial』, n_jobs=None, penalty=『l2』, random_state=None,

solver=『lbfgs』, tol=0.0001, verbose=0, warm_start=False)

上述方法就是廣為人知的 LogisticRegression.

…/…/images/sphx_glr_plot_iris_logistic_001.png

多類別分類任務

如果你有多個類需要預測，那麼一個可選的方法是訓練多個一對多分類器(one-versus-all classifiers),然後在預測階段使用啟發式投票做出最終的決策。

在logistic回歸模型中如何達到縮減(Shrinkage) 與稀疏(sparsity)

參數 C 控制著 LogisticRegression 對象中正則化的量：C 的值越大會導致越小的正則化量。 penalty=「l2」會導致logistic回歸模型的模型係數發生縮減但係數本身並不會變的稀疏縮減(Shrinkage), 而 penalty=「l1」會導致logistic回歸模型的模型係數變得稀疏起來稀疏性(Sparsity).

練習

嘗試用最近鄰模型和線性模型對數字數據集(digits dataset)進行分類。留出最後的10%作為測試集，並測試模型在這些數據上的預測性能。

from sklearn import datasets, neighbors, linear_model

digits = datasets.load_digits()

X_digits = digits.data / digits.data.max()

y_digits = digits.target

練習題答案: …/…/auto_examples/exercises/plot_digits_classification_exercise.py

支持向量機 (SVMs)

線性 SVMs

支持向量機（Support Vector Machines）屬於判別式模型家族：這類模型試圖找到一個若干樣本的組合來構建一個能夠最大化兩類之間間隔的平面。 (they try to find a combination of samples to build a plane maximizing the margin between the two classes.) 模型的正則化可以由參數 C 來控制: 一個較小的 C 意味著在計算間隔(margin)的時候用到了分隔線(separating line)周圍很多或全部的觀測值,也就意味著較大的正則化量；而一個較大的 C 意味著在計算間隔(margin)的時候用到了距離分隔線(separating line) 比較近的若干個觀測值,也就意味著較小的正則化量。

Unregularized SVM Regularized SVM (default)

svm_margin_unreg svm_margin_reg

案例:

Plot different SVM classifiers in the iris dataset

SVMs 既可以用於回歸問題 –SVR (Support Vector Regression)–,也可以用於分類問題 –SVC (Support Vector Classification).

from sklearn import svm

svc = svm.SVC(kernel=『linear』)

svc.fit(iris_X_train, iris_y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape=『ovr』, degree=3, gamma=『auto_deprecated』,

kernel=『linear』, max_iter=-1, probability=False, random_state=None,

shrinking=True, tol=0.001, verbose=False)

Warning 歸一化數據(Normalizing data)

對於包括SVMs在內的許多估計器來說，保證每個特徵都有單位標準差對於獲得良好的預測是非常重要的!!!

使用核函數的SVMs

類在特徵空間中並不總是線性可分的。解決方法是建立一個非線性的決策函數，比如多項式就是一個替代品。這可以使用核技巧(kernel trick)來完成，它可以被看作是通過在觀測數據上定位核(kernel)來創建決策能量:

Linear kernel Polynomial kernel

svm_kernel_linear svm_kernel_poly

svc = svm.SVC(kernel=『linear』)

svc = svm.SVC(kernel=『poly』,

… degree=3)

degree: polynomial degree

RBF kernel (Radial Basis Function)

svm_kernel_rbf

svc = svm.SVC(kernel=『rbf』)

gamma: inverse of size of

radial kernel

互動式例子

參考鏈接 SVM GUI 去下載 svm_gui.py; 用滑鼠左鍵與右鍵點擊添加兩個類的樣本點創建兩個類，然後在數據上擬合SVM模型,並改變參數和數據。

…/…/images/sphx_glr_plot_iris_dataset_001.png

練習

嘗試使用SVMs對虹膜數據集(iris dataset)中的第1類和第2類進行分類,只使用前兩個特徵。每個類留出10%的樣本做測試集測試模型性能。

Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only one class.

Hint: 您可以在網格上使用 decision_function 方法來獲得直觀感覺.

iris = datasets.load_iris()

X = iris.data

y = iris.target

X = X[y != 0, :2]

y = y[y != 0]

練習題答案: …/…/auto_examples/exercises/plot_iris_exercise.py

? 2007 - 2018, scikit-learn developers (BSD License). Show this page source

---------------------

作者：ScorpioDoctor

來源：CSDN

原文：機器學習 Python scikit-learn 中文文檔（6）監督學習: 從高維觀測中預測輸出變數