๋ฐ์ํ
버전 확인¶
In [1]:
import sys
print('python version : ', sys.version)
python version : 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
In [2]:
import pandas as pd
print('pandas version : ', pd.__version__)
pandas version : 1.2.4
In [3]:
import matplotlib
print('matplotlib version : ', matplotlib.__version__)
matplotlib version : 3.3.4
In [4]:
import numpy as np
print('numpy version : ', np.__version__)
numpy version : 1.20.1
In [5]:
import scipy as sp
print('scipy version : ', sp.__version__)
scipy version : 1.6.2
In [6]:
import sklearn
print('scikit-learn version : ', sklearn.__version__)
scikit-learn version : 0.24.1
1~2. 데이터 적재¶
In [16]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()
In [32]:
print('iris dataset key : \n', iris_dataset.keys())
iris dataset key : dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
In [18]:
print(iris_dataset['DESCR'][:])
.. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. topic:: References - Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ...
In [30]:
print('target names: (์์ธกํ๊ณ ์ ํ๋ ๊ฐ) \n', iris_dataset['target_names'])
target names: (์์ธกํ๊ณ ์ ํ๋ ๊ฐ) ['setosa' 'versicolor' 'virginica']
In [28]:
print('feature name : (ํน์ฑ์ ์ด๋ฆ๋ค) \n', iris_dataset['feature_names'])
feature name : (ํน์ฑ์ ์ด๋ฆ๋ค) ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
In [31]:
print('real data : (์ค์ ๋ฐ์ดํฐ)\n', iris_dataset['data'][:5])
print('\nreal data shape: (์ค์ ๋ฐ์ดํฐ shape) ', iris_dataset['data'].shape)
print('\nreal data type: (์ค์ ๋ฐ์ดํฐ type) ', type(iris_dataset['data']))
real data : (์ค์ ๋ฐ์ดํฐ) [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2]] real data shape: (์ค์ ๋ฐ์ดํฐ shape) (150, 4) real data type: (์ค์ ๋ฐ์ดํฐ type) <class 'numpy.ndarray'>
In [35]:
print('target ํํ : \n', iris_dataset['target'])
target ํํ : [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
In [33]:
print('target ์ ๋ํ ํฌ๊ธฐ : ', iris_dataset['target'].shape)
target ์ ๋ํ ํฌ๊ธฐ : (150,)
3 모델 만들기¶
- training set ์ hold-out set(test_set) ๋ก ๋ถ๋ฆฌ
- train_test_split ํจ์ ์ฌ์ฉํ๊ธฐ- scikit learn
In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], test_size = 0.3, random_state=5)
In [38]:
print('X_train ํฌ๊ธฐ : ', X_train.shape)
print('y_train ํฌ๊ธฐ : ', X_test.shape)
X_train ํฌ๊ธฐ : (105, 4) y_train ํฌ๊ธฐ : (45, 4)
In [39]:
print('X_test ํฌ๊ธฐ : ', X_test.shape)
print('y_test : ', y_test.shape)
X_test ํฌ๊ธฐ : (45, 4) y_test : (45,)
4 데이터 확인¶
In [40]:
iris_dataframe = pd.DataFrame(X_train, columns = iris_dataset.feature_names)
print(iris_dataframe)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 6.2 2.8 4.8 1.8 1 5.9 3.0 4.2 1.5 2 6.7 3.3 5.7 2.1 3 7.7 3.8 6.7 2.2 4 5.4 3.4 1.7 0.2 .. ... ... ... ... 100 4.4 2.9 1.4 0.2 101 6.1 2.8 4.7 1.2 102 6.7 3.3 5.7 2.5 103 7.7 2.6 6.9 2.3 104 5.7 2.8 4.1 1.3 [105 rows x 4 columns]
In [48]:
!pip install mglearn
import mglearn
Requirement already satisfied: mglearn in c:\users\bbeee\anaconda3\lib\site-packages (0.1.9) Requirement already satisfied: joblib in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (1.0.1) Requirement already satisfied: scikit-learn in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (0.24.1) Requirement already satisfied: matplotlib in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (3.3.4) Requirement already satisfied: cycler in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (0.10.0) Requirement already satisfied: pandas in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (1.2.4) Requirement already satisfied: numpy in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (1.20.1) Requirement already satisfied: imageio in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (2.9.0) Requirement already satisfied: pillow in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (8.2.0) Requirement already satisfied: six in c:\users\bbeee\anaconda3\lib\site-packages (from cycler->mglearn) (1.15.0) Requirement already satisfied: python-dateutil>=2.1 in c:\users\bbeee\anaconda3\lib\site-packages (from matplotlib->mglearn) (2.8.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\bbeee\anaconda3\lib\site-packages (from matplotlib->mglearn) (1.3.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\bbeee\anaconda3\lib\site-packages (from matplotlib->mglearn) (2.4.7) Requirement already satisfied: pytz>=2017.3 in c:\users\bbeee\anaconda3\lib\site-packages (from pandas->mglearn) (2021.1) Requirement already satisfied: scipy>=0.19.1 in c:\users\bbeee\anaconda3\lib\site-packages (from scikit-learn->mglearn) (1.6.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\bbeee\anaconda3\lib\site-packages (from scikit-learn->mglearn) (2.1.0)
In [79]:
pd.plotting.scatter_matrix(iris_dataframe, c = y_train, figsize=(15,15), marker='o',
hist_kwds={'bins' : 20}, s = 60, alpha = .8, cmap = mglearn.cm3)
Out[79]:
array([[<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal length (cm)'>, <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal length (cm)'>, <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal length (cm)'>, <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal length (cm)'>], [<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>, <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal width (cm)'>, <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal width (cm)'>, <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal width (cm)'>], [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal length (cm)'>, <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal length (cm)'>, <AxesSubplot:xlabel='petal length (cm)', ylabel='petal length (cm)'>, <AxesSubplot:xlabel='petal width (cm)', ylabel='petal length (cm)'>], [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal width (cm)'>, <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal width (cm)'>, <AxesSubplot:xlabel='petal length (cm)', ylabel='petal width (cm)'>, <AxesSubplot:xlabel='petal width (cm)', ylabel='petal width (cm)'>]], dtype=object)
5 KNN 최근접 이웃 알고리즘¶
- ๋จ์ํ ํ๋ จ ๋ฐ์ดํฐ๋ฅผ ์ ์ฅํ์ฌ ๋ง๋ค์ด์ง
- ์๋ก์ด ๋ฐ์ดํฐ ํฌ์ธํธ์ ๋ํ ์์ธก์, ์ ๋ฐ์ดํฐ ํฌ์ธํธ์์ ๊ฐ์ฅ
- ๊ฐ๊น์ด ํฌ์ธํธ๋ฅผ ์ฐพ์, ์ฐพ์ ํ๋ จ ๋ฐ์ดํฐ์ ๋ ์ด๋ธ์ ์ ๋ฐ์ดํฐ ํฌ์ธํธ์
- ๋ ์ด๋ธ๋ก ์ง์
- ์๋ก์ด ๋ฐ์ดํฐ ํฌ์ธํธ์ ๊ฐ์ฅ ๊ฐ๊น์ด 'k'๊ฐ์ ์ด์์ ์ฐพ์
- ๋น๋๊ฐ ๊ฐ์ฅ ๋์ ํด๋์ค๋ฅผ ์์ธก๊ฐ์ผ๋ก ์ฌ์ฉ
In [81]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors= 1) # knn ๊ฐ์ฒด๋ ํ๋ จ ๋ฐ์ดํฐ๋ก ๋ชจ๋ธ์ ๋ง๋ค๊ณ ์๋ก์ด ๋ฐ์ดํฐ ํฌ์ธํธ์ ๋ํด ์์ธกํ๋ ์๊ณ ๋ฆฌ์ฆ์ ์บก์ํํจ
# KNeighborsClassifier ์ ๊ฒฝ์ฐ ํ๋ จ ๋ฐ์ดํฐ ์์ฒด๋ฅผ ์ ์ฅ
In [82]:
# ํ๋ จ ๋ฐ์ดํฐ ์
์ผ๋ก๋ถํฐ ๋ชจ๋ธ์ ๋ง๋ค๊ธฐ
# knn ๊ฐ์ฒด์ fit ๋ฉ์๋๋ฅผ ์ฌ์ฉ
knn.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size = 30, metric = 'minkowski',
metric_params = None, n_jobs = None, n_neighbors=1, p=2
, weights = 'uniform')
Out[82]:
KNeighborsClassifier(n_neighbors=1)
6 예측하기¶
- ๊ฝ๋ฐ์นจ(sepal), ๊ฝ์(petal) ์ ํญ(width), ๊ธธ์ด(length) ๋ฅผ ์์๋ก ์ฃผ๊ณ , ๋ถ๊ฝ์ ํ์ข ์ ์์ธกํ๊ธฐ
- ์์ : setosa(0) , versicolor9(1), virginica(2)
In [93]:
#์ธก์ ๊ฐ์ numpy์ ๋ฐฐ์ด ํํ๋ก ๋์ด ์๋ค. ์ํ์ ์๋ 1๊ฐ, ํน์ฑ์ ์ 4๊ฐ ์ด๋ฏ๋ก, 1x4 ๋ฐฐ์ด๋ก ๋ง๋ค๊ธฐ
X_new = np.array([[1,1,3,3]])
print('X_new.shape', X_new.shape)
# knn ๊ฐ์ฒด์ predict ๋ฉ์๋๋ฅผ ์ฌ์ฉํ์ฌ ํ์ข
์์ธกํ๊ธฐ
prediction = knn.predict(X_new)
print("์์ธก : ", prediction)
print("์์ธกํ ํ๊น์ ์ด๋ฆ : ", iris_dataset['target_names'][prediction])
X_new.shape (1, 4) ์์ธก : [1] ์์ธกํ ํ๊น์ ์ด๋ฆ : ['versicolor']
7 모델 평가하기¶
- ์ค์ ํ ์คํธํ์ฌ ์ผ๋ง๋ ๋ง์ ๋ถ๊ฝ ํ์ข ์ด ์ ํํ ๋ง์๋์ง ์ ํ๋๋ฅผ ๊ณ์ฐํ์ฌ ๋ชจ๋ธ์ ์ฑ๋ฅ์ ํ๊ฐ
In [96]:
y_pred = knn.predict(X_test)
print("test set์ ๋ํ ์์ธก๊ฐ : \n", y_pred)
print('\n ํ
์คํธ ์ธํธ์ ์ ํ๋ : {:.2f}'.format(np.mean(y_pred == y_test)))
test set์ ๋ํ ์์ธก๊ฐ : [1 2 2 0 2 1 0 2 0 1 1 1 2 2 0 0 2 2 0 0 1 2 0 2 1 2 1 1 1 2 0 1 1 0 1 0 0 2 0 2 2 1 0 0 1] ํ ์คํธ ์ธํธ์ ์ ํ๋ : 0.93
๋ฐ์ํ
'๐๏ธ์ํํธ์จ์ด > ๐ปpython' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[ML] 5๋ถ์์ ๋จธ์ ๋ฌ๋ ๋ฟ์๊ธฐ (feat.breast cancer) (0) | 2021.11.05 |
---|---|
[ML] sklearn (0) | 2021.11.04 |
[ML] ๋ถ๊ฝ ํ์ข ๋ถ๋ฅ Story 2 (0) | 2021.11.04 |
[Python] 1๋ถ๋ง์ ์ ๋ฆฌํ๋ python ํจ์ 4 (0) | 2021.11.04 |
[Python] 1๋ถ๋ง์ ์ ๋ฆฌํ๋ python ํจ์ 3 (0) | 2021.11.04 |