import sys
print('python version : ', sys.version)

python version :  3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]


import pandas as pd
print('pandas version : ', pd.__version__)

pandas version :  1.2.4


import matplotlib
print('matplotlib version : ', matplotlib.__version__)

matplotlib version :  3.3.4


import numpy as np
print('numpy version : ', np.__version__)

numpy version :  1.20.1


import scipy as sp
print('scipy version : ', sp.__version__)

scipy version :  1.6.2


import sklearn
print('scikit-learn version : ', sklearn.__version__)

scikit-learn version :  0.24.1


from sklearn.datasets import load_iris
iris_dataset = load_iris()


print('iris dataset key : \n', iris_dataset.keys())

iris dataset key : 
 dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


print(iris_dataset['DESCR'][:])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...


print('target names: (예측하고자 하는 값) \n', iris_dataset['target_names'])

target names: (예측하고자 하는 값) 
 ['setosa' 'versicolor' 'virginica']


print('feature name : (특성의 이름들) \n', iris_dataset['feature_names'])

feature name : (특성의 이름들) 
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


print('real data : (실제 데이터)\n', iris_dataset['data'][:5])
print('\nreal data shape: (실제 데이터 shape) ', iris_dataset['data'].shape)
print('\nreal data type: (실제 데이터 type) ', type(iris_dataset['data']))

real data : (실제 데이터)
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

real data shape: (실제 데이터 shape)  (150, 4)

real data type: (실제 데이터 type)  <class 'numpy.ndarray'>


print('target 형태 : \n', iris_dataset['target'])

target 형태 : 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


print('target 에 대한 크기  : ', iris_dataset['target'].shape)

target 에 대한 크기  :  (150,)


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], test_size = 0.3, random_state=5)


print('X_train 크기 : ', X_train.shape)
print('y_train 크기 : ', X_test.shape)

X_train 크기 :  (105, 4)
y_train 크기 :  (45, 4)


print('X_test 크기 : ', X_test.shape)
print('y_test : ', y_test.shape)

X_test 크기 :  (45, 4)
y_test :  (45,)


iris_dataframe = pd.DataFrame(X_train, columns = iris_dataset.feature_names)
print(iris_dataframe)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  6.2               2.8                4.8               1.8
1                  5.9               3.0                4.2               1.5
2                  6.7               3.3                5.7               2.1
3                  7.7               3.8                6.7               2.2
4                  5.4               3.4                1.7               0.2
..                 ...               ...                ...               ...
100                4.4               2.9                1.4               0.2
101                6.1               2.8                4.7               1.2
102                6.7               3.3                5.7               2.5
103                7.7               2.6                6.9               2.3
104                5.7               2.8                4.1               1.3

[105 rows x 4 columns]


!pip install mglearn
import mglearn

Requirement already satisfied: mglearn in c:\users\bbeee\anaconda3\lib\site-packages (0.1.9)
Requirement already satisfied: joblib in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (1.0.1)
Requirement already satisfied: scikit-learn in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (0.24.1)
Requirement already satisfied: matplotlib in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (3.3.4)
Requirement already satisfied: cycler in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (0.10.0)
Requirement already satisfied: pandas in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (1.2.4)
Requirement already satisfied: numpy in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (1.20.1)
Requirement already satisfied: imageio in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (2.9.0)
Requirement already satisfied: pillow in c:\users\bbeee\anaconda3\lib\site-packages (from mglearn) (8.2.0)
Requirement already satisfied: six in c:\users\bbeee\anaconda3\lib\site-packages (from cycler->mglearn) (1.15.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\bbeee\anaconda3\lib\site-packages (from matplotlib->mglearn) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\bbeee\anaconda3\lib\site-packages (from matplotlib->mglearn) (1.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\bbeee\anaconda3\lib\site-packages (from matplotlib->mglearn) (2.4.7)
Requirement already satisfied: pytz>=2017.3 in c:\users\bbeee\anaconda3\lib\site-packages (from pandas->mglearn) (2021.1)
Requirement already satisfied: scipy>=0.19.1 in c:\users\bbeee\anaconda3\lib\site-packages (from scikit-learn->mglearn) (1.6.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\bbeee\anaconda3\lib\site-packages (from scikit-learn->mglearn) (2.1.0)


pd.plotting.scatter_matrix(iris_dataframe, c = y_train, figsize=(15,15), marker='o',
                            hist_kwds={'bins' : 20}, s = 60, alpha = .8, cmap = mglearn.cm3)

array([[<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal length (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal width (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='petal length (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='petal width (cm)'>]],
      dtype=object)


from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors= 1) # knn 객체는 훈련 데이터로 모델을 만들고 새로운 데이터 포인트에 대해 예측하는 알고리즘을 캡슐화함
# KNeighborsClassifier 의 경우 훈련 데이터 자체를 저장


# 훈련 데이터 셋으로부터 모델을 만들기
# knn 객체의 fit 메소드를 사용

knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size = 30, metric = 'minkowski',
                    metric_params = None, n_jobs = None, n_neighbors=1, p=2
                    , weights = 'uniform')

KNeighborsClassifier(n_neighbors=1)


#측정값은 numpy의 배열 형태로 되어 있다. 샘플의 수는 1개, 특성의 수 4개 이므로, 1x4 배열로 만들기
X_new = np.array([[1,1,3,3]])
print('X_new.shape', X_new.shape)

# knn 객체의 predict 메소드를 사용하여 품종 예측하기
prediction = knn.predict(X_new)
print("예측 : ", prediction)
print("예측한 타깃의 이름 : ", iris_dataset['target_names'][prediction])

X_new.shape (1, 4)
예측 :  [1]
예측한 타깃의 이름 :  ['versicolor']


y_pred = knn.predict(X_test)
print("test set에 대한 예측값 : \n", y_pred)

print('\n 테스트 세트의 정확도 : {:.2f}'.format(np.mean(y_pred == y_test)))

test set에 대한 예측값 : 
 [1 2 2 0 2 1 0 2 0 1 1 1 2 2 0 0 2 2 0 0 1 2 0 2 1 2 1 1 1 2 0 1 1 0 1 0 0
 2 0 2 2 1 0 0 1]

 테스트 세트의 정확도 : 0.93

[ML] 5분안에 머신러닝 뿌수기 (feat.breast cancer) (0)	2021.11.05
[ML] sklearn (0)	2021.11.04
[ML] 붓꽃 품종 분류 Story 2 (0)	2021.11.04
[Python] 1분만에 정리하는 python 함수 4 (0)	2021.11.04
[Python] 1분만에 정리하는 python 함수 3 (0)	2021.11.04

[ML] 붓꽃 품종 분류 Story 1

버전 확인¶

1~2. 데이터 적재¶

3 모델 만들기¶

4 데이터 확인¶

5 KNN 최근접 이웃 알고리즘¶

6 예측하기¶

7 모델 평가하기¶

'🗝️소프트웨어 > 💻python' 카테고리의 다른 글

티스토리툴바