tf,idf 방식

TF-IDF

Term Frequency - Inverse Document Frequency

줄어서 TF-IDF로 단어의 빈도와 역문서 빈도를 사용하여 DTM*내의 단어마다 중요한 정도로 나누어 가중치를 주는 법입니다.

주로 사용하는 곳은 문서의 유사도를 구할 때 사용합니다.

문서의 유사도라 함은, 검색 시스템에서 검색 결과의 중요도를 구할 때의 유사도를 의미합니다.

tf란 단어가 각 문서에서 발생한 빈도가 (단어가 등장한 '문서'의 빈도를 df라 합니다) 적은 문서에서 발견될수록 가치 있는 정보라고 할 수 있습니다.

많은 문서에 등장하는 단어일수록 일반적인 단어이며 이러한 공통 적인 단어는 tf가 크다고 하여도 비중을 낮추어야 분석이 제대로 이루어질 수 있습니다

따라서 단어가 특정 문서에만 나타나는 희소성을 반영하기 위해서 ' idf(df의 역수)를 tf에 곱한 값을 사용합니다

해당문서에서 주제를 찾아내는 경우

반면, CountVectorizer 는 텍스트에서 주인공을 찾거나, 많이 반복되는 키워드를 찾는 경우에 사용됩니다.

raw 파일 다운 :

ratings_test.txt

4.67MB

# tf-idf 진행

import pandas as pd

text_train = pd.read_csv('ratings_train.txt', delimiter='\t')
text_test = pd.read_csv('ratings_test.txt',delimiter='\t')

text_train.dropna(inplace=True)
text_test.dropna(inplace=True)

X_train = text_train['document']
y_train = text_train['label']

X_test = text_test['document']
y_test = text_test['label']

!pip install konlpy
from konlpy.tag import Okt
okt = Okt()

def myToken(text):
    return okt.nouns(text)
                                       

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import make_pipeline

pipe_model = make_pipeline(TfidfVectorizer(tokenizer=myToken), LogisticRegression())

pipe_model.fit(X_train, y_train)
### 에러 난 부분
print("pipe model score : "+ pipe_model.score(X_test,y_test))

# pipeline 살펴보기
cv = pipe_model.steps[0][1]
print(len(cv.vocabulary_))

logi = pipe_model.steps[1][1]
word_weights = logi.coef_

df = pd.DataFrame([cv.vocabulary_.keys(),
                    cv.vocabulary_.values()])

print(df.head())

df = df.T

print(df.T)

df_sorted = df.sort_values(by=1)
print(df_sorted)

df_sorted['coef'] = word_weights.reshape(-1)
print(df_sorted)

df_sorted.sort_values(by = 'coef', inplace =True)
print(df_sorted)

pipe_model.score(X_test,y_test)

# pipeline 살펴보기
cv = pipe_model.steps[0][1]
print(len(cv.vocabulary_))

logi = pipe_model.steps[1][1]
word_weights = logi.coef_

df = pd.DataFrame([cv.vocabulary_.keys(),
                    cv.vocabulary_.values()])

print(df.head())

df = df.T

print(df.T)

df_sorted = df.sort_values(by=1)
print(df_sorted)

df_sorted['coef'] = word_weights.reshape(-1)
print(df_sorted)

df_sorted.sort_values(by = 'coef', inplace =True)
print(df_sorted)

데이터 시각화

# 시각화
# 부정단어 상위 15개
bad_word = df_sorted.head(15)
# 긍정단어 상위 15개
good_word = df_sorted.tail(15)

부정과 긍정단어 상우 15개로 나눠 시각화를 진행하려 합니다.

top15 = pd.concat([bad_word, good_word])

import matplotlib.pyplot as plt

from matplotlib import font_manager, rc

#한글 출력을 위한 세팅, 맑음 폰트
font_name = font_manager.FontProperties(fname="C:\Windows\Fonts\malgun.ttf").get_name()
rc('font',family=font_name)

plt.figure(figsize=(15,5))
plt.title('긍정/부정 상위 15개 시각화', fontsize=20)
plt.bar(top15[0],top15['coef'])
plt.xticks(rotation = 90)
plt.show()

결과비교, 일반적인 countervector

저작자표시 (새창열림)

'🗝️소프트웨어 > 💻python' 카테고리의 다른 글

cannot import name 'image' from 'PIL' 에러 (0)	2021.11.13
No module named 'cv2' 에러 해결방법 (0)	2021.11.13
[ML] 5분안에 머신러닝 뿌수기 (feat.breast cancer) (0)	2021.11.05
[ML] sklearn (0)	2021.11.04
[ML] 붓꽃 품종 분류 Story 1 (0)	2021.11.04

'🗝️소프트웨어 > 💻python' 카테고리의 다른 글

티스토리툴바