Dev.Op
Yollow ๐Ÿ“š
Dev.Op
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (701)
    • ์œ ์ตํ•˜์…จ๋‹ค๋ฉด ๊ด‘๊ณ  ํ•œ๋ฒˆ๋งŒ ํด๋ฆญ ๋ถ€ํƒ๋“œ๋ฆด๊ฒŒ์š”~ (0)
    • ---------------------------.. (0)
    • Stock (1)
      • ์Šˆํผ๋งˆ์ดํฌ๋กœ์ปดํ“จํ„ฐ (2)
    • ๐Ÿง์ „๊ธฐ์ฐจ (72)
      • ๐Ÿ„lg์—๋„ˆ์ง€์†”๋ฃจ์…˜ (0)
      • ๐ŸŠํ˜„๋Œ€์ž๋™์ฐจ (0)
    • ๐Ÿ—๏ธ์†Œํ”„ํŠธ์›จ์–ด (243)
      • ๐Ÿ’ปpython (85)
      • โž•C & C++ (1)
      • โ˜•๏ธTableau (32)
      • ๐Ÿ‘‹SQL & MySQL (20)
      • ๐ŸฌHTML & CSS (14)
      • ๐Ÿ“—JavaScript (31)
      • ๐Ÿ“˜Pspice & Excel (2)
      • ๐Ÿ“•Matlab & COMSOL & CATIA (6)
      • ๐Ÿ“™java & Servlete & JSP (29)
      • ๐Ÿ““Raspberry PI 4 (5)
      • ๐Ÿ”จAnsys (2)
      • DJango (0)
      • Flutter (3)
      • Typescript (0)
      • ๐Ÿ†Vue (5)
      • ๐Ÿ‹Docker (1)
    • ๐Ÿ“‹์ฑ„์šฉ๊ณต๊ณ  (0)
    • ๐Ÿ“WEB & ML & DL ํ”„๋กœ์ ํŠธ (27)
      • ๐ŸŒต2์ฐจ ํ”„๋กœ์ ํŠธ(LG) (9)
    • ๐Ÿงฉ์ผ์ƒ (89)
      • ๐ŸŒค์ฝ”๋”ฉ ๊ณต๋ถ€ ์ผ์ง€ (1)
      • ๐Ÿšด์ž์ „๊ฑฐ (5)
      • ๐Ÿ“ฐํ…Œํฌ (20)
      • ๐ŸฆFood & Cafe (5)
      • ๐Ÿ’‰์˜์–ด ๋„์ ์ด๊ธฐ (5)
      • โšก๋ฐœ์ „์†Œ (6)
      • ๐Ÿ“š๋…์„œ (1)
      • ๐Ÿ›ซ์—ฌํ–‰ (2)
      • ๐Ÿ“ˆ๋ธ”๋กœ๊ทธ๋งˆ์ผ€ํŒ… (6)
    • ๐ŸŒ๊ธˆ์œต (37)
    • ๐ŸŽจ์ทจ์—…End (16)
    • ๐Ÿ‘‹์ž๊ฒฉ์ฆ (150)
      • ๐Ÿ™ˆSQLD๊ฐœ๋ฐœ์ž (12)
      • ๐Ÿ”Œ์ „๊ธฐ๊ธฐ์‚ฌ (116)
      • ๐Ÿข์ •๋ณด์ฒ˜๋ฆฌ๊ธฐ์‚ฌ (7)
      • ๐ŸŒŽADsP(๋ฐ์ดํ„ฐ๋ถ„์„์ค€์ „๋ฌธ๊ฐ€) (10)
      • ๐Ÿš™1์ข… ๋Œ€ํ˜• ์šด์ „ ๋ฉดํ—ˆ (1)
      • โญTableau Desktop Specialist (2)
    • ๐Ÿฅ‡๊ณต๋Œ€์ด๊ฑฐ์ €๊ฒƒ(๋ง‰ํ•™๊ธฐ) (24)
      • ๐Ÿ“๊ณตํ•™์ˆ˜ํ•™ 2 (1)
      • ๐Ÿบ๋งˆ์ดํฌ๋กœํ”„๋กœ์„ธ์„œ์‹ค์Šต (4)
      • ๐ŸŒCAE (10)
      • โœˆ๏ธ์ž๋™์ฐจ๊ณตํ•™์‹คํ—˜2 (0)
      • ๐Ÿšข์œ ์ฒด์—ญํ•™ (6)
      • ๐Ÿš—ํ˜„๋Œ€์ฐจ H-๋ชจ๋นŒ๋ฆฌํ‹ฐ ํด๋ž˜์Šค 1๊ธฐ (3)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    ๊ณต์ง€์‚ฌํ•ญ

    • Vue, Typescript, React, Tableau,โ‹ฏ
    • ์ง„์ธ์‚ฌ๋Œ€์ฒœ๋ช…(็›กไบบไบ‹ๅพ…ๅคฉๅ‘ฝ)

    ์ธ๊ธฐ ๊ธ€

    ํƒœ๊ทธ

    • lgํ™”ํ•™
    • ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • ์—…๋น„ํŠธ
    • ์—”๋น„๋””์•„
    • ์—๋””์Šจev
    • ์ง๋ ฌ๋ฆฌ์•กํ„ฐ
    • vue btn
    • ๋น…๋ฐ์ดํ„ฐ๋ถ„์„์ค€์ „๋ฌธ๊ฐ€
    • ๋ฐฑ์ค€
    • html
    • fluid mechanics
    • ์‚ผ์„ฑ์ „์ž
    • css
    • LG์—๋„ˆ์ง€์†”๋ฃจ์…˜
    • ํ…Œ์Šฌ๋ผ
    • rdfr
    • ์ „๊ธฐ์ฐจ ๋ณด์กฐ๊ธˆ 2021
    • SMCI
    • ๋ธŒ๋ฃจํŠธํฌ์Šค
    • ์ž๋ฐ”
    • ADsP
    • ์œ ์ฒด์—ญํ•™
    • ๋น…๋ฐ์ดํ„ฐ
    • ์•„์ด์˜ค๋‹‰5
    • ์ „๊ธฐ์ฐจ
    • ์—”์†”
    • Python
    • ๋””์นด๋ฅด๊ณ 
    • ipad dual monitor
    • ๋ถ€๋“ฑ๋ฅ 

    ์ตœ๊ทผ ๋Œ“๊ธ€

    ์ตœ๊ทผ ๊ธ€

    ํ‹ฐ์Šคํ† ๋ฆฌ

    hELLO ยท Designed By ์ •์ƒ์šฐ.
    Dev.Op

    Yollow ๐Ÿ“š

    tf,idf ๋ฐฉ์‹
    ๐Ÿ—๏ธ์†Œํ”„ํŠธ์›จ์–ด/๐Ÿ’ปpython

    tf,idf ๋ฐฉ์‹

    2021. 11. 12. 14:27
    ๋ฐ˜์‘ํ˜•

     

     

    TF-IDF

    Term Frequency - Inverse Document Frequency

    ์ค„์–ด์„œ TF-IDF๋กœ ๋‹จ์–ด์˜ ๋นˆ๋„์™€ ์—ญ๋ฌธ์„œ ๋นˆ๋„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ DTM*๋‚ด์˜ ๋‹จ์–ด๋งˆ๋‹ค ์ค‘์š”ํ•œ ์ •๋„๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๋ฒ•์ž…๋‹ˆ๋‹ค.

    โ€‹

    ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ณณ์€ ๋ฌธ์„œ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

    ๋ฌธ์„œ์˜ ์œ ์‚ฌ๋„๋ผ ํ•จ์€, ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์—์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ ์ค‘์š”๋„๋ฅผ ๊ตฌํ•  ๋•Œ์˜ ์œ ์‚ฌ๋„๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

     

    tf๋ž€ ๋‹จ์–ด๊ฐ€ ๊ฐ ๋ฌธ์„œ์—์„œ ๋ฐœ์ƒํ•œ ๋นˆ๋„๊ฐ€ (๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•œ '๋ฌธ์„œ'์˜ ๋นˆ๋„๋ฅผ df๋ผ ํ•ฉ๋‹ˆ๋‹ค) ์ ์€ ๋ฌธ์„œ์—์„œ ๋ฐœ๊ฒฌ๋ ์ˆ˜๋ก ๊ฐ€์น˜ ์žˆ๋Š” ์ •๋ณด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

     

    ๋งŽ์€ ๋ฌธ์„œ์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์ผ์ˆ˜๋ก ์ผ๋ฐ˜์ ์ธ ๋‹จ์–ด์ด๋ฉฐ ์ด๋Ÿฌํ•œ ๊ณตํ†ต ์ ์ธ ๋‹จ์–ด๋Š” tf๊ฐ€ ํฌ๋‹ค๊ณ  ํ•˜์—ฌ๋„ ๋น„์ค‘์„ ๋‚ฎ์ถ”์–ด์•ผ ๋ถ„์„์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

     

    ๋”ฐ๋ผ์„œ ๋‹จ์–ด๊ฐ€ ํŠน์ • ๋ฌธ์„œ์—๋งŒ ๋‚˜ํƒ€๋‚˜๋Š” ํฌ์†Œ์„ฑ์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ ' idf(df์˜ ์—ญ์ˆ˜)๋ฅผ tf์— ๊ณฑํ•œ ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

     

    ํ•ด๋‹น๋ฌธ์„œ์—์„œ ์ฃผ์ œ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒฝ์šฐ

     

    ๋ฐ˜๋ฉด, CountVectorizer ๋Š” ํ…์ŠคํŠธ์—์„œ ์ฃผ์ธ๊ณต์„ ์ฐพ๊ฑฐ๋‚˜, ๋งŽ์ด ๋ฐ˜๋ณต๋˜๋Š” ํ‚ค์›Œ๋“œ๋ฅผ ์ฐพ๋Š” ๊ฒฝ์šฐ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

     
     raw ํŒŒ์ผ ๋‹ค์šด :
    ratings_test.txt
    4.67MB
     
     
    # tf-idf ์ง„ํ–‰
    
    import pandas as pd
    
    text_train = pd.read_csv('ratings_train.txt', delimiter='\t')
    text_test = pd.read_csv('ratings_test.txt',delimiter='\t')
    
    text_train.dropna(inplace=True)
    text_test.dropna(inplace=True)
    
    X_train = text_train['document']
    y_train = text_train['label']
    
    X_test = text_test['document']
    y_test = text_test['label']
    
    !pip install konlpy
    from konlpy.tag import Okt
    okt = Okt()
    
    def myToken(text):
        return okt.nouns(text)
                                           
    
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.linear_model import LogisticRegression
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    from sklearn.pipeline import make_pipeline
    
    pipe_model = make_pipeline(TfidfVectorizer(tokenizer=myToken), LogisticRegression())
    
    pipe_model.fit(X_train, y_train)
    ### ์—๋Ÿฌ ๋‚œ ๋ถ€๋ถ„
    print("pipe model score : "+ pipe_model.score(X_test,y_test))
    
    # pipeline ์‚ดํŽด๋ณด๊ธฐ
    cv = pipe_model.steps[0][1]
    print(len(cv.vocabulary_))
    
    logi = pipe_model.steps[1][1]
    word_weights = logi.coef_
    
    df = pd.DataFrame([cv.vocabulary_.keys(),
                        cv.vocabulary_.values()])
    
    print(df.head())
    
    df = df.T
    
    print(df.T)
    
    df_sorted = df.sort_values(by=1)
    print(df_sorted)
    
    df_sorted['coef'] = word_weights.reshape(-1)
    print(df_sorted)
    
    df_sorted.sort_values(by = 'coef', inplace =True)
    print(df_sorted)

    pipe_model.score(X_test,y_test)
    
    # pipeline ์‚ดํŽด๋ณด๊ธฐ
    cv = pipe_model.steps[0][1]
    print(len(cv.vocabulary_))
    
    logi = pipe_model.steps[1][1]
    word_weights = logi.coef_
    
    df = pd.DataFrame([cv.vocabulary_.keys(),
                        cv.vocabulary_.values()])
    
    print(df.head())
    
    df = df.T
    
    print(df.T)
    
    df_sorted = df.sort_values(by=1)
    print(df_sorted)
    
    df_sorted['coef'] = word_weights.reshape(-1)
    print(df_sorted)
    
    df_sorted.sort_values(by = 'coef', inplace =True)
    print(df_sorted)

    ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

    # ์‹œ๊ฐํ™”
    # ๋ถ€์ •๋‹จ์–ด ์ƒ์œ„ 15๊ฐœ
    bad_word = df_sorted.head(15)
    # ๊ธ์ •๋‹จ์–ด ์ƒ์œ„ 15๊ฐœ
    good_word = df_sorted.tail(15)

    ๋ถ€์ •๊ณผ ๊ธ์ •๋‹จ์–ด ์ƒ์šฐ 15๊ฐœ๋กœ ๋‚˜๋ˆ  ์‹œ๊ฐํ™”๋ฅผ ์ง„ํ–‰ํ•˜๋ ค ํ•ฉ๋‹ˆ๋‹ค.

    top15 = pd.concat([bad_word, good_word])
    
    import matplotlib.pyplot as plt
    
    from matplotlib import font_manager, rc
    
    #ํ•œ๊ธ€ ์ถœ๋ ฅ์„ ์œ„ํ•œ ์„ธํŒ…, ๋ง‘์Œ ํฐํŠธ
    font_name = font_manager.FontProperties(fname="C:\Windows\Fonts\malgun.ttf").get_name()
    rc('font',family=font_name)
    
    plt.figure(figsize=(15,5))
    plt.title('๊ธ์ •/๋ถ€์ • ์ƒ์œ„ 15๊ฐœ ์‹œ๊ฐํ™”', fontsize=20)
    plt.bar(top15[0],top15['coef'])
    plt.xticks(rotation = 90)
    plt.show()

    ๊ฒฐ๊ณผ๋น„๊ต, ์ผ๋ฐ˜์ ์ธ countervector

     

    ๋ฐ˜์‘ํ˜•
    ์ €์ž‘์žํ‘œ์‹œ (์ƒˆ์ฐฝ์—ด๋ฆผ)

    '๐Ÿ—๏ธ์†Œํ”„ํŠธ์›จ์–ด > ๐Ÿ’ปpython' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

    cannot import name 'image' from 'PIL' ์—๋Ÿฌ  (0) 2021.11.13
    No module named 'cv2' ์—๋Ÿฌ ํ•ด๊ฒฐ๋ฐฉ๋ฒ•  (0) 2021.11.13
    [ML] 5๋ถ„์•ˆ์— ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฟŒ์ˆ˜๊ธฐ (feat.breast cancer)  (0) 2021.11.05
    [ML] sklearn  (0) 2021.11.04
    [ML] ๋ถ“๊ฝƒ ํ’ˆ์ข… ๋ถ„๋ฅ˜ Story 1  (0) 2021.11.04
      '๐Ÿ—๏ธ์†Œํ”„ํŠธ์›จ์–ด/๐Ÿ’ปpython' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
      • cannot import name 'image' from 'PIL' ์—๋Ÿฌ
      • No module named 'cv2' ์—๋Ÿฌ ํ•ด๊ฒฐ๋ฐฉ๋ฒ•
      • [ML] 5๋ถ„์•ˆ์— ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฟŒ์ˆ˜๊ธฐ (feat.breast cancer)
      • [ML] sklearn
      Dev.Op
      Dev.Op
      Interest: CS, Drive

      ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”