[python] crawling customizing

파이선 크롤링 전반적인 코드 / 특징

웹 브라우저를 보지 않고 크롤링하는 방식

2번쨰 탭으로 새창으로 생성

블랭크

스크래핑을 전문으로 하는 모듈 생성

하나는 빈창이고, 하나는 타겟창

URL로 스크래핑할때

괄호가 있으면 일부분만 짤라서 / 특정 숫자*전체수만 가져오는 경우 XPATH

F TYPE 으로 해당 URL반복

네이버 로그인 방법 2가지 비교

1. send keys() 로 가는 방법

driver.implicitly_wait(3)
driver.get('https://nid.naver.com/nidlogin.login')
# 아이디/비밀번호를 입력해준다.
driver.find_element_by_name('id').send_keys('naver_id')
driver.find_element_by_name('pw').send_keys('mypassword1234')

# 단순한 코드 방식
# send_keys(Keys.PAGE_DOWN) 키이벤트 방식
# .send_Keys() 메서드를 적용하기 위해서는 요소를 가져와야 하는 제약이 있음
# javascript를 쓰지 않는게 큰 장점

label.send_keys(Keys.PAGE_DOWN);

2. execute_script()

-execute_script('js작성가능') 메서드를 사용

-js코드를 인자로 넣을 수 있어 브라우저 화면을 제어 할수 있게 된다.

-스크롤을 하기 위해 뿐만아니라 다른 동적이벤트를 작성 가능

화면상 스크롤 위치 이동 : scrollTo(x,Y) ,scrollTo(x,Y+number)
화면 최하단으로 스크롤 이동 : scrollTo(0, document.body.scrollHeight)
화면을 움직이고 페이지 로드 기다리기 : time.sleep(second)

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

설치파일

1. 크롬 버전 확인

주소창에 :

chrome://version

검색

2. 폴더 다운

https://chromedriver.chromium.org/downloads

ChromeDriver - WebDriver for Chrome - Downloads

Current Releases If you are using Chrome version 97, please download ChromeDriver 97.0.4692.20 If you are using Chrome version 96, please download ChromeDriver 96.0.4664.45 If you are using Chrome version 95, please download ChromeDriver 95.0.4638.69 For o

chromedriver.chromium.org

해당 os로 다운 : window는 32비트

chromedriver_win32.zip

3. vscode에 해당 라이브러리 다운

! pip install selenium

! pip install bs4

4. 압축을 풀어주고, 셀레늄 위치

이 경로를 나중에 Selenium 객체를 생성할 때 지정해 주어야 한다.

(그래야 python이 chromedriver를 통해 크롬 브라우저를 조작이 가능)

5. ~~headless 실행하기 위해서 (아직 실행X)~~

https://beomi.github.io/2017/02/27/HowToMakeWebCrawler-With-Selenium/

나만의 웹 크롤러 만들기(3): Selenium으로 무적 크롤러 만들기 - Beomi's Tech blog

좀 더 보기 편한 깃북 버전의 나만의 웹 크롤러 만들기가 나왔습니다! Updated @ 2019.10.10. Typo/Layout fix, 네이버 로그인 Captcha관련 수정 추가 이전게시글: 나만의 웹 크롤러 만들기(2): Login with Session Se

beomi.github.io

라이브러리 관련 문서

1. shutil

https://docs.python.org/ko/3/library/shutil.html

shutil — 고수준 파일 연산 — Python 3.10.0 문서

shutil — 고수준 파일 연산 소스 코드: Lib/shutil.py shutil 모듈은 파일과 파일 모음에 대한 여러 가지 고수준 연산을 제공합니다. 특히, 파일 복사와 삭제를 지원하는 함수가 제공됩니다. 개별 파일

docs.python.org

2. selenium 라이브러리

-다른 여러 모듈들 포함

https://greeksharifa.github.io/references/2020/10/30/python-selenium-usage/

Python, Machine & Deep Learning

Python, Machine Learning & Deep Learning

greeksharifa.github.io

rom selenium import webdriver # 브라우저를 띄우기 위해서 필요
from selenium.webdriver.common.keys import Keys # 키이벤트를 주기 위해 필요

import time # 스크롤 텀을 주기 위해 필요

webdriver는 firefox, chrome, ie등 브라우저 선택이 가능하다.
Keys클래스 는 키보드 키들을 제공한다(e.g. return, f1, alt, cmd, shift ...)

함수	설명
webdriver.Chrome("c:/...")	chrome driver가 설치된 위치 지정하여 사용
implicitly_wait(5)	암묵적으로 모든 웹 자원 로드를 위해 5초 기다림
get('http://url.com’)	url에 접근
page_source	현재 렌더링 된 페이지의 Elements를 모두 가져오기
find_element_by_name('...’)	페이지의 단일 element중 name으로 접근
find_element_by_id('HTML_id’)	id로 접근
find_element_by_xpath(‘xpath’)	xpath로 접근
find_element_by_css_selector(‘...’)	css selector로 접근
find_element_by_class_name('...’)	class 이름으로 접근
find_element_by_tag_name('...’)	tag name으로 접근
close	사용했던 chrome driver 닫기

xpath 이용하는 법

https://wkdtjsgur100.github.io/selenium-xpath/

(python) selenium에서 xpath를 이용해 크롤링 하기

selenium으로 특정 element를 가져올 때, 가져오고 싶은 element가 다른 element 안에 있을 경우에 그 특정 element를 쉽게 가져올 수 있는 방법 중 하나인 xpath를 사용해 element를 가져오는 방법에 대해서 작

wkdtjsgur100.github.io

3. BeautifulSoup4: HTML parser

지정 HTML로부터 원하는 위치/형식의 문자열을 획득
주로 Requests에 의해 많이 사용되지만, Selenium에서도 사용할 수 있다.

크롤링 예제

1. 빌보드 차트에서 그림(이미지)파일 크롤링하는 법

from bs4 import BeautifulSoup # 셀레니움으로 이벤트를 한뒤 beautifulsoup으로 요소 정리, 그냥 셀레니움으로 해도된당
from selenium import webdriver # 브라우저를 열수 있는 드라이브모듈
from selenium.webdriver.common.keys import Keys # 키이벤트를 돕는 키 모듈

import csv, time

# csv저장
output = 'test_crawling_02.csv'
csv_open = open(output, 'w+', encoding='utf-8')
csv_writer = csv.writer(csv_open)
csv_writer.writerow(('index','title','artist', 'image_url'))


# 크롤링 url, html 받아오기
orig_url = 'https://www.billboard.com/charts/billboard-200' # 크롤링 할 사이트
driver = webdriver.Chrome('/Users/hwang/dev/chromedriver') # 크롬 브라우저 선택
driver.implicitly_wait(10) # 숫자 크면 잘 읽히는 기붕..
driver.get(orig_url) # 입력한 경로의 정보 긁어볼까요~?
body = driver.find_element_by_css_selector('body') # send_keys()메서드 사용을 위한 body가져오기
for i in range(15): # 11번 ~ 최하단 20번 
    body.send_keys(Keys.PAGE_DOWN) # 페이지 다운 키를  20회 반복한다.
    time.sleep(0.1) # 페이지 로드 대기, 숫자가 크면 안읽히는건 왤까요?


# beautifulsoup 사용 하기 준비
html = driver.page_source # html을 문자열로 가져온다.
driver.close() # 크롬드라이버 닫기

# beautifulsoup 사용하기
soup = BeautifulSoup(html,'html.parser')
top_200_list = soup.find_all( 'li', {'class' : re.compile('chart-list__element')} )

for li in top_200_list[:100]:
    # 혹시라도 이미지가 비었을때를 대비 하여 아래 함수 추가
    def check_image_url():
        value = bool(li.find('span', {'class':'chart-element__image'})['style'] != 'display: inline-block;')
        if value:
            return li.find('span', {'class':'chart-element__image'})['style'].split('"')[1]
        return None
    
    # * re.compile() 과 class_='' 의 차이는 ?.. 
    index     = li.find('span', {'class':'chart-element__rank__number'}).text 
    title     = li.find('span', {'class':'chart-element__information__song'}).text
    artist    = li.find('span', {'class':'chart-element__information__artist'}).text
    image_url = check_image_url()

    # CSV 에 저장하자
    csv_writer.writerow((index, title, artist, image_url))

2. 엔터치는 방법

# 첫번째 방법

sample = browser.find_element_by_css_select('a') sample.send_keys('\n')

# 해당 링크/명령어 에 엔터 를 실행하도록

# 두 번째 방법

sample = browser.find_element_by_css_select('a') browser.execute_script("arguments[0].click();", sample)

#자바 명령어 실행

3. 일부 코드

4. 정규표현식

https://ponyozzang.tistory.com/335

Python 정규 표현식(re.sub)을 이용한 문자열 치환 방법 및 예제

파이썬에서 문자열을 치환해주는 메서드로 replace가 있습니다. 이번에는 replace 메서드로 문자열을 치환하는 방법이 아닌 정규 표현식을 이용하여 문자열을 치환해보도록 하겠습니다. 정규 표현

ponyozzang.tistory.com

https://lovelydiary.tistory.com/17

파이썬 크롤러) 데이터 파싱 후 쓸데없는 태그 지우기 (re.sub, 태그 제거)

Beautiful Soup을 통해 html.parser로 파싱한 값들을 얻었다 ! soup=BeautifulSoup(html, 'html.parser') prd_names=soup.find_all('td', class_='ProdName') td class가 ProdName으로 된 제품명들을 모두 찾아내..

lovelydiary.tistory.com

저작자표시

'🗝️소프트웨어 > 💻python' 카테고리의 다른 글

Kernel died with exit code 1073741845 에러 (0)	2021.11.23
selenium.webdriver.common.keys (0)	2021.11.23
[git/ source tree] 허접한 사용법 (0)	2021.11.21
vscode 램 사용 (0)	2021.11.16
[github code]깃허브 파일 다운방법 (0)	2021.11.16

'🗝️소프트웨어 > 💻python' 카테고리의 다른 글

티스토리툴바