[python] 인스타그램 크롤링 하기(instagram Crawling)

먼저 자바스크립트가 아닐때 크롤링 하는법

https://10000sukk.tistory.com/3

[python]무신사 크롤링 하기 Crawling

먼저 url을 받는다 baseUrl = 'https://store.musinsa.com/app/product/search?search_type=1&q=' baseUrl1 = '&page=' plusUrl = input('검색할 옷을 입력하시오: ') pageNum =1 url = baseUrl + quote_plus(plus..

10000sukk.tistory.com

인스타는 자바스크립트 페이지, 즉, 그에맞는 방식으로 크롤링 요구됩니다.

저는 미흡하지만 페이지를 크롤링 하기위해 새로 불러오고 크롤링 하고 -> 새로 불러오고 크롤링하고 -> 새로 불러....-> (반복)

하는 방법을 사용했습니다.

새로 불러오는 방법은 selenium을 이용하는데

스크롤을 내리는 것입니다.

페이지의 스크롤을 내려서 만약 스크롤 위치가 변했으면 내려진거고 안내려 갓으면 페이지의 끝이겠죠?

스크롤 내리는 방법:

while True:

time.sleep(SCROLL_PAUASE_TIME)

#스크롤을 내려준다

last_height = driver.execute_script("return document.body.scrollHeight")

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(SCROLL_PAUASE_TIME)

new_height = driver.execute_script("return document.body.scrollHeight")

if new_height == last_height:

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(SCROLL_PAUASE_TIME)

new_height = driver.execute_script("return document.body.scrollHeight")

if new_height == last_height:

break

else:

last_height = new_height

continue

으로 스크롤을 내려줍니다!

그다음에 페이지를 BeautifulSoup을 이용해서 새로 파싱하고 가져옵니다...

한번 해봅시다..!

위의 그림처럼 먼저 한 행 단위로 tag를 가져올것이다....

list1 = soup.find_all(name='div', attrs={'class': 'Nnq7C weEfm'})

그리고 스크롤을 내려서 다시 파싱하고 가져오면

list2 = soup.find_all(name='div', attrs={'class': 'Nnq7C weEfm'})

위의 그림처럼 아까 가져온 list1과 지금 가져온 list2가 일부분 겹치게 된다... 이 겹치는 부분을 제거해 주는게 핵심인것같다!

11~13 개 정도가 겹친다. 그래서 나는 그냥 list2의 index 10번 부터만 list1에 포함 되었는지 검사 하면서 만약 포함 안됬으면 크롤링을 해주었다.

앞서 말한 두가지 점이 인스타 크롤링의 핵심 부분이며 무신사 크롤링에서 이 두가지 점만 추가 하면 되는것이다...

다시 말하겠다.

1. selenium으로 스크롤을 내려준다

2. 스크롤을 내린후에 다시 파싱했을때 겹치는 부분을 고려 해줘라 -> 얼마만큼 겹치는지 직접 실험해보고 유도리 있게 검사하면 크롤링 속도 향상 가능함...

추가로 저는 이미지 정보중에서

이런 것들만 필요했기에... 정규 표현식을 이용해서....

alt = re.compile('.*사람 1명 이상.*사람들이 서 있음.*')

if alt.match(str(j.attrs['alt'])) != None:

이런 형식으로 했습니다... 먼저 re.compile(찾고싶은 문자의 정규표현식) 으로 내가 원하는 문자 형식의 정규표현식을 넣고.

그 다음에 match라는 함수로 내가 방금 정규표현식을 저장한 변수 alt와 괄호속 문자가 일치되는지 검사해서 일치되면 ~~~~ 를 반환했습니다... 즉, 일치되면 != None이고 일치되지 않으면 None을 반환하는 것입니다.

더 나은 방법 있으면 알려주세요! 많이 미흡합니다!!!

코드.

from urllib.parse import quote_plus #아스키 코드로 변환해준다

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import re
import urllib.request


def notMatch(list, a):
    n = 0
    for i in list:
        if a == i:
            n = 1
            break
    if n == 1:
        return 0
    else:
        return 1


baseUrl1 = 'https://www.instagram.com/explore/tags/'
baseUrl2 = '/?hl=ko'
plusUrl = input('크롤링할 해시태그를 입력하시오: ')
url = baseUrl1 + quote_plus(plusUrl) +baseUrl2

driver = webdriver.Chrome()
driver.get(url)


time.sleep(3) #위에서 불러오고 3초 기다린후에 분석을 시작
reallist =[]
list1 = [] #old list
list2 = [] #new list
totalCount = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/header/div[2]/div[1]/div[2]/span/span').text
print("총 게시물수 : " + totalCount)
fileNumber = 1

alt = re.compile('.*사람 1명 이상.*사람들이 서 있음.*')
pageString = driver.page_source
soup = BeautifulSoup(pageString, features='html.parser')
list1 = soup.find_all(name='div', attrs={'class': 'Nnq7C weEfm'})
for i in list1:
    temp = i.find_all(name='img')
    for j in temp:
        try:
            if alt.match(str(j.attrs['alt'])) != None:
                urllib.request.urlretrieve( j.attrs['src'] , "C:\\Users\\PycharmProjects\\untitled7\\img\\인스타\\" +str(plusUrl)  + "(" + str(fileNumber) + ")" + ".jpg")
                fileNumber += 1
        except:
            continue



SCROLL_PAUASE_TIME = 1.5
n = 1



while True:
    time.sleep(SCROLL_PAUASE_TIME)
    #스크롤을 내려준다
    last_height = driver.execute_script("return document.body.scrollHeight")
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(SCROLL_PAUASE_TIME)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(SCROLL_PAUASE_TIME)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        else:
            last_height = new_height
            continue

    time.sleep(SCROLL_PAUASE_TIME)
    #다시 파싱한다
    pageString = driver.page_source
    soup = BeautifulSoup(pageString, features='html.parser')
    list2 = soup.find_all(name='div', attrs={'class': 'Nnq7C weEfm'})

    for i in range(10, len(list2)):
        try:
            if notMatch(list1, list2[i]) == 1:
                temp = list2[i].find_all('img')
                for j in temp:
                    try:
                        if alt.match(str(j.attrs['alt'])) != None:
                            urllib.request.urlretrieve(j.attrs['src'], "C:\\Users\\PycharmProjects\\untitled7\\img\\인스타\\" + str(plusUrl) + "(" + str(fileNumber) + ")" + ".jpg")
                            fileNumber += 1
                    except:
                        continue


        except:
            continue

    list1 = list2
Colored by Color Scripter

'기타(개발) > 크롤링(Crawling)' 카테고리의 다른 글

[python]무신사 크롤링 하기 Crawling (0)	2019.11.29

'기타(개발) > 크롤링(Crawling)' 카테고리의 다른 글

티스토리툴바