๊ธฐํƒ€(๊ฐœ๋ฐœ)/ํฌ๋กค๋ง(Crawling)

[python] ์ธ์Šคํƒ€๊ทธ๋žจ ํฌ๋กค๋ง ํ•˜๊ธฐ(instagram Crawling)

 

 

๋จผ์ € ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๊ฐ€ ์•„๋‹๋•Œ ํฌ๋กค๋ง ํ•˜๋Š”๋ฒ•

https://10000sukk.tistory.com/3 

 

[python]๋ฌด์‹ ์‚ฌ ํฌ๋กค๋ง ํ•˜๊ธฐ Crawling

๋จผ์ € url์„ ๋ฐ›๋Š”๋‹ค baseUrl = 'https://store.musinsa.com/app/product/search?search_type=1&q=' baseUrl1 = '&page=' plusUrl = input('๊ฒ€์ƒ‰ํ•  ์˜ท์„ ์ž…๋ ฅํ•˜์‹œ์˜ค: ') pageNum =1 url = baseUrl + quote_plus(plus..

10000sukk.tistory.com

์ธ์Šคํƒ€๋Š” ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ํŽ˜์ด์ง€, ์ฆ‰, ๊ทธ์—๋งž๋Š” ๋ฐฉ์‹์œผ๋กœ ํฌ๋กค๋ง ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค.

 

 

 ์ €๋Š” ๋ฏธํกํ•˜์ง€๋งŒ ํŽ˜์ด์ง€๋ฅผ ํฌ๋กค๋ง ํ•˜๊ธฐ์œ„ํ•ด ์ƒˆ๋กœ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ํฌ๋กค๋ง ํ•˜๊ณ  -> ์ƒˆ๋กœ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ํฌ๋กค๋งํ•˜๊ณ  -> ์ƒˆ๋กœ ๋ถˆ๋Ÿฌ....-> (๋ฐ˜๋ณต)

ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

 

 

์ƒˆ๋กœ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋ฐฉ๋ฒ•์€ selenium์„ ์ด์šฉํ•˜๋Š”๋ฐ

์Šคํฌ๋กค์„ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํŽ˜์ด์ง€์˜ ์Šคํฌ๋กค์„ ๋‚ด๋ ค์„œ ๋งŒ์•ฝ ์Šคํฌ๋กค ์œ„์น˜๊ฐ€ ๋ณ€ํ–ˆ์œผ๋ฉด ๋‚ด๋ ค์ง„๊ฑฐ๊ณ  ์•ˆ๋‚ด๋ ค ๊ฐ“์œผ๋ฉด ํŽ˜์ด์ง€์˜ ๋์ด๊ฒ ์ฃ ?

 

์Šคํฌ๋กค ๋‚ด๋ฆฌ๋Š” ๋ฐฉ๋ฒ•:

 

while True:

    time.sleep(SCROLL_PAUASE_TIME)

    #์Šคํฌ๋กค์„ ๋‚ด๋ ค์ค€๋‹ค

    last_height = driver.execute_script("return document.body.scrollHeight")

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    time.sleep(SCROLL_PAUASE_TIME)

    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        time.sleep(SCROLL_PAUASE_TIME)

        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        else:

            last_height = new_height

            continue

 

์œผ๋กœ ์Šคํฌ๋กค์„ ๋‚ด๋ ค์ค๋‹ˆ๋‹ค!

 

๊ทธ๋‹ค์Œ์— ํŽ˜์ด์ง€๋ฅผ BeautifulSoup์„ ์ด์šฉํ•ด์„œ ์ƒˆ๋กœ ํŒŒ์‹ฑํ•˜๊ณ  ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค...

 

ํ•œ๋ฒˆ ํ•ด๋ด…์‹œ๋‹ค..!

 

 

์œ„์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๋จผ์ € ํ•œ ํ–‰ ๋‹จ์œ„๋กœ tag๋ฅผ ๊ฐ€์ ธ์˜ฌ๊ฒƒ์ด๋‹ค....

list1 = soup.find_all(name='div', attrs={'class''Nnq7C weEfm'})

 

๊ทธ๋ฆฌ๊ณ  ์Šคํฌ๋กค์„ ๋‚ด๋ ค์„œ ๋‹ค์‹œ ํŒŒ์‹ฑํ•˜๊ณ  ๊ฐ€์ ธ์˜ค๋ฉด

 

 

list2 = soup.find_all(name='div', attrs={'class''Nnq7C weEfm'})

 

์œ„์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์•„๊นŒ ๊ฐ€์ ธ์˜จ  list1๊ณผ ์ง€๊ธˆ ๊ฐ€์ ธ์˜จ list2๊ฐ€ ์ผ๋ถ€๋ถ„ ๊ฒน์น˜๊ฒŒ ๋œ๋‹ค... ์ด ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์„ ์ œ๊ฑฐํ•ด ์ฃผ๋Š”๊ฒŒ ํ•ต์‹ฌ์ธ๊ฒƒ๊ฐ™๋‹ค!

 

11~13 ๊ฐœ ์ •๋„๊ฐ€ ๊ฒน์นœ๋‹ค. ๊ทธ๋ž˜์„œ ๋‚˜๋Š” ๊ทธ๋ƒฅ list2์˜ index 10๋ฒˆ ๋ถ€ํ„ฐ๋งŒ list1์— ํฌํ•จ ๋˜์—ˆ๋Š”์ง€ ๊ฒ€์‚ฌ ํ•˜๋ฉด์„œ ๋งŒ์•ฝ ํฌํ•จ ์•ˆ๋ฌ์œผ๋ฉด ํฌ๋กค๋ง์„ ํ•ด์ฃผ์—ˆ๋‹ค.

 

์•ž์„œ ๋งํ•œ ๋‘๊ฐ€์ง€ ์ ์ด ์ธ์Šคํƒ€ ํฌ๋กค๋ง์˜ ํ•ต์‹ฌ ๋ถ€๋ถ„์ด๋ฉฐ ๋ฌด์‹ ์‚ฌ ํฌ๋กค๋ง์—์„œ ์ด ๋‘๊ฐ€์ง€ ์ ๋งŒ ์ถ”๊ฐ€ ํ•˜๋ฉด ๋˜๋Š”๊ฒƒ์ด๋‹ค...

๋‹ค์‹œ ๋งํ•˜๊ฒ ๋‹ค.

 

1.  selenium์œผ๋กœ ์Šคํฌ๋กค์„ ๋‚ด๋ ค์ค€๋‹ค

 

2. ์Šคํฌ๋กค์„ ๋‚ด๋ฆฐํ›„์— ๋‹ค์‹œ ํŒŒ์‹ฑํ–ˆ์„๋•Œ ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์„ ๊ณ ๋ ค ํ•ด์ค˜๋ผ -> ์–ผ๋งˆ๋งŒํผ ๊ฒน์น˜๋Š”์ง€ ์ง์ ‘ ์‹คํ—˜ํ•ด๋ณด๊ณ  ์œ ๋„๋ฆฌ ์žˆ๊ฒŒ ๊ฒ€์‚ฌํ•˜๋ฉด ํฌ๋กค๋ง ์†๋„ ํ–ฅ์ƒ ๊ฐ€๋Šฅํ•จ...

 

 

์ถ”๊ฐ€๋กœ ์ €๋Š” ์ด๋ฏธ์ง€ ์ •๋ณด์ค‘์—์„œ 

 

์ด๋Ÿฐ ๊ฒƒ๋“ค๋งŒ ํ•„์š”ํ–ˆ๊ธฐ์—... ์ •๊ทœ ํ‘œํ˜„์‹์„ ์ด์šฉํ•ด์„œ....

 

alt = re.compile('.*์‚ฌ๋žŒ 1๋ช… ์ด์ƒ.*์‚ฌ๋žŒ๋“ค์ด ์„œ ์žˆ์Œ.*')

 

if alt.match(str(j.attrs['alt'])) != None:

 

์ด๋Ÿฐ ํ˜•์‹์œผ๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค... ๋จผ์ € re.compile(์ฐพ๊ณ ์‹ถ์€ ๋ฌธ์ž์˜ ์ •๊ทœํ‘œํ˜„์‹) ์œผ๋กœ ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๋ฌธ์ž ํ˜•์‹์˜ ์ •๊ทœํ‘œํ˜„์‹์„ ๋„ฃ๊ณ . 

๊ทธ ๋‹ค์Œ์— match๋ผ๋Š” ํ•จ์ˆ˜๋กœ ๋‚ด๊ฐ€ ๋ฐฉ๊ธˆ ์ •๊ทœํ‘œํ˜„์‹์„ ์ €์žฅํ•œ ๋ณ€์ˆ˜ alt์™€ ๊ด„ํ˜ธ์† ๋ฌธ์ž๊ฐ€ ์ผ์น˜๋˜๋Š”์ง€ ๊ฒ€์‚ฌํ•ด์„œ ์ผ์น˜๋˜๋ฉด ~~~~ ๋ฅผ ๋ฐ˜ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค... ์ฆ‰, ์ผ์น˜๋˜๋ฉด != None์ด๊ณ  ์ผ์น˜๋˜์ง€ ์•Š์œผ๋ฉด None์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๋” ๋‚˜์€ ๋ฐฉ๋ฒ• ์žˆ์œผ๋ฉด ์•Œ๋ ค์ฃผ์„ธ์š”! ๋งŽ์ด ๋ฏธํกํ•ฉ๋‹ˆ๋‹ค!!!

 

์ฝ”๋“œ.

 

from urllib.parse import quote_plus #์•„์Šคํ‚ค ์ฝ”๋“œ๋กœ ๋ณ€ํ™˜ํ•ด์ค€๋‹ค

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import re
import urllib.request


def notMatch(list, a):
    n = 0
    for i in list:
        if a == i:
            n = 1
            break
    if n == 1:
        return 0
    else:
        return 1


baseUrl1 = 'https://www.instagram.com/explore/tags/'
baseUrl2 = '/?hl=ko'
plusUrl = input('ํฌ๋กค๋งํ•  ํ•ด์‹œํƒœ๊ทธ๋ฅผ ์ž…๋ ฅํ•˜์‹œ์˜ค: ')
url = baseUrl1 + quote_plus(plusUrl) +baseUrl2

driver = webdriver.Chrome()
driver.get(url)


time.sleep(3) #์œ„์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ณ  3์ดˆ ๊ธฐ๋‹ค๋ฆฐํ›„์— ๋ถ„์„์„ ์‹œ์ž‘
reallist =[]
list1 = [] #old list
list2 = [] #new list
totalCount = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/header/div[2]/div[1]/div[2]/span/span').text
print("์ด ๊ฒŒ์‹œ๋ฌผ์ˆ˜ : " + totalCount)
fileNumber = 1

alt = re.compile('.*์‚ฌ๋žŒ 1๋ช… ์ด์ƒ.*์‚ฌ๋žŒ๋“ค์ด ์„œ ์žˆ์Œ.*')
pageString = driver.page_source
soup = BeautifulSoup(pageString, features='html.parser')
list1 = soup.find_all(name='div', attrs={'class': 'Nnq7C weEfm'})
for i in list1:
    temp = i.find_all(name='img')
    for j in temp:
        try:
            if alt.match(str(j.attrs['alt'])) != None:
                urllib.request.urlretrieve( j.attrs['src'] , "C:\\Users\\PycharmProjects\\untitled7\\img\\์ธ์Šคํƒ€\\" +str(plusUrl)  + "(" + str(fileNumber) + ")" + ".jpg")
                fileNumber += 1
        except:
            continue



SCROLL_PAUASE_TIME = 1.5
n = 1



while True:
    time.sleep(SCROLL_PAUASE_TIME)
    #์Šคํฌ๋กค์„ ๋‚ด๋ ค์ค€๋‹ค
    last_height = driver.execute_script("return document.body.scrollHeight")
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(SCROLL_PAUASE_TIME)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(SCROLL_PAUASE_TIME)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        else:
            last_height = new_height
            continue

    time.sleep(SCROLL_PAUASE_TIME)
    #๋‹ค์‹œ ํŒŒ์‹ฑํ•œ๋‹ค
    pageString = driver.page_source
    soup = BeautifulSoup(pageString, features='html.parser')
    list2 = soup.find_all(name='div', attrs={'class': 'Nnq7C weEfm'})

    for i in range(10, len(list2)):
        try:
            if notMatch(list1, list2[i]) == 1:
                temp = list2[i].find_all('img')
                for j in temp:
                    try:
                        if alt.match(str(j.attrs['alt'])) != None:
                            urllib.request.urlretrieve(j.attrs['src'], "C:\\Users\\PycharmProjects\\untitled7\\img\\์ธ์Šคํƒ€\\" + str(plusUrl) + "(" + str(fileNumber) + ")" + ".jpg")
                            fileNumber += 1
                    except:
                        continue


        except:
            continue

    list1 = list2
Colored by Color Scripter