๊ธฐํƒ€(๊ฐœ๋ฐœ)/ํฌ๋กค๋ง(Crawling)

[python]๋ฌด์‹ ์‚ฌ ํฌ๋กค๋ง ํ•˜๊ธฐ Crawling

 

๋จผ์ € url์„ ๋ฐ›๋Š”๋‹ค

 

baseUrl = 'https://store.musinsa.com/app/product/search?search_type=1&q='

baseUrl1 = '&page='

plusUrl = input('๊ฒ€์ƒ‰ํ•  ์˜ท์„ ์ž…๋ ฅํ•˜์‹œ์˜ค: ')

pageNum =1

url = baseUrl + quote_plus(plusUrl) + baseUrl1 + str(pageNum)

 

quote_plus๋Š” ํŠน์ˆ˜๋ฌธ์ž๋‚˜ ๋‹ค๋ฅธ ํ˜•์‹์˜ ๋ฌธ์ž๋ฅผ ์•„์Šคํ‚ค ์ฝ”๋“œ๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๊ณ  ๊ณต๋ฐฑ์„ '+'๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ์ฐธ๊ณ ๋กœ, quote()๋Š” ๊ณต๋ฐฑ์„ '%20'์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

์ด๋ฅผ ์ด์ œ from selenium import webdriver ์„ ์ด์šฉํ•ด์„œ webdriver.Chrome()์œผ๋กœ ์—ด์ˆ˜๊ฐ€ ์žˆ๋Š”๊ฒƒ์ด๋‹ค. ์ด๊ฒŒ ๋ฌด์Šจ ๋ง์ด๋ƒ๋ฉด selenium์„ ์ด์šฉํ•ด์„œ ํŽ˜์ด์ง€๋ฅผ ์—ฌ๋Š” ๊ฒƒ์ธ๋ฐ....์ „์ œ ์กฐ๊ฑด์ด Chrome ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ์„ค์น˜๊ฐ€ ๋˜์—ˆ์žˆ์–ด์•ผ ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ Chrome() ์— ๊ด„ํ˜ธ ์•ˆ์— ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ์„ค์น˜ ๋˜์–ด ์žˆ๋Š” ์ฃผ์†Œ๋ฅผ ๋„ฃ์–ด์ค€๋‹ค.

 ํ•„์ž๋Š” ๋“œ๋ผ์ด๋ฒ„์˜ ์ €์žฅ ์œ„์น˜๊ฐ€ ํ”„๋กœ์ ํŠธ ํด๋”์— ๋ฐ”๋กœ ์œ„์น˜ํ•˜๊ธฐ์— ๋„ฃ์–ด์ค„ ํ•„์š”๊ฐ€ ์—†์—ˆ๋‹ค.

 

    driver.get(url) #url์œผ๋กœ ํŽ˜์ด์ง€๋ฅผ ๋ถˆ๋Ÿฌ์˜จ๋‹ค

    time.sleep(3)  # ์œ„์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ๋ช‡์ดˆ ๊ธฐ๋‹ค๋ฆฐํ›„์— ๋ถ„์„์„ ์‹œ์ž‘

time์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” import time์ด ํ•„์š”ํ•˜๋‹ค.

 

๋ถ„์„์„ ํ• ๋•Œ์—๋Š” BeautifulSoup์„ ์ด์šฉํ•œ๋‹ค.

from bs4 import BeautifulSoup์„ ์œ„์— ์„ ์–ธํ•ด์ฃผ๊ณ 

 

    pageString = driver.page_source

    soup = BeautifulSoup(pageString, features="html.parser")

๋กœ ํŽ˜์ด์ง€๋ฅผ ํŒŒ์‹ฑํ•ด์ค€๋‹ค. ์ด๋Š” BeautifulSoup์—์„œ ์ œ๊ณตํ•ด์ฃผ๋Š” ๊ธฐ๋Šฅ์ด๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ์—ฌ๊ธฐ๊นŒ์ง€๋Š” ๋ชจ๋“  ํŽ˜์ด์ง€๋ฅผ ํŒŒ์‹ฑํ• ๋•Œ ๋™์ผ ํ•  ๊ฒƒ์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๋ถ€ํ„ฐ ์ง์ ‘ ํŽ˜์ด์ง€๋ฅผ ๋ณด๋ฉด์„œ ๋ถ„์„์„ ํ•ด์•ผํ•˜๋Š”๋ฐ...

 

์œ„์˜ ๋งจํˆฌ๋งจ ์ด๋ฏธ์ง€๋“ค์„ ํฌ๋กค๋ง ํ•˜๊ณ  ์‹ถ๋‹ค. 

๊ทธ๋Ÿฌ๋ฉด ์ € ์ด๋ฏธ์ง€๊ฐ€ ์–ด๋Š tag์— ์žˆ๋Š”์ง€๋ฅผ ์ฐพ๊ณ  ๋˜ ๊ฑฐ๊ธฐ tag ์•ˆ์—์„œ ๊ฐ๊ฐ ์ด๋ฏธ์ง€ ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ํ•˜์œ„ tag์†์˜ attribute ๊ฐ’์„ ์ฐพ๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค... ๋ง์€ ์‰ฝ๋‹ค..

 

1. ๋จผ์ € ์ € ํŒŒ๋ž€ ํ…Œ๋‘๋ฆฌ๋ฅผ ์ฐพ๋Š”๋‹ค

   :์ € ํŒŒ๋ž€ ํ…Œ๋‘๋ฆฌ๋Š” ํ•˜๋‚˜๋งŒ ๊ฐ€์ ธ์˜ค๋ฉด ๋˜๋‹ˆ๊นŒ     

result1 = soup.find(name = 'ul', attrs ={'class':'snap-article-list boxed-article-list article

                                                                    list center list goods_small_media8'})

๋กœ ๊ฐ€์ ธ์˜จ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด result1์—๋Š” ์ € ํ…Œ๋‘๋ฆฌ์— ํ•ด๋‹นํ•˜๋Š” HTML ์ฝ”๋“œ๊ฐ€ ์ €์žฅ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

2. ํŒŒ๋ž€ ํ…Œ๋‘๋ฆฌ์† ์ดˆ๋ก ์ด๋ฏธ์ง€ ๋“ค์„ ๊ฐ€์ ธ์˜ค์ž!

    :result2 = result1.find_all(name ="img")์„ ์‚ฌ์šฉํ•˜์—ฌ์„œ ๊ฐ๊ฐ์˜ img ํƒœ๊ทธ๋ฅผ ์ €์žฅํ•˜๋Š”๋ฐ ์ด๋•Œ result2๊ฐ€ ๋ฐฐ์—ด ํ˜•์‹์œผ๋กœ ์ €์žฅ๋˜๊ฒŒ ๋œ๋‹ค.

 

3. ๋‚ด๊ฐ€ ๊ฐ€์ ธ์˜ค๊ณ  ์‹ถ์€ attribute ๊ฐ’์„ ์ฐพ์ž

    : ์ฒ˜์Œ์— ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด์„œ src์†์„ฑ์„ ๊ฐ€์ ธ์™“๋Š”๋ฐ ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ ๋งจํˆฌ๋งจ ์ด๋ฏธ์ง€ ๋ง๊ณ ๋„ ์ €๊ธฐ ๋ฐ‘์— ํ•˜ํŠธํ‘œ์‹œ ์ด๋ฏธ์ง€ ๋˜ํ•œ ๊ฐ€์ ธ์™€ ๋ฒ„๋ ธ๋‹ค... ๊ทธ๋ž˜์„œ ์˜ท ์ด๋ฏธ์ง€ ์ •๋ณด๋งŒ์„ ํ‘œํ˜„ํ•˜๋Š” data-original ์†์„ฑ์„ ๊ฐ€์ ธ์˜ค๋Š”๊ฒŒ ๋” ๋‚ซ๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ์„œ ์•„๋ž˜ ์ฝ”๋“œ ๋Œ€๋กœ ์ˆ˜ํ–‰ํ•ด ์คฌ๋‹ค...(์ด ์†์„ฑ ์ •๋ณด์™€ ํƒœ๊ทธ ์ด๋ฆ„ ๊ฐ™์€ ๊ฒƒ์€ ํŽ˜์ด์ง€ ๋งˆ๋‹ค ๋‹ค๋ฅด๋‹ˆ ๊ฐ์ž ํ•˜๊ธฐ์ „์— ๋ถ„์„ํ•ด๋ณด์„ธ์š”)

 

    for i in result2:

        try:

            image = i.attrs['data-original']

            reallink.append(image)

        except:

            continue

 

์„ ์ด์šฉํ•ด์„œ ์ด๋ฏธ์ง€ url์„ ๊ฐ๊ฐ reallink์— ๋„ฃ๋Š”๋‹ค. 

 

 

์ด์ œ ๊ฐ€์ ธ์˜จ url์„ ๋‚ด pc์— ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด์„œ

 

urllib.request.urlretrieve(์ด๋ฏธ์ง€ url , ์ €์žฅํ•˜๊ณ ์‹ถ์€ pc ๊ฒฝ๋กœ) 

 

 

 

 

์ฝ”๋“œ

 

import urllib.request

from urllib.parse import quote_plus #์•„์Šคํ‚ค ์ฝ”๋“œ๋กœ ๋ณ€ํ™˜ํ•ด์ค€๋‹ค
from bs4 import BeautifulSoup
from selenium import webdriver
import time


#ํ•ด๋‹น ํŽ˜์ด์ง€๋ฅผ ํฌ๋กค๋ง i๊ฐ€ ํŽ˜์ด์ง€๋ฒˆํ˜ธ
def musinsaCrawling(pageNum):
    baseUrl = 'https://store.musinsa.com/app/product/search?search_type=1&q='
    baseUrl1 = '&page='
    url = baseUrl + quote_plus(plusUrl) + baseUrl1 + str(pageNum)
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(3)  # ์œ„์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ณ  1์ดˆ ๊ธฐ๋‹ค๋ฆฐํ›„์— ๋ถ„์„์„ ์‹œ์ž‘

    pageString = driver.page_source
    soup = BeautifulSoup(pageString, features="html.parser")

    result1 = soup.find(name = 'ul', attrs ={'class':'snap-article-list boxed-article-list article-list center list goods_small_media8'})
    result2 = result1.find_all(name = "img")

    for i in result2:
        try:
            image = i.attrs['data-original']
            reallink.append(image)
        except:
            continue

    print(reallink)
    driver.close()


plusUrl = input('๊ฒ€์ƒ‰ํ•  ์˜ท์„ ์ž…๋ ฅํ•˜์‹œ์˜ค: ')
reallink = []


baseUrl = 'https://store.musinsa.com/app/product/search?search_type=1&q='
baseUrl1 = '&page=1'
url = baseUrl + quote_plus(plusUrl) + baseUrl1
driver = webdriver.Chrome()
driver.get(url)

time.sleep(3) #์œ„์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ณ  3์ดˆ ๊ธฐ๋‹ค๋ฆฐํ›„์— ๋ถ„์„์„ ์‹œ์ž‘


pageString = driver.page_source
soup = BeautifulSoup(pageString, features="html.parser")
pageNum = int((soup.find("span",{"class" : "totalPagingNum"})).text)
print(pageNum)
driver.close()



for i in range(1,pageNum+1):
    musinsaCrawling(i)


print(reallink)


if plusUrl == '์Šค๋ชฐ๋กœ๊ณ ๋งจํˆฌ๋งจ':
    title = 'small logo sweatshir'
elif plusUrl == '๋น…๋กœ๊ณ ๋งจํˆฌ๋งจ':
    title = 'big logo sweatshir'
elif plusUrl =='์ŠคํŠธ๋ผ์ดํ”„๋งจํˆฌ๋งจ':
    title = 'stripe sweatshir'


print('title = ' + title)
n = 1
for i in range(0,len(reallink)):
    urllib.request.urlretrieve( "http:"+reallink[i],"C:\\Users\\PycharmProjects\\untitled7\\stripes sweat shirt\\" +title + "("+str(n)+")"+".jpg")
    n +=1