[파이썬 라이브러리] requests와 beautiful soup을 이용한 알바천국 스크래핑

linguaFranka(langs)/python

[파이썬 라이브러리] requests와 beautiful soup을 이용한 알바천국 스크래핑

그라파나 2022. 7. 18. 17:35

알바 혹은 채용정보 사이트 들어가서 뭔가 귀찮음을 느껴본적 있는가?

알바몬의 슈퍼 브랜드 채용정보를 보자. 많다.

if you == 개발자:

파이썬으로 웹 스크래핑해서 내가 원하는 데이터만 뽑아서 정보 얻기.

웹 스크래핑

웹에 있는 정보를 긁어온다(스크래핑)는 뜻.

파이썬 입문자라면 반드시 배우는. 그리고 동시에 매우 강력한 기능이다.

어쩌면 파이썬의 목적은 여기에 있는지도 모른다.

안 배울 수가 없지않은가.

우리가 알아야 하는 라이브러리 기능은 2가지다.

requests와 beautiful soup이다.

1. requests

Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.

GitHub - urllib3/urllib3: Python HTTP library with thread-safe connection pooling, file post support, user friendly, and more.

Python HTTP library with thread-safe connection pooling, file post support, user friendly, and more. - GitHub - urllib3/urllib3: Python HTTP library with thread-safe connection pooling, file post s...

github.com

파이썬을 한다면 자주 써먹을 문법.

import requests
r = requests.get('URL', auth('user','pass'))
r.status_code //200
r.headers['content-type']
//'application/json; charset=utf8'
r.encoding
//'utf-8'
r.text
//u'{type:'user'....'
r.joson()
// {u'disk_usage:....}

웹에 대한 요청을 처리하는 기본적인 패키지 requests다.~~(http for humans)~~

이거 하나만 알면 된다. 페이지 정보를 불러올 수 있다.

즉, 그 페이지의 내용과 data를 나에게 가져와서 입맛대로 변경할 수 있다는 이야기이다.

url을 통해서 페이지 정보를 불러오면
데이터를 추출할 수 있다.

2. beautiful soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object, or on a Tag within it: str(soup) # ' I linked to example.com ' str(soup.a) # ' I linked to example.com ' The str() function returns a str

www.crummy.com

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>

이렇게 불러오기만 하면 아래의 기능들을 쓸 수 있다.

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

간단하게 구동할 수 있다.

import requests
from bs4 import BeautifulSoup

indeed_result = requests.get("http://www.alba.co.kr/")

#print(indeed_result.text) 텍스트가 뜨면 정상

indeed_soup = BeautifulSoup(indeed_result.text, "html.parser")

#print(indeed_soup) 데이터 정보가 뜨면 정상

현재글[파이썬 라이브러리] requests와 beautiful soup을 이용한 알바천국 스크래핑

코딩에서 낭만을 찾으면 안될까요

낭만을 찾고 싶은 개발자. 문송한것은 덤이지만, 허리가 휘고 목이 거북이가 되고 있지만, 그래도 성장하고 있습니다.

Grafana, window, 링크드인URL, 프로메테우스 그라파나 설치, 쉽게, 문송한 코딩테스트, 프로메테우스, url수정, js, 프로메테우스 그라파나 연동, 링크드인URL수정, javascript, 분수의 덧셈, 그라파나, 유클리드 호제법, 최소공배수, 최대공약수, 윈도우 프로메테우스 그라파나, Prometheus, 링크드인url단축,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

코딩에서 낭만을 찾으면 안될까요