In this tutorial, we will introduce the way to scrape html and extract some useful text from a web page using python beautiful soup.
1. Import libray
import requests from bs4 import BeautifulSoup
2. Scape a url using python
We can use python requests package to scape a html and get its text content.
# Create a variable with the url url = 'https://www.cocyer.com' # Use requests to get the contents r = requests.get(url) # Get the text of the contents html_content = r.text
3. Use beautiful soup to parse a html content
soup = BeautifulSoup(html_content, "html.parser")
4. Extract some information from soup
HTML title
# View the title tag of the soup object soup.title
You will get:
<title>Cocyer.com</title>
If you only want to get the title in <title>, you can use this code.
soup.title.string
You will get: Cocyer.com
Get all paragraphs and content
px = soup.find_all('p') for p in px: print(p.text)
You also can get all h1, h2, or other information by soup.find_all().
Extract all links in this html page
You can refer to this tutorial: