P M A J E W S K I

Please Wait For Loading

A Step-by-Step Guide to Web Scraping for Beginners - Software Developer's Tour

Fisherman

A Step-by-Step Guide to Web Scraping for Beginners

In this article, I would like to raise the subject of web scraping.

Before we start, we have to consider web scraping on several layers: legal and ethical.

 Let’s tackle the topic of the legality of web scraping first.

Fundamentally, the law depends on the country you live in. You should remember about:

Authors’ right

Terms of Service

General Data Protection Regulation (GDPR, European Union)

Computer Fraud and Abuse Act (CFAA, USA)

Outside the regulations you have to make sure that you are not affecting on target’s infrastructure by for example DDOSing target’s site.

Let’s get to practice

website

We want to get all titles from the blog section.

When we analyze the blog section, we can see that the title is an a tag inside a div with the class blog-page-title.

web browser code inspector

The steps we need to take to accomplish our task:

  • 1. Go to the website.
  • 2. Wait for the page to load until there is a div with the class “blog-page-title” and an a tag inside it.
  • 3. Find the a tag inside the div with the class mentioned above.
  • 4. Print the text of these a tags.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://pawelmajewski.com/blog"
driver = webdriver.Chrome()
driver.get(url)
div_class = "blog-page-title"
wait = WebDriverWait(driver, 3)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, f"div.{div_class} a")))
a_tags = driver.find_elements(By.CSS_SELECTOR, f"div.{div_class} a")

for a_tag in a_tags:
    print(a_tag.text)

driver.quit()

I recommend using the Selenium library both for writing your own web scrapers as well as for writing, for example, integration tests.

leave a comment