Welcome to the Web Scraping Course

There is an overwhelming amount of information available to us on webpages and we cannot possibly parse through all this information in our lifetime. This is where web scraping comes in - we can make programs that automatically retrieve the most important information on websites for us. In this web scraping course, we will be learning the basics of web scraping using Python and retrieve information from various websites.

Installation Instructions

Installing pip. Please scroll down to your relevant operating system.

Run "pip install beautifulsoup4" in your terminal (PowerShell on Windows, Terminal on macOS and if you are using another operating system you probably know what you are doing).

Introduction to HTML Tags

HTML tags are the building blocks of every website that you see on the internet. Tags come in pairs like the <p> and </p> tags below. The first tag is known as the 'start tag' and the second is known as the 'end tag' and usually is the same as the start tag but with a forward slash (/) in front.

<!DOCTYPE html>

<html>
<p>This is a paragraph</p>
</html>

Two important things to note are that:

  1. When a web browser loads a HTML document it doesn't display the tags but instead uses them to identify how the webpage is meant to be structured and displays the elements accordingly
  2. Only the information between the <body> and </body> tags will be displayed by the web browser (The above HTML code would not display anything on a webpage)

HTML tags are also able to have a bit more complexity to make customizing the content easier. HTML tags can be structured as such

<!DOCTYPE html>

<html>
<img src = "http://www.imagewebsite/image">
</html>

Attributes are things that provide additional information to the browser about how the content is to be displayed. The tag above is structured as <tagname attribute = "attribute value"> In the above example the tagname img is what HTML uses to display an image, src is the attribute which is the source that the image is going to be located at, and http://www.imagewebsite/image is the value that the src attribute takes and is where the image is sourced.

BeautifulSoup

BeautifulSoup is the main library we will be using to scrape webpages. In order to retrieve webpages, however, we will need to use the requests library. Let's import both of those libraries at the top of our Python file.

from bs4 import BeautifulSoup
import requests

For the scope of this course, we will not delve into the details of how retrieving the webpage works. All we need to know is how we can transform a URL into a BeautifulSoup object we can parse through. First, however, let's look at the source of the webpage we will be parsing (link here). The website is very simple and is nothing like what real websites are actually like, but this serves as a very good starting point for us to start learning web scraping.

Now that we've had a look through the structure of the website, we can start creating a structural representation of our webpage's source file and store it in a variable called "soup". This varaible contains nested data structure that we can navigate through and retrieve information from.

website = requests.get('https://hackonnect.github.io/scrape1')
soup = BeautifulSoup(website.content, 'html.parser')

print(soup)

As you can see from the outputs of the print statement, the soup is basically a very long string containing the raw HTML of our website. The string isn't particularly readable. However, we can use soup.prettify() to generate a readable string representation of the soup.

print(soup.prettify())

We can find each element in our soup by referencing them by their HTML element name, giving us their HTML code. For example, if we want to get the title of our HTML, we can simply do:

print(soup.title)

This should yield the following output:

<title>Practice Web Scraping with Python</title>

We can also get the name of our HTML element by appending .name to the end. In this case, the following statement should print out "title":

print(soup.title.name)

Similarly, we can append .string to the end in order to get the contents of our HTML element.

print(soup.title.string)

The way we just introduced of finding HTML elements will only give you the first element that matches what we queried for. Our HTML page has multiple paragraphs, but the following will only print out the HTML code of our first paragraph:

print(soup.p)

In order to find multiple paragraphs, we have to use find_all(). This gives us a list of HTML elements that matches the criterion we search for. For the argument of find_all(), we will need to put the name of our HTML in quotation marks.

print(soup.find_all('p'))

Let's save this in a variable called paragraphs:

paragraphs = soup.find_all('p')

Remember HTML attributes? We can find the attribute of a particular HTML element by indexing the attirbute we want. For example, if we want to find the value of the class attribute of the second paragraph, we can type out the following:

print(paragraphs[1]['class'])

As you can see, we get a list of the attribute values. We can just index the first element in order to get what we want:

print(paragraphs[1]['class'][0])

Of course, for elements with multiple attribute values like the class values of the third paragraph, we should not index the list as this would mean that we may be discarding some important information.

print(paragraphs[2]['class'])

There's also a way we can find every single attribute of an element:

print(paragraphs[3].attrs)

In order to find an HTML element by its id, we can use find_all(id=''). This would give us a list of HTML elements with a specific id. However, because ids are unique, we can also simply use find(id=''). The find function, similar to how we added .title and .p to soup, returns the first element that matches the criterion inside the brackets. For everything else apart from finding the id, find_all() is usually the better choice. The following expressions will both yield the same output:

print(soup.find(id='unique'))
print(soup.find_all(id='unique')[0]

This also works with custom attributes like "name":

print(soup.find_all(name='h1'))

We can also pass multiple attributes at once. The two expressions listed below are equivalent:

print(soup.find_all(name='h1', id='h1'))
print(soup.find_all(attrs={'name': 'h1', 'id': 'h1'}))

Now, navigate to Exercise 1 and see if you are able to complete it.

Nested Elements

Now let's look at another webpage for us to learn how to navigate through nested elements. Have a look here and inspect the hierarchical structure of the divs. Open up a new document and save the soup of the HTML as a soup variable:

from bs4 import BeautifulSoup
import requests

website = requests.get('https://hackonnect.github.io/scrape2')
soup = BeautifulSoup(website.content, 'html.parser')

Let's look at a shortcut of how we can find all the divs with a class of "sibling":

print(soup.find_all('div', 'sibling'))

We simply put the name of the class after the HTML element tag name. In order to navigate the hierarchical structure of the divs better, let's save the parent div encompassing everything as a new variable called "parent".

parent = soup.div

We are able to list out all the children of this parent div using div.contents. Contents gives us everything that is inside the parent div. We can save all the contents in a variable called "children".

children = parent.contents

As you can see, we have successfully saved all the child divs in a variable. This variable has the same type as "soup" and "parent": it's still a soup. Soup is a hierarchical and recursive structure that consists of itself. To find the amount of children the parent div has, we can simply find the length of children:

print(len(children))

We get an answer of 9. However, if you navigate back to the page we are using, we can only count 4 orange divs! To get a better understanding of why this is the case, let's print out children and see what it actually contains.

print(children)

In the printed output, we can see a lot of '\n's. These are line break characters, showing us where in the HTML is there a new line. Unfortunately, they are not particularly useful to us and should be removed. As a quick exercise, try to remove all of the line breaks from children. See Exercise 2 for a solution. If you run the following code afterwards, the output should be 4:

print(len(children))

To find the parent of each child div, we can simply append .parent onto the end of the child. Here is an example of how it works:

print(children[0].parent)

Apart from navigating up and down the hierarchical structure, we can also find siblings of a particular div. Let's try using next_sibling in order to find the next sibling:

for child in children:
  print(child.next_sibling)

This is not what we are expecting is it? This is because .next_sibling only works on BeautifulSoup objects. Earlier on when we made the children variable, we used parent.contents to give us a list of strings of HTML elements. In order to get a generator containing BeautifulSoup objects, we need to use parent.children. Here's what we mean:

children = parent.children # Remember that we did not filter out any of the empty  this time.
print(children)
for child in children:
  print(child.next_sibling)

Now this gives us our intended outcome. If we want a generator containing BeautifulSoup objects of the next siblings, we can use .next_siblings. Remember that if you use things that return generators such as .children, the result you get is NOT a list. It's a list_iterator / generator that should only be used for iteration.

for next_sibling in parent.div.next_siblings:
  print('I\'m another sibling!')

Have a few minutes to mess around with moving through the hierarchical and nested structure.

HTML Selectors

Now that we are equipped with the knowledge of navigating nested HTML structures, let's look at some of the HTML selectors we can use while web scraping. Once again, let's explore the structure of this website.

from bs4 import BeautifulSoup
import requests

website = requests.get('https://hackonnect.github.io/scrape3')
soup = BeautifulSoup(website.content, 'html.parser')

Sometimes, we need to find the third, fourth, seventh or even forty-second element. To do this, we need to first know how CSS selectors work with BeautifulSoup. We can use select() in order to select HTML elements using CSS selectors. The following code does the exact same thing as soup.find_all('p'), except it uses CSS selectors. If possible, always use find_all() as it is significantly faster.

print(soup.select('p'))

Here are a few CSS selectors that you might need to know:

Selecting every p inside a div:

print(soup.select('div p'))

Selecting every p that is directly under a div:

print(soup.select('div > p'))

Finding everything with a class of example:

print(soup.select('.example'))

Finding the third p:

print(soup.select('p:nth-of-type(3)'))

Similar to how find_all() has a counterpart in find(), select() has a counterpart that only selects the first element that meets the criterion called select_one():

print(soup.select_one('p:nth-of-type(3)'))

Sometimes, different HTML elements will affect our code. Let's take a look at the last paragraph with a class of "broken". This paragraph is, quite literally, broken. If you try to extract the text by appending .string to the end of your Python statement, you'll find out that it doesn't work. Consequently, we need to remove the other stuff inside the paragraph.

If you use inspect element, you'll realise that there is a span there affecting it. We can remove the span like this:

soup.select_one('.broken').span.decompose()

Decomposing the element will remove it. If we want to keep the value in another variable after decomposing, we can do the following:

new_variable = soup.select_one('.broken').span.extract()

The code above won't work now because we have already decomposed the span. If you haven't removed the span yet, extract() can be a very useful tool.

Now, go ahead and attempt to solve Exercise 3.

Exercise 1

1. Find the class of the last paragraph on this page (this is the same page used in the BeautifulSoup section of the course).

2. There's actually a hidden paragraph with an id of "hidden" on the same page! Find out what it says.

Exercise 2

Exercise 2 is located within the Nested Structures section of the course. There are many different ways you can remove all the line breaks, including iterating through the soup structure and removing anything that is equivalent to '\n'. Here is a short solution with an extremely similar effect:

children = list(filter(lambda x: x != '\n', children))
Exercise 3

Wikipedia is one of our most used websites. Whether you'd like to admit it or not, we all use Wikipedia to get information. Let's say we want to get the relative atomic mass of all known elements. Using Wikipedia and BeautifulSoup, we can easily retrieve this information. Try to do it on your own before asking for help. You might not have enough time to finish it. Have fun! Feel free to carry on working on this in your free time. Computer Science can be fun!

Hints:

Try to look for a webpage that has the all the values you need.

Use inspect element to find the correct element.

You might want to loop through something several times.