Brain Pickings: Generate a list of Articles

Brain Pickings is one of my favorite blogs. Maria Popova (the author) has written a myriad of literary essays covering topics from science, all the way to children’s books. Every post is informative and written with style.

One missing feature in her website is the ability to quickly view all article titles without having to scroll through the content. For this reason I have written a small script that prints a list of article titles and their url for a certain number of pages.

Technologies used:

  • Python
  • BeautifulSoup

Understanding the Script

Start by importing the necessary libraries.

1
2
3
4
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

Create a few helper functions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None

except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None


def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)


def log_error(e):
"""
It is always a good idea to log errors.
This function just prints them, but you can
make it do anything.
"""
print(e)

Specify the number of pages to iterate through, and print the results in the console.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Specify the number of pages to go through
r = 2

for page in range(1, r+1):

url = 'https://www.brainpickings.org/page/' + str(page)
raw_html = simple_get(url)
html = BeautifulSoup(raw_html, 'html.parser')

part = html.findAll("h1", {"class": "entry-title"})
print('--------------------------')
print('\n')
print('PAGE ', page)
print('\n')
for i, j in enumerate(part):
print(part[i].text)
print('##########')
print('LINK: ' , part[i].a['href'])
print()

The Output