29
loading...
This website collects cookies to deliver better user experience
Threading
module which leads to blazing fast web scraping.requests
and beautifulsoup4
modules with pip.pip install requests
pip install beautifulsoup4
requests
module is how we are going to actually make a network request to Medium.com and fetch its data and the beautifulsoup4
module is what will parse this data for us and return the data we actually care about. (oh and I am going to call the BeautifulSoup module bs4 from now on)get_medium_articles.py
script and import everything.datetime
module comes in), add it to the publication URL, and we are off and running fetching data.fetch_articles
which will take in a publication and URL and return the data we want! data
and then let's go ahead fetch the current date, we will need the datetime.now()
function which returns the current date and time and then use the strftime
method on the date to properly format it in a way Medium will understand. get_articles
function toyesterday = datetime.datetime.now() - datetime.timedelta(days=1)
date = str(yesterday.strftime("%Y/%m/%d"))
requests.get()
and passing in the URL into it. This will return a response object and the first thing we want to do is call raise_for_status()
because we want to try to crash as early as possible when coding because it will always crash closer to the problem. .content
on the response object, this is what we are going to feed bs4 next! allow_redirects=False
do?". Good job if you caught that, but let me explain it.allow_redirects=True
. class="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls"
and this is what bs4 is going to use to find every article on the page.BeautifulSoup
object and pass in the page's contents and for the second parameter, we will pass in html.parser
which tells bs4 to parse HTML.find()
method, validate it, and retrieve the text string.a
tags(links), finding the appropriate text, getting it's href
, removing parameters from the URL and then returning it.write_to_desktop(data)
, I plan on writing all our data into a markdown file onto my desktop, but you can do anything you want with this such as using a simple text file, or writing to a separate drive.main()
function and actually see some results!main()
function, so this doesn't get too complicated. threading
module into our script.threading.Thread
making the target out get_articles
script and everything in the args
are the arguments that will go into get_articles
.write()
function would get called before our data is ready.import requests, datetime, threading
from bs4 import BeautifulSoup
urls = {
"Towards Data Science": "https://towardsdatascience.com/archive/",
"Personal Growth": "https://medium.com/personal-growth/archive/",
"Better Programming": "https://betterprogramming.pub/archive/",
}
data = {}
def get_articles(publication, url):
yesterday = datetime.datetime.now() - datetime.timedelta(days=1)
date = str(yesterday.strftime("%Y/%m/%d"))
url += date
print(f"Checking {publication}...")
response = requests.get(url, allow_redirects=False)
try:
response.raise_for_status()
except Exception:
print(f"Invalid URL At {url}")
page = response.content
soup = BeautifulSoup(page, "html.parser")
articles = soup.find_all(
"div",
class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls",
)
if len(articles) > 0:
print(f"Fetching Articles from {url}")
amount_of_articles = min(3, len(articles))
for i in range(amount_of_articles):
title = articles[i].find("h3", class_="graf--title")
if title is None:
continue
title = title.contents[0]
article_url = articles[i].find_all("a")[3]["href"].split("?")[0]
article = {
"title": title,
"article_url": article_url,
}
if not data.get(publication):
data[publication] = [article]
else:
data[publication].append(article)
def write_to_desktop(data):
with open("PATH_TO_DESKTOP/articles.md", "a") as file:
out = ""
for publication, articles in data.items():
out += f"### ***{publication}***\n"
for article in articles:
out += f"#### [{article['title']}]({article['article_url']})\n\n"
out += "---\n\n"
file.write(out)
def main():
threads = []
for publication, url in urls.items():
thread = threading.Thread(target=get_articles, args=[publication, url])
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
write_to_desktop(data)
main()