41
loading...
This website collects cookies to deliver better user experience
headers
.Dev Tools -> Network -> Fetch/XHR
. On the left side you'll see a bunch of requests send from/to the server, when you click on one of those requests, on the right side you'll see the response via preview tab.time.sleep
method:from time import sleep
sleep(0.05) # 50 milliseconds of sleep
sleep(0.5) # half a second of sleep
sleep(3) # 3 seconds of sleep
sleep
method as well:# Called without an argument, sleep() will sleep forever
sleep(0.5) # half a second
sleep(4.minutes)
# or longer..
sleep(2.hours)
sleep(3.days)
user-agent
does not guarantee that your request won't be declined or blocked. user-agent
is needed to act as a "real" user visit, which is also known as user-agent
spoofing, when bot or browser send a fake user-agent
string to announce themselves as a different client.requests
library, default user-agent
is python-requests
and websites understands that it's a bot and might block a request in order to protect the website from overload, if there's a lot of requests being sent.User-Agent: <product> / <product-version> <comment>
user-agent
.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# add request headers to request
requests.get("YOUR_URL", headers=headers)
HTTPary
gem it's identical process:headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# add request headers to request
HTTParty.get("YOUR_URL", headers:headers)
requests
library. This problem is very common on StackOverFlow.user-agent
passed into request headers. The example below will try to get the stock price.import requests, lxml
from bs4 import BeautifulSoup
params = {
"q": "Nasdaq composite",
"hl": "en",
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', params=params).text, 'lxml')
print(soup.select_one('[jsname=vWLAgc]').text)
AttributeError
because the response contains different HTML with different selectors:print(soup.select_one('[jsname=vWLAgc]').text)
AttributeError: 'NoneType' object has no attribute 'text'
print
soup
object or response from requests.get()
you'll see that it's a HTML with <script>
tags, or HTML that contains some sort of an error.import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "Nasdaq composite",
"hl": "en",
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
print(soup.select_one('[jsname=vWLAgc]').text)
# 15,363.52
list()
or txt
file.list()
using random.choice()
.
import requests, random
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for _ in user_agent_list:
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
requests.get('URL', headers=headers)
user-agent
isn't enough. You can pass additional headers. For example:Accept: <MIME_type>/<MIME_subtype>; Accept: <MIME_type>/*; Accept: */*
Accept-Language: <language>; Accept-Language: *
Content-Type: text/html; img/png
requests.Session()
:session = requests.Session()
session.auth = ('user', 'pass')
session.headers.update({'x-test': 'true'})
# both 'x-test' and 'x-test2' are sent
session.get('https://httpbin.org/headers', headers={'x-test2': 'true'})
session = requests.Session()
response = session.get('https://httpbin.org/cookies', cookies={'from-my': 'browser'})
print(response .text)
# '{"cookies": {"from-my": "browser"}}'
response = session.get('https://httpbin.org/cookies')
print(response.text)
# '{"cookies": {}}'
DevTools -> Network -> Click on the URL -> Headers
.from collections import OrderedDict
import requests
session = requests.Session()
session.headers = OrderedDict([
('Connection', 'keep-alive'),
('Accept-Encoding', 'gzip,deflate'),
('Origin', 'example.com'),
('User-Agent', 'Mozilla/5.0 ...'),
])
# other code ...
custom_headers = OrderedDict([('One', '1'), ('Two', '2')])
req = requests.get('https://httpbin.org/get', headers=custom_headers)
prep = session.prepare_request(req)
print(*prep.headers.items(), sep='\n')
# prints:
'''
('Connection', 'keep-alive')
('Accept-Encoding', 'gzip,deflate')
('Origin', 'example.com')
('User-Agent', 'Mozilla/5.0 ...')
('One', '1')
('Two', '2')
'''
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
HTTParty
to add proxies like so, or like in the code snippet shown below:http_proxy = {
http_proxyaddr: "PROXY_ADDRESS",
http_proxyport: "PROXY_PORT"
}
HTTParty.get("YOUR_URL", http_proxy:http_proxy)
HTTPrb
to add proxies:HTTP.via("proxy-hostname.local", 8080)
.get("http://example.com/resource")
HTTP.via("proxy-hostname.local", 8080, "username", "password")
.get("http://example.com/resource")
list()
or save it to .txt
file to save memory and iterate over them while making a request to see what's the results would be, and then move to different types of proxies if the result is not what you were looking for.