The key to all scraping
Responsible web scraping is a very important part of your strategy. Do not downplay or underestimate it!
If you’re scraping a site and hit it too fast or make it obvious that they’re being scraped, you’ll probably find your servers become blocked. The servers are now ‘burned’ and useless to you. This is a problem as proxy server providers will only replace servers occasionally, or not at all. Therefore it’s vital that any web scraping is handled responsibly through good human emulation settings.
What are human emulation settings?
The key here is in the title. You need to emulate human behaviour. This will help to hide your queries among genuine queries from humans. The idea here is to pull information from the site efficiently by scraping as quickly as possible without becoming blocked. This is the main principle behind responsible web scraping.
Each site has its own tolerances to scraping. So the settings you need will depend completely on the site you’re targeting. Sites can often beef up their defences too, so you will need to check the settings you use regularly.
Responsible web scraping and software
If you’re building your own software it’s very straightforward to put human emulation settings in place. However, if using an off-the-shelf software you’ve bought in there might not be much control over the settings. It’s important to research what settings are available otherwise the software might not work out for you.
Basics of human emulation settings and responsible web scraping
The speed of your queries when scraping
Each site will have a limit as to how many queries from one IP address are acceptable over a period of time. Go too fast and your servers become blocked. Go too slowly and you’re not making the most of your servers. Somewhere in between those is the sweet spot that you’re looking for. The gap between each query is called a delay and these are usually measured per server. So if the delay is one query per minute then thirty servers can run at a combined total of one query every two seconds.
Realistic rest periods
Real humans sleep. Of course we do. This means that if you’re sending queries from an IP address 24/7 then it’s obvious those queries are automated. Rest periods of at least 6 hours a day per server IP address are usually advised. To be extra safe it’s best to match these delays to the night time of the country the servers are based in.
User-Agents and web scraping
Each query you send to a website must have a user agent attached to it. Using just one for all of your requests will quickly stand out so you must gather a list and replace your user agent with every request. The user agents should always be from current versions of browsers. This is because browsers now update automatically. Any old user agents are very, very easy for a site to spot and they can block you. It’s also advisable that user agents are from popular operating systems too; try and avoid Linux and similar OS’s that have a comparatively low number of users. Again, this is to help lose your queries in the crowd.
Accessing the URL correctly
If the information you need isn’t normally accessed directly and a real human would have to make a number of clicks on a site to reach it, then you need to do the same too. Going straight to the target URL could seem unnatural and raise a red flag. You also need to check the URL each time you paginate through for any parameters that are added or removed and copy those as they appear. You also need to hold on to the one IP address as you go through those pages. Switching IP addresses in the middle of a query will look unnatural.
Cookies and cache
It may be necessary to use a session cookie, but always, always, always make sure that you delete cookies and empty your cache after you complete each query. Not doing so can lead the site to chain together the IP addresses you’re using. They work out the servers you’re using and they can then block all of the servers you have.
Using IP refreshes/replacements responsibly
Many server providers do offer the ability to refresh your server IP addresses, either after a certain amount of time or on-demand. The reasoning behind this is to replace any bad or poorly performing servers you have on your list.
This sounds entirely reasonable but it’s not necessarily conducive to responsible web scraping. If a customer knows that they’re going to get a fresh list of servers soon then they may speed up their scraping. They try and get as much from the servers as they can which can lead to the servers becoming burned. After all, they’re getting a new list of servers soon so why look after them until the end?
What happens to the servers then? They get recycled and sent to a new customer. That might lead to a bad experience as the customer who receives the new server list is hoping to get ones that aren’t burned.
I said this at the start, and I’ll say it at the end because it’s so vital: responsible web scraping is a very important part of your strategy. Do not downplay or underestimate it!
Like many things in life, there’s no one big secret or a magic key to unlock a door that will make life easy for you. Getting hold of your data effectively and efficiently is down to many different elements all coming together and working in harmony. You also need to keep one step ahead of your target site’s detection algorithms, so don’t neglect any basic maintenance and check your scraping techniques and methods often.
Don’t forget, if you need any assistance we can offer a free consultancy service.