Web scraping is a huge business and it can be a very expensive one. Obtaining the data you need may require a big outlay and getting it right can make all the difference between being competitive and being left behind. Contrary to popular belief, data centre proxies still have a huge part to play.
To get that information and scrape successfully, you know that you need a proxy server solution. You’ll also know the current prevailing wisdom is that residential proxies are better than data centre proxies. Contrary to popular opinion, this isn’t always the case. Far from it. Data centre proxies are often a better option than residential proxies. I’m going to tell you why, and how to use them effectively.
Are residential proxies really the future?
Residential proxies have been the hottest ticket in the proxy server market for a while now. To be fair, there’s some justification for this as it’s long been assumed that residential proxies are a general improvement over data centre proxies.
The main benefits of residential proxies are:
- Generally, there’s a huge pool of proxies available to you.
- Increased anonymity over residential IP addresses due to target websites struggling to separate traffic between automated requests and genuine users.
- This, in turn, reduces the chances of blocked queries, improves your efficiency and reduces the cost to you.
What’s the catch with residential proxies?
Despite the buzz, there are some downsides to using residential proxies over data centre proxies. Primarily, they do cost a lot more to run, with a minimum outlay of potentially hundreds of dollars a month. That could put off companies or projects with a smaller budget.
It’s true that this is partly down to there still being a buzz about them, and you’re paying a premium for their popularity. That’s in addition to the highlighted benefit and lure of a lower chance of blocks.
The other downside is that residential proxies are almost always dynamic and you can’t hold on to an IP address. This means there are many tasks that residential proxies are not suitable for, such as:
- Paginating through a site until you reach the information you need.
- Managing social media accounts.
- Holding a web session whenever you need to log into a site.
It is true that some residential proxy providers are now offering static residential IP addresses. However, these are very expensive and data centre proxies could still prove a better option.
What data centre proxies have to offer
Despite falling behind in the popularity stakes, data centre proxies still form the backbone of the vast majority of web scraping performed all over the globe. They fulfil the most basic requirement of any proxy server, shielding your own IP address, and are also available in every corner of the globe. This gives you the possibility of obtaining data from remote locations and blocked sites.
That’s all great, but nothing that residential proxies can’t offer. When it comes to using data centre proxies, the main benefits they bring are.
- Lower cost than residential proxies.
- Data centre proxies can be static. You can hold on to an IP address if you need to.
- You can choose the exact number you need.
This last benefit, choosing the exact number you need, is possibly the greatest benefit of data centre proxies.
Residential proxies are often sold with the main benefit being the vast number of IP addresses in hundreds of locations. Which is fantastic if you have heavy-duty requirements demanding that level of service.
However, if your requirements are far lower, paying a premium to access a huge number of IP addresses is very much overkill. Take into account the higher entry-level cost that residential proxies demand, as opposed to a decent number of data centre proxies that can cost less than $100, and data centre proxies are very definitely a great option.
How to use data centre proxies effectively and efficiently
Now you’ve breathed a sigh of relief that data centre proxies are fine to use and don’t cost the earth, you need to check how many you need. This can be a very tricky process as there are many, many variables you need to take into account, along with a great deal of trial and error you’ll need to go through.
How to really scrape anonymously
You might hear a lot about how to scrape anonymously, but the truth of the matter is that there is no such thing. Don’t forget, each request to a website comes loaded with information. And there can be lots of it. This is mostly contained in the HTTP header fields. Feel free to check them out, but only if you’re feeling masochistic. Some examples of the information sent are:
- Your IP address and ISP.
- Your physical location (though this isn’t as accurate as crime dramas will have you believe).
- Which language to send the information back to the requester.
- Any authentication credentials, such as a username and password.
- Your user-agent, which helps the site present information in the correct format for your device.
So how do you scrape ‘anonymously’? Meaning, how do you scrape vast amounts of data without getting blocked. The answer is to lose yourself in the crowd.
How to scrape like a pro
Now you know that you cant truly scrape by sneaking in and grabbing information, you need to learn how to blend into the crowd. The answer to this is to act like a human. But not just that. You need to act as many different humans. This is called human emulation. And unsurprisingly you need the correct human emulation settings.
Data centre proxies and human emulation settings
Remember a few paragraphs back I mentioned working out exactly how many IP addresses you need? This depends on just three basic elements, which are all equally important:
How long you have to complete your scrape.
In essence, the more time you have to run a report, the fewer the number of servers you will need, the less it will cost you to run. For example, if you run a scrape once a month, then look at spreading your queries out over the month. This can massively reduce the number of servers you need and the cost to you.
How many page requests you need to make.
You need to be careful here as wasting queries can build up and make you very inefficient. The number of queries you make needs to be kept to the bare minimum, but without taking shortcuts. If you can head to a deep URL straight away without problems then do so.
For example, if you can access URLs such as www.gym.com/exercise/cardio/treadmills without the website picking up on you then do so. However, you may find that you need to start with the homepage and then paginate through to the final page.
How sensitive your target site is to scraping.
Some sites, like Google and Amazon, can safely handle around 1 query per minute, per IP address. Some sites may be more sensitive. The only way you can work out the sensitivity of a site is through trial and error. Getting this right is vital. Too fast and you become blocked. You want to aim for a ‘sweet spot’ where you can scrape just a touch below the site picking up on you.
Detailed human emulation settings
Don’t go any further without checking out our guide to responsible web scraping. Here we’ll go into further details of how human emulation settings are key to successful scraping.
Downsides to using data centre proxies
Are there any downsides? Yes, of course, but a lot of this comes down to finding a reliable provider.
Residential IP addresses are easily identifiable as originating from an ISP. Requests from them are immediately harder to distinguish from those of real humans. As we mentioned above with the information that is sent across with every query, data centre IP’s are easily identifiable as originating from a data centres and websites know that these are often used for web scraping and automated queries.
That being said, many organisations and businesses use IP’s sourced from data centres for their everyday use. So there’s no clearcut definition between what is an automated request or a genuine one. Websites simply don’t know and make their own determination as to how tolerant they want to be.
In and of itself this difference between who ‘owns’ the IP address means nothing. But if a site is especially sensitive to being scraped then they may view requests from data centre IP’s negatively and place more restrictions on them.
However, you can minimise this hurdle by having the widest possible range of C-Classes combined with good human emulation settings in place. This helps spread your queries around many different data centres, providers and ISP’s. That helps to stop your requests tripping the site’s detection algorithms so you can fall under their radar.
Beware proxy refreshes
Another very important point is the quality of the servers, not just in terms of hardware but if they’ve been handled poorly in the past and abused to the point where they’re either blocked or even worse, blacklisted.
Many data centre providers offer to refresh your proxies, either on-demand or after a set amount of time. They may advertise these as ‘fresh’ servers. While this sounds good on the surface, you need to be cautious when proceeding as the ‘new’ servers you receive will only be new to you; there is no such thing as an unused, or virgin, IPv4 address. Servers may also not be checked against any blocks for your specific use.
Some sites may say that replacing your server pool often is a good way to avoid detection. And while it might be a good idea to occasionally replace your server list, it’s far more important to have great human emulation settings in place from day one. If you’re scraping correctly, you should never need to replace your server list. This is a very important topic, so if you haven’t checked out our guide to scraping responsibly go ahead and do so.
So there you have it. Yes, residential proxies are brilliant and are perfectly compatible with many tasks where you need to obtain publicly available data very quickly. However, there are some cracks in their armour as they aren’t suitable are they always worth the extra cost
You might achieve the same objectives with data centre proxies for a fraction of the cost of residential proxies. In that case, if all other considerations are equal, then go for data centre proxies.
Data centre proxies are also easier to obtain. Which is great if you need smaller numbers for low volume tasks. You can also control individual IP addresses and hold on to them. This makes them more useful for a wider range of tasks.
If you need any assistance choosing a provider, feel free to take advantage of our free consultancy service. We’ll help you find the provider that best matches your requirements.