Web Scraping Basics

Web Scraping Basics https://www.whichproxyprovider.com/wp-content/uploads/2020/04/ABC.jpg 2048 1371 Which Proxy Provider Which Proxy Provider https://www.whichproxyprovider.com/wp-content/uploads/2020/04/ABC.jpg 27th April 2020 2nd July 2021

What is web scraping?

Web scraping for information is possibly the most in-demand requirement for proxy servers, so getting the hang of web scraping basics is a must.

Essentially web scraping is an automated method of extracting specific information from a website. Web scraping software will automatically request multiple pages from the web site one by one and pull the data from it.

This information is then provided in a neat and tidy format to save on a database.

The data required is usually for competitive research; tracking prices on Amazon or Booking, for example. Or it could be checking the position of a website on a SERP (search engine result page) for specific keywords.

Whatever your requirement, finding the perfect provider requires proper planning and managing your expectations. The best way to do that is to make sure you have a good grounding in the basics of web scraping.

What is a proxy server?

Ask Google to define the word “proxy” and the response will be:

The authority to represent someone else.

However, we’re interested in computers and data, not people, so how is this statement relevant?

A proxy server acts as an intermediary between a client (your device) and a website. A simple explanation of how it works is to imagine a proxy server as a middle man. The proxy server collects your requests, fetches the results from the website on your behalf and returns the data to you.

A more in-depth explanation of proxy servers and how they work is available on the Wikipedia entry.

Using proxy servers might seem like adding an unnecessary layer of complication, but proxies can be very helpful and utilised for several purposes. Their use mainly falls under two categories: web scraping and changing your location.

Web Scraping Basics

Your project may need to scrape a web site and retrieve lots of data in a short space of time. However, scraping a web site very quickly just through your IP address will soon result in blocked requests. There are different reasons as to scraping is frowned upon by a web site, and they can react in different ways. Some of them are:

There’s a trade-off between how a website views normal human behaviour and what it views as a possible DDOS attack (when a site receives too many requests that their hardware can handle and the site freezes). The web site might view your scraping as an attack, block you and might even blacklist you.

Many websites, especially social media, receive funding from targeted ads, and they ask you to log in after viewing a few pages. If you don’t log in, they can’t target you with ads and therefore can’t make money from you (or your data).

Search engines receive funding through ads too. They also rely on data they gather from real users as there’s a link between advertising revenue and genuine users. This data helps to enhance and develop their services. In turn, this makes them more appealing so they can then generate more revenue from advertisers.

The information that the website provides could well have cost a lot of time and money for them to gather. Others might use this data without permission or compensation paid to the original site. It’s natural for them to want to protect this data from anyone who may wish to obtain it quickly and easily to use it for their purposes at far less cost than it would be to collate it themselves.

The website will incur a cost for the bandwidth and requested data returned to you. So they may look to throttle the number of queries you can make.

How you would get around these limitations is to spread your queries out around many, many different proxy servers. This care is especially relevant if using datacentre servers. Exactly what resources you will need is too in-depth to go into here as it depends on lots of factors. So make sure to discuss it in detail with any prospective provider. You can also take a look at our page offering advice about responsible scraping.

Changing your location

Every IP address always has a physical location attached to it. Websites use these details to provide certain information, or even not allow access to the site at all.

Some examples of how changing your location can help are:

Search engine results pages (aka SERPs). If you’re running an SEO campaign, it’s vital to check results from the correct location. Otherwise, the results you get won’t match the area, and you’ll just waste lots of time and money. Proxy servers can help you get results not only from the correct country, but may also help you get hyper-local results from specific areas. Note that if you need many results, then you will also need to include web scraping techniques to avoid blocks.

A proxy server based in a different location can help you check geo-specific information. A great example is geo-targeted ads so you can make sure that these are displayed correctly. Checking this provides the correct customer experience and ensures no waste of any advertising budget.

Sometimes a website blocks you purely because of your location. This block is because the content on the site is ‘geo-restricted’. An example would be the BBC’s iPlayer, which is only available to residents of the UK. Proxy servers help to circumvent these blocks as the website sees the physical location of the proxy server, not your own. They force your target website to treat you as if you were in the same place as the proxy.

Using a proxy server to mask your regular IP address is also how proxy servers anonymise your requests. The website just sees the IP address and location of the server, not you.

If you’re looking for servers just to appear as if you’re from another location to check ads or access geo-blocked sites, then you need a VPN solution. We don’t cover those as yet, but there’s plenty of options out there.

Requirements, restrictions, outcome

Before looking for a provider, it’s vital to detail the problem you’re looking to solve. This necessary step is the first to help you find the provider which best meets your needs. If you don’t know what you’re looking for, how can the provider be expected to help you?

Many proxy server providers don’t limit themselves to providing proxies and also offer dedicated solutions catering for specific needs, such as scraping Google or Amazon. It’s crucial to give them as much information as possible about your requirements. Being honest and open with them from the start can save you a tremendous amount of time and money further down the line.

It’s all in the preparation

Your target sites

Sounds obvious but you need to list all of the sites you wish to access. Some sites may be sensitive to scraping and need extra care and attention. This care will mean strict adherence to human emulation settings; otherwise, the servers or service you use could become blocked or even blacklisted.

Where the servers need to be based

Your servers may need to be based in a specific location to get the correct results. For example, if you’re scraping Google SERPs, then you need servers based in the same country as the top-level domain (TLD) you are querying. For example, if you are scraping google.com, then you will need servers based in the US. Google.de will need servers based in Germany. However, other sites like Amazon, for example, could be happy to let you query one specific country TLD with servers from another country.

The software you will use

There are hundreds of software programs out there to help you get your results. It’s also quite simple to build your scraper. However, not all software is efficient or runs with proper human emulation settings. You must research any software before purchasing to make sure it’s suitable. For example, some software may only utilise one user agent, which may also be old. This failure to adhere to proper human emulation settings would make your queries very easy to spot and then for your target site to block the servers.

Number of queries/time restrictions

The volume of queries you can make over a set period depends on the combination of human emulation settings you need to use and the number of servers you have at your disposal. Your provider should help and advise you but simply put the more servers you have, the quicker you can run your queries and complete your scrape.

Cost considerations

I appreciate you may not want to share how much you are ready to spend. You want to keep that information to yourself as a negotiation tactic. However, if you can provide a rough figure, the provider can either rule in or rule out specific services and help you manage your expectations. This clarity upfront can save valuable time otherwise spent on testing a solution that isn’t affordable or viable.