Sergey Zhuk has a new post to his site showing how to use the ReactPHP library to scrape web content automatically from a site that doesn't provide other means for getting the data.
Have you ever needed to grab some data from a site that doesn’t provide a public API? To solve this problem we can use web scraping and pull the required information out from the HTML. Of course, we can manually extract the required data from a website, but this process can become very tedious. So, it will be more efficient to automate it via the scraper.
Well, in this tutorial we are going to scrap cats images from <a href="https://www.pexels.com/>Pexels. This website provides high quality and completely free stock photos. They have a public API but it has a limit of 200 requests per hour.
He shows how to use ReactPHP to make concurrent requests (rather than one after another) asynchronously using the buzz-react HTTP client. He builds out the main
Scraper class that pulls in the content for each page requested. With that in place, he then shows how he extracted the image URLs from the pages via the DomCrawler Symfony component. He finishes up showing how to take these images and save them locally as the asynchronous process runs.