Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

Sergey Zhuk:
Fast Web Scraping With ReactPHP
Feb 12, 2018 @ 10:55:42

Sergey Zhuk has a new ReactPHP-related post to his site today showing you how to use the library to scrape content from the web quickly, making use of the asynchronous abilities the package provides.

Almost every PHP developer has ever parsed some data from the Web. Often we need some data, which is available only on some website and we want to pull this data and save it somewhere. It looks like we open a browser, walk through the links and copy data that we need. But the same thing can be automated via script. In this tutorial, I will show you the way how you can increase the speed of you parser making requests asynchronously.

In his example he creates a scraper that goes to a movie's page on the IMDB website and extracts the title, description, release date and the list of genres it falls into. Instead of creating a single-threaded process that can only fetch a single page at a time, he uses ReactPHP to speed things up and provide it a list of pages to fetch all at the same time. He starts by walking through the setup of the package and the creation of the browser instance. He then includes the code to make the request and crawl the contents of the result for the data. The post ends with the full code for the client and a way to add in a timeout in case the request fails.

tagged: scraping reactphp tutorial imdb movie crawl dom

Link: http://sergeyzhuk.me/2018/02/12/fast-webscraping-with-reactphp/

Run Geek Radio:
Episode 009 – Crawling Before We Can Walk
Oct 15, 2015 @ 13:43:15

The Run Geek Radio podcast, hosted by PHP community member Adam Culp, has posted its latest episode - Episode 099: Crawling Before We Can Walk.

So many startups attempt to skip the crawling stages and the MPV (minimum viable product) as they push to become successful. Just as many developers attempt to skip the vital stages of learning and forge ahead to create bugs, security holes, and poor code. Adam Culp, the host of Run Geek Radio, talks about how important it is to crawl before we can walk.

He also talks about the ZendCon and Sunshine PHP conferences (he's an organizer for both) and an update on some of his own personal speaking and running happenings. You can listen to this latest episode using either the in-page audio player or by downloading the mp3 of the show. You can also subscribe to the feed to get info about future episodes as they're released.

tagged: rungeekradio ep9 podcast adamculp crawl walk startup learning zendcon sunshinephp

Link: https://rungeekradio.com/episode-009-crawling-before-we-can-walk/

SitePoint PHP Blog:
Crawling and Searching Entire Domains with Diffbot
Jul 02, 2015 @ 09:41:39

The SitePoint PHP blog has a new tutorial posted, the first part in a new series, showing you how to create a "powerful custom search engine" with the help of the Diffbot service. In this first part they help you get everything you need set up (including a VM to run it from).

In this tutorial, I’ll show you how to build a custom SitePoint search engine that far outdoes anything WordPress could ever put out. We’ll be using Diffbot as a service to extract structured data from SitePoint automatically, and this matching API client to do both the searching and crawling. I’ll also be using my trusty Homestead Improved environment for a clean project, so I can experiment in a VM that’s dedicated to this project and this project alone.

He walks you through each step of the process, first creating the "crawljob" script and then executing it to gather the results. He also shows how to show this information via a simple GUI when searches are performed. A Diffbot PHP client library makes creating the crawljob simpler and lets you configure things like max number of items to crawl, patterns to match and what URLs to follow on the pages. Running the script creates the job which is then executed immediately. The same library makes search the data simpler too, using a "search" method along with some special tagging, and returning a JSON result with the matching records.

tagged: crawl domain diffbot search engine part1 series tutorial

Link: http://www.sitepoint.com/crawling-searching-entire-domains-diffbot/

Debuggable Blog:
Crawl Google, they do the same to you
Jun 11, 2008 @ 10:23:07

On the Debuggable blog, Felix Geisendorfer has posted some code (thought up by Marc Grabaniski) to go through Google and find the pages that they have indexed for your site. Their goal is to check and see if the migration of a site was successful.

Just get a list of all pages google has indexed from your site and then use that as your basis for checking if your migration worked or not. This is very convenient because you do not have to know all your own urls yourself, and you'll only get the relevant ones (if they are not in google they are unlikely to have traffic).

The code is included as well as an example usage. He also points out FixtureShell for more command-line CakePHP examples.

tagged: crawl google migration success link cakephp framework