The SitePoint PHP blog has a new tutorial posted, the first part in a new series, showing you how to create a "powerful custom search engine" with the help of the Diffbot service. In this first part they help you get everything you need set up (including a VM to run it from).
In this tutorial, I'll show you how to build a custom SitePoint search engine that far outdoes anything WordPress could ever put out. We'll be using Diffbot as a service to extract structured data from SitePoint automatically, and this matching API client to do both the searching and crawling. I'll also be using my trusty Homestead Improved environment for a clean project, so I can experiment in a VM that's dedicated to this project and this project alone.
He walks you through each step of the process, first creating the "crawljob" script and then executing it to gather the results. He also shows how to show this information via a simple GUI when searches are performed. A Diffbot PHP client library makes creating the crawljob simpler and lets you configure things like max number of items to crawl, patterns to match and what URLs to follow on the pages. Running the script creates the job which is then executed immediately. The same library makes search the data simpler too, using a "search" method along with some special tagging, and returning a JSON result with the matching records.