Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

Zend Framework Blog:
Scrape Screens with zend-dom
Feb 28, 2017 @ 22:46:27

The Zend Framework blog has posted another tutorial focusing on the use of one of the components that makes up the framework. In this latest tutorial Matthew Weier O'Phinney focuses on the zend-dom component and how to use it for scraping content from remote sources.

Even in this day-and-age of readily available APIs and RSS/Atom feeds, many sites offer none of them. How do you get at the data in those cases? Through the ancient internet art of screen scraping.

The problem then becomes: how do you get at the data you need in a pile of HTML soup? You could use regular expressions or any of the various string functions in PHP. All of these are easily subject to error, though, and often require some convoluted code to get at the data of interest.

[...] zend-dom provides CSS selector capabilities for PHP, via the ZendDomQuery class. [...] While it does not implement the full spectrum of CSS selectors, it does provide enough to generally allow you to get at the information you need within a page.

He gives an example of it in use, showing how to grab a navigation list from the Zend Framework documentation site (a list of items in a <ul> tag). He also suggests some other uses of the tool including use in testing of your application, checking content in the page without having to hard-code specific strings.

tagged: zendframework zenddom scrape content html dom xml tutorial

Link: https://framework.zend.com/blog/2017-02-28-zend-dom.html

SitePoint PHP Blog:
Image Scraping with Symfony's DomCrawler
Mar 31, 2014 @ 14:06:43

On the SitePoint PHP blog today there's a new post showing you how to use the Symfony DomCrawler component to scrape content, images mostly, from a remote website. The DomCrawler is one component of the Symfony framework.

A photographer friend of mine implored me to find and download images of picture frames from the internet. I eventually landed on a web page that had a number of them available for free but there was a problem: a link to download all the images together wasn't present. I didn't want to go through the stress of downloading the images individually, so I wrote this PHP class to find, download and zip all images found on the website.

He talks briefly about how the class works and then gets into the contents of the class. He walks through all the code and explains in chunks what each part does in the lifecycle of the request. The end result is a Zip archive file of all images from the remote website, packaged up for easy transport.

tagged: domcrawler symfony framework component tutorial image scrape

Link: http://www.sitepoint.com/image-scraping-symfonys-domcrawler/

Matthew Turland's Blog:
Gotcha on Scraping .NET Applications with PHP and cURL
Jul 01, 2010 @ 13:51:36

New on his blog today Matthew Turland has posted about a "gotcha" he came across when working with cURL to pull down information (scrape content) from a remote .NET application.

I recently wrote a PHP script to scrape data from a .NET application. In the process of developing this script, I noticed something interesting that I thought I’d share. In this case, I was using the cURL extension, but the tip isn’t necessarily specific to that. One thing my script did was submit a POST request to simulate a form submission. [...] The issue I ran into had to do with a behavior of the CURLOPT_POSTFIELDS setting that’s easy to overlook.

The problem was something cURL does automatically - change the header for the content type because you're sending an array. Thankfully, with the help of a call to http_build_query to encode it correctly, the request will use the right headers.

tagged: net application scrape content gotcha curl

Link:

Juozas Kaziukenas' Blog:
Scraping login requiring websites with cURL
Feb 24, 2009 @ 14:44:43

Several sites have areas that have content protected behind a login making them difficult to pull into a script. Juozas Kaziukenas has created an option to help you past this hurdle - a PHP class (that uses cURL) that can POST the login data to the script and pull back the session ID.

But how you are going to do all this work with cookies and session id? Luckily, PHP has cURL extension which simplifies connecting to remote addresses, using cookies, staying in one session, POSTing data, etc. It’s really powerful library, which basically allows you to use all HTTP headers functionality. For secure pages crawling, I’ve created very simple Secure_Crawler class.

The class uses the built-in cURL functionality to send the POST information (in this case the username and password, but it can be easily changed for whatever the form requires) and provides a get() method to use for fetching other pages once you're connected.

tagged: login require scrape curl secure crawler tutorial username password

Link:

Hasin Hayder's Blog:
Making a jobsite using PHP
Jan 24, 2008 @ 20:41:38

Hasin Hayder has started up a new project that he's documented in a new blog entry - the creation of a new jobs website in PHP.

I was involved in making a job site few days ago. During the development, I have studied how easily anyone can develop a job site using PHP (language independent in true sense) . So I decide to write a blog post about my experience and here it goes. But note that this article is not about scaling or balancing the load on your site during heavy traffic, heh heh.

He comments on the startup process surrounding this type of site and makes suggestions about something to consider for your careers site - pulling job content from other sites in two ways - screen scraping and using the job search APIs out there.

tagged: job website scrape content popular api

Link:

Jonathan Street's Blog:
When scraping content from the web don't make it obvious
Nov 07, 2007 @ 17:26:00

Jonathan Street has a tip for those developers out there that have no other choice than scraping content from a remote site - don't make it obvious. He also includes a suggestion on how to make it a little less obvious.

A couple of hours ago I was playing around scraping some content from a website. All was going well until suddenly I couldn't get my script to fetch meaningful content. [...] The first thing I did was stop visiting the site for 15 minutes or so and then increase the time between requests. It briefly worked again but quickly stopped.

One simple change to his user agent string in his php.ini made the problem evaporate pointing to a user agent filtering happening on the remote side. His helpful hint involves two methods - one in just PHP and the other in cURL - to change the user agent that your scripts are sending. An even better sort of solution might be some sort of rotating array that would alternate between four or five strings to make things even more random.

tagged: scrape content remote server useragent filter modify phpini scrape content remote server useragent filter modify phpini

Link:

Jonathan Street's Blog:
When scraping content from the web don't make it obvious
Nov 07, 2007 @ 17:26:00

Jonathan Street has a tip for those developers out there that have no other choice than scraping content from a remote site - don't make it obvious. He also includes a suggestion on how to make it a little less obvious.

A couple of hours ago I was playing around scraping some content from a website. All was going well until suddenly I couldn't get my script to fetch meaningful content. [...] The first thing I did was stop visiting the site for 15 minutes or so and then increase the time between requests. It briefly worked again but quickly stopped.

One simple change to his user agent string in his php.ini made the problem evaporate pointing to a user agent filtering happening on the remote side. His helpful hint involves two methods - one in just PHP and the other in cURL - to change the user agent that your scripts are sending. An even better sort of solution might be some sort of rotating array that would alternate between four or five strings to make things even more random.

tagged: scrape content remote server useragent filter modify phpini scrape content remote server useragent filter modify phpini

Link:

MakeBeta Blog:
Scraping Links With PHP
Aug 15, 2007 @ 17:08:00

From Justin Laing over at Merchant OS there's a new tutorial on creating a simple link scraper with the help of PHP and the cURL extension.

In this tutorial you will learn how to build a PHP script that scrapes links from any web page. You learn how to use cURL, call PHP DOM functions, use XPath and store the links in MySQL.

You'll have to have PHP5 and the cURL extension enabled on your web server to make it all work, but the code is all there ready for you to cut and paste. The application grabs the page with cURL (including the possibility to fake your user agent), parses through the HTML with the DOM and XPath functionality to grab the links and uses the MySQL methods to store them into your database.

tagged: scrape link curl dom xpath mysql tutorial scrape link curl dom xpath mysql tutorial

Link:

MakeBeta Blog:
Scraping Links With PHP
Aug 15, 2007 @ 17:08:00

From Justin Laing over at Merchant OS there's a new tutorial on creating a simple link scraper with the help of PHP and the cURL extension.

In this tutorial you will learn how to build a PHP script that scrapes links from any web page. You learn how to use cURL, call PHP DOM functions, use XPath and store the links in MySQL.

You'll have to have PHP5 and the cURL extension enabled on your web server to make it all work, but the code is all there ready for you to cut and paste. The application grabs the page with cURL (including the possibility to fake your user agent), parses through the HTML with the DOM and XPath functionality to grab the links and uses the MySQL methods to store them into your database.

tagged: scrape link curl dom xpath mysql tutorial scrape link curl dom xpath mysql tutorial

Link:

WaxJelly Blog:
The easiest way to scrape details from a MySpace profile page with PHP
Mar 20, 2007 @ 15:41:00

From the WaxJelly blog today comes a handy bit of code for anyone out there looking to scrape details from just about any MySpace page out there (quick and easy).

It's amazing how just a little optimization on the part of myspace makes crawling their site so much easier. We're going to scrape the user detail (name, age, sex, etc..) from a profile, using the header info...

The script grabs the contents of the given URL, loops through, pulls out the meta tag information and uses that as a key to grab the rest of the user's information (including name, age, city, state, etc).

tagged: scrape myspace details meta city state country name scrape myspace details meta city state country name

Link:


Trending Topics: