Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

Zend Framework Blog:
Scrape Screens with zend-dom
Feb 28, 2017 @ 16:46:27

The Zend Framework blog has posted another tutorial focusing on the use of one of the components that makes up the framework. In this latest tutorial Matthew Weier O'Phinney focuses on the zend-dom component and how to use it for scraping content from remote sources.

Even in this day-and-age of readily available APIs and RSS/Atom feeds, many sites offer none of them. How do you get at the data in those cases? Through the ancient internet art of screen scraping.

The problem then becomes: how do you get at the data you need in a pile of HTML soup? You could use regular expressions or any of the various string functions in PHP. All of these are easily subject to error, though, and often require some convoluted code to get at the data of interest.

[...] zend-dom provides CSS selector capabilities for PHP, via the ZendDomQuery class. [...] While it does not implement the full spectrum of CSS selectors, it does provide enough to generally allow you to get at the information you need within a page.

He gives an example of it in use, showing how to grab a navigation list from the Zend Framework documentation site (a list of items in a <ul> tag). He also suggests some other uses of the tool including use in testing of your application, checking content in the page without having to hard-code specific strings.

tagged: zendframework zenddom scrape content html dom xml tutorial

Link: https://framework.zend.com/blog/2017-02-28-zend-dom.html

James Morris' Blog:
Parsing HTML with DOMDocument and DOMXPath::Query
Jun 27, 2012 @ 10:19:35

In the latest post to his blog James Morris looks at using XPath's query() function to locate pieces of data in your XML.

The other day I needed to do some html scraping to trim out some repeated data stuck inside nested divs and produce a simplified array of said data. My first port of call was SimpleXML which I have used many times. However this time, the son of a bitch just wouldn’t work with me and kept on throwing up parsing errors. I lost my patience with it and decided to give DomDocument and DOMXpath a go which I’d heard of but never used.

He includes a code (and XML document) example showing how to extract out some content from an HTML structure - grabbing each of the images from inside a div and associating them with their description content.

tagged: dom domdocument domxpath xpath tutorial html


PHP DOM: Using XPath
Jun 26, 2012 @ 08:16:08

On PHPMaster.com today there's a new tutorial showing you how to use the XPath functionality that's built into PHP's DOM functionality to query your XML.

In a recent article I discussed PHP’s implementation of the DOM and introduced various functions to pull data from and manipulate an XML structure. I also briefly mentioned XPath, but didn’t have much space to discuss it. In this article, we’ll look closer at XPath, how it functions, and how it is implemented in PHP. You’ll find that XPath can greatly reduce the amount of code you have to write to query and filter XML data, and will often yield better performance as well.

They start with some basic XPath queries to find a simple path and locating the record for a specific book. There's also an example of using XPath versus the "find" functions in the DOM functionality (like getElementsByTagName). There's also a bit close to the end about using functions in XPath and how you can pull back in PHP functionality and use native PHP functions in your XPath queries.

tagged: xpath tutorial dom introduction


PHP DOM: Working with XML
Jun 08, 2012 @ 08:27:45

On PHPMaster.com there's a new tutorial posted about using XML in PHP, an introduction to using the DOM functionality in PHP to work with your XML content.

PimpleXML allows you to quickly and easily work with XML documents, and in the majority of cases SimpleXML is sufficient. But if you’re working with XML in any serious capacity, you’ll eventually need a feature that isn’t supported by SimpleXML, and that’s where the PHP DOM (Document Object Model) comes in.

He starts with a brief introduction to XML and DTDs including an example of each (defining a sample book information he'll use in the rest of the tutorial). He helps you create a simple class that takes in the XML content, working with construction/destruction of the object and using it to find, add and delete a book by things like ISBN or genre.

tagged: dom tutorial introduction xml


PHP Simple HTML DOM Parser: Editing HTML Elements in PHP
Sep 08, 2011 @ 10:06:07

On PHPBuilder.com today there's a new tutorial from Vojislav Janjic about using a simple DOM parser in PHP to edit the markup even if it's not correctly W3C-formatted - the Simple HTML DOM Parser

Simple HTML DOM parser is a PHP 5+ class which helps you manipulate HTML elements. The class is not limited to valid HTML; it can also work with HTML code that did not pass W3C validation. Document objects can be found using selectors, similar to those in jQuery. You can find elements by ids, classes, tags, and much more. DOM elements can also be added, deleted or altered.

They help you get started using the parser, passing in the HTML content to be handled (either directly via a string or loading a file) and locating elements in the document either by ID, class or tag. Selectors similar to those in CSS are available. Finally, they show how to find an object and update its contents, either by adding more HTML inside or by appending a new object after it.

tagged: simple html dom parse tutorial selector find replace edit


Parsing XML with the DOM Extension for PHP 5
Oct 28, 2010 @ 14:47:56

On PHPBuilder.com there's a new tutorial from Octavia Anghel about using the DOM extension to parse XML in a PHP5 application. The DOM functionality makes it simpler than even the older PHP4 DOM functionality to work with XML messaging and documents.

DOM (Document Object Model) is a W3C standard based on a set of interfaces, which can be used to represent an XML or HTML document as a tree of objects. A DOM tree defines the logical structure of documents and the way a document is accessed and manipulated. Using DOM, developers create and build XML or HTML documents, navigate their structures, and add, modify, or delete elements and content. The DOM can be used with any programming language, but in this article we will use the DOM extension for PHP 5. This extension is part of the PHP core and doesn't need any installation.

They include both a sample XML file to parse and the code you'll need to pull it in and make a basic DOM object out of it. Also included is some code showing how to pull out certain pieces of information, recurse through a set of XML values, add new nodes to the structure, remove a node and more.

tagged: parse xml dom extension tutorial


Practical PHPUnit: Testing XML generation
Sep 17, 2010 @ 13:51:02

On the Qafoo blog today there's a new post from Tobias Schlitt about a method you can use to unit test methods that generate XML without messing with a lot of extra overhead just to test the results.

Testing classes which generate XML can be a cumbersome work. At least, if you don't know the right tricks to make your life easier. In this article, I will throw some light upon different approaches and show you, how XML generation can be tested quite easily using XPath.

He includes a sample class, qaPersonVisitor, that has methods inside it to create a simple XML documents based on the first and last name data into a DOM element. He sets up the basic test case that creates a simple person - including gender and date of birth - and offer a few different suggestions on handling the check (in PHPUnit tests):

  • the naive way of rebuilding the DOM object and assert that they are equal
  • testing the resulting XML from the DOM object against a pre-generated XML document
  • matching the contents via CSS selectors
  • using the tag matching assertions
  • using XPath in a custom assertion (with short and long uses of it included)
tagged: unittest phpunit xml generation xpath dom


Thomas Weinert's Blog:
Using PHP DOM With XPath
Apr 13, 2010 @ 13:18:32

Thomas Weinert has a recent post to his blog showing how to use one of the more powerful XML-handling features that PHP's DOM extension includes - XPath.

Often I hear people say "We use SimpleXML, because DOM is so noisy and complex". Well, I don't think so. This article explains how you can parse a XML (an Atom feed) using the PHP DOM extension. No other libraries are involved.

In his example he loads an external feed (his own) into a DOM object, blocks any errors with a few handy functions and creates a DOMXPath object on the DOM object to get ready for his queries. He shows how to make searches for titles, subtitles, looping over attributes and an element list returned from one of the first queries. A full code listing is also provided to show how it all fits together.

tagged: dom xpath domxpath tutorial search atom


Matthew Turland's Blog:
Renaming a DOMNode in PHP
Feb 10, 2010 @ 09:16:58

Matthew Turland has a new post to his blog sharing a handy trick if you've ever looked for a way to use the DOM functionality on PHP to rename a certain node in an XML document. Since the node_name is read-only, some trickery is required.

A recent work assignment had me using PHP to pull HTML data into a DOMDocument instance and renaming some elements, such as b to strong or i to em. As it turns out, renaming elements using the DOM extension is rather tedious.

His method isn't so much of an update of what's already there as it is to replicate the attributes and child nodes of the node you're targeting and pus those back into the document with a call to replaceChild on the parent.

tagged: rename dom xml node tutorial


Zend Developer Zone:
PHP DOM XML extension encoding processing
Sep 02, 2009 @ 09:48:18

On the Zend Developer Zone today Alexander Veremyev shares some helpful hints he discovered about the DOM XML extension for PHP that could come in handy when working with different character encodings.

I recently worked with PHP's DOM XML extension while working on Zend Framework's Zend_Search_Lucene HTML highlighting capabilities, and uncovered some undocumented features and issues with the extension in regards to character encoding. The information contained in this article should also apply to other libxml-based DOM implementations, as PHP's DOM extension simply wraps that library.

There's five different tips he shares:

  • Internal document encoding is always UTF-8
  • Input data is always treated as UTF-8
  • Text nodes and CDATA are stored as UTF-8 without transformations
  • Document encoding does not affect loading behavior
  • Save/dumping operations and encoding

He describes each of the points and includes some sample code and XML to parse to help illustrate each.

tagged: tutorial dom extension character encoding