LiTWol @ Oleg Terenchuk

  • Contact
  • Tips & Tricks
  • About me

User login

  • Request new password
To prevent automated spam submissions leave this field empty.

litwol's tweets

  • Having nerves of steel, why waste my time with positive feedback on things i already know I'm doing right. — 2 years 28 weeks ago
  •  
  • 1 of 9
  • ››
more
Home

Processing large or multiple related XML files efficiently in PHP

Submitted by litwol on Tue, 10/11/2011 - 20:33

Click here to scroll to conclussion.

I had to learn the hard way that a 512MB RAM VPS cannot handle 70MB worth of plain text XML files, when i saw a pretty simple importer script consume more than a gig of RAM sending my VPS into swap hell. What did I do wrong and how can I improve things?


Measuring RAM consumption in PHP

My first pitfall was complete inability to accurately measure PHP memory consumption during run time. I've made the mistake of blindly trusting documentation for memory_get_usage which supposed to tell you how much memory your script is using (see docs for memory_get_peak_usage while you are at it). Later (after my server went into memory swap) i found out these functions report inaccurately when dealing with XML objects (specifically SimpleXMLElement). It turns out while above memory consumption functions were reporting 500kb-1MB, my operating system was reporting over 650MB ram consumed (mileage will vary depending on size of your plain text xml file). I've used a crude method to observe memory usage. Run the following command in terminal, before executing php script: top -b -d .001 | grep php (adjust update speed from .001 to .01 if you want to reduce CPU impact). Column labeled RES tells much more accurate story of your script's memory consumption over it's life time. Feel free to save entire output to text file for later inspection to see how your script consumes memory over time (graph it perhaps?). If your next (or previous) assignment has to do with XML parsing i challenge you to present better memory assessment method. Unfortunately after considerable time on google and various forums i found nothing that would present accurately memory consumption over script execution lifecycle.

I have prepared an example script which demonstrates wrong php memory usage reporting. See it at XML memory consumption test example one and make sure to see actual results by running it.


Efficiently handling single XML file items

Now that we can measure memory usage (accurately?), what exactly was wrong with my XML importer's approach? Naively I've used SimpleXMLElement to read entire XML file into memory before processing individual items. As described in previous section, this problem becomes easily fatal when dealing with large(r) XML files. A more efficient approach is to use XMLReader class. Documentation for XMLReader is one of the worst i've seen. Perhaps its the nature of the beast (XML)? Who has time to read specifications http://www.w3.org/TR/REC-xml/ ?

Having rewritten my script to use XMLReader to parse XML file and pull out individual XML items, my script looked something like this:

<?php
$XMLReader = new XMLReader;
$xml_file_path = '[Path to existing plain text XML file ]';
$XMLReader->open($xml_file_path);

// Move to the first "[item name]" node in the file.
while ($XMLReader->read() && $XMLReader->name !== "[item name]");
// Now that we're at the right depth, hop to the next "[item name]" until the end of tree/file.
while ($XMLReader->name === "[item name]") {
  $node = new SimpleXMLElement($XMLReader->readOuterXML());
  // *** Do something of interest with $node, this is the item we have been looking for. ***
  // Skip to the next node of interest.
  $XMLReader->next("[item name]");
}          

Please note that i am using placeholder in multiple places along the script. For example "[item name]" can be an item called "artist" in an XML file with the following content:

<artists>
  <artist>
    <name>Justin</name>
    <genre>pop</genre>
  </artist>
  <artist>
    <name>Christina</name>
    <genre>pop</genre>
  </artist>
</artists>

You may have noticed that i am cheating a little. After all the trash talk about SimpleXMLElement, i still end up using it in my final script. This is intentional and it works because now instead of loading entire XML file into memory using SimpleXMLElement, i end up loading single XML item per iteration which is much more efficient with relatively constant memory consumption of several MB at most. A HUGE improvement over first script implementation which consumed more than 650MB-1GB+ from a mere 70MB xml file.


Efficiently handling multiple XML files objects

My project was a little more complicated by the fact that i had to process four normalized XML files to reconstruct single complete object before storing it in database, creating even higher memory demand.

What to do when single entity is a product of individual XML items pulled from multiple XML files. Usually my solution at this case was to read multiple XML files into arrays indexed by a unique item ID and then combine item pieces from different arrays before storing in a database as a single object. As described above, this solution is not ideal and in many cases impossible to execute due to RAM limitations. Nowadays i find this method highly arrogant of myself to have thought i could execute a solution of data indexing and storage better than systems created solely for that purpose, such as MySQL or MongoDB. Some not too distant time ago i would have thought that involving a mammoth software like a database server to be a part of a seemingly simple script execution process an unnecessary complexity. An amateur mistake, as i will demonstrate bellow. My preference was to use MongoDB for it's ability to store arbitrary structure objects in a schema-less nature (Unlike SQL solutions, which IMHO creates truly unnecessary overhead).

Essentially what above says is instead of loading all XML items in memory for denormalization, load individual XML items from respective files and store them in database of your choice (my choice was MongoDB) indexed by a unique identifier field. Once ALL plain text XML files have been transferred into a database system you can begin retrieving them individually to reconstruct a denormalized object. Once finished reconstructing all objects, delete database tables used to store XML data temporarily.

So how is this different from just storing content of entire XML file in RAM during script execution? Quite simply when your XML items are stored in database, most of memory requirements are placed on your hard drive and only appropriate indexes are retained in RAM for faster items retrieval. In my case i had to deal with 4 normalized files which i transferred into 4 database collections (MongoDB terminology for "collection" is the equivalent of SQL database "table") and then retrieved one item from each collection to reconstruct denormalized object before placing it back into database. In PHP it looks something like this:

<?php
// START EXTRACTING XML DATA FROM PLAIN TEXT FILES AND PLACING THEM IN DATABASE.
$files = array(                                                      
  'file1.xml',    
  'file2.xml',
  'file3.xml',
  'file4.xml',
);

$m = new Mongo(); // connect
$db = $m->selectDB("example"); // Choose appropriate database
$collection = $db->selectCollection("temporary_xml_store_collection"); // Choose appropriate collection. we will store our xml (converted to array) data here.
foreach ($files as $file) {                                                                                                                
  $XMLReader = new XMLReader;                                                                                                              
  $xml_file_path = $file;                                                                                                     
  $XMLReader->open($xml_file_path);                                                                                           

  // Move to the first "[item name]" node in the file.
  while ($XMLReader->read() && $XMLReader->name !== "[item name]");
  // Now that we're at the right depth, hop to the next "[item name]" until the end of tree/file.
  while ($XMLReader->name === "[item name]") {
    $node = new SimpleXMLElement($XMLReader->readOuterXML());
    // Convert from SimpleXMLElement object into array structure
    $arr_node = convert_from_XML_to_array($node); // note this function doesn't really exist, this is just pseudocode. http://en.wikipedia.org/wiki/Pseudocode
    // Make sure that unique ID is set, this is what we will use later to retrieve individual objects.
    $to_store = array(                                                                      
      'unique_id' => $arr_node['unique_id'], // Being extra obvious for the sake of example.
      'data'      => $arr_node,                                                  
      'type'      => str_replace('.', '_', $file), // convert string like "file1.xml" into "file1_xml", used later.
    );                                                                                                     
    // Save data in database.                                                             
    $collection->save($to_store);                                                         
    // Skip to the next node of interest.                                                 
    $XMLReader->next("[item name]");                                                      
  }
}                                                                                         
// FINISHED EXTRACTING XML DATA FROM PLAIN TEXT FILES AND PLACING THEM IN DATABASE.

// START RETRIEVING NORMALIZED ITEMS FROM DATABASE AND RECONSTRUCT DENORMALIZED OBJECT
// Now that content of all XML files reside in database. we can start retrieving relevant items to denormalize them.
// Note that $collection have already been initialized above and connected to appropriate database. we need new collection for denormalized objects.
$denormalized_collection = $db->selectCollection('denormalized_items');
$results = $collection->find(array('type' => 'file1_xml')); // in this example i use file1_xml content as the initial skeleton for final object
while ($results->hasNext()) {                   
  $item = $results->getNext();                  
  // now fetch the rest of the items;           
  $pieces = $collection->find(array('unique_id' => $item['unique_id'])); // This will fetch N items to be pieced back into single denormalized object.
  foreach ($pieces as $piece) {                                                                               
    // do something here to reconstruct denormalized object.                                                  
    $item['pieces'][] = $piece; // just a silly eample.                                                       
  }                                     
                       
  // and now that we have denormalized an item, store it in database;
  $denormalized_collection->save($item);
}                               
// THE END.

My final implementation is actually split up into 2 separate execution steps (2 scripts which execute one after another, rather than single large script). First step is to read XML files and load items into database, and second step is to denormalize items in database. I do this separation so i can execute second step multiple times in parallel, even further increasing amount of items processed per unit of time.

In my final implementation i was able to decrease memory consumption from over a GIG of ram down to 35MB as well as increase number of items denormalized from 30-40 per second to over 300 items per second.


So what have i learned from this?

  1. PHP memory reporting is horrible at best. Use this trick to observe memory consumption in real time (when testing your script in dev environment of course, not in production environment) by running this command in terminal top -b -d .001 | grep php
  2. Do not be afraid to utilize power of real database system even in simplest scripts/projects.
  3. Paralelize your scripts to increase throughput by breaking full execution line into multiple specialized scripts which can be ran multiple times in parallel (e.x. run "php step2_script.php" twice or three times at the same time, each in different terminal window for example).
Tags:
  • php
  • xml

No responses to "Processing large or multiple related XML files efficiently in PHP"

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h2> <br>
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • Lines and paragraphs break automatically.

More information about formatting options

To prevent automated spam submissions leave this field empty.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

LiTWoL © Oleg Terenchuk - Hosted on Linode.com 512