The problem
We have a complex XML file with many different node types, etc. supplied by a third party via a URL reference.
There’s an initial dump, and then an incremental update every couple of weeks, with additions, deletions and updates.
What I’d like to be able to do is to drive a symfony app with this data.
Some thoughts so far
I’m a symfony noob.
Getting the data
I could manually transform the XML into YAML to create the schema and the initial data load, and then build the app. But I don’t want to have to do this manually for the regular updates. Ideally, I’d like to use the symfony ORM to handle all the transactions, rather than a completely separate (outside symfony) script.
… and some images
The XML also contains URIs for remotely hosted images, and I’d like to be able to automagically pull these down to the local app file system and rewrite the references.
Anything that will help us to get started in the right direction would be greatly appreciated.
So I don’t use Symphony but do use Rails so there may be something similar (Pake?)….In Rails I would create a Rake task that did this. The Rake task:
– Loads the Rails environment (so gets access to all of the ORM goodness)
– Uses an xml library to read in the source xml file
– Loop through each record/whatever and look for updates/additions/deletions.
– Use the ORM methods to update your data.
– Any new images can be pulled down using whatever http library that PHP supplies.
Ah if you’d asked that question while working with local… ๐ But you asked for it…
Symfony ( or Rails or web framework X ) are not your solution IMO. The general field is called ETL – http://en.wikipedia.org/wiki/Extract,_transform,_load although be warned – that can mean anything from hacking up a few Bash scripts “enterprise solutions” that push you into data warehousing. There’s a ton of open source ETL frameworks and tools out there in pretty much any language ( except PHP in fact ). Couple of ones you _might_ want to look at would be http://activewarehouse.rubyforge.org/etl/ and http://kettle.pentaho.org/. But that’s probably overkill – easier to hack something up yourself.
Here’s how I’d go about it…
First – assume stupidity at the far end, delivering you the data. Seriously. Assume they’re the crappiest coders imaginable. This will help you get your end right and if you turn out to be wrong, you’ll be happily surprised ๐
Next the data volume – how big is that iniital dump? Anything up to ~1GB is probably probably easiest to handle with a single process / thread and something like Perl, Python, Ruby or PHP (on the command line); just keep is simple.
More than 1GB and you probably want to parallelize the processing in some way – Amazon’s Elastic Map Reduce would be really smart here – see http://www.youtube.com/watch?v=iMOzC835H4I for example. Just remember – threads are evil…
I’ll stick to the then an incremental update every couple of weeks, with additions, deletions and updates.
What sucks about anything that involves incremental updates, especially deletions, is what happens when something gets lost, and it inevitably will ( remember how dumb they are at the far end ). What’s the impact? Could be the data isn’t that important, so who cares? But for small data volumes, the easiest is to get them to send everything every time and just process the whole lot. But lets assume the incremental updates are non-negotiable…
…then the strategy is about being able to keep the initial dump in some intermediate form and applying the updates to it then updating a database from there.
To do that, each update (the the original dump) is going to correspond to a single (SQLite) database file.
Using a stream-based XML parser (e.g. http://www.php.net/sax or http://www.php.net/xmlreader) that isn’t going to put the whole file in memory, for the initial dump insert the all the new records into an SQLite DB.
Then, every time you receive an update, you make a copy of the last-updated SQLite database file then apply the changes to the new copy. This gives you a “versioned” history of updates plus allows you to continue reading from the old version while you’re still updating the new version, then switch when you’re ready ( using something like ln -s and mv so it’s atomic ).
Finally you load that file into Symfony for display on the web. If you want to actually modify the data from within Symfony, you probably need to “lock” the DB for short periods while you do the updates i.e. once you’ve made a copy, no more updates to the old version – make it read only until the new version is ready.
Finally those images – well that’s what we did the “binarypool” for – http://github.com/localch/binarypool / http://www.viddler.com/explore/cocaman/videos/4/ although that only _really_ starts to make sense when you’ve parallelized the processing of the data and / or have multiple data sources.
Thanks for your help, Glen!
It looks like symfony has something similar – batch files and tasks that have access to the symfony application environment (see here).
Following your process above looks to be the way to go – I’ll code it up and try it out ๐
Thanks!
Thanks Harry,
The inital dump is a 6MB XML file, so not very large at all. The incrementals will be much smaller.
I like the idea of creating individual databases of the dump and for staging the new stuff from the incremental, and then switching when ready.
This gives me a good bit to go on, so I’ll get cracking and let you know how it turns out. Thanks heaps!