Generating Large XML Data Files

Just recently I needed some large example XML data files to prepare for some blog content. Now, I thought it would be relatively easy to find 1gb plus files for download but as it turns out… not so. What I did have though was a lot of big databases and so I instead decided to generate my own large data set from one of these. Its not complex at all but there are some provisos that you need to be aware of.

  1. Make sure that you actually have enough disk space! I wrote the necessary code for this, set it running and then had visions of bricking my machine with a full drive. Luckily this was not the case in this instance…
  2. You will temporarily need to enable ‘xp_cmdshell’ which is of course disabled by default within SQL Server, and rightly so. You could ‘have pens out and everything’ with this enabled!
  3. I know, I know ! Who really uses XML nowadays… Well my answer to that is generally people who have been given a dump of an application databases by vendors who wish to inflict as much pain as possible on (generally) ex clients.

So the basic premise of this is that we will use the ‘FOR XML’ keywords to turn a table full of table into an XML representation of that data. Once we have that data we will utilise that command line legend of the SQL world  ‘bcp’ or Bulk Copy to output the data to a physical file on the drive.

Now, as I have already said xp_cmdshell  is not enabled by default and so we need to enable it using the sp_configure. The thing is, its such a dangerous (in the wrong hands) piece of functionality that Microsoft have hidden it away so you also have to enable the advanced options. So the following piece of script, assuming you have permissions to do so, enables xp_cmdshell

sp_configure ‘show advanced options’,’1′

sp_configure ‘xp_cmdshell’,’1′

Getting the data into XML format is trivially simple


As you can see we have used the FOR XML here to instruct the querying engine to return the data as XML, lets just take a second to describe the extra pieces of syntax we find in there:-

  1. The AUTO keyword ensures that the data is represented as a nested tree.
  2. The ELEMENTS keyword seeks to ensure that the XML generated outputs each field name as sub XML elements rather than mucky XML attributes.
  3. The TYPE keyword. Now at this point reality bites as it so normally does. I had to add this parameter to ensure that the resultant documents don’t have randomly adding carriage returns, as created by the BCP process. Thus the TYPE keyword forces the for XML construct to return the data as XML (not as varchar data) and without any embedded carriage return data. HOWEVER…. this is not without consequence as I can now only create files of less than 2gb. This is a restriction enforced by the maximum file size of the XML data type.

So if your database results in files larger than 2gb you will need to look at processing the data off in batches, maybe I could extend this blog at a later date… If you’d like to find out a little more about what you can do with the FOR XML command you can look here. Given that we now have our SQL statement generated we can take this and append it into our Bulk Copy statement as below

EXEC xp_cmdshell ‘bcp “SELECT * FROM DSCAgilityDW.dbo.JobFacts AS F FOR XML AUTO, ELEMENTS, TYPE” queryout “C:\XMLBigData.xml” -c -T’

As you can see this instructs the bcp process to run using the XML query we created above and output this to a datafile called XMLBigData located in the root of our C: drive. The extra -c parameter  instruct the BCP engine to use the char type to store the data within the XML rather than prompting the user for it whilst -T is used to connect using a trusted connection. Other parameters you may need can be found here should you need them.

Finally we clean things up by switching off the xp_cmdshell and hiding it from prying eyes.

sp_configure ‘xp_cmdshell’,’0′

sp_configure ‘show advanced options’,’0′

This was exactly what I needed to generate my large XML files, I’m off now to crash some servers and generally cause all sorts of drive space trouble. Hopefully you will too, although with that 2gb limit…… 😦