Upsert [sic] with SAP Data Services? Don’t cry though….

upsert2So, recently I find myself heavily involved with SAP Data Services and data migrations, I know what you’re thinking… thats not real code! And you’re damn straight …. it isn’t, but I have always leaned heavily towards the data end of the spectrum rather than the UX end which is filled with sorcery, witch craft and more importantly users who are like little pixies with no respect for the sanctity of code; data is data though and it doesn’t press buttons marked ‘Nuclear Button, Do Not Press’ just for kicks. I thus don’t complain too  much (ahem) when ETL jobs come my way. One of the biggest challenges with ETL in the SAP world is to keep everything down at the database level, it’s very easy to slip up and watch your execution times rattle up to unusable very quickly all because you used a feature for which there is no direct database  equivalent and at times it is frustrating at how the simplest functions are not covered; bygones.

The idea for this blog came about when I was trying to establish how to delete (archive) a load of data which is of course easy if you use the Map Operation transform, the issue with that of course is that none of this gets pushed down to the database server. To my knowledge there is currently no way of achieving this (without recourse to the SQL transforms which are of course frowned upon). I needed to think outside the box a little, luckily as  I always use intermediary tables to compile my data  prior to completion that answer was fairly simple:-

  1. Compile data as normal in intermediary table
  2. Mark any candidate records in intermediary table as ‘Archived’ using a database field
  3. Omit ‘Archived’ records when inserting records from intermediary table into main target table

And that led me into using a feature that (until you know the toolset a little better) appears to be missing, namely how to execute an Update statement (using FULL pushdown) against existing data. Oh sure, again it’s very easy to just add a Map Operation and convert a flow of data into an ‘Update’ but this of course results in a RBAR operation as the BODS engine will iterate over each row of data and then change its op code so that it is an ‘update’ not a ‘insert’ . The end result is that the BODS engine then creates individual update statements and we quickly have a non -performant data transformation; furthermore each row of data will compound the problem and before very long we will have a drawing board to which we will have returned.

I must admit that how to achieve Updates with a SQL pushdown did escape my sights for a little while before I realised that the guys from SAP have allowed for it. There’s a nice feature called ‘Merge’ that was introduced in SQL Server 2008 (and I’m certain that other database providers also allow for it) which allows you to easily specify within one SQL statement how to match a series of records from a data source with another series of rows from a datatable, you then specify how to join these disparate and distinct sets of data and in the case of a match or a non match from the above join also specify how to update/insert the data.  For me the SQL is a little verbose for everyday use and I instead prefer to break it into distinct operations for everyday use but then I’m a little bit old school…. The real bonus about this merge methodology is that with very little effort SAP have encapsulated and used this functionality to offer a true database level Upsert mechanism in Data Services Designer. Of course the normal restrictions apply in that you must be operating against and joining on data that has a unique key but then that is relevant to the database provider and not to data services itself

So lets take a look at how to drive this functionality, quite simply we start with our simple transformation as shown below:-


As you can see we are using the same FactSalesIntermediary table for our source data and target data, in this instance we are inner joining to the other tables to establish which data can be logically archived off.  In our scenario as the FactSalesIntermediary is the driver table we can be sure that we will only ever be updating records and so in this instance the Insert part of the upsert methodology is not actually used at all (this is actually a reasonably normal use case as using upsert is the only effective way in SAP Data Services Designer to push an update down entirely to the database server). If we take a look at the configuration of the table that we are updating you should see the following:-


Note that ColumnComparison is set to ‘Compare by Name’ and that the only two fields that we are interested in are the SID (which is our unique key) and the archive flag which details the fact that we are to archive off the record with the current SID. If you think about this from a purely SQL point of view this is enough information to compile the following:-

UPDATE FactSalesIntermediary SET Archived = 1 WHERE SID = 10002

The rest of the fields as they are not inputs will of course be ignored. You should also note the other two advanced feature settings that have been highlighted, namely  ‘Auto Correct Load’ and ‘Allow Merge or Upsert’, selecting both of these allows the Data Services engine to create an Upsert SQL statement (instead of the normal SQL ‘Insert’ statement) that can be pushed directly down to the database thus avoiding any RBAR processing. Lets take a look at the SQL that is generated:-



ON ((s.SID = n.SID))



So lets break it down (each colour representing a ‘part’):-

  1. As you can see the first part of the SQL creates a Merge Statement which sets up the whole syntactical structure and also aliases the table we wish to Update to ‘S’.
  2. We than have a SELECT statement that we built in our query detailing which source data to use and again aliasing all of this to a virtual table called ‘N’.
  3. We then specify how these two aliased sources join together, in this instance joining the ‘A’ and ‘N’ together by the SID fields.
  4. We then have the nuts and bolts of the upsert itself, we start with the WHEN MATCHED  which effectively gives us our ‘UPDATE’ statement
  5. And finally we have the WHEN NOT MATCHED which for this scenario we actually ignore but in other circumstances would give us our INSERT statement.

And that really is all there is to it, when the example is run I see that 150 records are updated and none are inserted which is what I expect to see. I can then when I populate the Main FactSales table ignore all of the ‘archived’ records

Upsert? Please don’t cry…..


Coping with Large Dimensional Workflows

dimensionsI have recently been working on a third tranche of ETL work for a client which is basically a reasonably complex data warehouse. Its been a great project and I’ve loved getting down and dirty with their data although there have been some frustrating elements to the project, which as I have not worked on such a complex linear project before, I did not foresee as being an issue. This blog is a discussion of the main issue to which I will attempt to give a working strategy that can be used in preference.

To layout the issue that I encountered it is necessary to detail first what we are trying to achieve. We have a a large application database that it being constantly updated daily by a large number of systems and processes. Overnight we extract any data that has changed and apply/add these changes to the data warehouse for reporting purposes. We are only interested in the raw data, not the identifying keys within the source database as we will create our own keys within the data warehouse. We then import all of this data into a staging table from where we then sieve the data and create one fact table with foreign key relationships linking it to many many dimensional data tables. So far, uncontroversial…


Where the difficulties arise, is exactly how you implement this and moreover how much data we are talking about. Traditionally I would take this approach to a simple migration:-

  1. Import Source Data into Staging Area
  2. Create all missing Dimensional Data (from staging data columns)
  3. Create Fact Table, mapping in Dimensional Foreign Keys using look up components as I go. See pic below.


Step 3 is the issue here; in a small or medium solution it works without issue but when you start getting north of 50 dimension fields in a fact table DTS becomes very unresponsive, especially when loading/saving the ‘Merge Join’ transformation. In my instance I have about 100 of these dimensions and its starting to creak badly. Not to mention… If you need to make changes to a field in the source, you find that you have to propagate this meta data down through every single merge join component by opening it (it will inform you that it is incorrect and ask if you would like to fix the issues to which you say yes) which is time consuming! At least you do NOT need to save each merge which is a godsend as with save times of about a minute and a half for each component theres 3 hours out of your day for no good reason.

So how do you get around this issue, well alarmingly simply really although I have yet to prove this in the wild… The root of the problem is using a merge component to keep the data together (which is necessary as we’re loading from one table and saving to another in my instance). What we need to do is eliminate this costly merge and the way we do that is by resolving the foreign key fields in the staging data table, before just doing a wholesale dump of the data into the facts table. The workflow will be thus:-

  1. Load all of the staging data using one sql statement (as we did previously).
  2. Split the data as we did previously using a Multicast (one for each column that we are updating)
  3. Lookup each column using a lookup component (as we did previously)
  4. Update the ForeignKey field for the column we are looking up within the StagingTable using an OLEDB command.
  5. Use a straight data dump to move the data from the Staging Table to the Facts table applying any filters that you would normally have done


Where the warehouse is large this should make the whole edition and refactor process for a given job 100 times more pleasurable and performant than before . Sadly it is a little bit late in this iteration for me to refactor my DTS package to work like this but I will certainly be adopting this methodology in the future for all DTS work as the improvements in productivity are immeasurable, I also think that the setup costs are a lot less too as you you not need to worry about synchronising the name of the output column with its real name in the Lookup component.

As I said, I have yet to test this in anger but hopefully it will act as a heads up for anybody embarking on such an enterprise.

Grouping multiline SQL data using XML

xml conversionA friend of mine recently asked for help with a bit of SQL he was working on. True, he was in a far away land lording it up and I could have ignored him on general principles but the way I look at it is this… If he’s out there and in work he’s not over here making a nuisance of himself (Only joking George). So I decided to help, selfless creature that I am!

He had a data-set, in this instance nurse information whereby each nurse could have multiple registrations. What he wanted to achieve was to show a distinct list of nurses with all of their registrations on one line, comma separated. I remembered a couple of approaches that he could have taken but I though that for this instance what would be the best suited is to use an undocumented feature of the FOR XML statement. So we started with the following table of nurses and their registrations:-


We then used the following SQL statement


So a brief explanation, We are using a sub select here using the XML Path with an empty nodeset (XML PATH(”). This is the hacky part as calling XML Path with an empty nodeset is not strictly supported and as such is an unintended action. The AS data() statement is required as this removes the XML tags that would otherwise be generated. As you can see the outer select retrieves distinct nurses and the sub select joins to that distinct Nurse ID. The for XML path creates a space separated list of strings which is great, this means that we can perform a trim to remove all of the trailing spaces and to comma separate them. The final thing to do, replace all embedded spaces with a comma character. Dead simple although this does rely on your data NOT having embedded spaces itself, If it does I would suggest appending two spaces to the inner select like so:-

SELECT RegNo + '  ' AS 'data()'   FROM NurseRegistrations AS NR

This way the trim functions will still work  and all you need to do is to change your replace to use two spaces instead of one.

My original SQL yielded the following results.


Now this solution does offend me in some ways… The fact that I have had to resort to RBAR really grates on me but today I find myself with little time and purities of perfect SQL will have to be put on hold. I am sure, though, that I can improve this given enough time to further investigate and when I do you can be assured I will update this blog

In the meantime, I hope it helps you……. George! And stop moaning about the weather!

Improve Readability using ROW_NUMBER()

Ink BlotWhat did I ever do without the TSQL Row_Number() functionality. I do remember way back in days of yore (when I had youth, less weight and even less sleep) wishing that with straight SQL I could achieve a row count within my data. I seem to remember we cobbled something together but it was less than satisfactory. Code it seems is always about compromise…

Just recently I have been set some work that involved grouping and sorting and in addition I needed to be able to make the SQL code seemingly intelligent about where we were within the data. What we had was an Intake amount that could apply to ‘n’ records that were being detailed. We were ordering the data in such a way that only the first record should really output the Total Intake amount (as this applied to several underlying rows also). It was not helpful to look at 3 rows where the total Intake appeared to be 100 for each where in fact it was 100 for the 3 rows. To state the problem I started with a database structured thus:- campusThe campuses are all of the campuses where a student could attend a course. Next we had the course themselves (This is just a sample so the data is not normalised! Excuse my Faux Pas….) CoursesAnd finally the mapping table that brings this data all together… IntakeSo to start with I began with a simple SQL select to join all of this data together sensibly returning only the data that a human (if we only could find one) wanted to see. As you can see, nothing onerous here:- FlatResultsAs you can see for each student we are bringing back their course, level and capacity on a 1 to 1 basis. Not very useful as what I would actually like to see is all of the courses and a count of the students underlying it So we change things around a little grouping on the Course,Level and Capacity (as well as campus) and then taking the DISTINCT (Max from a summary point of view) TotalIntake figures and applying a count of the underlying students giving us this:- StudentsNow we’re cooking, were starting to see the data that I wanted, however if I was to scan read the above data from the first three lines presented I might think that for an intake of 300 students we had an uptake of only 44. The truth is that the 3 x 100 intake is actually 1 intake record split across the three campus…. so what I’d like to do is show the Intake only once. This means that I will have to of course order the results so that intakes are all together but we’ve already done that. So without further ado, welcome to the magic that is PARTITION. I refactored the SQL to read thus:- RownumberThis now looks a lot better, as you can see it is now very evident that my total intake for the first three records is 100. So just to explain what is happening:- the PARTITION statement is forming a partition within the data so that whenever a new group is determined for such time as that group is in use a row number count is incremented (starting at zero for each group). In the above instance records 1,2 and 3 and would have the row numbers 1. 2 and 3 respectively as they are all English Literature, Full Time and at A Level, what sets them apart is that they are each on a different campus. We thus, using this row number, elect to output the TotalIntake only for the first record in a group. As you can see, we’re nearly there, just one thing that grates me a little. I am more interested in the records with greater amounts of students so I’d like to order them within the partition such that the campuses with the most students are reported upon first. For this we alter the ORDER BY within the partition itself:- RowNumberFinal In general we need to keep the same order as the partition itself but we add a COUNT(*) DESC statement that will force the campuses with the most students to appear first (within a group). I hope you find yourself a use for this kind of location awareness within a dataset. I don’t often use it but its comforting to know that is there when you need it.