Upsert [sic] with SAP Data Services? Don’t cry though….

upsert2So, recently I find myself heavily involved with SAP Data Services and data migrations, I know what you’re thinking… thats not real code! And you’re damn straight …. it isn’t, but I have always leaned heavily towards the data end of the spectrum rather than the UX end which is filled with sorcery, witch craft and more importantly users who are like little pixies with no respect for the sanctity of code; data is data though and it doesn’t press buttons marked ‘Nuclear Button, Do Not Press’ just for kicks. I thus don’t complain too  much (ahem) when ETL jobs come my way. One of the biggest challenges with ETL in the SAP world is to keep everything down at the database level, it’s very easy to slip up and watch your execution times rattle up to unusable very quickly all because you used a feature for which there is no direct database  equivalent and at times it is frustrating at how the simplest functions are not covered; bygones.

The idea for this blog came about when I was trying to establish how to delete (archive) a load of data which is of course easy if you use the Map Operation transform, the issue with that of course is that none of this gets pushed down to the database server. To my knowledge there is currently no way of achieving this (without recourse to the SQL transforms which are of course frowned upon). I needed to think outside the box a little, luckily as  I always use intermediary tables to compile my data  prior to completion that answer was fairly simple:-

  1. Compile data as normal in intermediary table
  2. Mark any candidate records in intermediary table as ‘Archived’ using a database field
  3. Omit ‘Archived’ records when inserting records from intermediary table into main target table

And that led me into using a feature that (until you know the toolset a little better) appears to be missing, namely how to execute an Update statement (using FULL pushdown) against existing data. Oh sure, again it’s very easy to just add a Map Operation and convert a flow of data into an ‘Update’ but this of course results in a RBAR operation as the BODS engine will iterate over each row of data and then change its op code so that it is an ‘update’ not a ‘insert’ . The end result is that the BODS engine then creates individual update statements and we quickly have a non -performant data transformation; furthermore each row of data will compound the problem and before very long we will have a drawing board to which we will have returned.

I must admit that how to achieve Updates with a SQL pushdown did escape my sights for a little while before I realised that the guys from SAP have allowed for it. There’s a nice feature called ‘Merge’ that was introduced in SQL Server 2008 (and I’m certain that other database providers also allow for it) which allows you to easily specify within one SQL statement how to match a series of records from a data source with another series of rows from a datatable, you then specify how to join these disparate and distinct sets of data and in the case of a match or a non match from the above join also specify how to update/insert the data.  For me the SQL is a little verbose for everyday use and I instead prefer to break it into distinct operations for everyday use but then I’m a little bit old school…. The real bonus about this merge methodology is that with very little effort SAP have encapsulated and used this functionality to offer a true database level Upsert mechanism in Data Services Designer. Of course the normal restrictions apply in that you must be operating against and joining on data that has a unique key but then that is relevant to the database provider and not to data services itself

So lets take a look at how to drive this functionality, quite simply we start with our simple transformation as shown below:-


As you can see we are using the same FactSalesIntermediary table for our source data and target data, in this instance we are inner joining to the other tables to establish which data can be logically archived off.  In our scenario as the FactSalesIntermediary is the driver table we can be sure that we will only ever be updating records and so in this instance the Insert part of the upsert methodology is not actually used at all (this is actually a reasonably normal use case as using upsert is the only effective way in SAP Data Services Designer to push an update down entirely to the database server). If we take a look at the configuration of the table that we are updating you should see the following:-


Note that ColumnComparison is set to ‘Compare by Name’ and that the only two fields that we are interested in are the SID (which is our unique key) and the archive flag which details the fact that we are to archive off the record with the current SID. If you think about this from a purely SQL point of view this is enough information to compile the following:-

UPDATE FactSalesIntermediary SET Archived = 1 WHERE SID = 10002

The rest of the fields as they are not inputs will of course be ignored. You should also note the other two advanced feature settings that have been highlighted, namely  ‘Auto Correct Load’ and ‘Allow Merge or Upsert’, selecting both of these allows the Data Services engine to create an Upsert SQL statement (instead of the normal SQL ‘Insert’ statement) that can be pushed directly down to the database thus avoiding any RBAR processing. Lets take a look at the SQL that is generated:-



ON ((s.SID = n.SID))



So lets break it down (each colour representing a ‘part’):-

  1. As you can see the first part of the SQL creates a Merge Statement which sets up the whole syntactical structure and also aliases the table we wish to Update to ‘S’.
  2. We than have a SELECT statement that we built in our query detailing which source data to use and again aliasing all of this to a virtual table called ‘N’.
  3. We then specify how these two aliased sources join together, in this instance joining the ‘A’ and ‘N’ together by the SID fields.
  4. We then have the nuts and bolts of the upsert itself, we start with the WHEN MATCHED  which effectively gives us our ‘UPDATE’ statement
  5. And finally we have the WHEN NOT MATCHED which for this scenario we actually ignore but in other circumstances would give us our INSERT statement.

And that really is all there is to it, when the example is run I see that 150 records are updated and none are inserted which is what I expect to see. I can then when I populate the Main FactSales table ignore all of the ‘archived’ records

Upsert? Please don’t cry…..


Man of Distinct(ion)

yudIts not often that I learn something new about the fundamentals of SQL. Sure they add to the syntax all the time and you learn something  about what you can do with these addition but fundamentals, that’s basic stuff – the kind of stuff you rattle off a thousand times a day without even thinking about it . I’m talking about  stuff I should have known though it seems that I am not alone so I thought I would share this tidbit. This knowledge  is to do with the DISTINCT keyword of all things.

If I run the following piece of SQL I get a large no of results, covering all of the intakes available across all courses for all campuses:-

SELECT  C.[Description],O.Description,O.TotalIntake FROM Campus AS C  INNER JOIN Intake AS I
    INNER JOIN Courses AS O ON I.IntakeID = O.Id
    ON C.CampusID=I.CampusID

The results look a little like this:-


But what if I want to return a count of all of the course types at a campus and the number of seats available over the whole intake.The obvious thing is group by campus and then count the number of Courses like this

SELECT C.[Description],COUNT(O.Description),SUM(O.TotalIntake) FROM Campus AS C  INNER JOIN Intake AS I
    INNER JOIN Courses AS O ON I.IntakeID = O.Id
    ON C.CampusID=I.CampusID
GROUP BY C.[Description]

CaptureWait a minute, that’s not right though is it. I don’t have  84 different course types at my Shrewsbury Campus. Turns out that  getting a straight count of the course types is not quite as simple as that , that is unless you know what I now know…

That DISTINCT may be used within a COUNT function.

Its obvious, sure but I really had no idea that this was supported by SQL. So I refactor the SQL to look like this

SELECT C.[Description],COUNT(DISTINCT O.Description),SUM(O.TotalIntake) FROM Campus AS C  INNER JOIN Intake AS I
    INNER JOIN Courses AS O ON I.IntakeID = O.Id
    ON C.CampusID=I.CampusID
GROUP BY C.[Description]


CaptureAnd there it is, We have distinct course types at each Campus. I guess I’ve probably really over engineered solutions  in the past because I never knew about this little nugget of amazingness. Go forth and be awesome.

Grouping multiline SQL data using XML

xml conversionA friend of mine recently asked for help with a bit of SQL he was working on. True, he was in a far away land lording it up and I could have ignored him on general principles but the way I look at it is this… If he’s out there and in work he’s not over here making a nuisance of himself (Only joking George). So I decided to help, selfless creature that I am!

He had a data-set, in this instance nurse information whereby each nurse could have multiple registrations. What he wanted to achieve was to show a distinct list of nurses with all of their registrations on one line, comma separated. I remembered a couple of approaches that he could have taken but I though that for this instance what would be the best suited is to use an undocumented feature of the FOR XML statement. So we started with the following table of nurses and their registrations:-


We then used the following SQL statement


So a brief explanation, We are using a sub select here using the XML Path with an empty nodeset (XML PATH(”). This is the hacky part as calling XML Path with an empty nodeset is not strictly supported and as such is an unintended action. The AS data() statement is required as this removes the XML tags that would otherwise be generated. As you can see the outer select retrieves distinct nurses and the sub select joins to that distinct Nurse ID. The for XML path creates a space separated list of strings which is great, this means that we can perform a trim to remove all of the trailing spaces and to comma separate them. The final thing to do, replace all embedded spaces with a comma character. Dead simple although this does rely on your data NOT having embedded spaces itself, If it does I would suggest appending two spaces to the inner select like so:-

SELECT RegNo + '  ' AS 'data()'   FROM NurseRegistrations AS NR

This way the trim functions will still work  and all you need to do is to change your replace to use two spaces instead of one.

My original SQL yielded the following results.


Now this solution does offend me in some ways… The fact that I have had to resort to RBAR really grates on me but today I find myself with little time and purities of perfect SQL will have to be put on hold. I am sure, though, that I can improve this given enough time to further investigate and when I do you can be assured I will update this blog

In the meantime, I hope it helps you……. George! And stop moaning about the weather!

Improve Readability using ROW_NUMBER()

Ink BlotWhat did I ever do without the TSQL Row_Number() functionality. I do remember way back in days of yore (when I had youth, less weight and even less sleep) wishing that with straight SQL I could achieve a row count within my data. I seem to remember we cobbled something together but it was less than satisfactory. Code it seems is always about compromise…

Just recently I have been set some work that involved grouping and sorting and in addition I needed to be able to make the SQL code seemingly intelligent about where we were within the data. What we had was an Intake amount that could apply to ‘n’ records that were being detailed. We were ordering the data in such a way that only the first record should really output the Total Intake amount (as this applied to several underlying rows also). It was not helpful to look at 3 rows where the total Intake appeared to be 100 for each where in fact it was 100 for the 3 rows. To state the problem I started with a database structured thus:- campusThe campuses are all of the campuses where a student could attend a course. Next we had the course themselves (This is just a sample so the data is not normalised! Excuse my Faux Pas….) CoursesAnd finally the mapping table that brings this data all together… IntakeSo to start with I began with a simple SQL select to join all of this data together sensibly returning only the data that a human (if we only could find one) wanted to see. As you can see, nothing onerous here:- FlatResultsAs you can see for each student we are bringing back their course, level and capacity on a 1 to 1 basis. Not very useful as what I would actually like to see is all of the courses and a count of the students underlying it So we change things around a little grouping on the Course,Level and Capacity (as well as campus) and then taking the DISTINCT (Max from a summary point of view) TotalIntake figures and applying a count of the underlying students giving us this:- StudentsNow we’re cooking, were starting to see the data that I wanted, however if I was to scan read the above data from the first three lines presented I might think that for an intake of 300 students we had an uptake of only 44. The truth is that the 3 x 100 intake is actually 1 intake record split across the three campus…. so what I’d like to do is show the Intake only once. This means that I will have to of course order the results so that intakes are all together but we’ve already done that. So without further ado, welcome to the magic that is PARTITION. I refactored the SQL to read thus:- RownumberThis now looks a lot better, as you can see it is now very evident that my total intake for the first three records is 100. So just to explain what is happening:- the PARTITION statement is forming a partition within the data so that whenever a new group is determined for such time as that group is in use a row number count is incremented (starting at zero for each group). In the above instance records 1,2 and 3 and would have the row numbers 1. 2 and 3 respectively as they are all English Literature, Full Time and at A Level, what sets them apart is that they are each on a different campus. We thus, using this row number, elect to output the TotalIntake only for the first record in a group. As you can see, we’re nearly there, just one thing that grates me a little. I am more interested in the records with greater amounts of students so I’d like to order them within the partition such that the campuses with the most students are reported upon first. For this we alter the ORDER BY within the partition itself:- RowNumberFinal In general we need to keep the same order as the partition itself but we add a COUNT(*) DESC statement that will force the campuses with the most students to appear first (within a group). I hope you find yourself a use for this kind of location awareness within a dataset. I don’t often use it but its comforting to know that is there when you need it.