So, recently I find myself heavily involved with SAP Data Services and data migrations, I know what you’re thinking… thats not real code! And you’re damn straight …. it isn’t, but I have always leaned heavily towards the data end of the spectrum rather than the UX end which is filled with sorcery, witch craft and more importantly users who are like little pixies with no respect for the sanctity of code; data is data though and it doesn’t press buttons marked ‘Nuclear Button, Do Not Press’ just for kicks. I thus don’t complain too much (ahem) when ETL jobs come my way. One of the biggest challenges with ETL in the SAP world is to keep everything down at the database level, it’s very easy to slip up and watch your execution times rattle up to unusable very quickly all because you used a feature for which there is no direct database equivalent and at times it is frustrating at how the simplest functions are not covered; bygones.
The idea for this blog came about when I was trying to establish how to delete (archive) a load of data which is of course easy if you use the Map Operation transform, the issue with that of course is that none of this gets pushed down to the database server. To my knowledge there is currently no way of achieving this (without recourse to the SQL transforms which are of course frowned upon). I needed to think outside the box a little, luckily as I always use intermediary tables to compile my data prior to completion that answer was fairly simple:-
- Compile data as normal in intermediary table
- Mark any candidate records in intermediary table as ‘Archived’ using a database field
- Omit ‘Archived’ records when inserting records from intermediary table into main target table
And that led me into using a feature that (until you know the toolset a little better) appears to be missing, namely how to execute an Update statement (using FULL pushdown) against existing data. Oh sure, again it’s very easy to just add a Map Operation and convert a flow of data into an ‘Update’ but this of course results in a RBAR operation as the BODS engine will iterate over each row of data and then change its op code so that it is an ‘update’ not a ‘insert’ . The end result is that the BODS engine then creates individual update statements and we quickly have a non -performant data transformation; furthermore each row of data will compound the problem and before very long we will have a drawing board to which we will have returned.
I must admit that how to achieve Updates with a SQL pushdown did escape my sights for a little while before I realised that the guys from SAP have allowed for it. There’s a nice feature called ‘Merge’ that was introduced in SQL Server 2008 (and I’m certain that other database providers also allow for it) which allows you to easily specify within one SQL statement how to match a series of records from a data source with another series of rows from a datatable, you then specify how to join these disparate and distinct sets of data and in the case of a match or a non match from the above join also specify how to update/insert the data. For me the SQL is a little verbose for everyday use and I instead prefer to break it into distinct operations for everyday use but then I’m a little bit old school…. The real bonus about this merge methodology is that with very little effort SAP have encapsulated and used this functionality to offer a true database level Upsert mechanism in Data Services Designer. Of course the normal restrictions apply in that you must be operating against and joining on data that has a unique key but then that is relevant to the database provider and not to data services itself
So lets take a look at how to drive this functionality, quite simply we start with our simple transformation as shown below:-
As you can see we are using the same FactSalesIntermediary table for our source data and target data, in this instance we are inner joining to the other tables to establish which data can be logically archived off. In our scenario as the FactSalesIntermediary is the driver table we can be sure that we will only ever be updating records and so in this instance the Insert part of the upsert methodology is not actually used at all (this is actually a reasonably normal use case as using upsert is the only effective way in SAP Data Services Designer to push an update down entirely to the database server). If we take a look at the configuration of the table that we are updating you should see the following:-
Note that ColumnComparison is set to ‘Compare by Name’ and that the only two fields that we are interested in are the SID (which is our unique key) and the archive flag which details the fact that we are to archive off the record with the current SID. If you think about this from a purely SQL point of view this is enough information to compile the following:-
UPDATE FactSalesIntermediary SET Archived = 1 WHERE SID = 10002
The rest of the fields as they are not inputs will of course be ignored. You should also note the other two advanced feature settings that have been highlighted, namely ‘Auto Correct Load’ and ‘Allow Merge or Upsert’, selecting both of these allows the Data Services engine to create an Upsert SQL statement (instead of the normal SQL ‘Insert’ statement) that can be pushed directly down to the database thus avoiding any RBAR processing. Lets take a look at the SQL that is generated:-
MERGE INTO "DBO"."FACTSALESINTERMEDIARY" s USING (SELECT DISTINCT "FACTSALESINTERMEDIARY"."SID" , 1 FROM (("DBO"."STAGINGSAPDELETIONS" "STAGINGSAPDELETIONS" INNER JOIN "DBO"."DIMSAPDOCUMENTNUMBER" "DIMSAPDOCUMENTNUMBER" ON ( "STAGINGSAPDELETIONS"."SAPDOCUMENTNUMBER" = "DIMSAPDOCUMENTNUMBER"."SAPDOCUMENTNUMBER" )) INNER JOIN "DBO"."FACTSALESINTERMEDIARY" "FACTSALESINTERMEDIARY" ON ( "DIMSAPDOCUMENTNUMBER"."SID" = "FACTSALESINTERMEDIARY"."DIMSAPDOCUMENTFK" )) INNER JOIN "DBO"."DIMSTATUS" "DIMSTATUS" ON ( "DIMSTATUS"."SID" = "FACTSALESINTERMEDIARY"."DIMTRANSACTIONTYPEFK" ) AND ( "STAGINGSAPDELETIONS"."STATUS" = "DIMSTATUS"."STATUS" ) ) n (SID , ARCHIVED) ON ((s.SID = n.SID)) WHEN MATCHED THEN UPDATE SET s."ARCHIVED" = n.ARCHIVED WHEN NOT MATCHED THEN INSERT ("ARCHIVED" ) VALUES (n.ARCHIVED);
So lets break it down (each colour representing a ‘part’):-
- As you can see the first part of the SQL creates a Merge Statement which sets up the whole syntactical structure and also aliases the table we wish to Update to ‘S’.
- We than have a SELECT statement that we built in our query detailing which source data to use and again aliasing all of this to a virtual table called ‘N’.
- We then specify how these two aliased sources join together, in this instance joining the ‘A’ and ‘N’ together by the SID fields.
- We then have the nuts and bolts of the upsert itself, we start with the WHEN MATCHED which effectively gives us our ‘UPDATE’ statement
- And finally we have the WHEN NOT MATCHED which for this scenario we actually ignore but in other circumstances would give us our INSERT statement.
And that really is all there is to it, when the example is run I see that 150 records are updated and none are inserted which is what I expect to see. I can then when I populate the Main FactSales table ignore all of the ‘archived’ records
Upsert? Please don’t cry…..