So there I was on a Friday (I’m lying it was a Thursday actually) minding my own business and trying to implement a like clause in SAP Data Services thinking… It really shouldn’t be this hard! But it was… It turns out that the double pipe characters (||) which are the usual concatenation expressions within Data Services aren’t valid in a join expression. This was slightly problematical as I needed to query with a like as well as with exact matches. Faced with my like clause problems I would see the day quickly slipping away from me. The choices were as follows:-
1. Move all of the logic for my transform out to a Stored Procedure . I have no doubt that this would work but would have resulted in much reworking as my data transformations were reasonably complex and there were many of them. In addition it always feel like a cheat too as you end up losing the Data Lineage which to be honest had already gone to hell in a hand cart in this instance!
2. Not use a like, nor give a fig and just give up and run away to the South of France for the rest of my life. In all honesty, I'm struggling to find a downside here...
3. Add a new derived which performed the concatenation and created a new field call FuzzyData. I could then use the phrase OR FieldToCompare LIKE FuzzyData. Theretically this could work but I could see this extra query step potentially preventing my pushdowns from working and with the size of this data set that would be completely unworkable.
Now if you’re anything like me you stopped reading at point 2, made for the airport and was enjoying a cheeky red before embarkation; actually scratch that. I think that a red is probably way too classy and haughty to be considered cheeky. But I digress…. If you are also like me I would guess that Mr Lloyds also probably had you forcibly ejected from the aeroplane United Airlines style; Something about repaying those airline tickets size dents in ‘his’ money
So lets state the problem, we have a set of spreadsheet data that contains data returned from a client pertaining to products that they have ordered which are defective in some way; this data has been moved into a staging table within our data warehouse to allow for performant querying. There is of course much data within this table but the main identification field is the barcode which is always in the format ‘nnnnnnnn’ n being a numeric value of the barcode as exposed to the client.
Spreadsheet Barcodes 07987621 00234562 09812344
Also within our data warehouse is a DimProducts dimension table which has loaded the full inventory from our own LOB database. The barcodes within this data are generally in exactly the same format although there are a few that have 11 characters instead of the more usual 8 (format ‘nnnnnn-nnn’). The client however will only ever see the (first) 8 characters of any barcode including these anomalies.
Dimension Barcodes 07987621 00234562 09812344-908
The difficulty is that we wish to map the data contained within the ‘spreadsheet’ to the data within the ‘LOB database’ so that for every defective complaint sent out we can associate it to the product within our inventory. We can then contact manufacturers etc. Remember that in the main there will be a direct match between these two sources, it will only be ‘edge cases’ where the ‘spreadsheet’ data will match on the first 8 characters of our 11 character barcodes. As you can see from the sample barcodes above if we were to perform the below join we would only match on the first two records
'Spreadsheet.Barcode = Database.Barcode'
The final barcode would not be resolved as 09812344 is not the same as 09812344-908. In order to match against this we would need a join like this:-
'Spreadsheet.Barcode LIKE Database.Barcode || '%''
The percentage is of course required as SQL server requires wildcard characters to utilise fuzzy match on a LIKE statement otherwise it is essentially just an equal to statement. Nice and simple so far but just try and use that phrase inside an SAP data flow join. Not very helpful errors ensue and the afternoon disappears in much the same way that my salary does… alarmingly fast.
As I alluded to earlier in this blog post the easiest way to address this is to split your processing into two joined queries, in the first we select only the fields we are interested in from our spreadsheet staging table and derive a new field entitled FuzzyBarcode. This field uses the following formula for its text
Spreadsheet.Barcode || '%'
This removes the need for monkeying around with pipe characters in the joins at a later stage. Next off in the second query we bring in the data from the first query and join that to the data within our products table using the barcode join as shown below:-
'Dimension.Barcode LIKE Spreadsheet.FuzzyBarcode'
This data can then be output into your fact table and as you should see something magical happens. Not only does your data resolve in the manner that you wished the execution of this workflow (so long as you have not done anything else too complex) gets entirely pushed down to the database leaving you with a statement Similar to this pseudo SQL
INSERT INTO FactComplaints (DimProductFK,RejectionReason) SELECT D.Sid,S.RejectionReason FROM Spreadsheet AS S INNER JOIN DimProducts AS D ON D.Barcode LIKE S.FuzzyBarcode
Perfect, a like clause within SAP BODS and everything gets pushed down to the database for maximum efficiency.