Make That Switch From (The) Oracle


Just recently we came across a client who was currently running their data warehouse on an Oracle 10g installation and wanted to come home to the world of SQL Server. As I am personally not a fan of Oracle at all I couldn’t fault their logic one bit, I had not however ever experienced moving and SAP DS transform from one database server to another for the same datastore and so I thought I’d look into it a little more. Let’s begin by stating the problem a little clearer and throwing up some caveats about what this posting is not.

We have an existing Oracle database which is currently populated by use of an SAP Data Services Transform. To make things easy, from the point of view of this blog, our output table is a template table as this negates the need for us to physically create any new tables as SAP takes care of this for us. In the real world you are very likely to not be using template tables and so you would need to take care of creating the new data tables manually. So in this example, we will be working with a very simple transform which takes data from an input file and uploads it directly into an Oracle data table. We will be taking this transform and porting it to SQL Server with the least possible effort on our part, not just because we’re lazy, but also because least effort means least chance of breaking our transform.

So let’s take a look at our very simple transform:-


As you can there really is nothing complicated here, we have an input file which is just pretty much inserted unadulterated into our ‘THISISATEST’ Oracle database table. Note, let’s not talk about case sensitivity in this blog…. Oracle. As you can see the table resides within the POC datastore and so lets without any further ado take a look at our data store, We thus right click on our POC datastore and select the ‘Edit’ menu item.:-


So let’s take a look at what we currently have defined. A datastore named POC which appears to be defined as an Oracle 10g datastore. Note that the datastore type is NOT editable and so we cannot just change from one data store type to another.


So let’s go advanced and edit the configuration by using the ‘Edit’ button towards the bottom of the screen. You are presented with a screen similar to the following where our Oracle configuration is defined:-


What we need to do in this instance is create a brand new configuration for our SQL server database which will utilise exactly the same data store objects thus allowing us to make the change from one provider to another with practically no work. Lets then ‘Create’ a new configuration by using the button indicated in green above. This will invoke the ‘Create Configuration wizard which is the only way you have of defining a database type for a configuration.


In this instance, I have elected to use an ODBC connection to connect to SQL Server (which you can’t see) and create a new Configuration called somewhat originally ‘SqlServerConfig’. Once you elect to create this configuration the previous configurations screen will show the following:-


Wow.. Ain’t that simple. I now enter my various login credentials which have been obfuscated for the sake of this blog and we are left with two configurations that can be utilised by the transformation. But we want to switch to using SQL Server rather than Oracle so let’s make one small change that will effect this easily:-


All we do is change the ‘Default Configuration’ for the ‘SQLServerConfig’ to True which will mean that this will always be used in preference to the Oracle configuration unless of course you specifically override this behaviour. You can now go ahead and run your transformation and you should see, so long as you made no errors, that the ‘THISISATEST’ table will now have been created within the SQL Server database rather than the Oracle database. And that really is the crux of migrating from Oracle to SQL Server, or indeed from any one database to any other.


Rank and File…

dataswitchJust recently I was asked to make some changes to an existing data warehouse, we needed to add Tender data (that is, how a transaction was paid for) to existing transactional data. The difficulty with this was that we did not want to change the grain of the data which was at transactional level and often transactions are paid for with a mix of Tenders… (cash, debit card, discount coupons, cheques, credit cards). The data was to be presented to me exactly as it had been presented at the tills, that is where multiple payment types were used to pay for an item there would be multiple tender records each specifying the amount tendered and the tender type. The data would be presented in a table that looked a little like this:-


We were then asked to report the tender types used for each transaction and in instances where multiple payment types were tendered we were to show ONLY the largest Tender made. As an example where a transaction totalling £1000 was made £700 of which was in cash with the residue being made via a debit card, the transaction should be reported as a ‘Cash’ transaction. Simple enough, right.  As there was a lot of data which I would be looking at I decided that I didn’t want any processing in memory and so I wanted to find a way to give me this data in one operation at the database level which should give the best performance.  How best to do this? SQL being SQL there are many ways to achieve this but in this instance, let’s use my latest favourite function (I know, I know; you’re not supposed to have favourites….), the Partition function used in association with the Rank function.

So what are these functions, well despite using them both quite regularly I find myself not actually being able to describe very readily what the functions actually do so I had a little bit of research on my hands; The RANK and PARTITION BY () functions are in effect aggregate functions that when used in conjunction with the OVER ‘Windowed Function’ allow you to aggregate data within the resultset WITHOUT recourse to the GROUP BY clause which of course, operates against and dictates the shape of the entire query. Put simply they allow you to create aggregated data within an existing standard SQL query.  The RANK () function ranks data using a numeric counter which is grouped by the PARTITION specified. That sounds complicated, even to me, so let’s just dive in take a look and hopefully, it will become more apparent. Let’s start by looking at the core SQL that we need to run and it really is very simple:-

SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount 
FROM TenderEntryData AS TEE
ORDER BY TransactionNumber,StoreCode,Amount DESC

This requires no explanation other than we will return every single TenderEntryData record including those for transactions where more than one payment type was tendered. Let’s move on from this and add the Ranking and Partition function and then describe what is happening:-

SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount,
RANK() OVER (PARTITION BY TransactionNumber,StoreCode) 
AS Ranking
FROM TenderEntryData AS TEE
ORDER BY TransactionNumber,StoreCode,Amount DESC

So we have added the whole of the second line here, one key thing to note is that Transaction Numbers are unique only to stores, they are not guaranteed unique across the whole enterprise. For this reason, we must partition our data by both Transaction Number and Store Code. The rank function will thus currently arbitrarily rank our rows of data grouping by unique transactions at a given store. This is a great start but we wanted to see the larger Tender Entries first and so we need to tinker with the SQL a little thus:-

SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount,
(PARTITION BY TransactionNumber,StoreCode ORDER BY Amount DESC) 
AS Ranking
FROM TenderEntryData AS TE
ORDER BY TransactionNumber,StoreCode,Amount DESC

To make things more readable I have split the statement onto line 2,3 and 4 now. Line 4 has now been extended to include an ORDER BY statement which means that we should now have all of the larger amounts ranked ‘1’ whilst secondary amounts will be ranked ‘2’ etc. There is however still an issue here, where we have a transaction with multiple tenders but the amounts are equal the data will be ranked arbitrarily according to how the database elects to order them. This is not in itself a problem at all, I would however just like to be explicit in picking the first (according to ID) in such instances so I will add a second clause to the ORDER BY. This should help us to avoid any strange behavioural issues later:-

SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount,
(PARTITION BY TransactionNumber,StoreCode 
AS Ranking
FROM TenderEntryData AS TE
ORDER BY TransactionNumber,StoreCode,Amount DESC

Let’s have a look at that data now:-


Nearly there… So we now have ranking data for each payment type made against unique transactions at a store. What remains is to filter this data down as we ONLY wish to see the largest amount which is now easy as the largest amount will always have a Ranking Value of 1. What we cant however do is add the ‘windowed function’ specified into a where clause as this is invalid SQL. What we instead need to do is wrap the whole SQL we just created into a sub-query and then perform the filter, a little like this:-

SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount,
(PARTITION BY TransactionNumber,StoreCode 
AS Ranking
FROM TenderEntryData AS TE
ORDER BY TransactionNumber,StoreCode,Amount DESC
) AS TenderData
WHERE Ranking = 1

This SQL yields the following resultset:-

And we’re done. Lovely, simple and very performant SQL.  As you can see the order by statement had to be removed as ORDER BY’s are not valid SQL in a sub-query. I’ve only scratched the surface as to what is actually achievable using the partitioning and windows functions, watch this space for more awesomeness in the future.

Why Do You Have To Be So Sensitive?

Time Crystals

For this latest blog I have my soon to be ex-colleague (Splitter!) Luke ‘Maximo Park’ Johnson to thank. He asked me a question with regard to SAP Data Services which has always bugged me but that I have never managed to resolve, A quick google search on my part confirmed my previous line with regard to this issue and I left it at that, he sent me a link sometime later which I investigated which contained a nugget that had thus far eluded me. So I investigated and his nugget became fully a fledged gold bar. I therefore decided to blog this in order that others may benefit from this knowledge…. and I won’t forget!

So, what exactly was this problem? The question he asked me was ‘Why does Data Services create all of its tables within the data store with an upper case name. This has been my biggest gripe by far (apart from the god awful SQL that gets generated!) because being an everyday human being we use shortcuts for everything, consciously and subconsciously. Let’s step back a little to step forward…. When I create a database table dealing with Product Derivations I have a few ways I can do this, I could use the following names:-

Product Derivations



Its very much a matter of taste of course but as I have an abundance of taste I would always use the latter of these options! Why? Well option 1 with its embedded space means that when it comes to using SQL (which is an environment I live extensibily in)  you would have to use [] or “” around every table name which is just hideous and more more more work

SELECT * FROM “Product Derivations”


SELECT * FROM [Product Derivations]

Eugh… So option 2 is definitely less work as regards to these extra characters but I left visual basic behind a long time ago. To be perfectly honest it’s just amateurish and doesn’t really aid in the readability, all those silly little underscores which to me break just serve to the message up

SELECT * FROM Product_Derivations

No, I would use the final Camel Case option every time, it says exactly what it needs to and no more nor no less. Data Services, however, gets in the way of this as out of the box (and seemingly until today) it creates all Template Tables fully upper case which is just a bit of a car crash…Wait, I hear you say, surely it doesn’t matter? But actually, it really really does…  It all goes back to those shortcuts that we subconsciously do. I have read a LOT of words in my lifetime, many many millions I suspect and I know for a fact that not one of us reads every single letter in a word before arriving at ‘the word’ unless we are just learning that word or training our brains like children do. In general terms we just don’t work like that, we shortcut using context and we also shortcut using word shapes the word ‘balloon’ looks a lot different to the word ‘battlestation’ and we see that shape pretty much instantly. When I am therefore looking through a list of potentially hundreds of database tables, all beginning with P, my eyes are scanning for all of those that start with a shape like ‘Product’ and then once that shape has been recognised the rest of ‘the word’ will also be matched meaning that matching the words ‘ProductDerivation’ is a far easier job than it otherwise would have been.


What happens though if we take the case sensitivity out of the equation?


I’m sure you’ll all agree much harder to read…

And if we go to the opposite extreme and do what Data Services does and upper case everything things go from bad to shocking! Practically unreadable at speed as we don’t work in upper case really.


So, the question is how do I make Data Services behave and give me the tables EXACTLY as I define them? Well, the answer is very simple. WE can make one configuration change and normal service is resumed. We start by locating the DSConfig.txt file which is located within the Data Services/conf folder. On my machine this was within this directory:-

C:\Program Files (x86)\SAPBusinessObjects\DataServices\conf

We then open this file up and look for the [AL_Engine] Section heading which will look like this:-


AlEnginePathName =…..

You then need to add a new setting called ODBCAlwaysCaseSensitive and give this a value of ‘TRUE’

Your file should now look a little like this:-


AlEnginePathName =…..

ODBCAlwaysCaseSensitive = TRUE

And that’s it as far as changes are concerned. We just save the changes before stopping and restarting the SAP Data Services windows service available from the Windows Services Manager dialog. Lo and behold, you’re all done. What you should now find is that upon running your data transformations any template tables that you have defined will now upon closer examination within SQL Server have been created using the casing you specified at design time. I don’t know about you but for me, this will now be the first thing I change in any new data service installations….


There Is Nothing Like A Dame…


So there I was on a Friday (I’m lying it was a Thursday actually) minding my own business and trying to implement a like clause in SAP Data Services thinking… It really shouldn’t be this hard! But it was, it turns out that the double pipe characters   (||) which are the usual concatenation expressions within Data Services aren’t valid in the join expression. This was slightly problematical as I needed to query with a like as well as with exact matches. Faced with my like clause problems I would see the day quickly slipping away from me. The choices were as follows:-

  1. 1. Move all of the logic for my transform out to a Stored Procedure . I have no doubt that this would work but would have resulted in much reworking as my data transformations were reasonably complex and there were many of them. In addition it always feel like a cheat too as you end up losing the Data Lineage which to be honest had already gone to hell in a hand cart in this instance!
  2. 2. Not use a like, nor give a fig and just give up and run away to the South of France for the rest of my life. In all honesty, I'm struggling to find a downside here...
  3. 3. Add a new derived which performed the concatenation and created a new field call FuzzyData. I could then use the phrase OR FieldToCompare LIKE FuzzyData. Theretically this could work but I could see this extra query step potentially preventing my pushdowns from working and with the size of this data set that would be completely unworkable.

Now if you’re anything like me you stopped reading at point 2, made for the airport and was enjoying a cheeky red before embarkation; actually scratch that. I think that a red is probably way too classy and haughty to be considered cheeky. But I digress…. If you are also like me I would guess that Mr Lloyds also probably had you forcibly ejected  from the aeroplane United Airlines style; Something about better repaying those airline tickets size dents in ‘his’ money

So lets state the problem, we have a set of spreadsheet data that contains data returned from a client pertaining to products that they have ordered which are defective in some way; this data has been moved into a staging table within our data warehouse to allow for performant querying. There is of course much data within this table but the main identification field is the barcode which is always in the format ‘nnnnnnnn’ n being a numeric value of the barcode as exposed to the client.

Spreadsheet Barcodes


Also within our data warehouse is a DimProducts dimension table which has loaded the full inventory from our own LOB database. The barcodes within this data are generally  in exactly the same format although there are a few that have 11 characters instead of the more usual 8 (format ‘nnnnnn-nnn’). The client however will only ever see the (first) 8 characters of any barcode including these anomalies.

Dimension Barcodes


The difficulty is that we wish to map the data contained within the ‘spreadsheet’ to the data within the ‘LOB database’ so that for every defective complaint sent out we can associate it to the product within our inventory. We can then contact manufacturers etc. Remember that in the main there will be a direct match between these two sources, it will only be ‘edge cases’ where the ‘spreadsheet’ data will match on the first 8 characters of our 11 character barcodes. As you can see from the sample barcodes above if we were to perform the below join we would only match on the first two records

 'Spreadsheet.Barcode = Database.Barcode'

The final barcode would not be resolved as 09812344 is not the same as 09812344-908. In order to match against this we would need a join like this:-

'Spreadsheet.Barcode LIKE Database.Barcode || '%''

The percentage is of course required as SQL server requires wildcard characters to utilise fuzzy match on a LIKE statement otherwise it is essentially just an equal to statement. Nice and simple so far but just try and use that phrase inside an SAP data flow join. Not very helpful errors ensue and the afternoon disappears in much the same way that my salary does… alarmingly fast.

As I alluded to earlier in this blog post the easiest way to address this is to split your processing into two joined queries, in the first we select only the fields we are interested in from our spreadsheet staging table and derive a new field entitled FuzzyBarcode. This field uses the following formula for its text

Spreadsheet.Barcode || '%'

This removes the need for monkeying around with pipe characters in the joins at a later stage. Next off in the second query we bring in the data from the first query and join that to the data within our products table using the barcode join as shown below:-

'Dimension.Barcode LIKE Spreadsheet.FuzzyBarcode'

This data can then be output into your fact table and as you should see something magical happens. Not only does your data resolve in the manner that you wished the execution of this workflow (so long as you have not done anything else too complex) gets entirely pushed down to the database leaving you with a statement Similar to this pseudo SQL

INSERT INTO FactComplaints (DimProductFK,RejectionReason)
SELECT D.Sid,S.RejectionReason 
FROM Spreadsheet AS S INNER JOIN DimProducts AS D
ON D.Barcode LIKE S.FuzzyBarcode

Perfect, a like clause within SAP BODS and everything gets pushed down to the database for maximum efficiency.

Selectively Executing Jobs with SAP Data Services


Time…. Its constantly against us in every working day and it matters not how productive you manage to be somebody will always rain on your parade with ‘just one more job’ to eat away at your life blood. So here’s the deal… I’ll tell you how to save a little bit of time during your day but…. Shhhhhhhhh, keep quiet about! Maybe you could use the time you will doubtless save to start planning your takeover of the world, lets face it you could hardly make more of a hash of it than the current crop of lunatics are, could you?. Agreed? Good, now without wasting more time lets improve the world….

When it comes to Data Services one can spend an awful lot of time rerunning the same data imports over and over again and the cost added up over time can be considerable. Sure we can use the ‘Bypass’ functionality but as anyone with more than a passing familiarity to this can testify it can be more hassle than its worth; and of course once you start moving jobs between repositories everything goes further south than The Duke Boys on a day trip to the Alamo. In a chance conversation with my colleague Angus Menter (if he hasn’t lived there, it probably doesn’t exist) the frustrations of a particular user were relayed to me and I was asked how I coped with this. The answer is in fact relatively simple once you understand but there is a little bit of setting up.

So lets state the problem, I have a small but complex job that moves data from a line of business database to a data warehouse for reporting purposes. This is rather simplistically accomplished in 3 workflows.

  1. Destage Data (Move data from source database to destination database for reasons of performance)
  2. Create Dimension Table
  3. Create Fact Tables (using data created in step)


Now imagine that i had one massive view to move over in Stage 1, which I then needed to split into over 500 dimension tables in step 2 before creating one fact table referencing all of these dimension tables in step 3. Imagine now that Stage 1 takes 15 minutes to upload… It doesn’t take a genius to see that when it comes to designing and testing and running stage 2 that initial 15 minutes download is going to get very old very fast, as indeed will I. Ideally we would probably only want to run this initial stage once a day during the development process, if that. This we would accomplish  by changing our main workflow to use ‘Conditional Workflows’ overarching our 3 main workflows.

We start by removing the standard workflows from our main canvas (don’t worry, you won’t lose them) and replacing them with three conditional workflows named in a similar manner to the deleted workflows as shown in the following image.


We then open up each conditional workflow in turn and using the local object library’s ‘Workflow’ tab we drag our previous normal workflow (named the same) into the ‘IF’ window of the conditional workflow. What this will mean is that if the logical statement (which we will write later) equates to true then we SHOULD run this workflow otherwise we will run whatever is in the ‘ELSE’ window, in this instance…. Nothing at all. Your 3 conditional workflows should look a little like the following 3 images.

This slideshow requires JavaScript.

Great, so thats really coming along but of course we need to manage the logic regarding whether or not these jobs run. How do we do this? Well, there are a number of ways that it could possibly be achieved but I think for me the easiest and most flexible way is to use ‘Substitution Parameter Configurations’. For this of you not familiar with these it is essentially a set of parameters that are assigned at run time and fed through to the executing job. The job may then query these values to find out ‘how it should run’. Lets start off by defining our parameters, we want three one for each workflow so that we can control whether or not each runs. We thus open the Substitution Parameter Configurations dialog from the tools menu as shown below.


We will then be furnished with a dialog like the following where we can define our three parameters, I’m calling them:-

  1. $$DestageData
  2. $$LoadDimensions
  3. $$LoadFactsData


And I think that their usage is fairly obvious by their names. We set them all to have the value YES indicating that as things stand they SHOULD run. We then press OK and dismiss the dialog. We then return to each of the Conditional workflows in turn and set the ‘IF expression using the ellipsis button surrounded with a green square toward the top right of the following dialog. We then flip to the variables tab, expand the Substitution Parameters item and drag our ‘$$DestageData’ parameter into the formula editor before finishing the formula as shown below:-


Note how both the parameter name and the value MUST be enclosed in single quotes as they are text values and not numeric.  We then do this for each of the three conditional workflows as shown below:-

This slideshow requires JavaScript.

Great strides! We’re nearly there. In fact, lets just test that what we have created works as intended. If you save and run your job you should see that it does indeed function as it did previously but of course we are STILL running all of the jobs.

We could leave it at this point if you wished, and omit the running of a stage simply by opening and changing the Parameter Configurations which is perfectly acceptable. However, we will go a little further and with a little bit more set up eliminate the manual editing entirely.  We start by once again loading up the Substitution Parameter Configuration dialog and renaming our Configuration to something a bit more obvious, in this instance ‘Run All’ and as you can see all of the stages are set to run.


We then create an exact duplicate of this configuration using the indicated button which we rename to RunAllExceptDestage. Yep as you’ve probably guessed we then change the ‘$$DestageData’ value to ‘NO’ indicating that it should NOT run.


Brilliant, so lets run it right? Hold on Tiger… we’re not there yet.

We then need to utilise the System Configurations as shown below:-


These are amazingly useful things that allow you to create configurations with associated substitution parameters and data stores, you could thus work against a live database, or a test one or run in a different manner entirely. In our instance we don’t have different data stores and so we are only really interested in changing the substitution parameter configurations that are used. We thus create two System Configurations and entitle them rather imaginatively ‘RunAll’ and ‘RunAllExceptDestage’. Each configuration then uses different substitution parameters meaning that one will execute the full process, whilst the second will omit the ‘Destaging’ operation.

This slideshow requires JavaScript.

Nice and simple, we save those changes and now upon executing our job we get a new dropdown to pick from:-


From this we select ‘RunAllExceptDestage’. Once executed you should see that we no longer run the Destage operation and instead jump straight to the Dimensions job instead.


Amazing eh? Enjoy your executions……




SAP Data Services – Importing Image Data

Quite recently I was working on an SAP Data Services project which involved importing a great deal of data about fruits including photographs of the different varieties. There was a large amount of work to do and I must admit that whenever I thought about this photographic element of the import I developed a kind of coders twitch; I would hit google, notice the total lack of suggestion as to how this can be achieved and I would then push it very deliberately to the bottom of my pile. This is not normally how I work, I generally tend to do the hardest things first make stunning progress and then realise how much I had underestimated the complexities of the ‘simpler tasks’! But this was image data which is probably my least favourite thing, I like data… Eventually there came a point when I had no pile left under which to push this job any longer and so I had to assert myself to the job in hand.

Lets start by stating the problem, which lets face it, could have been much worse. We were given a spreadsheet containing many different varieties of fruits along with an awful lot of data, most of which is to be frank, completely irrelevant to this blog. This document also contained a relative path for each variety of fruit which described the location of a jpg file which was the picture itself. My task was to get all of this data into a database to allow the reporting services  to report against. So lets take a simplistic look at the data:-

As you can see we have a Variety Column which describes the name of the fruit Variety  along with a Product column which describes the Type of fruit. Row 1 is thus a variety of ‘Mango’ called ‘Amelie’ whilst Row 2 is a variety of ‘Passion Fruit’ called ‘Edulis’. For each of these varieties we also have a ‘Photo’ column which in this instance stores a relative path (from the context of the location of the document itself). The base path of this document is in this instance:-


Helpfully the first  row we look at in our earlier screen capture has a null value but the next row contains a valid relative path of

Exotic photos\Edulis.JPG

Using this relative path and the known location of the document itself we can thus build a path that details the location of the image relating to the Edulis Passion Fruit:-

K:\ImportFiles\Shared\Photos\Exotic photos\Edulis.JPG

Nice and simple, so we have the path now how do we go about translating that path into BLOB data  representing the actual image itself? This is where I struggled to find information, I scanned the function library looking for a ‘LoadImage’ method or similar before hitting the internet itself, after turning over some pretty scabby stones I found a forum post that was pretty illegible though it did contain a morsel of gold dust. Turns out there is no specific load function within the SAP library which I had already ascertained. There was however a fairly simple, but equally obtuse method for loading image data which involves writing the path name to a Text File  using a certain format and then opening and importing that data using a standard workflow. Lets investigate just how that works. We start by processing our spreadsheet data into a format that can be written to our text file, the transformation  I wrote looks like this:-

As you can see when you examine this first query I am merely removing any leading/trailing blanks and ensuring that null values are replaced with an empty string. Nothing tricky there nor even necessarily relevant to you.

The next operation is where the real work is done.

For the purposes of this demonstration we are only really interested in the ‘Photo’ field where as you can see we are building a string that looks a lot like this:-

<<K:\ImportFiles\Shared\Photos\Exotic photos\Edulis.JPG>>

The appending of the angle brackets will later tell SAP data services to treat the data contained within them as the path for an image that should be loaded. You needn’t worry yourself with the replace function that I have had to use, this is peculiar to my data alone. In reality your code will look more like this:-

'<<K:\ImportFiles\Shared\Photos\\' ||  Query.PHOTO || '>>'

This data can then be written out to a staging file, my file is formatted like this, note how the Photo field we are really interested in has been defined as a blob.

We then need to import the data, we thus create a new data flow looking much like this one using our new staging file type as the source.

The query component with regard to the Photo field itself simply maps the data together, in my instance I wanted to remove some blanks in another field which is why the query is even necessary and the data is then simply written into a blob field in the target database.

And that my friends, is mission accomplished…. This blog will self destruct in 10 seconds.


Lost in Time? Reach For The Time Tables


So….fairly recently whilst working on a regular clients data they expressed an interest in displaying times for use in their dimensional database. We are all too familiar with using Date tables in a dimensional manner but times are much less used, at least in my experience. It is with this in mind, and with an hour to kill due to your average ‘Microsoft 71% progress bar’ that I decided to blog exactly how I went about doing this as I was pleased with the results and the simplicity of the solution. Apologies, this solution is using an older version of SSIS but this methodology can be tailored for the ETL tool of your choice. First lets lake a look at the issue, we are all familiar with a date table where for each day of the year between x & y we represent each day with a single individual record. The table below shows how that would work starting from the 1st January 1900 (I’m nothing if not thorough!) Thus instead of using a date time column in our fact table to (for instance) detail a ‘Sail Date’ we can instead use the SID value to refer to a record within the DimCalendars table. The advantage of this is both speed and the fact that we can dice the date within this calendar table in many different ways (Week, Period,Month, Year, FinancialPeriods etc)


Thus we could also do the same with a DimCalendarTimesTable that would expose every second of the day from 00:00:00 to 23:59:59 thus allowing us to report on every single time that could potentially be encountered within a dataset. In our instance this data will look something like this starting from the top of the day:-



As you can see TimeToTheSecond reports every single second within the day (in order which can come in useful when dealing with push down in SAP Data Services… bygones). Time to the Minute only changes with every whole minute so that at 00:01:00.000000 is the value used for every minute from 00:01:00.000000 to  00:01:59.000000. This allows you to report on all events at say 13:53 without worrying about the seconds value. Taking this one stage further we also have a TimeToTheHourfield which only changes once every hour allowing us to report on every event in the 10:00pm window without caring what the second or minute values were. In addition we have two extra fields , SID being the unique identifier and TimeString being a special field that we join to in order to properly resolve the Foreign Key values in the first instance; the primary use of this field  is that it allows us to properly resolve NULL date values.


Below are displayed the tail of our DimCalendarTimes table along with that special null record. Note how the Time to the minute for all except the nulls are set to 23:59:00 and the TimeToTheHour is 23:00:00.


2017-04-10_1443You can of course dice this in many other ways (shift patterns for instance) as you can see,  but this is what we required. So how was this achieved and how complex… Well the answer is really not that complex at all. In fact it took just two Sql Statements:- The first to put in all of those times:-

INSERT INTO DimCalendarTimes (TimeToTheSecond,TimeToTheMinute,TImeToTheHour)
 CAST(DATEADD(second,Data.SecondsAfterMidNight,0) AS TIME) As UniqueSeconds, 
 CAST(DATEADD(minute,(Data.SecondsAfterMidNight/60),0) AS TIME) As UniqueMinutes, 
 CAST(DATEADD(hour,(Data.SecondsAfterMidNight/3600),0) AS TIME) As UniqueHours

(SELECT top (86400) 
 (ROW_NUMBER() OVER (ORDER BY AS SecondsAfterMidNight

FROM sys.sysobjects AS t1 cross join sys.sysobjects AS t2 ) AS Data

And the second to add that NULL value record

INSERT INTO DimCalendarTimes (TimeToTheSecond,TimeToTheMinute,TImeToTheHour)
 NULL As UniqueSeconds, 
 NULL As UniqueMinutes, 
 NULL As UniqueHours

Lets dissect that first SQL statement a little, we start off with the simple INSERT statement that determines which fields we are inserting:-

INSERT INTO DimCalendarTimes 

Nice and simple, we then need to generate these values which we do by generating one second for each in the day, there being 86,400 seconds in a day.  The 86400 seconds are generated as is usual in tally tables by cross going two large tables together (in SQL Server we generally use the sys objects tables) and generating a unique row number for each record.

      AS SecondsAfterMidNight
  FROM sys.sysobjects AS t1 cross join sys.sysobjects AS t2 
) AS Data

This sql statement will simply yield 86400 rows numbered from 0 to 86,399. We then simply need to generate the time values using the following simple pieces of date addition:-

 CAST(DATEADD(second,Data.SecondsAfterMidNight,0) AS TIME) 
     As UniqueSeconds, 
 CAST(DATEADD(minute,(Data.SecondsAfterMidNight/60),0) AS TIME) 
     As UniqueMinutes, 
 CAST(DATEADD(hour,(Data.SecondsAfterMidNight/3600),0) AS TIME) 
     As UniqueHours

They first column simply adds the row count value (or second) to the seed value of 0 and casts the resultant value to a TIME object. This will generate a true time starting at 00:00:00 right through to 23:59:59. The second and third columns  do exactly the same except that they will divide the result by 60 and 3600 respectively to give the time to the minute or to the hour. And that really is the crux of it. We of course also generate our null value record as below to allow us to resolve data rows with no time data (dimensional databases should ALWAYS resolve to a foreign key for reasons of performance.)

Resolving the times in our staging data into the foreign key is thus as easy as the following pseudo SQL. This will deal with both valid times and also with NULL times:-

FROM (SELECT CAST(S.TimeValue AS nvarchar(30)) AS CastTime,* 
      FROM StagingData AS S 
) AS S
INNER JOIN (select SID,CAST(TimeString AS TIME) AS TimeString 
            FROM DimCalendarTimes
) AS D
       ON COALESCE(LTRIM(RTRIM(UPPER(S.CastTime))),'--') 
        = COALESCE(LTRIM(RTRIM(UPPER(D.TimeString))),'--')

Perfectly simple. I hope that this escape into the dimensions of time has been both useful and informative. For me, it has certainly proved a distraction… I hear that 76% is the new 71% and besides, I feel sure that ‘Windows moon on a stick’ will be worth the tediom.