Coping with Large Dimensional Workflows

dimensionsI have recently been working on a third tranche of ETL work for a client which is basically a reasonably complex data warehouse. Its been a great project and I’ve loved getting down and dirty with their data although there have been some frustrating elements to the project, which as I have not worked on such a complex linear project before, I did not foresee as being an issue. This blog is a discussion of the main issue to which I will attempt to give a working strategy that can be used in preference.

To layout the issue that I encountered it is necessary to detail first what we are trying to achieve. We have a a large application database that it being constantly updated daily by a large number of systems and processes. Overnight we extract any data that has changed and apply/add these changes to the data warehouse for reporting purposes. We are only interested in the raw data, not the identifying keys within the source database as we will create our own keys within the data warehouse. We then import all of this data into a staging table from where we then sieve the data and create one fact table with foreign key relationships linking it to many many dimensional data tables. So far, uncontroversial…

fact_tables

Where the difficulties arise, is exactly how you implement this and moreover how much data we are talking about. Traditionally I would take this approach to a simple migration:-

  1. Import Source Data into Staging Area
  2. Create all missing Dimensional Data (from staging data columns)
  3. Create Fact Table, mapping in Dimensional Foreign Keys using look up components as I go. See pic below.

Capture

Step 3 is the issue here; in a small or medium solution it works without issue but when you start getting north of 50 dimension fields in a fact table DTS becomes very unresponsive, especially when loading/saving the ‘Merge Join’ transformation. In my instance I have about 100 of these dimensions and its starting to creak badly. Not to mention… If you need to make changes to a field in the source, you find that you have to propagate this meta data down through every single merge join component by opening it (it will inform you that it is incorrect and ask if you would like to fix the issues to which you say yes) which is time consuming! At least you do NOT need to save each merge which is a godsend as with save times of about a minute and a half for each component theres 3 hours out of your day for no good reason.

So how do you get around this issue, well alarmingly simply really although I have yet to prove this in the wild… The root of the problem is using a merge component to keep the data together (which is necessary as we’re loading from one table and saving to another in my instance). What we need to do is eliminate this costly merge and the way we do that is by resolving the foreign key fields in the staging data table, before just doing a wholesale dump of the data into the facts table. The workflow will be thus:-

  1. Load all of the staging data using one sql statement (as we did previously).
  2. Split the data as we did previously using a Multicast (one for each column that we are updating)
  3. Lookup each column using a lookup component (as we did previously)
  4. Update the ForeignKey field for the column we are looking up within the StagingTable using an OLEDB command.
  5. Use a straight data dump to move the data from the Staging Table to the Facts table applying any filters that you would normally have done

Scenario1

Where the warehouse is large this should make the whole edition and refactor process for a given job 100 times more pleasurable and performant than before . Sadly it is a little bit late in this iteration for me to refactor my DTS package to work like this but I will certainly be adopting this methodology in the future for all DTS work as the improvements in productivity are immeasurable, I also think that the setup costs are a lot less too as you you not need to worry about synchronising the name of the output column with its real name in the Lookup component.

As I said, I have yet to test this in anger but hopefully it will act as a heads up for anybody embarking on such an enterprise.

Advertisements

Lookup Date Values in DTS package

calendarI recently came to extend some DTS packages that we have implemented for a client to include a LOT of new Date styled dimensions. You know the kind of thing… We have a date table with all of the possible dates over the last 400 years which we all figure should suffice. If I’m still coding beyond this date I’m not entirely sure that DTS will be the framework of choice, Hell its already VERY superceded even as we speak! What I did not appreciate was that date look ups would be problematical in DTS.

To state the problem we have the ‘DimCalendarDates’ Dimension table defined thus:-

  • Sid (int) – Surrogate ID
  • CalendarDate (date) – The date that this record pertains to.
  • FirstDayInMonth (date) – The first date within this month.
  • Year (int) – The year that this record appears within

For the purposes of this Blog we are only really interested in the CalendarDate which is the field that we join to and the SID field which is the surrogate ID field that we populate the Facts Table with instead of a date. Our example fact table ‘DataFacts’ is defined thus:-

  • SID (int)- Surrogate ID Field
  • DimStartDateFK (int) – Links to the DimCalendarDates table on the SID field
  • DimEndDateFK (int) – Links to the DimCalendarDates table on the SID field
  • DimArchiveDateFK (int) – Links to the DimCalendarDates table on the SID field

I tend to populate my tables in these easy to follow steps:-

  1. Load data from source tables into staging table.
  2. Populate dimension tables with new dimension data found in staging table.
  3. Load staging data into dataset.
  4. Using a Lookup DTS component look up the SIDs for each dimension field required and persist it in the dataset
  5. Update/Insert the facts table data.

OK, job 1 in this instance we are not bothered about as this is irrelevant to the blog, in addition as we are using a predefined and pre-populated calendar table we also need not worry about step 2. Step 3 is where we load our data so in the instance above we would use a Get Component as shown:-

Capture

And issue the following SQL statement within it:-

SELECT SID, StartDate, EndDate, ArchiveDate 
FROM DataStaging

Simple, Now we perform the lookup We configure the component thus:-

Capture

Capture

And then we merge this data into the dataset taking all of the data from the incoming Staged data and appending the lookup value as the image below shows:-

Capture

Our finished transformation looks a little like this:-

Capture

Brilliant, except for one small thing. It doesn’t work… At least not correctly Quite a few BUT crucially not all of the Look up matches will fail to find a corresponding date even though the value is in the underlying DimCalendarDates table. The problem is to do with Siberian winds I think, that is to say… I don’t know what the issue is and I care even less. Basically it seems that DTS lookups work best against textual data and so the answer is indeed simple, treat the date fields as string. But no-one wants to store dates fields as text right? So here’s the two fold approach I took:-

  1. Add a new formula field to the DimCalendarDates table to express the data as a string value
  2. Change the look up expression to convert the look up value to a string representation of the data
  3. In both instances I needed to take care of nulls as well (i have a NULL date within my DimCalendarDates table)

OK, So first off we create the formula to express the date field as a string value. I thus applied the following update to the DimCalendarTable:-

ALTER TABLE DimCalendarDates ADD
 [CalendarDateAsString]  AS (CONVERT([nvarchar],[CalendarDate],(101))) PERSISTED

Nice and simple and for performance reasons I also persist the string field so that it is worked out only on updates not on ALL reads.

The conversion of the look ups are a little more involved:-

Capture

Capture

Capture

In essence what we have done is changed the query to compare the CalendarDateAsString formula expression to a textual representation of the date that we are looking up (represented in this instance by a question mark). Note that the Inner SQL query on the advanced tab must match exactly the source as defined on the Connection tab. Now the look ups should behave exactly as one would have expected.

You live and learn, well…. you live anyway.

How to keep your SSIS package ‘DRY’

desert-1So, you all know the deal… When it comes to coding there is one rule that all half decent programmers aspire to; Nope, it’s not ‘Never toss a dwarf’ which whilst being reasonably good advice is probably not quite relevant to normal weekly challenges at the coal face of code. It is this, Keep your code DRY

And we do, we really do. We all go to huge lengths using interfaces and the like to ensure that the same operations can be leveraged against differing, but similar kinds of objects. However when it comes to using SSIS for ETL work it always seems that all of these principles go flying out of the door, at least it seems to for me. SSIS is not best suited for DRY, I have lost count of the number of times I have defined workflows like this:-

Capture

So, my challenge was to import ‘n’ identically formatted Excel files (with different data) into the same database table after adding a few snippets of data. In the screenshot above I have only shown you two of the excel files to import, the client wanted 12 today, but up to 30 at some point in the future. Hmmmmm, so we discussed the matter further and it then became obvious that sometimes, just sometimes, they would only provide 1 file or maybe even none. It quickly became apparent that my clumsy transformation as defined above would not have worked for all of the following reasons:-

  1. Defining 30 import files would be lots of work, and they would then need to be merged together.
  2. Defining 30 files and then NONE of them, or just one of them resolving would have made the transformation fall over at run time with metadata issues.
  3. What if the client suddenly decided that they want 1000 files. I am NOT going to manually set all of those up
  4. I would also have to define 1 connection for each file…..
  5. There’s no flexibility. if they needed to support 30 files one week and ten the following as things stand we would have two different packages (or routines within the package)
  6. Most importantly, it violates DRY principles. Thus if I ever needed the change the way SQL was selecting data from the files I would need to make it to 30 transformations, or more even!

So, I had a drawing board to get back to…. Now as a seasoned SSIS user I must admit that I have never looked to deeply into some of the control flow components, topmost amongst these is the ForEachLoop Container

Capture

With a little bit of research I realised that this was exactly what I needed so without further ado I decided to implement a solution for my challenge using this component.I started at the start of course by firstly defining a new Excel connection (and ONLY one!) I set it to a file that represents the data that would be loaded at run time for each file loaded, remember that all of my import files are identical in format.

excelconnection

I would also need a couple of variables to allow me to store the Directory I am interrogating and to store the name of the file currently being processed in the Foreach Loop. These were defined like so:-

filepathvariable

filenamevariable

Now I could start to play with my Foreach Loop, I dragged a new component onto my Control Flow surface and then added a script component (more of that later) and my own workflow. The result was this little transformation process:-

IterateComponent

I then configured the Foreach Loop thus using the ‘Collections’ option :-

iterationcollectionconfiguration

As you can see I have created ForEach File enumerator which determines exactly what sort of behaviour this component will perform, Note how I have then created an Expression (using the expressions Editor Dialog) which allowed me to specify ‘Directory’ as the Expression to set and the value that it is being set to is the value stored in my ‘currentFilepath’ Variable. I could of course have just hard-coded the value by just using the Browse dialog further down the screen. In my instance I only want to use ‘xls’ file, or office 2003 format Excel files and so using the ‘Files’ field I can specify only to use *.xls files. Finally, I want all of the file paths returned to me to be Fully Qualified Path names and so that it what I have selected.

Then I needed to configure just what I should do with the value returned. In my instance I wanted to read the value (the fully qualified file name) into a variable for later use. We set that up using the ‘Variable Mappings’ as below scriptvariablereturnmappingsAs you can see we are setting the output result (with files index 0 is the ONLY option) into our ‘fileName ‘ variable. Once this has been done we have one job left to do, We need to programmatically change the connection string of our  ‘Current Financial’ connection to point to the file we are currently processing. This is what we use the script component for as mentioned earlier. So firstly we need to tell the script component what user variables it has access to:-

scriptvariablesFinally, we can now get to the scripting which is simplicity itself:-

scriptcodeAs you can see all we really do is obtain references to the Connection object and the ‘fileName’ variable (both by name). We then create a valid Connection string using our new file name and finally we set this new connection string into the ConnectionString property of the Connection object.

That’s it, we’re done and now running the code we can see that it doesn’t matter whether there are zero or a million files within the directory it will process them (in no particular order) and add them to our staging table ready for full adoption onto the system. I’m pleased with this solution, its DRY’d my transformations up nicely and given me the flexibility I would not have been able to achieve any other way.