Just recently I was asked to make some changes to an existing data warehouse, we needed to add Tender data (that is, how a transaction was paid for) to existing transactional data. The difficulty with this was that we did not want to change the grain of the data which was at transactional level and often transactions are paid for with a mix of Tenders… (cash, debit card, discount coupons, cheques, credit cards). The data was to be presented to me exactly as it had been presented at the tills, that is where multiple payment types were used to pay for an item there would be multiple tender records each specifying the amount tendered and the tender type. The data would be presented in a table that looked a little like this:-
We were then asked to report the tender types used for each transaction and in instances where multiple payment types were tendered we were to show ONLY the largest Tender made. As an example where a transaction totalling £1000 was made £700 of which was in cash with the residue being made via a debit card, the transaction should be reported as a ‘Cash’ transaction. Simple enough, right. As there was a lot of data which I would be looking at I decided that I didn’t want any processing in memory and so I wanted to find a way to give me this data in one operation at the database level which should give the best performance. How best to do this? SQL being SQL there are many ways to achieve this but in this instance, let’s use my latest favourite function (I know, I know; you’re not supposed to have favourites….), the Partition function used in association with the Rank function.
So what are these functions, well despite using them both quite regularly I find myself not actually being able to describe very readily what the functions actually do so I had a little bit of research on my hands; The RANK and PARTITION BY () functions are in effect aggregate functions that when used in conjunction with the OVER ‘Windowed Function’ allow you to aggregate data within the resultset WITHOUT recourse to the GROUP BY clause which of course, operates against and dictates the shape of the entire query. Put simply they allow you to create aggregated data within an existing standard SQL query. The RANK () function ranks data using a numeric counter which is grouped by the PARTITION specified. That sounds complicated, even to me, so let’s just dive in take a look and hopefully, it will become more apparent. Let’s start by looking at the core SQL that we need to run and it really is very simple:-
SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount FROM TenderEntryData AS TEE ORDER BY TransactionNumber,StoreCode,Amount DESC
This requires no explanation other than we will return every single TenderEntryData record including those for transactions where more than one payment type was tendered. Let’s move on from this and add the Ranking and Partition function and then describe what is happening:-
SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount, RANK() OVER (PARTITION BY TransactionNumber,StoreCode) AS Ranking FROM TenderEntryData AS TEE ORDER BY TransactionNumber,StoreCode,Amount DESC
So we have added the whole of the second line here, one key thing to note is that Transaction Numbers are unique only to stores, they are not guaranteed unique across the whole enterprise. For this reason, we must partition our data by both Transaction Number and Store Code. The rank function will thus currently arbitrarily rank our rows of data grouping by unique transactions at a given store. This is a great start but we wanted to see the larger Tender Entries first and so we need to tinker with the SQL a little thus:-
SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount, RANK() OVER (PARTITION BY TransactionNumber,StoreCode ORDER BY Amount DESC) AS Ranking FROM TenderEntryData AS TE ORDER BY TransactionNumber,StoreCode,Amount DESC
To make things more readable I have split the statement onto line 2,3 and 4 now. Line 4 has now been extended to include an ORDER BY statement which means that we should now have all of the larger amounts ranked ‘1’ whilst secondary amounts will be ranked ‘2’ etc. There is however still an issue here, where we have a transaction with multiple tenders but the amounts are equal the data will be ranked arbitrarily according to how the database elects to order them. This is not in itself a problem at all, I would however just like to be explicit in picking the first (according to ID) in such instances so I will add a second clause to the ORDER BY. This should help us to avoid any strange behavioural issues later:-
SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount, RANK() OVER (PARTITION BY TransactionNumber,StoreCode ORDER BY Amount DESC,ID DESC) AS Ranking FROM TenderEntryData AS TE ORDER BY TransactionNumber,StoreCode,Amount DESC
Let’s have a look at that data now:-
Nearly there… So we now have ranking data for each payment type made against unique transactions at a store. What remains is to filter this data down as we ONLY wish to see the largest amount which is now easy as the largest amount will always have a Ranking Value of 1. What we cant however do is add the ‘windowed function’ specified into a where clause as this is invalid SQL. What we instead need to do is wrap the whole SQL we just created into a sub-query and then perform the filter, a little like this:-
SELECT * FROM ( SELECT ID, TenderDescription, StoreCode, TransactionNumber, Amount, RANK() OVER (PARTITION BY TransactionNumber,StoreCode ORDER BY Amount DESC,ID DESC) AS Ranking FROM TenderEntryData AS TE
ORDER BY TransactionNumber,StoreCode,Amount DESC) AS TenderData WHERE Ranking = 1
This SQL yields the following resultset:-
And we’re done. Lovely, simple and very performant SQL. As you can see the order by statement had to be removed as ORDER BY’s are not valid SQL in a sub-query. I’ve only scratched the surface as to what is actually achievable using the partitioning and windows functions, watch this space for more awesomeness in the future.