It is surprisingly rare for IT people (in general) to appreciate that data warehousing requires a lot of disk space. Non-IT people, on the other hand, is more appreciative. They understand that if we collect data from many different places and put them in 1 place, then naturally it will require a big disk space.
But IT people are quite the opposite: they can appreciate that email and file servers are disks hungry, but they cannot understand why data warehousing requires tons of disk space. Many IT people still compare that data warehousing with other database applications, i.e. a few GB should be enough.
They cannot understand for example, why the data warehouse is 500 GB, if the source system is only 50 GB. Today was the 4th time I received such astonishment. So tonight I would like to write down the reasons why data warehousing requires a lot more disks than transactional system. I will try to write for audience who are not from data warehousing background, using plain IT language, avoiding any DW terms. So for readers who are from DW background, please bear with me.
- In data warehousing, we make a lot of copies of the source system. This is called “snapshoting”. Say the main transaction table in the source system contains 2 million rows. In the DW, we store all 2 million rows, every single day (or a week, or month). So in a year, we will be storing 365 x 2 = 730 million rows. In DW we store the data in a more compact form (using integer keys), so if the source system row is said 500 bytes, in the DW the row width is probably only 250 bytes. And we don’t store every single column from the source system, but only some of them, so the row width in DW is probably only 100 bytes, one-fifth of its original size. So if the transaction table in the source system is 1 GB, the daily ‘snapshot’ in the warehouse is only 200 MB. Multiplied by 365 it’s about 70 GB.
- In data warehousing, we are very concerned with query performance. We use a lot of indexing to provide good query performance. Indexes take a lot of space. In the transactional system, it’s between 20% to 75% of the table/data size. In data warehousing, the index is between 0.5 to 3 times the size of the tables. So if your DW database is 100 GB without index, the indexes would probably take between 50 and 300 GB. 50 if you try to be a minimalist e.g. only index where absolutely necessary, with the primary reasons usually being ETL/load time and disk constraint. 300 GB if you try to be a perfectionist, i.e. every single dimension key column in the fact table*, every single attribute used in the queries, etc.
*I don’t think I can keep it DW jargon-free now
Note that in SQL Server we don’t have a bitmap index like in Oracle – hence more space is required for indexing to the same effect.
- In the transaction systems, everything is normalized, so tables are relatively small, as there are no redundant data. In the warehouse, everything is denormalized, so tables are bigger. Sometimes a lot bigger. This is because the denormalised data model has redundant data i.e. the same value is repeated many times.
- In data warehousing, we create a materialised view. It’s an Oracle term (well, DB/2 too). In SQL Server it’s called indexed view. The idea of a materialised view is to query a table, then store the result of this query on the disk. So when the real query comes the response time would be lightning fast. For example, a report submits this query, which is a typical query in DW – it’s called “star join”:
select something, something, sum(something)
from table1 join table2 on … join table3 on …
where … and … and … group by … having …
and the query takes 2 minutes. If we store the result of this query on disk, SQL Server doesn’t have to do this calculation. The original query might read say 100 million rows, but the result may be only 1000 rows. So if the original query takes 2 minutes, reading this materialised view is instant (0 seconds), because the materialised view contains the exact data that the query wants i.e. the 1000 rows.
- In data warehousing, we like to do the partitioning. Partitioning is basically slicing a table in (say) 100 chunks and spread them across (say) 5 disks. Say the estimated database size is 100 GB. If we don’t do the partitioning, we say to SQL Server: create database1 on disk1, size = 100 GB. Now when you partition it into 5 disks, we need to create a filegroup (FG) on each of these disks. When creating these FGs you can’t size each FG = 100 GB/5 = 20 GB. You need to specify more, say 30 GB, because they don’t grow at the same pace. So FG3 may be full whilst FG4 is still half empty. This is because the FGs contains different tables that grow at different rates. This results in more disk space than if you don’t partition the tables e.g. 30 x 5 = 150 GB instead of 100 GB.
- The operation of partitioning takes some space. It’s not a lot but still something. I’ll illustrate with 2 operations: truncating by switching, and sliding window maintenance. In Oracle (I think it is from version 8) we have a “truncate partition” command. In SQL Server we don’t (not sure why not, probably because of copyright). So we have to prepare an empty partition, then swap it with the partition that we want to truncate. This technique takes seconds, whereas “delete” takes minutes (30-60 minutes are common).
The sliding window is a technique to maintain a rolling window, say the last 24 months. So on the first (working) day of every month, we drop the oldest month (month 24) and add the current month. The data for each month is stored in different partitions, so at the beginning of every month we “switch out” the oldest partition and “switch in” the current month partition.
- In data warehousing, we do a lot of backups. Apart from the main warehouse database, and the data marts, we also backup the stage databases, standing data database, metadata database, ETL database, and ODS database. We also keep the source system backups. And we don’t only store 1 copy of backup (last night’s only), but 5-10 copies and sometimes 20. For example, we keep a week’s worth of the staging backup on disk (beyond this is usually on tape). This is to enable us to reload the DW if we had 2-3 days of incremental load failure by redoing the load for those failed days. It is common for the DW backup size to be 5 times the size of DW databases. If the total of all DW databases is 800 GB, the backup is probably 2-4 TB. Fortunately, they are on RAID 0, rather than RAID 10. RAID 10 requires 4 times as much disk as RAID 0. More about RAID on point 10 below.
- In data warehousing, we need a big space for development and testing. Dev and test environments are in TB, unlike in the transaction systems which are usually in GB. It is common that the dev environment is 5 times the production size (the disk, not memory or CPU). Test is also 5 times prod. This is because: a) quite often we have several projects running simultaneously, and each project requires disk space similar to the prod, and b) unlike testing a transaction system where minimal amount of data will do, testing a data warehouse often requires the full set of data, because the analyses/reports needs a good amount of data. So if the total size our DW databases is 500 GB, we can expect the development space to be 2.5 TB and the test/UAT space to be 2.5 TB.
- Log expansion at loading. When a warehouse is being loaded at night, the log expands. Yes, true that we use a lot of bulk loading. Yes, true that we set the databases to simple recovery mode. But in reality, at loading time the DW still expand. It’s just the nature of log-based RDBMS, and SQL Server is no exception. So if one of your mart is say 30 GB when it’s first deployed, expect that during normal daily load it will expand up to 50 GB, i.e. the data file (MDF & NDF) will relatively stay the same, but the log file (LDF) will expand during load. Of course, after DB maintenance plan is executed (backup log etc), the log file size is reduced, but you’ve got to allow some space for expansion during loading. By the way, “backup log with truncate only” is deprecated in 2008.
- In data warehousing we like RAID 10 (1+0 that is, not 0+1). This is because we are very performance concern. In the transaction systems, tables contain thousands rows, sometimes hundred of thousands. In data warehousing, the main tables (they are called “fact tables”) contains million of rows, sometimes billions. When you have a table with 2 billion rows in it, you would want to a) place it in 10 different disks to spread the IO (see point 5 above about partitioning), and b) you want to put it on RAID 10 rather than 1 or 5 or 0. I’ll skip the technical details as my target reader is non DW people: to make 1 TB of space, RAID 1 requires 2 TB of disks. RAID 5 requires 1.05 TB of disks (I’m simplifying here). RAID 10 requires 4 TB of disks. RAID 0, which I mentioned on point 7, requires 1 TB of disks. In data warehousing, we need RAID 10 because it gives the best query performance and the best load performance.
- Not only we need RAID 10, but we need many sets of small RAID 10. If you have 100 disks of 146 GB each (formatted capacity 137), you don’t put all of them in 1 big chain of RAID 10. No, that’s not good for performance because we can’t create many filegroups on different spindles for partitioning. Instead, we take 4 disks and make 1 RAID 10 of 137 GB. Take another 4 disks and make another 137 GB RAID 10. So we will end up with 25 RAID 10s. This is good because now we can spread the IO when that 2 billion rows partitioned fact table is being queried.
I hope the above points had made you aware about the disk space requirements in data warehousing. And this is general really, not only in SQL Server platform, but all other platform. Whether you build your DW on Sybase, Informix or DB2, you still hit the same problem generally.
One more thing before I end this writing: exponential growth. It is the nature of data warehouse development to be exponential growth. If you predict that the space you just bought should be enough for the next 3 years, expect that it will be full within 1.5 years.