The short, no background, off-the-cuff, question is “when you need to copy a large portion of data from your legacy system to your SQL Server for use by web-based, intranet, desktop, and/or mobile applications, reports or reporting services, and other future sources, what is the best way to structure and import the data so that those sources all have access to the data, yet have queries return data quickly and efficiently, regardless of the volume of data?” (…whew!)
For those willing to read a little background info, the following is my situation and what I am running into…
I have a fairly good-sized SQL Server project ahead of me and I am trying to research the best way to do it, but I swear it seems the more I dig into the topic, the more questions I have that I am not finding answers to. I ask for feedback, suggestions, what you would do or have done, and so forth. I’ll try to be as brief as possible, but I need to create a clear as possible picture to ensure accurate feedback.
As I have mentioned in past questions, we here have an IBM iSeries AS/400 as our core (legacy) system which runs a custom designed application. This application consists of dozens and dozens of files (tables in SQL Server land) which hold the program data, user data, business data, and so on. There are two “halves” to the core system, which match our two lines of business. One half is the business line we have been in for some time where the other was just implemented this year, so the numbers of records in the data files are sized accordingly.
This year we built websites for each line of business that, in short, allow access to some data for our customers – nothing different than any other company. We do this by copying from the AS/400 over to our SQL Server 2000 a handful of files that contain the information we want to display on the web. Not all data in each file is needed on the web so the tables that are created on the SQL Server are truncated (column-wise) versions of that on the 400. This helps increase performance for searches and other data-retrieval.
This has worked great for us, but there was very little – to be honest – that was being provided on the web, requiring very little of the core business data be copied to the SQL Server. However, we would like to continue expanding this idea of copying data from the 400 to the SQL Server to allow more web-based functionality and services, as well as provide the means to conduct internal “data mining” and running of reports, all of which taking place off of the 400 to reduce system impact during business hours. This desire, both by me as well as those in my department and the “higher-ups”, is what has me in unknown territory.
The unknown territory for me is this concept of a “Data Warehouse”, which is the buzz-term I am hearing by my boss and higher up the chain. This term does describe what it seem we are on the verge of needing – a single location that contains all the data, replicated from our legacy system, where we can run our websites, reports, intranet, and son on and so forth. But the question I am faced, never having done this before, is “I know what I need, but how do I build it?”
Right now, we have a “main” database in the SQL Server for each business line, where the core tables reside, and I use DTS packages to refresh the data each night. From there we have a couple other databases to track who our registered web site users are, what services are available on-line, or other “meta-data” type information, but all queries or stored procedures that return or need core business data tap into one of those two “main” databases.
It is here that a few issues all come together as the base of my confusion on how to most appropriately design the overall structure to fit our needs. As I mentioned, the tables in the SQL Server are smaller (column-wise) versions of their AS/400 brethren to help increase performance. Yet, as the number of services increase (web-searches, reports, etc.) the number of columns required also goes up, and very quickly these tables could grow to 100+ columns. The queries that used to run quick, when the tables involved were each at most a couple dozen columns, would begin to bog down under the sheer volume of information merged in JOIN’s, sub-queries, or other methods. The indexes created against the “core” tables where designed for current needs, but as new applications, reports, etc. are created, they each need data in different ways causing the indexes to increase which I have read will quickly cause efficiency to dramatically decrease.
I thought perhaps a viable solution might be going to the opposite direction – that of carving the core data off into chunks of that which is and where needed. The idea I had would be to create a SQL Server database for each application or service, where I could copy business data from the main databases into tables within each application database, which could be indexed independently, and would allow processing and querying to be dispersed among many smaller databases instead of the core databases being continuously hammered. Secondly, this would open up the door for me to copy over from the AS/400 the huge tables, and not worry about indexing or how to optimize for querying. The main databases would, in effect, become a repository, or hand-off point for the core data to be dispersed where needed.
Sounded great, but when attempting to implement it, I ran into two major issues: 1) time, and 2) disk space.
The second issue is the lesser of the two, that of disk space. I knew this would be coming since I was copying data, I’m not that green. But it was an eye-opener just how much space is being used. For the more pressing issue, time, I mean that where before the DTS package may have taken an hour to copy over the data, the whole process now takes an hour and a half for the DTS on the expanded table, and then another hour for a second DTS package to “carve” up the data into the tables into the application database, index those tables, and shrink the data/log files of the database. And this was only one application database – I am afraid to think how long it would take were we to have 10 application databases!
After running into issues with both the centralized and decentralized structures, I thought perhaps the answer could be an amalgamation of the two – using a combination of tables and views to break the data apart. However, this introduced a problem when trying to refresh the data each night. The core tables are so huge that in order to accurately refresh them and maintain data integrity, we simply truncate and refill. But when using views on a table, truncate is no longer an option, only a delete can be performed, yet this causes all the deletes to be logged exponentially increasing the size of the transaction log, not to mention taking much more time to execute. And then as I was testing out various views, I discovered that many of the quires I would need to use to build the view have structures in them (CASE for instance) that are not allowed when trying to index the view.
So I am at a crossroads here. One path is the structure we have, a centralized core databases that the majority of querying runs out of that works for now but I can see becoming over bloated and inefficient very soon. Another, the decentralized structure with the data needed by each application in its own database to improve efficiency, but requires much more disk space and takes a fold increase in time to disperse per each database. A third is using some of each, but adding views reduces disk space but adds complexity to the over all structure, and complexity when trying to update/refresh the data each night.
Researching information on Data Warehouses really has not been much of a help as the information I find is of a high-level nature – discussing Dimensional Models (star, snowflake, etc.), dimensional tables, hierarchies, OLAP, OLTP, and more. Much of it I have not been exposed to before, others only in passing, very little do I understand in any clarity. Having only attended one training course in SQL Server (course 2073), the r4est of my experience being OTJ, responses I have received here, or other sources, you could say I feel a little overwhelmed here.
So, after all my rambling above, I hope that I may find some here that are willing to lend some expertise, some lessons learned, suggestions, solid yet easy to understand resources, etc. to help me get my hands around this situation and make an intelligent decision on how best to proceed.
Your help is greatly appreciated. Thanks!
-- Andrew
My new philosophy!
Life's journey is not to arrive at the grave safely in a well preserved body, but rather to skid in sideways, totally worn out, shouting "...holy ***...what a ride!"