AZGroups.com

Bringing together Arizona Technology Users And Enhancing the Careers of Developers
Welcome to AZGroups.com Sign in | Join | Help
in Search

Request data replication/warehouse advice

Last post 09-05-2006, 9:30 PM by DinPHX. 1 replies.
Sort Posts: Previous Next
  •  09-05-2006, 1:31 PM 578

    Request data replication/warehouse advice

    The short, no background, off-the-cuff, question is “when you need to copy a large portion of data from your legacy system to your SQL Server for use by web-based, intranet, desktop, and/or mobile applications, reports or reporting services, and other future sources, what is the best way to structure and import the data so that those sources all have access to the data, yet have queries return data quickly and efficiently, regardless of the volume of data?”  (…whew!)

    For those willing to read a little background info, the following is my situation and what I am running into…

    I have a fairly good-sized SQL Server project ahead of me and I am trying to research the best way to do it, but I swear it seems the more I dig into the topic, the more questions I have that I am not finding answers to.  I ask for feedback, suggestions, what you would do or have done, and so forth.  I’ll try to be as brief as possible, but I need to create a clear as possible picture to ensure accurate feedback.

    As I have mentioned in past questions, we here have an IBM iSeries AS/400 as our core (legacy) system which runs a custom designed application.  This application consists of dozens and dozens of files (tables in SQL Server land) which hold the program data, user data, business data, and so on.  There are two “halves” to the core system, which match our two lines of business.  One half is the business line we have been in for some time where the other was just implemented this year, so the numbers of records in the data files are sized accordingly.

    This year we built websites for each line of business that, in short, allow access to some data for our customers – nothing different than any other company.  We do this by copying from the AS/400 over to our SQL Server 2000 a handful of files that contain the information we want to display on the web.  Not all data in each file is needed on the web so the tables that are created on the SQL Server are truncated (column-wise) versions of that on the 400.  This helps increase performance for searches and other data-retrieval.

    This has worked great for us, but there was very little – to be honest – that was being provided on the web, requiring very little of the core business data be copied to the SQL Server.  However, we would like to continue expanding this idea of copying data from the 400 to the SQL Server to allow more web-based functionality and services, as well as provide the means to conduct internal “data mining” and running of reports, all of which taking place off of the 400 to reduce system impact during business hours.  This desire, both by me as well as those in my department and the “higher-ups”, is what has me in unknown territory.

    The unknown territory for me is this concept of a “Data Warehouse”, which is the buzz-term I am hearing by my boss and higher up the chain.  This term does describe what it seem we are on the verge of needing – a single location that contains all the data, replicated from our legacy system, where we can run our websites, reports, intranet, and son on and so forth.  But the question I am faced, never having done this before, is “I know what I need, but how do I build it?”

    Right now, we have a “main” database in the SQL Server for each business line, where the core tables reside, and I use DTS packages to refresh the data each night.  From there we have a couple other databases to track who our registered web site users are, what services are available on-line, or other “meta-data” type information, but all queries or stored procedures that return or need core business data tap into one of those two “main” databases.

    It is here that a few issues all come together as the base of my confusion on how to most appropriately design the overall structure to fit our needs.  As I mentioned, the tables in the SQL Server are smaller (column-wise) versions of their AS/400 brethren to help increase performance.  Yet, as the number of services increase (web-searches, reports, etc.) the number of columns required also goes up, and very quickly these tables could grow to 100+ columns.  The queries that used to run quick, when the tables involved were each at most a couple dozen columns, would begin to bog down under the sheer volume of information merged in JOIN’s, sub-queries, or other methods.  The indexes created against the “core” tables where designed for current needs, but as new applications, reports, etc. are created, they each need data in different ways causing the indexes to increase which I have read will quickly cause efficiency to dramatically decrease.

    I thought perhaps a viable solution might be going to the opposite direction – that of carving the core data off into chunks of that which is and where needed.  The idea I had would be to create a SQL Server database for each application or service, where I could copy business data from the main databases into tables within each application database, which could be indexed independently, and would allow processing and querying to be dispersed among many smaller databases instead of the core databases being continuously hammered.  Secondly, this would open up the door for me to copy over from the AS/400 the huge tables, and not worry about indexing or how to optimize for querying.  The main databases would, in effect, become a repository, or hand-off point for the core data to be dispersed where needed.

    Sounded great, but when attempting to implement it, I ran into two major issues: 1) time, and 2) disk space.

    The second issue is the lesser of the two, that of disk space.  I knew this would be coming since I was copying data, I’m not that green.  But it was an eye-opener just how much space is being used.  For the more pressing issue, time, I mean that where before the DTS package may have taken an hour to copy over the data, the whole process now takes an hour and a half for the DTS on the expanded table, and then another hour for a second DTS package to “carve” up the data into the tables into the application database, index those tables, and shrink the data/log files of the database.  And this was only one application database – I am afraid to think how long it would take were we to have 10 application databases!

    After running into issues with both the centralized and decentralized structures, I thought perhaps the answer could be an amalgamation of the two – using a combination of tables and views to break the data apart.  However, this introduced a problem when trying to refresh the data each night.  The core tables are so huge that in order to accurately refresh them and maintain data integrity, we simply truncate and refill.  But when using views on a table, truncate is no longer an option, only a delete can be performed, yet this causes all the deletes to be logged exponentially increasing the size of the transaction log, not to mention taking much more time to execute.  And then as I was testing out various views, I discovered that many of the quires I would need to use to build the view have structures in them (CASE for instance) that are not allowed when trying to index the view.

     So I am at a crossroads here.  One path is the structure we have, a centralized core databases that the majority of querying runs out of that works for now but I can see becoming over bloated and inefficient very soon. Another, the decentralized structure with the data needed by each application in its own database to improve efficiency, but requires much more disk space and takes a fold increase in time to disperse per each database.  A third is using some of each, but adding views reduces disk space but adds complexity to the over all structure, and complexity when trying to update/refresh the data each night.

    Researching information on Data Warehouses really has not been much of a help as the information I find is of a high-level nature – discussing Dimensional Models (star, snowflake, etc.), dimensional tables, hierarchies, OLAP, OLTP, and more.  Much of it I have not been exposed to before, others only in passing, very little do I understand in any clarity.  Having only attended one training course in SQL Server (course 2073), the r4est of my experience being OTJ, responses I have received here, or other sources, you could say I feel a little overwhelmed here.

    So, after all my rambling above, I hope that I may find some here that are willing to lend some expertise, some lessons learned, suggestions, solid yet easy to understand resources, etc. to help me get my hands around this situation and make an intelligent decision on how best to proceed.

    Your help is greatly appreciated.  Thanks!

    -- Andrew


    My new philosophy!
    Life's journey is not to arrive at the grave safely in a well preserved body, but rather to skid in sideways, totally worn out, shouting "...holy ***...what a ride!"
  •  09-05-2006, 9:30 PM 580 in reply to 578

    Re: Request data replication/warehouse advice

    Hey a well thought out post.

    First on the issue of space.

    There are a couple of things here:

    1. Typing. When you get the data from the legacy system, how do the types map? Are you picking the smallest types that you need?

    2. Rows. If you need 2 years worth of data, are you only pulling 2 years worth or 3?

    3. Columns. You pretty much covered this one.

    Okay, onto the big one, OLTP vs. OLAP.

    I am gonna generalize abit on some stuff here, so if anyone else read this that is up on databases, give me a break. :)

    Okay, the TP stands for transactional or transaction processing. So insert, update and delete.

    AP is analytical processing, ie select.

    But wait, you're probably saying to yourself, "I've seen plenty of db's that were properly normalized that did all the standard CRUD stuff and we ran reports and everything was cool."

    Well yes and no.

    If you have a bunch of queries wherein the where statements use the same columns, you might make an index, right? (what is termed a covering index)

    Okay, well think of the joins you are having problems with in a primarily read scenario as the same situation.

    Those joins that you know and love are now causing you some strife. They take time to resolve.

    All this is kind of like the monkeys at the zoo randomly throwing poo though.

    What I mean is, everything in the database world is dependent on usage. You have mentioned a couple of different usages for these db's (reports, web, etc). Good. Now start looking at the major usage of those tables.

    What you'll find (and this is assuming that you are just doing reads) is that you will DEnormalize stuff heading towards the OLAP end of things.

    So instead of joining to the states table to get a state name or whatnot, you might look up the stateID and copy that to a new column in the customers table when you import/export the data via DTS.

    Then you have eliminated that join.

    Does it sound backwards if you are used to normalizing the hell out of everything? Yeah.

    But remember, everything about the db is about performance related to usage.

    That is why IMHO the dynamic DAL layer stuff is a fad. Because if I look into the usage and find that one proc is like 50% of the bottleneck, that's where I'll spend my time optimizing stuff. The dynamic wireup stuff is way, way behind in that area (from what I've seen).

    I'll pm/email you my contact info, feel free to drop me a line, I'll give you a hand.

    It's refreshing to see someone WANTING to understand databases instead of wanting to just ignore that they use a RDBMS and pretend that they use flat files.

  • View as RSS news feed in XML
    Powered by Community Server, by Telligent Systems