LarryETL


I will be receiving an incremental file that will contain ※I§ inserts, ※U§ updates, and ※D§ deletes as an action code that I will load into a transform table. If I conditionally split the action code I will lose the fifo order and the data in the transform sequence could be applied out of order causing an error. If I use 3 separate data flows I can apply incremental doing the inserts first, then the updates and then the deletes. I think this would work but I would have to read the transform table three times.

I am looking for some feedback on what is the best most efficient way to handle what I am trying to accomplish

Thanks

Larry




Re: Incremental process based on Action Code

JayH


I think reading the table three times is negligible compared to executing individual statements. If you wanted to only read it once, yet serialize the statements, you could use a script to execute the statements instead of splitting them to different OLE DB commands, but I wouldn't do that.

I don't know if it will work, but the most efficient way is to join between your target table and the transform table for the INSERT, UPDATE, and DELETE statements and not bring the statements through SSIS at all.






Re: Incremental process based on Action Code

Phil Brammer

JayH wrote:
I think reading the table three times is negligible compared to executing individual statements. If you wanted to only read it once, yet serialize the statements, you could use a script to execute the statements instead of splitting them to different OLE DB commands, but I wouldn't do that.

I don't know if it will work, but the most efficient way is to join between your target table and the transform table for the INSERT, UPDATE, and DELETE statements and not bring the statements through SSIS at all.

How about a compromise Load the file to a staging table while splitting off the inserts to an OLE DB Destination using fast load. Then back in the control flow, execute two "Execute SQL" tasks. The first to update the destination table by joining to the staging table. The second deletes from the destination, also by joining to the staging table.







Re: Incremental process based on Action Code

Jamie Thomson

LarryETL wrote:

I will be receiving an incremental file that will contain ※I§ inserts, ※U§ updates, and ※D§ deletes as an action code that I will load into a transform table. If I conditionally split the action code I will lose the fifo order and the data in the transform sequence could be applied out of order causing an error. If I use 3 separate data flows I can apply incremental doing the inserts first, then the updates and then the deletes. I think this would work but I would have to read the transform table three times.

I am looking for some feedback on what is the best most efficient way to handle what I am trying to accomplish

Thanks

Larry

Why is there a necessity to load into a table first With SSIS you don't have to do that.

Having said that, once the data is ina table you don't need to use data-flows at all - which is what Jay suggested. If you DO want to use data-flows then the use of raw files will negate the needto read the table three ties.

In my opinion, none of these solutions will achieve FIFO processing though. Of course its up to you whether that matters or not.

-Jamie






Re: Incremental process based on Action Code

LarryETL

Thanks for all your suggestions.

I think I need to provide some additional information that will help assist coming up with the right decision.

1) The input is an Oracle dmp file from a mainframe that I source into transform tables in Sql Server after importing into a local version of Oracle.

This Oracle dump file is created using triggers in Oracle, therefore order is extremely important. Example: If a record gets inserted, then updated, then deleted and inserted again on the triggers this would be a problem and I would lose my record.

2) There are 70 tables now but over 300 total that may be needed in the future.

I believe SSIS is needed to keep the sanity for the control flow and logging.

3) The tables are huge 25 每 50 million historical records, 40- 50 columns each, the incremental is monthly and contains 100,000 每 200,000 records.

4) The data on the Sql side is over indexed(all columns non-clustered), so regardless of which method is chosen all the non-key indexes need to be dropped and rebuilt again.

Fyi, usually takes about 4 每 5 hours to rebuild.

This will be running on 64 Bit for the ETL engine and 64 Bit database setup in an n-tier architecture. 4 CPUS and 32 gig on memory for the ETL 4 cpus 8 gig memory for the database.





Re: Incremental process based on Action Code

Jamie Thomson

The main point here is that you want to process the records in the order that they occur. So, l don't think doing all the inserts followed by all the updates followed by all the deletes (or whatever order you do them in) will work.

I think you are going to have to issue a SQL statement for every row in the batch. So yes, this will mean 100000-200000 SQL statements being fired. It sounds as though your processing time window (of one month) is big enough for this to not be a problem.

The OLE DB Command is perfect for this by the way.

-Jamie






Re: Incremental process based on Action Code

JayH

LarryETL wrote:

1) The input is an Oracle dmp file from a mainframe that I source into transform tables in Sql Server after importing into a local version of Oracle.

This Oracle dump file is created using triggers in Oracle, therefore order is extremely important. Example: If a record gets inserted, then updated, then deleted and inserted again on the triggers this would be a problem and I would lose my record.

2) There are 70 tables now but over 300 total that may be needed in the future.

I believe SSIS is needed to keep the sanity for the control flow and logging.

3) The tables are huge 25 每 50 million historical records, 40- 50 columns each, the incremental is monthly and contains 100,000 每 200,000 records.

4) The data on the Sql side is over indexed(all columns non-clustered), so regardless of which method is chosen all the non-key indexes need to be dropped and rebuilt again.

Fyi, usually takes about 4 每 5 hours to rebuild.



Ugh. 4-5 hours sounds optimistic. Is there any way to partition those 40-50 million records by table, so you could do them in parallel Otherwise, I think that you're stuck doing the insert/update/delete statements sequentially. You could either execute the statement in a script component as I previously mentioned, or you could pass the statement to sp_executesql using the OLE DB Command.




Re: Incremental process based on Action Code

JayH

JayH wrote:

Ugh. 4-5 hours sounds optimistic. Is there any way to partition those 40-50 million records by table, so you could do them in parallel Otherwise, I think that you're stuck doing the insert/update/delete statements sequentially. You could either execute the statement in a script component as I previously mentioned, or you could pass the statement to sp_executesql using the OLE DB Command.


Upon further reflection I see that the 40-50 million is per table, so no further partitioning will be possible. I've also realized that we're not receiving fully formed statements, but likely only values in columns with a flag to indicate whether the row was inserted, updated, or deleted.

I see Jamie's comment that OLE DB Command is perfect, but I'd be interested in how he's come to that conclusion. We can't have three (one each for inserts, updates, and deletes) and route our rows with a conditional split since we can't serialize the commands that way.

I think a script to build a SQL statement is necessary. The statement can either be executed from the script or from an OLE DB Command with sp_executesql.




Re: Incremental process based on Action Code

JayH

JayH wrote:

Upon further reflection I see that the 40-50 million is per table, so no further partitioning will be possible. I've also realized that we're not receiving fully formed statements, but likely only values in columns with a flag to indicate whether the row was inserted, updated, or deleted.


Hi. Me replying to myself again. If we do have columns with values, then we could partition the statements by the keys as long as the keys don't change. This would allow an arbitrary degree of parallelization.

Hopefully with data this size, the SQL Server tables are also partitioned across multiple file groups. You would of course want your key partitioning to match SQL Server's.




Re: Incremental process based on Action Code

LarryETL

If I was to disregard the action code and use the lookup for detecting changes that you guys always recommend, would this keep the order Would there be a significant performance difference.

Thanks,

Larry





Re: Incremental process based on Action Code

Jamie Thomson

JayH wrote:


I see Jamie's comment that OLE DB Command is perfect, but I'd be interested in how he's come to that conclusion. We can't have three (one each for inserts, updates, and deletes) and route our rows with a conditional split since we can't serialize the commands that way.


The reason I say that is that these records need to be processed in the order that they were written to the source. Hence we need to execute a record at a time. This is inherently what the OLE DB Command does - hence my comment above. Perhaps "perfect" was the wrong word but hopefully you get my meaning.

I absolutely agree that the OP cannot split the data into three chunks i.e. insertions, deletions, updates.

-Jamie






Re: Incremental process based on Action Code

JayH

LarryETL wrote:

If I was to disregard the action code and use the lookup for detecting changes that you guys always recommend, would this keep the order Would there be a significant performance difference.

Thanks,

Larry



That depends on how many of the records from Oracle don't represent changes in SQL. I'm beginning to think of this as a type of Oracle->SQL replication. There could be a big potential performance difference if you can identify a significant number of those 40-50 million records that don't have to be applied. I wouldn't think there would be many. The strategy would also depend on having enough memory on the SSIS machine that you could fully cache the target table in the lookup, which seems highly doubtful.

As for keeping order, they shouldn't ever get out of order, but if its a concern, you can issue a row number using a script component or this one.





Re: Incremental process based on Action Code

Jamie Thomson

LarryETL wrote:

If I was to disregard the action code and use the lookup for detecting changes that you guys always recommend, would this keep the order Would there be a significant performance difference.

Thanks,

Larry

I don't really understand how LOOKUP could be employed in your scenario. Simply, you have to apply the operations in the correct order.

Let me give an example and you tell me if I've understood the problem correctly. You have the following rows in your source:

UniqueKey | Address | Age

Jamie | London | 29

Helen | New York | 30

and they came about because of the following changes:

ActionCode | UniqueKey | Attribute1 | Attribute2 | ActionDate

I | Jamie |Birmingham | 28 | 2007-01-01

U | Jamie | London | 28 | 2007-01-02

U | Jamie | London | 29 | 2007-01-03

I | Helen | Boston | 29 | 2007-01-01

D | Helen | [n/a] | [n\a] | 2007-01-10

I | Helen | New York | 30 | 2007-01-15

It is important that you apply those changes to your destination in order of ActionDate - or else you won't have the same result. If you were (for example) to apply all the deletes, then the inserts, then the deletes, you would end up with two records for Helen and the wrong values for Jamie.

Using the OLE DB Command will allow you to apply the changes in the same order that they occurred.

Does that make sense






Re: Incremental process based on Action Code

LarryETL

The only reason Oracle is being used on my side is I need to get the data out of the .dmp file. Apparently an import is the only way since the source originates on version 8. I just have a copy of Oracle personal edition 10g running on 32 bit windows OS which seem to be working fine on command line imports. So I don*t think I would want to do replication or if I can with my version, especially with over 2000 indexes on the database.

Thanks,

Larry





Re: Incremental process based on Action Code

LarryETL

Jamie,

Your data scenario is exactly right

I think that the OLE DB Command suggested by you is my only choice since the order is the most important thing to insure the integrity of the data.

I have used the SCD in edition to the merge join methods for handling non Action code data but I am not sure how to use the OLE DB Command method and the flow of the different action codes.

Code Snippet

I think you are going to have to issue a SQL statement for every row in the batch. So yes, this will mean 100000-200000 SQL statements being fired. It sounds as though your processing time window (of one month) is big enough for this to not be a problem.

The OLE DB Command is perfect for this by the way.

I am not sure exactly how to pull off the oledb command.

How do I do this while keeping the order

If I read my input first record in the data flow and it is an Add.

Do I now conditionality split to three different oledb commands based on the action code. Is this a blocking process that will wait for completion before moving to the next record Can you clarify

Thanks,

Larry