LuCarD


I am using the Konesans Checksum transformation ( http://www.sqlis.com/21.aspx ) to detect changes in my big (many columns, type 2 SCD) dimensional table.

But I am running into collossions Sad

The checksum transformation, sometimes misses a small change in the record, for instance when a certain flag is set or unset. Is there a more robust checksum generator Of any other suggestions on to solve this

thx





Re: Checksum transformation is not detecting my change

DarrenSQLIS


More robust or an exact comparison All checksums risk collisions Sad, and using a 32-bit int as checksum does mean we have that risk. About the same as the CRC32 used T-SQL CHECKSUM. It is always a trade-off I'm afraid. Hash functions such as MD5 will have less risk of collision, but are generally 32 bytes worth of string rather than 4 bytes of integer. MD5 and similar functions are available in teh framework, so are quite easy to implement in a Script Component.





Re: Checksum transformation is not detecting my change

Phil Brammer

DarrenSQLIS wrote:
More robust or an exact comparison All checksums risk collisions , and using a 32-bit int as checksum does mean we have that risk. About the same as the CRC32 used T-SQL CHECKSUM. It is always a trade-off I'm afraid. Hash functions such as MD5 will have less risk of collision, but are generally 32 bytes worth of string rather than 4 bytes of integer. MD5 and similar functions are available in teh framework, so are quite easy to implement in a Script Component.

Yeah, I'd like to see an option to choose between the current checksum method an MD5 method in the Checksum Task. Wink







Re: Checksum transformation is not detecting my change

LuCarD

I was thinking of using the MD5 checksum. It is more robust and collision are rare. The hash is used in serveral dimenision tables so I don't think it will be very big impact on the total size of the database.

But I haven't done much work with the Script Task.So I am a bit reluctant to use this.

How can you walk through everycolumn of the dateflow, withouth having to specify the columns by name





Re: Checksum transformation is not detecting my change

DarrenSQLIS

LuCarD wrote:
How can you walk through everycolumn of the dateflow, withouth having to specify the columns by name
- You write code! They will be named by the natire of the wrapper in the script component, just because that is the way it works. A custom component you can work by index/ID which you can discover, but the user still selects columns.

Phil, a combined MD5 / Checksum will not happen because the outpt column type would be different and allowing a switch that would change the data type of the output column would be a really nasty downstream experience breaking other components too, hence it was never done. It has been considered through!






Re: Checksum transformation is not detecting my change

jaegd

If you're using SQL 2005 as a source, you can use the HashBytes function, which will compute cryptographic hashes (sometimes known as checksums or digests).

MD2,MD4,MD5, and SHA-1 are supported. MD2,MD4, and MD5 are all 128 bit checksums (16 bytes in length), while SHA-1 is 160 bits (or 20 bytes).

For example, you could use the following to "fingerprint" records in the AdventureWorks database.

However, one of the main advantages of computing checksums in the pipeline as opposed to at the source, is that you'll often want checksums to be computed post-transformation.





Re: Checksum transformation is not detecting my change

LuCarD

This will not work, because I am using the Checksum to determine if a record needs an update or not. And I need to do this at the end of the dataflow pipeline.

I am working on a checksum component of my own. Which will work similar as the Konesans checksum but uses md5 as calculation of the checksum.

When it is ready I will make it publicly available.




Re: Checksum transformation is not detecting my change

Chris Brook

I'm using a similar technique in my ETL process...

Are you looking up on both the hash and the business key By constraining on the business key as well, you greatly decrease the risk of a collision. This doesn't help distinguish changed rows from new rows though, but it does get you part of the way.

Still, I'll keep my eyes open for your MD5 component!





Re: Checksum transformation is not detecting my change

LuCarD

I have created a simple MD5 component. Which works similar to the checksum component. Use at your own risk Smile

http://ssis-md5.lucard.nl/

The source code is in the zip file, and is created by Visual Studio 2005 Express.

If you don;t want to use the source code you can use the dll. Installation is very easy

  1. Copy the DLL to c:\Program Files\Microsoft SQL Server\90\DTS\PipelineComponents, or if you changed you default location you have to search for this directory.
  2. gacutil /i MD5_1.dll
  3. Add the component in SSIS designer by rightclicking the toolbox. And selecting choose items.

Some side notes:

Not all datatypes are supported. Most strings, date, int's and numeric data are supported. I will update the list of supported data types on a need basis.

I am not a C# programmer ! And I just created this by using code by example.

If you have any comments, remarks, hints or swears you can send it to me Smile

And I say again:

USE AT YOUR OWN RISK!





Re: Checksum transformation is not detecting my change

cborden

while this topic is a bit old & talks about some workarounds... I discovered a similar problem using Konesan's checksum component. In my case, the source table has right around 100 columns and making slight changes to some of the varchar fields resulted in the same checksum value.

Then I changed also changed another field from NULL to "test" yet got the same exact checksum. Doing the same on my own machine resulted in a different checksum, even for tiny changes.... so I knew something was wrong..

the culprit 64-bit runtime. The production server is a 64-bit machine, mine is not. Changing it to use the 32bit runtime (Project Properties, Config. properties, Debugging, Run64BitRunTime: False) appears to have solved the problem.

So now all I need to do is change my job to use the 32bit dtexec located here:

\Program Files\Microsoft SQL Server (x86)\90\DTS\Binn\dtexec.exe

and I'm golden Smile

good luck!

-chris