ydbn


I need to parse some text files and load them to database - these files are mostly CSV files or fix width columns format and the column names (first line) may vary in text (e.g. different abbreviation), order and extra columns, etc.

Is it a good idea have this done using script task of SSIS How it compare to Perl on performance Or any other tools good for this



Re: Parse text file (csv, fix width), Perl or something newer?

MatthewRoche


If you can define in advance which files match which format (what columns in what order, etc.) then this is trivial to do with SSIS, without needing any sort of script or custom code. This should get you pointed in the right direction:

  • The SSIS Data Source Task provides the core capabilities you need. While in the data flow:
    • Use the Flat File Source component to pull data from the flat file
    • Use the OLE DB Destinaiton component (there are others as well; this is often the best choice) to load data into your target database
    • Use whatever transformations are necessary (if any) to reshape the data between the source and the destination
  • While in the control flow:
    • Use the Foreach Loop container to loop through all of the flat files that match a give format
    • Within the Foreach Loop, have the Data Flow task to perform the work described in the bullets above for the current file that the loop is processing
  • Have a different package (DTSX file) for each source file format

There is an excellent set of tutorials in SQL Server Books Online that walks you through 90% of what is described above. If you are new to SSIS, you should definitely take the time to walk through this material before you start.

Concerning performance, I have never seen an SSIS-to-Perl comparison, but I regularly achieve performance on the order of millions of rows per minute processed when loading data warehouses, even when there are dozens of complex transformations taking place in the data flow. I do not think you will be disappointed with SSIS performance if you take the time to learn how the various tools (and their settings) work.







Re: Parse text file (csv, fix width), Perl or something newer?

ydbn

The format of of the files are not predicable - column order changes, column name changes (e.g. sometime words, sometime abbreviation), etc. I think format cannot be pre-categorized so I think some sort of parsing is necessary.

Just wonder if any good tool for semi-manual ETL process Java world may have something like Octopus. Trying to avoid write perl script





Re: Parse text file (csv, fix width), Perl or something newer?

MatthewRoche

To be honest with you, I don't see how you can hope to have any sort of reliable ETL process with random file formats, regardless of the technology selected. I've also never worked in a business/project environment where there was not some way to predict the file format. If I were in your situation I would be looking for a way to redefine the problem (such as rearchitecting the source system) instead of attempting to solve something arbitrarily complex with technology.




Re: Parse text file (csv, fix width), Perl or something newer?

jwelch

In general, I concur with Matthew on this. If you can make the file formats consistent - that would be much better in the long run. If you can't, there are a number of resources on using scripts to parse flat files that have been posted to the forum. A few related blog posts:

http://blogs.conchango.com/jamiethomson/archive/2006/07/14/4226.aspx

http://agilebi.com/cs/blogs/jwelch/archive/2007/05/08/handling-flat-files-with-varying-numbers-of-columns.aspx






Re: Parse text file (csv, fix width), Perl or something newer?

ydbn

The files are downloaded from the web so I cannot control the format. The format is consistent for each individual file. It still has some rules even for different files, for example, one file may have columns:
CustomerNo, Amount, OrderID, .....
while another file has
OrderNo..(some extra columns)., AMT, ...(some extra columns).., CustID,...
Amnt, ....., CID, OrderNumber,...

We used to write a Perl script to parse the files and generate a set of CSV files with constant column names. The Perl script keep all the alias of columns names so CustomerNo will be matched to CustID and CID, etc....

Any way to handle it in SSIS




Re: Parse text file (csv, fix width), Perl or something newer?

MatthewRoche

If you're looking for a way in SSIS to replace the column renaming functionality that you previously built for yourself using Perl, the answer is no - SSIS does not provide any built-in mechanism to do this.

But... SSIS can be part of a solution to handle these files nonetheless:

  1. The original solution I proposed matches your scenario, according to your most recent post: "The format is consistent for each individual file." Build one package per file format, and the problem goes away.
  2. You can write a "pre-processor" in any language to standardize the column headers and sequence, and then use a single SSIS package to process the standard file format that has been output.