sensfan

Hi,

I have a subroutine that parses a huge html file and removes unneccessary code generated by MS Word after I've done a "Save as Web Page". I'm using a lot of regular expressions to remove and format the code to the way I need it and I've recently been getting a System.OutOfMemoryException while processing this subroutine. I'm using VB.Net 2005 Express.

Is there a way around this Should I use a separate function for each pattern or is it just my code that's sloppy When I run this on a particular file...it crashes at this line:

pattern = "(<img[^>]*)(>)"

returntext = Regex.Replace(returntext, pattern, "$1 /$2", RegexOptions.Singleline)

Any help is greatly appreciated.

Thanks

Rob

Here's the subroutine:

Private Function CleanHTML(ByVal html As String) As String

Dim returntext As String = html

Dim pattern As String = String.Empty

'remove everything within the head tags

pattern = "<head>.*</head>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.Singleline Or RegexOptions.IgnoreCase)

'remove page numbers in TOC

pattern = "( si)<span[^>]* display:none[^>]* >\d{1,3}</span>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.Singleline)

'this removes any dots and page numbers at the end of each line in the TOC

pattern = "( si)<span[^>]* display:none[^>]* ><span[^>]*>\.* *</span></span>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.Singleline)

'remove certain tags

pattern = "<[/] (html|body|font|xml|del|ins|[ovwxp]:\w+)[^>]* >"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'preserve bordered windowtext boxes

pattern = "style='borderTongue Tiedolid windowtext .75pt;padding:1.0pt 0in 1.0pt 0in'"

returntext = Regex.Replace(returntext, pattern, "class=FormView", RegexOptions.IgnoreCase)

'remove unwanted attributes within tags but preserve text between tags

pattern = "<([^>]*)( :lang|style|size|face|[ovwxp]:\w+)=( :'[^']*'|""[^""]*""|[^>]+)([^>]*)>"

returntext = Regex.Replace(returntext, pattern, "<$1$2>", RegexOptions.IgnoreCase)

'remove unwanted attributes within tags but preserve text between tags (run it again to catch all)

pattern = "<([^>]*)( :lang|style|size|face|[ovwxp]:\w+)=( :'[^']*'|""[^""]*""|[^>]+)([^>]*)>"

returntext = Regex.Replace(returntext, pattern, "<$1$2>", RegexOptions.IgnoreCase)

'remove all MS comments

pattern = "<!--\[if[^>]*>.* <!\[endif\]-->"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.Singleline)

'remove MS if and endif structures

pattern = "<!\[if[^>]*>"

returntext = Regex.Replace(returntext, pattern, "")

returntext = Replace(returntext, "<![endif]>", "")

returntext = Replace(returntext, "<span >", "<span>")

returntext = Replace(returntext, "class=msoChangeProp", "")

'remove empty <a> tags

pattern = "<a[^>]*>\s*</a>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'remove <br clear=all> tags

pattern = "<br[\n|\s]+clear=all[\s]*>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'remove empty <i> tags (or contains space)

pattern = "<i[^>]*></i>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'remove empty <b> tags (or contains space)

pattern = "<b[^>]*></b>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'remove empty <p> tags (or contains space)

pattern = "<p[^>]*></p>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'remove unnecessary class

pattern = "\n class=MsoHyperlink"

returntext = Regex.Replace(returntext, pattern, String.Empty)

pattern = "\svalign=\w+"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'returntext = Replace(returntext, " nowrap", "")

returntext = Replace(returntext, "<br>", "<br />")

pattern = "<br[^>]*>"

returntext = Regex.Replace(returntext, pattern, "<br />", RegexOptions.Singleline)

pattern = "(<hr[^>]*)(>)"

returntext = Regex.Replace(returntext, pattern, "$1 /$2", RegexOptions.Singleline)

'cleanup the <p > tags

returntext = Replace(returntext, "<p >", "<p>")

'replace code for hyphen with actual hyphen

returntext = Replace(returntext, "&#8209;", "-")

'replace image paths

'this will return something like: TOMS/MyTomName_files/images/image001.gif

pattern = "(<img.* src="")(\.*/[^/]*)(/[^>]*>)"

returntext = Regex.Replace(returntext, pattern, "$1images$3", RegexOptions.Singleline)

'for the xml document

returntext = Replace(returntext, "&nbsp;", " ")

'put double quotes around all attributes

pattern = "(\s\w+=)(\w+)"

'pattern = "[class|align|width|height|border|valign|cellpadding|cellspacing]+=\w+"

returntext = Regex.Replace(returntext, pattern, "$1""$2""", RegexOptions.IgnoreCase)

'CRASHES HERE

'add a closing / to the img tag to make it XML compliant

pattern = "(<img[^>]*)(>)"

returntext = Regex.Replace(returntext, pattern, "$1 /$2", RegexOptions.Singleline)

For x As Integer = 1 To 10

returntext = Replace(returntext, "<span></span>", String.Empty)

'remove empty <span> tags

pattern = "<span\s*></span>"

returntext = Regex.Replace(returntext, pattern, "", RegexOptions.IgnoreCase Or RegexOptions.Singleline)

Next

If cbLeftAlign.Checked Then

'remove all indent associated with table of contents

pattern = " class=""MsoToc[\d]*"""

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'remove all indent associated with content

pattern = " class=""MsoList[\d]*"""

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

pattern = " class=""List[\d]*"""

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'remove all MsoNormal classes

pattern = " class=""MsoNormal"""

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'set Note classes

pattern = " class=""Note[\w]"""

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

'align <p>,<h1> and <div> tags to the left

pattern = "<p align=""center[ ] "">"

returntext = Regex.Replace(returntext, pattern, "<p align=""left"">", RegexOptions.IgnoreCase)

pattern = "<h1 align=""center""[ ] >"

returntext = Regex.Replace(returntext, pattern, "<h1 align=""left"">", RegexOptions.IgnoreCase)

pattern = "<div align=""center[ ] "">"

returntext = Regex.Replace(returntext, pattern, "<div align=""left"">", RegexOptions.IgnoreCase)

pattern = "class=""FigureCaption"">"

returntext = Regex.Replace(returntext, pattern, "class=""FigureCaptionLeftAlign"">", RegexOptions.IgnoreCase)

End If

If cbStrikethrough.Checked Then

pattern = "<strike>.* </strike>"

returntext = Regex.Replace(returntext, pattern, String.Empty, RegexOptions.IgnoreCase)

returntext = Replace(returntext, "<s>", "")

returntext = Replace(returntext, "</s>", "")

End If

If cbChangedText.Checked Then

returntext = Replace(returntext, "class=""msoIns""", "")

End If

Return returntext

End Function



Re: Regular Expressions Using too many regular expressions

OmegaMan

How big are the files you are processing Also ignorecase may be costing you more than you think. I am sure that either Office returns upper or lower identify which it is and remove that option. See Want faster regular expressions Maybe you should think about that IgnoreCase option... about why that is a costly operation in unintended ways...





Re: Regular Expressions Using too many regular expressions

sensfan

Thanks Omegaman,

I tried removing the IngnoreCase but it's still hapenning.

The files are huge. The Word document is about 10 mb but when I "Save as Web Page" in Word...it becomes a 30 mb html file which is what I'm parsing.

I think I'm going to have to split up the file somehow.





Re: Regular Expressions Using too many regular expressions

A New Entry in IT

I would suggest go for Divide-Rule policy.

Divide entire file into small chunks in memory itself, reading limited bytes and processing them. This might help.

Also, you can think of using Threading option.






Re: Regular Expressions Using too many regular expressions

OmegaMan

sensfan wrote:

Thanks Omegaman,

I tried removing the IngnoreCase but it's still hapenning.

The files are huge. The Word document is about 10 mb but when I "Save as Web Page" in Word...it becomes a 30 mb html file which is what I'm parsing.

I think I'm going to have to split up the file somehow.



Give me a list of the replace patterns (without the vb coce) that now currently use and I will try it. I have done similar operations with multiple regex'es on similar files without problems.

Use this format in your response and place it in a code block (using the {} button on the editor)

Code Snippet

Pattern Replace

"< Patter>.*" "$1 $2"
"<Pattern2>.*" "<$2>"







Re: Regular Expressions Using too many regular expressions

sensfan

Here it is.

Code Snippet

Pattern Replace

"<head>.*</head>" String.Empty

"<[/] (html|body|font|xml|del|ins|[ovwxp]:\w+)[^>]* >" String.Empty

"<([^>]*)( :lang|style|size|face|[ovwxp]:\w+)=( :'[^']*'|""[^""]*""|[^>]+)([^>]*)>" "<$1$2>"

"<!--\[if[^>]*>.* <!\[endif\]-->" String.Empty

"<!\[if[^>]*>" String.Empty

"<a[^>]*>\s*</a>" String.Empty

"<br[\n|\s]+clear=all[\s]*>" String.Empty

"<i[^>]*></i>" String.Empty

"<b[^>]*></b>" String.Empty

"<p[^>]*></p>" String.Empty

"\svalign=\w+" String.Empty

"<br[^>]*>" "<br />"

"(<hr[^>]*)(>)" "$1 /$2"

"(<img.* src="")(\.*/[^/]*)(/[^>]*>)" "$1images$3

"(\s\w+=)(\w+)" "$1""$2"""

"(<img[^>]*)(>)" "$1 /$2"

"<span\s*></span>" String.Empty

That's about it...I have a few replace() statements within this function as well that do not need regular expressions.

Also, the document that this fails on is a french HTML document...I don't know if that has anything to do with it. Shouldn't matter because my other smaller french HTML documents convert without a problem.

Thanks





Re: Regular Expressions Using too many regular expressions

sensfan

Sorry Omegaman. The formatting didn't come out so good.



Re: Regular Expressions Using too many regular expressions

OmegaMan

sensfan wrote:
Sorry Omegaman. The formatting didn't come out so good.

I got it, no worries. I am repacing the single and double quotes with

\x22 for "
\x27 for '

which makes it easier to use in the patterns instead of "".





Re: Regular Expressions Using too many regular expressions

sensfan

Thanks OmegaMan, I tries this and it worked once but then I get the exception again. I must be getting closer.



Re: Regular Expressions Using too many regular expressions

sensfan

Ya, I thought about that but then I thought...what if a chunk ended in between one of my patterns...then it wouldn't process it.

I've never used threading before...I'll take a look at that option.

Thanks





Re: Regular Expressions Using too many regular expressions

OmegaMan

I am not seeing the problem. I created a 12meg size saved from Word. Loaded it into a string and then ran the replace's you mentioned. Here are my results:


Start data size #12769323
Regex #0 Time (00:00:00.0316380) data size #12769323
Regex #1 Time (00:00:00.4550438) data size #12704692
Regex #2 Time (00:00:03.6901181) data size #6086260
Regex #3 Time (00:00:00.0062393) data size #6086260
Regex #4 Time (00:00:00.0426995) data size #6006770
Regex #5 Time (00:00:00.0238588) data size #6006770
Regex #6 Time (00:00:00.0134938) data size #6006770
Regex #7 Time (00:00:00.0223835) data size #6006770
Regex #8 Time (00:00:00.0327946) data size #6006770
Regex #9 Time (00:00:00.0587169) data size #6006770
Regex #10 Time (00:00:00.7821401) data size #5892097
Regex #11 Time (00:00:00.0479189) data size #5881513
Regex #12 Time (00:00:00.0518113) data size #5888073
Regex #13 Time (00:00:00.0092799) data size #5888073
Regex #14 Time (00:00:01.3090867) data size #6436857
Regex #15 Time (00:00:00.0100719) data size #6436857
Regex #16 Time (00:00:00.0614737) data size #6436756
Done: data size #6436756
Done

I am able to run it many times, the private bytes never gets above 400mb and is collected by the GC no problem.

Whatever it is, it is something else that is giving you problems and not the regex....





Re: Regular Expressions Using too many regular expressions

sensfan

Ya, I don't have a problem with a 12 mb file but the file I'm processing is a 30 mb file and that's the one giving me the problem.



Re: Regular Expressions Using too many regular expressions

OmegaMan

sensfan wrote:
Ya, I don't have a problem with a 12 mb file but the file I'm processing is a 30 mb file and that's the one giving me the problem.


Doh! I was thinking 10...I will triple it and try it again tonight.





Re: Regular Expressions Using too many regular expressions

OmegaMan

Ok, here are the results:

Start data size #52690524
Regex 0 Time (00:00:00.0652323) data size #52690524
Regex 1 Time (00:00:01.9986221) data size #52402607
Regex 2 Time (00:00:16.9111993) data size #24857174
Regex 3 Time (00:00:00.0263516) data size #24857174
Regex 4 Time (00:00:00.1885993) data size #24525990
Regex 5 Time (00:00:00.0979660) data size #24525990
Regex 6 Time (00:00:00.0551296) data size #24525990
Regex 7 Time (00:00:00.0898915) data size #24525990
Regex 8 Time (00:00:00.1331160) data size #24525990
Regex 9 Time (00:00:00.2463950) data size #24525990
Regex 10 Time (00:00:03.2042784) data size #24060057
Regex 11 Time (00:00:00.1560097) data size #24017049
Regex 12 Time (00:00:00.1934167) data size #24043681
Regex 13 Time (00:00:00.0381788) data size #24043681
Regex 14 Time (00:00:05.4292897) data size #26325833
Regex 15 Time (00:00:00.0412426) data size #26325833
Regex 16 Time (00:00:00.2905833) data size #26325732
Done: data size #26325732
Done

I bumped the size up to 50meg. (Note don't open an mht file in Word 2007...drops it like a rock! I had this problem in 2003, thought they would have fixed it...oh well). Then ran the test.

The results show that it took longer, but that it worked.

Now the private bytes memory footprint shot well over 1 gig...so I wonder if your machine just doesn't have the horse power and is now paging into memory . I am running on a Dual Core (2.16) with two gig of memory.







Re: Regular Expressions Using too many regular expressions

sensfan

I'm running a Pentium 4 2.8 gig processor with 1.5 gigs of ram.

I seem to be able to process the file everytime I start the application but if I try to process it a second time (and so on) without closing the application and re-opening it, I get the System.OutOfMemoryException. If I close the Application and do it over again...it will process it the first time again.

I tried doubling the file size to 50 mb and it won't even process it at all...I get the same error.

Must be that my computer is not powerful enough.