nitm

hi.

i'm taking a word document as the input and need to output a xml file of that document.

i'm writing a .net application (C#) for that purpose and with the use of the word application i can extract the xml of the document (WordProcessingML) which is great but not enough...

i need to add some "custom" tags to that xml file, one of those tags should be the "page" tag.

i.e.:

Code Block

<nitm:page>
<w:p>
<w:pPr>
....
</nitm:page>

the problem is that i have no idea how to figure where one page ends and another starts.

i searched for the answer and the only thing that i understood from that (and please correct me if i'm wrong) is that a new "wx : sect" will be added when the author used "Insert ==> Break ==> Page Break" in the document.

that isn't good enough since the page will break if the text overflows the current one...

one solution that i can think of is to "travel" the word document (dynamically) and each time the application reaches a new page it will look up the location in the xml file and add a "page" tag.

this solution should do the trick (and again, please correct me if i'm wrong) but i don't like it one bit! it's ugly and clumsy and i'm looking for a more elegant solution...

any ideas

thanks, nitzan.



Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

MauricioG

If you are looking por a page break they are defined as paragraphs of that type:

<wStick out tongue w:rsidR="00422402" w:rsidRDefault="00422402">

<w:r>

<w:br w:type="page"/><!--It is -->

</w:r>

</wStick out tongue>






Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

nitm

thanks, i didnt notice that...

but it looks like it's a result of the "insert => break => page" action of word, as i said i need to know where each and every page of the document ends (and the one after it begins), not just page breaks..

for example, let's say i start a new a document and i write something long, one page wont be enough so the text will "overflow" to a second page and so on.. i want/need to know where each and every page ends.

thanks.





Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

MauricioG

Mmmmm, I'll search into some documents to see if Word caches in some place this information but it is very relative to fonts, fonts size, page size, margins, etc.

I can't see it now but I'll do, tell me if you find something else.






Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

nitm

yes, i'm aware that what i'm asking for is relative to a lot of page and font properties, but the xml files i'm working with are xml of documents that won't be changed in the future, so for me everything is set and defined and since this is the case all of the page/font properties are absolute.

i searched a lot for an answer for this problem and could not find a thing..

thanks a lot for your help, if you do find something i will be thrilled to learn about since it will save me a lot of dirty work.

thanks again, nitzan.





Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

MauricioG

Back here Wink thinking ....

If a get the page breaks (something like XPATH= "//w:br") I get the paragraphs where they are because the paragraphs are parent of the page break. It can be done with with XPATH or some xml navigation.

Something changes for you getting the paragraph






Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

Shiguang Dong

Maybe <w:lastRenderedPageBreak/> is what you are looking for.
Per the Ecma Open XML spec:
2.3.3.13 lastRenderedPageBreak (Position of Last Calculated Page Break)
This element specifies that this position delimited the end of a page when this document was last saved by an
application which paginates its content.
[Guidance: This element shall be used by applications to specify the locations of page breaks within a document
when it is saved as WordprocessingML, in order to allow other applications (e.g. assistive software) to utilize this
information when reading the document. end guidance]
[Example: Consider a run which consists of the text This is the end of the page, where the word end
was the last word on a page. If the application saving this file had paginated this content, that information may
be saved with the file as follows:
<w:r>
<w:t>This is the end</w:t>
<w:lastRenderedPageBreak/>
<w:t xmlTongue Tiedpace="preserve"> of the page</w:t>
</w:r>
The lastRenderedPageBreak element indicates that there was a page break resulting from pagination of this
content, which occurred between the word end and the word of. end example]




Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

nitm

this sounds like what i'm looking for, the only problem is that i searched the xml document for this tag and it does not appear there even once...

maybe there's a special way to save the document as xml file with those page break tags

thanks, nitzan





Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

MauricioG

I'm only supposing: perhaps older Office versions don't generate this cache.

Is it or documents saved with Word 2007 avoid the cache tag

And, again, did you try to locate the last paragraph before the page brek mark






Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

nitm

the documents i'm using are generated with word 2003, i might need to make the application support word 2007 documents but that's the future, right now it should support 2003 documents only.

i located the last paragraphs before each page break in the xml file but there's nothing in there that might suggest a page break





Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

MauricioG

Do you need an XPath instruction to get te paragrapghs with a page break






Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

nitm

thanks, but the xpath is not the problem here, it's just that there's nothing to look for with the xpath, there are no page break paragraphs.





Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

MauricioG

Did you look for [//w:br]






Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

nitm

hi.

yes i did look for that but i found only 6 of those in a document that has 36 pages.. that's not what i'm looking for.. i need to know where each page stops.

thanks!





Re: Microsoft SDK for Open XML Formats WordProcessingML and page breaks

MauricioG

I understand you now: you are looking for ALL page breaks (not only those inserted by the user) including those breaks managed by Microsoft Word (if you change the font this will change, if you put it bold.... and so on)

I did this:

1)I saved a .doc file as docx using Word 2007

2)looked for the <w:lastRenderedPageBreak/> tag you were told before

3)It was exactly in the places you are looking for

Which is the reason for not to be there in your .docx files

The way to convert them