Srikanth kannan

Hi everyone,

I have an XML. Now my requirement is i have to find the duplicate node amoung the child nodes of a parent node. Duplicate means tha value and the subchilds of the both the node should be same.

If thats the case then one node should be deleted.

Another case, the value can be same but the subchilds of the two nodes might be different. If that is the case then , sub childs of one node should be appeneded to the subchilds of the other and the former should be removed.

Kindly tell me the way to do it.

Thanks in advance.

Re: XML and the .NET Framework How to find duplicate node amoung the child nodes?


Your specification leaves some questions. Does order matter What if the value appears twice in either file What should happen to nodes that are in one file and not the other

I'm assuming the files aren't that large so you can load them both into memory. If they are large then you'll have to stream the data you're interested in and order suddenly becomes a useful requirement. Otherwise you'll have to potentially stream the files multiple times to build up the lists of values. I'm also assuming that you are generating a third file that contains the merging of the first two without duplicates. Finally I'm assuming that nodes are all at a fixed (predefined) level (the root) rather than trying to merge two arbitrary XML files.

There are a couple of different ways to do this. Here's one option when you can load the entire document into memory.

Load the first file (first) into an XML document object.

For each root node (root) in the second file (second)

matchedNode = FindNode(root, first.RootNodes)

If matchedNode

MergeNodes(root, matchedNode)



End If

End For

node FindNode ( nodeToFind, nodeList )
For each node in nodeList

If node.Value == nodeToFind.Value (not the really C# code here)

Return node

End For

Return Nothing

MergeNode ( nodeSource, nodeTarget )

For each child node in nodeSource

matchedNode = FindNode(child, nodeTarget.ChildNodes)

If Not matchedNode


End For

Of course there are a lot of little issues you'll need to deal with like comparing values that might not be quite the same (like strings). You'll also need to handle the actual copying of the node data because once a node is associated with a document you can't just assign it to the a different document. You'll end up having to create a new node and copy the values (including children) from the original node.

I did something along these lines a while back but with memory and streaming requirements and a far more complex comparison algorithm. Expect to put some serious time into this code as it is not as simple as it first appears.

Michael Taylor - 10/25/07

Re: XML and the .NET Framework How to find duplicate node amoung the child nodes?

Srikanth kannan


Order doesnt matter and the file is also not so large and loading into memory as a whole is not a problem.

There is only one Xml file and in that there should not be any duplicates amoung the child nodes of any parent node.

If there is duplicate in the name of the child nodes then,

a) if only name is same and the child nodes of these duplicate nodes are different, then append the child nodes of one duplicate node with other and remove the former duplicate node.

b) if both the name and the child nodes(InnerXml) of the duplicate nodes are same remove one of the duplicate node

Did you get me

Re: XML and the .NET Framework How to find duplicate node amoung the child nodes?


Then the approach I mentioned should work. The fact that you are using a single file shouldn't matter. Just load it up twice and process as normal.

There are probably better alternatives but they'll get complicated really fast. For example you could probably use XPath to get the list of nodes that match a certain criteria (specifically those that happen to match another node). In this case you could then do a merge. However this requires that you enumerate the child nodes and you'll have some difficulties since you might be deleting some of the nodes being enumerated.

An alternative approach (thinking off the top of my head) would be to use a dictionary to match up duplicates. You would enumerate all nodes of the root. For each node you would calculate its unique identifier using whatever algorithm is appropriate (name-value, etc). You then look in a dictionary to see if the entry already exists. If it doesn't then you insert it. If it does then you have a duplicate node. You take all the child nodes of the duplicate and you associate them with the original node. After you have processed all root nodes then you repeat the process on the child nodes of each root node. Continue until you run out of child nodes. In this manner you eliminate duplicates at each level of the XML file by merging child nodes. Child nodes are cleaned up as the sublevel is processed. This algorithm is a little tricky so I don't know that I recommend it unless the original algorithm just doesn't work well. This algorithm would also probably work better if you streamed the XML file across rather than using the document model. You'll have to play around with it to get it to work right though.

Michael Taylor - 10/25/07