I am trying to parse the stackoverflow dump file (Posts.xml- 17gb) .It is of the form:
<posts>
<row Id="15228715" PostTypeId="1" />
.
<row Id="15228716" PostTypeId="2" ParentId="1600647" LastActivityDate="2013-03-05T16:13:24.897"/>
</posts>
I have to 'group' each question with their answers. Basically find a question (posttypeid=1) find its answers using parentId of another row and store it in db .
I tried doing this using querypath (DOM), but it kept exiting(139) . My guess is because of the large size of the file, my PC couldn't handle it, even with huge swap.
I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?
Is there any other method/way ?
Help!
It is a one time parsing.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…