I have several large XML files, that won't parse due to some unrecognised character, the complaint is similar to:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 528370, column 153
On smaller files I am also seeing this, but can open the file with a text editor and fix the issue. However, my text editor won't read the large files.
I hacked together a Python script to print the line concerned, and I can see from that there appears to be a unicode encoding problem, in that the “μ” (for micro[metres]) is encoded xb5, where I think it should be x00B5. There are several of these on the same line.
I found that the only way to read that line was as a binary. Anything else wouldn't parse it (ie the unicode parser wouldn't read it).
I could not find a method to read that line, fix it, and then write back just that line.
So, in a desparate bid to get around that large file size I thought I could perhaps just split up the file on a line by line basis, edit the file with the error, and then stitch them back together. Each file in the split is 512,000 lines.
This obviously breaks the XML in the individual files - but not a problem if I stitch them back together in the right order. I can't parse the file into smaller XML elements, because, as above, ElementTree chokes on the encoding.
So, here is my script to split the file on a line basis:
import contextlib
file_large = 'thefile.rdf'
l = 1024*512 # lines per split file
with contextlib.ExitStack() as stack:
fd_in = stack.enter_context(open(file_large, 'rb'))
for i, line in enumerate(fd_in):
if not i % l:
file_split = '{}.{}'.format(file_large, i//l)
fd_out = stack.enter_context(open(file_split, 'w'))
fd_out.write('{}'.format(line))
This works well enough and quickly enough, except that it writes out the binary line as a string, so that when you read the file in a text editor you get 500k lines on a single line and the text reading like this:
…b'<dcterms:contributor>
'b'<rdf:Description>
'b'<rdfs:label>University of Durham</rdfs:label>
'b'<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Organization" />
'b'</rdf:Description>
'b…
Which seems to indicate that it reads in the binary, and then writes out as a string. I tried changing the last couple of lines to:
fd_out = stack.enter_context(open(file_split, 'w+b'))
fd_out.write('{}'.format(bytearray(line)))
But then I get a Python error:
TypeError: a bytes-like object is required, not 'str'
Would therefore appreciate some pointers as to how to either solve the binary write issue, or a better way to correct the large XML file in situ.
Thanks
question from:
https://stackoverflow.com/questions/66050276/how-to-write-file-to-binary-or-edit-single-line-in-large-file-python