The reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word.
Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good appearance and reading experience the glyphs should be printed nearer to each other or farther from each other than they would be by default. This is done in PDFs using the same operation as above.
Thus, a PDF parser in such situations has to use heuristics to decide whether such a shift was meant to imply a space character or whether it was merely meant to make the letter group look good. And heuristics can fail.
You use SimpleTextExtractionStrategy
as text extraction strategy. The heuristics in this case are implemented like this (as currently in the renderText
method in SimpleTextExtractionStrategy.java in the iText 5.x github git develop branch):
float spacing = lastEnd.subtract(start).length();
if (spacing > renderInfo.getSingleSpaceWidth()/2f)
{
result.append(' ');
}
Thus, a gap which is at least half as wide as the current width of as space character, is translated into a space character.
This generally sounds sensible. In case of documents, though, which only use horizontal shifts to separate words, the current widths of the actual space character may not be a good measure for the heuristics.
So, what you can do is try to improve the heuristics in the text extraction strategy. Copy the existing one, manipulate it, and use it in your code.
If you supply a sample PDF for your issue, we might have some ideas to help.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…