I'm developing an Android app that exchanges PDF text highlights via database objects with similar apps using other PDF parsers in Flash and iOS. To that end I need to be able to extract the following information about a selection:
The highlighted text
The pixel coordinates of the highlight rectangles
The start and end indices of the selected text.
So far so good, I have figured out how to accomplish this, but it seems that the character indices (start, end) don't match those derived in the other two PDF parsers, namely, the parser in the Radaee component seems to be counting extra characters. Except for the first few characters in the sequence, ObjsGetCharIndex() returns a value which is greater than the character's actual position in the text stream. I can't think of a way to correct for this, and without correct indexing this app won't be compatible with the other apps.
I can attach a single page example PDF and a log of the character sequencing that I expect to see, if useful.
The topic has been locked.
Character index sequencing counts extra characters
6 years 2 months ago #7654
it is not bug.
different PDF lib may got different result for extracting text.
algorithm of extracting text is(or shall) not defined in PDF reference.
you can save database as rects, by [PDFAnnot getMarkupRects];