RadaeePDF.com :: Topic: Extracting text (1/2)
Welcome, Guest
Username: Password: Remember me

Signin/Signup with:

Questions about Android development and PDF
  • Page:
  • 1
  • 2

TOPIC: Extracting text

Extracting text 5 years 3 months ago #192

  • dmpost
  • dmpost's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 2
  • Karma: 0
Hi,
when I extract non-english text I get wrong(chinese) symbols from time to time.
How can I fix it?

Example:

page.ObjsStart();
int pageLength = page.ObjsGetCharCount();
String tempStr = page.ObjsGetString(0, pageLength);

Results I get:

섄됵ксей Голощапов
nроrра�ирование
дn茐1 мо6иnьных
舠' у섐Ё⑀оиств
Сан섎т-Петербург

Results I expect

Алексей Голощапов
програмирование
для мобильных
устройств
Санкт-Петербург

I understand that this can be because of bad OCR text recognition,
but on images or PDF viewers it looks OK.
Maybe it is possible to force some encoding or something?
The administrator has disabled public write access.

Re: Extracting text 5 years 3 months ago #193

  • support
  • support's Avatar
  • OFFLINE
  • Administrator
  • Posts: 501
  • Thank you received: 42
  • Karma: 6
If your PDF are built by an OCR system, a lot of characters could be bad recognized and should have only a good graphical aspect but a wrong text representation.

Even with some digitally produced documents there are some characters that are composed only to get the right graphical rendering using two or three characters varying kerning and spacing.

I think the only way to get a good text is to complete the extraction process with a thesaurus and dictionary.

By the way: may you send us one of the document you are experiencing that issues?
The administrator has disabled public write access.

Re: Extracting text 5 years 3 months ago #194

  • dmpost
  • dmpost's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 2
  • Karma: 0
The administrator has disabled public write access.

Re: Extracting text 5 years 3 months ago #196

  • radaee
  • radaee's Avatar
  • OFFLINE
  • Moderator
  • Posts: 825
  • Thank you received: 10
  • Karma: 43
OK, some bugs found.
this will fixed in furture.
The administrator has disabled public write access.

Re: Extracting text 2 years 9 months ago #8426

Urgently !
Any news about this case?
I've got a PDF with portuguese text that I can't extract the text with correct characters.

Thanks!
The administrator has disabled public write access.

Re: Extracting text 2 years 9 months ago #8427

  • support
  • support's Avatar
  • OFFLINE
  • Administrator
  • Posts: 501
  • Thank you received: 42
  • Karma: 6
This was a very old thread. This specific issue was belonging from OCR recognition and not to chars encoding.
If you have some file that encode text in the wrong manner, please provide us a copy and give us information about what you're expecting extracting text.
The administrator has disabled public write access.
  • Page:
  • 1
  • 2
Powered by Kunena Forum