Signin/Signup with: 
Welcome, Guest
Username: Password: Remember me
Questions about Android development and PDF
  • Page:
  • 1
  • 2

TOPIC:

Extracting text 11 years 8 months ago #192

  • dmpost
  • dmpost's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 2
  • Thank you received: 0
Hi,
when I extract non-english text I get wrong(chinese) symbols from time to time.
How can I fix it?

Example:

page.ObjsStart();
int pageLength = page.ObjsGetCharCount();
String tempStr = page.ObjsGetString(0, pageLength);

Results I get:

섄됵ксей Голощапов
nроrра�ирование
дn茐1 мо6иnьных
舠' у섐Ё⑀оиств
Сан섎т-Петербург

Results I expect

Алексей Голощапов
програмирование
для мобильных
устройств
Санкт-Петербург

I understand that this can be because of bad OCR text recognition,
but on images or PDF viewers it looks OK.
Maybe it is possible to force some encoding or something?

Please Log in or Create an account to join the conversation.

Re: Extracting text 11 years 8 months ago #193

  • support
  • support's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 690
  • Thank you received: 59
If your PDF are built by an OCR system, a lot of characters could be bad recognized and should have only a good graphical aspect but a wrong text representation.

Even with some digitally produced documents there are some characters that are composed only to get the right graphical rendering using two or three characters varying kerning and spacing.

I think the only way to get a good text is to complete the extraction process with a thesaurus and dictionary.

By the way: may you send us one of the document you are experiencing that issues?

Please Log in or Create an account to join the conversation.

Re: Extracting text 11 years 8 months ago #194

  • dmpost
  • dmpost's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 2
  • Thank you received: 0

Please Log in or Create an account to join the conversation.

Re: Extracting text 11 years 8 months ago #196

  • radaee
  • radaee's Avatar
  • Offline
  • Moderator
  • Moderator
  • Posts: 1123
  • Thank you received: 73
OK, some bugs found.
this will fixed in furture.

Please Log in or Create an account to join the conversation.

Re: Extracting text 9 years 2 months ago #8426

  • pedro.pinheiro
  • pedro.pinheiro's Avatar
  • Offline
  • New Member
  • New Member
  • Posts: 3
  • Thank you received: 0
Urgently !
Any news about this case?
I've got a PDF with portuguese text that I can't extract the text with correct characters.

Thanks!

Please Log in or Create an account to join the conversation.

Re: Extracting text 9 years 2 months ago #8427

  • support
  • support's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 690
  • Thank you received: 59
This was a very old thread. This specific issue was belonging from OCR recognition and not to chars encoding.
If you have some file that encode text in the wrong manner, please provide us a copy and give us information about what you're expecting extracting text.

Please Log in or Create an account to join the conversation.

  • Page:
  • 1
  • 2
Powered by Kunena Forum