TOPIC:

Extracting text 11 years 8 months ago #192

dmpost
Topic Author
Offline
New Member
Posts: 2
Thank you received: 0

Hi,
when I extract non-english text I get wrong(chinese) symbols from time to time.
How can I fix it?

Example:

page.ObjsStart();
int pageLength = page.ObjsGetCharCount();
String tempStr = page.ObjsGetString(0, pageLength);

Results I get:

섄됵ксей Голощапов
nроrра�ирование
дn茐1 мо6иnьных
舠' у섐Ё⑀оиств
Сан섎т-Петербург

Results I expect

Алексей Голощапов
програмирование
для мобильных
устройств
Санкт-Петербург

I understand that this can be because of bad OCR text recognition,
but on images or PDF viewers it looks OK.
Maybe it is possible to force some encoding or something?

Please Log in or Create an account to join the conversation.

Re: Extracting text 11 years 8 months ago #193

support
Offline
Administrator
Posts: 690
Thank you received: 59

If your PDF are built by an OCR system, a lot of characters could be bad recognized and should have only a good graphical aspect but a wrong text representation.

Even with some digitally produced documents there are some characters that are composed only to get the right graphical rendering using two or three characters varying kerning and spacing.

I think the only way to get a good text is to complete the extraction process with a thesaurus and dictionary.

By the way: may you send us one of the document you are experiencing that issues?

Please Log in or Create an account to join the conversation.

Re: Extracting text 11 years 8 months ago #194

dmpost Topic Author Offline New Member Posts: 2 Thank you received: 0	Here is the link to the PDF www.dropbox.com/s/5ejxf9pxjj9i9vk/%D0%93...D0%B2%20-%202011.pdf
	Please Log in or Create an account to join the conversation.

Re: Extracting text 11 years 8 months ago #196

radaee Offline Moderator Posts: 1123 Thank you received: 73	OK, some bugs found. this will fixed in furture.
	Please Log in or Create an account to join the conversation.

Re: Extracting text 9 years 2 months ago #8426

pedro.pinheiro Offline New Member Posts: 3 Thank you received: 0	Urgently ! Any news about this case? I've got a PDF with portuguese text that I can't extract the text with correct characters. Thanks!
	Please Log in or Create an account to join the conversation.

Re: Extracting text 9 years 2 months ago #8427

support Offline Administrator Posts: 690 Thank you received: 59	This was a very old thread. This specific issue was belonging from OCR recognition and not to chars encoding. If you have some file that encode text in the wrong manner, please provide us a copy and give us information about what you're expecting extracting text.
	Please Log in or Create an account to join the conversation.

Page:
1
2

Forum

Developing applications

Android development

Extracting text