Microsoft Windows Phone 8.1 support ends (13 Jul 2017)

Microsoft has ended support for Windows Phone 8.1

Questions about Android development and PDF

Building an index of words contained in a PDF

More
IP: 95.248.179.53 5 years 4 months ago - 5 years 4 months ago #15349 by simone.p
Is it possible to extract all the text words contained in a PDF?
The purpose is to build an index with the number of occurrences of each word in each document (I have many document to examine), and (if possible) their positions.
Last edit: 5 years 4 months ago by simone.p.
More
IP: 111.196.244.255 5 years 4 months ago #15350 by radaee
char position(x,y) in page is relate to char index of page.
and char index is in UTF16 encoding.
that mean you may need record page NO and char index to database for each found index.

and make to be sure, char index for Page.ObjsGetString() is never changed(use special version)
More
IP: 217.141.78.166 5 years 4 months ago #15351 by simone.p
Sorry it's not clear to me how to do it, I have the following:
Code:
com.radaee.pdf.Global.Init(this) val m_doc = com.radaee.pdf.Document() m_doc.Open("my/file/path.pdf", "") for (x in 0..m_doc.GetPageCount()){ val page = m_doc.GetPage(x) page.ObjsStart() // what do I do here in order to extract all words in the PDF }
More
IP: 111.196.244.255 5 years 4 months ago #15352 by radaee
to extrat all texts on page, you can:
Code:
page.ObjsStart(); int ccnt = page.ObjsGetCharCount();//char count in UTF16. String scontent = page.ObjsGetString(0, ccnt);//string object is UTF8 encoding. //todo: convert java string value to UTF16 texts.
and you shall make your search engine, search texts in UTF16.
once you get your char index in UTF16 of texts, you can locate position of searched key by:

float[] rect = page.ObjsGetCharRect(char_index);
Time to create page: 0.396 seconds
Powered by Kunena Forum