TOPIC:

Building an index of words contained in a PDF 4 years 7 months ago #15349

simone.p Topic Author Offline New Member Posts: 16 Thank you received: 0	Is it possible to extract all the text words contained in a PDF? The purpose is to build an index with the number of occurrences of each word in each document (I have many document to examine), and (if possible) their positions.
	Please Log in or Create an account to join the conversation. Last edit: by simone.p.

Building an index of words contained in a PDF 4 years 7 months ago #15350

radaee Offline Moderator Posts: 1123 Thank you received: 73	char position(x,y) in page is relate to char index of page. and char index is in UTF16 encoding. that mean you may need record page NO and char index to database for each found index. and make to be sure, char index for Page.ObjsGetString() is never changed(use special version)
	Please Log in or Create an account to join the conversation.

Building an index of words contained in a PDF 4 years 7 months ago #15351

simone.p
Topic Author
Offline
New Member
Posts: 16
Thank you received: 0

Sorry it's not clear to me how to do it, I have the following:

   com.radaee.pdf.Global.Init(this)
   val m_doc = com.radaee.pdf.Document()
   m_doc.Open("my/file/path.pdf", "")
   for (x in 0..m_doc.GetPageCount()){
     val page = m_doc.GetPage(x)
     page.ObjsStart()
     // what do I do here in order to extract all words in the PDF
   }

Please Log in or Create an account to join the conversation.

Building an index of words contained in a PDF 4 years 7 months ago #15352

radaee
Offline
Moderator
Posts: 1123
Thank you received: 73

to extrat all texts on page, you can:

page.ObjsStart();
int ccnt = page.ObjsGetCharCount();//char count in UTF16.
String scontent = page.ObjsGetString(0, ccnt);//string object is UTF8 encoding.
//todo: convert java string value to UTF16 texts.

and you shall make your search engine, search texts in UTF16.
once you get your char index in UTF16 of texts, you can locate position of searched key by:

float[] rect = page.ObjsGetCharRect(char_index);

Please Log in or Create an account to join the conversation.

Forum

Developing applications

Android development

Building an index of words contained in a PDF