Signin/Signup with: 
Welcome, Guest
Username: Password: Remember me
Questions about Android development and PDF
  • Page:
  • 1

TOPIC:

Building an index of words contained in a PDF 3 years 4 months ago #15349

  • simone.p
  • simone.p's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 16
  • Thank you received: 0
Is it possible to extract all the text words contained in a PDF?
The purpose is to build an index with the number of occurrences of each word in each document (I have many document to examine), and (if possible) their positions.

Please Log in or Create an account to join the conversation.

Last edit: by simone.p.

Building an index of words contained in a PDF 3 years 4 months ago #15350

  • radaee
  • radaee's Avatar
  • Away
  • Moderator
  • Moderator
  • Posts: 1123
  • Thank you received: 73
char position(x,y) in page is relate to char index of page.
and char index is in UTF16 encoding.
that mean you may need record page NO and char index to database for each found index.

and make to be sure, char index for Page.ObjsGetString() is never changed(use special version)

Please Log in or Create an account to join the conversation.

Building an index of words contained in a PDF 3 years 4 months ago #15351

  • simone.p
  • simone.p's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 16
  • Thank you received: 0
Sorry it's not clear to me how to do it, I have the following:
   com.radaee.pdf.Global.Init(this)
   val m_doc = com.radaee.pdf.Document()
   m_doc.Open("my/file/path.pdf", "")
   for (x in 0..m_doc.GetPageCount()){
     val page = m_doc.GetPage(x)
     page.ObjsStart()
     // what do I do here in order to extract all words in the PDF
   }

Please Log in or Create an account to join the conversation.

Building an index of words contained in a PDF 3 years 4 months ago #15352

  • radaee
  • radaee's Avatar
  • Away
  • Moderator
  • Moderator
  • Posts: 1123
  • Thank you received: 73
to extrat all texts on page, you can:
page.ObjsStart();
int ccnt = page.ObjsGetCharCount();//char count in UTF16.
String scontent = page.ObjsGetString(0, ccnt);//string object is UTF8 encoding.
//todo: convert java string value to UTF16 texts.
and you shall make your search engine, search texts in UTF16.
once you get your char index in UTF16 of texts, you can locate position of searched key by:

float[] rect = page.ObjsGetCharRect(char_index);

Please Log in or Create an account to join the conversation.

  • Page:
  • 1
Powered by Kunena Forum