curly
Full Member
Posts: 161
|
PDF
Feb 13, 2021 8:39:24 GMT -5
Post by curly on Feb 13, 2021 8:39:24 GMT -5
I am aware that original data cannot be retrieved from a PDF created by scanning a paper document. I want to find specific bits of data in a pdf created directly from the data I am loking for. Is that possible?
|
|
|
PDF
Feb 13, 2021 9:21:39 GMT -5
Post by tsh73 on Feb 13, 2021 9:21:39 GMT -5
That doesn't say much
Basically if you have text, from which some program did made PDF - you either could get source text back or your could not. You just try that. It all depends on that other program You see, aim of PDF is to make it LOOK ok - getting text back from there was not a goal So programs making PDF sometimes do weirdest things (like mixing glyphs(letters) from several fonts, assigning to letters arbitrary character codes - even different ones on different strings!)
The only that that supposedly work 100% is OCR. (Treat PDF as /print PDF to) bitmap and recognize as text scanned from paper.
|
|
|
PDF
Feb 13, 2021 10:44:20 GMT -5
Post by irvbingham on Feb 13, 2021 10:44:20 GMT -5
If a PDF document was created by a program (not scanned) then programs like Acrobat Reader can usually save the contents in several different formats, including plain text. You can open the original PDF, which is an ASCII file, but as tsh73 points out, it also contains formatting commands embedded in the text stream that can make it difficult to extract bits of the text. Some compressed PDF files also remap the font character codes, so the extracted text may not be readable.
|
|
curly
Full Member
Posts: 161
|
PDF
Feb 13, 2021 12:39:06 GMT -5
Post by curly on Feb 13, 2021 12:39:06 GMT -5
Thank you, I was hoping to read it in, one character at a time and try to detect words?
|
|
|
PDF
Feb 13, 2021 12:51:53 GMT -5
Post by tsh73 on Feb 13, 2021 12:51:53 GMT -5
It really depends on the program creating PDF You just should open it as a text file - in a wordpad may be - and see if you can read anything meaningful.
|
|
|
PDF
Feb 13, 2021 13:57:53 GMT -5
Post by BeeTrap on Feb 13, 2021 13:57:53 GMT -5
Recently I used Sumatra PDF to convert scanned pages from a Word Search book to text. I then use the text to create ".lex" files, still just text, to use in a modified version of Janet Terra's Word Search code. I have modified it to produce 30 x 30 grids with 30 words instead of 20 x 20 with 20 words. Anyway, so far, using Sumatra PDF has been the easiest route for me. Even Nitro Pro did not do as well. This may not be exactly what you need, but it is a start.
|
|
|
PDF
Feb 13, 2021 14:57:06 GMT -5
Post by tsh73 on Feb 13, 2021 14:57:06 GMT -5
Hello BeeTrap
I think it really could be useful. Could you provide some details? I went to Sumatra PDF site and did not found how it could be used to get text from PDF.
|
|
|
PDF
Feb 13, 2021 16:31:41 GMT -5
Post by gidiom2 on Feb 13, 2021 16:31:41 GMT -5
curlyMaster PDF Editor (Linux & Windows versions) available as free demo, can export to a text file which can then be parsed with LibertyBASIC. The demo version adds a watermark to edited pdfs but the text export seems clean. Directly parsing a pdf with LB is almost certainly no-go, if that was your aim.
|
|
|
PDF
Feb 13, 2021 22:15:18 GMT -5
Post by irvbingham on Feb 13, 2021 22:15:18 GMT -5
curly Master PDF Editor (Linux & Windows versions) available as free demo, can export to a text file which can then be parsed with LibertyBASIC. The demo version adds a watermark to edited pdfs but the text export seems clean. Directly parsing a pdf with LB is almost certainly no-go, if that was your aim. Those watermarks are inserted with a few simple PostScript commands so they will appear on displayed and printed copies. They can also be easily removed by editing the pdf with a text editor, which I might have done a few times in a pinch.
|
|
|
PDF
Feb 14, 2021 8:38:14 GMT -5
Post by BeeTrap on Feb 14, 2021 8:38:14 GMT -5
For tsh73 I went to the "hamburger menu" icon top left, chose "file" from drop-down menu and then chose "save as". Chose "txt" as output type when given the save box. There is probably a much more elegant method of doing this, but Sumatra converted a two column PDF to single column text, which was great. I did have to manually delete the text "grid" that was at the top of the Word Search page, but I was still pleased the output.
|
|
|
PDF
Feb 14, 2021 11:36:36 GMT -5
Post by gidiom2 on Feb 14, 2021 11:36:36 GMT -5
There is a free pdftotext commandline converter available for both windows and linux (I have only tested the linux version). The windows version is a .exe so could presumably be called from within LibertyBASIC to produce a .txt file suitable for parsing (not tested) see: xpdfreader/pdftotext for details. I found that pdftotext was already installed on my linux installation and I believe it is used by Libre Office.
|
|