Author Topic: Corrupted or NO-Text PDFs  (Read 6559 times)

0 Members and 1 Guest are viewing this topic.

fabiano.querceto

  • Newbie
  • *
  • Posts: 1
Corrupted or NO-Text PDFs
« on: February 04, 2014, 09:02:20 AM »
Hi all, i'm trying the software and it seems really good.
Very often i find myself working on big number of pdf files extracted from hd images.
Many of these files usually come out from an activity of deleted files recovery so many of them
are, in effect, corrupted and can't be rendered from PDF explorer or Acrobat Reader;
plus there are PDFs that are made of just images and contain no text so my question is:

given a folder/subfolder structure with hundreds or thousands of pdf files
can you think of a way (maybe scripting) to use PDF explorer to catch the ones that are corrupted
and the ones that contain no text ??

Thanks in advance

RTT

  • Administrator
  • *****
  • Posts: 918
Re: Corrupted or NO-Text PDFs
« Reply #1 on: February 04, 2014, 05:01:59 PM »
Quote
given a folder/subfolder structure with hundreds or thousands of pdf files
can you think of a way (maybe scripting) to use PDF explorer to catch the ones that are corrupted
and the ones that contain no text ??
You are posting in the PDF-ShellTools section of the forum, and asking about PDF Explorer?  Anyway, both programs can be used to do that.

To find potentially corrupted files:
- With PDF Explorer you just need to fill the grid with all the files you want to check, and sort it by the number of pages. All the PDFs with zero number of pages are potentially corrupted.
- With PDF-ShellTools you can use the find duplicates tool. From Windows Explorer, just select and right click the root folders where you have the PDFs, and call the tool. The default match criteria should work, but a match criteria set for number of pages will group all the files by same number of pages. If the first duplicates group found is for zero number of pages, that is usually evidence of unable to parse the file, so potentially corrupted.

To find PDFs with no text, the best way is indeed a My Scripts script, and the same script can be used from both tools.
Something like this:
Code: [Select]
var NoTextFilesList = [];
for (var i = 0; i < pdfe.SelectedFiles.Count; i++) {

    var PDFfile = pdfe.SelectedFiles(i);
    pdfe.Echo('Checking file: ' + PDFfile.Filename);

    var hastext = false;
    for (var p = 0; p < PDFfile.Pages.Count; p++) {
        if (PDFfile.Pages(p).Text.length > 0) {
            hastext = true;
            break;
        }
    }
    if (!hastext) {
        NoTextFilesList.push(PDFfile.Filename)
    }
}

if (NoTextFilesList.length > 0) {
    var WshShell = WScript.CreateObject("WScript.Shell");
    var OutputFileName = WshShell.ExpandEnvironmentStrings("%TEMP%") + "\\notextspdfs.txt";
    var fso = new ActiveXObject("Scripting.FileSystemObject");
    var OutputFile = fso.CreateTextFile(OutputFileName, 2, true);
    OutputFile.Write(NoTextFilesList.join("\r\n"));
    OutputFile.Close();
    WshShell.Run(OutputFileName);
    pdfe.Echo('Done. ' + NoTextFilesList.length + ' PDF(s) contain no text');
} else {
    pdfe.Echo('Done. All the parsed PDFs have text.');
}
This script check all the passed PDFs for no extractable text, and creates a text file with the list that it then opens in notepad.
To run it from PDFE, just fill the grid with all the files to check and run the script.
To run it from PDF-ShellTools, and because you have a folders tree structure, you can use the Windows Explorer top right search box to search for *.pdf, so you can select all the PDFs of the current folder, and sub-folders, at once.