Author Topic: Script to count how many colour pages in PDF?  (Read 22609 times)

0 Members and 1 Guest are viewing this topic.

nightslayer23

  • Newbie
  • *
  • Posts: 98
Script to count how many colour pages in PDF?
« on: May 11, 2017, 08:46:11 AM »
Hi all, so I'm in need of a script similar to RapidPDFCount which uses a DLL script to count how many pages of a PDF are colour and how many are black & white..

Is there a way of getting a script for PDF Shell Tools to do the same thing? Then display the count in a custom Collumn?

RTT

  • Administrator
  • *****
  • Posts: 918
Re: Script to count how many colour pages in PDF?
« Reply #1 on: May 12, 2017, 04:17:18 AM »
That's functionality not directly available from the scripts API but we can create a script to automate the ImageMagick tool and get that info.
The idea is to render each PDF page and analyze the result bitmaps for color content.

Made some test and here is a sample script that creates a .csv file with a "Color Pages Count" and "BW/Gray Page Count" columns. It renders each PDF page, converts the result bitmaps to the HSI colorspace and computes the mean value of the saturation channel. The page is considered colorized if this value is higher than 0, or BW/Gray otherwise. You may adjust this threshold to your needs.

Code: [Select]
// Add format function to the String prototype
// First, checks if it isn't implemented yet.
if (!String.prototype.format) {
    String.prototype.format = function() {
        var args = arguments;
        return this.replace(/{(\d+)}/g, function(match, number) {
            return typeof args[number] != 'undefined' ? args[number] : match;
        });
    };
}

var imo = new ActiveXObject("ImageMagickObject.MagickImage.1");
var fso = new ActiveXObject("Scripting.FileSystemObject");

var tmpfolder = fso.GetSpecialFolder(2 /*TemporaryFolder*/ );
var InfoFilename = tmpfolder + '\\PagesInfo.txt';
var CSVOutputFileName = tmpfolder + '\\' + fso.GetTempName();
var CSVOutputFile = fso.CreateTextFile(CSVOutputFileName, true, true);

//write header line to the csv file
CSVOutputFile.WriteLine('Filename,Status,"Pages Count","Color Pages Count","BW/Gray Pages Count"');

for (var i = 0; i < pdfe.SelectedFiles.Count; i++) {
    var file = pdfe.SelectedFiles(i);
    pdfe.echo('Processing ' + file.filename);
    try {
        //use imagemagick to render each pdf page, convert the result image colorspace
        //to HSI and output "1" if the mean of the saturation values is higher
        //than 0 (the page has color), and "0" if 0 (no color in the page)
        imo.convert(file.filename, "-colorspace", "HSI", "-format", "%[fx:mean.g>0?1:0]", "info:" + InfoFilename);

        //read the result info file, that contains a "0" or "1" for each page
        //in the PDF. E.g. 0110, for a 4 pages PDF with pages 1 and 4 being bw/gray
        //and 2 and 3 with color.
        var f = fso.GetFile(InfoFilename);
        var fts = f.OpenAsTextStream();
        var info = fts.ReadAll();
        fts.Close();
        f.Delete();

        //Count the number of "1"
        var ColorPagesCount = info.split('1').length - 1;
        //Count the number of "0"       
        var BWPagesCount = info.split('0').length - 1;

        pdfe.echo(file.filename + ': Color Pages Count = ' + ColorPagesCount + ',BW / Gray Page Count = ' + BWPagesCount, 0, 2);
        CSVOutputFile.WriteLine('"{0}",OK,{1},{2},{3}'.format(file.filename, file.NumPages, ColorPagesCount, BWPagesCount));
    } catch (e) {
        pdfe.echo(file.filename + ': Error (' + e.message + ')', 0xff0000, 2);
        CSVOutputFile.WriteLine('"{0}",Failed'.format(file.filename));
    }
}
CSVOutputFile.Close();
dialog = pdfe.SaveDialog;
dialog.DefaultExt = '.csv';
dialog.filter = 'CSV (*.csv)|*.csv';
dialog.Options = '[ofOverwritePrompt]';
dialog.Filename = fso.GetParentFolderName(file.filename) + '\\PDFsInfo.csv';
if (dialog.execute) {
    if (fso.FileExists(dialog.Filename)) fso.DeleteFile(dialog.Filename);
    fso.MoveFile(CSVOutputFileName, dialog.Filename);
    var WshShell = WScript.CreateObject("WScript.Shell");
    WshShell.Run(dialog.Filename);
} else {
    fso.DeleteFile(CSVOutputFileName);
}
To test it, just import the attached .myscript file into the PDF-ShellTools My Scripts, and you will get a "Number of Color and BW/Gray pages" named script, you can invoke for all the selected PDF files from the Windows shell PDF files context menu, from the PDF-ShellTools>My Scripts sub menu.
The scrip needs to have the 32-bit version of the ImageMagick tool installed. I've tested with the ImageMagick-7.0.5-5-Q16-x86-dll.exe one. While installing, make sure you select the "Install ImageMagickObject OLE Control for VBScript,..." option, under the "additional tasks" page of the installer.
The ImageMagick also needs to have the Ghostscript tool installed, to handle the PDF format.

If the script is performing as needed we can change it to put the info into custom metadata properties, as you suggested.

nightslayer23

  • Newbie
  • *
  • Posts: 98
Re: Script to count how many colour pages in PDF?
« Reply #2 on: May 16, 2017, 12:29:14 AM »
This works!

Thank you so much!

Could we also have an extention of this script to allow us to display the output results as custom columns in windrows explorer?

RTT

  • Administrator
  • *****
  • Posts: 918
Re: Script to count how many colour pages in PDF?
« Reply #3 on: May 16, 2017, 01:31:59 AM »
Variant of the above script, to save the result in 2 custom metadata properties. This makes the result of the slow process to calculate these values immediately available to the shell, e.g. for easy management of the PDFs, the next time we need to know this information, without the need to run the script again. And because these custom metadata properties are saved in the PDF file itself, there is no risk of information loss if the files are moved to another disk or sent to someone else.
Code: [Select]
var imo = new ActiveXObject("ImageMagickObject.MagickImage.1");
var fso = new ActiveXObject("Scripting.FileSystemObject");

var tmpfolder = fso.GetSpecialFolder(2 /*TemporaryFolder*/ );
var InfoFilename = tmpfolder + '\\PagesInfo.txt';

var ProgressBar = pdfe.ProgressBar;
ProgressBar.max = pdfe.SelectedFiles.Count;

for (var i = 0; i < pdfe.SelectedFiles.Count; i++) {
    ProgressBar.position = i + 1;
    var file = pdfe.SelectedFiles(i);
    var FileMetadata = file.Metadata;

    //Bypass already processed files.
    if (FileMetadata.ColorPagesCount && FileMetadata.BWGrayPagesCount) {
        pdfe.echo(file.filename + ': Color Pages Count = ' + FileMetadata.ColorPagesCount + ', BW/Gray Pages Count = ' + FileMetadata.BWGrayPagesCount);
        pdfe.echo(' [Already set]', 0xFF, 1);
        continue;
    }

    pdfe.echo('Processing ' + file.filename + ' (' + file.NumPages + ' pages)');
    try {
        //use imagemagick to render each pdf page, convert the result image colorspace
        //to HSI and output "1" if the mean of the saturation values is higher
        //than 0 (the page has color), and "0" if 0 (no color in the page) 
        imo.convert(file.filename, "-colorspace", "HSI", "-format", "%[fx:mean.g>0?1:0]", "info:" + InfoFilename);

        //read the result info file, that contains a "0" or "1" for each page
        //in the PDF. E.g. 0110, for a 4 pages PDF with pages 1 and 4 being bw/gray
        //and 2 and 3 with color.
        var f = fso.GetFile(InfoFilename);
        var fts = f.OpenAsTextStream();
        var info = fts.ReadAll();
        fts.Close();
        f.Delete();

        //Count the number of "1"
        var ColorPagesCount = info.split('1').length - 1;
        //Count the number of "0"       
        var BWGrayPagesCount = info.split('0').length - 1;

        pdfe.echo(file.filename + ': Color Pages Count = ' + ColorPagesCount + ', BW/Gray Pages Count = ' + BWGrayPagesCount, 0, 2);

        if (FileMetadata.ColorPagesCount !== ColorPagesCount.toString() || FileMetadata.BWGrayPagesCount !== BWGrayPagesCount.toString()) {
            FileMetadata.ColorPagesCount = ColorPagesCount;
            FileMetadata.BWGrayPagesCount = BWGrayPagesCount;
            if (FileMetadata.CommitChanges()) {
                pdfe.echo(' [OK]', 0x006400, 1);
            } else {
                pdfe.echo(' [Setting metadata failed]', 0xFF0000, 1);
            }
        } else {
            pdfe.echo(' [Already set]', 0xFF, 1);
        }

    } catch (e) {
        pdfe.echo(file.filename + ' : ', 0, 2);
        pdfe.echo(e.name + ' ( ' + e.message + ' )', 0xff0000, 1);
    }
}

pdfe.echo('Done');

As with the page size script, there is the need to define the custom properties that will hold the values. This script expects two custom properties, named "ColorPagesCount" and "BWGrayPagesCount", as shown in the attached screenshot.

nightslayer23

  • Newbie
  • *
  • Posts: 98
Re: Script to count how many colour pages in PDF?
« Reply #4 on: May 17, 2017, 02:42:13 AM »
What about another function similar to this, but instead of telling you colour or black & white...it tells you the ink coverage on each page and then displays that as a value in explorer? Would be basically the same just looking at a percentage of 0-33% , 34-66% or 67-100% and displaying that as Line, Medium or High as the value?

RTT

  • Administrator
  • *****
  • Posts: 918
Re: Script to count how many colour pages in PDF?
« Reply #5 on: May 18, 2017, 12:46:24 AM »
How is the page ink coverage calculated?
Something like the percentage of the CMYK colors? This gives 4 values, i.e. the percentage of cyan, magenta, yellow and black. Are this 4 values what you want to calculate?

nightslayer23

  • Newbie
  • *
  • Posts: 98
Re: Script to count how many colour pages in PDF?
« Reply #6 on: May 18, 2017, 03:09:10 AM »
basically yes. whatever ISN'T white space..

Let's say the page was split in half.. half white space, half Purple.

I would need it to spit out a percentage of 50% and equate that to being a Medium coverage because it falls between the 34-66% threshold.

But given the fact CMYK values mix together and both have to print to say produce Purple, it would need to know not to add 50% Yellow to 50% Cyan to makeup that Purple. Which would be 50+50 = 100% which is false because it isn't a 100% file..

nightslayer23

  • Newbie
  • *
  • Posts: 98
Re: Script to count how many colour pages in PDF?
« Reply #7 on: May 18, 2017, 03:13:01 AM »
also, the original code that spat out a csv file was a handy part of the function.. could that come back into the second script?
the csv file also didn't appear to put the colour and black&white values as separate collumns in excel for easy sum totalling.

RTT

  • Administrator
  • *****
  • Posts: 918
Re: Script to count how many colour pages in PDF?
« Reply #8 on: May 19, 2017, 04:49:53 PM »
whatever ISN'T white space..
Not easy to find a ImageMagick set of commands to calculate this properly for all the situations, even because I'm not an expert on this subject, but thresholding the image, to convert to black all non-white pixels, and then calculating the page percentage of black pixels seems to be giving good results.
Code: [Select]
var imo = new ActiveXObject("ImageMagickObject.MagickImage.1");
var fso = new ActiveXObject("Scripting.FileSystemObject");

var tmpfolder = fso.GetSpecialFolder(2 /*TemporaryFolder*/ );
var InfoFilename = tmpfolder + '\\PagesInfo.txt';

var ProgressBar = pdfe.ProgressBar;
ProgressBar.max = pdfe.SelectedFiles.Count;

for (var i = 0; i < pdfe.SelectedFiles.Count; i++) {
    ProgressBar.position = i + 1;
    var file = pdfe.SelectedFiles(i);
    var FileMetadata = file.Metadata;

    //Bypass already processed files.
    if (FileMetadata.InkCoverage) {
        pdfe.echo(file.filename + ': Ink coverage = ' + FileMetadata.InkCoverage);
        pdfe.echo(' [Already set]', 0xFF, 1);
        continue;
    }

    pdfe.echo('Processing ' + file.filename + ' (' + file.NumPages + ' pages)');
    try {
        //use imagemagick to render each pdf page, convert all non-white colors to black
        //and calculate the average of black pixels, that correspond to the percentage of non-white area.

        //imo.convert(file.filename, "-fuzz","1%","-fill","white","opaque","white","-fill","black","+opaque","white","-format", "%[fx:100-mean*100]\n", "info:" + InfoFilename);               
        imo.convert(file.filename, "-colorspace", "gray", "-auto-level", "-threshold", "99%", "-format", "%[fx:100-mean*100]\n", "info:" + InfoFilename);

        //read the result info file, that contains a line of ink coverage percentage value for each page.
        var f = fso.GetFile(InfoFilename);
        var fts = f.OpenAsTextStream();
        var PagesInkCoverage = fts.ReadAll().split('\n');
        fts.Close();
        f.Delete();

        //calculate the document total ink coverage by averaging the by page values.
        var InkCoverage = 0;
        for (var index = 0, len = PagesInkCoverage.length - 1; index < len; index++) {
            InkCoverage += Number(PagesInkCoverage[index]);
        }
        InkCoverage = Math.round((InkCoverage / (len ? len : 1)));

        pdfe.echo(file.filename + ': Ink coverage=' + InkCoverage + '%', 0, 2);

        if (FileMetadata.InkCoverage !== InkCoverage.toString()) {
            FileMetadata.InkCoverage = InkCoverage;
            if (FileMetadata.CommitChanges()) {
                pdfe.echo(' [OK]', 0x006400, 1);
            } else {
                pdfe.echo(' [Setting metadata failed]', 0xFF0000, 1);
            }
        } else {
            pdfe.echo(' [Already set]', 0xFF, 1);
        }

    } catch (e) {
        pdfe.echo(file.filename + ' : ', 0, 2);
        pdfe.echo(e.name + ' ( ' + e.message + ' )', 0xff0000, 1);
    }
}

pdfe.echo('Done');
This script expects a custom property named InkCoverage and to show in the Shell this ink coverage percentage value as ranges named "Line", "Medium" or "High", this custom property needs to be configured as depicted in the attached screenshots.

If it's not giving the expected results, better if you ask in a ImageMagick forum on how to calculate this and then we can update the script with a better set of image processing/analysis commands.

RTT

  • Administrator
  • *****
  • Posts: 918
Re: Script to count how many colour pages in PDF?
« Reply #9 on: May 19, 2017, 05:02:54 PM »
also, the original code that spat out a csv file was a handy part of the function.. could that come back into the second script?
That's something you can try yourself. ;) You just need to copy some lines from the first script.
But probably better if you use the included "Export metadata to MS Excel or OpenOffice Calc" script for that propose. After the PDFs are processed, you can run this script anytime to fill the spreadsheet with all the defined properties of the selected PDFs.

Quote
the csv file also didn't appear to put the colour and black&white values as separate collumns in excel for easy sum totalling.
The last time I checked it included these values as independent columns. If you are using the MS Excel, you need to open the csv manually. The script automatically opens the generated csv, but Excel fails to properly import the data this way. It works fine with the OpenOffice Calc, that is what I use.

nightslayer23

  • Newbie
  • *
  • Posts: 98
Re: Script to count how many colour pages in PDF?
« Reply #10 on: May 23, 2017, 02:17:31 AM »
Had an issue displaying the Ranges.. Had to set the Data Type to Integer,64bit and then it worked?

RTT

  • Administrator
  • *****
  • Posts: 918
Re: Script to count how many colour pages in PDF?
« Reply #11 on: May 23, 2017, 11:36:21 PM »
Yes, I forgot to make reference to this. I've edited my post, to add the screenshot depicting the data type needed configuration. I have it set to Integer 16 bit, but any integer data type can hold the 0..100 value.

nightslayer23

  • Newbie
  • *
  • Posts: 98
Re: Script to count how many colour pages in PDF?
« Reply #12 on: May 24, 2017, 12:12:13 AM »
:) all good

nightslayer23

  • Newbie
  • *
  • Posts: 98
Re: Script to count how many colour pages in PDF?
« Reply #13 on: July 19, 2017, 12:29:43 AM »
whatever ISN'T white space..
Not easy to find a ImageMagick set of commands to calculate this properly for all the situations, even because I'm not an expert on this subject, but thresholding the image, to convert to black all non-white pixels, and then calculating the page percentage of black pixels seems to be giving good results.
Code: [Select]
var imo = new ActiveXObject("ImageMagickObject.MagickImage.1");
var fso = new ActiveXObject("Scripting.FileSystemObject");

var tmpfolder = fso.GetSpecialFolder(2 /*TemporaryFolder*/ );
var InfoFilename = tmpfolder + '\\PagesInfo.txt';

var ProgressBar = pdfe.ProgressBar;
ProgressBar.max = pdfe.SelectedFiles.Count;

for (var i = 0; i < pdfe.SelectedFiles.Count; i++) {
    ProgressBar.position = i + 1;
    var file = pdfe.SelectedFiles(i);
    var FileMetadata = file.Metadata;

    //Bypass already processed files.
    if (FileMetadata.InkCoverage) {
        pdfe.echo(file.filename + ': Ink coverage = ' + FileMetadata.InkCoverage);
        pdfe.echo(' [Already set]', 0xFF, 1);
        continue;
    }

    pdfe.echo('Processing ' + file.filename + ' (' + file.NumPages + ' pages)');
    try {
        //use imagemagick to render each pdf page, convert all non-white colors to black
        //and calculate the average of black pixels, that correspond to the percentage of non-white area.

        //imo.convert(file.filename, "-fuzz","1%","-fill","white","opaque","white","-fill","black","+opaque","white","-format", "%[fx:100-mean*100]\n", "info:" + InfoFilename);               
        imo.convert(file.filename, "-colorspace", "gray", "-auto-level", "-threshold", "99%", "-format", "%[fx:100-mean*100]\n", "info:" + InfoFilename);

        //read the result info file, that contains a line of ink coverage percentage value for each page.
        var f = fso.GetFile(InfoFilename);
        var fts = f.OpenAsTextStream();
        var PagesInkCoverage = fts.ReadAll().split('\n');
        fts.Close();
        f.Delete();

        //calculate the document total ink coverage by averaging the by page values.
        var InkCoverage = 0;
        for (var index = 0, len = PagesInkCoverage.length - 1; index < len; index++) {
            InkCoverage += Number(PagesInkCoverage[index]);
        }
        InkCoverage = Math.round((InkCoverage / (len ? len : 1)));

        pdfe.echo(file.filename + ': Ink coverage=' + InkCoverage + '%', 0, 2);

        if (FileMetadata.InkCoverage !== InkCoverage.toString()) {
            FileMetadata.InkCoverage = InkCoverage;
            if (FileMetadata.CommitChanges()) {
                pdfe.echo(' [OK]', 0x006400, 1);
            } else {
                pdfe.echo(' [Setting metadata failed]', 0xFF0000, 1);
            }
        } else {
            pdfe.echo(' [Already set]', 0xFF, 1);
        }

    } catch (e) {
        pdfe.echo(file.filename + ' : ', 0, 2);
        pdfe.echo(e.name + ' ( ' + e.message + ' )', 0xff0000, 1);
    }
}

pdfe.echo('Done');
This script expects a custom property named InkCoverage and to show in the Shell this ink coverage percentage value as ranges named "Line", "Medium" or "High", this custom property needs to be configured as depicted in the attached screenshots.

If it's not giving the expected results, better if you ask in a ImageMagick forum on how to calculate this and then we can update the script with a better set of image processing/analysis commands.


Is there a way for the file to be flattened first before doing this conversion? Some work perfectly, but others come out at a really high percentage when they aren't technically going to print that way. I figured it was looking at other layers or some other hidden info and converting that to bw too.

I did a test saving one to jpg, converting it back to pdf then running the tool again which gave me an accurate result. However the process to convert one to jpg and back to pdf was quite slow. I am needing to colm over hundreds of files at once with this tool to get a fast result. So would i be possible in code to first flatten layers before running the check?

I actually batch flattened layers in acrobat and it didn't solve the issue.. I had to batch flatten AND convert everything to CMYK to get it to work.

In the optimizer tool, can CMYK and RGB colour spaces be added somehow? Because Acrobat is just way too slow at doing these steps.. your tool is much faster!

RTT

  • Administrator
  • *****
  • Posts: 918
Re: Script to count how many colour pages in PDF?
« Reply #14 on: July 19, 2017, 03:56:14 AM »
Is there a way for the file to be flattened first before doing this conversion? Some work perfectly, but others come out at a really high percentage when they aren't technically going to print that way. I figured it was looking at other layers or some other hidden info and converting that to bw too.
Are these trouble PDF layers set to be visible in the PDF reader (screen mode) and hidden when printed? If that's the case, edit the file delegates.xml, where you have ImageMagick installed, and change the line "<delegate decode="ps:alpha" stealth="True" command="&quot;@PSDelegate@&quot; -q -dQUIET -..." to include the -dPrinted parameter.

Quote
So would i be possible in code to first flatten layers before running the check?
When the ImageMagick tool calls the Ghostscript to convert each of the PDF pages to an image, that then uses to run the color check, is effectively flattening the PDF. If the issue is not the mentioned above (these layers are set to be hidden only when the PDF is printed) and even hidden layers are being rendered too, then that's an issue with Ghostscript.
If you have Acrobat, I suppose the script can automate it to flatten the PDF layers to a temporary PDF file and then run the check on that PDF.

Quote
I actually batch flattened layers in acrobat and it didn't solve the issue.. I had to batch flatten AND convert everything to CMYK to get it to work.
Can't opine without a sample file.

Quote
In the optimizer tool, can CMYK and RGB colour spaces be added somehow?
I'm not understanding your question. Please explain this better.