I'm attempting to detect duplicates among our internal image hosting service. Currently it has about 65K images, a fair number of which I suspect are duplicates.
We have been using ImageMagick to open the images as uploaded from the user, validate them, and then store them into our database for about 5 years.
My algorithm is as follows:
Open each image one at a time
Perform a mogrify->strip to remove all comment/exif data
If the image is a PNG, strip the date:create and date:modify that randomly started being added in like 2009 (prior to that the images lack this problem)
Store the image
Determine the MD5 of the image and store that for a later deduping sweep
In the case of lossy images it's interesting that cycled images (those added, downloaded and re-added) will not be detected as duplicates here, but there isn't too much I can do about that at this point.
So here is my problem, I wrote all the above code, and deployed it. I then came to find out that the PNG images still had date:create/date:modify even though the API call was made. I believe something is broken because when I switched the mogrify and date stripping around, the output was altered in an unexpected way.
Here is my code, along with the MD5 printed at various points of the processed image:
Code: Select all
warn md5_hex($im->ImageToBlob()); # 2985ceb411ffc2ca80e845c09f389160
# strip out unique attrs from the image that might mess up the final file
$im->Mogrify('strip');
warn md5_hex($im->ImageToBlob()); # de4c581bde9a6c7d5b30d234ac37167e
$im->Set( 'date:modify' => '');
warn md5_hex($im->ImageToBlob()); # de4c581bde9a6c7d5b30d234ac37167e
$im->Set( 'date:create' => '');
warn md5_hex($im->ImageToBlob()); # de4c581bde9a6c7d5b30d234ac37167e
Code: Select all
warn md5_hex($im->ImageToBlob()); # 5f6c94c736f6614a17449bdc6710fd96
$im->Set( 'date:modify' => '');
warn md5_hex($im->ImageToBlob()); # b5f0d9a8df86ff58ed1b345acc533b78
$im->Set( 'date:create' => '');
warn md5_hex($im->ImageToBlob()); # 9278c7386812b311593b5143331ced52
# strip out unique attrs from the image that might mess up the final file
$im->Mogrify('strip');
warn md5_hex($im->ImageToBlob()); # de4c581bde9a6c7d5b30d234ac37167e
As a sidenote I hope others can find this post about how to clear date:modify and date:create as the documentation is VERY confusing about how to do this. I'm still not sure the above is the correct procedure.
Thanks for reading this far, I have been working on this over a week!