Posted by virantha on Sun 30 March 2014

Detecting color vs greyscale and blank pages during scanning

Recently, I needed a way to detect if a color scanned image was actually color or just contained greyscale/BW content. I also wanted to detect if an image was "blank", or just a continuous shade of one color. I couldn't really find a simple solution on the web, so I pieced together my own using ImageMagick, as described below. I am not an image processing person by any means, but these quick hacks worked well in practice for my needs.

Detecting color vs greyscale

There are a lot of questions on forums about this (1, 2), but no solution seemed to work for me. I basically needed to take a scanned color image, and detect if the original source was just a B&W document that I could down-convert into a dithered 1-bit document to save space/processing time.

The solution I came up with was similar to this post in the ImageMagick documentation

Another technique is to do a direct 'best fit' of a 3 dimensional line to all the colors (or a simplified Color Matrix of metrics) in the image. The error of the fit (generally average of the squares of the errors) gives you a very good indication about how well the image fits to that line.

The observation is that a pure grey scale image will have colors that individually have RGB components that are the same (RGB: #222222), so a plot of greys in 3D space should lie on a single line going through the origin. A monochromatic tint (sepia) similarly will also lie on a line with different slope/intercept, and if there are any shade variations, there will be some errors off the line.

Finding this best-linear-fit in 3D space is overkill for my problem. I don't need the line, and only need to deal with very small tints from grey (intercept of the line will be very close to the origin). So I just need to know if some simple error metric that correlates to my image's colors deviation from the best fit line is within some threshold. Here's the solution I came up with that I've verified works well at scanning different types of paper documents:

1. Quantize the number of colors in the image to some small number like 8.
2. Quantize to 8-bits in each component.
3. Generate a frequency table of those colors (optional, I didn't really end up using this)
4. Take each of the top N colors, and calculate the mean of the differences between the RGB components.
5. If the mean for any color is larger than some threshold (I used 20), then classify this image as color; if it is less, then I can safely assume the original image was greyscale or black and white.

I use ImageMagick for the first 3 actions, using the following command:

```convert IMAGE -colors 8 -depth 8 -format %c histogram:info:-
```

This generates output like the following (small differences may arise depending on the image format):

``` 10831: ( 24, 26, 26,255) #181A1A srgba(24,26,26,1)
4836: ( 55, 87, 79,255) #37574F srgba(55,87,79,1)
6564: ( 77,138,121,255) #4D8A79 srgba(77,138,121,1)
4997: ( 86, 96, 93,255) #56605D srgba(86,96,93,1)
7005: ( 92,153,139,255) #5C998B srgba(92,153,139,1)
2479: (143,118,123,255) #8F767B srgba(143,118,123,1)
8870: (169,176,170,255) #A9B0AA srgba(169,176,170,1)
442906: (254,254,254,255) #FEFEFE srgba(254,254,254,1)
1053: (  0,  0,  0,255) #000000 black
484081: (255,255,255,255) #FFFFFF white
```

And here's my python code for parsing it (you can find this being used in scanpdf). Note that I just use all the colors instead of limiting it to the top N.

```cmd = "convert %s -colors 8 -depth 8 -format %%c histogram:info:-" % filename
out = self.cmd(cmd)
mLine = re.compile(r"""\s*(?P<count>\d+):\s*\(\s*(?P<R>\d+),\s*(?P<G>\d+),\s*(?P<B>\d+).+""")
colors = []
for line in out.splitlines():
matchLine = mLine.search(line)
if matchLine:
color = [int(x) for x in (matchLine.group('count'),
matchLine.group('R'),
matchLine.group('G'),
matchLine.group('B'),
)
]
colors.append(color)
# sort
colors.sort(reverse=True, key = lambda x: x)
is_color = False
logging.debug(colors)
for color in colors:
# Calculate the mean differences between the RGB components
# Shades of grey will be very close to zero in this metric...
diff = float(sum([abs(color-color),
abs(color-color),
abs(color-color),
]))/3
if diff > 20:
is_color = True
return is_color
```

Detecting blank pages

For this, I originally started off with post and used the following command line:

```convert %s -shave 1%x1%  -format "%[fx:mean]" info:
```

For a BW image (with one channel), This returns the average white value after trimming off the edges of the image. So if the percentage is >0.97, I would say the page was blank. Unfortunately, this doesn't work for color images as the channels will have information.

My final idea was to use the image information in each channel, and specifically the standard deviation, to identify images with information. If the standard deviation is higher, that means there is more variation in a particular RGB channel, or in other words, more features in the image. So, I just look at the standard deviation of each channel and if it's higher than a threshold, I mark it as a non-blank page. I use the ImageMagick identify -verbose tool to get this value:

```def is_blank(self, filename):
c = 'identify -verbose %s' % filename
result = self.cmd(c)
mStdDev = re.compile("""\s*standard deviation:\s*\d+\.\d+\s*\((?P<percent>\d+\.\d+)\).*""")
for line in result.splitlines():
match = mStdDev.search(line)
if match:
stdev = float(match.group('percent'))
if stdev > 0.1:
return False
return True
```

Obviously, any noise (high frequency variation) will also contribute to a higher stdev, so at some point in the future, I might run some denoise filters first, but for now I can live with the false positive non-blanks.

That's it for this post. Again, you can see how I used this in my own pdf scanning script, ScanPDF.

© Virantha Ekanayake. Built using Pelican. Modified svbhack theme, based on theme by Carey Metcalfe