Image Preparation for Google Vision API
Google Vision API is a power tool for harvesting text from an image. It produces state-of-the-art results for OCR tasks. However, just like using any tool, you need to twist the input image to fit into its expectation about an image.
To extract invoice information, MLReader utilizes Google Vision API to harvest raw text information. There are some undocumented factors, such as aspect ratio, that will affect the result. Since it is undocumented, following is based on trial-and-error experiments. Your mileage may vary.
The original invoice is a PDF file. MLReader uses ImageMagick to convert it into PNG files at resolution of 300 dpi. For testing purposes, we generated three PNF files.
First one is generated on Mac OS with dimensions of 2480 x 3508. This file has an aspect ratio, i.e. width / height, of 0.71.
Second one is generated on CentOS with dimensions of 2480 x 3508. This file also has an aspect ratio, i.e. width / height, of 0.71.
Third one is generated on CentOS, but cropped to 2480 x 3306. This file has an aspect ratio, i.e. width / height, of 0.75.
These three files are shown below. Even with the same dimension, files generated on MacOS and CentOS have different file sizes. It is most likely due to different fonts used on these systems.
macos_r300_2480_3508.png centos_r300_2480_3508.png centos_r300_2480_3360.png
After sending these files to Google Vision API, we got the following results:
For the first one, i.e. macos_r300_2480_3508.png:
"Slicedinvoices\nInvoice\nFrom:\nDEMO - Sliced Invoices\n ....
For the second one, i.e. centos_r300_2480_3508.png:
"Invoice\nProm:\nDEMO - Sliced Invoices\n ...
For the third one, i.e. centos_r300_2480_3306.png:
"Slicedlnvoices\nnvoice\nFrom:\nDEMO Sliced Invoices\n ...
As you can see, they produced different OCR extractions for the beginning of this invoice:
The image generated on MacOS produced the best result. It extracts all text correctly, except the lower cased 'i'.
The image generated on CentOS, without cropping, generated the worst result
It missed out the logo portion, i.e. 'SlicedInvoices'
It misspelled 'Prom', instead of 'From'
The cropped CentOS image is in the middle
It retains the logo, although it produced lower case 'l', instead of upper case 'I' for the first word 'SlicedInvoices'.
It produced 'From' correctly
It missed the 'I' of the second word 'Invoice'
Again, there's no documentation about what Google Vision API expects and does with these images. The following are purely based on observations. If any visitor can shed light on this, please leave a message.
It seems Google Vision API expects certain aspect ratio between the width to height. When it sees an out-of-range image, it might resize, and/or crop, the image. Depending on the font type, this could result in unrecognized or misinterpreted text.
Although ImageMagick on Mac OS produces the best result, it is unlikely the operating system for your production environment. If your production is using CentOS, it would be better to generate an image that minimizes the chance that Google Vision API had to perform additional transformation. The ratio we found that is optimal is 0.75.
To achieve that ratio, it is better to crop than resize the image. Simple resizing might distort the font. All OCR system are trained with certain font types. It is no exception for Google Vision API.
For invoice images, if one has to choose, it might be better to crop off the bottom as opposed to the top in order to fit the 0.75 ratio.