Ingredient data

Hi, just discovered this wonderful tool and so far finding it very interesting. Decided to lookup the ingredient data for 2 random products and both had issues in the ingredient data.

Barcode 5010003064744
Hovis Seed Sensations Bread
the ingredient data had a lot of spelling mistakes and underscore characters which were not right. When the ingredients are created and or edited, is there not some form of automatic spelling or punctuation check in place?
For the spelling mistakes here are some examples:- CaramelisedSuaar, MattedBadevFlour,

It seems that anyone can make an edit, is there any checks in place to ensure the person editing the product is not providing incorrect information?

Some product information can be viewed online via other sources such as food retailer websites etc, is it possible to provide a URL for data checking against food retailer data?

I find this solution to be fascinating but I am concerned over the quality of the data that I have searched so far.

Any thoughts and or advice on this are welcome/

I think the underscores are for things like bold, which is often used in ingredient lists.

I don’t think there is anything automatic about spelling and punctuation, but there is an “index of correctness” based on the number of ingredients that exist in our taxonomy.

Like other collaboration projects, we tend to trust the user, but we do require an account to be able to edit product details. Also, every edit is saved and can be rollbacked should it be necessary.

Thank you for your response, however using underscores for highlighting characters is not industry standard. These just make the ingredients information appear incorrect for most users.
For highlight of allergens you would need HTML or Rich Text and if the underscore was meant to highlight the allergens then that is even more of an issue as many of the allergens were not highlighted in this example prior to me updating the ingredients to be correct.

Might be worth adding a base spelling check function just to stop word like this “CaramelisedSuaar, MattedBadevFlour,” appearing in the data?

AFAIK we try to find allergens in the ingredients taxonomy, and we automatically apply that type of markup. The mobile app and the website should display them as bold. We cannot use html nor rtf as it is something that the user should be able to update without knowing any of them.

Obviously, if the ingredients are spelled wrong there is no way of finding them in the taxonomy, and the only thing we do is mark the ingredients as unknown (you can see that by opening the details of the ingredients analysis).

For the spelling, I don’t think creating rules for every spelling mistake possible could be feasible, but I’ll let @stephane reply to that part.

I think it was done by OCR on the picture and not corrected after doing OCR.

It could also be that!

Indeed we used to have very bad OCR errors about 5 years ago, but the quality of OCR has gone up tremendously, so even just re-running OCR on the same photos from 5 years ago gives much better results.

We thought of introducing spellchecks or spell suggestions, and some experiments were made, but currently it’s not deployed.

According to the (old?) documentation, the images are OCRed via Google cloud vision probably without post-processing, so things like hyphens must be removed manually, even if they could be easily removed by running the output against a wordlist after some regexes.

OCR has really improved and programs like Tesseract (GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)) works very well even with low resolution images.