We already have many data quality warnings about ingredients, but there are producing many false positives. We should try to find real issues in order to fix them.
We can probably identify two cases.
1. Ingredients with real issues
These ingredients should be identified and manually fixed. We should try to list some ingredients that contain issues.
a. Typos
- bceuf instead of boeuf, and ceuf instead of oeuf (this query gives 800+ results)
- sečer/šecer/secer/sećer/šečer instead of šećer (found 25 times (~3000 products in HR) 3859889283694, Macho Sandwich - Ledo - 86g, …)
b. Words which are not ingredients
-
l.l.c.
,gmbh
,gaec
, etc. (this query returns 2200+ products) - Ingredients beginning by
Ingredients
orING:
,INGREDIENTS / INGREDIENTES:
, etc. -
à conserver
,store
, … (this query returns more than 7400 products)- Add a stopword?
-
made in
,fabriqué
, … -
Kann bei übermäßigem Verzehr abführend wirken
(meaning: Can have a laxative effect if consumed to excess) - [to be continued]
2. Ingredients to be deleted
The whole ingredients
field should be deleted because it is too “noisy”: maybe few words are ok, but the rest is full of mistakes. We should try to identify patterns based on examples.
In this case, we could automatically delete the ingredients.
Examples
Patterns
- a. Ingredients only containing a sequence of one to 5 chars repeated at least 3 times
- b. Ingredients containing a specific letter repeated more than 4 times: no language at all should contain an alphabetic letter repeated more than 4 times
- c. Ingredients containing only non-alphanumeric signs.
- d. Ingredients containing mostly non-alphanumeric signs.
- e. Ingredients only containing one latin char or diacritic sign
- f. [to be continued]