Ingredients errors

We already have many data quality warnings about ingredients, but there are producing many false positives. We should try to find real issues in order to fix them.

We can probably identify two cases.

1. Ingredients with real issues

These ingredients should be identified and manually fixed. We should try to list some ingredients that contain issues.

a. Typos

b. Words which are not ingredients

  • l.l.c., gmbh, gaec, etc. (this query returns 2200+ products)
  • Ingredients beginning by Ingredients or ING:, INGREDIENTS / INGREDIENTES: , etc.
  • à conserver, store, … (this query returns more than 7400 products)
    • Add a stopword?
  • made in, fabriqué, …
  • Kann bei übermäßigem Verzehr abführend wirken (meaning: Can have a laxative effect if consumed to excess)
  • [to be continued]

2. Ingredients to be deleted

The whole ingredients field should be deleted because it is too “noisy”: maybe few words are ok, but the rest is full of mistakes. We should try to identify patterns based on examples.

In this case, we could automatically delete the ingredients.

Examples

Patterns

2 Likes