I have been using Open Food Facts JSONL file for some time now and have noticed that products sometimes disappear due to community updates. I’ve also noticed that nutritional info (which is what I’m primarily interested in) and image information are sometimes structured differently in different records.
For my nutrition app, I created a subset of the Open Food Facts dataset which seeks to standardise and simplify nutritional info and which only has:
- Barcode
- Product Name (in all available languages)
- Brand
- Ingredients (in all available languages)
- Ingredients with Allergen Highlighting (in all available languages)
- Packaging Text (in all available languages)
- Primary language
- Serving Size number (e.g. 50g)
- Serving Size Description (e.g. 1 slice)
- Nutritional Info (including per-serving if serving size is available)
- URLs of detected images (i.e. the derived AWS and OFF URLs). Be sure to amend URLs to retrieve only the size you need (e.g. change “full” to “400” for a medium sized image in OFF URLS).
in a JSON string for each product like this:
{
"_id": "0008077102146",
"brand": "meiji",
"product_name": {
"en": "Hello Panda"
},
"code": "0008077102146",
"images": {
"front_en": {
"contributor": "Open Food Facts Contributor : stephane",
"aws": "``https://openfoodfacts-images.s3.eu-west-3.amazonaws.com/data/000/807/710/2146/2.jpg``",
"off": "``https://images.openfoodfacts.org/images/products/000/807/710/2146/front_en.6.full.jpg``"
},
"nutrition_en": {
"contributor": "Open Food Facts Contributor : stephane",
"aws": "``https://openfoodfacts-images.s3.eu-west-3.amazonaws.com/data/000/807/710/2146/3.jpg``",
"off": "``https://images.openfoodfacts.org/images/products/000/807/710/2146/nutrition_en.7.full.jpg``"
}
},
"ingredients_text": {
"en": "Wheat flour, vegetable shortening (partially hydrogenate palm & canola oils), sugar, malt syrup, lactose, whole milk powder, skim milk powder, emulsifier, (soya-lecithin), seasoning (natural), leavening (ammonium bicarbonate & sodium bicarbonate), salt, sugar flavor, milk flavor."
},
"ingredients_text_with_allergens": {
"en": "<span class=\"allergen\">Wheat flour</span>, vegetable shortening (partially hydrogenate palm & canola oils), sugar, malt syrup, <span class=\"allergen\">lactose</span>, whole milk powder, skim milk powder, emulsifier, (<span class=\"allergen\">soya-lecithin</span>), seasoning (natural), leavening (ammonium bicarbonate & sodium bicarbonate), salt, sugar flavor, milk flavor."
},
"lc": "en",
"nutriments": {
"carbohydrates_100g": 66.67,
"carbohydrates_serving": 10,
"energy_kcal_100g": 533,
"energy_kcal_serving": 80,
"energy_kj_100g": 2230,
"energy_kj_serving": 334,
"fat_100g": 26.67,
"fat_saturated_100g": 13.33,
"fat_saturated_serving": 2,
"fat_serving": 4,
"fibre_100g": 6.7,
"fibre_serving": 1.01,
"protein_100g": 6.67,
"protein_serving": 1,
"salt_100g": 0.9175,
"salt_serving": 0.138,
"sodium_100g": 0.367,
"sodium_serving": 0.0551,
"sugars_100g": 33.33,
"sugars_serving": 5
},
"serving_quantity": "15",
"serving_size": "4 COOKIES, PER CONTAINER ABOUT (15 g)"
}
Items in the main JSONL file in which I can’t find:
- A Product Name
- Nutritional Info
are rejected. This has reduced the number of products from 4.4M to 3.1M. The compressed JSONL is about 476MiB (3.36GiB when uncompressed).
My JSONL is generated every day finishing at about 5.30am local time (UK), and I’m happy for others to use, modify and republish etc. If I notice my server getting hammered I might restrict it somehow or move it to AWS or similar.
https://nutrit.app/data/openfoodfacts-nutritapp.jsonl.gz
If you do make use of it and notice any errors or other issues, please let me know.