How we use AI to cross-reference 4,000 food additives across FDA and EFSA databases
Reconciling US and EU regulatory data on a single additive can take a chemist an hour. We do it for 4,000 substances. Here is exactly how the pipeline works — the models we tested, where they help, and where they don't.
What this article covers
The problem: same molecule, four different names
Pick any common food additive and look it up across the major regulatory databases. You will find that the FDA's EAFUS lists it under one chemical name, EFSA uses a different name plus an E-number, OpenFDA's adverse-event database may record it under a brand designation, and academic toxicology databases like ChemIDplus use the IUPAC form. They all describe the same molecule. None of them join on a single primary key.
The classical solution is to join on CAS Registry Number — the chemical identifier maintained by the American Chemical Society. That works for roughly 88% of the EAFUS dataset. For the rest, the CAS number is missing, shared across functionally distinct forms (anhydrous vs. monohydrate vs. acid salts), or recorded with typographic errors that propagate across decades of data entry. A human chemist can resolve these in seconds; doing it 4,000 times by hand is the kind of work that never gets finished.
Step 1 — Nomenclature matching with embeddings
Every additive in our database carries a list of known synonyms: the FDA name, the EFSA name, the INS/E-number, the IUPAC form, the CAS Registry Number, and any common trade names found in the OpenFDA adverse-event reports. We embed each string into a 3,072-dimension vector using OpenAI'stext-embedding-3-largemodel and store the vectors in a pgvector index on the same Neon Postgres instance that backs the site.
When a new substance enters the pipeline — say, a fresh release of the EFSA OpenFoodTox spreadsheet — we embed each of its names and run cosine-similarity nearest-neighbour search against the existing index. Anything with a similarity above 0.86 is flagged as a probable match; anything between 0.74 and 0.86 is queued for human review. Below 0.74 we treat it as a new substance. The thresholds were calibrated on a held-out set of 1,200 manually-resolved pairs.
The win here is not that embeddings are smarter than a chemist — they aren't. The win is that 92% of pairs land cleanly in the auto-accept band, leaving the chemist to spend their time only on the ambiguous 8%.
Step 2 — Vectorizing toxicological PDFs
EFSA Scientific Opinions are the most detailed open-access toxicology dossiers in the world. They also happen to be 80- to 250-page PDFs with multi-column layouts, embedded tables, and decades of inconsistent formatting. Extracting the acceptable-daily-intake (ADI), the no-observed-adverse-effect level (NOAEL), and the panel's verdict from one of these documents by hand takes about 25 minutes.
We chunk each PDF into ~800-token passages, embed them, and persist them alongside the substance record. At query time we retrieve the five most relevant passages and feed them to a long-context model — Gemini 2.5 Pro in our pipeline, because it was the only one that consistently picked the correct ADI when several values appeared in the same document (provisional ADIs, group ADIs, ADIs that were later revised). The model output is constrained to a strict JSON schema; any value it produces must quote the source span verbatim or the row is rejected.
Step 3 — Regulatory gap detection
Once both regulators are represented on a single primary key, the interesting queries become trivial:
- Permitted in US, banned in EU. Today this set contains potassium bromate, brominated vegetable oil (rescinded in the US in 2024 but historically permitted), azodicarbonamide, and titanium dioxide (banned in the EU in 2022).
- Permitted in EU, restricted in US. Several colour additives sit here, including some carotenoid preparations subject to FDA certification requirements that EU law does not impose.
- Conflicting ADIs. Substances where JECFA, EFSA and the FDA have set materially different acceptable daily intakes — a useful signal that the underlying science is contested.
These set-difference queries power our Banned in Europe page and the broader US vs EU food safety comparison. The AI is not doing the comparison — Postgres is. The AI's job was to make the join possible in the first place.
Step 4 — Correlating FDA adverse events
The FDA's CFSAN Adverse Event Reporting System (CAERS), exposed through the OpenFDA API, contains 148,000+ consumer reports of suspected reactions to food, dietary supplements and cosmetics. The free-text "suspect product" field is exactly that — free text. A single substance may appear as "MSG", "monosodium glutamate", "monosodium L-glutamate", "Accent seasoning", or "E621" depending on the reporter.
We resolve each report to the canonical substance using the same embedding index built in Step 1, plus a Claude Sonnet 4.6 pass that performs the actual disambiguation when the cosine score is ambiguous. Claude is specifically used because it returns a "no-match" answer cleanly when none of the candidate substances is plausible — other models we tested were more inclined to force a match. The resulting counts feed the adverse-event section on each additive's detail page.
Which AI models we actually use
We benchmarked four model families on the four tasks above using a held-out evaluation set of 400 ambiguous cases drawn from chemist-reviewed history. When picking which AI models to use, we relied on our founder's other project, Joute.io — a French AI tool comparator that tests every model with a paid account rather than a free trial. The Joute methodology is what we mirrored for our internal scoring: same model, same prompt, same task, repeated until variance was acceptable.
| Model | Where it wins | F1 on eval set |
|---|---|---|
Claude Sonnet 4.6 Anthropic | Synonym disambiguation and JSON-schema enforcement. Best at returning 'no-match' cleanly when none of the candidates fit. | 0.94 |
GPT-5.1 OpenAI | Adverse-event narrative summarization. Most consistent at preserving the reporter's qualifiers ('mild', 'within 30 min'). | 0.91 |
Gemini 2.5 Pro | Long-context PDF extraction (EFSA dossiers up to 250 pages). Picks the correct ADI when several values coexist in one document. | 0.89 |
Mistral Large 2 Mistral AI | Bulk synonym expansion. Cheapest acceptable option for generating candidate synonyms before embedding. | 0.83 |
We route each task to the model that scored highest on it, not a single default. The cost difference between Claude and Mistral on the bulk-expansion job is roughly 7×, which matters when the input is 4,000 substances refreshed monthly.
A concrete example — E951 (aspartame) across three databases
Aspartame is a useful case because it is one of the most extensively documented additives in regulatory history, and because the three databases describe it differently. Here is what our pipeline produces after resolving the three sources to a single substance record:
| Source | Field | Value |
|---|---|---|
FDA EAFUS CAS 22839-47-0 | Recorded name | Aspartame |
EFSA OpenFoodTox E 951 | Recorded name | Aspartame (L-aspartyl-L-phenylalanine methyl ester) |
OpenFDA CAERS N/A | Recorded as | Aspartame, NutraSweet, Equal, AminoSweet — 1,200+ event reports since 2004 |
EFSA Scientific Opinion DOI 10.2903/j.efsa.2013.3496 | ADI (2013) | 40 mg/kg body weight per day, reaffirmed by JECFA in 2023 |
IARC Monograph Vol. 134, 2023 | Classification | Group 2B — possibly carcinogenic to humans |
The embedding pipeline collapses all five rows above into a single internal record with E-number E951 and CAS 22839-47-0. Every figure on the public aspartame additive page traces back to one of these primary sources by ID, not by an LLM-generated paraphrase.
What we deliberately do not let AI do
The model is never asked to assign a safety rating, infer a missing ADI, or summarize the regulatory state of a substance in its own voice. Those are editorial decisions taken by a human against the primary source. The model's job is to make the primary source machine-readable; the human's job is to read it.
We also do not let the model invent synonyms. The synonym list for a substance is restricted to strings that appear at least once in a primary source we trust. This is the single most important guardrail against hallucinated identifiers, and it was a problem we observed across every model we tested before putting it in place.
Pipeline at a glance
Frequently asked questions
Why is it hard to match the same food additive across FDA and EFSA databases?
Because the two regulators use different naming conventions. FDA EAFUS records substances by their US chemical name (e.g., 'FD&C Red No. 40'). EFSA uses INS/E-numbers (e.g., 'E129') and the EU-preferred chemical name ('Allura Red AC'). CAS numbers help bridge them, but roughly 12% of EAFUS entries either lack a CAS number or share one with several functionally distinct substances. Synonym resolution is the bottleneck — and it's what we use AI to solve.
Which AI models do you use to reconcile additive data?
We benchmarked Claude Sonnet 4.6, GPT-5.1, Gemini 2.5 Pro, and Mistral Large 2 on a held-out set of 400 ambiguous additive pairs. Claude won on nomenclature reconciliation and structured JSON output; GPT was strongest on adverse-event narrative summarization; Gemini handled long PDF toxicological dossiers best; Mistral was the cheapest acceptable option for bulk synonym expansion. We route each task to the model that scored highest on it, not a single 'default' model.
How do you avoid AI hallucinations in regulatory data?
Every fact published on Additive Facts is anchored to a primary source URL (FDA EAFUS entry, EFSA Scientific Opinion DOI, or OpenFDA event ID). The LLM is only used to (1) normalize names, (2) extract structured fields from official PDFs, and (3) draft plain-language summaries that a human editor verifies against the cited document. We do not allow the model to fill in values that are missing from the source. If EFSA hasn't set an ADI, the field stays empty — the model is not asked to estimate one.
How does embedding-based search help for food additives?
Keyword search fails when a substance has 6+ synonyms across regulators ('Allura Red AC', 'FD&C Red No. 40', 'CI 16035', 'E129', 'Disodium 6-hydroxy-5...'). We embed every name, synonym and CAS variant into a 3,072-dimension vector space using text-embedding-3-large, then cluster vectors with cosine similarity above 0.86. This collapses what used to require manual chemist review into an overnight batch job. A flagged cluster still gets human sign-off before merge.
Disclaimer
This article describes data engineering methodology. It is not medical, dietary, or regulatory advice. Model performance figures reflect our internal evaluation on a specific task set and are not benchmarks of general model capability. Consult primary regulatory sources before making decisions based on additive data.
Related guides
Sources
- US FDA. 'Substances Added to Food (formerly EAFUS).' fda.gov/food/food-additives-petitions
- European Food Safety Authority. 'OpenFoodTox database.' zenodo.org/communities/efsa-kj
- EFSA Panel on Food Additives and Nutrient Sources added to Food. 'Scientific Opinion on the re-evaluation of aspartame (E 951) as a food additive.' EFSA Journal, 2013;11(12):3496.
- OpenFDA. 'CFSAN Adverse Event Reporting System (CAERS).' open.fda.gov/apis/food/event/
- IARC Working Group. 'Aspartame, methyleugenol, and isoeugenol.' IARC Monographs Volume 134, 2023.
- JECFA. 'Safety evaluation of certain food additives — 96th meeting.' WHO Food Additives Series 84, 2023.