Food Technology Magazine | Digital Exclusive
Jason Cohen, CEO and founder of Simulacra
At the 2025 IFT FIRST Annual Event and Expo’s Solutions Showcase session Accelerating Predictive Insight with Causal AI and Synthetic Data in CPG Research, Jason Cohen, founder and CEO of Simulacra, discussed how synthetic data and new model architectures can streamline product testing and insight generation. Drawing on more than a decade of experience building AI for consumer-facing industries, he outlined why traditional data approaches often fall short and how causal AI might address those limitations.
Q: To begin, what do we mean by synthetic data, and why is it attracting interest in consumer research?
Cohen: Synthetic data is artificially created data designed to mimic the statistical patterns of real-world datasets. If it could be generated reliably for market research, it could help teams overcome several industry-wide challenges. Traditional studies are expensive and often suffer from limited sample sizes, and many do not return statistically valid findings. Synthetic data offers the possibility of filling in gaps: enabling cohort analysis, strengthening cross-tabbed insights, or allowing teams to explore directional results with greater statistical power.
Q: A common question is whether synthetic data is “fake.” How do you address that?
Cohen: Synthetic data is not fake; it is only as real as its application. We trust generative AI tools to summarize documents, assist with writing, or generate code because we can verify the outputs. But consumer research is different. There is no predefined ground truth against which to check whether a model-generated preference score or demographic relationship is correct. That makes the quality and training process especially important.
Q: You argue that large language models (LLMs) are a poor fit for synthetic tabular data. Why?
Cohen: Most LLMs are trained on unstructured text scraped from the internet, where almost none of the content is linked to the demographic or behavioral attributes that matter in market research. These datasets introduce multiple forms of bias—demographic, temporal, cultural, ideological—and they heavily overrepresent complaints and negative sentiment. Social listening has already shown the limitations of relying on this type of data. In addition, companies rarely have the required volume of structured observations to fine-tune an LLM for this purpose. Even leading providers suggest that tens of thousands of structured observations are required.
Q: What alternative approach are you proposing?
Cohen: We’ve developed a model architecture we call causal AI. Unlike generative text models, causal AI is trained exclusively on the structured research dataset provided by the user. It learns the causal relationships within that dataset—across every cohort and variable—before generating new synthetic data. It does not rely on internet training data and is designed to avoid hallucination or confabulation. The method can expand an existing dataset while preserving its statistical properties and internal relationships.
Q: You demonstrated a scenario modeling example using product testing data. How does that work?
Cohen: Scenario modeling lets teams ask what it would have taken to achieve a desired outcome. In the example shown, we worked with an 860-observation product development dataset containing sensory attributes, demographics, and product evaluations. Using causal AI, we generated a synthetic dataset of 5,000 observations that preserved the statistical profile of the original sample. Then we asked the model to generate a dataset in which everyone evaluated a specific prototype and rated it highly. This revealed the most likely demographic shifts and product attribute changes associated with that outcome. The same technique can be used to identify meaningful subpopulations, optimize flavor profiles, or explore targeted marketing scenarios.
Q: How might this affect ongoing research programs?
Cohen: Many companies use this approach to reduce the base size of future studies, revisit incomplete or inconclusive results, or strengthen directional findings. In one validation example, we reconstructed the results of a 2,782-person dataset using just 500 observations. That suggests researchers may be able to reduce data-collection costs while maintaining decision quality.
Q: What kinds of applications are companies exploring with causal AI?
Cohen: We see uses in product development, purchase intent modeling, sensory preference mapping, brand and advertising research, and pricing and promotion optimization. Synthetic data is not intended to replace research but to increase statistical power and support faster insight generation. The goal is to expand what teams can do with the data they already have and to make scenario planning more rigorous.ft
This article is based on a live session at IFT FIRST. Responses have been edited for length and clarity.