Revisiting Exploratory Data Analysis through the lens of language.
Data is simply the traces of an event that happened in the real world: a user logged in, a bus arrived at the station, a comet completed its orbit around the sun, a customer recommended a product. These traces call us to reason about the real world, but data cannot generate reasoning: it is the evidence for it. And reasoning generates the question we ask of the data.
Human reasoning is qualitative in nature. Copernicus, for example, was motivated by the Platonic ideal of uniform circular motion; he felt the geocentric model was not abiding by this philosophical rule. By placing the Sun at the center, he could maintain the qualitative ‘perfection’ of circular motion more naturally. The supply and demand curve, Bohr’s model of the atom- these are simplified models designed to trigger a progression in understanding. They do not have to be perfectly accurate; they can even be wrong. But they call us to defy or refine them, as George Box famously noted: “All models are wrong, but some are useful”.
The democratization of LLMs is creating the ‘Data Sapiens’-the citizens of the AI era. Yet, as more people enter the circle of data literacy- the paradigm remains the same: ask questions, chat with your data. But what if we could do more with this massive human hivemind internalized within the model? What if there was another way to represent data that could serve as a ‘useful model’?
Language doesn’t just carry meaning. It generates one
Cognitive scientist Andy Clark argues that language is ‘supra-communicative’-it acts as a cognitive tool that structures thought and reasoning rather than merely expressing them. Think back to 1933, when physicist Fritz Zwicky coined the term ‘dark matter.’ He wasn’t just naming an anomaly; he was defining a conceptual space that challenged physicists to either prove or defy it. The value of ‘dark matter’ was defined entirely by its relationship to visible matter and gravity, not by its own inherent properties. Without that conceptual structure, Vera Rubin’s later findings might have been dismissed as instrument error rather than a confirmation of a systemic reality. This ‘structuralism’ is now a pillar of modern physics: the Higgs Boson and Black Holes were essential components of a mathematical structure long before they were ever observed in the physical world.
Human reasoning is complex, but inherently relational; we don’t just use structures, we represent through them. These conceptual contrasts and grids are mirrored in the training corpus. Evidence suggests that LLMs haven’t just ‘learned’ our data; they may have internalized the structuralist nature of our sensemaking. Think about it: an LLM doesn’t have a ‘cat.’ It has never seen, touched, or heard one. It lacks a physical referent entirely. In its vector space, a ‘cat’ exists by its proximity to ‘feline’ and its distance from ‘refrigerator.’ This isn’t just a simulation of language; it is the structural representation that defines how we think. By employing these gaps and tensions, the LLM can generate new structures, moving beyond mere statistical mimicry and finding the ‘dark matter’ in our data.
The missing layer.
Tukey’s foundational work on Exploratory Data Analysis urges us to explore data before we question it explicitly ‘without attempting to apply any particular model.’ He declared EDA to be ‘detective work’, insisting the analyst discover clues rather than confirm predetermined conclusions. While Tukey understood the relational nature of data, his philosophy positioned representation as the final product of analysis- not a means to an end.
The rise of neural networks reframed this dilemma. They learn hierarchical representations without the analyst’s hand. The signal gets distilled, but the analyst can no longer see how. LLMs take the final leap: they don’t just learn a representation; they inhabit the linguistic structure of human thought. They are the first ‘useful models’ that don’t merely answer our questions — they provide the conceptual scaffolding to generate them.
By co-creating with AI, we can build such scaffolding for any dataset before a single question is asked. However, a core risk persists: ‘imposing a grid where none exists.’ Because LLMs are ‘pattern-hungry,’ they are prone to finding structure in random noise. This transforms the LLM into a ‘Paradigm Generator’ only if the user remains the ‘Paradigm Critic.’
To test whether this produces anything beyond sophisticated-sounding noise, I ran the approach on the World Happiness Report 2025, a dataset already analyzed by a team of leading economists, and compared the conceptual scaffolding the model surfaced against what the professional analysis found .
What this looks like in practice
The World Happiness Report is an annual global study published by the University of Oxford’s Wellbeing Research Centre that ranks 150 countries using the Cantril Ladder-asking respondents to rate their current life from 0 (worst possible) to 10 (best possible) -across six variables: GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption.
I ran the same six variables through The Data Board, an open source tool I built to prompt an LLM to generate a conceptual vocabulary for a dataset before a single question is asked. The following board was generated for WHR 2025:
Press enter or click to view image in full size

Instead of bringing details from the logs, I interviewed the LLM about the concepts, just like I would interview a data analyst about their insights:
Q: GDP is the strongest predictor of happiness, why did you name this concept card “Resource Elasticity” and not simply “Income”?
A: High GDP only addresses Resource Elasticity (the capacity to absorb shocks). However, well-being requires dual-track stability. We observe a structural “Agency-Resource Gap” — where material wealth exists but is decoupled from Institutional Predictability.
The United States (#23 overall): High Resource Elasticity but declining Communal Insulation. The data shows that while income per capita is high, “Social Support” scores have stagnated, leading to a “hollow” form of wealth.
The Nordic Cluster (e.g., Finland #1, Denmark #2): These nations maintain a “Nordic Stability Cycle”. They treat wealth not as an end-goal, but as the fuel for Institutional Predictability. Their high scores are driven by the absolute certainty that the “floor” will not fall out, which reduces the psychological cost of risk-taking.
The Claim: Happiness is not the sum of resources, but the ratio of Resource Elasticity to structural anxiety.
Q: The report asks the following question: “In North America and Western Europe, young people are much less happy than 15 years ago. Over the same period, social media use has greatly increased. Many people blame social media for this fall in happiness, but does this hypothesis stand the test of rigorous scientific analysis?”: do you think you answered it? and how?
A: The data suggests that social media is not a direct “poison,” but a driver of Digital Fragmentation — the algorithmic erosion of Communal Insulation. The decline is most extreme where digital immersion has replaced physical social infrastructure. For Example: The North American Divergence: For the first time, the United States dropped out of the Top 20 overall primarily because youth happiness (under 30) plummeted to #62 globally.
The Western European Decline: Similar generation gaps are widening in the UK, France, and Germany, where “Social Support” among young people is at its lowest recorded levels.
The Counter-Example: Lithuania (#1 for youth globally): While highly digitized, Lithuania shows rising youth happiness. Why? Because they are experiencing a surge in Resource Elasticity (post-Soviet economic growth) combined with strong Cultural Anchoring. Their digital life hasn’t yet fully fragmented their physical “Physical Cohesion” (festivals, family structures, national identity).
Israel (#2 for youth globally): Despite high-stress environments, Israel’s youth remain remarkably resilient. This is a classic case of Communal Insulation. High Cultural Anchoring acts as a “buffer” that prevents Digital Fragmentation from leading to total social atomization.
The Conclusion: The “Social Media Hypothesis” only holds true when the user lacks Communal Insulation. In the US/UK, the digital world is a substitute for community; in Lithuania/Israel, it is an extension of it.
Q: Finally, I shared the WHR 2025 executive summary with the LLM and asked it to compare its analysis to the expert team’s findings.
A: While economists focus on “How much?” (Statistical Volume), the Data Board audits “How it works?” (Structural Mechanism). The Report (Volume): Economists use GDP as a wealth metric. It tells you there is money, but not what that money does. The Board (Mechanism): We replaced “GDP” with Resource Elasticity. In high-happiness nations like Finland, wealth isn’t just “high”; it is elastic. It effortlessly transforms into a safety net. Example: In many “rich” nations, wealth is rigid — you have a high salary but a precarious life. In the WHR leaders, Institutional Predictability ensures that $1 of GDP buys significantly more “Peace of Mind” than it does in a fragmented economy. We don’t just count the coins; we audit the Utility of the Environment.
Final thoughts
The Data Board is not perfect, but it might be one example of a ‘useful model’ for representing data in meaningful language. On top of just prompting the LLM to find concepts, I also encouraged it to find tension. Contrast is considered one of the most fundamental concepts in storytelling, driving conflict, highlighting themes, and shaping impact. Conceptual representation of data can supercharge the narrative by creating these contrastive concepts, which may be an easier and safer entry point than finding an anomaly in a sea of numerical data.
I was debating around the question of whether I should restrict the LLM only to the provided data. The economists who wrote the WHR brought decades of parallel research, prior frameworks, and conceptual knowledge to their analysis of the same six variables. Analysts often bring domain knowledge to the analysis, they interview stakeholders, educate themselves outside the dataset and put the knowledge into a living framework. Do we restrict our resources when we think? Does restricting knowledge make sense in the age of a model trained on all of it? The Data Board used this hivemind to construct the conceptual scaffolding for the Paradigm Critic to interrogate it.
If we view LLMs as a shadow of human reasoning and language as its native analysis tool, LLMs are actually more prolific than humans at generating new structural concepts like “dark matter”. The model creates clusters in its multidimensional space that represent abstract ideas. Occasionally, it finds a structural relationship between words that humans haven’t named yet, essentially identifying a “conceptual space” that exists in our data but isn’t yet part of our conscious vocabulary. This is where “chat with your data” can turn analysis into a discovery.
References
[1] J.W. Tukey, Exploratory Data Analysis (1977), Addison-Wesley.
[2] G.E.P. Box, Science and Statistics (1976), Journal of the American Statistical Association, 71(356), 791–799.
[3] A. Clark, Magic Words: How Language Augments Human Computation (1997), In P. Carruthers & J. Boucher (Eds.), Language and Thought, Cambridge University Press.
[4] E. Vromen, Language Models as Semiotic Machines: Reconceptualizing AI Language Systems through Structuralist and Post-Structuralist Theories of Language (2024), arXiv:2410.13065 (preprint).
[5] J.F. Helliwell et al., World Happiness Report 2025 (2025), Wellbeing Research Centre, University of Oxford.
[6] R. Aharon, The Data Board (2025), GitHub — open source code and an open tool for conceptual data representation
Beyond Chatting with Your Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.