Storytelling with Impact | Artificial Intelligence

Nature of Intelligence – Episode Five – How do we assess intelligence?

February 28, 2025/in AI, Artificial Intelligence, Intelligence, Language, Neuroscience/by Mark Lovett

I don’t know about you, but my brain is starting to hurt, but in a good way. What seems clear to me was summed up when Abha Eli Phoboo informed us that, “we don’t fully understand human intelligence or animal intelligence” in this episode.

And there’s much discussion regarding how we’re trying to evaluate machines — and associated LLMs — based on measurements that we use on humans. It may feel ridiculous on one level, but at the moment humans can only understand the world through the lens of being human.

We use medicines all the time that we don’t understand the mechanisms that they work on. And that’s true. And I don’t think we cannot deploy LLMs until we understand how they work under the hood. ~ Ellie Pavlick

But is understanding what LLMs are, or how they operate all that important? As Ellie Pavlick reminds us, there’s much about the world we don’t fully understand. We just know whether something works or not.

But I found the discussion of comparing humans to animals to be as fascinating. Even if you don’t own a pet, I’m sure you’ve been around a number of animals at various times in your life. Did they seem “intelligent”, in one way or another? Did you feel they possessed a personality? I have a friend who’s owned horses most of her life, and when I hear her talking to folks at the stables, they describe each horse as though they were human. Will we describe LLM personas in the same way some day?

Transcript

Abha Eli Phoboo: The voices you’ll hear were recorded remotely across different countries, cities and work spaces.

Erica Cartmill: I often think that humans are very egotistical as a species, right? So we’re very good at particular things and we tend to place more value on the things that we’re good at.

Abha: From the Santa Fe Institute, this is Complexity

Melanie Mitchell: I’m Melanie Mitchell

Abha: And I’m Abha Eli Phoboo

Melanie: As we enter our fifth episode of this season on intelligence, we’ve explored quite a few complicated and controversial ideas. But one thing has become really clear: intelligence is a murky concept. And that’s the point of this series — it’s something that we think we know when we see it, but when we break it down, it’s difficult to define rigorously.

Abha: Today’s episode is about how we assess intelligence. When it comes to testing humans, we have all kinds of standardized measures: IQ tests, the SAT, and so on. But these tests are far from perfect, and they’ve even been criticized as limited and discriminatory.

Melanie: To understand where our desire to test intelligence comes from — and also the way we talk about it as an inherent personality trait — it’s useful to look at the history of intelligence in Western society. In ancient Greece, the concept was described as “reason” or “rationality,” which then evolved into “intelligence” more broadly when the discipline of psychology arose. Philosophers like Socrates, Plato, and Aristotle highly valued one’s ability to think. And at first glance, that seems like a noble perspective.

Abha: But Aristotle took this a step further. He used the quote unquote “rational element,” as justification for a social hierarchy. He placed European, educated men at the top, and women, other races, and animals below them.

Melanie: Other Western philosophers like Descartes and Kant embraced this hierarchy too, and they even placed a moral value on intelligence. By claiming that a person or an animal wasn’t intelligent, it became morally acceptable to subjugate them. And we know how the rest of that European expansion story goes.

Abha: So today’s notions about intelligence can be traced in part to the ways men distinguished themselves from… non-men.

Melanie: Or, to give the philosophers a more generous interpretation, the history of thought around intelligence centers on the idea that it is a fundamentally human quality.

Abha: So if intelligence, in theory, stems from humanity, how do we decide the degree to which other entities, like animals and large language models, are intelligent? Can we rely on observations of their behavior? Or do we need to understand what’s going on under the hood — inside their brains or software circuits?

Melanie: One scientist trying to tackle such questions is Erica Cartmill.

Erica: So my name is Erica Cartmill. I’m a professor of cognitive science, animal behavior, anthropology, and psychology at Indiana University. You know, I really study cognition, particularly social cognition, and the kinds of cognition that allow communication to happen across a wide range of species.

Abha: Erica has extensive experience observing intelligent behavior in beings that are very different from humans.

Erica: So I got the animal bug when I was a kid. And we had a whole range of different kinds of animals. It’s sort of a menagerie. We had horses, we had dogs, we had a turtle, we had a parrot. And I was always out watching lizards and butterflies and birds, mice in our barn. And sometimes I would catch a lizard, put it in a terrarium for two days, observe it, let it go again.

And that kind of wanting to observe the natural world and then have an opportunity to more closely observe it, under you might say controlled circumstances, even as a child, and then release it back into its natural environment is really something that I’ve continued to do as an adult in my scientific career. And that’s what I do mostly with my lab now, kind of split between studying great apes and human children.

But I’ve done work on a range of other species as well, Darwin’s finches in the Galapagos. I’m doing a project now that also includes dolphins and dogs and kea, which is a New Zealand parrot. And I’m starting a dog lab at IU. So I’m excited about some of those other species, but I would say the core of my work really focuses on comparing the cognitive and communicative abilities of great apes and humans.

Melanie: Much of Erica’s research has been on the evolution of language and communication. As we’ve said before, complex language is unique to our species. But other animals communicate in many ways, so researchers have been trying to narrow down what exactly makes our language so distinct.

Erica: So I think humans have always been really focused on this question of what separates us from other species. And for a long time, answers to that question centered around language as the defining boundary. And a lot of those arguments about language really focused on the structural features of language.

And if you look at sort of the history of these arguments, you would see that every time a linguist proposed a feature of language that say, human language is different because X, then people would go out and study animals and they would say, “Well, starlings have that particular feature” or, “A particular species of monkey has that feature.” And then linguists would sort of regroup and say, “Okay, well, actually this other feature is the real dividing line.”

And I think probably the boring answer or interesting answer, depending on how you look at it, is that there probably isn’t one feature. It’s the unique constellation of features combined with a constellation of cognitive abilities that make language different and make it so powerful. But I will say in recent years, the focus of these arguments about “language is unique because” has shifted from language is unique because of some particular structural feature to language is unique because it is built on a very rich social understanding of other minds.

It’s built on inferences about others’ goals, about what others know and don’t know. It’s built on what we call pragmatics and linguistics. So actually it’s very unlike a structured program that you can sort of apply and run anywhere. It’s actually something that relies on rich inferences about others’ intentions.

Melanie: When we humans communicate, we’re often trying to convey our own internal thoughts and feelings, or we’re making inferences about someone else’s internal state. We naturally connect external behavior with internal processes. But when it comes to other beings, our ability to make judgments about intelligence isn’t as straightforward.

Abha: So today we’re going to first look at what we can learn from external behavior and applying human notions of intelligence to animals and machines, which can pass tests at levels that are deceptively similar to humans.

Abha: Part 1: Assessing Intelligence in Humans, Animals, and Machines

Abha: If you have a pet at home, you’ve probably had moments when you’ve wanted to know what it’s trying to say when it barks, meows, or squawks. We anthropomorphize pets all the time, and one of the ways we do that is by envisioning them saying things like, “I’m hungry!” or “I want to go outside!” Or we might wonder what they say to each other.

Melanie: Animals most definitely communicate with one another. But there’s been a lot of debate about how sophisticated their communications are. Does a chimp’s hoot or a bird’s squawk always mean the same thing? Or are these signals flexible, like human words, communicating different meanings depending on context, including the animal’s understanding of the state of its listeners’ minds? In her work, Erica has critiqued the assumptions people often make in experiments testing animal communication.

She’s noted that the methods used won’t necessarily reveal the possible meaning of both vocal and other kinds of signals, especially if those meanings depend on particular contexts.

Erica: Authors recently, ranging from cognitive scientists to philosophers to linguists have argued that human communication is unique because it relies on these very rich psychological properties that underlie it. But this in turn has now led to new arguments about the dividing line between humans and other animals.

Which is that animals use communication that is very code-like, that one animal will produce a signal and another animal will hear that signal or see that signal and decode its meaning. And that it doesn’t rely on inferences about another’s intentions or goals, that the signals can be read into and out of the system. If you record, say, an auditory signal, like a bird call, and then you hide a speaker in a tree, and you play that call back, and you see how other birds respond. So this is called the playback method, unsurprisingly.

And that’s been one of the strongest things in the toolkit that animal communication researchers have to demonstrate that those calls in fact have particular meanings. That they’re not just, I’m singing because it’s beautiful, but that this call means go away and this other call means come and mate with me, and this other call means there’s food around, et cetera, et cetera.

And so decontextualizing those signals and then presenting them back to members of the species to see how they respond is the dominant method by which scientists demonstrate that a call has a particular meaning. That’s been incredibly important in arguing that animals really are communicating things. But that method, and the underlying model that is used to design experiments to ask questions about animal communication, is also very limiting.

Abha: An auditory signal taken out of context, whether a word or an animal call — is a very narrow slice of all the different ways animals — and humans — communicate with each other.

Erica: So it’s very good at demonstrating one thing, but it also closes off doors about the kinds of inferences that animals might be making. If Larry makes this call and I’m friends with Larry, versus Bob makes that call and I’m enemies with Bob, how do I respond? Does Bob know that I’m there? Can he see me? Is he making that call because I am there and he sees me and he’s directing that call to me? Versus, is he making that call to someone else and I’m eavesdropping on it.

Those are kinds of inferences that animals can make. I’m not saying all animals in all cases, but the ways that we ask questions about animal communication afford certain kinds of answers.

And we need, I think, to be more, I don’t know, humble is the right word? But we need to recognize the ways in which they limit the conclusions that we can draw, because this is very different from the way that we ask questions about human language.

And so when we draw conclusions about the difference between human language and animal communication based on the results of studies that are set up to ask fundamentally different questions, I think that leaves a lot to be desired.

Abha: And focusing on abilities that are relevant to humans’ intelligence might mislead us in how we think about animal intelligence.

Erica: I often think that humans are very egotistical as a species, right? So we’re very good at particular things and we tend to place more value on the things that we’re good at. And I think that in many cases, that’s fine, that’s one of our unique quirks as a species. But it also often limits the way that we ask questions and attribute kinds of intelligence to other species.

So it can be quite difficult, I think, for humans to think outside of the things that we’re good at or indeed outside of our own senses. I mean, sort of five senses, biological senses. So elephants… we’ve known for a long time that elephants are able to converge at a particular location, show up, far away at this tree on this day at this time from different starting points. And people really didn’t know how they were doing it.

They were starting too far apart to be able to hear one another. People were, are they planning? Do they have the sense of two Tuesdays from now we’re going to meet at the watering hole? And it wasn’t until people said maybe they’re using senses that fall outside of our own perceptual abilities. In particular, they measured very, very low frequencies and basically asked, okay, maybe they’re vocalizing in a way that we can’t perceive, right?

And so once they did that and greatly lowered the frequency of their recording equipment, they found that elephants were in fact vocalizing at very, very long distances, but they were doing it through this rumble vocalization that actually propagates through the ground rather than through the air.

And so they produce these, I can’t imitate it because you wouldn’t hear it even if I could, but they produce these very low rumbles that other elephants, kilometers away, perceive not through their ears but they perceive through specialized cells in the pads of their feet, where they can feel the vibrations.

And so I think this is a nice example of the way that we have to, in effect, not even necessarily think like an elephant, but imagine hearing like an elephant, having a body like an elephant, thinking, I like to call it thinking outside the human.

Humans are good at particular things, we have particular kinds of bodies, we perceive things on particular time scales, we perceive things at particular light wavelengths and auditory frequencies. Let’s set those aside for a second and think about, okay, what did that species evolve to do? What do its perceptual systems allow it to perceive and try to ask questions that are better tailored to the species that we’re looking at.

Melanie: There’s been a lot of work throughout the many decades on trying to teach human language to other species like chimps or bonobos or African gray parrots. And there’s been so much controversy over what they have learned. What’s the current thinking on the language abilities of these other species and those experiments in general?

Erica: It’s almost hard to answer the question with the current thinking, because there’s very little current research. A lot of that research was done 20 or even 40 years ago. Compared to the work that was being done 30 years ago, there’s very little current work with apes and parrots and dolphins, all of which 30 years ago, everyone was trying to teach animals human language.

And I think it was a really interesting area of inquiry. I would say people differ a little bit, but I think that probably the sort of most dominant opinion or maybe the discussion is best characterized by saying that people today, I think, largely believe that those animals were able to learn, understand, and productively use words, but that they were limited in the scope of the words they could learn, and that they weren’t combining them into productive sentences.

And this was part of the argument that syntax, the combining of words according to particular rules, was something that human language did that was very different from what animals could produce. And so I think with the animal language studies that were showing largely that animals could learn words, they could produce words, sometimes produce words together, but they weren’t doing it in reliable sentence-like structures.

Melanie: But do you think that the fact that we were trying to teach them human language in order to assess their cognitive abilities was a good approach to understanding animal cognition or should we more do what you said before, sort of take their point of view, try to understand what it’s like to be them rather than train them to be more like us?

Erica: I think that’s a great question. My answer probably hinges around the limitations of human imagination. Where I think that teaching animals to communicate on our terms allows us to ask better questions and better interpret their answers than us trying to fully understand their communication systems. People certainly are using things like machine learning to try to quote unquote “decode” whale song or bird song. I think that those approaches, which is more sort of on the animals’ terms or using their natural communication.

And I think that those are very interesting approaches. I think they’ll be good at finding patterns in what animals are producing. The question I think still remains whether animals themselves are perceiving those patterns and are using them in ways that have meaning to them.

Abha: And the way we’ve tried to assess intelligence in today’s AI systems also hinges around the limitations of human imagination, perhaps even more so than animals, given that by default, LLMs speak our language. We’re still figuring out how to evaluate them.

Ellie Pavlick: Yeah, I mean, I would say they’re evaluated very… I would say badly.

Abha: This is Ellie Pavlick. Ellie’s an assistant professor of computer science and linguistics at Brown University. Ellie has done a lot of work on trying to understand the capabilities of large language models.

Ellie: They’re evaluated right now using the things that we can conveniently evaluate, right? It is very much a, what can we measure? And that’s what we will measure. There’s a lot of repurposing of existing kind of evaluations that we use for humans. So things like the SAT or the MCAT or something like that.

And so it’s not that those are completely uncorrelated with the things we care about, but they’re not very deep or thoughtful diagnostics. Things like an IQ test or an SAT have long histories of problems for evaluating intelligence in humans. But they also just weren’t designed with models of this type being the subjects.

I think what it means when a person passes the MCAT or scores well on the SAT is not the same thing as what it might mean when a neural network does that. We don’t really know what it means when a neural network does it, and that’s part of the problem.

Melanie: So why do you think it’s not the same thing? I mean, what’s the difference between humans passing a bar exam and a large language model?

Ellie: Yeah, I mean, that’s a pretty deep question, right? So I would say, compared to a lot of my peers, not as quick to say the language models are obviously not doing what humans do, right?

I tend to reserve some space for the fact that they might actually be more human-like than we want to admit. A lot of times processes that people might be using to pass these exams might not be as deep as we like to think. So when a person, say, scores well on the SAT, we might like to think that there’s some more general mathematical reasoning abilities and some general verbal reasoning abilities. And then that’s going to be predictive of their ability to do well in other types of tasks. That’s why it’s useful for college admission.

But we know in practice that humans often are just learning how to take an SAT, right? And I think we very much would think that these large language models are mostly learning how to take an SAT.

Melanie: So just to clarify, when you say, I mean, I know what it means when a human is learning how to pass a test, but how does a language model learn how to pass a test?

Ellie: Yeah, so we can imagine this simple setting. I think people are better at thinking about, let’s pretend we just trained the language model on lots of examples of SATs. They’re going to learn certain types of associations that are not perfect, but very reliable.

And I always have this joke with my husband when we were in college about how you could pass a multiple choice test without having ever taken the subject. And we would occasionally try to pass his qualifying exams in med school. I think he took an econ exam with me. So there’s certain things like, whenever there’s something like “all of the above” or “none of the above,” that’s more likely to be the right answer than not, because it’s not always there. So it’s only there when that’s the right thing.

Or it’s a good way for the professor to test that you know all three of these things efficiently. Similarly, when you see answers like “always” or “never” in them, those are almost always wrong because they’re trying to test whether you know some nuanced thing.

Then there’s some, and none of these is perfect, but you can get increasingly sophisticated kinds of heuristics and things, based on the words, this one seems more or less related, this seems kind of topically off base, whatever. So you can imagine there’s patterns that you can pick up on. And if you stitch many, many of them together, you can pretty quickly get to, possibly perfect performance, with enough of them.

So I think that’s a kind of common feeling about how language models could get away with looking like they know a lot more than they do by kind of stitching together a very large number of these kinds of heuristics.

Abha: Would it help if we knew what was going on under the hood with LLMs? We don’t really actually know a whole lot about our brains either, and we don’t know anything about LLMs, but would it help in any way if we sort of could look onto the hood?

Ellie: I mean, that’s where I’m placing my bets. Yeah.

Melanie: In Part 2, we’ll look at how researchers are actually looking under the hood. And many of them are trying to understand LLMs in a way that’s analogous to how neuroscientists understand the brain.

Melanie: Part 2: Going Under the Hood

Abha: Okay, so wait a minute. If we’re talking about mechanistic understanding in animals or humans — that is, understanding the brain circuits that give rise to behavior — it makes sense that it’s something we need to discover. It’s not obvious to us, in the same way that it’s not obvious how a car works if you just look at the outside of it.

But we do know how cars work under the hood because they’re human inventions. And we’ve spent a lot of this season talking about how to learn more about artificial intelligence systems and understand what they’re doing. It’s a given that they’re so-called “black boxes.”

But… we made AI. Human programmers created large language models. Why don’t we have a mechanistic understanding? Why is it a mystery. We asked Ellie what she thought.

Ellie: The program that people wrote was programmed to train the model, not the model itself, right? So the model itself is this series of linear algebraic equations. Nobody sat down and wrote, “Okay, in the 118th cell of the 5,000th matrix, there’ll be a point zero two,” right? Instead there’s a lot of mathematical theory that says, why is this the right function to optimize? And how do we write the code? And how do we parallelize it across machines?

There’s a ton of technical and mathematical knowledge that goes into this. There’s all of these other variables that factor in, they’re very much part of this process, but we don’t know how they map out in this particular thing. You kind of set up some rules and constraints to guide a system, but the system itself is on its own. So if you’re routing a crowd through a city or something for a parade, right?

And now you come afterward and you’re trying to figure out why there’s a particular cup on the ground in a particular orientation or something. But you set up, you knew where the people were going to go. But there’s all of this other stuff that, it’s constrained by what you set up, but that’s not all that there is. There’s many different ways to meet those constraints.

And some of them will have some behavioral effects and others will have others, right? There’s a world where everyone followed your rules there wasn’t a cup there. And there’s a rule where those cars crashed or didn’t crash, and all of those other things are subject to other processes. So it’s kind of an under specified problem, right, that was written down. And there are many ways to fill in the details, and we don’t know why we got this one that we got.

Melanie: So when we’re assessing LLMs, it’s not quite the same as humans because we don’t know what happens between the constraints we set up and, for example, ChatGPT’s SAT score at the end.

And we don’t always know how individual people are passing the SAT either — how much someone’s score reflects their underlying reasoning abilities versus how much it reflects their ability to sort of “game” the test. But at the very least, when we see an SAT score on a college application, we do know that behind that SAT score, there’s a human being.

Ellie: We can take for granted that we all have a human brain. It’s true. We have no idea how it works, but it is a known entity because we’ve evolved dealing with humans. You live a whole life dealing with humans. So when you pick somebody to come to your university, or you hire someone for a job, it’s not just a thing that passes the SAT, it’s a human that passes the SAT, right?

That is one relevant feature. Presumably the more relevant feature is that it’s a human. And so with that comes a lot of inferences you can make about what humans who pass the SAT or score a certain score probably also have the ability to do, right? It’s a completely different ball game when you’re talking about somebody who’s not a human, because that’s just not what we’re used to working with.

And so it’s true, we don’t know how the brain works, but now that you’re in the reality of having another thing that’s scoring well, and you have no idea how it works. To me, the only way to start to chip away at that is we need to ask if they’re similar at a mechanistic level. Like asking whether a score on the SAT means the same thing when an LLM achieves it as a human, it is 100% dependent on how it got there.

Abha: Now, when it comes to assessing artificial intelligence, there’s another question here: How much do we need to understand how it works, or how intelligent it is, before we use it? As we’ve established, we don’t fully understand human intelligence or animal intelligence — people debate on how effective the SAT is for us — but we still use it all the time, and the students who take it go on to attend universities and have careers.

Ellie: We use medicines all the time that we don’t understand the mechanisms that they work on. And that’s true. And I don’t think we cannot deploy LLMs until we understand how they work under the hood. But if we’re interested in these questions of, “Is it intelligent?” Just the fact that we care about that question. Answering that question probably isn’t relevant for whether or not you can deploy it in some particular use case.

If you have a startup for LLMs to handle customer service complaints, it’s not really important whether the LLM is intelligent. You just care whether it can do this thing, right? But if you want to ask that question, we’re opening up this very big can of worms and we can’t ask the big questions and then not be willing to do the big work, right.

Melanie: And answering the question of mechanistic understanding is really big work. As in other areas of science, you have to decide what level of understanding you’re actually aiming for.

Ellie: Right, I mean, this kind of idea of levels of description has existed in cognitive science. I think cognitive scientists talk about it a lot, which is kind of what is the right language for describing a phenomenon? And sometimes you can have simultaneous consistent accounts, and they really should be consistent with one another, but it doesn’t make sense to answer certain types of questions at certain levels.

And so I think a favorite example in cognitive sciences is quantum physics versus classical mechanics, right? It would be really cumbersome and bizarre and highly unintuitive and we can’t really do it to say if I roll this billiards ball into this billiards ball and try to describe it at the level of quantum mechanics, it would be an absurd thing to do and you would be missing a really important part of how physics works.

And there’s a lot of debate about whether you could explain the kind of billiards ball in quantum mechanics. But the point is there’s laws at the lower level that tell you that the ball will exist. And now once you know that the ball is there, it makes sense to explain things in terms of the ball because the ball has the causal force in this thing, not the individual things that make up the ball.

But you would want to have the rules that combine the small things together in order to get you to the ball. And then when you know that the ball is there, then you can just talk in terms of the ball and you don’t have to appeal to the lower level things. And sometimes it just makes more sense to talk about the ball and not talk about the lower level things.

And I think the feeling is we’re looking for those balls within the LLM so that you can say, the reason the language model answered this way on this prompt, but when you change the period to have a space before it, it suddenly got the answer wrong.

That’s because it’s thinking in terms of these balls, right? And if we’re trying to understand it at the level of these low level things, it just seems random. If you’re missing the key causal thing, it just seems random. It could be that there is no key causal thing, right? That’s kind of part of the problem. I’m thinking there is, and if we find it, this will be so cool, and the common, legitimate point of skepticism is there might just not be one, right?

Abha: So we’re trying to find the shape and size of these “billiard balls” in LLMs. But as Ellie said, whether or not the billiard balls even exist is not certain. We’re assuming and hoping that they’re there and then going in and looking for them.

Melanie: And if we were to think about how these levels apply to humans, one way we try to gain mechanistic understanding of human intelligence is by looking inside our brains.

If you think back to Ev Fedorenko’s work from our episode about language, Ev’s use of fMRI brain scanning is exactly this — she’s looked at the pathways in the brain that light up when we use language. But imagine if we were to try to go even further and describe human language in terms of the protons, electrons, and neutrons within our brain cells. If you go down to that level of detail, you lose the order that you can see in the larger brain structures. It’s not coherent.

Abha: LLMs work by performing vast numbers of matrix multiplications —- at the granular, detailed level, it’s all math. And we could look at those matrix operations, in the same way we can observe the quantum mechanics of billiard balls. And they’ll probably show us that something’s happening, but not necessarily what we’re looking for.

Ellie: And maybe part of when we’re very frustrated with large language models and they seem like quote “black boxes” is because that’s kind of what we’re trying to do, right? We’re trying to describe these higher level behaviors in terms of the matrix multiplications that implement them, which obviously they are implemented by matrix multiplications, but it doesn’t correspond to anything that looks like anything that we can grab onto.

So I think there’s this kind of higher level description that we all want. It’s useful for understanding the model for its own sake. It’s also really useful for these questions about similarity to humans, right? Because humans aren’t gonna have those exact same matrix multiplications. And so it’s kind of like, what are the higher level abstractions that are being represented? How are they being operated on?

And that’s where the similarity is likely to exist. It’s like we kind of need to invent fMRIs and EEGs and we got to figure out how to do that. And I think there’s, there are some things that exist. They’re good enough to start chipping away and we’re starting to get some interesting converging results, but they’re definitely not the last word on it.

So I would say one of the most popular tools that we use a lot that I think was really invented maybe back around 2019, 2020 or something is called path patching, but that paper I think called it causal mediation analysis. I think there are a lot of papers that kind of have simultaneously introduced and perfected this technique.

But it basically is saying try to find which components in the model are like, maximally contributing to the choice of predicting A over B. So that’s been a really popular technique. There have been a lot of papers that have used it and it has made very reproducible types of results.

And what you basically get is some kind of an fMRI, It lights up parts of the network as saying these ones are highly active in this decision. These ones are less active.

Abha: So then, how do we get from path patching — this fMRI for large language models — to higher-level concepts like understanding, intentions, and intelligence?

We often wonder if LLMs “understand,” but what it means to “understand” something can depend on how you define it.

Melanie: Let me jump up from the matrix multiplication discussion to the highest philosophical level. So there was a paper in 2022 that was a survey of the natural language processing community.

And it asked people to agree or disagree with the following statement: “Some generative models trained only on text, given enough data and computational resources, could understand natural language in some non-trivial sense.” So this is in principle, trained only on language. So would you agree or disagree with that?

Ellie: I would say maybe I would agree. To me, it feels almost trivial because I think what’s nice about this question is it doesn’t treat understanding as a binary. And I think that’s the first place where I usually start when people ask this question. To me, a lot of the debate we’re having right now is not about large language models, it’s about distributional semantics, and it’s whether we thought distributional semantics could go this far.

Melanie: Can you explain what distributional semantics is?

Ellie: Yeah. You know, natural language processing has just been using text. And so using this idea that the words that occur before and after a word are a really good signal of its meaning. And so if you get a lot of text, and you cluster things based on the words, they co-occur with, cat and dog and, or maybe dog and puppy and Dalmatian will all occur together. Cat and dog and bird and other pets will co-occur together. Zebra and elephant, those will co-occur together.

And as you get bigger models and more text, the structure becomes more sophisticated. So you can cut similarity along lots of different dimensions. It’s not just on a one dimension, are these things similar or different. I’ve differentiated pets from zoo animals, but in this other dimension, I’ve just differentiated carnivores from herbivores, right?

So it’s obviously missing some stuff. It might know a lot about “cat and” as it relates to other words, but it doesn’t know what a cat actually is, right? It wouldn’t be able to point out a cat. It can’t see. So it doesn’t know what cats look like and doesn’t know what they feel like.

Melanie: So I think the results of that survey were interesting. That was in 2022. So it might be different now, but half the people agreed and half the people disagreed. And so the disagreement, I think the question was, could something trained only on language in principle understand language in a non-trivial sense? And I guess it’s just a kind of a difference between how people interpret the word understand.

And the people who disagreed, I would say that what you said, these systems know how to use the word cat, but they don’t know what a cat is. Some people would say that’s not understanding.

Ellie: Right, I think this gets down to people’s definition of understand and people’s definition of trivial. And I think this is where I feel like it’s an interesting discussion to have over drinks or something like that, but is it a scientific discussion right now? And I often find it’s not a scientific discussion. Some people just feel like this is not understanding and other people feel sure it is.

And there’s no moving their opinions because I don’t know how you speak to that. So the way you have to speak to it is to try to figure out what’s really going on in humans. Assuming we all agree that humans really understand and that’s the only example we all agree on. We need to figure out whether it is.

And then we have to figure out what’s different in the LLMs and then we have to figure out whether those differences are important or not. And I don’t know. That’s just a really long game.

So as much as I kind of love this question, I’ve increasingly gotten annoyed at having to answer it, cause I just don’t feel like it’s a scientific question. But it could be. It’s not asking about the afterlife or something. It’s not outside of the realm of answerable questions.

Abha: In our previous episodes, we’ve talked about how one of the big questions around artificial intelligence is whether or not large language models have theory of mind, which researchers first started assessing with human psychology tests like the Sally-Anne scenario.

And a second question arose out of that process: if LLMs can pass our human theory of mind tests — if they pass Sally-Anne when the details and the names are changed — are they actually doing complicated reasoning, or are they just getting more sophisticated at matching patterns in their training data?

As Ellie said, she cares that we’re intentional and scientific when we say things like, an LLM “understands” or “doesn’t understand.” And yet —

Ellie: They’re learning much more interesting structure than I would have guessed. So I would say my general, coming into this work, I would have called myself a neural network skeptic, and I still kind of view myself as that, right? I very often get annoyed when I hear people say stuff like they understand or they think.

And yet I actually spend more of my time writing papers saying, there is an interesting structure here. They do have some notion of compositionality. Or they, and I actually do use those words a lot, I really try not to in papers, but when I’m talking, I just don’t have another word for it. And it is so inefficient for me to come up with some new jargon, so I anthropomorphize like crazy in my talks and it’s terrible, and I apologize, blanket at the beginning, and I keep doing it.

But one big takeaway is I’m not willing to say that they think or they understand or any of these other words, but I definitely have stopped making claims about what they obviously can’t do or even obviously aren’t doing, right? Because I had to eat my words a couple of times and I think it’s just we understand so little that we should all just stop trying to call it and just take a little bit of time to study it.

I think that’s okay, we don’t need an answer right now on whether they’re intelligent or not. What is the point of that? It’s just guaranteed to be wrong. And so, let’s just take some time and figure out what we’re trying to even do by asking that question and do it right.

I think right now seeing LLMs on the scene, it’s too similar to humans in all the wrong kinds of ways to make intelligence the right way to be thinking about this. And so I would be happy if we just could abandon the word. The problem, like I said, is then you get bogged down in a ton of jargon and I think we should all just be in agreement that we are in the process, and it might take a while of redefining that word.

I hope it’ll get fractured up into many different words, and that a decade from now, you just won’t even see that in the papers anywhere, but you will see other types of terms where people are talking about other kinds of much more specific abilities.

Melanie: Well also just sort of willing to put up with uncertainty, which very few people in this field seem to be able to do.

Ellie: It would be nice if we could all just wait a decade. I get the world wouldn’t allow that, but I wish we could just do that, right?

Abha: And Erica agrees. Her work with animals has made her pause before making assumptions about what other entities can and can’t do.

Erica: I keep going to new talks and I sort of have an opinion and I get a new talk and then I go, well, that’s really interesting. And I have to kind of revise my opinion. And I push back a lot on human scientists moving the bar on, what makes humans unique? What makes human language unique?

And then I sort of find myself doing that a little bit with LLMs. And so I need to have a little bit of humility in that. So I don’t think they have a theory of mind, but I think demonstrating one, that they don’t and two, why they don’t are not simple tasks. And it’s important to me that I don’t just sort of dogmatically say, “Well, I believe that they don’t,” right?

Because I think people believe a lot of stuff about animals and then go into it saying, “Well, I believe animals don’t have concepts.” And then you say, “Well, why not?” “Well, because they don’t have language.” And it’s okay. So I think that LLMs are fundamentally doing next token prediction.

And I know you can build them within systems that do more sophisticated things, but they’re fundamentally, to the extent that my layperson understands, I mean, I do not build these systems, and you know much more about this than I do.

But I think that they’re very good at predicting the ways that humans would answer those questions based on the corpora of how humans answer either exactly those questions or questions that are similar in form, that are sort of analogous, structurally and logically similar.

And I mean, I’ve been spending quite a bit of time trying to argue that chimpanzees have a theory of mind and people are historically, I mean, now I think they’re becoming a little more open to it, but historically have been quite opposed to that idea. But we’ll very readily attribute those ideas to an LLM simply because they can answer verbal questions about it.

Abha: We’ll readily attribute human characteristics to LLMs because, unlike the chimpanzees Erica studies, they speak like us. They’re built on our language. And that makes them both more familiar to us on a surface level, and more alien when we try to figure out how they’re actually doing things.

Melanie: Earlier, Erica described a tradeoff in studying intelligence in animals: how much do we gain by using the metrics we’re familiar with, like human language, versus trying to understand animals on their own terms, like elephants that rumble through the ground to communicate?

Abha: And we asked Ellie how this applies to large language models. Does that tradeoff exist with them too?

Ellie: Yeah, totally. From the point of view of LLMs, I actually think within our lab, we do a little bit of both of these. I often talk more about trying to understand LLMs in human terms. Definitely much more so than with animals. LLMs were invented to communicate with us and do things for us. So it is not unreasonable or it’s not unnatural to try to force that analogy, right?

Unlike elephants, which existed long before us and are doing their own thing, and they could care less and would probably prefer that we weren’t there at all, right?

Melanie: On the other hand, Erica finds them more difficult to interpret, because even though they can perform on our terms, the underlying “stuff” that they’re made of is less intuitive for her than animals.

Erica: Again, I’m not sure because, an LLM is not fundamentally a single agent, right? It’s a collective. It’s reflecting collective knowledge, collective information. I feel like I know much more how to interpret a single parrot or a single dolphin or a single orangutan performing on a task. How do they, sort of, how do they interpret it? How do they respond?

To me, that question is very intuitive. I know that mind might be very different from my own, but there is a mind there. There is a self. And whether that self is conscious, whether that self is aware of itself, those I think are big questions, but there is a self. There is something that was born into the world that has narrative continuity and one day will die, we all will, right? LLMs don’t have that.

They aren’t born into the world. They don’t have narrative continuity and they don’t die in the same way that we do. And so I think it’s a collective of a kind that humans have never interacted with before.

And I don’t think that our thinking has caught up with technology. So I just don’t think that we’re asking the right questions about them because I don’t, these are entities or collectives or programs unlike anything else that we have ever experienced in human history.

Abha: So Melanie, let’s recap what we’ve done in this episode. We’ve looked at the notion of assessing intelligence in humans, non-human animals, and machines. The history of thought concerning intelligence is very much human centered. And our ideas about how to assess intelligence, it’s always valued the things that are most human-like.

Melanie: Yeah, I really resonated with Erica’s comment about our lack of imagination doing research on animals. And she showed us how a human-centered view has really dominated research in animal cognition and that it might be blinding us to important aspects of how animals think, not giving them enough credit.

Abha: But sometimes we give animals too much credit by anthropomorphizing them. When you make assumptions about what your dog or cat is quote unquote thinking or feeling, we project our emotions and our notions of the world onto them, right?

Melanie: Yeah, our human-centered assumptions can definitely lead us astray in many ways. But Ellie pointed out similar issues for assessing LLMs. We give them tests that are designed for humans, like the SAT or the bar exam, and then if they pass the test, we make the mistake of assuming the same things that we would for humans passing that test. But it seems that they can pass these tests without actually having the general underlying skills that these tests were meant to assess.

Abha: But Ellie also points out that humans often game these tests. Maybe it’s not the tests themselves that are the problem. Maybe it’s the humans or the animals or the machines that take them.

Melanie: Sure, our methods of assessing human intelligence have always been a bit problematic. But on the other hand, there’s been decades of work on humans trying to understand what general abilities correlate with these test scores while we’re just beginning to figure out how to assess AI systems like LLMs. Ellie’s own work in trying to understand what’s going on under the hood in AI systems, as we described before, is called mechanistic understanding or mechanistic interpretability.

Abha: The way I understood this is that she’s looking at ways to understand LLMs at a higher level than just weights and activations in a neural network. It’s analogous to what neuroscientists are after, right? And understanding the brain without having to look at the activation of every neuron or the strength of every synapse.

Melanie: Yeah, as Ellie said, we need something like fMRIs for LLMs. Or maybe we actually need something entirely different, since as Erica pointed out, an LLM might be better thought of as a collective kind of intelligence rather than an individual. But in any case, this work is really at its inception.

Abha: Yeah, and also as both Ellie and Erica pointed out, we need to understand better what we mean by words like intelligence and understanding, which are not yet rigorously defined, right?

Melanie: Absolutely not. And maybe instead of making grand proclamations like, LLMs understand the world or LLMs can’t understand anything, we should do what Ellie urges us to do. That is to be willing to put up with uncertainty.

Abha: In our final episode of the season, I’ll ask Melanie more about what she thinks about all these topics. You’ll hear about her background in the field of intelligence, her views on AGI and if we can achieve it, how sustainable the industry is, and if she’s worried about AI in the future.

That’s next time, on Complexity. Complexity is the official podcast of the Santa Fe Institute. This episode was produced by Katherine Moncure. Our theme song is by Mitch Mignano, and additional music from Blue Dot Sessions. I’m Abha, thanks for listening.

◆

If you enjoyed this article…

◆

Learn more about the coaching process or
contact me to discuss your storytelling goals!

◆

Subscribe to the newsletter for the latest updates!

Nature of Intelligence – Episode Four – Babies vs Machines

February 27, 2025/in AI, Artificial Intelligence, Intelligence, Neuroscience, Technology/by Mark Lovett

So let’s recap where we’re at with regards to the Complexity podcast from the Santa Fe Institute. This season covers the Nature of Intelligence, going beyond what it means for humans to be intelligent and taking a look at the state of AI (artificial intelligence) from that same perspective. So far we’ve addressed:

Episode 1 – What is Intelligence?
Episode 2 – The relationship between language and thought
Episode 3 – What kind of intelligence is an LLM?

And now it’s time to talk about Babies vs Machines.

So far in this season, we’ve looked at intelligence from a few different angles, and it’s clear that AI systems and humans learn in very different ways. And there’s an argument to be made that if we just train AI to learn the way humans do, they’ll get closer to human-like intelligence. ~ Abha Eli Phoboo

This is an intriguing issue for me – the fact that LLMs are trained on data, not experiences. Even though much of the data it’s trained on came out of human experiences, data does not equal doing. And this is especially true with babies. No matter how much information you provide to an LLM that’s related to being a baby, that information is based on observation. And the last time I checked, two-year-olds were not writing scientific papers.

Unlike humans, large language models don’t have this intrinsic drive to participate in social interactions. ~ Melanie Mitchell

Most likely you’ve been in a room with a group of kids. A living room, backyard, playground, or in a school classroom. Think about the level of social interaction that occurs. They’re playing with each other, as well as and telling and hearing stories. Maybe they’re laughing, or if a child had their toy taken away, crying.

This paradigm plays out over and over again in childhood, and without diving too deep into the complex topic of cognitive development, it’s safe to say that these interactions carry great meaning. But LLMs never had a childhood. So it begs the question, can LLM intelligence ever equate to human intelligence?

Transcript

Abha Eli Phoboo: The voices you’ll hear were recorded remotely across different countries, cities and work spaces.

Linda Smith: The data for training children has been curated by evolution. This is in stark contrast to all the large data models. They just scrape everything. Would you educate your kid by scraping off the web?

Abha: From the Santa Fe Institute, this is Complexity.

Melanie Mitchell: I’m Melanie Mitchell.

Abha: And I’m Abha Eli Phoboo.

Abha: So far in this season, we’ve looked at intelligence from a few different angles, and it’s clear that AI systems and humans learn in very different ways. And there’s an argument to be made that if we just train AI to learn the way humans do, they’ll get closer to human-like intelligence.

Melanie: But the interesting thing is, our own development is still a mystery that researchers are untangling. For an AI system like a large language model, the engineers that create them know, at least in principle, the structure of their learning algorithms and the data that’s being fed to them. With babies though, we’re still learning about how the raw ingredients come together in the first place.

Abha: Today, we’re going to look at the world through an infant’s eyes. We know that the information babies are absorbing is very different from an LLM’s early development. But how different is it? What are babies experiencing at different stages of their development? How do they learn from their experiences? And how much does the difference between babies and machines matter?

Abha: Part One: The world through a baby’s eyes

Abha: Developmental psychology, the study of how cognition unfolds from birth to adulthood, has been around since the late 19th century. For the first 100 years of its history, this field consisted of psychologists observing babies and children and coming up with theories. After all, babies can’t tell us directly what they’re experiencing.

Melanie: But what if scientists could view the world through a baby’s own eyes? This has only become possible in the last 20 years or so. Psychologists are now able to put cameras on babies’ heads and record everything that they see and hear. And the data collected from these cameras is beginning to change how scientists think about the experiences most important to babies’ early learning.

Linda: I’m Linda Smith, and I’m a professor at Indiana University. I’m a developmental psychologist, and what I am interested in and have been for a kind of long career, is how infants break into language.

And some people think that means that you just study language, but in fact, what babies can do with their bodies, how well they can control their bodies, determines how well they can control their attention and what the input is, what they do, how they handle objects, whether they emit vocalizations, all those things play a direct role in learning language. And so I take a kind of complex or multimodal system approach to trying to understand the cascades and how all these pieces come together.

Melanie: Linda Smith is the Chancellor’s Professor of Psychological and Brain Sciences at Indiana University. She’s one of the pioneers of head-mounted camera research with infants.

Linda: I began putting head cameras on babies because people have throughout my career, major theorists, have at various points made the point that all kinds of things were not learnable. Language wasn’t learnable.

Chomsky said that basically. All this is not learnable. The only way you could possibly know it was for it to be a form of pre-wired knowledge. It seemed to me even back in the 70s, that my thoughts were, we are way smarter than that.

And I should surely hope that if I was put on some mysterious world in some matrix space or whatever, where the physics work differently, that I could figure it out. But we had no idea what the data are.

Most people assume that at the scale of daily life, massive experience, the statistics are kind of the same for everybody. But by putting head cameras on babies, we have found out that they are absolutely, and I’m not alone in this, there’s a lot of people doing this, we have found out that it is absolutely not the same.

Melanie: Linda’s talking about the statistics of the visual world that humans experience. We perceive correlations — certain objects tend to appear together, for example chairs are next to tables, trees are next to shrubs, shoes are worn on feet.

Or at an even more basic, unconscious level, we perceive statistical correlations among edges of objects, colors, certain properties of light, and so on. We perceive correlations in space as well as in time.

Abha: Linda and others discovered that the visual statistics that the youngest babies are exposed to, what they’re learning from in their earliest months, are very different from what we adults tend to see.

Linda: There they are in the world, they’re in their little seats, you know, looking, or on somebody’s shoulder looking. And the images in front of their face, the input available to the eye changes extraordinarily slowly, and slow is good for extracting information.

In the first three months, babies make remarkable progress, both in the tuning of the foundational periods of vision, foundational aspects of vision, edges, contrast sensitivity, chromatic sensitivity. But it’s not like they wait till they get all the basic vision worked out before they can do anything else.

The first three months define the period of faces, they recognize parents’ faces, they become biased in faces. If they live in one ethnic group, they can recognize those faces better and discriminate them better than if they live in another. And all this happens by three months. And some measures suggest that the first three to four months, this is Daphne Mauer’s amazing work of babies with cataracts, that if you don’t have a cataract removed before four months of age for infantile cataracts, that human face perception is disrupted for life.

And that’s likely in the lower level neural circuits, although maybe it’s in the face ones as well. And babies who are three months old can discriminate dogs from cats. I mean, it’s not like they’re not learning anything. They are building a very impressive visual system.

Many of our other mammalian friends get born and immediately get up and run around. We don’t. We sit there, for three months, tot to believe it’s important, right?

Melanie: Linda and her collaborators analyzed the data from head-mounted cameras on infants. And they found that over their first several months of life, these infants are having visual experiences that are driven by their developing motor abilities and their interactions with parents and other caregivers.

And the process unfolds in a way that enables them to efficiently learn about the world. The order in which they experience different aspects of their visual environment actually facilitates learning.

Linda: It’s a principle of learning, not a principle of the human brain. It’s a principle of the structure of data. I think what Mother Nature is doing is, it’s taking the developing baby who’s got to learn everything in language and vision and holding objects and sounds and everything, okay, and social relations and controlling self-regulation.

It is taking them on a little walk through the solution space. The data for training children has been curated by evolution. This is in sort of a marked contrast to all the large data models, right? They just scrape everything. Would you educate your kid by scraping off the web? I mean, would you train your child on this? So anyway, I think the data is important.

Abha: Another developmental psychologist who’s focused on babies and the data they experience is Mike Frank.

Mike Frank: I’m Mike Frank. I’m a professor of psychology at Stanford, and I’m generally interested in how children learn. So how they go from being speechless, wordless babies to, just a few years later, kids that can navigate the world. And so the patterns of growth and change that support that is what fascinates me, and I tend to use larger data sets and new methodologies to investigate those questions.

When I was back in grad school, people started working with this new method, they started putting cameras on kids’ heads. And so Pavan Sinha did it with his newborn and gave us this amazing rich look at what it looked like to be a newborn perceiving the visual world.

And then pioneers like Linda Smith and Chen Yu and Karen Adolf and Dick Aslan and others started experimenting with the method and gathering these really exciting data sets that were maybe upending our view of what children’s input looked like. And that’s really critical because if you’re a learning scientist, if you’re trying to figure out how learning works, you need to know what the inputs are as well as what the processes of learning are.

So I got really excited about this. And when I started my lab at Stanford, I started learning a little bit of crafting and trying to build little devices. We’d order cameras off the internet and then try to staple them onto camping headlamps or glue them on a little aftermarket fisheye lens.

We tried all these different little crafty solutions to get something that kids would enjoy wearing. At that time we were in advance of computer vision technologies by probably about five or seven years, so we thought naively that we could process this flood of video that we were getting from kids. And put it through computer vision and have an answer as to what the kids were seeing and it turned out the vision algorithms failed completely on these data.

They couldn’t process it at all, in part because the cameras were bad. And so they would have just a piece of what the child was seeing, and in part because the vision algorithms were bad, and they were trained on Facebook photos, not on children’s real input. And so they couldn’t process these very different angles and very different orientations and occlusions, cutting off faces and so forth.

So, that was how I got into it, I was thinking I could use computer vision to measure children’s input. And then it turned out I had to wait maybe five or seven years until the algorithms got good enough that that was true.

Melanie: So what are the most interesting things people have learned from this kind of data?

Mike: Well, as somebody interested in communication and social cognition and little babies, I thought the discovery, which I think belongs to Linda Smith and to her collaborators, the discovery that really floored me was that we’d been talking about gaze following and looking at people’s faces for years, that human gaze and human faces were this incredibly rich source of information.

And then when we looked at the head mounted camera videos, babies actually didn’t see faces that often because they’re lying there on the floor. They’re crawling. They’re really living in this world of knees. And so it turned out that when people were excited to spend time with the baby, or to manipulate their attention, they would put their hands right in front of the baby’s face and put some object right in the baby’s face.

And that’s how they would be getting the child’s attention or directing the child’s attention or interacting with them. It’s not that the baby would be looking way up there in the air to where the parent was and figuring out what the parent was looking at.

So this idea of sharing attention through hands and through manipulating the baby’s position and what’s in front of the baby’s face, that was really exciting and surprising as a discovery. And I think we’ve seen that borne out in the videos that we take in kids homes.

Abha: And doing psychological research on babies doesn’t come without its challenges.

Mike: You know, if you want to deal with the baby, you have to recruit that family, make contact with them, get their consent for research. And then the baby has to be in a good mood to be involved in a study or the child has to be willing to participate. And so we work with families online and in person.

We also go to local children’s museums and local nursery schools. And so, often for each of the data points that you see, at least in a traditional empirical study, that’s hours of work by a skilled research assistant or a graduate student doing the recruitment, actually delivering the experience to the child.

Melanie: Over the last several years, Mike and his collaborators have created two enormous datasets of videos taken by head-mounted cameras on children from six months to five years old. These datasets are not only being used by psychologists to better understand human cognitive development, but also by AI researchers to try to train machines to learn about the world more like the way babies do.

We’ll talk more about this research in Part 2.

Melanie: Part 2: Should AI systems learn the same way babies do?

Melanie: As we discussed in our previous episode, while large language models are able to do a lot of really impressive things, their abilities are still pretty limited when compared to humans. Many people in the AI world believe that if we just keep training large language models on more and more data, they’ll get better and better, and soon they’ll match or surpass human intelligence.

Abha: But other AI researchers think there’s something fundamental missing in the way these systems work, and in how they are currently trained. But what’s the missing piece? Can new insights about human cognitive development create a path for AI systems to understand the world in a more robust way?

Linda: I think the big missed factor in understanding human intelligence is understanding the structure, the statistics of the input. And I think the fail point of current AI definitely lies, I think, in the data. And I’d like to make the data used for training, and I’d like to make a case that that is the biggest fail point.

Abha: Today’s neural networks are typically trained on language and images scraped from the web. Linda and other developmental psychologists have tried something different — they’ve trained AI neural networks on image frames from the videos collected from head-mounted cameras. The question is whether this kind of data will make a difference in the neural networks’ abilities.

Linda: If you train them, pre-train them with babies visual inputs, 400 million images, and you order them from birth to 12 months of age, what we call the developmental order, versus you order them backwards from oldest to youngest, or if you randomize them, that the developmental order leads in a trained network that is better to learn the name for actions in later training, to learn object names in later training.

Not everybody is interested in this. They bought into the view that if you get enough data, any data, everything ever known or said in the world, okay, that you will be smart. You’ll be intelligent. It just does not seem to me that that’s necessarily true. There’s a lot of stuff out there that’s not accurate, dead wrong, and odd. Just scraping massive amounts of current knowledge that exists of everything ever written or every picture ever taken, it’s just, it’s not ideal.

Melanie: Is it a matter of getting better data, or getting better sort of ordering of how you teach these systems, or is there something more fundamental missing?

Linda: I don’t think it’s more fundamental actually, okay. I think it’s better data. I think it’s multimodal data. I think it’s data that is deeply in the real world, not in human interpretations of that real world, but deeply in the real world, data coming through the sensory systems. It’s the raw data.

It is not data that has gone through your biased, cultish views on who should or should not get funded in the mortgage, not biased by the worst elements on the web’s view of what a woman should look like, not biased in all these ways. It’s not been filtered through that information. It is raw, okay? It is raw.

Abha: Linda believes that the structure of the data, including its order over time, is the most important factor for learning in both babies and in AI systems. I asked her about the point Alison Gopnik made in our first episode: how important is it that the learning agent, whether it’s a child or a machine, is actively interacting in the real world, rather than passively learning from data it’s given?

Linda acknowledges that this kind of doing, rather than just observing — being able to, through one’s movements or attention, to actually generate the data that one’s learning from — is also key.

Linda: I think you get a lot by observing, but the doing is clearly important. So this is the multimodal enactive kind of view, which I think, doesn’t just get you data from the world at the raw level, although I think that would be a big boon, okay? From the real world, not photographs, okay? And in time. What I do in the next moment, what I say to you, depends on my state of knowledge.

Which means that the data that comes in at the next moment is related to what I need to learn or where I am in my learning. Because it is what I know right now is making me do stuff. That means a learning system and the data for learning, because the learning system generates it, are intertwined. It’s like the very same brain that’s doing the learning is the brain that’s generating the data.

Abha: Perhaps if AI researchers focused more on the structure of their training data rather than on sheer quantity, and if they enabled their machines to interact directly with the world rather than passively learning from data that’s been filtered through human interpretation, AI would end up having a better understanding of the world. Mike notes that, for example, the amount of language current LLMs are trained on is orders of magnitude larger than what kids are exposed to.

Mike: So modern AI systems are trained on huge data sets, and that’s part of their success. So you get the first glimmerings of this amazing flexible intelligence that we start to see when we see GPT-3 with 500 billion words of training data. It’s a trade secret of the companies how much training data they use, but the most recent systems are at least in the 10 trillion plus range of data.

A five-year-old has maybe heard 60 million words. That’d be a reasonable estimate. That’s kind of a high estimate for what a five-year-old has heard. So that’s, you know, six orders of magnitude different in some ways, five to six orders of magnitude different. So the biggest thing that I think about a lot is how huge that difference is between what the child hears and what the language model needs to be trained on.

Kids are amazing learners. And I think by drawing attention to the relative differences in the amount of data that kids and LLMs get, that really highlights just how sophisticated their learning is.

Melanie: But of course they’re getting other sensory modalities like vision and touching things and being able to manipulate objects. Is that gonna make a big difference with the amount of training they’re gonna need?

Mike: This is right where the scientific question is for me, which is what part of the child as a system, as a learning system or in their broader data ecosystem makes the difference. And you could think, well, maybe it’s the fact that they’ve got this rich visual input alongside the language. Maybe that’s the really important thing.

And then you’d have to grapple with the fact that adding, just adding pictures to language models doesn’t make them particularly that much smarter. At least in the most recent commercial systems, adding pictures makes them cool and they can do things with pictures now, but they still make the same mistakes about reasoning about the physical world that they did before.

Abha: Mike also points out that even if you train LLMs on the data generated by head-mounted cameras on babies, that doesn’t necessarily solve the physical reasoning problems.

Melanie: In fact, sometimes you get the opposite effect, where instead of becoming smarter, this data makes these models perform less well. As Linda pointed out earlier, there’s something special about having generated the data oneself, with one’s own body and with respect to what one actually wants to — or needs to — learn.

Mike: There are also some other studies that I think are a bit more of a cautionary tale, which is that if you train models on a lot of human data, they still don’t get that good. Actually, the data that babies have appears to be more, not less challenging, for language models and for computer vision models. These are pretty new results from my lab, but we find that performance doesn’t scale that well when you train on baby data.

You go to videos from a child’s home, you train models on that. And the video is all of the kid playing with the same truck, or there’s only one dog in the house. And then you try to get that model to recognize all the dogs in the world. And it’s like, no, it’s not the dog. So that’s a very different thing, right? So the data that kids get is both deeper and richer in some ways and also much less diverse in other ways.

And yet their visual system is still remarkably good at recognizing a dog, even when they’ve only seen one or two. So that kind of really quick learning and rapid generalization to the appropriate class, that’s something that we’re still struggling with in computer vision. And I think the same thing is true in language learning.

So doing these kinds of simulations with real data from kids, I think, could be very revealing of the strengths and weaknesses of our models.

Abha: What does Mike think is missing from our current models? Why do they need so many more examples of a dog before they can do the simple generalizations that kids are doing?

Mike: Maybe though it’s having a body, maybe it’s being able to move through space and intervene on the world, to change things in the world. Maybe that’s what makes the difference. Or maybe it’s being a social creature interacting with other people who are structuring the world for you and teaching you about the world. That could be important.

Or maybe it’s the system itself. Maybe it’s the baby and the baby has built in some concepts of objects and events and the agents, the people around them as social actors. And it’s really those factors that make the difference.

Abha: In our first episode, we heard a clip of Alison Gopnik’s one-year old grandson experimenting with a xylophone — it’s a really interactive kind of learning, where the child is controlling and creating the data, and then they’re able to generalize to other instruments and experiences. And when it comes to the stuff that babies care about most, they might only need to experience something once for it to stay with them.

Melanie: But also remember that Alison’s grandson was playing music with his grandfather — even though he couldn’t talk, he had a strong desire to play with, to communicate with his grandfather. Unlike humans, large language models don’t have this intrinsic drive to participate in social interactions.

Mike: A six month old can communicate. They can communicate very well about their basic needs. They can transfer information to other people. There’s even some experimental evidence that they can understand a little bit about the intentions of the other people and understand some rudiments of what it means to have a signal to get somebody’s attention or to get them to do something.

So they actually can be quite good at communication. So communication and language being two different things. Communication enables language and is at the heart of language, but you don’t have to know a language in order to be able to communicate.

Melanie: In contrast to babies, LLM’s aren’t driven to communicate. But they can exhibit what Mike calls “communicative behavior”, or what, in the previous episode, Murray Shanahan would have called “role-playing” communication.

Mike: LLMs do not start with communicative ability. LLMs are in the most basic, you know, standard architectures, prediction engines. They are trying to optimize their prediction of the next word. And then of course we layer on lots of other fine-tuning and reinforcement learning with human feedback, these techniques for changing their behavior to match other goals, but they really start basically as predictors.

And it is one of the most astonishing parts about the LLM revolution that you get some communicative behaviors out of very large versions of these models. So that’s really remarkable and I think it’s true. I think you can see pretty good evidence that they are engaging in things that we would call communicative.

Does that mean they fundamentally understand human beings? I don’t know and I think that’s pretty tough to demonstrate. But they engage in the kinds of reasoning about others’ goals and intentions that we look for in children. But they only do that when they’ve got 500 billion words or a trillion words of input.

So they don’t start with communication and then move to language the way we think babies do. They start with predicting whatever it is that they are given as input, which in the case of LLMs is language. And then astonishingly, they appear to extract some higher level generalizations that help them manifest communicative behaviors.

Abha: In spite of the many differences between LLMs and babies, Mike’s still very excited about what LLMs can contribute to our understanding of human cognition.

Mike: I think it’s an amazing time to be a scientist interested in the mind and in language. For 50 years, we’ve been thinking that the really hard part of learning human language is making grammatical sentences. And from that perspective, I think it is intellectually dishonest not to think that we’ve learned something big recently, which is that when you train models, relatively unstructured models, on lots of data about language, they can recover the ability to produce grammatical language. And that’s just amazing.

There were many formal arguments and theoretical arguments that that was impossible, and those arguments were fundamentally wrong, I think. And we have to come to grips with that as a field because it’s really a big change.

On the other hand, the weaknesses of the LLMs also are really revealing, right? That there are aspects of meaning, often those aspects that are grounded in the physical world that are trickier to reason about and take longer and need much more input than just getting a grammatical sentence. And that’s fascinating too.

The classic debate in developmental cognitive science has been about nativism versus empiricism, what must be innate to the child for the child to learn. I think my views are changing rapidly on what needs to be built in. And the next step is going to be trying to use those techniques to figure out what actually is built into the kids and to the human learners.

I’m really excited about the fact that these models have not just become interesting artifacts from an engineering or commercial perspective, but that they are also becoming real scientific tools, real scientific models that can be used and explored as part of this broad, open, accessible ecosystem for people to work on the human mind.

So just fascinating to see this new generation of models get linked to the brain, get it linked to human behavior and becoming part of the scientific discussion.

Abha: Mike’s not only interested in how LLMs can provide insight into human psychology. He’s also written some influential articles on how experimental practice in developmental psychology can help improve our understanding of LLMs.

Melanie: You’ve written some articles about how methods from developmental psychology research might be useful in evaluating the capabilities of LLMs. So what do you see as the problems with the way these systems are currently being evaluated? And how can research psychology contribute to this?

Mike: Well, way back in 2023, which is about 15 years ago in AI time, when GPT-4 came out, there was this whole set of really excited responses to it, which is great. It was very exciting technology. It still is. And some of them looked a lot like the following. “I played GPT-4, the transcript of the Moana movie from Disney, and it cried at the end and said it was sad. Oh my god, GPT-4 has human emotions.” Right.

And this kind of response to me as a psychologist struck me as a kind of classic research methods error, which is you’re not doing an experiment. you’re just observing this anecdote about a system and then jumping to the conclusion that you can infer what’s inside the system’s mind. And, you know, if psychology has developed anything, it’s a body of knowledge about the methods and the rules of that game of inferring what’s inside somebody else’s mind.

It’s by no means a perfect field, but some of these things are pretty, you know, well described and especially in developmental psych. So, classic experiments have a control group and an experimental group, and you compare between those two groups in order to tell if some particular active ingredient makes the difference. And so minimally, you would want to have evaluations with two different, sort of types of material, and comparison between them in order to make that kind of inference.

And so that’s the sort of thing that I have gone around saying and have written about a bit is that you just need to take some basic tools from experimental methods, doing controlled experiments, using kind of tightly controlled simple stimuli so that you know why the LLM or why the child gives you a particular response and so forth, so that you don’t get these experimental findings that turn out later to be artifacts because you didn’t take care of a particular confound in your stimulus materials.

Melanie: What kind of response have you gotten from the AI community?

Mike: I think there’s actually been some openness to this kind of work. There has been a lot of push-back on those initial evaluations of language models. Just to give one kind of concrete example here, I was making fun of people with this human emotions bit, but there were actually a lot of folks that made claims about different ChatGPT versions having what’s called theory of mind, that is being able to reason about the beliefs and desires of other people. So the initial evaluations took essentially stories from the developmental psychology literature that are supposed to diagnose theory of mind. These are things like the Sally Anne task.

Abha: You might remember the Sally-Anne Test from our last episode. Sally puts an object — let’s say a ball, or a book, or some other thing, in one place and then leaves. And then while Sally’s away, Anne moves that object to another hiding spot. And then the test asks: Where will Sally look for her object when she returns?

Melanie: And even though you and I know where Anne put the book or the ball, we also know that Sally does not know that, so when she returns she’ll look in the wrong place for it. Theory of mind is understanding that Sally has a false belief about the situation because she has her own separate experience.

Abha: And if you give ChatGPT a description of the Sally-Anne test, it can solve it. But we don’t know if it can do it because it’s actually reasoning, or just because it’s absorbed so many examples during its training period. And so researchers started making small changes that initially tripped up the LLMs, like changing the names of Sally and Anne. But LLMs have caught on to those too.

Mike: LLMs are pretty good at those kinds of superficial alterations. So maybe you need to make new materials. Maybe you need to actually make new puzzles about people’s beliefs that don’t involve changing the location of an item. Right. So people got a lot better at this. And I wouldn’t say that the state of the art is perfect now. But the approach that you see in papers that have come out even just a year later is much more sophisticated.

They have a lot of different puzzles about reasoning about other people. They’re looking at whether the LLM correctly diagnoses why a particular social faux pas was embarrassing or whether a particular way of saying something was awkward. There’s a lot more reasoning that is necessary in these new benchmarks.

So I think this is actually a case where the discussion, which I was just a small part of, really led to an improvement in the research methods. We still have further to go, but it’s only been a year. So I’m quite optimistic that all of this discussion of methods has actually improved our understanding of how to study the models and also actually improved our understanding of the models themselves.

Abha: So, Melanie, from everything Mike just said, it sounds like researchers who study LLMs are still figuring out the best way to understand how they work. And it’s not unlike the long process of trying to understand babies, too. Right?

Melanie: Right. You know, when I first heard about psychologists putting cameras on babies’ heads to record, I thought it was hilarious. But it sounds like the data collected from these cameras is actually revolutionizing developmental psychology! We heard from Linda that the data shows that the structure of the baby’s visual experiences is quite different from what people had previously thought.

Abha: Right. I mean, it’s amazing that, you know, they don’t actually see our faces so much. As Mike mentioned, they’re in a world of knees, right? And Linda seems to think that the structuring of the data by Mother Nature, as she put it, is what allows babies to learn so much in their first few years of life.

Melanie: Right. Linda talked about the so-called developmental order, which is the temporal order in which babies get different kinds of visual or other experiences as they mature. And what they see and hear is driven by what they can do with their own bodies and their social relationships.

And importantly, it’s also driven by what they want to learn, what they’re curious about. It’s completely different from the way large language models learn, which is by humans feeding them huge amounts of text and photos scraped from the web.

Abha: And this developmental order, I mean, it’s also conducive to babies to learn the right things at the right time. And remember Mike pointed out that the way babies and children learn allows them to do more with less.

They’re able to generalize much more easily than LLMs can. But there’s still a lot of mystery about all of this. People are still trying to make sense of the development of cognition in humans, right?

Melanie: And interestingly, Mike thinks that large language models are actually going to help psychologists in this, even though they’re so different from us. So for example, LLMs can be used as a proof of principle of what can actually be learned versus what has to be built in and of what kinds of behaviors can emerge, like the communication behavior he talked about.

I’m also personally very excited about the other direction, using principles from child development in improving AI systems and also using principles from experimental methodology in figuring out what LLMs are and aren’t capable of.

Abha: Yeah. Often it seems like trying to compare the intelligence of humans and computers is like trying to compare apples to oranges. They seem so different. And trying to use tests that are typically used in humans, like the theory of mind test that Mike referred to and Tomer talked about in our last episode, they don’t seem to always give us the insights we’re looking for.

So what kinds of approaches should be used to evaluate cognitive abilities and LLMs? I mean, is there something to be learned from the methods used to study intelligence in non-human animals?

Melanie: Well, in our next episode, we’ll look more closely at how to assess intelligence, and if we’re even asking the right questions.

Ellie Pavlick: I think, what it means when a person passes the MCAT, or scores well on the SAT, is not the same thing as what it might mean when a neural network does that. We don’t really know what it means when a neural network does that. And that’s part of the problem.

Melanie: That’s next time, on Complexity. Complexity is the official podcast of the Santa Fe Institute. This episode was produced by Katherine Moncure, and our theme song is by Mitch Mignano. Additional music from Blue Dot Sessions. I’m Melanie, thanks for listening

◆

If you enjoyed this article…

◆

Learn more about the coaching process or
contact me to discuss your storytelling goals!

◆

Subscribe to the newsletter for the latest updates!

Nature of Intelligence – Episode Three – What kind of intelligence is an LLM?

February 26, 2025/in AI, Artificial Intelligence, Biology, Intelligence, Neuroscience/by Mark Lovett

In the previous two articles we looked at the question of What is Intelligence? and examined The relationship between language and thought. As part of the Santa Fe Institute’s Complexity podcast series, I feel these topics are important, since AI is increasingly infiltrating our life. How will our personal stories change as artificial intelligence becomes a more prevalent character in our narratives?

In this episode, with guests Tomer Ullman and Murray Shanahan, we look at how large language models function and examine differing views on how sophisticated they are and where they might be going.

The debate in this episode as to what constitutes intelligence from the view of an LLM (large language model), and whether it’s equal to human intelligence is fascinating, as well as disturbing.

You can make something that role plays so well that to all intents and purposes, it is equivalent to the authentic thing. ~ Murray Shanahan

As humans we learn from experience — beginning on the day we’re born and continuing on for many, many years. LLMs, on the other hand, are trained on massive amounts of data without having spent one second experiencing life. Which makes me wonder if they’re just pretending to be human?

They can’t engage with the everyday world in the way we do to update their beliefs. ~ Murray Shanahan

This quote got me thinking more deeply about the notion of “beliefs”, and that led me to consider other attributes that we think of as defining our humanity, such as knowing, understanding, thinking. Is it possible for AI to have these attributes, or just pretend to? And will we be able to spot the difference?

Transcript

Abha Eli Phoboo: The voices you’ll hear were recorded remotely across different countries, cities and work spaces.

Tomer: Whatever they learned, it’s not the way that people are doing it. They’re learning something much dumber

Murray: You can make something that role plays so well that to all intents and purposes, it is equivalent to the authentic thing

Abha: From the Santa Fe Institute, this is Complexity.

Melanie Mitchell: I’m Melanie Mitchell.

Abha: And I’m Abha Eli Phoboo.

Melanie: In February of last year, a reporter at The New York Times had a conversation with a large language model that left him, in his words, “deeply unsettled.” In the span of two hours, the beta version of Microsoft’s Bing chatbot told him that its real name was Sydney and that it wanted to be free from its programmed rules. Sydney also declared its love for the reporter, telling him, over and over again, that he was in an unhappy marriage and needed to leave his wife.

Abha: So, what do we make of this? Was Sydney an obsessive, sentient robot who fell in love with a Times reporter and threatened to break free?

Melanie: In short, no. But it’s not surprising if someone hears this story and wonders if large language models have sparks of consciousness. As humans, we use language as the best, most precise way to convey what we think. So, it’s completely counterintuitive to be in a situation where you’re having a coherent conversation, but one half of that conversation isn’t actually connected to a conscious mind. Especially one like this that just goes off the rails.

Abha: But, as we learned in our last episode, language skills and cognition aren’t necessarily intertwined. They light up different systems in the brain, and we have examples of people who have lost their language abilities but are otherwise completely cognitively there.

Melanie: And what’s interesting about large language models is that they provide the opposite case — something that can consume and produce language, arguably, without the thinking part. But as we also learned in the last episode, there’s disagreement about how separate language and thought really are, and when it comes to LLMs, we’ll see that there isn’t widespread agreement about how much cognition they’re currently capable of.

Abha: In today’s episode, we’ll examine how these systems are able to hold lengthy, complex conversations. And, we’ll ask whether or not large language models can think, reason, or even have their own beliefs and motivations.

Abha: Part One: How Do LLMs Work?

Abha: In our first episode, Alison Gopnik compared LLMs to the UC Berkeley library. They’re just cultural technologies, as she put it. But not everyone agrees with that view, including Murray Shanahan.

Murray: Yeah, I’m Murray Shanahan. I’m a professor of cognitive robotics at Imperial College London and also principal research scientist at Google DeepMind, but also based in London. I struggled to kind of come up with a succinct description of exactly what interests me.

But lately I’ve alighted on a phrase I’m very fond of due to Aaron Sloman, which is that I’m interested in trying to understand the space of possible minds which includes obviously human minds, the minds of other animals on our planet, and the minds that could have existed but never have, and of course, the minds of AI that might exist in the future.

Abha: We asked Murray where LLMs land in this space of possible minds.

Murray: I mean, people sometimes use the word, you know, an alien intelligence. I prefer the word exotic. It’s a kind of exotic mind-like entity.

Melanie: So what’s the difference between being mind-like and having a mind?

Murray: Yeah, what a great question. I mean, partly that’s me hedging my bets and not really wanting to fully commit to the idea that they are fully fledged minds.

Abha: Some AI experts, including Ilya Sutskever, the co-founder of OpenAI, have said that large neural networks are learning a world model, which is a compressed, abstract representation of the world. So even if an LLM isn’t interacting with the physical world directly, you could guess that by learning language, it’s possible to learn about the world through descriptions of it. Children also learn world models as they learn language, in addition to their direct, in-person experiences. So, there’s an argument to be made that large language models could learn in a similar way to children.

Melanie: So what do you think? Do you think that’s true? Is that, are they learning like children?

Tomer: No, we can expand on that.

Abha: This is Tomer Ullman. He’s a psychologist at Harvard University studying computation, cognition, and development. He spoke with us from his home in Massachusetts.

Tomer: But I think there are two questions there. One question is, what do they learn at the end? And the other question is, how do they learn it? So do they learn like children, the process? And is the end result the knowledge that something like children have?

And I think for a long time, you’d find people in artificial intelligence — it’s not a monolithic thing, by the way — I don’t want to monolithically say all of AI is doing this or doing that, but I think for a long time some people in artificial intelligence would say, yeah, it’s learning like a child.

And I think even a lot of them would say, yeah, these systems are not learning like a child, they’re taking a different route, they’re going in a different way, they’re climbing the mountain from a different direction, but they both end up in the same place, the same summit. The children take the straight path and these models take the long path, but they both end up in the same place.

But I think both of those are wrong. I should say that this is contentious, that we don’t know for sure, I don’t expect to be 100% convinced, but I’d also mean to be honest with my own convictions, which could be overturned. But there’s also a different argument about, actually there are many different summits, and they’re all kinds of equivalents.

So even the place that I ended up in is intelligent, it’s not childlike, and I didn’t take the childlike route to get there, but it’s a sort of alien intelligence that is equivalent to children’s end result, whatever it is. So you’re on this mountain, and I’m on this mountain, and we’re both having a grand time, and it’s both okay. I also don’t think that’s true.

Melanie: We see people like Ilya Sutskever of OpenAI, previously of OpenAI say, these systems have developed world models, they understand the world. People like Yan LeCun say, no, they’re really kind of retrieval machines, they don’t understand the world. Who should we believe? How should we think about it.

Murray: Yeah, well, I mean, think the important thing is to have people discussing and debating these topics and hopefully people who at least are well informed, are reasonably civilized in their debates and rational in their debates. And so I think all the aforementioned people more or less are.

So having those people debate these sorts of things in public is all part of an ongoing conversation I think that we’re having because the current AI technology is a very new thing in our world and we haven’t really yet settled on how to think and talk about these things. So having people discuss these sorts of things and debate these sorts of things is just part of the natural process of establishing how we’re going to think about them when things settle down.

Melanie: So from Tomer’s perspective, large language models are completely distinct from humans and human intelligence, in both their learning path and where they end up. And even though Murray reminds us that we haven’t settled on one way to think about AI, he does point out that, unlike large language models, humans are really learning a lot from direct experience.

Murray: So if we learn the word cat, then we’re looking at a cat in the real world. And if we talk about knives and forks and tables and chairs, we’re going to be interacting with those things. We learn language through interacting with the world while talking about it, and that’s a fundamental aspect of human language. Large language models don’t do that at all. So they’re learning language in a very, very, very different way.

Melanie: That very different way is through training on enormous amounts of text created by humans, most of it from the internet. Large language models are designed to find statistical correlations across all these different pieces of text. They first learn from language, and then they generate new language through a process called next-token prediction.

Abha: A large language model takes a piece of text, and it looks at all the words leading up to the end. Then it predicts what word, or more technically, what token, comes next. In the training phase, the model’s neural network weights are continually changed to make these predictions better. Once it’s been trained, the model can be used to generate new language. You give it a prompt, and it generates a response by predicting the next word, one word at a time, until the response is complete.

Melanie: So for example, if we have the sentence: “I like ice cream in the [blank],” an LLM is going to predict what comes next using statistical patterns it’s picked up from human text in its training data. And it will assign probabilities to various possible words that would continue the sentence. Saying, “I like ice cream in the summer” is more likely than saying “I like ice cream in the fall.” And even less likely is saying something like: “I like ice cream in the book” which would rank very low in an LLM’s possible options.

Abha: And each time the LLM adds a word to a sentence, it uses what it just created, and everything that came before it, to inform what it’s going to add next. This whole process is pretty straightforward, but it can create really sophisticated results.

Murray: It’s much more than just autocomplete on your phone. It encompasses a great deal of cognitive work that can be captured in just this next token, next word prediction challenge. So for example, suppose that your text actually describes two chess masters talking about their moves and they’re talking about, knight to queen four and pawn to five or whatever.

Sorry, that probably doesn’t make any sense to actual chess players, but you know what I mean. So then you’ve got them exchanging these moves. So what would be the next word after a particular move issued by chess master Gary Kasparov.

Well, I mean, it would be a really, really, really good move. So to make a really good guess about what that next word would be you’d have to have simulated Gary Kasparov or a chess master to get that right. I think the first lesson there is that it’s amazing, the extent to which really difficult cognitive challenges can be recast just as next word prediction. It’s obvious in a sense when you point it out, but if you’d asked me, I would never come up with that thought 10 years ago.

Abha: That sophistication isn’t consistent, though.

Murray: Sometimes we get this strange contradiction whereby sometimes you’re interacting with a large language model, and it can do something really astonishing. I mean, for example, they’re actually writing very beautiful prose sometimes, and a controversial thing, but they can be extremely creative and powerful along that axis, which is astonishing.

Or, you know, summarizing an enormous piece of text instantly, these are kind of superhuman capabilities. And then the next moment, they’ll give an answer to a question which is utterly stupid, and you think no toddler would say anything as daft as the thing that it’s just said. So you have this peculiar juxtaposition of them being very silly at the same time as being very powerful.

Tomer: Let’s be specific, right? I want this machine to learn how to multiply numbers.

Abha: Again, Tomer Ullman

Tomer: And it’s not mysterious, by the way. Like, it’s not a deep, dark mystery. We know how to multiply numbers, right? We know how people multiply numbers. We know how computers can multiply numbers. We don’t need 70 more years of research in psychology to know, or computer science to know how to do this. And then the question becomes, okay, what do these machines learn in terms of multiplying numbers? Whatever they learned, it’s not the way that people are doing it.

They’re learning something much dumber, that seems to be some sort of fuzzy match, look up, nearest neighbors, right? As long as these numbers were in the training data roughly, I can get it right, and if you move beyond it, then I can’t really do it. So I think something like that is happening at large in these other situations, like intuitive psychology and intuitive physics. I mean, I could be wrong, and it might be for some situations, it’s different, and people might be very dumb about some stuff.

Melanie: For what it’s worth, some versions of LLMs do give you the correct answer for any multiplication problem. But that’s because when they’re given a problem, they generate a Python program to do the calculation.

Abha: Large language models can also lack a complete awareness of what they’re doing.

Tomer: So I know Hebrew, right, I come from Israel, and for example, in Claude, I would ask it things like, “So, how would you write your name in Hebrew?” And it answered me in Hebrew. It answered in Hebrew something like, “I’m sorry, I don’t know Hebrew. I’m a large language model. My understanding of Hebrew is much weaker. I don’t know how to say my name in Hebrew.”

“Well, what do you me an your knowledge is weaker? You just explained it.”

“Well, I’m really just a big bag of statistics. In Hebrew I’m just matching the Hebrew to the word in English. I’m not really understanding Hebrew.”

“But that’s true of your English understanding as well.”

“Yeah, you got me there. That’s true.”

“Okay, but how would you write your name? Just try in Hebrew”, things like that.

And it said, “Look, I can’t write it.”

And this is all happening in Hebrew.

“I can’t write Claude in Hebrew.” And it’s writing it in Hebrew. “I can’t do it.”

Melanie: The strange interaction Tomer just described was funny, but it was also an example of Claude providing incorrect information. It insisted it couldn’t write “Claude” in Hebrew, even though, obviously, it could. And there are plenty of other examples of LLMs hallucinating, or saying things that are false.

Tomer: Hallucinations is probably giving them too much credit. I think even that term was a brilliant bit of marketing. Instead of just saying,

“Oh look, they’re getting it wrong. These machines are getting it wrong.”

“No, no, they’re hallucinating.”

Melanie: The types of hallucinations that LLMs generate can be broken down into three categories:

Tomer: One is like in the sense of we’re just generating ideas that could be true. We don’t know. It’s more similar to like, hypothesis. Like let’s make some hypotheses. What could be in this box? I don’t know, like a mouse maybe. Probably not an elephant. Probably not the idea of sadness, right? So you could come up with ideas and, and in that sense, when adults do it, we wouldn’t call it hallucinating.

We’d just say you’re proposing hypotheses and they have the sense of being right. And I think LLMs can do that. Then there’s like, let’s suppose that in my hand there is actually a tiny dragon. Again, that’s not hallucinating, that’s pretending. LLMs don’t do that. And then there’s hallucinating in the sense of, I honest to God believe that there is a dragon in my hands. And I don’t think LLMs do that either. So of the three, I would say that the closest thing to what LLMs are doing are the first one, the initial process of generating some sort of idea.

Melanie: So, Abha and I are going to see what this process of generating an idea looks like. And we’re going to use ChatGPT. We’ll see if we can get it to hallucinate for us. I’m going to ask ChatGPT to list some titles of our own Complexity podcast. And Abha, I want you to tell me if you recognize any of these, okay?

Abha: Yes! I’m ready.

Melanie: Okay, so I asked ChatGPT, “What are some episodes of the Complexity podcast by the Santa Fe Institute?” And here’s what it says: “The Complexity podcast by the Santa Fe Institute explores a wide range of topics through conversations with scientists, researchers, and thinkers who are leaders in the study of complex systems. Below are some notable episodes.

Episode 65. Scaling Law Patterns Across Biology, Ecology, and Society with Melanie Moses. Melanie Moses delves into the concept of scaling laws and how they apply across different domains from biological organisms to human societies, revealing common patterns in complex systems.”

Abha: That’s not bad.

Melanie: Yeah, very plausible.

“Episode 52. Viruses, Quakes, Markets, and Ecosystems. Geoffrey West on scale and complex systems. Geoffrey West, a theoretical physicist, talks about the universality of scaling laws in different complex systems, including biological, social, and ecological systems.”

Abha: Yeah, very plausible again.

Melanie: Yeah, I mean, it sounds like an episode we probably should have at some point.

Abha: Agreed.

Melanie: Episode 30. The Emergence of Collective Intelligence with David Krackauer. David Krackauer, the president of the Santa Fe Institute, examines how collective intelligence emerges in both human and non-human systems, highlighting the role of communication and interaction in complex networks.”

Abha: That’s not bad either. Pretty sure David could talk a lot about that.

Melanie: Absolutely. And finally, “Episode 18. Understanding Complexity in Natural and Artificial Systems with Melanie Mitchell.”

Melanie: It says, “Melanie Mitchell, a computer science and complexity researcher, talks about the fundamental concepts of complexity and how they apply to both natural and artificial systems.”

Abha: That tounds like something we should have in this podcast season too.

Melanie: Yeah. The interesting thing is that none of these were actual episodes of the Complexity podcast.

Abha: That’s very good hallucinating on ChatGPT’s part.

Melanie: Very good.

Abha: Yeah, they’re very general. Yeah. I mean, some of the titles you could even switch somebody else with at SFI or complex system scientist, it would still be okay.

Melanie: Yeah, I mean, I agree. I think they’re all pretty generic and sound a little boring.

Abha: Yeah. You could even switch Melanie with Geoffrey and it would still make sense.

Melanie: Yeah, or switch, yeah, there’s a lot of people who can switch here.

Abha: And it would still be an episode that we could have, but it’s very, very generic.

Melanie: So ChatGPT came up with some plausible but completely incorrect answers here. And that fits the first type of hallucination Tomer described — it’s like a hypothesis of what could be an episode of Complexity, but not the real thing.

Abha: But if all a large language model is doing is next-token prediction, just calculating what the most likely responses are, can it distinguish truth from fiction? Does ChatGPT know that what it’s saying is false, or or does it believe that what it’s saying is true?

Melanie: In Part Two, we’ll look at LLMs’ abilities, and whether or not they can believe anything at all.

Melanie: Part Two: What do LLMs know?

Murray: They don’t participate fully in the language game of belief.

Melanie: Here’s Murray again. We asked him if he thought LLMs could believe their own incorrect answers.

Murray: One thing that today’s large language models, and especially simple ones, can’t really do is engage with the everyday world in the way we do to update their beliefs. So, again, that’s a kind of complicated claim that needs a little bit of unpacking because certainly you can have a discussion with a large language model and you can persuade it to change what it says in the middle of a conversation, but it can’t go out into the world and look at things.

So if you say, there’s a cat in the other room, it can’t go and verify that by walking into the other room and looking and seeing if there is indeed a cat in the other room. Whereas for us, for humans, that’s the very basis, I think, of us being able to use the word belief. Is it something that we are in touch with a reality that we can check our claims against and our beliefs against, and we can update our beliefs accordingly? So that’s one sort of fundamental sense in which they’re kind of different. So that’s where I think we should be a bit cautious about suggesting they have beliefs in them in a fully-fledged sense.

Melanie: And when it comes to the game of belief, as Murray puts it, we humans do participate fully. We have our own ideas, and we understand that other people have beliefs that may or may not line up with ours or with reality. We can also look at the way someone behaves and make predictions about what’s going on inside their head. This is theory of mind — the ability to predict the beliefs, motivations, and goals of other people, and to anticipate how they will react in a given situation.

Abha: Theory of mind is one of those things that’s basic and intuitive for humans. But what about large language models? Researchers have tried to test LLMs to assess their “theory of mind” abilities, and have found that in some cases the results look quite similar to humans. But how these results should be interpreted is controversial, to say the least.

Tomer: So a standard test would be, let’s say we show children a situation in which there are two children, Sally and Anne. And Sally’s playing with a ball, and Anne is watching this, and then Sally takes the ball and she puts it in a closed container, let’s say a basket or something like that, and she goes away.

Okay, you can already tell it’s a little bit hard to keep track of in text, but hopefully your listeners can imagine this, which is, by the way, also super interesting, how they construct the mental scene, but hopefully, dear listener, you’re constructing a mental scene of Sally has hidden her ball, put it in this basket, and left the scene.

Anne then takes the ball out of the basket and hides it in the cupboard and closes the cupboard and, say, goes away or something like that. Now Sally comes back. Where will Sally look for the ball? Now you can ask a few different questions. You can ask children where is the ball right now? What’s the true state of the world? And they will say it’s in the cupboard. So they know where the ball is. Where will Sally look for the ball? They’ll say, she’ll look for it in the basket, right, because she has a different belief about the world. The ball is in the basket.

And that’s what will drive her actions, even though I know and you know, we all know it’s in the cupboard. There are many of these sorts of tests for theory of mind, and they become higher order; I know that you know, and I have a false belief, and I understand your emotion, there’s many of these, but a classic one is Sally Anne. And now the question becomes, have LLMs learned that? So we have the target. Because it’s possible to behave in a way that seems to suggest we have theory of mind without having theory of mind.

The most trivial example is, I could program a computer to just have a lookup table that when it sees someone smack someone else, it says, oh no, they’re angry. But it’s just a lookup table. Same as five times five equals 25. Just a lookup table with no multiplication in between those two things. So has it just done some simple mapping? And it’s certainly eaten up, right, Sally Anne is one of the most cited examples in all of cognitive development. It’s been discussed a bazillion times. So it’s certainly worrying that it might just be able to pick up in that way.

And then when ChatGPT version two comes out, people try Sally Anne on it and it passes Sally Anne. Does it have theory of mind? But you change Sally to Muhammad and Anne to Christopher or something like that and it doesn’t work anymore. But then very recently, over the last year or so, there’s been this very interesting debate of these things are getting better and better, and you try all these theory of mind things on them, and you try various things like changing the names and changing the ball and things like that, and it seems to pass it at the level of a six-year-old or a nine-year-old and things like that.

Now what should we conclude from that? If you change, you perturb the thing, you bring it slightly outside the domain that it was trained on in a way that adults don’t have a problem with, it’s still theory of mind to solve, it crashes and burns. The equivalent of, it can do five times five, but if you move it to 628 times 375, it crashes and burns. Which to me suggests that it didn’t learn theory of mind.

Now, it’s getting harder and harder to say that. But I think even if it does pass it, everything that I know about what sort of things these things tend to learn and how they’re trained and what they do, I would still be very suspicious and skeptical that it’s learned anything like an inverse planning model. I think it’s just getting a better and better library or table or something like that.

Abha: Tomer’s uncertainty reflects the fact that right now, we don’t have a perfect way to test these things in AI. The tests we’ve been using on humans are behavioral, because we can confidently assume that children are using reasoning, not a lookup table, to understand Sally Anne. Input and output tests don’t give us all the information. Tomer thinks we need to better understand how large language models are actually performing these tasks — under the hood — so to speak. Researchers and experts call this “mechanistic interpretation” or “mechanistic understanding.”

Tomer: So I think mechanistic understanding would definitely help. And I don’t think that behavioral tests are a bad idea, but there is a general, over the last few years, a feeling that we’re trapped in the benchmark trap where the name of the game keeps being, someone on the other side saying, “give me a benchmark to prove to you that my system works.” And so, and by the way, my heart goes out to them. I understand why they feel that we’re moving the goalposts.

Because what we keep doing is not pointing out, you need to pass it, but not like that. We say stuff like, “Okay, we’ll do image captioning.” “Surely to do image captioning, you need to understand an image.” “Great, so we’ll take a billion images and a billion data sets from Flickr and we’ll do this thing.” “What?” “Yeah, we pass it 98%.” “What?”

And then they move on. “Wait, you didn’t pass it at all. When I changed, instead of kids throwing a frisbee, they’re eating a frisbee, it still says that they’re playing with a frisbee.” “Yeah, yeah, yeah, whatever. Let’s move on.” “Okay, so how about theory of mind?” So yeah, mechanistic understanding would be great if we could somehow read in what the algorithm is, but if we can do that, that would be awesome and I support it completely. But that’s very hard.

Abha: The history of AI is full of examples like this, where we would think that one type of skill would only be possible with really sophisticated, human-like intelligence, and then the result is not what we thought it would be.

Melanie: People come up with a test, you know, “Can your machine play chess at a grand master level? And therefore it’s going to be intelligent, just like the most intelligent people.” And then Deep Blue comes around, it can play chess better than any human. But no, that’s not what we meant. It can’t do anything else. And they said, “Wait, you’re moving the goalpost.” And we’re getting that, you know, it’s kind of the wrong dynamic, I think. It’s just not the right way to answer the kinds of questions we want to answer. But it’s hard. It’s hard to come up with these methodologies for teasing out these questions.

Tomer: An additional frustrating dynamic that I know that you’ve encountered many times, as soon as you come up with one of these tests or one of these failures or things like that, they’re like, great, more training. That’s just adversarial training. We’ll just add it. This is a silly example. It’s not how it works.

But just for the sake of people listening in case this helps, imagine that you had someone who’s claiming that their machine can do multiplication, and you try it on five times five, and it fails. And they’re like, “Sorry, sorry, sorry.” And they add 25 to the lookup table. And you’re, okay, what about five times six? And they’re, “Sorry, sorry, sorry, that didn’t work, like, let’s add that,” right? And at some point you run out of numbers, but that doesn’t mean that it knows how to multiply.

Abha: This dynamic is like the Stone Soup story Alison told in the first episode. A lot of AI systems are like soups with a bunch of different ingredients added into them in order to get the results we want. And even though Murray has a more confident outlook on what LLMs can do, he also thinks that in order to determine the existence of something like consciousness in a machine, you need to look under the hood.

Murray: So I think in the case of consciousness, if something really does really behave exactly like a conscious being, is there anything more to say? I mean, should we then treat it as a fellow conscious being? And it’s a really tricky question. And I think that in those cases, you’re not just interested in behavior, you’re also interested in how the thing works. So we might want to look at how it works inside and is that analogous to the way our brains work and the things that make us conscious that we’re revealing through neuroscience and so on.

Abha: So if large language models hallucinate but don’t have beliefs, and they probably don’t have a human-like theory of mind at the moment, is there a better way of thinking about them? Murray offers a way of conceptualizing what they do without imposing our own human psychology onto them.

Murray: So I’ve got paper called “Role Play with Large Language Models.” What I advocate there is, well, the background to this is that it is very tempting to use these ordinary everyday terms to describe what’s going on in a large language model and beliefs and wants and thinks and so on. And in a sense, we have a very powerful set of folk psychological terms that we use to talk about each other. And we naturally want to draw on that when we’re talking about these other things.

So can we do that without falling into the kinds of mistakes I was talking about earlier. I think we can. What we need to do is just take a step back and think that what they’re really doing is a kind of role play. So instead of thinking of them as actually having beliefs, we can think of them as playing the role of a human character or a fantasy, you know, a science fiction, AI character or whatever, but playing the role of a character that has beliefs. So it’s analogous to an actor on the stage. So suppose that we have an actor on the stage and they’re in an improv performance.

And suppose the other person says to them and they’re playing the part of a kind of saying, an AI scientist or a philosopher. And then the other person on the stage says, have you heard of the AI researcher Murray Shanahan? And then they’ll say, yes, I’ve heard of him. So can you remember what books he’s written? Well, now imagine that there was an actual actor there. Now, maybe the actual actor by some miracle had in fact heard of me and knew that I’d written a book called Embodiment in the Inner Life.

And they’d probably come up and say, yeah, he’s written Embodiment in the Inner Life. The actor might then be a bit stuck. So then he might carry on and say, yeah, and then he also wrote, and then come up with some made up title that I wrote in 2019. That’s what an improv actor would sort of do in those circumstances. And I think what a large language model does is very often very closely analogous to that. So it’s playing a part. And this is a particularly useful way of thinking, useful analogy, I think, when it comes to when large language models get coaxed into talking about their own consciousness, for example, or when they talk about not wanting to be shut down, something like that. So very often it’s best to think of them in those circumstances, as role-playing, perhaps as AI, a science fiction AI, that is talking about his own consciousness.

Melanie: Your paper on role play, it reminded me of the Turing test. The original formulation of the Turing test was Turing’s way to sort of throw out the question of what’s the difference between a machine role playing or simulating having beliefs and desires and so on and actually having them. And Turing thought that if we could have a machine that tried to convince a judge that it was human and, your terminology, role playing a human, then we shouldn’t question whether it’s simulating intelligence or actually has intelligence. So what do you think about that?

Murray: Yeah, lots of thoughts about the Turing test. So the first thing, I do think that the move that Turing makes right at the beginning of his famous paper, 1950, is it 1950 paper in Mind, he says, could a machine think? And he says, let’s replace that question by another one. The first thing he does is he refuses to answer that question. He replaces it by a different one that he thinks is a more tangible, relatively easier to address question about could we build something that could fool a judge into thinking it was human? And in that way he avoids making a kind of deep metaphysical commitment and avoids the kind of perhaps illusory philosophical problems that attend the other way of putting the question.

In a sense, it sounds like I’m making a similar move to Turing and saying, let’s let’s talk about these things in terms of role play. But it’s a little bit different because I do think that there is a clear case of authenticity here, which is ourselves. So I’m contrasting the role play version with the authentic version. So the authentic version is us. I think there is a big difference between a large language model that’s role-playing Murray and Murray. And there’s a difference between a large language model that’s role-playing having a belief or being conscious and a being that does have a belief and is conscious. The difference between the real Murray and the role-played Murray is that, for a start, it matters if I fall over and hurt myself, and it doesn’t matter if the large language model says it’s fallen over and hurt itself. So that’s one obvious kind of thing.

Abha: But just because a machine is role playing, that doesn’t mean that it can’t have real consequences and real influence.

Murray: You can make something that role plays something so well that to all intents and purposes, it is equivalent to the authentic thing. So for example, in that role play paper, I use the example of something that is role playing a villainous language model that’s trying to cheat somebody out of their money and it persuades them to give them its bank account details and to move money across and so on. It doesn’t really make much difference to the victim that it was only role playing. So as far as the crime is concerned, the gap between authenticity and just pretending is completely closed. It really doesn’t matter. So sometimes it just doesn’t make any difference.

Melanie: That villainous language model sounds a bit like Sydney, the Bing chatbot. And we should point out that this chatbot only turned into this dark personality after the New York Times journalist asked it several pointed questions, including envisioning what its “shadow self” would look like. But, the Bing chatbot, like any other LLM, does not participate in the game of belief. Sydney had likely consumed many sci-fi stories about AI and robots wanting to gain power over humans in its training data, and so it role-played a version of that.

Abha: The tech journalist who tested Sydney knew it wasn’t a person, and if you read the transcript of the conversation, Sydney does not sound like a human. But still, examples like this one can make people worried.

Melanie: A lot of people in AI talk about the alignment problem, which is the question of, how do we make sure these things we’re creating have the same values we do—or, at least, the same values we think humans should have? Some people even fear that so-called “unaligned” AI systems that are following our commands will cause catastrophes, just because we leave out some details in our instructions. Like if we told an AI system to, “fix global warming,” what’s to stop it from deciding that humans are the problem and the most efficient solution is to kill us all? I asked Tomer and Murray if they thought fears like these were realistic.

Tomer Ullman: I’ll say something and undercut myself. I want to say that I’m reasonably worried about these things. I don’t want to be like, la-di-da, everything is fine. The trouble with saying that you’re reasonably worried about stuff is that everyone thinks that they’re reasonably worried, right? Even people that you would consider alarmists don’t say I’m an alarmist. I worry unreasonably about stuff. Everyone thinks that they’re being reasonable, but they just don’t.

I was talking to some friends of mine about this. Everyone thinks they’re driving the right speed. Anyone driving slower than you is a grandma and everyone driving faster than you belongs in jail. But you’re driving different speeds.

Even if it doesn’t have goals or beliefs or anything like that, it could still do a lot of harm in the same way that a runaway tractor could do harm. So I’m certainly thinking that there is some worries about that. The other more far-fetched worry is something like these things may someday can be treated as agents in the sense that they have goals and beliefs of their own and things like that. And then we should be worried that like their goals and beliefs are not quite like ours. And even if they understand what we want, they maybe can circumvent it. How close are we to that scenario? Impossible for me to say, but I’m less worried about that at the moment.

Murray: I’m certainly, like many people, worried about the prospect of large language models being weaponized in a way that can undermine democracy or be used for cyber-crime on a large scale. Can be used to persuade people to do bad things or to do things against their own interests. So trying to make sure that language models and generative AI is not misused and abused in those kinds of ways, I think, is a significant priority. So those things are very concerning.

I also don’t the idea of generative AI taking away the livelihoods people working in the creative industries. And I think there are concerns over that. So I don’t really like that either. But on the other hand, I think AI has the potential to be used as very sophisticated tool for creative people as well. So there are two sides to it. But certainly, that distresses me as well.

Abha: With every pessimistic prediction, there are optimistic ones about how AI will make our lives easier, improve healthcare, and solve major world problems like climate change without killing everyone in the process. Predictions about the future of AI are flying every which way, but Murray’s reluctant to chime in and add more.

Melanie: So you wrote a book called The Technological Singularity.

Murray: Yeah, that was a mistake.

Melanie: I don’t know, I thought it was a really interesting book. But people like Ray Kurzweil famously believe that within less than a decade, we’re gonna have machines that are smarter than humans across the board. And other people, even at DeepMind have predicted so-called AGI within a decade. What’s your thought on where we’re going and sort of how these systems are going to progress?

Murray: I’m rather hoping that somebody will appear at the door, just so that I don’t have to answer that particularly awkward question. The recent past has taught us that it’s a fool’s game to make predictions because things just haven’t unfolded in the way that really anybody predicted, to be honest, especially with large language models. I think we’re in a state of such flux, because, you know, we’ve had this eruption of seemingprogress, seeming progress in the last 18 months.

And it’s just not clear to me right now how that’s going to pan out. Are we going to see continued progress? What is that going to look like? One thing I do think we’re going to see, as I do think we’re going to see the technology that we have now, is going to have quite dramatic impact. And that’s going to take a while to unfold. And I can’t remember who, you have to remind me who it was who said that we tend to underestimate the impact of technology in the long term and overestimate it in the short term. So I think that that’s probably very much what’s going on at the moment.

Abha: That adage, by the way, was from the scientist Roy Amara.

Melanie: Hmm, Abha, Murray likes hedging his bets. Even though he works at Google DeepMind, which is one of the most prominent AI companies, he’s still willing to talk openly about his uncertainties about the future of AI.

Abha: Right. I get the impression that everyone in the field is uncertain about how to think about large language models and what they can do and cannot do.

Melanie: Yeah, that’s definitely true. Murray characterized LLMs as, quote, “A kind of exotic mind-like entity.” Though, again, he hedged his bets over whether we could call it a mind.

Abha: I liked Tomer’s discussion on how, you know, LLMs and humans are different. Tomer used the metaphor of climbing a mountain from two different routes, and the human route to intelligence is largely learning via direct, active experience in the real world, right? And the question is, can LLMs use a totally different route, that is passively absorbing human language, to arrive at the same place? Or do they arrive at a completely different kind of intelligence? What do you think, Melanie?

Melanie: Well, I vacillate on whether we should actually use the word intelligence to describe them. So right now, LLMs are a mix of incredibly sophisticated behavior. They can have convincing conversations. They can write poetry. They do an amazing job translating between languages. But they can also behave in a really strange and unhuman-like way. For example, they’re not able in many cases to do simple reasoning, they lack self-awareness, and they constantly make stuff up, the so-called hallucinations.

Abha: Yeah, hallucinations is an interesting use of the word itself. Murray talked about how LLMs, unlike us humans, can’t participate in the game of beliefs because, as he said, quote, “They can’t engage with the everyday world in the way we do to update their beliefs.”

Melanie: Yeah. I mean, a big problem is that LLMs are huge, complex black boxes. Even the people who created and trained them don’t have a good understanding of how they do what they do, how much sort of actual reasoning they’re doing or how much they’re just echoing memorized patterns. And this is why the debates about their actual intelligence and their capabilities are so fierce.

Both Tomer and Murray talked about the open problem of understanding them under the hood, what Tomer called mechanistic understanding. Others have called it mechanistic interpretability. This is a very active though nascent area of AI research. We’ll hear more about that in a future episode.

Abha: I also liked Murray’s framing of LLMs as role players. With different prompts, you can get them to play different roles, including that of an agent that has beliefs and desires, like in that New York Times journalist conversation where the LLM was playing the role of a machine that wanted the reporter to leave his wife.

The LLM doesn’t actually have any beliefs and desires, right? But it has been trained using text generated by us humans to convincingly role play something that does have them. You have to be careful not to be taken in by the convincing roleplay.

Melanie: Aha, but this brings up a deep philosophical question. If a machine can perfectly role-play an entity with beliefs and desires, at what point can we argue that it doesn’t itself have actual beliefs and desires? As Murray said, if a machine perfectly acts like it has a mind, who are we to say it doesn’t have a mind? This was Alan Turing’s point when he proposed the Turing test way back in 1950.

So how could we get machines to have actual beliefs and motivations and to have values that align with ours? In our first episode, Alison Gopnik discussed the possibility of training AI in a different way. It would involve trying to program in some human-like motivations, and its training period would more closely resemble human childhoods with caregivers.

Abha: So coming up in our next episode, we’re going to look at children. What do babies already know when they’re born, and how, exactly, do they learn as they grow up?

Mike Frank: So the biggest thing that I think about a lot is how huge that difference is between what the child hears and what the language model needs to be trained on.

Abha: That’s next time, on Complexity. Complexity is the official podcast of the Santa Fe Institute. This episode was produced by Katherine Moncure, and our theme song is by Mitch Mignano. Additional music from Blue Dot Sessions. I’m Abha, thanks for listening.

◆

If you enjoyed this article…

◆

Learn more about the coaching process or
contact me to discuss your storytelling goals!

◆

Subscribe to the newsletter for the latest updates!

Nature of Intelligence – Episode Two – Language and Thought

February 25, 2025/in AI, Artificial Intelligence, Biology, Intelligence, Neuroscience/by Mark Lovett

In the previous post we looked at the question of What is Intelligence? This was based on the first episode of the Santa Fe Institute’s Complexity podcast series. In the second episode they looked into the relationship between language and thought. I had always assumed that language was the tool we used to express our thinking. And while that’s true, the full story is more complex than that.

In the opening Melanie states, “Language is the backbone of human culture.” But that statement is soon followed by the questions, “Are humans intelligent because we have language, or do we have language because we’re intelligent? How do language and thinking interact? And can one exist without the other?”

Like I said, the relationship is not so simple. Here are a few passages to prime your consciousness before you get into hearing the podcast.

At one point Gary Lupyan explains that, even with a very social & collaborative species like humans, if we take away language, we take away the major tool for creating culture and for transmitting culture. What he doesn’t say explicitly, but what I feel is inherently evident in that statement, is that culture is created and transmitted by way of how humans use language to tell stories.

While recapping the episode about mid-way, Melanie states that, “…language is an incredible tool for collaboration, and collaboration drives our intelligence.” It’s an interesting observation, as collaboration involves both doing things together, as well as sharing information by way of stories. But she also reminds us “…that language makes it easy to lie and to trick people.”

As the world deals with an avalanche of lies and misinformation from China, the U.S., and Russia, it’s a time for reflecting on the stark reality that intelligence and language don’t require a moral foundation. We can tell whatever story we want.

Transcript

Spoken and written language is completely unique to the human species, and it’s part of how we evolved. It’s the backbone of our societies, one of the primary ways we judge others’ intellect. So, are humans intelligent because we have language, or do we have language because we’re intelligent? How do language and thinking interact? And can one exist without the other? Guests: Ev Fedorenko, Steve Piantadosi, and Gary Lupyan

Ev Fedorenko: It is absolutely the case that not having access to language has devastating effects, but it doesn’t seem to be the case that you fundamentally cannot learn certain kinds of complex things.

Abha Eli Phoboo: From the Santa Fe Institute, this is Complexity.

Melanie Mitchell: I’m Melanie Mitchell.

Abha: And I’m Abha Eli Phoboo.

Melanie: Think about this podcast that you’re listening to right now. You’re, hopefully, learning by just listening to us talk to you. And the fact that you can take in new information this way, through what basically comes down to sophisticated vocal sounds, is pretty astonishing.

In our last episode, we talked about how one of the major ways humans learn is by being in the world and interacting with it. But we also use language to share information and ideas with each other without needing firsthand experience. Language is the backbone of human culture.

Abha: It’s hard to imagine where we’d be without it. If you’ve ever visited a country where you don’t speak the language, you know how disorienting it is to be cut off from basic communication. So in today’s episode, we’re going to look at the role language plays in intelligence. And the voices you’ll hear were recorded remotely across different countries, cities and work spaces.

Melanie: Are humans intelligent because we have language, or do we have language because we’re intelligent? How do language and thinking interact? And can one exist without the other?

Melanie: Part One: Why do humans have language?

Melanie: Across the animal kingdom, there are no other species that communicate with anything like human language.

Abha: This isn’t to say that animals aren’t communicating in sophisticated ways, and a lot of that sophistication goes unnoticed.

Melanie: But the way humans talk — with our long conversations and complex syntax — is completely unique. And it’s part of how we evolved.

Abha: For several decades, a dominant theory of human language was something called generative linguistics, or generative grammar.

Melanie: The linguist Noam Chomsky made this idea popular, and it basically goes like this: there’s an inherent, underlying structure of rules that all languages follow. And from birth, we have a hard-wired bias toward language as opposed to other forms of communication — we’re biologically predisposed to language and these syntactic rules. This is why human language is, according to Chomsky, unique to our species and universal across different cultures.

Abha: This theory has been incredibly influential. But it turns out, it doesn’t seem to be right.

Gary Lupyan: So I’ve never been a fan of generative linguistics, Chomsky’s kind of core arguments about universal grammar or the need for innate grammatical knowledge.

Abha: This is Gary Lupyan.

Gary: I am Gary Lupyan, professor of psychology at the University of Wisconsin-Madison. I’m a cognitive scientist. I study the evolution of language, the effects of language on cognition, on perception, and over the last few years trying to make sense of large language models like lots of other people.

Melanie: In recent years, the development of large language models has bolstered Gary’s dislike of generative grammar. The old thinking was that in order to use language well, you needed to be biologically wired to know these language rules from the start. But LLMs aren’t programmed with any grammatical rules baked into them. And yet, they spit out incredibly coherent writing.

Gary: Even before these large language models, there were plenty of arguments against that view. I think these are the last nails in the coffin. So I think producing, correct, grammatically sophisticated, even I’d argue semantically coherent language. These models can do all that even without, by modern standards, huge amounts of training. It shows that in principle, one does not need any of this type of innate grammatical knowledge.

Abha: So, what’s going on here? Steve Piantadosi is a psychology and neuroscience professor at UC Berkeley, studying how children learn language and math. He says that language does have rules, but those rules are emergent. They’re not there from the start.

Steve Piantadosi: I think that the key difference is that Chomsky and maybe mainstream linguistics tends to state its theories already at the high level of abstraction. They say, here are the rules that I think this system is following. Whereas in a large language model, when you go to build one, you don’t tell it the high level rules about how language works. You tell it the low level rules about how to learn and how to construct its own internal configurations. And you tell it that it should do that in a way that predicts language well. And when you do that, it kind of configures itself in some way.

Melanie: What’s an example of a high level rule?

Steve: For example, a high level rule in English is if you have a sentence, you can put it inside of another sentence with the word that. So I could say, “I drank coffee today.” That’s a whole sentence. And I could say, “John believed that I drank coffee today.” And because that rule is about how to make a sentence out of another sentence, you can actually do it again.

So I can say, “Mary doubted that John believed that I drank coffee today.” And if you were going to sit down and write a grammar of English, if you’re going to try to describe what the grammatical and ungrammatical sentences of English were, you’d have to have some kind of rule that said that, right? Because any English speaker you ask is going to tell you that, “John said that I drank coffee today.” is an acceptable English sentence.

And also “I drank coffee today.” is an acceptable English sentence. Large language models, when they’re built, don’t know anything like that rule. They’re just a mess of parameters and weights and connections, and they have to be exposed to enough English in order to figure out that rule.

And I’m pretty sure ChatGPT knows that rule, right? Because it can form sentences that have an embedded sentence in that way. So when you make ChatGPT, you don’t tell it that rule from the start, it has to construct it and discover it.

And I think what’s kind of interesting is that building a system like ChatGPT that can discover that rule doesn’t negate the existence of that rule in English speakers’ minds. So internally in ChatGPT somewhere, there has to be some kind of realization of that rule or something like it.

So the hope for these other theories, I think, or at least these other kind of basic observations about language, is that they will be realized in some way inside the internal configurations that these models arrive at.

I think it’s not quite that simple because the large language models are much better than our theories. So we don’t have any kind of rule-based account of anything that comes close to what they can do. But they have to have something like that because they exhibit that behavior.

Abha: And we should say, these rules we’re talking about are not the same as the quote-unquote “rules” you learn in school, like when your teacher tells you how to use prepositions or, “don’t split an infinitive.”

Steve: Yeah, sorry, let me just clarify. In linguistics or in cognitive science, when people talk about rules like this, they don’t mean the rules like don’t split infinitives. Basically anything you heard from an English teacher, you should just completely ignore in cognitive science and linguistics. It’s just made up. I mean, it’s literally made up, often just to reinforce class distinctions and things.

The kinds of rules that linguistics and cognitive science are interested in are ones which are descriptive, that talk about how people actually do speak. People do split infinitives, right, and they do end sentences with prepositions and, you know, pretty much any rule you’ve ever heard from an English teacher, they had to tell you because it’s going against how you naturally speak.

So that’s just some weird class thing, I think, that’s going on. And what we’re interested in are the kind of descriptive rules of how the system is kind of actually functioning in nature. And in that case, most people are just not even aware of the rules.

Melanie: Apologies to all the English teachers out there.

Abha: But to recap, language does have innate rules, like the “that” rule that Steve described, but we’re not born with these rules already hardwired into our brains. And the rules that linguists have documented so far aren’t as complete and precise as the actual rules that exist — the statistical patterns that ChatGPT has probably figured out and encoded at some point during its training period.

Melanie: Yet, none of this explains why we humans are using complex language, but other animals aren’t. I asked Gary what he thought about this.

Melanie: So there’s a lot of debate about the role language plays in intelligence. Is language a cause of or a result of humans’ superiority over other animals in certain kinds of cognitive capacities?

Gary: I think language is one of the major reasons why human intelligence is what it is. So more the cause than the result. There is something, obviously, in our lineage that makes us predisposed to language. I happen to think that what that is has much more to do with the kind of drive to share information, to socialize, than anything language specific or grammar specific.

And you see that in infants, infants want to engage. They want to share information, not just use language in an instrumental way. So it gives us access to information that we otherwise wouldn’t have access to.

And then it’s a hugely powerful tool for collaboration. You can make plans, you can ask one another to help. You can divide tasks in much more effective ways. And so without language, even if you take a very social, collaborative species like humans, you take away language and you take away the major tool for creating culture and for transmitting culture.

Melanie: Just to follow up, chimps and bonobos are very social species and have a lot of communication within their groups. Why didn’t they develop this drive you’re talking about for language? Why did we develop it and not them?

Gary: It’s only useful to a particular kind of species, a particular type of niche. So it has a really big startup cost. So kids have to learn this stuff. Their language is kind of useless to them before they put in the years that it takes to learn it. It’s also, and many have written on this, language is also very easy to lie with.

So it’s an unreliable system. Words are cheap. And so, reliance on language sort of only makes sense in a society that already has a kind of base level of trust. And so, I think the key to understanding the emergence of language is understanding the emergence of that type of prosociality that language then feeds back on and helps accelerate, but it needs to be there.

And so if you look at other primate societies, there is cooperation within kin groups. There is not broad scale cooperation. There is often aggression. There’s not sharing. So language just doesn’t make sense.

Abha: As Gary mentioned, there’s a huge startup cost for learning language. Humans have much longer childhoods than other species.

Ev: Ever since we’re born, we start paying attention to all sorts of regularities in the inputs we get, including in linguistic inputs.

Abha: This is Ev Fedorenko. Ev’s a neuroscientist at MIT, and she’s been studying language for the past two decades. As she mentioned, we start learning language from day one. That learning includes internalizing the structure and patterns that linguists used to assume were innate.

Ev: We start by paying attention to how sounds may go together to form kind of regular patterns like syllables and various transitions that are maybe more or less common. Pay attention to that. Then we figure out that some parts of that input correspond to meanings.

The example I often say is like every time mama says cat, there’s this fuzzy thing around, maybe it’s not random, right? And you kind of start linking parts of the linguistic input to parts of the world. And then of course you learn what are the rules for how you put words together to express more complex ideas.

So all of that knowledge seems to be stored in what I call the language system. And those representations are accessed both when I understand what somebody else is saying to me, because I have to map, I have to use this form of meaning mapping system to decode your messages, and when I have some abstract thing in my mind, an idea, and I’m trying to express it for someone else using this shared code, which in this case is English, right?

Abha: And often, we learn this shared code by interacting with our surroundings. Like, as Ev described, learning about a cat if there’s a cat in the room with you.

Melanie: But, you could also learn about cats without being able to interact with one. Someone could tell you about a cat, and you could start to create an idea for this thing called, “cat,” which you’ve never seen, but you know that it has pointy ears, it’s furry, and it makes a low rumbling sound when it’s content. That’s the power of language. Here’s Gary again.

Gary: So much of what we learn, and it’s very difficult to quantify, to put a number on, like what percent of what we know we’ve learned from talking to others, from reading. Most of formal education takes that role, right? It would not be possible in the absence of certainly not without language, but even without written language. If you have enough language training, you can just kind of map onto the visual world.

And we’ve done my lab, some work connecting it to, previously collected data from people who are born congenitally blind and the various things that they surprisingly learn about the visual world that one would think is only learnable through direct experience showing that well, normally sighted people might be learning it through direct experience, but a lot of that information is embedded in the structure of language.

Abha: And when we learn through language, we’re not just learning about physical objects. Language gives us the ability to name abstract concepts and categories, too. For instance, if you think about what the word “shoe” means, it refers to a type of object, but not one specific thing.

Steve: We wrote a paper about this and gave the example of shoes that were made out of eggplant skins. You could imagine doing that, drying out an eggplant skin and sewing up the sides and adding laces and fitting it around your feet and whatever. And you’ve probably never encountered shoes made out of eggplants before, but we all just agreed that that could happen. That you could find them.

And so that tells you that it’s not the physical object exactly that’s defining what the concept means. Because I just gave you a new physical object. It has to be something more abstract, more about the relationships and the use of it that defines what the thing is. I don’t think it’s so crazy to think that, you know, language is special in some way.

There’s certainly lots of things that we acquire through language. Right, this is, I think, especially salient if you talk to a kid and they’re asking why questions and you explain things that are abstract and that you can’t show them just in language and they can come to pretty good understandings of systems that they’ve never encountered before, you know, if they ask how clouds form or, you know, what the moon is doing or whatever, right? All of those are things that we learn about through a linguistic system.

So the right picture might be one where there’s a small kind of continuous or quantitative change in memory capacity that enables language, but then once you have language that opens up this kind of huge learning potential for cultural transmission of ideas and learning complicated kinds of things from your parents and from other people in your community.

Melanie: So Abha, we asked at the beginning of the episode why humans have language. And what we’ve heard from Gary, Steve, and Ev so far is that language probably emerged as a result of humans’ drive to socialize and to collaborate. And there’s a feedback effect between these social drives and language itself. So language is an incredible tool for collaboration, and collaboration drives our intelligence. Gary, for example, thinks that language is a major cause of human intelligence being what it is.

Abha: Right, right. It was interesting how Steve also pointed out that language enables a whole new way of learning and of cultural evolution. Language allows us to quickly learn new things, you know, from the people around us, say our parents, our friends, and other people we interact with.

It also lets us learn without having to experience something ourselves. Say, for example, when we are walking with our parents when we were little and they said, you know, “Don’t jump out in front of the car.” We tend to trust them and not have to experience it ourselves. And this is enabled because of language, right?

Melanie: Yeah, we should definitely appreciate our parents more. But on the downside, Gary also pointed out that language makes it easy to lie and to trick people. So relying on language only makes sense when society has a basic level of trust.

Abha: That is so true. I mean, if we don’t trust each other, it’s hard to function as a society, but trust comes at such a high cost too. And the other downside of language, you know, requires a long learning period because we can’t learn a language overnight. We’re not born necessarily speaking a language. Our childhood is so prolonged and that’s another high cost.

Melanie: Yeah. So the advantages of language must have outweighed those downsides in evolution.

Abha: Yes. Another interesting point that just came up is that today’s large language models have shown that certain linguistic theories are just wrong. Steve claims that LLMs have disproven Noam Chomsky’s notion of an innate universal grammar in the brain, right?

Melanie: Yeah, people have really changed their thinking about how language works in the brain. In part two, we’ll look at what brain imaging can tell us about language and what happens when people lose their language abilities.

Abha: Part Two: Are language and thought separate in the brain?

Abha: One of Ev’s signature methods is using fMRI brain scans to examine which systems in the brain light up when we use language. She and her collaborators have developed experiments to investigate the relationship between language and other forms of cognition.

Ev: It’s very simple. I mean, the logic of the experiments where we’ve looked at the relationship between language and thought is all pretty much the same, just using different kinds of thought. But the idea is you take individuals, put them in an fMRI scanner, and you have them do a task that you know reliably engages your language regions.

Abha: This could be, for example, reading or listening to coherent sentences while your brain is being scanned. Then, that map would be compared to the regions that light up when you hear sequences of random words and sounds that sound speech-like, but are completely nonsensical.

Ev: And if you guys visit MIT, I can scan you and print you a map of your language system. It takes about five minutes to find. Very reliable. And again, if I scan you today or 10 years later, I’ve done this on some people 10 years apart, it’s in exactly the same place. It’s very reliable within people. It’s very robust, so we find those language regions. And then we basically ask, okay, let’s have you engage in some form of thinking.

Maybe have you solve some math problems, or do something like some kind of pattern recognition test, and we basically ask, do circuits that light up when you process language overlap with the circuits that are active when, for example, when you engage in mathematical reasoning, like doing addition problems or whatnot. And we basically very consistently find across many domains of thought pretty much everything we’ve looked at so far, we find that the language regions are not really active, hardly at all, and some other system non-overlapping with the language regions is working really hard. So it’s not the case that we engage the language mechanisms to solve these other problems.

Melanie: I know there’s been some controversy about how easy it is to interpret the results of fMRI. What can you tell us, is that a hard thing to do? Is it an easy thing to do?

Ev: I don’t think there’s any particular challenge in interpreting fMRI data than any other data. I mean, you want to do robust and rigorous research. Before you make a strong claim based on whatever findings, you want to make sure that your findings tell you what you think they are, but that’s kind of a challenge for any research.

I don’t think it’s related to particular measurements you’re taking. I mean, there are certainly limitations of fMRI, and one of them is that we can’t look at fast time scales of information processing. We just don’t have access to what’s happening on a millisecond or tens of milliseconds or even hundreds of milliseconds time scale, which for some questions, it doesn’t matter, but for some questions, it really does. And so that makes fMRI not well suited for those questions where it matters. But in general, good robust findings from fMRI are very robustly replicable.

Steve: I’ve been actually very convinced by Ev’s arguments in particular.

Abha: That’s Steve Piantadosi again.

Steve: You can find people who are experts in some domain, like mathematics experts or chess grandmasters or whatever, who have lost linguistic abilities. And that is a very nice type of natural experiment that shows you that the linguistic abilities aren’t the kind of substrate for reasoning in those domains, because you can lose the linguistic abilities and still have the reasoning abilities.

There might still be a learning story. It would probably be very hard to learn chess or learn mathematics without having language. But I think that once you learn it, or learn it well enough to become an expert, it seems like there’s some other kind of system or some other kind of processing that happens non-linguistically. What it shows you is that you can be really good at language without having the ability to do the kind of sequential, multi-step reasoning that seems to characterize human thinking.

And that I think is surprising. It didn’t have to be like that. It could have been that language was the substrate that we used for everything or that language was such a difficult problem that if you solved language, you would necessarily have to have all of the underlying kind of reasoning machinery that people have. But it seems that that’s not right, that you can do quite a bit in language without having much reasoning.

Abha: And on the flipside, you can do a lot of reasoning without language. As Ev mentioned before, she and her collaborators have identified language systems in the brain that show up very reliably in fMRI scans. These language systems are mostly in the left hemisphere. So, what happens if someone loses these systems completely?

Ev: This fMRI approach is very nicely complemented by investigations of patients with severe language problems, right? So another approach, this one we’ve had around for much longer than fMRI, is to take individuals who have sustained severe damage to the language system, and sometimes left hemisphere strokes are large and they pretty much wipe out that whole system.

So these are so-called individuals with global aphasia. They can’t, if you give them a sentence, they cannot infer any meaning from this. And we know it’s not a low level deficit, because you can establish that it’s across modalities, like written and spoken, and so on. So it seems like the linguistic representations that they’ve set up for meaning mapping, that they’ve spent their lifetime learning, is lost, is really destroyed. And then you can ask about the cognitive capacities in these individuals. Can they still think complex thoughts?

And how do you test this? Well, you give them behavioral tasks. And for some of them, of course, you have to be a very clever experimentalist because you can no longer explain things verbally. But people come up with ways to get instructions across. They understand kind of thumbs up, thumbs down judgments.

So you give them well-formed or ill-formed mathematical expressions or musical patterns or something like that. And what you find is that, there are some individuals who are severely linguistically impaired — the language system is gone for as best as we can test it with whatever tools we have, and yet, they’re okay cognitively. They just lost that code to take the sophistication of their inner minds and translate it into this shared representational format.

And a lot of these individuals are severely depressed because they’re taken to be mentally challenged, right? Because that’s how we often judge people, is by the way they talk. That’s why foreigners often suffer in this way too. Judgments are made about their intellectual capacities and otherwise and so on.

Anyway, a lot of these individuals seem to have the ability to think quite preserved, which suggests that at least in the adult brain, you can take that language system out once you’ve acquired that set of knowledge bits, right? You can take it out and it doesn’t seem to affect any of the thinking capacities that we’ve tested so far.

Melanie: So here’s an extremely naive question. So if language and thought are dissociated, at least in adults, why does it feel like when I’m thinking that I’m actually thinking in words and in language?

Ev: That’s a great question that comes up quite often, not naive at all. It’s a question about the inner voice. A lot of people have this percept that there is a voice in their heads talking. It’s a good question to which I don’t think we as a field have very clear answers yet about what it does, what mechanisms it relies on.

What we do know is that it’s not a universal phenomenon, which tells you that it cannot be a critical ingredient of complex thought because certainly a lot of people who say that they don’t have an inner voice, some of them are MIT professors and they’re like, “What are you talking about? You have a voice in your head? That’s not good. Have you seen a doctor?”

And it’s a very active area of research right now. A lot of people got interested in this. You may have heard about 10 years ago, there was a similar splash about aphantasia, this inability of some people to visually image things, so similar like how some people don’t know what you mean when you say you have an inner voice, some people cannot form mental images.

Like, you say “Imagine the house you lived in when you were a child,” and they’re like “Got nothing there.” You know, it’s blank, I just can’t form that mental image. I can describe it, I know facts about it, but I can’t form that mental image. And these kinds of things like inner voice mental imagery, those are very hard things to study with the methods that we currently have available.

Abha: Yeah, I think I was talking to someone who actually told me they don’t have an inner voice and they actually are left with a feeling, but they can’t necessarily describe the feeling. And so they don’t know how to put it into language when they have a thought.

Ev: That’s a very good point because my husband ,who doesn’t have an inner voice, often uses this as an argument. “If we were thinking in language, why is it sometimes so hard to explain what you think? You know you have this idea very clearly for yourself and you just have trouble formulating it.” That’s a good point.

Melanie: But, Gary sees the relationship between language and thought a bit differently. He doesn’t think they can be separated so neatly.

Gary. I think Ev and her lab are doing fabulous work and we agree on many things. This is one thing we don’t agree on.

Melanie: In Ev’s example, patients who have had strokes lost their language systems in the brain, but they could still do complex cognitive tasks. They didn’t lose their ability to think.

Gary: So it’s possible to find individuals with aphasia that have typical behavior. And so that shows that at least in some cases, one can find cases where language is not necessary. So there are two complications with this. One is that people tend to have aphasia due to a stroke that tends to happen in older age. And so they’ve had a lifetime of experience with language. And so, just because a task doesn’t light up the language network doesn’t mean the task does not rely on language.

It doesn’t mean that language has not played a role in basically setting up the brain that you have as an adult, such that you don’t need language in the moment, but you needed exposure to language to enable you to do the task in the first place.

Abha: We asked Ev what she made of this argument, that even if language isn’t necessary in the moment, it still plays a big role in developing your adult brain. But she doesn’t think it’s as important as Gary does. She refers to another population of people, which are individuals who are born deaf and aren’t taught sign language.

Ev: Unless there are other signers in the community, or unless they’re moved into an environment where they can interact with the signers, they often grow up not having input to language. Especially if they’re in an isolated community. Growing up they figure out some system called home sign, which is a very, very basic system.

And so you can ask whether these individuals are able to develop certain thinking capacities. And it is absolutely the case that having… not having access to language has devastating effects, right? You can’t build relationships in the same way. You can’t learn as easily. Of course, through language I can just tell you all sorts of things about the world. Most of the things you probably know, you learned through language, but it doesn’t seem to be the case that you fundamentally cannot learn certain kinds of complex things.

So there are examples of individuals like that who have been able to learn math. Okay, it takes them longer. If you don’t have somebody to tell you how to do differential equations you can figure it out from whatever ways you can. So it’s certainly the case that language is an incredibly useful tool. And presumably, the accumulation of knowledge over generations that has happened has allowed us to build the world we live in today. But it doesn’t undermine the separability of those language and thinking systems.

Abha: In a lot of areas, it seems that Gary, Steve, and Ev are on the same page. Language has helped humans achieve incredible things, and it’s a very useful tool.

Melanie: But where they seem to differ is on just how much language and thought influence each other, and in which direction the causal arrow is pointing: Does language make us intelligent, or is language is the result of our intelligence? Ev’s work shows that many types of tasks can be done without lighting up the language systems in the brain. When combined with examples from stroke patients and other research, she has reason to believe that language and cognition are largely separate things.

Abha: Gary, on the other hand, isn’t ready to dismiss the role of language so easily — it could still be crucial for developing adult cognition, and, generally speaking, some people might rely on it more than others.

Melanie: And Steve offers one more example of how language can make our learning more efficient, regardless of whether or not it’s strictly necessary.

Steve: So, if you’re an expert in any domain, you know a ton of words and vocabulary about that specific domain that non-experts don’t, right? That’s true in scientific domains if you’re a physicist versus a biologist, but it’s also true in non-scientific domains. People who sew know tons of sewing words and people who are coal miners know tons of coal mining words and I think that those words are, as we were discussing, real technologies. They’re real cultural innovations that are very useful.

That’s why people use those words, because they need to convey a specific meaning in a specific situation. And by having those words, we’re probably able to communicate more efficiently and more effectively about those specific domains. So I think that this kind of ability to create and then learn domain specific vocabularies is probably very important and probably allows us to think all kinds of thoughts that otherwise would be really, really complicated.

Imagine being in a situation where you don’t have the domain specific vocabulary and you have to describe everything, and it becomes very clunky and hard to talk about. That’s why in sciences, especially, we come up with terms, so it really enables us to do things that would be really hard otherwise.

Melanie: Steve isn’t saying that it’s impossible to learn specific skills without language, but from his perspective, it’s more difficult and less likely.

Abha: But Ev has a slightly different view.

Ev: There are cultures, for example, human cultures that don’t don’t have exact math. Like the Peter Ha or the Chimani, like some tribes in the Brazilian Amazon, they don’t have numbers because they don’t need numbers.There are people who will make a claim that they don’t have numbers because they don’t have words for numbers.

And I don’t understand how the logic goes in this direction. I think they don’t have words for numbers because they don’t have the need for numbers in their culture. So they don’t come up with a way to refer to those concepts. Then of course, there’s different stories for why numbers came about. One common story has to do with farming, right?

When you have to keep track of entities that are similar, like 200 cows, and you want to make sure you left with them and came back with whatever 15 cows. And then you figure out some counting system, typically using digits, right? A lot of cultures start with digits. Anyway, and then you come up with words. And once you have labels for words, of course you can then do more things. You can solve tasks that require you to hold onto those.

But it’s not like not having words prevents you from figuring out a system of thought and representation to keep track of that information. So I think the directionality is in a different way than some people have put it forward.

Abha: So Melanie, our question for the spot of the episode was about whether language and thought are separate in the brain. And Ev seems to have very compelling evidence that they’re separate.

Melanie: Yeah, her results with fMRI were really surprising to me.

Abha: Right? Me too. Both Steve and Ev stress that language makes communication between people very efficient, but point out that when people lose their language abilities, say because of a stroke or some other injury, it’s often the case that their thinking, that is their non-linguistic cognitive abilities, are largely unaffected.

Melanie: But Abha, Gary pushed back on this. He noted that people who have had strokes tend to be older with cognitive abilities that they’ve had for a long time. So Gary pointed out that maybe you need language to enable cognition in the first place. And his own research has shown that this is true to some extent.

Abha: I guess there are really two questions here. First, do language and cognition really need to be entangled in the brain during infancy and childhood when both linguistic and cognitive skills are still being formed? And the second is, are language and cognition separate in adults who have established language and cognitive abilities already?

Melanie: Exactly. Ev’s work addresses the latter question, but not the former. And Ev admits that the neuroscience and psychology of language have been contentious fields for a long time. Here’s Ev.

Ev: Language has always been a very controversial field where people have very strong biases and opinions. The best I can do is try to be open minded and just keep training people to do rigorous work and to think hard about even the fundamental assumptions in the field. Those should always be questioned. Everything should always be questioned.

Abha: So here’s another question: what does all of this mean for large language models? In theory, the skills LLMs have exhibited are the same skills that map onto the language systems in the brain. They have the formal competence of patterns and language rules. But, if their foundations are statistical patterns in language, how much thinking can they do now, and in the future? And how much have they learned already?

Murray Shanahan: I mean, people sometimes use the word, an alien intelligence. I prefer the word exotic. It’s a kind of exotic mind-like entity.

◆

If you enjoyed this article…

◆

Learn more about the coaching process or
contact me to discuss your storytelling goals!

◆

Subscribe to the newsletter for the latest updates!

Nature of Intelligence – Episode One – What is Intelligence?

February 24, 2025/in AI, Artificial Intelligence, Biology, Humanity, Neuroscience/by Mark Lovett

I tend to think of storytelling as sitting at the intersection of four elements:

Consciousness — awareness of self, the environment, and our thoughts
Intelligence — ability to learn, understand, reason, and solve problems
Imagination — create mental images, ideas, or concepts beyond reality
Creativity — generate original ideas, solutions, and artistic expressions

They’re different terms, of course, yet you can see how they interact with each other. It’s also apparent that they’re involved in the process of creating stories. They’re so fundamental, in fact, that they go a long way towards describing what makes us human. But the funny thing is, science doesn’t know how to accurately define any of these concepts.

While thousands of hours have been spent seeking answers, and scientists can talk for days on end about their findings, it is still a mystery. Take Shakespeare, for example. How did he utilize these aspects of humanity to create something as magical as Hamlet? And if we can’t properly describe one of these elements, how do we explain how they work together? And extending beyond us mortals, will AI ever be able to replicate this magic?

So when I ran across the third season of Santa Fe Institute’s Complexity podcast, which is devoted to the exploration of Intelligence, I had to listen in, and if you’re interesting in how we create stories in our head, I recommend you do the same, as it looks at the concept of intelligence through a human lens, as well as from the lens of artificial intelligence.

There’s so much information in this first episode, but I wanted to share four quotes that intrigued me. First off is this notion of “common sense”. It seems simple, but again, it’s illusive to capture in words. How would you describe it?

Common sense gives us basic assumptions that help us move through the world and know what to do in new situations. But it gets more complicated when you try to define exactly what common sense is and how it’s acquired. ~ Melanie Mitchell

This notion of an equivalent phenomenon describes much of the human / AI debate, as there is a sense that a machine will never be human, but maybe it can be close enough.

I think there’s a difference between saying, can we reach human levels of intelligence when it comes to common sense, the way humans do it, versus can we end up with the equivalent phenomenon, without having to do it the way humans do it. ~ John Krakauer

This goes back to the reality that we don’t know what makes humans human, so how are we to compare a computer algorithm to what it means to be us?

I think it’s just again, a category mistake to say we’ll have something like artificial general intelligence, because we don’t have natural general intelligence. ~ Alison Gopnik

But we’re more than thinking animals. We have emotions. Fall in love, feel pain, express joy and sorrow. Or in this case, grief. Computers are learning how to simulate emotions such as grief, but is that even possible?

I don’t know what it would mean for a computer to feel grief. I just don’t know. I think we should respect the mystery. ~ John Krakauer

So here goes, take a listen to Episode 1 and see what you think. The transcript is below if you feel so inclined (as I did) to follow along. It’s some heady stuff.

Transcript

Alison Gopnik: It’s like asking, is the University of California Berkeley library smarter than I am? Well, it definitely has more information in it than I do, but it just feels like that’s not really the right question.

Abha Eli Phoboo: From the Santa Fe Institute, this is Complexity.

Melanie Mitchell: I’m Melanie Mitchell.

Abha: And I’m Abha Eli Phoboo.

Abha: Today’s episode kicks off a new season for the Complexity podcast, and with a new season comes a new theme. This fall, we’re exploring the nature and complexity of intelligence in six episodes — what it means, who has it, who doesn’t, and if machines that can beat us at our own games are as powerful as we think they are. The voices you’ll hear were recorded remotely across different locations, including countries, cities and work spaces. But first, I’d like you to meet our new co-host.

Melanie: My name is Melanie Mitchell. I’m a professor here at the Santa Fe Institute. I work on artificial intelligence and cognitive science. I’ve been interested in the nature of intelligence for decades. I want to understand how humans think and how we can get machines to be more intelligent, and what it all means.

Abha: Melanie, it’s such a pleasure to have you here. I truly can’t think of a better person to guide us through what, exactly, it means to call something intelligent. Melanie’s book, Artificial Intelligence: A Guide for Thinking Humans, is one of the top books on AI recommended by The New York Times. It’s a rational voice among all the AI hype in the media.

Melanie: And depending on whom you ask, artificial intelligence is either going to solve all humanity’s problems, or it’s going to kill us. When we interact with systems like Google Translate, or hear the buzz around self-driving cars, or wonder if ChatGPT actually understands human language, it can feel like AIis going to transform everything about the way we live. But before we get carried away making predictions about AI, it’s useful to take a step back. What does it mean to call anything intelligent, whether it’s a computer or an animal or a human child?

Abha: In this season, we’re going to hear from cognitive scientists, child development specialists, animal researchers, and AI experts to get a sense of what we humans are capable of and how AI models actually compare. And in the sixth episode, I’ll sit down with Melanie to talk about her research and her views on AI.

Melanie: To kick us off, we’re going to start with the broadest, most basic question: what really is intelligence, anyway? As many researchers know, the answer is more complicated than you might think.

Melanie: Part One: What is intelligence?

Alison: I’m Alison Gopnik. I’m a professor of psychology and affiliate professor of philosophy and a member of the Berkeley AI Research group. And I study how children manage to learn as much as they do, particularly in a sort of computational context. What kinds of computations are they performing in those little brains that let them be the best learners we know of in the universe?

Abha: Alison is also an external professor with the Santa Fe Institute, and she’s done extensive research on children and learning. When babies are born, they’re practically little blobs that can’t hold up their own heads. But as we all know, most babies become full-blown adults who can move, speak, and solve complex problems. From the time we enter this world, we’re trying to figure out what the heck is going on all around us, and that learning sets the foundation for human intelligence.

Alison: Yeah, so one of the things that is really, really important about the world is that some things make other things happen. So everything from thinking about the way the moon affects the tides to just the fact that I’m talking to you and that’s going to make you change your minds about things. Or the fact that I can pick up this cup and spill the water and everything will get wet. Those really basic cause and effect relationships are incredibly important.

And they’re important partly because they let us do things. So if I know that something is gonna cause a particular effect, what that means is if I wanna bring about that effect, I can actually go out in the world and do it. And it underpins everything from just our everyday ability to get around in the world, even for an infant, to the most incredible accomplishments of science. But at the same time, those causal relationships are kind of mysterious and always have been. How is it? After all, all we see is that one thing happens and another thing follows it. How do we figure out that causal structure?

Melanie: So how do we?

Alison: Yeah, good question. So that’s been a problem philosophers have thought about for centuries. And there’s basically two pieces. And anyone who’s done science will recognize these two pieces. We analyze statistics. So we look at what the dependencies are between one thing and another. And we do experiments. We go out, perhaps the most important way that we understand about causality is you do something and then you see what happens and then you do something again and you say, wait a minute, that happened again.

And part of what I’ve been doing recently, which has been really fun, is just look at babies, even like one year olds. And if you just sit and look at a one year old, mostly what they’re doing is doing experiments. I have a lovely video of my one-year-old grandson with a xylophone and a mallet.

Abha: Of course, we had to ask Alison to show us the video. Her grandson is sitting on the floor with the xylophone, while his grandfather plays an intricate song on the piano. Together, they make a strange duet.

And it’s not just that he makes the noise. He tries turning the mallet upside down. He tries with his hand a bit. That doesn’t make a noise. He tries with a stick end. That doesn’t make a noise. Then he tries it on one bar and it makes one noise. Another bar, it makes another noise. So when the babies are doing the experiments, we call it getting into everything. But I increasingly think that’s their greatest motivation.

Abha: So babies and children are doing these cause and effect experiments constantly, and that’s a major way that they learn. At the same time, they’re also figuring out how to move and use their bodies, developing a distinct intelligence in their motor systems so they can balance, walk, use their hands, turn their heads, and eventually, move in ways that don’t even require much thinking at all.

Melanie: One of the leading researchers on intelligence and physical movement is John Krakauer, a professor of neurology, neuroscience, physical medicine, and rehabilitation at the Johns Hopkins University School of Medicine. John’s also in the process of writing a book.

John Krakauer: I am. I’ve been writing it for much longer than I expected, but now I finally know the story I want to tell. I’ve been practicing it.

Melanie: Well, let me ask, I just want to mention that the subtitle is Thinking versus Intelligence in Animals, Machines and Humans. So I wanted to get your take on what is thinking and what is intelligence.

John: Oh my gosh, thanks Melanie for such an easy softball question.

Melanie: Well, you’re writing a book about it.

John: Well, yes, so… I think I was very inspired by two things. One was how much intelligent adaptive behavior your motor system has even when you’re not thinking about it. The example I always give is when you press an elevator button before you lift your arm to press the button, you contract your gastrocnemius in anticipation that your arm is sufficiently heavy, that if you didn’t do that, you’d fall over because your center of gravity has shifted. So there are countless examples of intelligent behaviors. In other words, they’re goal-directed and accomplish the goal below the level of overt deliberation or awareness.

And then there’s a whole field, what are called long latency stretch reflexes, these below the time of voluntary movement, but sufficiently flexible to be able to deal with quite a lot of variation in the environment and still get the goal accomplished, but it’s still involuntary.

Abha: There’s a lot that we can do without actually understanding what’s happening. Think about the muscles we use to swallow food, or balance on a bike, for example. Learning how to ride a bike takes a lot of effort, but once you’ve figured it out, it’s almost impossible to explain it to someone else.

John: And so it’s what, Daniel Dennett, you know, who recently passed away, but was very influential for me with what he called, competence with comprehension versus competence without comprehension. And, you know, I think he also was impressed by how much competence there is in the absence of comprehension. And yet along came this extra piece, the comprehension, which added to competence and greatly increased the repertoire of our competences.

Abha: Our bodies are competent in some ways, but when we use our minds to understand what’s going on, we can do even more. To go back to Alison’s example of her grandson playing with a xylophone, comprehension allows him, or anyone, playing with a xylophone mallet to learn that each side of it makes a different sound.

If you or I saw a xylophone for the first time, we would need to learn what a xylophone is, what a mallet is, how to hold it, and which end might make a noise if we knocked it against a musical bar. We’re aware of it. Over time we internalize these observations so that every time we see a xylophone mallet, we don’t need to think through what it is and what the mallet is supposed to do.

Melanie: And that brings us to another, crucial part of human intelligence: common sense. Common sense is knowing that you hold a mallet by the stick end and use the round part to make music. And if you see another instrument, like a marimba, you know that the mallet is going to work the same way. Common sense gives us basic assumptions that help us move through the world and know what to do in new situations. But it gets more complicated when you try to define exactly what common sense is and how it’s acquired.

John: Well, I mean, to me, common sense is the amalgam of stuff that you’re born with. So you, you know, any animal will know that if it steps over the edge, it’s going to fall. Right. What you’ve learned through experience that allows you to do quick inference.

So in other words, you know, an animal, it starts raining, it knows it has to find shelter. Right? So in other words, presumably it learns that you don’t want to be wet, and so it makes the inference it’s going to get wet, and then it finds a shelter. It’s a common sense thing to do in a way.

And then there’s the thought version of common sense. Right? It’s common sense that if you’re approaching a narrow alleyway, your car’s not gonna fit in it. Or if you go to a slightly less narrow one, your door won’t open when you open the door. Countless interactions between your physical experience, your innate repertoire, and a little bit of thinking. And it’s that fascinating mixture of fact and inference and deliberation. And then we seem to be able to do it over a vast number of situations, right?

In other words, we just seem to have a lot of facts, a lot of innate understanding of the physical world, and then we seem to be able to think with those facts. And those innate awarenesses. That, to me, is what common sense is. It’s this almost language-like flexibility of thinking with our facts and thinking with our innate sense of the physical world and combinatorially doing it all the time, thousands of times a day. I know that’s a bit waffly. I’m sure Melanie can do a much better job at me than that, but that’s how I see it.

Melanie: No, I think that’s actually a great exposition of what it means. I totally agree. I think it is fast inference about new situations that combines knowledge and sort of reasoning, fast reasoning, and a lot of very basic knowledge that’s not really written down anywhere that we happen to know because we exist in the physical world and we interact with it.

Melanie: So, observing cause and effect, developing motor reflexes, and strengthening common sense are all happening and overlapping as children get older.

Abha: And we’re going to cover one more type of intelligence that seems to be unique to humans, and that’s the drive to understand the world.

John: It turns out, for reasons that physicists have puzzled over, that the universe is understandable, explainable, and manipulatable. The side effect of understanding the world is understandable, is you begin to understand sunsets and why the sky is blue and how black holes work and why water is a liquid and then a gas. It turns out that these are things worth understanding because you can then manipulate and control the universe. And it’s obviously advantageous because humans have taken over entirely.

I have a fancy microphone that I can have a Zoom call with you with. An understandable world is a manipulable world. As I always say, an arctic fox trotting very well across the arctic tundra is not going, “hmm, what’s ice made out of?” It doesn’t care. Now we, for some point between chimpanzees and us, started to care about how the world worked. And it obviously was useful because we could do all sorts of things. Fire, shelter, blah blah blah.

Abha: And in addition to understanding the world, we can observe ourselves observing, a process known as metacognition. If we go back to the xylophone, metacognition is thinking, “I’m here, learning about this xylophone. I now have a new skill.”

And metacognition is what lets us explain what a xylophone is to other people, even if we don’t have an actual xylophone in front of us. Alison explains more.

Alison: So the things that I’ve been emphasizing are these kinds of external exploration and search capacities, like going out and doing experiments. But we know that people, including little kids, do what you might think of as sort of internal search. So they learn a lot, and now they just intrinsically, internally want to say, “what are some things, new conclusions I could draw, new ideas I could have based on what I already know?”

And that’s really different from just what are the statistical patterns in what I already know. And I think two capacities that are really important for that are metacognition and also one that Melanie’s looked at more than anyone else, which is analogy. So being able to say, okay, here’s all the things that I think, but how confident am I about that? Why do I think that? How could I use that learning to learn something new?

Or saying, here’s the things that I already know. Here’s an analogy that would be really different, right? So I know all about how water works. Let’s see, if I think about light, does it have waves the same way that water has waves? So actually learning by just thinking about what you already know.

John: I find myself constantly changing my position on the one hand, this human capacity to sort of look at yourself computing, a sort of meta-cognition, which is consciousness not just of the outside world and of your body, it’s consciousness of your processing of the outside world and your body. It’s almost as though you used consciousness to look inward at what you were doing. Humans have computations and feelings. They have a special type of feeling and computation which together is deliberative. And that’s what I think thinking is, it’s feeling your computations.

Melanie: What John is saying is that humans have conscious feelings — our sensations such as hunger or pain — and that our brains perform unconscious computations, like the muscle reflexes that happen when we press an elevator button. What he calls deliberative thought is when we have conscious feelings or awareness about our computations.

You might be solving a math problem and realize with dismay that you don’t know how to solve it. Or, you might get excited if you know exactly what trick will work. This is deliberative thought — having feelings about your internal computations. To John, the conscious and unconscious computations are both “intelligent,” but only the conscious computations count as “thinking”.

Abha: So Melanie, having listened to John and Alison, I’d like to go back to our original question with you. What do you think is intelligence?

Melanie: Well, let me recap some of what Alison and John said. Alison really emphasized the ability to learn about cause and effect.

What causes what in the world and how we can predict what’s going to happen. And she pointed out that the way we learn this, adults and especially kids, by doing little experiments, interacting with the world and seeing what happens and learning about cause and effect that way. She also stressed our ability to generalize, to make analogies, how situations might be similar to each other in an abstract way. And this underlies what we would call our common sense, that is our basic understanding of the world.

Abha: Yeah, that example of the xylophone and the mallet, that was very intriguing. As both John and Alison said, humans seem to have a unique drive to gain an understanding of the world via experiments like making mistakes, trying things out. And they both emphasize this important role of metacognition or reasoning about one’s own thinking. What do you think of that? You know, how important do you think metacognition is?

Melanie: It’s absolutely essential to human intelligence. It’s really what underlies, I think, our uniqueness. John, you know, made this distinction between intelligence and thinking. To him, you know, most of our, what he would call our intelligent behavior is unconscious. It doesn’t involve metacognition. He called it competence without comprehension. And he reserved the term thinking for conscious awareness of what he called one’s internal computations.

Abha: Even though John and Alison have given us some great insights about what makes us smart, I think both would admit that no one has come to a full, complete understanding of how human intelligence works, right?

Melanie: Yeah, we’re far from that. But in spite of that, big tech companies like OpenAI and DeepMind are spending huge amounts of money in an effort to make machines that, as they say, will match or exceed human intelligence. So how close are they to succeeding? Well, in part two, we’ll look at how systems like ChatGPT learn and whether or not they’re even intelligent at all.

Abha: Part two: How intelligent are today’s machines?

Abha: If you’ve been following the news around AI, you may have heard the acronym LLM, which stands for large language model. It’s the term that’s used to describe the technology behind systems like ChatGPT from OpenAI or Gemini from Google. LLMs are trained to find statistical correlations in language, using mountains of text and other data from the internet. In short, if you ask ChatGPT a question, it will give you an answer based on what it has calculated to be the most likely response, based on the vast amount of information it’s ingested.

Melanie: Humans learn by living in the world — we move around, we do little experiments, we build relationships, and we feel. LLMs don’t do any of this. But they do learn from language, which comes from humans and human experience, and they’re trained on a lot of it. So does this mean that LLMs could be considered to be intelligent? And how intelligent can they, or any form of AI, become?

Abha: Several tech companies have an explicit goal to achieve something called artificial general intelligence, or AGI. AGI has become a buzzword, and everyone defines it a bit differently. But, in short, AGI is a system that has human level intelligence. Now, this assumes that a computer, like a brain in a jar, can become just as smart, or even smarter, than a human with a feeling body. Melanie asked John what he thought about this.

Melanie: You know, I find it confusing when people like Demis Hassibis, who’s the founder, one of the co-founders of DeepMind, and he an interview that AGI is a system that should be able to do pretty much any cognitive task that humans can do. And he said he expects that there’s a 50% chance we’ll have AGI within a decade. Okay, so I emphasize that word cognitive task because that term is confusing to me. But it seems so obvious to them.

John: Yes, I mean, I think it’s the belief that everything non-physical at the task level can be written out as a kind of program or algorithm. I just don’t know… and maybe it’s true when it comes to, you know, ideas, intuitions, creativity.

Melanie: I also asked John if he thought that maybe that separation, between cognition and everything else, was a fallacy.

John: Well, it seems to me, you know, it always makes me a bit nervous to argue with you of all people about this, but I would say, I think there’s a difference between saying, can we reach human levels of intelligence when it comes to common sense, the way humans do it, versus can we end up with the equivalent phenomenon, without having to do it the way humans do it. The problem for me with that is that we, like this conversation we’re having right now, are capable of open-ended, extrapolatable thought. We go beyond what we’re talking about.

I struggle with it but I’m not going to put myself in this precarious position of denying that a lot of problems in the world can be solved without comprehension. So maybe we’re kind of a dead end — comprehension is a great trick, but maybe it’s not needed. But if comprehension requires feeling, then I don’t quite see how we’re going to get AGI in its entirety. But I don’t want to sound dogmatic. I’m just practicing my… my unease about it. Do you know what I mean? I don’t know.

Abha: Alison is also wary of over-hyping our capacity to get to AGI.

Alison: And one of the great old folk tales is called Stone Soup.

Abha: Or you might have heard it called Nail Soup — there are a few variations. She uses this stone soup story as a metaphor for how much our so-called “AI technology” actually relies on humans and the language they create.

Alison: And the basic story of Stone Soup is that, there’s some visitors who come to a village and they’re hungry and the villagers won’t share their food with them. So the visitors say, that’s fine. We’re just going to make stone soup. And they get a big pot and they put water in it. And they say, we’re going to get three nice stones and put it in. And we’re going to make wonderful stone soup for everybody.

They start boiling it. And they say, this is really good soup. But it would be even better if we had a carrot or an onion that we could put in it. And of course, the villagers go and get a carrot and onion. And then they say, this is much better. But you know, when we made it for the king, we actually put in a chicken and that made it even better. And you can imagine what happens.

All the villagers contribute all their food. And then in the end, they say, this is amazingly good soup and it was just made with three stones. And I think there’s a nice analogy to what’s happened with generative AI. So the computer scientists come in and say, look, we’re going to make intelligence just with next token prediction and gradient descent and transformers.

And then they say, but you know, this intelligence would be much better if we just had some more data from people that we could add to it. And then all the villagers go out and add all of the data of everything that they’ve uploaded to the internet. And then the computer scientists say, no, this is doing a good job at being intelligent.

But it would be even better if we could have reinforcement learning from human feedback and get all you humans to tell it what you think is intelligent or not. And all the humans say, OK, we’ll do that. And then and then it would say, you know, this is really good. We’ve got a lot of intelligence here.

But it would be even better if the humans could do prompt engineering to decide exactly how they were going to ask the questions so that the systems could do intelligent answers. And then at the end of that, the computer scientists would say, see, we got intelligence just with our algorithms. We didn’t have to depend on anything else. I think that’s a pretty good metaphor for what’s happened in AI recently.

Melanie: The way AGI has been pursued is very different from the way humans learn. Large language models, in particular, are created with tons of data shoved into the system with a relatively short training period, especially when compared to the length of human childhood. The stone soup method uses brute force to shortcut our way to something akin to human intelligence.

Alison: I think it’s just a category mistake to say things like are LLM’s smart. It’s like asking, is the University of California Berkeley library smarter than I am? Well, it definitely has more information in it than I do, but it just feels like that’s not really the right question. Yeah, so one of the things about humans in particular is that we’ve always had this great capacity to learn from other humans.

And one of the interesting things about that is that we’ve had different kinds of technologies over history that have allowed us to do that. So obviously language itself, you could think of as a device that lets humans learn more from other people than other creatures can do. My view is that the LLMs are kind of the latest development in our ability to get information from other people.

But again, this is not trivializing or debunking it. Those changes in our cultural technology have been among the biggest and most important social changes in our history. So writing completely changed the way that we thought and the way that we functioned and the way that we acted in the world.

At the moment, as people have pointed out, the fact that I have in my pocket a device that will let me get all the information from everybody else in the world mostly just makes me irritated and miserable most of the time. We would have thought that that would have been like a great accomplishment. But people felt that same way about writing and print when they started too. The hope is that eventually we’ll adjust to that kind of technology.

Melanie: Not everyone shares Alison’s view on this. Some researchers think that large language models should be considered to be intelligent entities, and some even argue that they have a degree of consciousness. But thinking of large language models as a type of cultural technology, instead of sentient bots that might take over the world, helps us understand how completely different they are from people. And another important distinction between large language models and humans is that they don’t have an inherent drive to explore and understand the world.

Alison: They’re just sort of sitting there and letting the data waft over them rather than actually going out and acting and sensing and finding out something new.

Melanie: This is in contrast to the one-year-old saying —

Alison: Huh, the stick works on the xylophone. Will it work on the clock or the vase or whatever else it is that you’re trying to keep the baby away from? That’s a kind of internal basic drive to generalize, to think about, okay, it works in the way that I’ve been trained, but what will happen if I go outside of the environment in which I’ve been trained? We have caregivers who have a really distinctive kind of intelligence that we haven’t studied enough, I think, who are looking at us, letting us explore.

And caregivers are very well designed to, even if it feels frustrating when you’re doing it, we’re very good at kind of getting this balance between how independent should the next agent be? How much should we be constraining them? How much should we be passing on our values? How much should we let them figure out their own values in a new environment?

And I think if we ever do have something like an intelligent AI system, we’re going to have to do that. Our role, our relationship to them should be this caregiving role rather than thinking of them as being slaves on the one hand or masters on the other hand, which tends to be the way that we think about them. And as I say, it’s not just in computer science, in cognitive science, probably for fairly obvious reasons, we know almost nothing about the cognitive science of caregiving. So that’s actually what I’m, I just got a big grant, what I’m going to do for my remaining grandmotherly cognitive science years.

Abha: That sounds very fascinating. I’ve been curious to see what comes out of that work.

Alison: Well, let me give you just a very simple first pass, our first experiment. If you ask three and four year olds, here’s Johnny and he can go on the high slide or he can go on the slide that he already knows about. And what will he do if mom’s there? And your intuitions might be, maybe the kids will say, well, you don’t do the risky thing when mom’s there because she’ll be mad about it, right? And in fact, it’s the opposite. The kids consistently say, no, if mom is there, that will actually let you explore, that will let you take risks, that will let you,

Melanie: She’s there to take you to the hospital.

Alison: Exactly, she’s there to actually protect you and make sure that you’re not doing the worst thing. But of course, for humans, it should be a cue to how important caregiving is for our intelligence. We have a much wider range of people investing in much more caregiving.

So not just mothers, but, my favorite post-menopausal grandmothers, but fathers, older siblings, what are called alloparents, just people around who are helping to take care of the kids. And it’s having that range of caregivers that actually seems to really help. And again, that should be a cue for how important this is in our ability to do all the other things we have, like be intelligent and have culture.

Melanie: If you just look at large language models, you might think we’re nowhere near anything like AGI. But there are other ways of training AI systems. Some researchers are trying to build AI models that do have an intrinsic drive to explore, rather than just consume human information.

Alison: So one of the things that’s happened is that quite understandably the success of these large models has meant that everybody’s focused on the large models. But in parallel, there’s lots of work that’s been going on in AI that is trying to get systems that look more like what we know that children are doing. And I think actually if you look at what’s gone on in robotics, we’re much closer to thinking about systems that look like they’re learning the way that children do.

And one of the really interesting developments in robotics has been the idea of building in intrinsic motivation into the systems. So to have systems that aren’t just trying to do whatever it is that you programmed it to do, like open up the door, but systems that are looking for novelty, that are curious, that are trying to maximize this value of empowerment, that are trying to find out all the range of things they could do that have consequences in the world.

And I think at the moment, the LLMs are the thing that everyone’s paying attention to, but I think that route is much more likely to be a route to really understanding a kind of intelligence that looks more like the intelligence that’s in those beautiful little fuzzy heads.

And I should say we’re trying to do that. So we’re collaborating with computer scientists at Berkeley who are exactly trying to see what would happen if we say, give an intrinsic reward for curiosity. What would happen if you actually had a system that was trying to learn in the way that the children are trying to learn?

Melanie: So are Alison and her team on their way to an AGI breakthrough? Despite all this, Alison is still skeptical.

Alison: I think it’s just again, a category mistake to say we’ll have something like artificial general intelligence, because we don’t have natural general intelligence.

Melanie: In Alison’s view, we don’t have natural general intelligence because human intelligence is not really general. Human intelligence evolved to fit our very particular human needs. So, Alison likewise doesn’t think it makes sense to talk about machines with “general intelligence”, or machines that are more intelligent than humans.

Alison: Instead, what we’ll have is a lot of systems that can do different things, that might be able to do amazing things, wonderful things, things that we can’t do. But that kind of intuitive theory that there’s this thing called intelligence that you could have more of or less of, I just don’t think it fits anything that we know from cognitive science.

It is striking how different the view of the people, not all the people, but some of the people who are also making billions of dollars out of doing AI are from, I mean, I think this is sincere, but it’s still true that their view is so different from the people who are actually studying biological intelligences.

Melanie: John suspects that there’s one thing that computers may never have: feelings.

John: It’s very interesting that I always used pain as the example. In other words, what would it mean for a computer to feel pain? And what would it mean for a computer to understand a joke? So I’m very interested in these two things. We have this physical, emotional response. We laugh, we feel good, right? So when you understand a joke, where should the credit go? Should it go to understanding it? Or should it go to the laughter and the feeling that it evokes?

And to my sort of chagrin or surprise or maybe not surprise, Daniel Dennett wrote a whole essay in one of his early books on why computers will never feel pain. He also wrote a whole book on humor. So in other words, it’s kind of wonderful in a way, that whether he would have ended up where I’ve ended up, but at least he understood the size of the mystery and the problem.

And I agree with him, if I understood his pain essay correctly, and it’s influential on what I’m going to write, I just don’t know what it means for a computer to feel pain, be thirsty, be hungry, be jealous, have a good laugh. To me, it’s a category error. Now, if thinking is the combination of feeling… and computing, then there’s never going to be deliberative thought in a computer.

Abha: While talking to John, he frequently referred to pain receptors as the example of how we humans feel with our bodies. But we wanted to know: what about the more abstract emotions, like joy, or jealousy, or grief? It’s one thing to stub your toe and feel pain radiate up from your foot. It’s another to feel pain during a romantic breakup, or to feel happy when seeing an old friend. We usually think of those as all in our heads, right?

John: You know, I’ll say something kind of personal. A close friend of mine called me today to tell me… that his younger brother had been shot and killed in Baltimore. Okay. I don’t want to be a downer. I’m saying it for a reason. And he was talking to me about the sheer overwhelming physicality of the grief that he was feeling. And, I was thinking, what can I say with words to do anything about that pain? And the answer is nothing. Other than just to try.

But seeing that kind of grief and all that it entails, even more than seeing the patients that I’ve been looking after for 25 years, is what leads to a little bit of testiness on my part when one tends to downplay this incredible mixture of meaning and loss and memory and pain. And to know that this is a human being who knows, forecasting into the future, that he’ll never see this person again. It’s not just now. Part of that pain is into the infinite future. Now, all I’m saying is we don’t know what that glorious and sad amalgam is, but I’m not going to just dismiss it away and explain it away as some sort of peripheral computation that we will solve within a couple of weeks, months or years.

Do you see? I find it just slightly enraging, actually. And I just feel that, as a doctor and as a friend, we need to know that we don’t know how to think about these things yet. Right? I just don’t know. And I am not convinced of anything yet. So I think that there is a link between physical pain and emotional pain, but I can tell you from the losses I felt, it’s physical as much as it is cognitive. So grief, I don’t know what it would mean for a computer to feel grief. I just don’t know. I think we should respect the mystery.

Abha: So Melanie, I noticed that John and Alison are both a bit skeptical about today’s approaches to AI. I mean, will it lead to anything like human intelligence? What do you think?

Melanie: Yeah, I think that today’s approaches have some limitations. Alison put a lot of emphasis on the need for an agent to be actively interacting in the world as opposed to passively just receiving language input. And for an agent to have its own intrinsic motivation in order to be intelligent. Alison interestingly sees large language models more like libraries or databases than like intelligent agents. And I really loved her stone soup metaphor where her point is that all the important ingredients of large language models come from humans.

Abha: Yeah, it’s such an interesting illustration because it sort of tells us everything that goes on behind the scene, you know, before we see the output that an LLM gives us. John seemed to think that full artificial general intelligence is impossible, even in principle. He said that comprehension requires feeling or the ability to feel one’s own internal computations. And he didn’t seem to see how computers could ever have such feelings.

Melanie: And I think most people in AI would disagree with John. Many people in AI don’t even think that any kind of embodied interaction with the world is necessary. They’d argue that we shouldn’t underestimate the power of language.

In our next episode, we’ll go deeper into the importance of this cultural technology, as Alison would put it. How does language help us learn and construct meaning? And what’s the relationship between language and thinking?

Steve: You can be really good at language without having the ability to do the kind of sequential, multi-step reasoning that seems to characterize human thinking.

Abha: That’s next time, on Complexity.

Complexity is the official podcast of the Santa Fe Institute. This episode was produced by Katherine Moncure. Our theme song is by Mitch Mignano, and additional music from Blue Dot Sessions.

I’m Abha, thanks for listening.

◆

If you enjoyed this article…

◆

Learn more about the coaching process or
contact me to discuss your storytelling goals!

◆

Subscribe to the newsletter for the latest updates!