Like many people I've been arguing about the nature of LLMs a lot over the last few years. There is a particular set of arguments that I found myself having to recreate from scratch over and over again in different contexts, so finally put it together in a larger post, and this is that post.
The crux of it is that I think both the maximalist and minimalist claims about what LLMs can do/are doing are simultaneously true, and not in conflict with one another. A mind made out of text can vary along two axes, the quantity of text it has absorbed, which here I call "coverage," and the degree to which that text has been unified into a coherent model, which here I call "integration." As extreme points on that spectrum, a search engine is high coverage, low integration, and an individual person is low coverage, high integration, and LLMs are intermediate between the two. And most importantly, every point on that spectrum is useful for different kinds of tasks.
I'm hoping this will be a more useful way of thinking about LLMs than the ways people have typically talked about them so far.
Jump in the discussion.
No email address required.
Notes -
It should be possible to train "notice when something in its context window is wrong and say that the thing is wrong" and also "notice when something in its context window is something said by the assistant persona it is being trained to write as", and I don't think either of those objectives would incentivize "say wrong things while writing in the assistant persona".
That said, if you are specifically referring to the behavior of "accurately indicate your confidence level in the thing you are about to say, and then say the thing" that does seem like a much more difficult behavior to train (still possible, since LLMs have a nonzero ability to plan ahead, but finicky and easy to screw up). But if it's fine for the evaluation-of-confidence step to come after the reasoning step, the task is much easier (and in fact that's what the chain-of-thought prompting technique aims to do).
Also, if you're interested in the interpretability side of things specifically, you might find Inference-Time Intervention: Eliciting Truthful Answers from a Language Model interesting:
The level of interpretability you want is currently beyond us, but I expect that over time that situation will improve quite a lot (I think well under a thousand person-years have been spent on this particular type of interpretability research so far, and even that estimate might be an order of magnitude or two high).
More options
Context Copy link