@faul_sname comments on "The Many Ways that Digital Minds Can Know

The Many Ways that Digital Minds Can Know

moultano.wordpress.com

Like many people I've been arguing about the nature of LLMs a lot over the last few years. There is a particular set of arguments that I found myself having to recreate from scratch over and over again in different contexts, so finally put it together in a larger post, and this is that post.

The crux of it is that I think both the maximalist and minimalist claims about what LLMs can do/are doing are simultaneously true, and not in conflict with one another. A mind made out of text can vary along two axes, the quantity of text it has absorbed, which here I call "coverage," and the degree to which that text has been unified into a coherent model, which here I call "integration." As extreme points on that spectrum, a search engine is high coverage, low integration, and an individual person is low coverage, high integration, and LLMs are intermediate between the two. And most importantly, every point on that spectrum is useful for different kinds of tasks.

I'm hoping this will be a more useful way of thinking about LLMs than the ways people have typically talked about them so far.

Jump in the discussion.

No email address required.

faul_sname Fuck around once, find out once. Do it again, now it's science. 1yr ago

Additionally and relatedly, you cannot train a LLM to "notice when it's saying something wrong" without indirectly training it to say something wrong, then say it notices.

It should be possible to train "notice when something in its context window is wrong and say that the thing is wrong" and also "notice when something in its context window is something said by the assistant persona it is being trained to write as", and I don't think either of those objectives would incentivize "say wrong things while writing in the assistant persona".

That said, if you are specifically referring to the behavior of "accurately indicate your confidence level in the thing you are about to say, and then say the thing" that does seem like a much more difficult behavior to train (still possible, since LLMs have a nonzero ability to plan ahead, but finicky and easy to screw up). But if it's fine for the evaluation-of-confidence step to come after the reasoning step, the task is much easier (and in fact that's what the chain-of-thought prompting technique aims to do).

Also, if you're interested in the interpretability side of things specifically, you might find Inference-Time Intervention: Eliciting Truthful Answers from a Language Model interesting:

To close this gap, we introduce a technique we call Inference-Time Intervention (ITI). At a high level, we first identify a sparse set of attention heads with high linear probing accuracy for truthfulness. Then, during inference, we shift activations along these truth-correlated directions. We repeat the same intervention autoregressively until the whole answer is generated. ITI results in a significant performance increase on the TruthfulQA benchmark. We also see a smaller but nonzero performance improvement on two benchmarks with different data distributions.

The level of interpretability you want is currently beyond us, but I expect that over time that situation will improve quite a lot (I think well under a thousand person-years have been spent on this particular type of interpretability research so far, and even that estimate might be an order of magnitude or two high).

Context

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats