site banner

Friday Fun Thread for December 27, 2024

Be advised: this thread is not for serious in-depth discussion of weighty topics (we have a link for that), this thread is not for anything Culture War related. This thread is for Fun. You got jokes? Share 'em. You got silly questions? Ask 'em.

1
Jump in the discussion.

No email address required.

As the old saying goes, "Context is that which is scarce." I know, I know, it's all the rage to try to shove as much context into the latest LLM as you can. People are even suggesting organization design based around the idea. It's really exciting to see automated theorem provers starting to become possible. The best ones still use rigorous back-end engines rather than just pure inscrutable giant matrices. They can definitely speed up some things. But the hard part is not necessarily solving problems. Don't get me wrong, it's a super useful skill; I'm over the moon that I have multiple collaborators who are genuinely better than me at solving certain types of problems. They're a form of automated theorem prover from my perspective. No, the hard part is figuring out what question to ask. It has to be a worthwhile question. It has to be possible. It has to have some hope of leading to "elegance", even if some of the intermediate questions along the way seem like they're getting more and more inelegant.

Homework questions... and even the contest questions that folks are so fond of benchmarking with... have been extremely designed. Give your cousin a couple more years of doing actual research, and he'll tell you all about how much he loves his homework problems. Not necessarily because they're "easy". They might still be very hard. But they're the kind of hard that is designed to be hard... but designed to work. Designed to be possible. Designed to have a neat and tidy answer in at most a page or two (for most classes; I have seen some assigned problems that brutally extended into 5-6 pages worth of necessary calculation). But when you're unmoored from such design, feeling like you might just be taking shots in the dark, going down possibly completely pointless paths, I'm honestly not sure what the role of the automated theorem prover is going to be. If you haven't hit on the correct, tidy problem statement, and it just comes back with question marks, then what? If it just says, "Nope, I can't do it with the information you've given me," then what? Is it going to have the intuition to be able to add, "...but ya know, if we add this very reasonable thing, which is actually in line with the context of what you're going for and contributes rather that detracts from the elegance, then we can say..."? Or is it going to be like an extremely skilled grad student level problem solver, who you can very quickly ping to get intermediate results, counterexamples, etc. that help you along the way? Hopefully, it won't come back with a confident-sounding answer every time that you then have to spend the next few days closely examining for an error. (This will be better mitigated the more they're tied into rigorous back-ends.) I don't know; nobody really knows yet. But it's gonna be fun.

You might have already read it, but I find Terence Tao's impression of a similar model, o1, illuminating:

https://mathstodon.xyz/@tao/113132502735585408

The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student. It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of "(static simulation of a) competent graduate student" is reached, at which point I could see this tool being of significant use in research level tasks

In the context of AI capabilities, going from ~0% success to being, say, 30% correct on a problem set is difficult and hard to predict. Going from 30% to 80%, on the other hand, seems nigh inevitable.

I would absolutely expect that in a mere handful of years we're going to get self-directed Competent Mathematician levels of performance, with "intuition" and a sense of mathematical elegance. We've gone from "high schooler who's heard of advanced mathematical ideas but fumbles when asked to implement them" to "mediocre grad student" (and mediocre in the eyes of Tao!).

But when you're unmoored from such design, feeling like you might just be taking shots in the dark, going down possibly completely pointless paths, I'm honestly not sure what the role of the automated theorem prover is going to be. If you haven't hit on the correct, tidy problem statement, and it just comes back with question marks, then what? If it just says, "Nope, I can't do it with the information you've given me," then what? Is it going to have the intuition to be able to add, "...but ya know, if we add this very reasonable thing, which is actually in line with the context of what you're going for and contributes rather that detracts from the elegance, then we can say..."?

In this context, the existence of ATPs allows for models to be rigorously evaluated on ground-truth signals through reinforcement learning. We have an objective function that unambiguously tells us whether it has correctly solved a problem, without the now extreme difficulty of having humans usefully grade responses. This allows for the use of synthetic data with much more confidence, and a degree of automation as you can permute and modify questions to develop more difficult ones, and then when a solution is found, use that as training data. This is suspected to be why recent thinking models have shown large improvements in maths and coding while being stagnant on what you'd think are simpler tasks like writing or poetry (because at a certain point the limitations become human graders, without a ground truth to go off when asked if one bit of prose is better than the other).

Totally agreed that having rigorous engines that are able to provide synthetic training will massively help progress. But my sense is that the data they can generate is still of the type, "This works," "This doesn't work," or, "Here is a counterexample." Which can still be massively useful, but may still run into the context/problem definition/"elegance" concerns. Given that the back ends are getting good enough to provide the yes/no/counterexample results, I think it's highly likely that LLMs will become solidly good at translating human problems statements into rigorous problem statements for the back end to evaluate, which will be a huge help to the usefulness of those systems... but the jury is still out in my mind to what extent they'll be able to go further and add appropriate context. It's a lot harder to find data or synthetic data for that part.