site banner

Friday Fun Thread for December 27, 2024

Be advised: this thread is not for serious in-depth discussion of weighty topics (we have a link for that), this thread is not for anything Culture War related. This thread is for Fun. You got jokes? Share 'em. You got silly questions? Ask 'em.

1
Jump in the discussion.

No email address required.

You might have already read it, but I find Terence Tao's impression of a similar model, o1, illuminating:

https://mathstodon.xyz/@tao/113132502735585408

The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student. It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of "(static simulation of a) competent graduate student" is reached, at which point I could see this tool being of significant use in research level tasks

In the context of AI capabilities, going from ~0% success to being, say, 30% correct on a problem set is difficult and hard to predict. Going from 30% to 80%, on the other hand, seems nigh inevitable.

I would absolutely expect that in a mere handful of years we're going to get self-directed Competent Mathematician levels of performance, with "intuition" and a sense of mathematical elegance. We've gone from "high schooler who's heard of advanced mathematical ideas but fumbles when asked to implement them" to "mediocre grad student" (and mediocre in the eyes of Tao!).

But when you're unmoored from such design, feeling like you might just be taking shots in the dark, going down possibly completely pointless paths, I'm honestly not sure what the role of the automated theorem prover is going to be. If you haven't hit on the correct, tidy problem statement, and it just comes back with question marks, then what? If it just says, "Nope, I can't do it with the information you've given me," then what? Is it going to have the intuition to be able to add, "...but ya know, if we add this very reasonable thing, which is actually in line with the context of what you're going for and contributes rather that detracts from the elegance, then we can say..."?

In this context, the existence of ATPs allows for models to be rigorously evaluated on ground-truth signals through reinforcement learning. We have an objective function that unambiguously tells us whether it has correctly solved a problem, without the now extreme difficulty of having humans usefully grade responses. This allows for the use of synthetic data with much more confidence, and a degree of automation as you can permute and modify questions to develop more difficult ones, and then when a solution is found, use that as training data. This is suspected to be why recent thinking models have shown large improvements in maths and coding while being stagnant on what you'd think are simpler tasks like writing or poetry (because at a certain point the limitations become human graders, without a ground truth to go off when asked if one bit of prose is better than the other).

Totally agreed that having rigorous engines that are able to provide synthetic training will massively help progress. But my sense is that the data they can generate is still of the type, "This works," "This doesn't work," or, "Here is a counterexample." Which can still be massively useful, but may still run into the context/problem definition/"elegance" concerns. Given that the back ends are getting good enough to provide the yes/no/counterexample results, I think it's highly likely that LLMs will become solidly good at translating human problems statements into rigorous problem statements for the back end to evaluate, which will be a huge help to the usefulness of those systems... but the jury is still out in my mind to what extent they'll be able to go further and add appropriate context. It's a lot harder to find data or synthetic data for that part.