Contact Us
Sign In
Sign Up
Rules Admins Moderation Log Random Post Random User
What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules
Recommended Posts And Communities
Recommended Realtime Chats
- Astral Codex Ten Discord
- Quokka's Den Telegram

PaperclipPerfector 3mo ago (text post) 2919 thread views

Friday Fun Thread for December 27, 2024

Be advised: this thread is not for serious in-depth discussion of weighty topics (we have a link for that), this thread is not for anything Culture War related. This thread is for Fun. You got jokes? Share 'em. You got silly questions? Ask 'em.

Jump in the discussion.

No email address required.

self_made_human Kai su, teknon? 3mo ago

My younger cousin is a mathematician currently doing an integrated Masters and PhD. About a year back, I'd been trying to demonstrate to him the every increasing capability of SOTA LLMs at maths, and asked him to raise questions that it couldn't trivially answer.

He chose "is the one-point compactification of a Hausdorff space itself Hausdorff?".

At the time, all the models insisted invariably that that's a no. I ran the prompt multiple times on the best models available then. My cousin said it was incorrect, and provided to sketch out a proof (which was quite simple when I finally understood that much of the jargon represented rather simple ideas at their core).

I ran into him again when we're both visiting home, and I decided to run the same question through the latest models to gauge their improvements.

I tried Gemini 1206, Gemini Flash Thinking Experimental, Claude 3.5 Sonnet (New) and GPT-4o.

Other than reinforcing the fact that AI companies have abysmal naming schemes, to my surprise almost all of them gave the correct answer, barring Claude, but it was hampered by Anthropic being cheapskates and turning on the concise responses mode.

I showed him how the extended reasoning worked for Gemini Flash (it doesn't hide its thinking tokens unlike o1) and I could tell that he was shocked/impressed, and couldn't fault the reasoning process it and the other models went through.

To further shake him up, I had him find some recent homework problems he'd been assigned at his course (he's in a top 3 maths program in India) and used the multimodality inherent in Gemini to just take a picture of an extended question and ask it to solve it. It did so, again, flawlessly.

He then demanded we try with another, and this time he expressed doubts that the model could handle a compact, yet vague in the absence of context not presented problem, and no surprises again.

He admitted that this was the first time he took my concerns seriously, though getting a rib in by saying doctors would be off the job market before mathematicians. I conjectured that was unlikely, given that maths and CS performance are more immediately beneficial to AI companies as they are easier to drop-in and automate, while also having direct benefits for ML, with the goal of replacing human programmers and having the models recursively self-improve. Not to mention that performance in those domains is easier to make superhuman with the use of RL and automated theorem providers for ground truth. Oh well, I reassured him, we're probably all screwed and in short order, to the point where there's not much benefit in quibbling about the other's layoffs being a few months later.

Context

2rafa self_made_human 3mo ago

How long do you have to stay in the UK before they can’t deport you (4 years?) What happens will happen, I am less concerned with the economic situation because I think that after a brief period of chaos it will be resolved very quickly one way or the other. I’m more interested in the spiritual one, even last week here people were arguing with me that these models don’t capture something fundamental about human cognition.

Context

self_made_human Kai su, teknon? 2rafa 3mo ago

I believe Indefinite Leave to Remain nominally takes 5 years, but with bureaucratic slowness, closer to 6 in practice.

I agree that economic turmoil will probably be a rapid shock. But I'm unsure whether rapid implies months or years of unemployment and uncertainty. Either way all I can do is save enough money to hope to weather.

On the plus side, if NHS workers were fired immediately when they became redundant, the service would be rather smaller haha.

Context

ControlsFreak self_made_human 3mo ago

As the old saying goes, "Context is that which is scarce." I know, I know, it's all the rage to try to shove as much context into the latest LLM as you can. People are even suggesting organization design based around the idea. It's really exciting to see automated theorem provers starting to become possible. The best ones still use rigorous back-end engines rather than just pure inscrutable giant matrices. They can definitely speed up some things. But the hard part is not necessarily solving problems. Don't get me wrong, it's a super useful skill; I'm over the moon that I have multiple collaborators who are genuinely better than me at solving certain types of problems. They're a form of automated theorem prover from my perspective. No, the hard part is figuring out what question to ask. It has to be a worthwhile question. It has to be possible. It has to have some hope of leading to "elegance", even if some of the intermediate questions along the way seem like they're getting more and more inelegant.

Homework questions... and even the contest questions that folks are so fond of benchmarking with... have been extremely designed. Give your cousin a couple more years of doing actual research, and he'll tell you all about how much he loves his homework problems. Not necessarily because they're "easy". They might still be very hard. But they're the kind of hard that is designed to be hard... but designed to work. Designed to be possible. Designed to have a neat and tidy answer in at most a page or two (for most classes; I have seen some assigned problems that brutally extended into 5-6 pages worth of necessary calculation). But when you're unmoored from such design, feeling like you might just be taking shots in the dark, going down possibly completely pointless paths, I'm honestly not sure what the role of the automated theorem prover is going to be. If you haven't hit on the correct, tidy problem statement, and it just comes back with question marks, then what? If it just says, "Nope, I can't do it with the information you've given me," then what? Is it going to have the intuition to be able to add, "...but ya know, if we add this very reasonable thing, which is actually in line with the context of what you're going for and contributes rather that detracts from the elegance, then we can say..."? Or is it going to be like an extremely skilled grad student level problem solver, who you can very quickly ping to get intermediate results, counterexamples, etc. that help you along the way? Hopefully, it won't come back with a confident-sounding answer every time that you then have to spend the next few days closely examining for an error. (This will be better mitigated the more they're tied into rigorous back-ends.) I don't know; nobody really knows yet. But it's gonna be fun.

Context

self_made_human Kai su, teknon? ControlsFreak 3mo ago

You might have already read it, but I find Terence Tao's impression of a similar model, o1, illuminating:

https://mathstodon.xyz/@tao/113132502735585408

The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student. It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of "(static simulation of a) competent graduate student" is reached, at which point I could see this tool being of significant use in research level tasks

In the context of AI capabilities, going from ~0% success to being, say, 30% correct on a problem set is difficult and hard to predict. Going from 30% to 80%, on the other hand, seems nigh inevitable.

I would absolutely expect that in a mere handful of years we're going to get self-directed Competent Mathematician levels of performance, with "intuition" and a sense of mathematical elegance. We've gone from "high schooler who's heard of advanced mathematical ideas but fumbles when asked to implement them" to "mediocre grad student" (and mediocre in the eyes of Tao!).

But when you're unmoored from such design, feeling like you might just be taking shots in the dark, going down possibly completely pointless paths, I'm honestly not sure what the role of the automated theorem prover is going to be. If you haven't hit on the correct, tidy problem statement, and it just comes back with question marks, then what? If it just says, "Nope, I can't do it with the information you've given me," then what? Is it going to have the intuition to be able to add, "...but ya know, if we add this very reasonable thing, which is actually in line with the context of what you're going for and contributes rather that detracts from the elegance, then we can say..."?

In this context, the existence of ATPs allows for models to be rigorously evaluated on ground-truth signals through reinforcement learning. We have an objective function that unambiguously tells us whether it has correctly solved a problem, without the now extreme difficulty of having humans usefully grade responses. This allows for the use of synthetic data with much more confidence, and a degree of automation as you can permute and modify questions to develop more difficult ones, and then when a solution is found, use that as training data. This is suspected to be why recent thinking models have shown large improvements in maths and coding while being stagnant on what you'd think are simpler tasks like writing or poetry (because at a certain point the limitations become human graders, without a ground truth to go off when asked if one bit of prose is better than the other).

Context

ControlsFreak self_made_human 3mo ago

I just want to add a little bit from Zvi's latest:

Process for a Tier 4 problem:

1 week crafting a robust problem concept, which “converts” research insights into a closed-answer problem.

3 weeks of collaborative research. Presentations among related teams for feedback.

Two weeks for the final submission.

We’re seeking mathematicians who can craft these next-level challenges. If you have research-grade ideas that transcend T3 difficulty, please email elliot@epoch.ai with your CV and a brief note on your interests.

We’ll also hire some red-teamers, tasked with finding clever ways a model can circumvent a problem’s intended difficulty, and some reviewers to check for mathematical correctness of final submissions. Contact me if you think you’re suitable for either such role.

As AI keeps improving, we need benchmarks that reflect genuine mathematical depth. Tier 4 is our next (and possibly final) step in that direction.

Tier 5 could presumably be ‘ask a bunch of problems we have actual no idea how to solve and that might not have solutions but that would be super cool’ since anything on a benchmark inevitably gets solved.

The abilities are impressive, and I actually wouldn't be surprised if it's able to perform admirably on Tier 4 "closed-answer" problems, especially as they get better and better at using rigorous back-end engines. But notice what they're expecting. They're expecting to have teams of top tier mathematicians spend a significant amount of time crafting "closed-answer problems". That really is probably where the bottleneck is, and Zvi's offhand comment is also in that vein. One possible end state is that these algorithms become an extremely useful 'calculator-on-steroids' that, like calculators, programming languages, and other automated theorem proving tools before, supercharges mathematical productivity under the guidance and direction of intuitive humans trying to push forward human understanding of human-relevant/human-interesting subject domains. Another possible end state is that the algorithms will get 'smart' enough to have all that human context, human intuition, and understanding of human-relevance/human-interestingness and be able to actually drop-and-replace human math folks. I suppose a third possible end state would be that a society of super advanced AIs go off and create their own math that humans can tell somehow is objectively good, but that they have to work and struggle to try to understand bits and pieces of (see also the computer chess championship). I really don't have any first principles to guide my reasoning of which of these end states we'll end up in. It really feels to me like a 'wait, watch, and see' situation.

Context

self_made_human Kai su, teknon? ControlsFreak 3mo ago

I would put the last option as the most likely over a time frame greater than a decade or two, but the initial two options can be intermediate stages, albeit I don't expect any of them to last more than a few years. My reasoning is largely that much like chess, when the reward signal is highly legible, it becomes far easier to optimize for it, and diminishing returns!= nil returns, and probably PEV returns.

But you're right, only way to find out is to strap in for the ride. We live in interesting times.

Context

ControlsFreak self_made_human 3mo ago

Totally agreed that having rigorous engines that are able to provide synthetic training will massively help progress. But my sense is that the data they can generate is still of the type, "This works," "This doesn't work," or, "Here is a counterexample." Which can still be massively useful, but may still run into the context/problem definition/"elegance" concerns. Given that the back ends are getting good enough to provide the yes/no/counterexample results, I think it's highly likely that LLMs will become solidly good at translating human problems statements into rigorous problem statements for the back end to evaluate, which will be a huge help to the usefulness of those systems... but the jury is still out in my mind to what extent they'll be able to go further and add appropriate context. It's a lot harder to find data or synthetic data for that part.

Context

What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules

Recommended Realtime Chats

Link copied to clipboard

Action successful!

Error, please try again later.

Friday Fun Thread for December 27, 2024

Jump in the discussion.

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats