This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
There are definitely going to be massive blind spots with the current architecture. The strawberry thing always felt a little hollow to me though as it's clearly an artifact of the tokenizer (i.e., GPT doesn't see "strawberry", it sees "[302, 1618, 19772]", the tokenization of "st" + "raw" + "berry"). If you explicitly break the string down into individual tokens and ask it, it doesn't have any difficulty (unless it reassembles the string and parses it as three tokens again, which it will sometimes do unless you instruct otherwise.)
Likewise with ARC-AGI, comparing o3 performance to human evaluators is a little unkind to the robot, because while humans get these nice pictures, o3 is fed a JSON array of numbers, similar to this. While I agree the visually formatted problem is trivial for humans, if you gave humans the problems in the same format I think you'd see their success rate plummet (and if you enforced the same constraints e.g., no drawing it out, all your "thinking" has to be done in text form, etc, then I suspect even much weaker models like o1 would be competitive with humans.)
I agree that any AI that can't complete these tasks is obviously not "true" AGI. (And it goes without saying that even if an AI could score 100% on ARC it wouldn't prove that it is AGI, either.) The only metric that really matters in the end is whether a model is capable of recursive self-improvement and expanding its own capabilities autonomously. If you crack that nut then everything else is within reach. Is it plausible that an AI could score 0% on ARC and yet be capable of designing, architecting, training, and running a model that achieves 100%? I think it's definitely a possibility, and that's where the fun(?) really begins. All I want to know is how far we are from that.
Edit: Looks like o3 wasn't ingesting raw JSON. I was under the impression that it was because of this tweet from roon (OpenAI employee), but scrolling through my "For You" page randomly surfaced the actual prompt used. Which, to be fair, is still quite far from how a human perceives it, especially once tokenized. But not quite as bad as I made it look originally!
To your point, someone pointed out on the birdsite that ARC and the like are not actually good measures for AGI, since if we use them as the only measures for AGI, LLM developers will warp their model to achieve that. We'll know AGI is here when it actually performs generally, not well on benchmark tests.
Anyway, this was an interesting dive into tokenization, thanks!
More options
Context Copy link
More options
Context Copy link