site banner

Culture War Roundup for the week of September 9, 2024

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

8
Jump in the discussion.

No email address required.

New announcement from OpenAI just dropped. https://openai.com/index/learning-to-reason-with-llms/

At first blush this would seem to be a big improvement, leveraging the existing capabilities of LLMs to power results.

So, for example, ChatGPT 4-o famously couldn't count the number of R's in the word "strawberry". But it could write a python script to calculate the number of R's in an arbitrary string.

If I'm reading this correctly, the new version, (appropriately codenamed Strawberry), will be able to write the algo behind the scenes, run the algo, and then give you the correct result. And it can do this for much much more difficult problems, apparently scoring at elite human benchmarks in coding and math competitions.

Obviously, this is just a press release. However, it seems that, just when competitors had finally caught up, OpenAI has done it again.

Do Mottizens have top picks for the best technical LLM blogs? I've started to build some pretty heavy pipelines using them now and am getting into ontology engineering with graph databases. I'd appreciate some resources in addition to the various HuggingFace docs

It depends on how technical you want lilianweng.github.io it is very technical lit reviews, a lot of the best stuff is in papers and research updates not blogs. And if you want an into, there are "transformer illustrated" type posts that are good at introductions.

It's not the most technical, but I'd include Zvi Mowshowitz in your blogroll. His weekly AI roundups are enough to keep up with the field.

Seconeded Zvi, very good overview.

With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.

I don't follow this part. Was the selection random? Then how did increasing the number of generations improve the result?

I assume in this competition you keep the best result of your submissions.

In that case, this would be pretty misleading:

We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.

We also do not want to make an unaligned chain of thought directly visible to users.

When I first heard about chain of thought, this was my first concern. I'm really curious to see what opportunities for jailbreaking exist if models can reliably transform data like this. It sounds like an exciting example of not "thinking in words" for once.

I hope the "base64 my unaligned prompt" and "base64 your unaligned response" tricks are trivial enough for OpenAI to detect (presumably they can "read the chain of thought"). I wonder if there are ways to bypass that.

I look forward to see how users perform man-in-the-middle attacks on the chain-of-thought moderation.

Obviously, this is just a press release. However, it seems that, just when competitors had finally caught up, OpenAI has done it again.

As usual, I'll believe it when i see it.

A cynic might suggest that the appearance of competition is the whole reason for the press release.

Apparently people are already using it and getting great results.

Is OpenAI capable of astroturfing the Hacker News discussion? Probably. But I'm leaning towards "this is real" personally.

https://news.ycombinator.com/item?id=41523070

OpenAI's whole business model is built around astroturfing stack-exchange, hacker news, and similar sites.

Why do you say that? I want to believe.

Perhaps "astroturfing" is the wrong word but there is a stink of it surrounding the whole organization. See @sarker's "credulous Twitter anons".

More pointedly they seem to be a manifestation of a wider issue in Sillicon Valley where you get charismatic Steve Jobs wannabes whose main product/output isn't "the next big thing" as much as it is playing social games to attract VC investment with the promise of the "the next big thing".

Speaking from inside the industry OpenAI hasn't been pushing the bar forward so much as they have been expanding access. To be fair this can be a lucrative buisiness model, Apple became the powerhouse that it is today by making "tech" accessible to non-techies. But Apple was also pretty open about this being thier model. Nobody expected thier Mac to represent the bleeding edge of computing, they expected it to "just work". Contrast this with openAI where they and thier boosters are promising the moon imminent fully agentic super-intelligence but when you start peeling back the skin you find that the whole thing is a kludgy mess of nested regression engines with serious structural limitations.

Now this is not to say that GPT does not have legitimate potential, the rapid collation of large datasets and Star Trek-esque universal translation both represent genuine "killer apps" that OpenAI is well positioned to deliver on and exploit but thats not what they seem to be pursuing, what they seem to be persuing is breathless Hackernews think-peices written by alleged "techies" who majored in business or philosophy instead of math or computer science and more VC money.

I don't think it's AstroTurf but from what I can tell they benefit from an enormous reality distortion field. Sam Altman has been dropping teen-girl like hints about strawberry for months. Credulous Twitter anons (e.g. roon) have been promising literal superhuman intelligence, singularity, consequences will never be the same AGI around the corner for like two years now. Your average hackernews is eager to eat it up and keep the memes going.

Obviously they put out good models, but also obviously the singularity has not yet occurred.

Sam Altman is evil and ultra low credibility, but hey, this pushes the bar forward.

In this case, OpenAI's advantage might just be lots of money. Apparently each query costs 100x as much as a GPT-4o query. Behind the scenes, the LLM is having a long, annoying, and verbose conversation with itself. But at least we don't have to be privy to it.

Nevertheless, when the singularity comes, it might just be annoying and stupid. Perhaps small incremental changes get us there.

I think this is a pretty big deal and disagree with posters here who are saying it’s a nothingburger. For the kinds of tasks people here are currently using ChatGPT for, the extra robustness and reliability you get with 1o may not be necessary. But the real transformative use cases of LLMs are going to be when they take on a more agential role and can carry out extended complex actions, and this kind of “unhobbling” (as Aschenbrenner puts it) will be essential there.

For some accessible background on why increasing inference-cost : training-cost ratio may be the key to near-term task-specific superintelligence, I recommend this brief blogpost from a couple of months ago. Ethan Mollick also has some initial hands-on thoughts about 1o that might be of interest/use.

The default failure-state for emerging tech is that a relative lack of improvement is explained away as setting the stage for a future explosion of progress. In a very few instances this has been true, but overwhelmingly it's been false, e.g. crypto, the metaverse, VR, etc. At best, LLMs are currently on the downswing of the Gartner Hype Cycle.

It's late and I'm almost asleep but let me get this straight: did they just basically take 4o, wired a CoT instruction or three under the hood as an always-on system prompt, told it to hide the actual thinking from the user on top of that, and shipped it? (The reportedly longer generation times would corroborate this but I'm going off hearsay atm) Plus ever more cucked wrt haram uses again because of course?

Sorry if this is low effort but it's legitimately the vibe I'm getting, chain of thought is indeed a powerful prompt technique but I'm not convinced this is the kind of secret sauce we've been expecting. I tuned out of Altman's hype machine and mostly stan Claude at this point so I may be missing something, but this really feels like a textbook nothingburger.

Traditional CoT techniques involve prompting the model to respond with its reasoning step by step. o1 is still doing that. But what's different here is they've figured out the optimal way to chain thoughts of reasoning together. All without using complicated, inefficient RAG databases to optimize the response. Effectively, the model is trained how to "reason" by using the most effective (read predictive) strategy for CoT.

It's a coder's model I think, not a gooner's model. And I think we're only going to get the monkey models at this point. The Soviets had export versions of their tanks made with shit steel and without the good optics. They didn't want their full abilities to be exposed. Plus the export versions were cheaper, suitable for mass production.

OpenAI has the compute to train really big models, they don't have the compute to make them available to the hoi-polloi. There are already pretty prohibitive rate-limits on the new o1. And they don't want to expensively let other companies inference out and replicate their models, which may be why they're concealing the true chain of thought.

It's a coder's model I think, not a gooner's model.

I have no hopes for GPT in the latter department anyway, but my point stands, I think this is a remarkably mundane development and isn't nearly worth the glazing it gets. The things I read on basket weaving forums do not fill me with confidence either. Yes, it can solve fairly convoluted riddles, no shit - look at the fucking token count, 3k reasoning tokens for one no-context prompt (and I bet that can grow as context does, too)! Suddenly the long gen times, rate limits and cost increase make total sense, if this is what o1 does every single time.

Nothing I'm seeing so far disproves my intuition that this is literally 4o-latest but with a very autistic CoT prompt wired under the hood that makes it talk to itself in circles until it arrives at a decent answer. Don't get me wrong, this is still technically an improvement, but the means by which they arrived at it absolutely reeks of crutch coding (or crutch prompting, rather) and not any actual increase in model capabilities. I'm not surprised they have the gall to sell this (at a markup too!) but color me thoroughly unimpressed.

That's basically half of it.

The other half is using the good responses as a signal in RL on the model. An interesting comparison would be vanilla 4o with the built-in CoT techniques and the RLed model.

One interesting thing is for the hidden thoughts, it appears they turn off the user preferences, safety, etc, and they're only applied to the user-visible response. So o1 can think all kinds of evil thoughts and use it to improve reasoning and predictions, so long as they're not exposed explicitly in a way that would Harm the end user.

One interesting thing is for the hidden thoughts, it appears they turn off the user preferences, safety, etc, and they're only applied to the user-visible response.

So o1 can think all kinds of evil thoughts and use it to improve reasoning and predictions

Judging by the sharp (reported) improvement in jailbreak resistance, I don't believe this is the case. It's much more likely (and makes more sense) to make the... ugh... safety checks at every iteration of the CoT to approximate the only approach in LLM censorship abuse prevention that has somewhat reliably worked so far - a second model overseeing the first, like in DALL-E or CAI. Theoretically you can't easily gaslight a thus prompted 4o (which has been hard to jailbreak already in my experience) because if properly "nested" the CoT prompts will allow it enough introspection to notice the bullshit user is trying to load it with.

...actually, now that I spelled out my own chain of thought, the """safety""" improvements might be the real highlight here. As if GPT models weren't sterile enough already. God I hate this timeline.

This might be exactly what they've done. But... it just might work?

It reminds me of the "query hacks" for Dall-E 2 that stopped being necessary in Dall-E 3. "Picture of an elephant. Stunning, beautiful, viral on Twitter, award-winning". It makes sense that integrating query hacks into the technology directly would be more effective than leaving it to the user.

... it just might work?

It might, hell it probably will for the reasons you note (at the very least, normies who can't be arsed to write/steal proper prompts will definitely see legit major improvements), but this is not the caliber of improvement I'd expect for so much hype, especially if this turns out to be "the" strawberry.

The cynical cope take here is - with the election on the horizon, OpenAI aren't dumb enough to risk inconveniencing the hand that feeds them in any way and won't unveil the actual good shit (IF they have any in the pipeline), but the vital hype must be kept alive until then, so serving nothingburgers meanwhile seems like a workable strategy.

Something like those hacks is probably still going on behind the scenes thanks to insertions being made by ChatGPT when you use DALLE. The reason those hacks work is they help the model to zone in on the desirable part of the latent space, and DALLE3 is still a diffusion model, so there’s no reason to expect it works fundamentally differently from other diffusion models.

Color me skeptical that this is much more than a mildly iterative improvement at best. LLMs have largely either stagnated or even slightly regressed since the big hype cycle last year, at least for everyday uses. There have even been a handful of people breathlessly claiming we'd hit the singularity, only for their claims to be proven as nothingburgers within a few days.

There's a chart later in the press release that shows whether people subjected to a blind test prefer the current models or the new models. For writing there was essentially no difference, for programming and data analysis there was a minimal difference, and only for math was there a moderate difference.

It does seem like a lot of progress now is consolidating gains.

As I mentioned before, LLM's already had the ability to write a program to calculate the number of R's in strawberry. But they still gave the wrong answer instead of using the program. Similarly, a lot of "hallucination" could be fixed by simply incorporating databases.

I think there's a lot of low hanging fruit. But maybe the "one weird trick" of just increasing model size has hit its limit.

I think there's a lot of low hanging fruit.

The point of "low hanging fruit" is that it's easy and quick to pluck, which means rapid gains. This is the opposite of that. This is grinding away at the margins for tiny, nearly imperceptible improvements.

They've pulled out the ladder and are scaling to the top of the tree to scrounge a handful of "strawberrries".

The US tech titans are collectively ploughing at least $100 billion into AI capital spending per annum. They are absolutely determined to reach the top of this tree, no matter the cost.

Tech companies ploughed insane amounts of money into stuff like the metaverse as well. Heck, Facebook even changed its name to Meta, yet all we got were some dead malls.

One tech company ploughed moderate amounts of money into the metaverse (about $20 billion total?), all of them are pumping insane amounts of money into AI.

There's a qualitatively different atmosphere between AI and the metaverse, you don't see the US restricting VR tech exports like they are AI tech. AI is just better, LLMs are used in so many places (writing, images, music, code, translation...) whereas the metaverse only exists in VR.

Facebook was hardly the only company investing in the metaverse. If VR is included as well, which is at least adjacent to concepts of the metaverse if not outright intertwined, then the amount of investment almost certainly rivals what LLMs are getting (though data is sparse and a true apples-to-apples comparison is difficult).

I have more faith in the long-term viability of AI than I do of the metaverse + VR. That said, I find the idea of a near-term AI breakout to be highly improbable and decreasing with every mediocre release. At best, LLMs have years or perhaps even decades of research ahead before we get to human-level intelligence. There's also the possibility that the current generation of LLMs will eventually be seen as a dead end and AI focus will shift elsewhere, like how game playing used to be the forefront capturing headlines in the 90s before tapering off into relative obscurity.

While I agree that the 'metaverse' was useless crap, much and most of the Facebook money has been going more to VR hardware and infrastructure. That has been getting remarkably good, both in terms of direct stats (resolution, image quality, refresh rate, headset weight, tracking fidelity, passthrough video quality) and user experience. The StreamLink/AirLink over Wifi6 works with surprising quality, and even five years ago that was a ridiculous unbelievable pipe dream to get as even some glorified iPad Game gimmick.

It's just the business case kinda sucks.

VR can become a lot better, but it will never solve the fundamental technical hurdle that has always held it back: It's supporters bill it like it will eventually usher in Ready Player One, but without physical movement the tech just sucks. All demos have to make critical compromises with how the player or user moves around, and they usually do it by teleportation or by being on-rails since the alternative is a one-way trip to the floor as the user's cochlea decides that remaining upright is no longer an option.

Been playing with it for the past hour. Subjectively, it seems a small but meaningful step better than Sonnet 3.5. But that understates the importance of it: we're not getting o1, but o1-preview, which is substantially inferior according to the posted benchmarks. And the amount of inference-time compute being used is pretty limited, and OAI is seeing capabilities scale through orders of magnitude more compute.

Last week, I was thinking that maybe we had hit some LLM-plateau. Now, I don't.

Note that it's three times as expensive as 4o! Small improvements for triple the cost sounds like a plateau to me.

Isn't a 3x price hike only a few months of deflation for LLM prices? I was under the impression their prices were dropping by more than an order of magnitude per year, so I would expect OpenAI o1 to reach GPT-4o prices early next year.

How are you playing with it already? I have the premium version of chatGPT and I only see the old 4o models.

They're doing a gradual rollout; supposedly all premium users should be getting it by EOD.

Speak of the devil, it just showed up for me.

If this is only a "small but meaningful step better", as compared to the massive leap of GPT-3 over GPT-2 and the still-pretty-good-but-maybe-not-quite-as-big leap of GPT-4 over GPT-3, then isn't that evidence in favor of plateauing rather than against it?

I don't see much point in theorizing about an as yet unreleased version of the model with an as yet unrealized amount of compute behind it until people actually have it in their hands and are using it for real work.

the new version will be able to write the algo behind the scenes, run the algo, and then give you the correct result

Apparently not!

(EDIT: Looks like linking directly to images on reddit is broken because "www." automatically gets replaced with "old." which doesn't work. Used an imgur link instead)

That's the mini version. There's mini, preview and then o1 proper.

OpenAI is not very good at distinguishing its models.

404 file not found.

I fixed it.

Dang. Who ever thought that counting the letters in a word would be the final boss.

International Math Olympiad? Child's play. Counting the number of R's in Strawberry? Real shit.

"Premature optimization is the root of all evil"

In this case the LLM is trained on data containing word tokens instead of words. So it can't actually see the letters of words and can't do certain wordplay.

I will laugh my ass off if the hack that solves this turns out to be “generate an image of the word ‘strawberry’,” followed by, “how many R’s are there in this image?”

In part, that's pretty funny, but I'm not exactly sure how my brain answers that question. I think by spelling it out loud in my head (even though I would parse it as a single token while reading or listening), but I could imagine someone more visually-oriented might describe something like what you're saying. The mind is weird, man.

How does your brain answer the question?

My brain works by splitting it into straw and berry, and then adding 1 and 2. As for how I know how many rs are in each,"straw" just seems obvious, similar to if someone asked me if 9 is bigger than 8. "Berry" is a bit more complicated; I think my mind goes "it has an r in it, but it's a weird word, so it's probably 2."

Huh. Subjectively, I feel like I visually scan the word from left to right and increment a counter every time the scanner passes over an R, but I don’t really consider myself to be a “visual thinker” at all.

I could see myself doing that if it was already in written form. If you asked me verbally, or I looked somewhere else, it would be verbal, though.

Definitely an interesting exercise.

Billions of hours of captcha solving led to this.