site banner

Culture War Roundup for the week of July 15, 2024

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

9
Jump in the discussion.

No email address required.

I feel like this is a bad idea because LLMs appear to be a dead end beyond helping you write code. Like they can’t handle 9.9-9.11, so I don’t think they’ll be good at something that needs a lot of real-time precision. Maybe they’ll go after another approach but given that a lot of SF VC’s plans appear to be to get Trump/Vance into office and then get huge handouts for whatever they’re currently working on, it seems like this’ll be the way it’ll go.

I gave the question 9.9-9.11 to both ChatGPT (3.5) and Claude 3.5 Sonnet. ChatGPT bungled it very badly even though I told it multiple times it was wrong. Claude 3.5 Sonnet was correct though on the first try.

Claude 3.5 changed its tokenization scheme (R2L integers, so 1234 is tokenized as [234, 1]), which accounts for its models' (even Haiku!) superior performance over competitors.

I am torn about how to read into this. It's very stupid that a change like that can skyrocket performance and shows that existing systems have some pretty serious flaws. On the other hand, it indicates that there is a lot of low-hanging fruit to improve things, even if there isn't a serious improvement in fundamental architecture coming in the short-term.

I continue to think that, once someone cracks keeping "chain of thought" out of the loss function, via something as simple as begin/end tokens, we'll see an improvement in performance that's the equivalent of the difference between an answer a human can give while blathering off the top of their head vs an answer a human can give by quietly thinking about it first and then thinking about how they're thinking about it and then assembling the best of those thoughts into a final verbal answer. I do not add 1234+8766 by going left-to-right, but certainly addition is also not the only place where I think about the later parts of a problem before coming to a conclusion about the earlier parts, so any kind of reversal that only applies to numbers is just a hack.

On the other hand, the longer I continue to think this, the less likely it is that someone hasn't tried to do it in enough ways to conclude that it's a failure for some subtle reason I don't understand.

I feel like this is a bad idea because LLMs appear to be a dead end beyond helping you write code.

Supposedly, it's already starting to put junior associates at some law firms out of work, and within striking distance of putting junior writers and editors, junior data scientists, and junior developers out of work. The article I linked argued that mid-May of this year was the turning point where models started to become genuinely viable for helping a senior developer with programming, with the releases of GPT 4o, Google's Gemini and Anthropic's Claude 3 Opus.

Like they can’t handle 9.9-9.11, so I don’t think they’ll be good at something that needs a lot of real-time precision.

It's pretty astonishing how years of demonstrable and constantly increasing utility can be dismissed with some funny example.

On the other hand, now this makes it easier for me to understand how people can ignore other, more politicized obvious stuff.

dead end beyond generating text

dead end beyond generating bad pictures

dead end beyond classifying pictures

dead end beyond generating good pictures

dead end beyond helping write code

dead end beyond doing basic math with guidance

you are here

We're into fairly advanced mathematics now, things are moving so quickly.

https://www.scientificamerican.com/article/ai-matches-the-abilities-of-the-best-math-olympians/

LLMs aren't helpful writing code.

Speak for yourself. They're great for doing entry level tasks in languages/tech you're not familiar with. Once you get a good sense for what they do and don't hallucinate, you can really cruise through. It's made me significantly faster.

That's exactly where they are worst! If you're not familiar with the language, you have no idea at all whether it's giving you correct code or not. Nobody should ever use an LLM for something they don't know well enough to validate.

Yeah IDK man, I can tell pretty quick whether the LLM code works or not. Either the UI starts looking right, the data starts getting transformed how I need it, or it doesn't. For anything not dead simple to validate I use my own brain. That's still automating a huge part of my work.

Why say something trivially disproved by common experience?

Maybe they don't feel helpful to you but, objectively, a lot of developers are being helped by them, myself and many of my coworkers included.

Why claim something trivially disproved by common experience?

Personally I'm with @SubstantialFrivolity on this one.

Generating a thousand lines of code in 5 minutes doesn't mean a thing if it's going to take a week or more to validate it.

I mean, it's a matter of opinion, not fact. But I don't think they are useful, and I think people who are using them for programming are playing with fire. You can't safely use anything they give you without validating it is correct, and if you have to check the code anyways it's not saving you work compared to just writing it.

LLMs are the worst sort of help - the unreliable kind, which you can't trust to actually do its job. Help like that is generally worse than no help at all, because at least you know what to expect and can plan for it when you have no help at all.

One of the main use cases I have is "take this algorithm described in this paper and implement it using numpy" or "render a heatmap" where it's pretty trivial to check that the code reads as doing the same thing as the paper. But it is nice to skip the innumerable finicky "wait was it numpy.linalg.eigenvalues() or numpy.linalg.eigvals()" gotchas - LLMs are pretty good at writing idiomatic code. And for the types of things I'm doing, the code is going to be pasted straight into a jupyter notebook, where it will either work or fail in an obvious way.

If you're trying to get it to solve a novel problem with poor feedback you're going to have a bad time, but for speeding up the sorts of finicky and annoying tasks where people experienced with the toolchain have just memorized the footguns and don't even notice them anymore but you have to keep retrying because you're not super familiar with the toolchain, LLMs are great.

Also you can ask "are there any obvious bugs or inefficiencies in this code". Usually the answer is garbage but sometimes it catches something real. Again, it's a case where the benefit of the LLM getting it right is noticeable and the downside if it gets it wrong is near zero.

Out of curiosity, would you say classic code completion is 'helpful' when writing code?

Suggesting a method or attribute is very useful because I know what I'm looking for and it's very quick to figure out which one is right. Suggesting the next ten lines means I need to carefully review them, and I might as well just write them myself at that point.

Suggesting the next ten lines means I need to carefully review them, and I might as well just write them myself at that point

Talking Copilot specifically, the suggestions I wait for and review are typically 1-3 lines long, and it's the sort of code I review far faster than I type. Most of the time it knows exactly what I mean to write. Maybe an extensive, expertly setup & quality tooling could match it, but I'm too incompetent and too poor, respectively.

Also Copilot chat is hugely helpful, superior to a search engine for simple questions, though admittedly only once I settled on 'instructions' prompt that reliably prevents it from yapping. It gives very concise answers, rarely makes mistakes. It consistently saves me time, and is much more pleasant to interact with than your typical search engine result.

Maybe a little, but it's a minor convenience really.

Not in the cases it routinely suggests complete garbage (as everyone with suitably buggy / misconfigured IDE has found out).

LLMs don’t generate pictures. I have no idea why people keep repeating the blatantly obviously incorrect claim that AI equals LLM.

Stable diffusion contains a text transformer. Language models alone don't generate pictures but they're a necessary part of the text-to-image pipeline.

Also some LLMs can use tools, so an LLM using an image generation tool is in a sense generating a picture. It's not like humans regularly create pictures without using any tools.

Yes, it has an input parser. If you’ve studied SD details, you’ll know that it’s very different from what people call LLMs and is only a small part of Stable Diffusuon (and not anything you could say that ”generates pictures”).

Yes, it has an input parser

Specifically OpenCLIP. As far as I can tell the text encoder is nearly a bog-standard GPT-style transformer. The transformer in question is used very differently than the GPT-style next token sampling loop, but architecturally the TextTransformer it's quite similar to e.g. gpt-2.

Still, my understanding is that the secret sauce of stable diffusion is that it embeds the image and the text into tensors of the same shape, and then tries to "denoise" the image in such a way that the embedding of the "denoised" image is closer to the embedding of the text.

The UNet is the bit that generates the pictures, but the text transformer is the bit which determines which picture is generated. Without using a text transformer, CLIP and thus stable diffusion would not work nearly as well for generating images from text. And I expect that further advancements which improve how well instructions are followed by image generation models will come mainly from figuring out how to use larger language transformers and a higher dimensional shared embedding space.

Tbh, I didn’t notice that Tomato had called those out in particular. The top level post talks about various applications which are definitely not LLMs.

LLMs specifically are horrible with arithmetic (as Tomato said). I don’t see why a math oriented AI couldn’t be made - it just wouldn’t be an LLM and quite possibly would have about as much in common with LLM as eg. image generators do (iow, very little beyond an input parsing stage).

I don’t know what it is about this site (other than people being infatuated with ridiculously long meandering posts) that makes users think LLMs are the modal example of AI when their actual productive uses are limited to a few text generating and parsing niches. Meanwhile eg. every photo 99% of people take has multiple layers of AI applied to it.

We were in the same spot a few years ago though?

I think Stable Diffusion's public release in August 2022 marks the time when we reached "dead end beyond generating good pictures" - before that, AI being able to generate good pictures was either very very niche knowledge or just not considered true. That's not even 2 years ago. I believe ChatGPT 3.5 also came out publicly in 2022, though earlier in the year, so a little over 2 years ago, and that probably marked when we reached "dead end beyond helping write code." I think it's arguable that the roughly 2 years since those periods haven't been revolutionary, but I think it's inarguable that lots of progress has happened in those 2 years, and in any case, 2 years is a rather short period of time, probably the lower bound of what is considered "a few years."

We've now got pretty good video generation.