@wlxd comments on "Culture War Roundup for the week of February 17, 2025

Culture War Roundup for the week of February 17, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

main: processing 'johny-cash-gods-gonna-cut-you-down.wav' (2739583 samples, 171.2 sec), 24 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ... (...) [00:00:38.360 --> 00:00:52.120] Go tell that long-tongued liar, go and tell that midnight rider, tell the rambler, the gambler, the backbiter, tell him that God's gonna cut him down. (...) whisper_print_timings: total time = 67538.11 ms

Jump in the discussion.

No email address required.

wlxd 10mo ago

No, parsing bits from a video file does happen practically instantly. Download a video file to your local disk, and play it from there, you’ll see. Even on YouTube, if you rewind back, it will have to represent the bytes again.

The reason it takes a while for YouTube stream to start is that this is what it takes for YouTube to locate the bytes you asked for and start streaming them to you.

Context

jkf wlxd 10mo ago

Yes, and for the LLM to parse these bits, first youtube needs to locate them, then serve them to the llm. If the llm can convince youtube to serve the bits as fast as bandwidth will allow, it still needs to run those bits through some transcription algo -- which typically are borderline on lagging at 1x speed.

In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen. The fact that it goes ahead and gets this wrong (Cash pretty clearly says gam-bel-er in the video) makes it much more likely that the llm is looking at some timestamped transcript to pick up the word "gambler" in the context of country songs, and hallucinating a pronunciation.

wlxd jkf 10mo ago · Edited 10mo ago

which typically are borderline on lagging at 1x speed.

That's just not true. You see them running at 1x speed, because they only ever need to run at 1x speed. Running them faster is a waste of resources: what's the point of transcribing video in its entirety, if you are going to close the tab before it ends?

This is running on my home PC:

That's 2.5x transcription speed, and it's just running on CPU, not even using GPU.

Look, guys, processing your stupid-ass text questions on a frontier LLM takes way more resources than speech transcription. Whisper on my CPU can do 400 GFLOPS, so the above 67 seconds run used something like 26,000 GFLOPs. For comparison, running a transformer model on a single token requires something like twice the number of parameters of the model worth of flops. So, for example, GPT-3, a junk tier model by today's standards, has 175B parameters, so just feeding it the lyrics of Cash's song would need 150,000+ GFLOPs, and that's before it even produces any output.

In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen.

Modern speech transcription models not only make judgements on accents as a normal course of operations, they actually understand the speech they are transcribing to some (limited) extent in the same sense as more general LLMs understand the text fed into them. It would actually be a pretty cool project for a deep learning class to take something like OpenAI Whisper, and fine tune it to make it into accent detection tool. It's all there inside the models already, you'd just need to get it out.

Just to be clear: I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could. The frontier LLMs are all multimodal, which usually means that they can process and generate image tokens in addition to text tokens. There's literally no technical reason why they couldn't add audio token type, and some of them probably already do that.

whisper_print_timings: total time = 67538.11 ms

OK? 67 seconds is not instant -- like, at all. Even 6.7s (assuming the resources assigned to this task were as you suggest) is not instant.

I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could.

Of course it could! But it doesn't, and the fact that it responded instantly is evidence of that. Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?

LLMs lie man -- we should get used to it I guess.

wlxd jkf 10mo ago

In case it wasn’t clear: it took 67 seconds to transcribe an entire 171 seconds long song on my home CPU. You don’t have to wait for it to finish either: it produces output as it goes. It takes less than 20 seconds to process the entire song on my 9 year old gamer GPU. It would take less than half a second on a single H100. On 5 H100s, transcribing the entire song would take less time that it would take you to blink an eye. X AI reportedly has 100,000 of H100s.

But it doesn't, and the fact that it responded instantly is evidence of that.

How is it any evidence? Responding to the question is at least an order of magnitude more computationally demanding task than transcribing the entire song.

Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?

That’s the amazing thing about multimodal LLMs: they wouldn’t even have to add any special feature, with multimodality you get it for free. For a multimodal LLM model that is trained on sound tokens as well as on text tokens, understanding the words in a song, and analyzing accents etc is literally the same task as answering text questions. When you ask Grok a question, it searches web, fetches websites, and processes the contents. Fetching and processing song is, again, exactly the same task to the multimodal LLM as processing text.

Sounds are not text though -- nothing is free, and nothing is instant.

Why don't you try it? Ask Grok to transcribe a song from a youtube link and see what it does -- preferably a song that differs from the published lyrics somehow, maybe a live version or something.

To quote myself again (and to respond to /u/2rafa)

Just to be clear: I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could.

All I'm arguing here is that "Grok couldn't have done that, because the answer came very fast, and speech transcription is slow" is a completely invalid argument, and I also argued that implementing this feature wouldn't require Grok developers to do anything special, because it would have came out of multimodality combined with internet search. I'm not saying that Grok actually does all of this (and I'm leaning towards thinking that it doesn't). My purpose here is to clear your misconceptions as to how the technology works, and what is possible.

I also argued that implementing this feature wouldn't require Grok developers to do anything special

This part is wrong -- Grok is designed to take text input, and the developers would definitely need to 'do something' for it to ingest youtube audio instead. (and further work would be required for the model to make any judgement as to how Cash was pronouncing "gambler")

My purpose here is to clear your misconceptions as to how the technology works, and what is possible.

I never said that it was impossible to build a near-instantaneous transcription service -- I said that Grok has no reason to do so, and therefore almost certainly didn't.

2rafa wlxd 10mo ago

So you think it’s real? I think it’s an interesting question, possible either way. On the other hand, gambler is clearly pronounced with three syllables in the song, which suggests it’s clear some hallucination is going on.

Also, can’t this obviously be tested? Can’t we just see if it can accurately transcribe a 30 minute newly uploaded YouTube video in a couple of seconds? Idk.

wlxd 2rafa 10mo ago · Edited 10mo ago

No, I don’t have strong opinion one way or the other, I’m just saying that “transcribing a song is expensive so there is no way Grok is doing that to answer a text question” is a very bad argument. Grok could do that if it was trained with audio modality, but I don’t know if it actually does.

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats