site banner

Culture War Roundup for the week of February 17, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

4
Jump in the discussion.

No email address required.

(timestamped subtitles followed)

Idk, it responded pretty much instantly, so it could be lying. Or maybe it has preprocessed subtitles for popular videos.

I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes? Also, listening to the track it seems like a pretty good example of the pronounciation that you are referring to -- so it's clearly not 'listening' to the video in any meaningful way.

"AI lies and confidently misrepresents evidence in order to advance it's chosen position" is... not too surprising considering that it's been trained on decades of internet fora conversations, but probably not the kind of alignment we are looking for.

I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes?

It certainly has access to the subtitles so they are probably cached at least. That video has like 10 million views, while I can't believe that I was not the first person questioning the number of syllables in "gambeler", it definitely could have pre-subtitled it.

Also, listening to the track it seems like a pretty good example of the pronounciation that you are referring to -- so it's clearly not 'listening' to the video in any meaningful way.

Thank you!

Yeah, after showing this to people and thinking about it, I lean heavily towards Grok having been fed a bunch of autogenerated subtitles (with timestamps). Which is very cool but not all as cool as if it actually listened to stuff. Also, then it keeps hallucinating stuff.

I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes?

Remember that the algorithm is not processing physical vibrations in the atmospheric medium, it's parsing bits in an audio file. The relavent metric here is not the runtime but the file size, and a 3 minute song is unlikely to be more than a few Mb.

I do understand this -- just the same, 'parsing bits' from a video file does not happen instantly. Indeed, just starting a stream on youtube is typically not what I would call 'instant'.

yt-dlp will download and transform a 20 minute youtube video in under a second, including subtitles if you want.

OK, but will Grok? I guess it would be pretty easy to try, but it might refuse on copyright grounds or something.

The point is youtube will serve all necessary files at the same time, you're not limited to a slow data stream. This is what we would call 'instant'.

Nothing happens instantly, but lots of things happen quickly enough that they might as well be instantaneous from the perspective of an unaugmented human.

Yes, but serving and parsing videos from youtube is not one of those things.

Let's agree to disagree.

I don't think so -- you are wrong on this one.

I don't think i am, see @veqq's comment above.

No, parsing bits from a video file does happen practically instantly. Download a video file to your local disk, and play it from there, you’ll see. Even on YouTube, if you rewind back, it will have to represent the bytes again.

The reason it takes a while for YouTube stream to start is that this is what it takes for YouTube to locate the bytes you asked for and start streaming them to you.

Yes, and for the LLM to parse these bits, first youtube needs to locate them, then serve them to the llm. If the llm can convince youtube to serve the bits as fast as bandwidth will allow, it still needs to run those bits through some transcription algo -- which typically are borderline on lagging at 1x speed.

In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen. The fact that it goes ahead and gets this wrong (Cash pretty clearly says gam-bel-er in the video) makes it much more likely that the llm is looking at some timestamped transcript to pick up the word "gambler" in the context of country songs, and hallucinating a pronunciation.

which typically are borderline on lagging at 1x speed.

That's just not true. You see them running at 1x speed, because they only ever need to run at 1x speed. Running them faster is a waste of resources: what's the point of transcribing video in its entirety, if you are going to close the tab before it ends?

This is running on my home PC:

main: processing 'johny-cash-gods-gonna-cut-you-down.wav' (2739583 samples, 171.2 sec), 24 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
(...)
[00:00:38.360 --> 00:00:52.120]   Go tell that long-tongued liar, go and tell that midnight rider, tell the rambler, the gambler, the backbiter, tell him that God's gonna cut him down.
(...)
whisper_print_timings:    total time = 67538.11 ms

That's 2.5x transcription speed, and it's just running on CPU, not even using GPU.

Look, guys, processing your stupid-ass text questions on a frontier LLM takes way more resources than speech transcription. Whisper on my CPU can do 400 GFLOPS, so the above 67 seconds run used something like 26,000 GFLOPs. For comparison, running a transformer model on a single token requires something like twice the number of parameters of the model worth of flops. So, for example, GPT-3, a junk tier model by today's standards, has 175B parameters, so just feeding it the lyrics of Cash's song would need 150,000+ GFLOPs, and that's before it even produces any output.

In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen.

Modern speech transcription models not only make judgements on accents as a normal course of operations, they actually understand the speech they are transcribing to some (limited) extent in the same sense as more general LLMs understand the text fed into them. It would actually be a pretty cool project for a deep learning class to take something like OpenAI Whisper, and fine tune it to make it into accent detection tool. It's all there inside the models already, you'd just need to get it out.

Just to be clear: I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could. The frontier LLMs are all multimodal, which usually means that they can process and generate image tokens in addition to text tokens. There's literally no technical reason why they couldn't add audio token type, and some of them probably already do that.

whisper_print_timings: total time = 67538.11 ms

OK? 67 seconds is not instant -- like, at all. Even 6.7s (assuming the resources assigned to this task were as you suggest) is not instant.

I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could.

Of course it could! But it doesn't, and the fact that it responded instantly is evidence of that. Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?

LLMs lie man -- we should get used to it I guess.

In case it wasn’t clear: it took 67 seconds to transcribe an entire 171 seconds long song on my home CPU. You don’t have to wait for it to finish either: it produces output as it goes. It takes less than 20 seconds to process the entire song on my 9 year old gamer GPU. It would take less than half a second on a single H100. On 5 H100s, transcribing the entire song would take less time that it would take you to blink an eye. X AI reportedly has 100,000 of H100s.

But it doesn't, and the fact that it responded instantly is evidence of that.

How is it any evidence? Responding to the question is at least an order of magnitude more computationally demanding task than transcribing the entire song.

Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?

That’s the amazing thing about multimodal LLMs: they wouldn’t even have to add any special feature, with multimodality you get it for free. For a multimodal LLM model that is trained on sound tokens as well as on text tokens, understanding the words in a song, and analyzing accents etc is literally the same task as answering text questions. When you ask Grok a question, it searches web, fetches websites, and processes the contents. Fetching and processing song is, again, exactly the same task to the multimodal LLM as processing text.

Sounds are not text though -- nothing is free, and nothing is instant.

Why don't you try it? Ask Grok to transcribe a song from a youtube link and see what it does -- preferably a song that differs from the published lyrics somehow, maybe a live version or something.

More comments

So you think it’s real? I think it’s an interesting question, possible either way. On the other hand, gambler is clearly pronounced with three syllables in the song, which suggests it’s clear some hallucination is going on.

Also, can’t this obviously be tested? Can’t we just see if it can accurately transcribe a 30 minute newly uploaded YouTube video in a couple of seconds? Idk.

More comments

I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes?

I think your explanation about the AI lying and confidently misrepresenting evidence in this case is almost certainly true. But I don't see how the runtime of the music video would matter for this. If the AI were analyzing the music video - which I don't think it did - it would be analyzing the bits that make up a video file after downloading it from wherever it is, in which case it just needs to process the bits that make up the file, and the speed of that would be dependent on many factors, but certainly not limited by how long the video is. A human might be limited to maybe half the time at the shortest if they watched the video at 2x speed, but I don't see any reason why an AI couldn't transcribe, say, all recorded audio in human history within a second, just by going through the bits fast enough.

Any of this would need a pretty specialized video analysis module though, which AFAIK doesn't really exist period, much less built into Grok -- plus the ability to download the video directly rather than look at a stream of it, which Youtube doesn't really provide. So if the AI were literally accessing the video through that link, 3:00/2x is indeed the fastest it would be able to provide the transcript.

(it would not be instant in any case; downloading the video takes X seconds, analyzing it Y -- X + Y might be less than three minutes, but it's not less than one second)

Any of this would need a pretty specialized video analysis module though

No it wouldn't, you just need the codec specification to extract the audio from the video at which point it becomes a reasonably straight forward speech-to-text problem, which is something we've been doing since the 90s.

You don't even need that; as anyone who's looked at the unauthorized youtube downloader tools out there will know, youtube's protocols allow access to the audio-only and video-only portions of a recording as separate stream. All Grok would have to do is request the audio-only stream.

OK, then you need an audio analysis model -- this is not a thing that is integrated into LLMs.

The first practical LLMs were developed as tools to automatically generate transcripts and sub-titles. My point above is that even if we assume that Grok is not pulling a previouse parse from some database, generating a fresh parse is well within its basic capabilities.

LLMs were developed as tools to automatically generate transcripts and sub-titles

Interesting assertion, but it doesn't really have any bearing on whether or not Grok can do this -- it takes text input from the user, and generates a text response. What makes you think it even has an interface to bring in audio inputs? (on the training end, they might -- given the hunger for data -- but it seems like an odd thing to include in a chatbot. Even for training, it would probably be better to do something like, oh, IDK -- run a transcripting algo on as much YouTube content as you can grab and then feed the text from that into your training set. You might even include some timestamps!)

So if the AI were literally accessing the video through that link, 3:00/2x is indeed the fastest it would be able to provide the transcript.

You can make YouTube videos go at arbitrarily high speeds just using a Chrome extension. I actually had an issue recently where an extension was causing the videos to default to 10x speed, which was both amusing and annoying. In any case, anyone with a link to a YouTube video has the ability to just download the video using basic non-AI tools, so the AI wouldn't be limited by the UI that YouTube presents human users.

True (and interesting about the Chrome extension; what is the usecase for 10x browser playback of youtube videos, I wonder?) but I'm quite sure Grok is not currently programmed with anything like this.