@official_techsupport comments on "Culture War Roundup for the week of February 17, 2025

Culture War Roundup for the week of February 17, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

main: processing 'johny-cash-gods-gonna-cut-you-down.wav' (2739583 samples, 171.2 sec), 24 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ... (...) [00:00:38.360 --> 00:00:52.120] Go tell that long-tongued liar, go and tell that midnight rider, tell the rambler, the gambler, the backbiter, tell him that God's gonna cut him down. (...) whisper_print_timings: total time = 67538.11 ms

Jump in the discussion.

No email address required.

official_techsupport who/whom 2mo ago · Edited 2mo ago

Grok 3.0 apparently can listen to music and integrate the results with the rest of its knowledge.

Twitter released their newest iteration on AI, it's fun, clever, and noticeably less lobotomized than basically everything before it. It's currently free with a basic twitter account. But I accidentally discovered a thing that kinda blew my mind. I got into a silly argument with it, as one does, about whether you can legitimately pronounce "gambler" with three syllables. As one of my arguments I brought up a Johnny Cash song where he, to my ear, does it. The robot responded:

In Johnny Cash’s version, recorded in 2003 and released posthumously, he pronounces both "gambler" and "rambler" with two syllables. Listening to the track, particularly in the chorus lines "Tell the rambler, the gambler, the back biter," Cash delivers them as "ram-bler" and "gam-bler." His deep, steady vocal style and the song’s rhythmic structure—driven by a stomping beat—support this two-syllable phrasing. There’s no elongation into a three-syllable "gam-buh-lur" or "ram-buh-lur" as you suggested earlier with "gambler." Cash’s Southern accent is present but doesn’t stretch the words beyond their standard two-syllable form here.

For reference, you can hear it clearly around the 0:45 mark in the official recording. Here’s a link to the track on YouTube: Johnny Cash - God's Gonna Cut You Down. Check it out and let me know if you hear something different!

I disagree obviously, but notice the timestamp!!! I'm reasonably sure that nobody in the history of the internet had this exact argument before and mentioned the exact timestamp in that exact song. Moreover, before that I asked it about "House of the Raising Sun" (because I misremembered the vocalist drawling "gambling" there) and the robot also timestamped the place in the recording where it was said.

So I don't know. It's possible that this is a result of an unsophisticated hack, give the AI a database of timestamped subtitles for youtube videos (something they already have generated), then it bullshits its way through the argument about what was actually said and how. That's totally possible, it's really good at bullshitting!

The other possibility is that it actually listens to videos/audios and analyses them on the fly or during training, whatever. What's super interesting about is that, look, we started with LLMs that literally had not a single real world reference, nothing that could remotely qualify as a qualia of say seeing an apple. They were trained entirely on people talking about their perceptions of apples, and somehow they managed to learn what apples are pretty well, without ever seeing one (which all philosophers agreed should be impossible, seeing apples must come first, and yet here we were). And now, if it's not just a subtitle hack, then we have quietly passed another milestone, the robots now can hear and see and correlate that with their knowledge bases.

Also, I asked the robot directly:

Do you have timestamped subtitles for Johnny Cash - God's Gonna Cut You Down?

I don’t have direct access to pre-made timestamped subtitles for Johnny Cash’s "God’s Gonna Cut You Down" from American V: A Hundred Highways (2006), but I can create them for you based on the official recording! I’ll use the widely available version on YouTube (Johnny Cash - God's Gonna Cut You Down), which runs 2:39, and break down the lyrics with approximate timestamps. This is the stripped-down, acoustic rendition produced by Rick Rubin, and I’ll align the subtitles to Cash’s vocal delivery.

Here are the timestamped subtitles:

(timestamped subtitles followed)

Idk, it responded pretty much instantly, so it could be lying. Or maybe it has preprocessed subtitles for popular videos.

Context

philosoraptor official_techsupport 2mo ago

"House of the Raising Sun" (because I misremembered the vocalist drawling "gambling" there)

If you mean The Animals' version, I think there's a slide during the first syllable that probably threw you, but it's still only one syllable.

TequilaMockingbird Brown-skinned Fascist MAGA boot-licker official_techsupport 2mo ago

The Grok and GPT are both largely derived from tools originally developed to generate automated transcripts and sub-titles.

You dont need to give the computer a database of timestamped lyrics when you can generate them. That the LLM can generate subtitles, or that it defaults to the common US english/dictionary pronunciation of the word (which is naturally going to be far more central to the training corpus) rather than the four-beat tempo of the actual audio (the ram-bull-er, the gam-bull-er, the back bite-er) should not be surprising to anyone. In fact, i find your example highly illustrative of both the capabilities and common failure modes of such models.

As an aside, surely you must have more intelligent things to do with your time than arguing with chatbots.

official_techsupport who/whom TequilaMockingbird 2mo ago

Like reading replies like this? Arguing with a chatbot was more productive tbhwyf.

jkf official_techsupport 2mo ago

I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes? Also, listening to the track it seems like a pretty good example of the pronounciation that you are referring to -- so it's clearly not 'listening' to the video in any meaningful way.

"AI lies and confidently misrepresents evidence in order to advance it's chosen position" is... not too surprising considering that it's been trained on decades of internet fora conversations, but probably not the kind of alignment we are looking for.

official_techsupport who/whom jkf 2mo ago

I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes?

It certainly has access to the subtitles so they are probably cached at least. That video has like 10 million views, while I can't believe that I was not the first person questioning the number of syllables in "gambeler", it definitely could have pre-subtitled it.

Also, listening to the track it seems like a pretty good example of the pronounciation that you are referring to -- so it's clearly not 'listening' to the video in any meaningful way.

Thank you!

Yeah, after showing this to people and thinking about it, I lean heavily towards Grok having been fed a bunch of autogenerated subtitles (with timestamps). Which is very cool but not all as cool as if it actually listened to stuff. Also, then it keeps hallucinating stuff.

TequilaMockingbird Brown-skinned Fascist MAGA boot-licker jkf 2mo ago · Edited 2mo ago

Remember that the algorithm is not processing physical vibrations in the atmospheric medium, it's parsing bits in an audio file. The relavent metric here is not the runtime but the file size, and a 3 minute song is unlikely to be more than a few Mb.

jkf TequilaMockingbird 2mo ago

I do understand this -- just the same, 'parsing bits' from a video file does not happen instantly. Indeed, just starting a stream on youtube is typically not what I would call 'instant'.

-1

veqq jkf 2mo ago

yt-dlp will download and transform a 20 minute youtube video in under a second, including subtitles if you want.

jkf veqq 2mo ago

OK, but will Grok? I guess it would be pretty easy to try, but it might refuse on copyright grounds or something.

The point is youtube will serve all necessary files at the same time, you're not limited to a slow data stream. This is what we would call 'instant'.

youtube will serve all necessary files at the same time

I don't think it will -- have you interacted with the youtube API at all?

TequilaMockingbird Brown-skinned Fascist MAGA boot-licker jkf 2mo ago

Nothing happens instantly, but lots of things happen quickly enough that they might as well be instantaneous from the perspective of an unaugmented human.

Yes, but serving and parsing videos from youtube is not one of those things.

Let's agree to disagree.

I don't think so -- you are wrong on this one.

I don't think i am, see @veqq's comment above.

More comments

wlxd jkf 2mo ago

No, parsing bits from a video file does happen practically instantly. Download a video file to your local disk, and play it from there, you’ll see. Even on YouTube, if you rewind back, it will have to represent the bytes again.

The reason it takes a while for YouTube stream to start is that this is what it takes for YouTube to locate the bytes you asked for and start streaming them to you.

jkf wlxd 2mo ago

Yes, and for the LLM to parse these bits, first youtube needs to locate them, then serve them to the llm. If the llm can convince youtube to serve the bits as fast as bandwidth will allow, it still needs to run those bits through some transcription algo -- which typically are borderline on lagging at 1x speed.

In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen. The fact that it goes ahead and gets this wrong (Cash pretty clearly says gam-bel-er in the video) makes it much more likely that the llm is looking at some timestamped transcript to pick up the word "gambler" in the context of country songs, and hallucinating a pronunciation.

wlxd jkf 2mo ago · Edited 2mo ago

which typically are borderline on lagging at 1x speed.

That's just not true. You see them running at 1x speed, because they only ever need to run at 1x speed. Running them faster is a waste of resources: what's the point of transcribing video in its entirety, if you are going to close the tab before it ends?

This is running on my home PC:

That's 2.5x transcription speed, and it's just running on CPU, not even using GPU.

Look, guys, processing your stupid-ass text questions on a frontier LLM takes way more resources than speech transcription. Whisper on my CPU can do 400 GFLOPS, so the above 67 seconds run used something like 26,000 GFLOPs. For comparison, running a transformer model on a single token requires something like twice the number of parameters of the model worth of flops. So, for example, GPT-3, a junk tier model by today's standards, has 175B parameters, so just feeding it the lyrics of Cash's song would need 150,000+ GFLOPs, and that's before it even produces any output.

In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen.

Modern speech transcription models not only make judgements on accents as a normal course of operations, they actually understand the speech they are transcribing to some (limited) extent in the same sense as more general LLMs understand the text fed into them. It would actually be a pretty cool project for a deep learning class to take something like OpenAI Whisper, and fine tune it to make it into accent detection tool. It's all there inside the models already, you'd just need to get it out.

Just to be clear: I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could. The frontier LLMs are all multimodal, which usually means that they can process and generate image tokens in addition to text tokens. There's literally no technical reason why they couldn't add audio token type, and some of them probably already do that.

whisper_print_timings: total time = 67538.11 ms

OK? 67 seconds is not instant -- like, at all. Even 6.7s (assuming the resources assigned to this task were as you suggest) is not instant.

I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could.

Of course it could! But it doesn't, and the fact that it responded instantly is evidence of that. Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?

LLMs lie man -- we should get used to it I guess.

In case it wasn’t clear: it took 67 seconds to transcribe an entire 171 seconds long song on my home CPU. You don’t have to wait for it to finish either: it produces output as it goes. It takes less than 20 seconds to process the entire song on my 9 year old gamer GPU. It would take less than half a second on a single H100. On 5 H100s, transcribing the entire song would take less time that it would take you to blink an eye. X AI reportedly has 100,000 of H100s.

But it doesn't, and the fact that it responded instantly is evidence of that.

How is it any evidence? Responding to the question is at least an order of magnitude more computationally demanding task than transcribing the entire song.

Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?

That’s the amazing thing about multimodal LLMs: they wouldn’t even have to add any special feature, with multimodality you get it for free. For a multimodal LLM model that is trained on sound tokens as well as on text tokens, understanding the words in a song, and analyzing accents etc is literally the same task as answering text questions. When you ask Grok a question, it searches web, fetches websites, and processes the contents. Fetching and processing song is, again, exactly the same task to the multimodal LLM as processing text.

07mk jkf 2mo ago

I think your explanation about the AI lying and confidently misrepresenting evidence in this case is almost certainly true. But I don't see how the runtime of the music video would matter for this. If the AI were analyzing the music video - which I don't think it did - it would be analyzing the bits that make up a video file after downloading it from wherever it is, in which case it just needs to process the bits that make up the file, and the speed of that would be dependent on many factors, but certainly not limited by how long the video is. A human might be limited to maybe half the time at the shortest if they watched the video at 2x speed, but I don't see any reason why an AI couldn't transcribe, say, all recorded audio in human history within a second, just by going through the bits fast enough.

jkf 07mk 2mo ago

Any of this would need a pretty specialized video analysis module though, which AFAIK doesn't really exist period, much less built into Grok -- plus the ability to download the video directly rather than look at a stream of it, which Youtube doesn't really provide. So if the AI were literally accessing the video through that link, 3:00/2x is indeed the fastest it would be able to provide the transcript.

(it would not be instant in any case; downloading the video takes X seconds, analyzing it Y -- X + Y might be less than three minutes, but it's not less than one second)

Any of this would need a pretty specialized video analysis module though

No it wouldn't, you just need the codec specification to extract the audio from the video at which point it becomes a reasonably straight forward speech-to-text problem, which is something we've been doing since the 90s.

rmtodd TequilaMockingbird 2mo ago

You don't even need that; as anyone who's looked at the unauthorized youtube downloader tools out there will know, youtube's protocols allow access to the audio-only and video-only portions of a recording as separate stream. All Grok would have to do is request the audio-only stream.

OK, then you need an audio analysis model -- this is not a thing that is integrated into LLMs.

The first practical LLMs were developed as tools to automatically generate transcripts and sub-titles. My point above is that even if we assume that Grok is not pulling a previouse parse from some database, generating a fresh parse is well within its basic capabilities.

LLMs were developed as tools to automatically generate transcripts and sub-titles

Interesting assertion, but it doesn't really have any bearing on whether or not Grok can do this -- it takes text input from the user, and generates a text response. What makes you think it even has an interface to bring in audio inputs? (on the training end, they might -- given the hunger for data -- but it seems like an odd thing to include in a chatbot. Even for training, it would probably be better to do something like, oh, IDK -- run a transcripting algo on as much YouTube content as you can grab and then feed the text from that into your training set. You might even include some timestamps!)

So if the AI were literally accessing the video through that link, 3:00/2x is indeed the fastest it would be able to provide the transcript.

You can make YouTube videos go at arbitrarily high speeds just using a Chrome extension. I actually had an issue recently where an extension was causing the videos to default to 10x speed, which was both amusing and annoying. In any case, anyone with a link to a YouTube video has the ability to just download the video using basic non-AI tools, so the AI wouldn't be limited by the UI that YouTube presents human users.

True (and interesting about the Chrome extension; what is the usecase for 10x browser playback of youtube videos, I wonder?) but I'm quite sure Grok is not currently programmed with anything like this.

sohois official_techsupport 2mo ago

Even if it did have preprocessed subtitles, presumably "gambler" just appears like that regardless of pronunciation?

Ioper official_techsupport 2mo ago · Edited 2mo ago

I don’t have direct access to pre-made timestamped subtitles for Johnny Cash’s "God’s Gonna Cut You Down" from American V: A Hundred Highways (2006), but I can create them for you based on the official recording!

I don't know exactly what's is going on here but LLMs often respond like that. I'm not sure that "lying" is the correct term or if it's more accurate to say that they frequently don't "perceive themselves" as having the literal knowledge that they're trained on and with some prompting can reproduce.

SnapDragon Ioper 2mo ago

they frequently don't "perceive themselves" as having the literal knowledge that they're trained on

IMO this is roughly the right way to think about it. LLMs probably don't even have the capability to know what they know; it's just not what they're trained to do. A lot of people confuse the LLM's simulation of a chatbot with the LLM itself, but they're not the same. (e.g. we can't settle the question of whether an LLM is conscious by asking it "are you conscious?". The answer will just depend on what it thinks the chatbot would say.) From the LLM's perspective it's perfectly reasonable to extend a conversation with "the answer is" even when the word after that is undetermined. Hence hallucinations.

(I think RLHF helps a bit with this, allowing it to recognize "hard questions" that it's likely to get wrong, but that's not the same as introspection.)

DecaDeciHuman SnapDragon 2mo ago

I think RLHF helps a bit with this

RLHF tends to make a model less calibrated. Substantially so.

SnapDragon DecaDeciHuman 2mo ago

By "calibration" I assume you mean having low confidence when it's wrong. It's counter-intuitive to me, but some quick Googling suggests that you're right about that. Good correction. I guess that's part of why fixing hallucinations has proven so intractable so far.

It's worse than you think.

Look at figure 8 of the GPT4 'technical report'. Or figure 9 of this paper on mode collapse. It's all across the calibration scale that gets messed up, not just the low extreme.

If you, say, ask a LLM for the result of a fair d4 roll (1-4), with sufficient formatting/etc such that with overwhelming probability it will output just the tokens '1', '2', '3', or '4', a properly calibrated model "should" result in the following output probabilities:


'1': 0.25  
'2': 0.25  
'3': 0.25  
'4': 0.25

And many base models are pretty close. Not perfect, but reasonable.

With RLHF, however, you'll often see something like, say:


'1': 0.02  
'2': 0.07  
'3': 0.90  
'4': 0.01

Why? Short answer:

Consider the case of a slightly weighted coin tossed once per training session, that flips heads 60% of the time. A base model will result in the highest training score if it flips heads 60% of the time.

But a RLHF'd model will result in the highest training score if it flips heads 100% of the time. Because when the user sees "which is a more likely answer: heads or tails" - they will answer "heads". And so the model will be trained to answer "heads".

self_made_human Kai su, teknon? Ioper 2mo ago

Some of them, most notably ChatGPT, are explicitly trained and prompted not to reproduce potentially copyrighted work like song lyrics. Though OAI's recent model spec has been updated to standards where the LLM is suppose to decline politely rather than lie and say it's incapable of reproducing them.

Ioper self_made_human 2mo ago

I've had this happen to me with many things not copywrited but I suppose such "instructions" perhaps could bleed over into other areas.

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats