This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
Grok 3.0 apparently can listen to music and integrate the results with the rest of its knowledge.
Twitter released their newest iteration on AI, it's fun, clever, and noticeably less lobotomized than basically everything before it. It's currently free with a basic twitter account. But I accidentally discovered a thing that kinda blew my mind. I got into a silly argument with it, as one does, about whether you can legitimately pronounce "gambler" with three syllables. As one of my arguments I brought up a Johnny Cash song where he, to my ear, does it. The robot responded:
I disagree obviously, but notice the timestamp!!! I'm reasonably sure that nobody in the history of the internet had this exact argument before and mentioned the exact timestamp in that exact song. Moreover, before that I asked it about "House of the Raising Sun" (because I misremembered the vocalist drawling "gambling" there) and the robot also timestamped the place in the recording where it was said.
So I don't know. It's possible that this is a result of an unsophisticated hack, give the AI a database of timestamped subtitles for youtube videos (something they already have generated), then it bullshits its way through the argument about what was actually said and how. That's totally possible, it's really good at bullshitting!
The other possibility is that it actually listens to videos/audios and analyses them on the fly or during training, whatever. What's super interesting about is that, look, we started with LLMs that literally had not a single real world reference, nothing that could remotely qualify as a qualia of say seeing an apple. They were trained entirely on people talking about their perceptions of apples, and somehow they managed to learn what apples are pretty well, without ever seeing one (which all philosophers agreed should be impossible, seeing apples must come first, and yet here we were). And now, if it's not just a subtitle hack, then we have quietly passed another milestone, the robots now can hear and see and correlate that with their knowledge bases.
Also, I asked the robot directly:
(timestamped subtitles followed)
Idk, it responded pretty much instantly, so it could be lying. Or maybe it has preprocessed subtitles for popular videos.
If you mean The Animals' version, I think there's a slide during the first syllable that probably threw you, but it's still only one syllable.
More options
Context Copy link
The Grok and GPT are both largely derived from tools originally developed to generate automated transcripts and sub-titles.
You dont need to give the computer a database of timestamped lyrics when you can generate them. That the LLM can generate subtitles, or that it defaults to the common US english/dictionary pronunciation of the word (which is naturally going to be far more central to the training corpus) rather than the four-beat tempo of the actual audio (the ram-bull-er, the gam-bull-er, the back bite-er) should not be surprising to anyone. In fact, i find your example highly illustrative of both the capabilities and common failure modes of such models.
As an aside, surely you must have more intelligent things to do with your time than arguing with chatbots.
Like reading replies like this? Arguing with a chatbot was more productive tbhwyf.
More options
Context Copy link
More options
Context Copy link
I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes? Also, listening to the track it seems like a pretty good example of the pronounciation that you are referring to -- so it's clearly not 'listening' to the video in any meaningful way.
"AI lies and confidently misrepresents evidence in order to advance it's chosen position" is... not too surprising considering that it's been trained on decades of internet fora conversations, but probably not the kind of alignment we are looking for.
It certainly has access to the subtitles so they are probably cached at least. That video has like 10 million views, while I can't believe that I was not the first person questioning the number of syllables in "gambeler", it definitely could have pre-subtitled it.
Thank you!
Yeah, after showing this to people and thinking about it, I lean heavily towards Grok having been fed a bunch of autogenerated subtitles (with timestamps). Which is very cool but not all as cool as if it actually listened to stuff. Also, then it keeps hallucinating stuff.
More options
Context Copy link
Remember that the algorithm is not processing physical vibrations in the atmospheric medium, it's parsing bits in an audio file. The relavent metric here is not the runtime but the file size, and a 3 minute song is unlikely to be more than a few Mb.
I do understand this -- just the same, 'parsing bits' from a video file does not happen instantly. Indeed, just starting a stream on youtube is typically not what I would call 'instant'.
yt-dlp
will download and transform a 20 minute youtube video in under a second, including subtitles if you want.OK, but will Grok? I guess it would be pretty easy to try, but it might refuse on copyright grounds or something.
More options
Context Copy link
More options
Context Copy link
Nothing happens instantly, but lots of things happen quickly enough that they might as well be instantaneous from the perspective of an unaugmented human.
Yes, but serving and parsing videos from youtube is not one of those things.
Let's agree to disagree.
I don't think so -- you are wrong on this one.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
No, parsing bits from a video file does happen practically instantly. Download a video file to your local disk, and play it from there, you’ll see. Even on YouTube, if you rewind back, it will have to represent the bytes again.
The reason it takes a while for YouTube stream to start is that this is what it takes for YouTube to locate the bytes you asked for and start streaming them to you.
Yes, and for the LLM to parse these bits, first youtube needs to locate them, then serve them to the llm. If the llm can convince youtube to serve the bits as fast as bandwidth will allow, it still needs to run those bits through some transcription algo -- which typically are borderline on lagging at 1x speed.
In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen. The fact that it goes ahead and gets this wrong (Cash pretty clearly says gam-bel-er in the video) makes it much more likely that the llm is looking at some timestamped transcript to pick up the word "gambler" in the context of country songs, and hallucinating a pronunciation.
That's just not true. You see them running at 1x speed, because they only ever need to run at 1x speed. Running them faster is a waste of resources: what's the point of transcribing video in its entirety, if you are going to close the tab before it ends?
This is running on my home PC:
That's 2.5x transcription speed, and it's just running on CPU, not even using GPU.
Look, guys, processing your stupid-ass text questions on a frontier LLM takes way more resources than speech transcription. Whisper on my CPU can do 400 GFLOPS, so the above 67 seconds run used something like 26,000 GFLOPs. For comparison, running a transformer model on a single token requires something like twice the number of parameters of the model worth of flops. So, for example, GPT-3, a junk tier model by today's standards, has 175B parameters, so just feeding it the lyrics of Cash's song would need 150,000+ GFLOPs, and that's before it even produces any output.
Modern speech transcription models not only make judgements on accents as a normal course of operations, they actually understand the speech they are transcribing to some (limited) extent in the same sense as more general LLMs understand the text fed into them. It would actually be a pretty cool project for a deep learning class to take something like OpenAI Whisper, and fine tune it to make it into accent detection tool. It's all there inside the models already, you'd just need to get it out.
Just to be clear: I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could. The frontier LLMs are all multimodal, which usually means that they can process and generate image tokens in addition to text tokens. There's literally no technical reason why they couldn't add audio token type, and some of them probably already do that.
OK? 67 seconds is not instant -- like, at all. Even 6.7s (assuming the resources assigned to this task were as you suggest) is not instant.
Of course it could! But it doesn't, and the fact that it responded instantly is evidence of that. Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?
LLMs lie man -- we should get used to it I guess.
In case it wasn’t clear: it took 67 seconds to transcribe an entire 171 seconds long song on my home CPU. You don’t have to wait for it to finish either: it produces output as it goes. It takes less than 20 seconds to process the entire song on my 9 year old gamer GPU. It would take less than half a second on a single H100. On 5 H100s, transcribing the entire song would take less time that it would take you to blink an eye. X AI reportedly has 100,000 of H100s.
How is it any evidence? Responding to the question is at least an order of magnitude more computationally demanding task than transcribing the entire song.
That’s the amazing thing about multimodal LLMs: they wouldn’t even have to add any special feature, with multimodality you get it for free. For a multimodal LLM model that is trained on sound tokens as well as on text tokens, understanding the words in a song, and analyzing accents etc is literally the same task as answering text questions. When you ask Grok a question, it searches web, fetches websites, and processes the contents. Fetching and processing song is, again, exactly the same task to the multimodal LLM as processing text.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I think your explanation about the AI lying and confidently misrepresenting evidence in this case is almost certainly true. But I don't see how the runtime of the music video would matter for this. If the AI were analyzing the music video - which I don't think it did - it would be analyzing the bits that make up a video file after downloading it from wherever it is, in which case it just needs to process the bits that make up the file, and the speed of that would be dependent on many factors, but certainly not limited by how long the video is. A human might be limited to maybe half the time at the shortest if they watched the video at 2x speed, but I don't see any reason why an AI couldn't transcribe, say, all recorded audio in human history within a second, just by going through the bits fast enough.
Any of this would need a pretty specialized video analysis module though, which AFAIK doesn't really exist period, much less built into Grok -- plus the ability to download the video directly rather than look at a stream of it, which Youtube doesn't really provide. So if the AI were literally accessing the video through that link, 3:00/2x is indeed the fastest it would be able to provide the transcript.
(it would not be instant in any case; downloading the video takes X seconds, analyzing it Y -- X + Y might be less than three minutes, but it's not less than one second)
No it wouldn't, you just need the codec specification to extract the audio from the video at which point it becomes a reasonably straight forward speech-to-text problem, which is something we've been doing since the 90s.
You don't even need that; as anyone who's looked at the unauthorized youtube downloader tools out there will know, youtube's protocols allow access to the audio-only and video-only portions of a recording as separate stream. All Grok would have to do is request the audio-only stream.
More options
Context Copy link
OK, then you need an audio analysis model -- this is not a thing that is integrated into LLMs.
The first practical LLMs were developed as tools to automatically generate transcripts and sub-titles. My point above is that even if we assume that Grok is not pulling a previouse parse from some database, generating a fresh parse is well within its basic capabilities.
Interesting assertion, but it doesn't really have any bearing on whether or not Grok can do this -- it takes text input from the user, and generates a text response. What makes you think it even has an interface to bring in audio inputs? (on the training end, they might -- given the hunger for data -- but it seems like an odd thing to include in a chatbot. Even for training, it would probably be better to do something like, oh, IDK -- run a transcripting algo on as much YouTube content as you can grab and then feed the text from that into your training set. You might even include some timestamps!)
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
You can make YouTube videos go at arbitrarily high speeds just using a Chrome extension. I actually had an issue recently where an extension was causing the videos to default to 10x speed, which was both amusing and annoying. In any case, anyone with a link to a YouTube video has the ability to just download the video using basic non-AI tools, so the AI wouldn't be limited by the UI that YouTube presents human users.
True (and interesting about the Chrome extension; what is the usecase for 10x browser playback of youtube videos, I wonder?) but I'm quite sure Grok is not currently programmed with anything like this.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Even if it did have preprocessed subtitles, presumably "gambler" just appears like that regardless of pronunciation?
More options
Context Copy link
I don't know exactly what's is going on here but LLMs often respond like that. I'm not sure that "lying" is the correct term or if it's more accurate to say that they frequently don't "perceive themselves" as having the literal knowledge that they're trained on and with some prompting can reproduce.
Some of them, most notably ChatGPT, are explicitly trained and prompted not to reproduce potentially copyrighted work like song lyrics. Though OAI's recent model spec has been updated to standards where the LLM is suppose to decline politely rather than lie and say it's incapable of reproducing them.
I've had this happen to me with many things not copywrited but I suppose such "instructions" perhaps could bleed over into other areas.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link