This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
No, parsing bits from a video file does happen practically instantly. Download a video file to your local disk, and play it from there, you’ll see. Even on YouTube, if you rewind back, it will have to represent the bytes again.
The reason it takes a while for YouTube stream to start is that this is what it takes for YouTube to locate the bytes you asked for and start streaming them to you.
Yes, and for the LLM to parse these bits, first youtube needs to locate them, then serve them to the llm. If the llm can convince youtube to serve the bits as fast as bandwidth will allow, it still needs to run those bits through some transcription algo -- which typically are borderline on lagging at 1x speed.
In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen. The fact that it goes ahead and gets this wrong (Cash pretty clearly says gam-bel-er in the video) makes it much more likely that the llm is looking at some timestamped transcript to pick up the word "gambler" in the context of country songs, and hallucinating a pronunciation.
That's just not true. You see them running at 1x speed, because they only ever need to run at 1x speed. Running them faster is a waste of resources: what's the point of transcribing video in its entirety, if you are going to close the tab before it ends?
This is running on my home PC:
That's 2.5x transcription speed, and it's just running on CPU, not even using GPU.
Look, guys, processing your stupid-ass text questions on a frontier LLM takes way more resources than speech transcription. Whisper on my CPU can do 400 GFLOPS, so the above 67 seconds run used something like 26,000 GFLOPs. For comparison, running a transformer model on a single token requires something like twice the number of parameters of the model worth of flops. So, for example, GPT-3, a junk tier model by today's standards, has 175B parameters, so just feeding it the lyrics of Cash's song would need 150,000+ GFLOPs, and that's before it even produces any output.
Modern speech transcription models not only make judgements on accents as a normal course of operations, they actually understand the speech they are transcribing to some (limited) extent in the same sense as more general LLMs understand the text fed into them. It would actually be a pretty cool project for a deep learning class to take something like OpenAI Whisper, and fine tune it to make it into accent detection tool. It's all there inside the models already, you'd just need to get it out.
Just to be clear: I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could. The frontier LLMs are all multimodal, which usually means that they can process and generate image tokens in addition to text tokens. There's literally no technical reason why they couldn't add audio token type, and some of them probably already do that.
OK? 67 seconds is not instant -- like, at all. Even 6.7s (assuming the resources assigned to this task were as you suggest) is not instant.
Of course it could! But it doesn't, and the fact that it responded instantly is evidence of that. Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?
LLMs lie man -- we should get used to it I guess.
In case it wasn’t clear: it took 67 seconds to transcribe an entire 171 seconds long song on my home CPU. You don’t have to wait for it to finish either: it produces output as it goes. It takes less than 20 seconds to process the entire song on my 9 year old gamer GPU. It would take less than half a second on a single H100. On 5 H100s, transcribing the entire song would take less time that it would take you to blink an eye. X AI reportedly has 100,000 of H100s.
How is it any evidence? Responding to the question is at least an order of magnitude more computationally demanding task than transcribing the entire song.
That’s the amazing thing about multimodal LLMs: they wouldn’t even have to add any special feature, with multimodality you get it for free. For a multimodal LLM model that is trained on sound tokens as well as on text tokens, understanding the words in a song, and analyzing accents etc is literally the same task as answering text questions. When you ask Grok a question, it searches web, fetches websites, and processes the contents. Fetching and processing song is, again, exactly the same task to the multimodal LLM as processing text.
Sounds are not text though -- nothing is free, and nothing is instant.
Why don't you try it? Ask Grok to transcribe a song from a youtube link and see what it does -- preferably a song that differs from the published lyrics somehow, maybe a live version or something.
More options
Context Copy link
So you think it’s real? I think it’s an interesting question, possible either way. On the other hand, gambler is clearly pronounced with three syllables in the song, which suggests it’s clear some hallucination is going on.
Also, can’t this obviously be tested? Can’t we just see if it can accurately transcribe a 30 minute newly uploaded YouTube video in a couple of seconds? Idk.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link