This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
Moderately interesting news in AI image gen:
It's been a good while since we've had AI chat assistants able to generate images on user request. Unfortunately, for about as long, we've had people being peeved at the disconnect between what they asked for, and what they actually got. Particularly annoying was the tendency for the assistants to often claim to have generated what you desired, or that they edited an image to change it, without actually doing that.
This was an unfortunate consequence of the LLM, being the assistant persona you speak to, and the actual image generator that spits out images from prompts, actually being two entirely separate entities. The LLM doesn't have any more control over the image model than you do when running something like Midjourney or Stable Diffusion. It's sending a prompt through a function call, getting an image in response, and then trying to modify prompts to meet user needs. Depending on how lazy the devs are, it might not even be 'looking' at the final output at all.
The image models, on the other hand, are a fundamentally different architecture, usually being diffusion-based (Google a better explanation, but the gist of it is that they hallucinate iteratively from a sample of random noise till it resembles the desired image) whereas LLMs use the Transformer architecture. The image models do have some understanding of semantics, but they're far stupider than LLMs when it comes to understanding finer meaning in prompts.
This has now changed.
Almost half a year back, OpenAI teased the ability of their then unreleased GPT-4o to generate images natively. It was the LLM (more of a misnomer now than ever) actually making the image, in the same manner it could output text or audio.
The LLM doesn’t just “talk” to the image generator - it is the image generator, processing everything as tokens, much like it handles text or audio.
Unfortunately, we had nothing but radio silence since then, barring a few leaks of front-end code suggesting OAI would finally switch from DALLE-3 for image generation to using GPT-4o, as well as Altman's assurances that they hadn't canned the project on the grounds of safety.
Unfortunately for him, Google has beaten them to the punch . Gemini 2.0 Flash Experimental (don't ask) has now been blessed with the ability to directly generate images. I'm not sure if this has rolled out to the consumer Gemini app, but it's readily accessible on their developer preview.
First impressions: It's good.
You can generate an image, and then ask it to edit a feature. It will then edit the original image and present the version modified to your taste, unlike all other competitors, who would basically just re-prompt and hope for better luck on the second roll.
Image generation just got way better, at least in the realm of semantic understanding. Most of the usual give-aways of AI generated imagery, such as butchered text, are largely solved. It isn't perfect, but you're looking at a failure rate of 5-10% as opposed to >80% when using DALLE or Flux. It doesn't beat Midjourney on aesthetics, but we'll get there.
You can imagine the scope for chicanery, especially if you're looking to generate images with large amounts of verbiage or numbers involved. I'd expect the usual censoring in consumer applications, especially since the LLM has finer control over things. But it certainly massively expands the mundane utility of image generation, and is something I've been looking forward to ever since I saw the capabilities demoed.
Flash 2.0 Experimental is also a model that's dirt cheap on the API, and while image gen definitely burns more tokens, it's a trivial expense. I'd strongly expect Google to make this free just to steal OAI's thunder.
Color me doubly impressed:
https://files.catbox.moe/aiict3.jpg
I must overcome my mental barriers against demanding that much out of a single prompt. Too much time with SD and DALLE-3 has given me the bigotry of low expectations haha.
More options
Context Copy link
More options
Context Copy link
https://x.com/NotBrain4brain/status/1900118469447987317
This is one particular example of a prompt where prior image gen models failed terribly. They didn't really understand the concept of negation, to gloss over many details, the models treated "A room with an elephant in it" and "A room without an elephant in it" interchangeably, mentioning an elephant, or even the absence of an elephant, would get you an elephant in the room.
More options
Context Copy link
I'd say it's another bit of evidence for Google upgrading their product strategy, but nothing unexpected capabilities-wise. Shame they did not release the weights, instead shipping only Gemma 3 with image-in text-out. «Safety» reasoning is obvious enough.
Contra @SkoomaDentist I think it's not fair to describe this as «The LLM is still talking to the image generator», ie that the main LLM is basically just the encoder for some diffusion model or another separate module. The semantic fidelity and surgical precision of successive edits suggest nothing like that, and point instead to a unified architecture with a single context where each token, be it textual or visual, is embedded in its network of relationships with all others (well, that's what these models are – literally, hypotheses about the shape of the training data manifold). Back when OpenAI announced their image-out capabilities with 4o, the teaser generation said «suppose we directly model P(text, image, sounds) with one big autoregressive transformer». Shortly after, Meta (or really Armen Aghajanyan, who has since departed largely in protest over Chameleon's safety-informed nerfing, and his team) published their Chameleon, a parallel work in identical spirit:
Later, DeepSeek, who are probably the best team in the business (if not for resource limits), have been working on Janus, which is also a unified model of a potentially superior design:
I expect DeepSeek's next generation large model to be based on some mature form of Janus.
I think Gemini is similar. This may be the first time we get to evaluate the power of modality transfer in a well-trained model – usually you run into the bottleneck of the projection layer, as @self_made_human describes. But here, it can clearly copy an image (up to the effective "resolution" of its codebook and tokenizer) and make isolated transformations, precisely the way transformers can do to a text string. Hopefully this means its pure verbalized understanding of the visual modality (eg spatial relations, say… anatomy…) is upgraded. Gooners from 4chan ought to be reaching the conclusion as I type this.
In the next iteration video and probably 3d meshes are getting similar treatment.
P.S. SkoomaDentist being bizarrely aggressive and insistent that this is whatsoever like inpainting is being very funny. Inpaint this. No, no, these are not vulgar tricks, and I don't see why one could be invested in bitterly arguing against that.
More options
Context Copy link
It's pretty good at following instructions, it's hilarious giving it prompts but translated to other languages, the quality of the images noticably degrades the more obscure the language.
Unfortunately it's also censored to hell and back, it just refused to add a Cyberpunk style Razorgirl into a Noire inspired image, guess we need to add Prude AI to our future dystopias.
I feel like they must have screwed it up the censorship layer somehow. You can generate a photorealistic woman in a revealing outfit, but try to generate anime or a cartoon (totally 100% clean) and it will immediately shut it down.
I can't imagine why they would intend for the model to generate deepfakes without issue but ban all anime.
maybe they are worried about copyright claims from the Japanese/distribution platforms.
More options
Context Copy link
More options
Context Copy link
Looks like we're going to have to wait for an open source model with native image gen, which can probably be fine tuned to remove the censorship. I would expect that the most likely candidate would be a Llama model, but expect the unexpected from the Chinese.
More options
Context Copy link
More options
Context Copy link
Interesting, though I’d say probably not the right thread.
I’m not hugely interested in the pace of AI advancement for now. Superhuman intelligence at the point where things just get ‘solved’ will be a fun step, but for me as soon as the potential of agents became clear (which was early in the GPT 3 era) the writing was on the wall. Everything now is just efficiency, the pathway had been clear for the last couple of years, we’re just waiting for the world to realize what’s just happened.
For the past almost two years I've been taking small steps to arrange my life for a 'soft landing' in the event my job gets instantly obliterated when the AI that can do it better comes out.
I stand by this advice from just over 2 years ago, where I said:
A student who was a first year law student in December 2022 will be in the third and final year now, graduating soon. They may have some runway left to get a job before the AIttorney arrives, but do we want to bet that AI tools that can outperform them across the board won't be here by December 2025?
I'm still keeping an eye out for signs of downward pressure on new attorney salaries.
The Rumblings have begun in earnest
99% of all 'purely' knowledge-based work is on the chopping block.
Signed:
A practicing attorney who semi-regularly consults ChatGPT to get my bearings when dealing with a unique legal issue.
More options
Context Copy link
I have a meta question: what is up with people putting AI news in the culture war thread? It's not just @self_made_human by any means, but I have no idea why the topics keep getting posted here. It's not really culture war in any way, so shouldn't they get their own threads?
Although not "culture war" in the traditional left vs right sense, the development of AGI still has wide-reaching cultural, political, and ideological implications. The more theoretical/philosophical AI posts are a pretty natural fit for the CW thread. News items about more specific/incremental AI advances maybe not so much, but starting with the first ChatGPT there was a period where there was a lot of interest in AI on TheMotte and people got used to talking about it in the CW thread, so, it just kind of stuck.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
That claim is just flat out false. Inpainting only specific areas of the original image (even from text description) has been in use for multiple years now (there's even an extension for that for the open source AUTOMATIC1111 Stable Diffusion webui). Only complete novices rely on rerolling.
I think you're being blinded by your single minded enthusiasm for LLMs and are massively overestimating their capabilities as well as ignoring the wider state of the field.
The LLM is still talking to the image generator. It just does so using native tokens and vectors instead of going through a text encoder layer in-between.
This is very confusingly stated. The second sentence is correct, but in the first one, it’s confusing to say that LLM is talking to image generator, because the LLM and the image generator are literally the same thing.
The claim that has no proof (beyond marketing speak) is that they are the same thing. I don't believe the claim and the evidence doesn't show anything to support the claim as opposed to just skipping the text encoder and talking directly to the actual image generator in its native format.
More options
Context Copy link
More options
Context Copy link
My man, if you'd read what I said carefully, you'd see I was talking about what the LLM assistant does when asked to edit pictures. They're not in-painting.
I'm more than familiar with such techniques, courtesy of using Stable Diffusion since pre-alpha. I've fucked around with ControlNet myself.
This leaves such techniques in the dirt, because an LLM is far smarter than CLIP or any image segmentation model you're throwing into a janky pipeline.
Here's what Google has to say:
https://developers.googleblog.com/en/the-next-chapter-of-the-gemini-era-for-developers/
Nobody calls the step through which an LLM's tokens get converted into textual output a separate "text generator". Ergo, there isn't a separate image model of any kind involved.
I did read what you said. Both your description, the web page you linked to and the very demonstration video itself are exactly what you'd get from inpainting (optionally enhanced with image prompting).
I also read your specific claim: "It will then edit the original image and present the version modified to your taste, unlike all other competitors, who would basically just re-prompt and hope for better luck on the second roll."
That claim (in particular the highlighted portion) is just outright false. There are multiple models (including SD itself) which are far more capable than just "re-prompting and hoping for better luck". Don't start insulting me just because you later decided that you really meant something other than what you actually wrote.
I don't see a single proof for such claim that the LLM itself is generating the pixels (which would be required for actual native image output) or indeed anything that doesn't point towards simply replacing clip encoder and using image prompting. The rest is just marketing speak.
When I'm talking about "competitors", I mean the most well-known image generation services and AI assistants. If you're inclined to be this literal, you might as well claim that "all competitors" includes paying someone on Fiverr to photoshop an image according to your instructions.
Hell, your accusation that I'm somehow downplaying parallel advances in dedicated image models or tooling for the same is completely unfounded. You can go through the headache of setting up ComfyUI if you like, or work patch by patch manually with Adobe's AI powered content fill. This is a far simpler and qualitatively different technique.
I mean precisely what I've said, nothing more, nothing less. There are no competing models, or additional tooling, that can reproduce this kind of image generation without being vastly more finicky and painful.
Since Gemini is a proprietary model, I'm afraid I'm going to have to go with Google's opinion on the matter and not yours, until proven otherwise.
If Google actually says it's the same AI generating both models, I don't see it in the link you provided. More likely, one AI is communicating with another internally, and they're lumping them both together as "the same AI" for marketing speak.
The same applies to the first link by OpenAI--just because they say gpt-4o generated the image doesn't mean it's actually the same AI.
I'm honestly curious if you have any evidence at all for your claims beyond the companies involved referring to both functionalities as belong to the same AI. Of course they will do that; it sounds more impressive and, from the perspective of the consumer, is more accurate than referring to the two functionalities as different AIs, even if the latter is what's happening behind the scenes.
That is what it means for an LLM to be natively multimodal with image output. If they were using a different image model, then it wouldn't be news at all, that's what we already have today.
I can only stress that terminology only makes sense to use if it distinguishes itself from the status quo. There's no dedicated image model that can produce outputs with such strong semantic fidelity, if there were, it would be an advancement worthy of celebration by itself.
Both Google and OAI keep the underlying architectures and implementation details of their SOTA models close to their chest. Could they lie about their models being natively capable of image output? Yes, but that is extremely unlikely, and while they might exaggerate on occasion, that would be a rather unprecedented move. If you want definitive proof, then you'd need to look at the underlying code or hack into their backend.
Thanks, searching "natively multimodal" it does look like their claim is that their single AI is capable of doing everything.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I am eating crow right now.
I'm very interested to know if this model is also better at spatial reasoning compared to other models. I'm gonna see if I can get access and try it out.
It's easy being an AI advocate, I just have to wait a few weeks or months for the people doubting them to be proven wrong haha.
Jokes aside, I have tinkered with it quite a bit, and it is obviously much smarter than any dedicated image model I've used.
It took me about a half a dozen prompts (additions and corrections for the original image, instead of brand new ones) to go from the first image I've attached to the second.
It followed instructions like:
And
I did notice that there was some off-target editing, when I asked for the color of the cybernetic arm to be more like carbon fiber, it also changed his helmet. The text in the background could move around or degrade with edits to the foreground. That's not a big deal, because I can iteratively approach the image I'm envisioning.
/images/17418103732556858.webp
/images/17418103735760682.webp
Unfortunately, even in this board, being "proven wrong" doesn't stop them. e.g. this argument I had with someone who actually claimed that LLMs "suck at writing code", despite the existence of objective benchmarks like SWE-bench that LLMs have been doing very well on. (Not to mention o3's crazy high rating on Codeforces.) AI is moving so fast, I think some people don't understand that they need to update from that one time in 2023 they asked ChatGPT3 for help and its code didn't compile.
More options
Context Copy link
Often you don’t even need to wait, doubters often say things that are wrong already when they say it. Remember Gary Marcus? He was big a couple years back, but everyone learned to ignore him after basically everything he said was wrong.
I do my best to not remember Gary Marcus. You'd have better luck setting your time with a broken clock. He's successfully predicted all 900 of the 2 AI winters.
I wish that were true, because I've seen people still quoting him as an authority on the subject to this very day. Thankfully, the people doing so didn't seem to be the kind to have any actual understanding of ML or modern AI.
A line that stuck with me from https://thezvi.substack.com/ was something like:
IIRC, it was about the release of o1, but Perplexity is pointing me here about GPT-4, which feels too early (Sept 2023).
Oh, that's precisely what I was remembering. It is, in fact, not too early, because the link to Zvi's blog demonstrates in amber how Marcus was crowing about the inability of LLMs to solve simple logic or physics puzzles, and was using GPT 3.5 to prove his point.
After OAI had released 4. Which handily tackled his issues, before he'd even brought them up.
He's somewhere between a charlatan and a joke, and has been for as long as I've had the displeasure of knowing about him.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Whoa. Thanks for the summary!
More options
Context Copy link
More options
Context Copy link