Text, audio-vqvae, image-vqvae (possibly video too) tokens in one stream
How do you suppose it reads tiny words with a VQVAE? Even an RQVAE shouldn't have the pixel precision needed to see tiny 5px font letters.
and some modest computation capability (say, a cluster of 3090s or a commitment to spend a moderately large sum on lambda.labs)
This is not sufficient. The rig as described by neonbjb is only 192GB of vram; fine-tuning an LM with 130B params (in the best possible case of GLM-130B; the less said about the shoddy performance of OPT/BLOOM, the better) requires somewhere in the ballpark of ~1.7TB of vram (this is at least 20+ A100s), and that's on batch size 1 with gradient checkpointing and mixed precision and 8bit adam and fused kernels without kv cache and etc. If you don't have an optimised trainer ready to go (or god forbid, you're trying distributed training), you should expect double the requirements.
The cost of that isn't too bad, of course. Maybe $25 bucks an hour on LL, any machine learning engineer can surely afford that. The larger doubt I have is that any of this will take place.
Training consumes far more matmuls than inference. LLM training operates at batch sizes in the millions -- so if you aren't training a new model, you have enough GPUs lying around to serve millions of customers.
My instinct is that this should be smaller and easier than the Stable Diffusion I run on my PC, but maybe I am just super wrong about that?
Super-wrong is correct. Nobody has a consumer-sized solution for that, and if it ever happens it'll be huge news.
or?
TLDR: it should be possible for any chump with 12GB of ordinary RAM, or some combination of offloaded RAM+vRAM that sums to 9GB, because running encoder-only is fast enough. Tests and stats mostly extrapolated from T5-3B because of personal hardware constraints (converting models costs much more memory than loading them)
There are markdown tables in this comment that do not display correctly on the site, despite appearing correctly on the comment preview. You may wish to paste the source for this comment into a markdown preview site.
To start, T5-XXL's encoder is actually 4.6B, not 5.5. I do not know why the parameters aren't evenly split between the encoder & decoder, but they aren't.
Additionally, it's likely that int8 quantisation will perform well enough for most users. load_in_8bit
was recently patched to work with T5-like models, so that brings the memory requirements for loading the model down to ∼5GB.
What about vram spikes during inference? Well, unlike SD, the memory use of T5 is not going to blow significantly beyond what its parameter count would imply, assuming the prompts remain short. Running T5-3B from huggingface [0], I get small jumps of:
| dtype | vram to load | .encode(11 tokens) | .encode(75 tokens) |
|-|-|-|-|
| 3B-int8 | 3.6GB | 4.00GB | 4.35GB |
| 3B-bf16 | 6.78GB | | 7.16GB |
Note that the bump in memory for bf16 is smaller than int8 because int8 does on-the-fly type promotion shenangians.
Extrapolating these values to T5-XXL, we can expect bumps of (0.4∼0.8) * 11/3 = 1.5∼3GB of memory use for an int8 T5-XXL encoder, or <1.5GB for a bf16 encoder. We should also expect the model to take 10∼20% extra vram to load than what its parameters should imply.
So, an ideal int8 T5-XXL encoder would take up to (4.6*1.15+3)GB, or slightly more than 8GB of vram during runtime. That still locks out a substantial number of SD users -- not to mention the 10xx series users who lack int8 tensor cores to begin with. Are they fucked, then?
Short answer: no, we can get away with CPU inference via ONNX.
I first came across the idea below a Gwern comment. Given that prompts are limited to 77 tokens, would it be possible to run the encoder in a reasonable amount of wall time? Say, <60s.
Huggingface's default settings are atrociously slow, so I installed the ONNX runtime for HF Optimum and built ONNX models for T5-3B [1]. Results:
| quantized? | model size on disk | python RAM after loading (encoder+decoder) | model.encoder(**input)
duration | full seq2seq pass |
|-|-|-|-|-|
| no | 4.7+6.3GB | 17.5GB | 0.27s | 42s |
| yes | 1.3+1.7GB | 8.6GB | 0.37s | 28s |
I'm not sure whether I failed to use the encoder correctly here, considering how blazing fast the numbers I got were. Even if they're wrong, an encoder pass on T5-XXL is still likely to fall below 60s.
But regardless, the tougher problem here is RAM use. Assuming it is possible to load the text encoder standalone in 8bit (I have not done so here due to incompetency, but the model filesizes are indicative), the T5-XXL text encoder would still be too large for users with merely 8GB of RAM to use. An offloading scheme with DeepSpeed would probably only marginally help there.
[0] - example code to reproduce:
PROMPT = "..."
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map='auto', low_cpu_mem_usage=True, ...)#add torch_dtype=torch.bfloat16 OR load_in_8bit=True here
inputs = tokenizer(PROMPT, return_tensors='pt')
output = model.encoder(**inputs)
[1] - example code for ONNX model creation:
model_name = "t5-3b"
model_name_local = "./t5-3b-ort"
model_name_quantized = "./t5-3b-ort-quantized"
def create_ORT_base():
model = ORTModelForSeq2SeqLM.from_pretrained(model_name, from_transformers=True)
model.save_pretrained(model_name_local)
def create_ORT_quantized():
model = ORTModelForSeq2SeqLM.from_pretrained(model_name_local)
model_dir = model.model_save_dir
#
encoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="encoder_model.onnx")
decoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_model.onnx")
decoder_wp_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_with_past_model.onnx")
quantizer = [encoder_quantizer, decoder_quantizer, decoder_wp_quantizer]
#
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
for q in quantizer:
q.quantize(save_dir=model_name_quantized,quantization_config=dqconfig)
I didn't have any good place to add this in my post, but it's worth noting that caching of text embeddings will help a lot with using T5-XXL. Workflows that involve large batch sizes/counts || repeated inpaintings on the same prompt do not need to keep the text encoder loaded permanently. Similar to the --lowvram mechanism implemented now, the text encoder can be loaded on demand, only when the prompt changes, saving memory costs.
In all likelihood, you'd need something like 8x3090, but that's about as hard to trace as a stealth weed growbox in a basement. Inference, I expect, also won't be feasible on normal consumer machines, so it'll incentivize some stealthy cloud computing, maybe very small-scale.
I'll bet against that. It's supposed to be an Imagen-like model leveraging T5-XXL's encoder with a small series of 3 unets. Given that each unet is <1B, this is no worse than trying to run Muse-3B locally.
I miss the past.
Then you've missed the point of the article entirely? It's an election prediction site. Trying to put forward a case for a Republican electoral victory. It would be very odd and partisan to portray Redd as an anti-strategist that doesn't care about the outcome and "prefers to die on that hill and lose election".
That does nothing to dispute the claim that supporting pro life policies are costing republican votes. Ditto for basically everyone in the parent comment tree. I don't understand how so many motteposters are conflating "There's lots of pro-life people" (true) with "Being pro-life will make it easier to win elections" (do you believe this?)
Because of course every single woman is pro-abortion, of course a Republican-voting guy is not going to know any women who might be pro-life, of course no woman picked at random in the USA is going to be "I think the Supreme Court decision was great".
Are you reading the same article as me?
Redd: And by the way, I do talk to my female friends about abortion. Abortion is a problem for Republicans. That’s why I’m not sure if we’re going to win small or win big. But pretty much everything else lines up on our side.
The paradigm and reason for party allegiances is not equality and certainly not about government handouts, considering how every citizen of their home country sidesteps the government to black markets where exchanges are made at better rates. The current Democrat party is actually damaging their bottom line through their fiscal policy.
Call me skpetical. What makes you confident it has a strong impact on voter habits? You have presented a theory -- a possible factor of influence -- but no reason to believe it has any more predictive power than any other explanation of voter preferences.
China isn't suffering from stagflation right now like the rest of the world. They have inflation of about 2%, there are worries about inflation being too low. This is because they didn't print huge amounts of money as stimulus. And the damage to the Chinese economy? According to the Asian Development Bank, Chinese growth will drop to 3.3% this year thanks to Omicron and these lockdowns. US growth is somewhere around 1.5% and there's a recession looming. The US and the rest of the West is being forced to raise interest rates to reduce the growth that we paid for with stimulus.
I see absolute figures for two groups that did not start with same absolute numbers.
Resolve?
For the people that might get duped like me: no, this article isn't about figuring out the true statistically rigourous answer for what a medium-sized breast looks like.
Ah, well... Fuck. That covers every idea I had and more.
I think the migration just failed, and I plan on doing some data scraping / analytics this weekend to demonstrate it.
My question was, albeit unclearly, not about "why would this be a bad thing", but rather: Conditional on the West recognising this as a true and obviously bad thing, what could even be done? "Just stop digging the hole", as reactionaries will know, is an incredibly difficult task at times.
But @crake has answered that question well.
This is a great response.
What's the point?
No, really -- let's say you win. You've convinced the entirety of the western public that COVID-19 was made in a Chinese biolab. Okay, now what?
I have 180°'d on my opinions, thanks.
I've always wondered if the parentheses attention format was intentionally designed for humour.
Delete (yes -- delete) all distractions. Mute everything. Lock your phone in a safe. Ensure that the only kind of rest you're permitted is passing out on the floor from exhaustion.
Bruh
- Prev
- Next
where did you learn that from?
More options
Context Copy link