site banner

Friday Fun Thread for November 8, 2024

Be advised: this thread is not for serious in-depth discussion of weighty topics (we have a link for that), this thread is not for anything Culture War related. This thread is for Fun. You got jokes? Share 'em. You got silly questions? Ask 'em.

1
Jump in the discussion.

No email address required.

Do we have anyone running local offline LLMs here?

How are they coming along? Do you need to load them into VRAM, or can you load them into RAM or something and use either CPU or GPU from there?

If you use llama.cpp, you can load part of the model into VRAM and evaluate it on the GPU, and do the rest of the evaluation on CPU. (The -ngl [number of layers] parameter determines how many layers it tries to push to the GPU.)

In general, I strongly recommend using this over the "industry standard" Python-based setup, as the overheads of 1GB+ of random dependencies and interpreted language do tend to build up in ways beyond what shows up in benchmarks. (You might not lose so much time per token, but you will use more RAM (easy to measure) and put more strain on assorted caches and buffers (harder to attribute) and have more context switches degrading UI interactivity.)

In general, I strongly recommend using this over the "industry standard" Python-based setup, as the overheads of 1GB+ of random dependencies and interpreted language do tend to build up in ways beyond what shows up in benchmarks

Very true. Playing around with stable diffusion has made the impossible - made my windows seize up over memory, actual app crashes & need to restart the system more than every three weeks when the inevitable update happens.

I had to download rammap from sysinternals because it could delete leaked memory a1111 spilled in gigabytes every time a model was being swapped. Every ~50 mb of pics generated it will need restarting.

Maybe I should go to comfyUI.

Thanks! Do you use it?

Yes, though I haven't paid attention to it in about half a year so I couldn't answer what the capabilities of the best models are nowadays. My general sense was that performance of the "reasonably-sized" models (of the kind that you could run on a standard-architecture laptop, perhaps up to 14B?) has stagnated somewhat, as the big research budgets go into higher-spec models and the local model community has structural issues (inadequate understanding of machine learning, inadequate mental model of LLMs, inadequate benchmarks/targets). That is not to say they aren't useful for certain things; I have encountered 7B models that could compete with Google Translate performance on translating some language pairs and were pretty usable as a "soft wiki" for API documentation and geographic trivia and what-not.

Do we have anyone running local offline LLMs here?

How are they coming along?

There are some really good models available to run but they require beastly graphics cards. Here are some llama benchmarks, for a rough idea.

Do you need to load them into VRAM, or can you load them into RAM or something and use either CPU or GPU from there?

In theory, they can be ran on a CPU but GPUs are way better at this task.
The best places to find information on local LLMs that I'm aware of are https://old.reddit.com/r/LocalLLaMA/ and https://boards.4chan.org/g/ and especially the LLM general there.

Thank you.

I can run 7B models on a Macbook M2 with 8 GB of ram. This is because of how Macbooks handle VRAM.

It's pretty slow, and 7B models aren't great for general tasks. If you can use one that's fine tune for a specific thing, they're worth it.

Frankly, however, I'd just recommend using something like together(dot)AI or OpenRouter to run larger models elsewhere. Normal caveats about not pushing sensitive info out there, of course. $30-$50 worth of credits, even for monster models like Meta's 405B, will take you easily though a month of pretty heavy usage (unless you're running big automated workloads 24/7).

I think there's going to be a race between local AI specific hardware for consumers and just cloud based hyperscaling. I don't know which will win. Privacy definitely plays a part. I'm quite optimistic to see a new compute hardware paradigm emerge.

I'm using openrouter.ai daily. The credits last for a surprisingly long time. Sonnet 3.5 is my go-to model.

I'd like something offline and private for sensitive use though.