Brainstorming Post LoK/Avatar Seven Havens Story Ideas

brucethemoose@lemmy.world · 8 hours ago

CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.

brucethemoose@lemmy.world · edit-2 8 hours ago

I completely disagree.

Frankly, I find the description “VC funding a FOSS” offensive. They aren’t funding the engine. I’ve been messing with LLM inference engines since 2022, and Ollama is the worst I’ve seen in the community.

They misname models for SEO. They leech off llama.cpp while deliberately hiding attribution yet redirecting GH support requests there. They sometimes make their own GGUFs+forked releases which are broken and incompatibile with upstream llama.cpp, just so they can get a release out a day ahead for hype, even though it doesn’t really work and they’ll never upstream one line. They set a default context size thats basically unusable, they screw up chat templates and deep internal code with no obvious indicators, they release suboptimal quants without iMatrix, they gate you into their internal quantization repo and model card format, they hide model downloads on your hard drive, they mess with standard APIs for no good reason other than to mess up other backends. I could go on and on.

And if that’s all fine, they’re enshittifying the app with closed code, and pointers to cloud models.

They GIVE LLM inference a bad name, by making it a terrible quality engine that happens to show up in search as the “default.” Hence the comments below of people being unimpressed with local inference. And they sap attention from actual llama.cpp devs, without contributing a single dime. Everyone in the localllama communtity hates their guts, and that’s not even getting into the interpersonal drama they’ve stirred.

They are a leech that’s a net drag to the whole community, that we can’t get rid of because they’re attention grifters. And they’ve gotten worse and worse over time.

It’s more morale to use any cloud API over Ollama, in my eyes. They’re a grift.

EDIT: And, to be clear, I’m not against VC funded downstream stuff.

LM Studio is good! Even though it’s closed source.

Tons of downstream projects are great.

brucethemoose@lemmy.world · edit-2 10 hours ago

Not anymore. Not with hybrid offloading, where the GPU handles dense tensors and the CPU only runs the sparse MoEs. I’m running a 300B model on a single 3090, and its faster than I can read.

You just need to use the right framework, and the right model.

I’d suggest trying ik_llama.cpp and a MoE like one of these: https://huggingface.co/models?other=ik_llama.cpp&sort=modified&search=35B

And speculative decoding like DFlash or MTP (which you can also get specific models for).

EDIT: Wrong link.

brucethemoose@lemmy.world · 11 hours ago

Oh, and I just saw you have a 3090.

To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that’s practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main

Or Gemma 31B QAT: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF

If you have 128GB CPU RAM, I can upload my custom MiMo 2.5 quant. That should “beat” the cheapest Claude, give or take.

If you have 64GB, I’d suggest a quantization of Step 3.7.

If you have 32GB or 48, I’m not sure. I’d need to look if any “small” MoE is actually better than Qwen 27B now.

brucethemoose@lemmy.world · 11 hours ago

https://sleepingrobots.com/dreams/stop-using-ollama/

And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.

LM Studio is better, and easy.

If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).

It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no “easy button” to run them, there can’t be.

But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.

brucethemoose@lemmy.world · edit-2 11 hours ago

An aside for anyone reading this:

https://sleepingrobots.com/dreams/stop-using-ollama/

And that barely scratches the surface. Please.

Use anything but Ollama. Even APIs.

brucethemoose@lemmy.world · 11 hours ago

How much CPU RAM do you have?

brucethemoose@lemmy.world · 11 hours ago

Did you serve them with ollama?

It’s basically broken, if you did. Try the same models over API, and you’ll see what I mean.

brucethemoose@lemmy.world · edit-2 11 hours ago

Yep.

I have a RTX 3090 + 128GB CPU RAM.

Currently I run my own custom IQ3_KT quantization of MiMo 2.5 300B, and it’s crazy good. It’s better than API models from not that long ago, and it’s served at about reading speed.

Never thought I’d ever run such a thing on my lowly desktop.

For quick scripts or code assistant, sometimes I use Qwen 27B (another custom quant, currently experimenting with exllama). Or Gemini 12B for messing with image/audio input. But TBH MiMo 2.5 with thinking disabled is smarter than 27B with it.

…And honestly, I use GLM 5.2 API a good bit.

I was lucky enough to get a yearly subscription for like $30, 6 months ago. I do self host the UIs or whatever takes the prompts, though.

brucethemoose@lemmy.world · 3 days ago

Well, what if Sony made the PS6 a Linux PC?

They can have their own embedded storefront and proprietary stuff, but that would still be cool.

brucethemoose@lemmy.world · 11 days ago

It seems like some person with a bot just asked to maintain a bunch of orphaned packages, abusing the 2-week waiting period. Right?

Thats why they used npm; off the shelf, almost “standard practice” credential harvesting malware. Nothing too fancy.

brucethemoose@lemmy.world · 27 days ago

No.

Even the biggest open weights models are trained on pennies compared to OpenAI and Claude. They just don’t have the hardware to be so wasteful.

In fact, the Nvidia GPU ban was the best thing to ever happen to “small” AI devs. It made them thrifty.

brucethemoose@lemmy.world · 28 days ago

Yeah.

It’s not even about efficiency, really, but independence from corporations, privacy, and principle. Kind of like Lemmy.

brucethemoose@lemmy.world · 4 months ago

If you’re wondering about Fedora vs CachyOS, it comes down to what you do on your PC. And what you’re used to.

If you want better “preconfiguration” for graphics stuff, CachyOS is the way to go. With Fedora you will end up referencing and maintaining a whole lot more yourself, while the CachyOS maintainers basically do all that maintinance and config optimization for you.

But Fedora might be better for a less GPU-focused “workstation” type system.

Generally, I’d look at the “style” and interests of distro maintainers. CachyOS is built by a collective of linux gaming/compute enthusiasts that snowballed into popularity, though it does inherit all the work from Arch. Fedora is a long standing workstation/server workhorse, a “pre release” for Red Hat enterprise linux.

brucethemoose@lemmy.world · 2 years ago

Brainstorming Post LoK/Avatar Seven Havens Story Ideas

brucethemoose@lemmy.world · edit-2 2 years ago

'Avatar: Seven Havens' Rumors Emerge

Brainstorming Post LoK/Avatar Seven Havens Story Ideas

Brainstorming Post LoK/Avatar Seven Havens Story Ideas

'Avatar: Seven Havens' Rumors Emerge

'Avatar: Seven Havens' Rumors Emerge

Moderates