Using Mac M2 Ultra 192GB to Self-Host LLMs?

shaserlark@sh.itjust.works · edit-2 6 months ago

Using Mac M2 Ultra 192GB to Self-Host LLMs?

0x01@lemmy.ml · edit-2 6 months ago

I do this on my ultra, token speed is not great, depending on the model of course, a lot of source code sets are optimized for Nvidia and don’t even use native Mac gpu without modifying the code, defaulting to cpu. I’ve had to modify about half of what I run

Ymmv but I find it’s actually cheaper to just use a hosted service

If you want some specific numbers lmk

shaserlark@sh.itjust.works · 6 months ago

Interesting, is there any kind of model you could run at reasonable speed?

I guess over time it could amortize but if the usability sucks that may make it not worth it. OTOH really don’t want to send my data to any company.

brucethemoose@lemmy.world · edit-2 6 months ago

Late to this post, but shoot for and AMD Strix Halo or Nvidia Digits mini PC.

Prompt processing is just too slow on Apple, and the Nvidia/AMD backends are so much faster with long context.

Otherwise, your only sane option for 128K context in a server with a bunch of big GPUs.

Also… what model are you trying to use? You can fit Qwen coder 32B with like 70K context on a single 3090, but honestly its not good above 32K tokens anyway.

shaserlark@sh.itjust.works · 6 months ago

Thanks for the reply, still reading here. Yeah thanks to the comments and reading some benchmarks I abandoned the idea of getting an Apple, it’s just too slow.

I was hoping to test Qwen 32B or llama 70b for running longer contexts, hence the apple seemed appealing.

brucethemoose@lemmy.world · edit-2 6 months ago

Honestly, most LLMs suck at the full 128K. Look up benchmarks like RULER.

In my personal tests over API, LLama 70B is bad out there. Qwen (and any fine tune based on Qwen Instruct, with maybe an exception or two) not only sucks, but is impractical past 32K once its internal rope scaling kicks in. Even GPT-4 is bad out there, with Gemini and some other very large models being the only usable ones I found.

So, ask yourself… Do you really need 128K? Because 32K-64K is a boatload of code with modern tokenizers, and that is perfectly doable on a single 24G GPU like a 3090 or 7900 XTX, and that’s where models actually perform well.

RandomlyRight@sh.itjust.works · 6 months ago

Take a look at NVIDIA Project Digits. It’s supposed to release in May for 3k usd and will be kind of the only sensible way to host LLMs then:

https://www.nvidia.com/en-us/project-digits/

wise_pancake@lemmy.ca · 6 months ago

That actually seems attractive to me, but I’m unsure where I stand yet. It’s pricey, I just want a box I can put in the basement and then connect everything to over wifi.

just_another_person@lemmy.world · edit-2 6 months ago

I’ve not run such things on Apple hardware, so can’t speak to the functionality, but you’d definitely be able to do it cheaper with PC hardware.

The problem with this kind of setup is going to be heat. There are definitely cheaper minipcs, but I wouldn’t think they have the space for this much memory AND a GPU, so you’d be looking for an AMD APU/NPU combo maybe. You could easily build something about the size of a game console that does this for maybe $1.5k.

shaserlark@sh.itjust.works · 6 months ago

I’d honestly be open for that but would an AMD setup not take up a lot of space and consume lots of power / be loud?

It seems like in terms of price & speed, the Macs suck compared to other options, but if you don’t have a lot of space and don’t want to hear an airplane engine constantly I’m wondering if there are options.

just_another_person@lemmy.world · edit-2 6 months ago

~~I just looked, and the MM maxes out at 24G anyway. Not sure where you got the thought of 196GB at.~~ NVM you said m2 ultra

Look, you have two choices. Just pick one. Whichever is more cost effective and works for you is the winner. Talking it down to the Nth degree here isn’t going to help you with the actual barriers to entry you’ve put in place.

shaserlark@sh.itjust.works · 6 months ago

I understand what you’re saying but I’m coming to this community because I like having more input, hear about the experience of others and potentially learn about things I didn’t know about. I wouldn’t ask specifically in this community if I wouldn’t want to optimize my setup as much as I can.

just_another_person@lemmy.world · 6 months ago

Here’s a quick idea of what you’d want in a PC build https://newegg.io/2d410e4

shaserlark@sh.itjust.works · 6 months ago

Thanks, that’s very helpful! Will look into that type of build

just_another_person@lemmy.world · 6 months ago

You can have a slightly bigger package in PC form and doing 4x the work for half the price. That’s the gist.

tehnomad@lemm.ee · 6 months ago

The context cache doesn’t take up too much memory compared to the model. The main benefit of having a lot of VRAM is that you can run larger models. I think you’re better off buying a 24 GB Nvidia card from a cost and performance standpoint.

shaserlark@sh.itjust.works · 6 months ago

Yeah I was thinking about running something like Code Qwen 72B which apparently requires 145GB Ram to run the full model. But if it’s super slow especially with large context and I can only run small models at acceptable speed anyway it may be worth going NVIDIA alone for CUDA.

tehnomad@lemm.ee · 6 months ago

I found a VRAM calculator for LLMs here: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator

Wow it seems like for 128K context size you do need a lot of VRAM (~55 GB). Qwen 72B will take up ~39 GB so you would either need 4x 24GB Nvidia cards or the Mac Pro 192 GB RAM. Probably the cheapest option would be to deploy GPU instances on a service like Runpod. I think you would have to do a lot of processing before you get to the breakeven point of your own machine.

KoalaUnknown@lemmy.world · edit-2 6 months ago

There are some videos on youtube of people running local LLMs on the newer M4 chips which have pretty good AI performance. Obviously, a 5090 is going to destroy it in raw compute power, but the large unified memory on Apple Silicon is nice.

That being said, there are plenty of small ITX cases at about 13-15L that can fit a large nvidia GPU.

shaserlark@sh.itjust.works · edit-2 6 months ago

Thanks! Hadn’t thought of YouTube at all but it’s super helpful. I guess that’ll help me decide if the extra Ram is worth it considering that inference will be much slower if I don’t go NVIDIA.