Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

A user ran GLM-5.2 locally on CPU-only hardware, using the UD-Q2-K_XL quantization and ik_llama.cpp for inference. The setup utilized a Dell PowerEdge R740 server with dual Xeon 6248R CPUs, 768 GB RAM, and achieved performance improvements by isolating to a single NUMA node. This allowed running 24 cores and 384 GB of memory for the model. The user reported a relatively smooth experience with the model.

Key takeaways

GLM-5.2 can run on CPU-only hardware with quantization.
ik_llama.cpp provides performance improvements over llama.cpp for CPU inference.
NUMA node isolation helps mitigate cross-socket latency issues.

#local-llm #cpu-inference #quantization

Read the original

Feed

models2h ago

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

rr/LocalLLaMA

Key takeaways

GLM-5.2 can run on CPU-only hardware with quantization.
ik_llama.cpp provides performance improvements over llama.cpp for CPU inference.
NUMA node isolation helps mitigate cross-socket latency issues.

#local-llm #cpu-inference #quantization

Read at r/LocalLLaMA

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

Related

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

Related