Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)
A user ran GLM-5.2 locally on CPU-only hardware, using the UD-Q2-K_XL quantization and ik_llama.cpp for inference. The setup utilized a Dell PowerEdge R740 server with dual Xeon 6248R CPUs, 768 GB RAM, and achieved performance improvements by isolating to a single NUMA node. This allowed running 24 cores and 384 GB of memory for the model. The user reported a relatively smooth experience with the model.
Key takeaways
- GLM-5.2 can run on CPU-only hardware with quantization.
- ik_llama.cpp provides performance improvements over llama.cpp for CPU inference.
- NUMA node isolation helps mitigate cross-socket latency issues.
A user ran GLM-5.2 locally on CPU-only hardware, using the UD-Q2-K_XL quantization and ik_llama.cpp for inference. The setup utilized a Dell PowerEdge R740 server with dual Xeon 6248R CPUs, 768 GB RAM, and achieved performance improvements by isolating to a single NUMA node. This allowed running 24 cores and 384 GB of memory for the model. The user reported a relatively smooth experience with the model.
Key takeaways
- GLM-5.2 can run on CPU-only hardware with quantization.
- ik_llama.cpp provides performance improvements over llama.cpp for CPU inference.
- NUMA node isolation helps mitigate cross-socket latency issues.