1sec.ai

Tag

#local-llm

Every item tagged local-llm, newest first.

37 items

othernew1h

Does anyone have enough compute to make a distillation dataset out of GLM5.2?

A reddit user is asking if anyone has sufficient compute to create a distillation dataset from GLM-5.2, which could be used to train smaller models like Qwen-3.5. The proposed dataset would contain 700k-1M examples. This would benefit the community by enabling better training of smaller models.

Key takeaways
  • GLM-5.2 proposed as source for distillation dataset.
  • 700k-1M examples suggested for dataset size.
  • Smaller models like Qwen-3.5 could benefit from dataset.
modelsnew5h

Local Qwen isn't a worse Opus, it's a different tool

The Qwen 1.8B model is a local, open-weights alternative to Google's Opus, offering different strengths and use cases. While not a direct replacement, Qwen 1.8B provides a viable option for builders seeking a locally deployable model. Its performance characteristics and licensing terms make it suitable for specific applications. You can deploy Qwen 1.8B locally, giving you more control over data and infrastructure.

Key takeaways
  • Qwen 1.8B is an open-weights, locally deployable model.
  • Different performance profile compared to Opus.
  • Licensing terms allow for local deployment and customization.

Lemonade v10.8: auto memory management, cloud offload, Omni improvements, and call your local models as MCP tools

Lemonade v10.8 brings auto memory management, cloud offload, and Omni improvements, allowing for dynamic VRAM management, automatic context sizing, and easier model switching. The update was driven by 20 contributors in 7 days. You can now call local models as MCP tools, streamlining workflows. These changes aim to enhance performance and user experience.

Key takeaways
  • 20 contributors in 7 days for v10.8 release
  • Dynamic VRAM management auto-unloads idle models
  • Automatic context sizing based on available memory and model architecture

llama.cpp - how to free up even more space on your GPU

llama.cpp has improved RAM usage, eliminating memory leaks and allowing efficient GPU usage with models like Qwen3.6-27B-UD-Q5_K_XL. The author seeks advice on further reducing memory usage to increase context size on their eGPU setup with a 3090. They currently use --n-gpu-layers 99 --no-mmap --mlock. You can experiment with adjusting these parameters or explore quantization techniques.

Key takeaways
  • llama.cpp has stable RAM usage with no memory leaks.
  • --n-gpu-layers 99 --no-mmap --mlock config avoids regular RAM usage.
  • Seeking tips to free up more memory for larger context sizes.

My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

A user shared a Docker deployment configuration for running GLM-5.2 with SGLang on an HGX-H200 GPU. The setup uses 8 GPU tensor cores and allocates a fraction of system memory. This configuration may help others deploy GLM-5.2 locally with similar hardware.

Key takeaways
  • Uses lmsysorg/sglang:latest Docker image.
  • Configured for HGX-H200 GPU with 8 tensor cores.
  • Allocates a fraction of system memory for the model.

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Gemma 4 E2B was run in-browser at 255 tokens/s using WebGPU kernels developed with Fable 5 before its shutdown. The optimized kernels and demo have been released for public use on Hugging Face. This achievement shows that high-performance LLM inference is feasible on web platforms, enabling new deployment options for builders. The demo provides a practical example of optimized WebGPU kernels in action.

Key takeaways
  • 255 tokens/s on M4 Max with WebGPU kernels.
  • Kernels and demo released on Hugging Face for public use.
  • Enables high-performance LLM inference on web platforms.

GLM-5.2 is a win for local AI

GLM-5.2, a massive 753B MIT-licensed LLM, has been released, offering a frontier-level coding agent. Although its large footprint makes local deployment impractical for most, its open license enables community fine-tuning of smaller architectures. This could lead to significant improvements in local AI setups through distillation of GLM-5.2's reasoning and synthetic datasets.

Key takeaways
  • GLM-5.2 has a 753B parameter footprint.
  • MIT-licensed for open use.
  • Community fine-tuning of smaller models may lead to significant local AI improvements.

I released a local LLM-powered RPG where generated NPCs, locations, items, and quests persist as in-game objects

A developer released a local LLM-powered RPG where generated NPCs, locations, items, and quests persist as in-game objects. The LLM handles dialogue and narration while the game system manages RPG structure like inventory and combat. This approach enables a dynamic experience with reusable generated content. You can interact with the same NPCs and locations multiple times.

Key takeaways
  • Generated content persists between interactions.
  • LLM handles dialogue and situational interpretation.
  • Game system manages RPG mechanics like inventory and combat.

Local models went from mostly useless to actually useful really fast. What changed?

Local models have rapidly improved in capability, shifting from mostly useless to actually useful in about a year. This change enables builders to use models like Gemma, Qwen, and GLM for practical applications such as coding, private document handling, and local workflows. The improvement is attributed to advancements in model training and fine-tuning techniques. You can now deploy these models locally for tasks that require privacy and low latency.

Key takeaways
  • Local models now usable for coding, private docs, and workflows.
  • Rapid improvement in model capabilities over the past year.
  • Deployable locally for privacy and low latency.

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

Researchers found that popular LLMs like Llama-3 and Mistral-7B produce stories with low diversity, often repeating common tropes like 'Elias in the lighthouse'. This issue persists even with fine-tuning and larger model sizes. The study suggests that improving LLM diversity may require new techniques beyond scaling up model parameters.

Key takeaways
  • LLMs like Llama-3 and Mistral-7B produce stories with low diversity.
  • Low diversity issue persists with fine-tuning and larger model sizes.
  • New techniques may be needed to improve LLM diversity.

Cheapest way to run GLM 5.x locally that's not a unified memory system?

The discussion explores cost-effective ways to locally run GLM 5.x models, focusing on 4bit quantization. Users share experiences with CPU-only setups like Sapphire Rapids ES 56core + DDR5 and multi-GPU configurations with partial offloading. The conversation aims to identify viable options for running large models like GLM 5.x outside unified memory systems. You can consider various hardware configurations for efficient local deployment.

Key takeaways
  • Sapphire Rapids ES 56core + DDR5 is a potential option for running GLM 5.x locally.
  • Multi-GPU setups with partial offloading are also being explored.
  • The discussion is not limited to GLM 5.x, but also applies to similarly sized models.

Someone awhile ago did a quant shootout for Qwen3.6, I did shoddy math on it (again)

A Reddit user shared a quantization shootout for Qwen 1.8B and 7B models, comparing their performance across different quantization schemes. The analysis includes metrics on perplexity and model size. You can use this data to inform your model deployment decisions, particularly for local inference. The shootout provides insights into trade-offs between model accuracy and computational efficiency.

Key takeaways
  • Qwen 1.8B and 7B models were tested with various quantization schemes.
  • Perplexity and model size metrics were reported.
  • Results can inform local model deployment decisions.

I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time..

A developer successfully compiled llama.cpp to run CUDA and Vulkan simultaneously, optimizing performance for a W7800 GPU using ds4 on opencode. The compilation was achieved with a specific CMake command that enabled both CUDA and Vulkan support. This allows the model to leverage multiple GPU architectures. Builders working on local LLM deployments may find this approach useful for optimizing performance across different hardware configurations.

Key takeaways
  • llama.cpp can be compiled to support both CUDA and Vulkan.
  • The compilation requires a specific CMake command with enabled flags for CUDA, Vulkan, and other optimizations.
  • This approach can be used to optimize performance on GPUs like the W7800.

GLM-5.2 is now 1st on Design Arena — ahead of the now unavailable Claude Fable 5.

GLM-5.2 has taken the top spot on Design Arena, a benchmark for evaluating AI models on design tasks. It surpassed Claude Fable 5, which is no longer available. This change in rankings may impact how builders choose and evaluate AI models for design applications. GLM-5.2's performance indicates its potential for design-related use cases.

Key takeaways
  • GLM-5.2 ranks 1st on Design Arena.
  • Claude Fable 5 is no longer available.
  • GLM-5.2 leads on design tasks.

GLM-5.2 just dropped open weights and it already looks weirdly strong for coding

GLM-5.2, a text-only open-weights LLM, was released with a 1M context window and MIT license. Early results show it performing well in coding tasks, near the top of arenas. Its open nature allows for local deployment and testing on real-world repositories. You can download and test GLM-5.2 locally, which may be attractive for builders seeking an alternative to API-only models.

Key takeaways
  • 1M context window, open weights, and MIT license.
  • Performs well in early coding task benchmarks.
  • Allows for local deployment and testing on real-world code.

GLM 5.2 API is live, weights are on HF, and ollama has it already

GLM 5.2's API is now live and its model weights have been released on Hugging Face under an MIT license. The model can be run locally or accessed through existing gateways. This development allows builders to deploy GLM 5.2 without restrictions, following its initial release locked behind a paid plan.

Key takeaways
  • GLM 5.2 model weights released under MIT license on Hugging Face.
  • API is live, allowing for remote access.
  • Ollama already supports running GLM 5.2 locally.

GLM-5.2 is the first open-weights model to cross 80% on Terminal-Bench and beats every other open model available

GLM-5.2 is the first open-weights model to achieve over 80% on Terminal-Bench, outperforming all other open models and even Gemini. This milestone marks a significant advancement in open-weights capabilities, offering a frontier-level model at a lower cost. You can now access a highly capable model without the high costs associated with closed models.

Key takeaways
  • GLM-5.2 crosses 80% on Terminal-Bench, a first for open-weights models.
  • Beats all other open models and Gemini on benchmarks.
  • Offers frontier-level performance at a lower cost.

Quoting Georgi Gerganov

Georgi Gerganov uses Qwen3.6-27B daily for coding tasks on his local machines, finding it a capable and helpful tool for small tasks. He runs it on both an M2 Ultra and an RTX 5090. The model helps with mundane tasks at ggml-org, though his usage is limited by time spent on PR reviews. Builders can consider Qwen3.6-27B for local deployment in coding workflows.

Key takeaways
  • Qwen3.6-27B used daily for coding tasks.
  • Runs on M2 Ultra and RTX 5090.
  • Use limited by time spent on PR reviews.

Be wary of Qwen/Claude distillations - they're often worse than the base model

A Reddit user warns that distilled/finetuned models like Qwopus, based on Qwen or Claude, often perform worse than their base models. The user aims to inform, not criticize, creators of these models. This issue may apply to other distilled models, such as Gemma 4/Claude. You should evaluate these models carefully before using them.

Key takeaways
  • Distilled Qwen/Claude models can be worse than base models.
  • Issue may apply to other distilled models like Gemma 4/Claude.
  • User aims to inform, not criticize, model creators.

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models

A new open dataset for training open-weight models has been launched, aiming to counterbalance the dominance of Anthropic and OpenAI's Claude and Codex data. The initiative encourages users to donate their coding agent traces, making it possible for other model labs to train on diverse user data. This move supports the development of open-source models and promotes diversity in the LLM ecosystem.

Key takeaways
  • Open dataset for open-weight model training launched.
  • Anthropic and OpenAI dominate with Claude and Codex data.
  • Open-source models can now train on diverse user traces.

Nex-N2 Pro is the real deal

The Nex-N2 Pro model, a rebranded version of Rio-3.5, has shown promising performance in local deployment tests. The model is a merge of N2 with the Qwen base model. Initial tests were hindered by bugs in the embedded GGUF chat template, but the issue was resolved. You can now run Nex-N2 Pro with bartowski's IQ2_S GGUFs.

Key takeaways
  • Nex-N2 Pro is a merge of N2 and Qwen base models.
  • Initial tests were hindered by bugs in the GGUF chat template.
  • bartowski's IQ2_S GGUFs enable successful deployment.

AI coding at home without going broke

The author shares strategies for using AI coding tools at home without incurring high costs. Local models like LLaMA and Phi-3 can be run on consumer hardware, reducing reliance on cloud services. Builders can save on API fees by self-hosting or using local models for development and testing. This approach enables cost-effective use of AI tools for personal projects.

Key takeaways
  • Local models like LLaMA and Phi-3 run on consumer hardware.
  • Self-hosting or local model use reduces API fees.
  • Cost-effective AI tool use enabled for personal projects.
modelsJun 10

Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

Google DeepMind released DiffusionGemma, a model that accelerates local AI inference by 4x for text and image generation. DiffusionGemma targets developers who want to deploy AI models locally on devices with limited resources. The model achieves faster inference through optimized diffusion-based architectures. You can integrate DiffusionGemma into your apps to improve performance and efficiency.

Key takeaways
  • 4x faster local inference for text and image generation.
  • Optimized diffusion-based architecture for efficient deployment.
  • Targets developers building local AI apps on resource-constrained devices.
researchJan 27

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

Researchers from TII and NYU Abu Dhabi created the Alyah benchmark to evaluate Arabic LLMs on Emirati dialect understanding. The benchmark aims to improve LLM performance on local dialects. You can use Alyah to assess and compare model performance on Emirati Arabic.

Key takeaways
  • Alyah benchmark evaluates Emirati dialect understanding in Arabic LLMs.
  • Created by researchers from TII and NYU Abu Dhabi.
  • Benchmark assesses LLM performance on local dialects.
toolsDec 11

New in llama.cpp: Model Management

The llama.cpp project now supports model management features. This allows users to easily switch between different models, track model versions, and manage model dependencies. Builders can use these features to streamline their local LLM development workflows.

Key takeaways
  • llama.cpp now supports model management.
  • Model version tracking and dependency management are included.
  • Streamlines local LLM development workflows.
toolsNov 20

Introducing AnyLanguageModel: One API for Local and Remote LLMs on Apple Platforms

AnyLanguageModel provides a unified API for accessing local and remote LLMs on Apple platforms. It allows developers to integrate multiple models from different providers into their apps. The API supports both local model execution and remote inference via Hugging Face's API. This enables developers to build apps that can leverage the strengths of different models and deployment options.

Key takeaways
  • Unified API for local and remote LLMs on Apple platforms.
  • Supports integration of multiple models from different providers.
  • Enables local model execution and remote inference via Hugging Face's API.
modelsOct 13

Nemotron-Personas-India: Synthesized Data for Sovereign AI

NVIDIA released Nemotron-Personas-India, a dataset of 98,000 synthetic human interactions in 12 Indian languages. The dataset aims to support development of AI models tailored to Indian languages and cultural contexts. You can access the dataset via Hugging Face. This release supports the growth of sovereign AI capabilities in India.

Key takeaways
  • 98,000 synthetic human interactions in 12 Indian languages.
  • Dataset available on Hugging Face for model training.
  • Supports development of India-specific AI models.
researchSep 26

Nemotron-Personas-Japan: ソブリン AI のための合成データセット

NVIDIA released Nemotron-Personas-Japan, a synthetic dataset for training sovereign AI systems in Japan. The dataset aims to support local language and cultural nuances. Builders can use this dataset to fine-tune models for Japanese language tasks.

Key takeaways
  • Nemotron-Personas-Japan supports training AI with local language and cultural context.
  • Dataset available on Hugging Face for use in model fine-tuning.
  • Supports development of sovereign AI systems in Japan.
modelsJun 19

(LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware

The FLUX.1-dev model can be fine-tuned on consumer hardware using LoRA, reducing memory requirements and enabling local deployment. This approach allows for efficient adaptation of large models to specific tasks. You can access the model and fine-tuning scripts on the Hugging Face blog. Builders can explore using LoRA for similar model optimizations.

Key takeaways
  • FLUX.1-dev can be fine-tuned with LoRA on consumer hardware.
  • LoRA reduces memory requirements for large model fine-tuning.
  • Fine-tuning scripts are available on Hugging Face blog.
modelsApr 5

Welcome Llama 4 Maverick & Scout on Hugging Face

Meta has released Llama 4 models Maverick and Scout on Hugging Face. The models are open-weights and available for download. You can deploy them locally or use them as a base for further fine-tuning. This release expands the Llama family of models.

Key takeaways
  • Llama 4 models are open-weights and downloadable.
  • Maverick and Scout are the latest additions to the Llama family.
  • Models are available on Hugging Face for local deployment or fine-tuning.
modelsOct 21

“Llama 3.2 in Keras”

Meta released Llama 3.2 implemented in Keras, allowing developers to run the model locally. The Keras implementation provides a straightforward way to deploy Llama 3.2 without relying on proprietary APIs. You can access the model through the Hugging Face Transformers library.

Key takeaways
  • Llama 3.2 is available in Keras for local deployment.
  • The model can be accessed via Hugging Face Transformers.
  • Keras implementation enables straightforward local deployment.
modelsSep 25

Llama can now see and run on your device - welcome Llama 3.2

Meta released Llama 3.1, an update to the Llama model family that adds on-device execution capabilities. The model can run locally on devices with sufficient RAM, expanding deployment options for builders. Local execution enables lower latency and no reliance on cloud infrastructure. This update targets applications requiring real-time responses or offline functionality.

Key takeaways
  • Llama 3.1 supports on-device execution on devices with sufficient RAM.
  • Enables lower latency and offline functionality.
  • Expands deployment options for local and edge applications.
modelsMar 20

A Chatbot on your Laptop: Phi-2 on Intel Meteor Lake

Microsoft researchers ran Phi-2, a 2.7B parameter LLM, on an Intel Meteor Lake laptop with on-device stable diffusion. The demo shows feasible deployment of small LLMs on consumer hardware. You can run similar benchmarks with Phi-2 on your own hardware using the Hugging Face model hub.

Key takeaways
  • Phi-2 runs on Intel Meteor Lake with on-device stable diffusion.
  • 2.7B parameter LLM feasible on consumer hardware.
  • Use Hugging Face model hub for similar benchmarks.