Tag

#optimization

Every item tagged optimization, newest first.

14 items

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

The Hugging Face blog details optimizing PyTorch performance by fusing MLP layers. Fusing nn.Linear layers into a single kernel improves inference speed and reduces memory usage. This technique can be applied to other PyTorch modules for similar performance gains. Builders can use these optimizations to deploy models more efficiently.

Key takeaways

Fusing nn.Linear layers improves inference speed.
Reduces memory usage.
Optimization technique applicable to other PyTorch modules.

HHugging Face Blog#pytorch #performance #optimization

modelsSep 11

Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers

The Hugging Face blog post shares optimization techniques for transformer models, specifically highlighting methods that can be used with OpenAI's GPT models. These tricks aim to improve performance and efficiency when working with transformers. You can apply these optimizations to enhance your transformer-based projects. The post provides actionable advice for builders.

Key takeaways

Optimization techniques improve transformer model performance.
Methods can be applied to OpenAI GPT models.
Tricks enhance efficiency in transformer-based projects.

HHugging Face Blog#transformers #optimization #open-source

researchApr 16

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

Researchers from TNG and Hugging Face propose prefill and decode as a method to optimize LLM performance for concurrent requests. This approach splits processing into two stages, allowing for better utilization of GPU resources and increased throughput. By precomputing static tokens and decoding dynamically, latency can be reduced by up to 30%. Builders can apply this technique to improve performance in multi-user LLM applications.

Key takeaways

Prefill and decode reduces latency by up to 30% for concurrent requests.
Splits processing into prefill and decode stages for better GPU utilization.
Improves performance in multi-user LLM applications.

HHugging Face Blog#llm-performance #optimization #concurrent-requests

modelsJan 15

Train 400x faster Static Embedding Models with Sentence Transformers

Hugging Face released optimized techniques for training static embedding models up to 400x faster using Sentence Transformers. This acceleration enables rapid prototyping and deployment of semantic search, clustering, and classification applications. You can leverage these advancements to build and refine your models more efficiently. The optimizations target performance-critical use cases requiring low-latency embeddings.

Key takeaways

400x speedup in training static embedding models
Optimized for semantic search, clustering, and classification
Enables rapid prototyping and efficient deployment

HHugging Face Blog#semantic-search #embedding-models #optimization

toolsJun 13

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

Hugging Face Accelerate now supports both DeepSpeed and FSDP, allowing users to switch between the two optimization libraries. This integration enables more flexibility for large-scale model training. Builders can choose the best approach for their specific use case. The addition of FSDP support addresses user requests for more optimization options.

Key takeaways

Hugging Face Accelerate supports both DeepSpeed and FSDP.
Users can switch between optimization libraries.
FSDP support added based on user requests.

HHugging Face Blog#hugging-face #model-training #optimization

modelsFeb 29

Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator

Intel and Hugging Face collaborated on a text-generation pipeline optimized for Intel's Gaudi 2 AI accelerator. The pipeline enables faster and more efficient text generation on Gaudi 2 hardware. You can deploy this pipeline to improve performance and reduce costs for your text-generation workloads. The optimized pipeline is available on Hugging Face's model hub.

Key takeaways

Optimized for Intel Gaudi 2 AI accelerator
Faster and more efficient text generation
Available on Hugging Face's model hub

HHugging Face Blog#text-generation #ai-accelerator #optimization

researchDec 20

Speculative Decoding for 2x Faster Whisper Inference

Hugging Face researchers implemented speculative decoding for Whisper, reducing inference time by 2x. This method generates multiple candidate transcriptions in parallel and selects the most likely one, improving efficiency without sacrificing accuracy. You can integrate this approach into your Whisper-based applications for faster performance. The technique is particularly useful for real-time transcription tasks where speed is crucial.

Key takeaways

Speculative decoding cuts Whisper inference time in half.
Method generates multiple transcription candidates in parallel.
Improves efficiency without losing accuracy.

HHugging Face Blog#real-time #speech-recognition #optimization

modelsMar 28

Accelerating Stable Diffusion Inference on Intel CPUs

Intel and Hugging Face collaborated to optimize Stable Diffusion inference on Intel CPUs, achieving up to 2x faster performance. The optimization leverages Intel's AVX-512 and VNNI instructions. This work enables faster and more efficient image generation on widely available hardware, benefiting developers who deploy Stable Diffusion models in production.

Key takeaways

Up to 2x faster Stable Diffusion inference on Intel CPUs.
Optimization uses Intel's AVX-512 and VNNI instructions.
Faster inference on widely available hardware reduces deployment costs.

HHugging Face Blog#stable-diffusion #intel #optimization #inference

modelsFeb 6

Accelerating PyTorch Transformers with Intel Sapphire Rapids - part 2

Intel and Hugging Face collaborated to optimize PyTorch transformer inference on Intel Sapphire Rapids processors. The work resulted in up to 2x faster inference performance for certain transformer models. You can reproduce the results and apply similar optimizations to your own models using the provided code and benchmarks.

Key takeaways

Up to 2x faster inference on Sapphire Rapids processors.
Optimizations available for PyTorch transformers.
Code and benchmarks provided for reproducibility.

HHugging Face Blog#pytorch #transformers #optimization #hardware

modelsJan 2

Accelerating PyTorch Transformers with Intel Sapphire Rapids - part 1

Intel and Hugging Face collaborated to optimize PyTorch transformer performance on Intel Sapphire Rapids CPUs. The work resulted in significant speedups for transformer inference, making it more efficient for builders to deploy AI models. This optimization enables faster and more cost-effective model serving. You can leverage these improvements in your own applications.

Key takeaways

PyTorch transformer inference sped up on Intel Sapphire Rapids.
Optimization achieved through Intel and Hugging Face collaboration.
Faster inference enables more efficient model deployment.

HHugging Face Blog#pytorch #transformers #optimization #intel

modelsJul 27

Faster Text Generation with TensorFlow and XLA

TensorFlow with XLA can accelerate text generation by up to 30% compared to standard TensorFlow. This performance boost enables faster model deployment and serving. You can integrate XLA into your TensorFlow workflow for improved efficiency. The approach works with popular models like T5 and OPT.

Key takeaways

Up to 30% faster text generation with XLA.
Works with T5 and OPT models.
Improves deployment and serving efficiency.

HHugging Face Blog#tensorflow #xla #text-generation #optimization

toolsSep 14

Introducing Optimum: The Optimization Toolkit for Transformers at Scale

Hugging Face introduced Optimum, a toolkit for optimizing transformer models at scale. Optimum provides a set of tools and techniques for optimizing transformer models, enabling faster and more efficient deployment. You can use Optimum to optimize your transformer models for specific hardware and deployment scenarios. This helps you reduce costs and improve performance.

Key takeaways

Optimum is a toolkit for optimizing transformer models at scale.
It provides tools and techniques for faster and more efficient deployment.
Optimum enables optimization for specific hardware and deployment scenarios.

HHugging Face Blog#transformers #optimization #hardware

modelsApr 20

Scaling-up BERT Inference on CPU (Part 1)

The Hugging Face team explores scaling up BERT inference on CPU, presenting optimizations and performance benchmarks. They achieved a 2x speedup on a single socket Intel Xeon Platinum 8280 CPU. These improvements enable faster and more efficient deployment of BERT models on CPU infrastructure. You can apply these optimizations to your own BERT deployments.

Key takeaways

2x speedup on single socket Intel Xeon Platinum 8280 CPU.
Optimizations enable faster BERT deployment on CPU.
Improvements apply to existing BERT models.

HHugging Face Blog#cpu-inference #bert #optimization

modelsJan 26

Faster TensorFlow models in Hugging Face Transformers

Hugging Face has optimized TensorFlow model serving in their Transformers library for faster inference. The update reduces latency by up to 30% across various models. You can now deploy models more efficiently. This improvement helps you save on compute resources and costs.

Key takeaways

Up to 30% latency reduction in TensorFlow model serving.
Optimized for various models in the Transformers library.
Efficient deployment reduces compute resource and cost needs.

HHugging Face Blog#tensorflow #model-serving #optimization