1sec.ai

Tag

#model-serving

Every item tagged model-serving, newest first.

13 items

otherApr 29

DeepInfra on Hugging Face Inference Providers 🔥

DeepInfra has joined Hugging Face as an inference provider, expanding access to optimized model serving. This partnership allows builders to deploy models with DeepInfra's performance-optimized infrastructure. You can now use DeepInfra's GPU-accelerated serving for Hugging Face-hosted models. The addition of DeepInfra brings more choices for scalable and cost-effective model deployment.

Key takeaways
  • DeepInfra joins Hugging Face as an inference provider.
  • Offers GPU-accelerated model serving for Hugging Face models.
  • Expands deployment options for builders.
otherSep 19

Scaleway on Hugging Face Inference Providers 🔥

Scaleway has joined Hugging Face as an inference provider, expanding access to scalable and secure cloud infrastructure for deploying AI models. This partnership enables users to deploy models on Scaleway's cloud infrastructure. Builders can now leverage Scaleway's GPU-accelerated instances for model serving. Scaleway's infrastructure is designed for high-performance and low-latency AI workloads.

Key takeaways
  • Scaleway joins Hugging Face as an inference provider.
  • Enables deployment of AI models on Scaleway's cloud infrastructure.
  • Scaleway offers GPU-accelerated instances for model serving.
modelsJul 10

Building the Hugging Face MCP Server

The Hugging Face team built the Model Catalog Platform (MCP) server to enable fast, scalable model serving. The MCP server handles millions of requests per second, supporting thousands of models and tens of thousands of users. You can deploy and serve models using the Hugging Face Hub API. The MCP server is a key component of Hugging Face's infrastructure, allowing for efficient model management and deployment.

Key takeaways
  • Handles millions of requests per second
  • Supports thousands of models and tens of thousands of users
  • Deploy models using Hugging Face Hub API
toolsJul 9

Upskill your LLMs With Gradio MCP Servers

Hugging Face has introduced Gradio MCP Servers, a new feature that enables you to deploy and manage LLMs at scale. This allows for efficient model serving and fine-tuning. You can now easily integrate LLMs into your applications using Gradio MCP Servers.

Key takeaways
  • Gradio MCP Servers enable scalable LLM deployment and management.
  • Efficient model serving and fine-tuning are supported.
  • Integration with applications is streamlined.
modelsJun 23

Transformers backend integration in SGLang

SGLang now supports integration with Transformers as a backend, allowing users to deploy models from the Transformers library with SGLang's serving infrastructure. This integration enables flexible model deployment and management. You can leverage the strengths of both frameworks. The update streamlines workflows for builders working with diverse model ecosystems.

Key takeaways
  • SGLang integrates with Transformers as a backend.
  • Enables deployment of Transformers models with SGLang serving.
  • Streamlines model deployment and management workflows.
toolsMar 21

The New and Fresh analytics in Inference Endpoints

Hugging Face has introduced new analytics features for Inference Endpoints, enabling users to monitor and optimize their model deployments. The updates provide detailed metrics on request latency, throughput, and error rates. Builders can now better understand performance bottlenecks and make data-driven decisions. This enhancement aims to improve the efficiency and reliability of model serving.

Key takeaways
  • New analytics features for monitoring request latency and throughput.
  • Detailed metrics help identify performance bottlenecks.
  • Improves efficiency and reliability of model serving.
modelsJun 7

Introducing the Hugging Face Embedding Container for Amazon SageMaker

Hugging Face and AWS have collaborated to launch the Hugging Face Embedding Container for Amazon SageMaker, streamlining the deployment of transformer-based embeddings. This integration allows you to deploy Hugging Face models directly on SageMaker, simplifying workflows and reducing operational overhead. The container supports popular Hugging Face models and is designed for efficient model serving. Builders can now easily leverage transformer-based embeddings in their SageMaker workflows.

Key takeaways
  • Hugging Face Embedding Container now available on Amazon SageMaker.
  • Simplifies deployment of transformer-based embeddings.
  • Supports popular Hugging Face models for efficient serving.
modelsDec 5

Goodbye cold boot - how we made LoRA Inference 300% faster

Hugging Face improved LoRA inference speed by 300% through dynamic adapter loading, eliminating cold boot times. This optimization enables faster model switching and reduces latency for builders using LoRA adapters. The technique allows for more efficient use of resources, making it easier to deploy and manage multiple models.

Key takeaways
  • 300% faster LoRA inference via dynamic loading.
  • Eliminates cold boot times for faster model switching.
  • Improves resource efficiency for multi-model deployment.
modelsOct 4

Accelerating over 130,000 Hugging Face models with ONNX Runtime

Microsoft and Hugging Face collaborated to integrate ONNX Runtime with Hugging Face Hub, enabling accelerated inference for over 130,000 models. This integration allows for faster and more efficient model deployment. You can now deploy models with optimized performance. The collaboration aims to improve the overall model serving experience.

Key takeaways
  • 130,000+ Hugging Face models accelerated with ONNX Runtime.
  • Integration enables faster and more efficient model deployment.
  • Optimized performance for model serving.
modelsSep 22

Inference for PROs

Hugging Face launched Inference for PROs, a paid API for high-priority access to optimized model inference. The API targets enterprise users who need fast, reliable, and scalable model serving. You can now get priority access to optimized inference for popular open-source models. This service aims to support production environments requiring low latency and high throughput.

Key takeaways
  • Hugging Face offers paid API for priority model inference.
  • Targets enterprise users with production environment needs.
  • Optimized for popular open-source models.
otherFeb 15

Why we’re switching to Hugging Face Inference Endpoints, and maybe you should too

A case study explains why one company switched to Hugging Face Inference Endpoints for model serving, citing cost savings and ease of use. They found Hugging Face's solution reduced costs and improved scalability. You can evaluate Hugging Face Inference Endpoints as an alternative for your model serving needs. The company's experience may help inform your own decisions about model deployment.

Key takeaways
  • Hugging Face Inference Endpoints reduced costs for one company.
  • Hugging Face's solution improved scalability.
  • Company switched from in-house model serving to Hugging Face.
otherNov 21

An overview of inference solutions on Hugging Face

Hugging Face offers a range of inference solutions for deploying machine learning models, including optimized runtimes, serverless endpoints, and on-premises deployments. These solutions support various frameworks like TensorFlow, PyTorch, and Transformers. You can use Hugging Face for model serving, monitoring, and logging. The platform provides a unified experience for model deployment and management.

Key takeaways
  • Hugging Face supports TensorFlow, PyTorch, and Transformers for inference.
  • The platform offers serverless endpoints and on-premises deployment options.
  • Hugging Face provides monitoring and logging for deployed models.
modelsJan 26

Faster TensorFlow models in Hugging Face Transformers

Hugging Face has optimized TensorFlow model serving in their Transformers library for faster inference. The update reduces latency by up to 30% across various models. You can now deploy models more efficiently. This improvement helps you save on compute resources and costs.

Key takeaways
  • Up to 30% latency reduction in TensorFlow model serving.
  • Optimized for various models in the Transformers library.
  • Efficient deployment reduces compute resource and cost needs.