models15h ago

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

rr/LocalLLaMAscore 0.32

Gemma 4 E2B was run in-browser at 255 tokens/s using WebGPU kernels developed with Fable 5 before its shutdown. The optimized kernels and demo have been released for public use on Hugging Face. This achievement shows that high-performance LLM inference is feasible on web platforms, enabling new deployment options for builders. The demo provides a practical example of optimized WebGPU kernels in action.

Key takeaways

255 tokens/s on M4 Max with WebGPU kernels.
Kernels and demo released on Hugging Face for public use.
Enables high-performance LLM inference on web platforms.

#webgpu #in-browser-llm #local-llm #webml

Read the original

models15h ago

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Gemma 4 E2B was run in-browser at 255 tokens/s using WebGPU kernels developed with Fable 5 before its shutdown. The optimized kernels and demo have been released for public use on Hugging Face. This achievement shows that high-performance LLM inference is feasible on web platforms, enabling new deployment options for builders. The demo provides a practical example of optimized WebGPU kernels in action.

Key takeaways

255 tokens/s on M4 Max with WebGPU kernels.
Kernels and demo released on Hugging Face for public use.
Enables high-performance LLM inference on web platforms.

#webgpu #in-browser-llm #local-llm #webml

Read at r/LocalLLaMA