Tag

#synthetic-data

Every item tagged synthetic-data, newest first.

4 items

Nemotron-Personas-India: Synthesized Data for Sovereign AI

NVIDIA released Nemotron-Personas-India, a dataset of 98,000 synthetic human interactions in 12 Indian languages. The dataset aims to support development of AI models tailored to Indian languages and cultural contexts. You can access the dataset via Hugging Face. This release supports the growth of sovereign AI capabilities in India.

Key takeaways

98,000 synthetic human interactions in 12 Indian languages.
Dataset available on Hugging Face for model training.
Supports development of India-specific AI models.

HHugging Face Blog#synthetic-data #sovereign-ai #local-llm

researchSep 26

Nemotron-Personas-Japan: ソブリン AI のための合成データセット

NVIDIA released Nemotron-Personas-Japan, a synthetic dataset for training sovereign AI systems in Japan. The dataset aims to support local language and cultural nuances. Builders can use this dataset to fine-tune models for Japanese language tasks.

Key takeaways

Nemotron-Personas-Japan supports training AI with local language and cultural context.
Dataset available on Hugging Face for use in model fine-tuning.
Supports development of sovereign AI systems in Japan.

HHugging Face Blog#synthetic-data #sovereign-ai #local-llm

modelsJul 16

How we leveraged distilabel to create an Argilla 2.0 Chatbot

Argilla 2.0 integrated distilabel for data curation and model training. The Argilla team used distilabel to generate synthetic data, fine-tune models, and deploy a chatbot. This approach streamlined their development process and improved model performance. You can replicate this workflow using Argilla and distilabel.

Key takeaways

Argilla 2.0 used distilabel for data curation and model training.
Synthetic data generation improved model performance.
Streamlined development process via distilabel integration.

HHugging Face Blog#fine-tuning #synthetic-data #chatbots

researchMar 20

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Researchers from Hugging Face and collaborators propose Cosmopedia, a method for generating large-scale synthetic data to pre-train Large Language Models. This approach uses a combination of LLMs and text-to-text models to create diverse, high-quality training data. The generated data can help improve model performance, especially in low-resource languages. You can explore the generated dataset and code on the Hugging Face Hub.

Key takeaways

Cosmopedia generates synthetic data using LLMs and text-to-text models.
The approach aims to improve model performance, especially in low-resource languages.
The dataset and code are available on the Hugging Face Hub.

HHugging Face Blog#synthetic-data #pre-training #large-language-models