#open-data — 1sec.ai

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

Researchers at Stanford created the Stanford EDGAR Filings Dataset, an open dataset of SEC filings reconstructed into layout-faithful MultiMarkdown. This dataset provides a new source of long-context training data for financial language models, addressing the scarcity of high-quality, publicly available documents. The dataset can be used to pretrain large language models, potentially improving their performance on financial tasks. You can access the dataset for your own research and model fine-t

Key takeaways

Stanford EDGAR Filings Dataset is an open dataset of SEC filings.
Dataset is in layout-faithful MultiMarkdown format.
Addresses scarcity of long-context training data for financial LLMs.

aarXiv#open-data #financial-llm #long-context