Overview
Poolside trains its language models using two types of text-based data: programming code and natural language documents. Training data goes through a rigorous 5-step processing pipeline to create a safe, high-quality data set:- Data collection from public and proprietary sources
- Safety screening: automated methods to identify and exclude harmful or inappropriate data as well as code subject to certain non-permissive licenses, such as copyleft licenses
- Quality filtering to exclude irrelevant or low-quality data
- Privacy protection: automated methods to identify and exclude PII, credentials and secrets
- Deduplication (exact and near-duplicate) to reduce memorization risk