This is a core principle of poolside’s data pipeline process and team members.
Overview
poolside trains its language models using two types of text-based data: programming code and natural language documents. Training data goes through a rigorous 5-step processing pipeline to create a safe, high-quality data set:- Data collection: We use publicly available sources from the internet for raw code repositories and text documents. We may also source proprietary data where appropriate.
- Safety screening: We classify collected data and screen content that we identify as harmful, copyleft, non-commercial, inappropriate, or otherwise problematic.
- Quality filtering: We evaluate files for relevance and usefulness, removing files that are too small, too large, or don’t meet our quality standards for training data.
- Privacy protection: We scan for and remove personally identifiable information (PII), passwords, API keys, or other sensitive data that might have been accidentally included in public repositories.
- Deduplication: We eliminate duplicate and near-duplicate content to ensure our models train on diverse, unique examples rather than memorizing repeated patterns.