Data Processing and Safety - poolside-private

Poolside uses code and textual data to train its large language models, drawn from a mixture of publicly available code repositories and textual documents, as well as proprietary datasets. To obtain the data for the training of the model, Poolside follows a strict policy to ensure data safety and quality.

Overview

Poolside trains its language models using two types of text-based data: programming code and natural language documents. Training data goes through a rigorous 5-step processing pipeline to create a safe, high-quality data set:

Data collection from public and proprietary sources
Safety screening: automated methods to identify and exclude harmful or inappropriate data as well as code subject to certain non-permissive licenses, such as copyleft licenses
Quality filtering to exclude irrelevant or low-quality data
Privacy protection: automated methods to identify and exclude PII, credentials and secrets
Deduplication (exact and near-duplicate) to reduce memorization risk

This process applies uniformly to both code and natural language data.

Code Data

Poolside selects top repositories through a ranking algorithm, allowing the model to learn coding practices from repositories where the best code is created. Then, for any repository where Poolside identifies significant issues, we exclude that code from the dataset. Poolside uses automated means to exclude certain licenses from our training data, such as copyleft licenses (e.g., GPL, AGPL). We then use an AI model that is highly specialized in identifying low-quality data sources to remove them from the pipeline. We remove any file that doesn’t fit our code quality patterns (i.e. files that are too small, too big, programming languages we do not want to train the model on, etc.). Finally, we take steps to remove any PII, credentials and secrets from the data corpus. This process helps us create a healthy, relevant data pipeline. Only a small fraction of publicly available code meets our quality, licensing and safety requirements. To address this limitation, we create additional synthetic coding data based on our Reinforcement Learning from Code Execution Feedback (RLCEF). This proprietary technique allows Poolside to generate high volumes of valid coding patterns that are also included in the model training. Poolside owns all the synthetically generated data.

Textual Data

We also use meaningful textual data to train the model. When selecting the relevant data sources to train our LLMs in natural language, we use external indexing systems to help us filter relevant data sources. Poolside scans these data sources and removes data identified as obscene, low-quality, toxic, or otherwise problematic. We perform this filtering using a set of specialized AI models and search tools. After our filters have been applied, we apply a final global deduplication step that consists of exact, document similarity and semantic based deduplication for natural language and code. For code we apply this both at file and code repository level to obtain both unique code bases and unique files. We use a similarity threshold here to not just filter out exact duplicates, but also documents that are highly similar to each other. By taking these deduplication steps, we can reduce risk of memorization of source documents while also increasing the quality of our models. As a last step, we remove documents that are too small or too large, with too many repeated words, or that are only composed of small words, and documents with file extensions that represent textual content, like XML, etc. By putting all this effort into data gathering, we end up with a diverse, responsible dataset that represents high-quality, real-world data.

Legal and Safety

​Overview

​Code Data

​Textual Data

Overview

Code Data

Textual Data