poolside uses large volumes of code and textual data to train its large language models. To obtain the data for the training of the model, poolside follows a strict policy to ensure data safety and quality. This is a core principle of poolside’s data pipeline process and team members.

Overview

poolside trains its language models using two types of text-based data: programming code and natural language documents. Training data goes through a rigorous 5-step processing pipeline to create a safe, high-quality data set:

  1. Data collection: We use publicly available sources from the internet for raw code repositories and text documents. We may also source proprietary data where appropriate.
  2. Safety screening: We classify collected data and screen content that we identify as harmful, copyleft, non-commercial, inappropriate, or otherwise problematic.
  3. Quality filtering: We evaluate files for relevance and usefulness, removing files that are too small, too large, or don’t meet our quality standards for training data.
  4. Privacy protection: We scan for and remove personally identifiable information (PII), passwords, API keys, or other sensitive data that might have been accidentally included in public repositories.
  5. Deduplication: We eliminate duplicate and near-duplicate content to ensure our models train on diverse, unique examples rather than memorizing repeated patterns.

This systematic approach helps us create a training dataset that contains high-quality, diverse, and safe content for model training.

Code Data

poolside selects top repositories through a ranking algorithm, allowing the model to learn coding practices from repositories where the best code is created. Then, for any repository where poolside identifies one of the following license types, we exclude that code from the dataset.

MPL-2.0, CC-BY-SA-4.0, CC-BY-NC-SA-4.0, CC-BY-NC-ND-4.0, CC-BY-SA-3.0, CC-BY-NC-4.0, EUPL-1.2, OFL-1.1, CC-BY-NC-SA-3.0, CC-BY-NC-3.0, UPL-1.0, GPL-3.0, CC-BY-NC-ND-3.0, APSL-2.0, CECILL-2.1, EUPL-1.1, CC-BY-SA-2.0, BlueOak-1.0.0, CDDL-1.0, OpenSSL, CC-BY-NC-SA-2.0, CECILL-B, CECILL-C, OGL-UK-3.0, CC-BY-SA-2.5, CC-BY-NC-2.0, OFL-1.1-no-RFN, OSL-3.0, MPL-1.1, LGPL-3.0, GPL-2.0, LGPL-2.0-only, LGPL-2.1, LGPL-2.0-or-later, LGPL-2.0, ClArtistic, LGPL-2.0+, NGPL, Sleepycat, CECILL-1.1, copyleft-next-0.3.0, OSL-2.1, NPL-1.1, GPL-2.0+, GPL-2.0-or-later, GPL-2.0-only, APSL-1.1, APSL-1.0, GPL-3.0-only, GPL-3.0-or-later, GPL-3.0-linking-source-exception, GPL-3.0-linking-exception, Classpath-exception-2.0, OSL-2.0, SISSL, AGPL-3.0, GPL-3.0+, CC-BY-NC-ND-2.0, CDLA-Sharing-1.0, IPL-1.0, GPL-2.0-with-font-exception, CC-BY-SA-1.0, GPL-2.0-with-classpath-exception, CC-BY-NC-ND-1.0, LGPL-2.1-or-later, CC-BY-NC-SA-2.5, CC-BY-NC-1.0, LiLiQ-P-1.1, OSET-PL-2.1, EPL-1.0, Aladdin, CECILL-1.0, LGPLLR, CC-BY-NC-SA-1.0, APL-1.0, GPL-2.0-with-font-exception, MS-RL, TMate, PDDL-1.0, OSL-1.1, NPL-1.0, RPL-1.1, NCSA

We then use an AI model that is highly specialized in identifying low-quality data sources to remove them from the pipeline. We remove any file that doesn’t fit our code quality patterns (i.e. files that are too small, too big, programming languages we do not want to train the model on, etc.). Finally, take steps to remove any PII, passwords, or other similar information from the data corpus. This process helps us create a healthy, relevant data pipeline.

Only about 10% of publicly available code is actually useful for training high-quality models. The majority is filtered out due to licensing restrictions, duplication, or low-quality content. The industry currently relies on this limited subset of useful code, but even this has limitations. One primary limitation is that it is just output code, lacking reasoning, execution validation, additional checks, or the ability to scale further. To address this limitation, we create additional synthetic coding data based on our Reinforcement Learning from Code Execution Feedback (RLCEF). This proprietary technique allows poolside to generate high volumes of valid coding patterns that are also included in the model training. poolside owns all the synthetically generated data.

Textual Data

We also use meaningful textual data to train the model. When selecting the relevant data sources to train our LLMs in natural language, we use external indexing systems to help us filter relevant data sources. poolside scans these data sources and removes data identified as obscene, low-quality, toxic, or otherwise problematic. We perform this filtering using a set of specialized AI models and search tools..

After our filters have been applied, we apply a final global deduplication step that consists of exact, document similarity and semantic based deduplication for natural language and code. For code we apply this both at file and code repository level to obtain both unique code bases and unique files. We use a similarity threshold here to not just filter out exact duplicates, but also documents that are highly similar to each other. By taking these deduplication steps, we can reduce risk of memorization of source documents while also increasing the quality of our models.

As a last step, we remove documents that are too small or too large, with too many repeated words, or that are only composed of small words, and documents with file extensions that represent textual content, like XML, etc.

By putting all this effort into data gathering, we end up with a diverse, responsible dataset that represents high-quality, real-world data.