poolside uses high volumes of code and textual data to train its large language models (LLM). This data is maintained in a secure rest state, and only specific individuals are accredited to operate it.

To obtain the data for the training of the model, poolside follows a strict policy to ensure data safety and quality. poolside ignores all copyleft or non-permissive code repositories, and removes offensive textual data (i.e., sexual content) from its dataset. In case of uncertainty, poolside will remove any data source that can be considered dangerous to minimize the risk of including non-quality or non-secure data for model training. This is a core principle of poolside’s data pipeline process and team members.

Overview

poolside only uses a corpus of data composed of textual sources, either in code or natural language format. Sources are obtained and validated based on the following:

  1. highly selective internet crawling sources

  2. data classification and removal of dangerous data

  3. data file filtering based on relevance and usefulness

  4. PII or classified data removal from datasets

  5. data deduplication

Code Data

poolside selects top repositories by identifying top contributors through a web-ranking algorithm, allowing the model to learn coding practices from repositories where the best developers actively participate. Any repository that includes one of the following license types is immediately excluded from the dataset.

Licenses poolside accepts:

MPL-2.0, CC-BY-SA-4.0, CC-BY-NC-SA-4.0, CC-BY-NC-ND-4.0, CC-BY-SA-3.0, CC-BY-NC-4.0, EUPL-1.2, OFL-1.1, CC-BY-NC-SA-3.0, CC-BY-NC-3.0, UPL-1.0, GPL-3.0, CC-BY-NC-ND-3.0, APSL-2.0, CECILL-2.1, EUPL-1.1, CC-BY-SA-2.0, BlueOak-1.0.0, CDDL-1.0, OpenSSL, CC-BY-NC-SA-2.0, CECILL-B, CECILL-C, OGL-UK-3.0, CC-BY-SA-2.5, CC-BY-NC-2.0, OFL-1.1-no-RFN, OSL-3.0, MPL-1.1, LGPL-3.0, GPL-2.0, LGPL-2.0-only, LGPL-2.1, LGPL-2.0-or-later, LGPL-2.0, ClArtistic, LGPL-2.0+, NGPL, Sleepycat, CECILL-1.1, copyleft-next-0.3.0, OSL-2.1, NPL-1.1, GPL-2.0+, GPL-2.0-or-later, GPL-2.0-only, APSL-1.1, APSL-1.0, GPL-3.0-only, GPL-3.0-or-later, GPL-3.0-linking-source-exception, GPL-3.0-linking-exception, Classpath-exception-2.0, OSL-2.0, SISSL, AGPL-3.0, GPL-3.0+, CC-BY-NC-ND-2.0, CDLA-Sharing-1.0, IPL-1.0, GPL-2.0-with-font-exception, CC-BY-SA-1.0, GPL-2.0-with-classpath-exception, CC-BY-NC-ND-1.0, LGPL-2.1-or-later, CC-BY-NC-SA-2.5, CC-BY-NC-1.0, LiLiQ-P-1.1, OSET-PL-2.1, EPL-1.0, Aladdin, CECILL-1.0, LGPLLR, CC-BY-NC-SA-1.0, APL-1.0, GPL-2.0-with-font-exception, MS-RL, TMate, PDDL-1.0, OSL-1.1, NPL-1.0, RPL-1.1, NCSA

We then use an AI model that is highly specialized in identifying low-quality data sources to remove them from the pipeline. We remove any file that doesn’t fit our code quality patterns (i.e. files that are too small, too big, programming languages we do not want to train the model on, etc.). Finally, we make our best efforts to remove any PII, passwords, or other information from the data corpus. This process helps us create a healthy data pipeline — a pipeline of relevant code sources filtered for irrelevant data.

In parallel, additional coding data is synthetically generated based on our Reinforcement Learning from Code Execution Feedback (RLCEF). This proprietary technique allows poolside to generate high volumes of valid coding patterns that are also included in the model training. poolside owns all the synthetically generated data.

Textual Data

We also use meaningful textual data to train the model. When selecting the relevant data sources to train our LLMs in natural language, we use external indexing systems to help us filter relevant data sources. For example, we only use documents that the research community has filtered and considered open-access licenses.

With the relevant data sources selected, poolside scans them and removes anything that gets classified as inappropriate by our obscenity list classification, which looks for variations of obscene words.

After programmatically filtering out the data, we use specialized LLMs to remove low-quality or toxic documentation, such as documentation with bad formatting, low educational content, adult innuendos, etc. These models classify the data and then use the classification to exclude information that doesn’t add value to model training.

As a last step, we remove documents that are too small or too large, with too many repeated words, or that are only composed of small words, as well as documents with file extensions that represent textual content, like XML, etc.

By putting all this effort into data gathering, we end up with a diverse dataset that represents high-quality, real-world data.