StarCoder2 Data

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

The Stack v2 Training Data

This organization contains the full datasets used to train StarCoder2:

  • the-stack-v2-train-full: contains the training data with 600+ programming languages used to train StarCoder2-15B with the files concatenated per repository
  • the-stack-v2-train-full-files: same as the-stack-v2-train-full but without repository concatenation which makes filtering files or licenses easier
  • the-stack-v2-train-smol: contains the training data with 17 programming languages used to train StarCoder2-3B and 7B with the files concatenated per repository
  • the-stack-v2-train-smol-files: same as the-stack-v2-train-smol but without repository concatenation which makes filtering files or licenses easier

See the tech report for all the details on the dataset.

models 0

None public yet

datasets 0

None public yet