
StarCoder2 Data
community
AI & ML interests
None defined yet.
Recent Activity
View all activity
Organization Card
The Stack v2 Training Data
This organization contains the full datasets used to train StarCoder2:
the-stack-v2-train-full
: contains the training data with 600+ programming languages used to train StarCoder2-15B with the files concatenated per repositorythe-stack-v2-train-full-files
: same asthe-stack-v2-train-full
but without repository concatenation which makes filtering files or licenses easierthe-stack-v2-train-smol
: contains the training data with 17 programming languages used to train StarCoder2-3B and 7B with the files concatenated per repositorythe-stack-v2-train-smol-files
: same asthe-stack-v2-train-smol
but without repository concatenation which makes filtering files or licenses easier
See the tech report for all the details on the dataset.
models
0
None public yet
datasets
0
None public yet