Skip to content

Conversation

@david-cortes-intel
Copy link
Contributor

Description

This PR expands the sizes of the synthetic datasets used to benchmark PCA.

Currently, these cases involve 3 components, in many cases out of thousands of features, which is not a representative application and thus not a good candidate for benchmarking. The PR expands those to 20 which is more reasonable.

It also makes the synthetic datasets wider (=more columns) and shorter (=fewer rows) as large-scale PCA is for the most part meant to be applied to wide datasets, and substantially increases the sizes of the inputs for .transform() as the benchmarks for those cases are very short.

Note that this PR might increase the time it takes to execute a benchmark run, especially from the data generation step. I do not know how much the timings will change if this is merged.


Checklist:

Completeness and readability

  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.

@david-cortes-intel david-cortes-intel added the datasets Extension or fix load dataset label Oct 28, 2025
@david-cortes-intel
Copy link
Contributor Author

CI errors should be fixed once this PR is merged in sklearnex: uxlfoundation/scikit-learn-intelex#2741

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasets Extension or fix load dataset

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant