Skip to content

Conversation

@jkeelan
Copy link
Contributor

@jkeelan jkeelan commented Jul 12, 2021

If there are a large number of schema with a large number of tables, particularly if those tables are spectrum, then reflection queries can take a long time due to _get_all_column_info. As far as I can tell, there is no reason to return all columns (the table name is always supplied). This PR limits the returned columns to those in the table specified, and optionally limited to the schema if that argument is supplied.

Todos

  • MIT compatible
  • Tests
  • Documentation
  • Updated CHANGES.rst

@jklukas
Copy link
Member

jklukas commented Jul 12, 2021

If there are a large number of schema with a large number of tables, particularly if those tables are spectrum, then reflection queries can take a long time due to _get_all_column_info

The original design of this reflection logic assumed that grabbing schema info in bulk would be more efficient when doing things like crawling an entire database to dump the schema. I believe the reflection code for the postgres dialect does individual queries per column, which was untenable on Redshift.

That evaluation was ~six years ago when things like Spectrum did not exist and it may well be that different trade-offs are appropriate today. But I'm unsure if users have cases where they're crawling many tables at once and this change in behavior would negatively impact performance.

I'll kick off integration tests here to validate whether this causes any breakage in tests, but interested in thoughts if you can think of any way to validate how this affects large-scale reflection performance.

@jkeelan
Copy link
Contributor Author

jkeelan commented Jul 12, 2021

Ah OK, yeah you're right and it would negatively impact performance when crawling many tables.

Would it be acceptable to use a where clause that filtered by schema only, rather than by table and schema? From some very quick testing, this seems to be almost as performant as doing individual queries per table for our use case, and I'd imagine in most cases when users are crawling many tables they are doing it within a single schema (although, that is just an assumption). The worst case scenario I suppose would be when using a single table from multiple schema, but I don't have a feeling for how common that kind of process is.

All the integration tests failed because I forgot to unpack the parameter dict, sorry.

@jklukas
Copy link
Member

jklukas commented Jul 12, 2021

Would it be acceptable to use a where clause that filtered by schema only, rather than by table and schema? From some very quick testing, this seems to be almost as performant as doing individual queries per table for our use case, and I'd imagine in most cases when users are crawling many tables they are doing it within a single schema

That does sound significantly less risky. And it likely solves the primary problem you've mentioned here of performance for spectrum tables, since those are naturally segregated per schema.

@jkeelan
Copy link
Contributor Author

jkeelan commented Jul 12, 2021

Yeah, that makes sense. I've modified to only filter by schema name.

@jklukas jklukas merged commit b7c88fd into sqlalchemy-redshift:master Jul 15, 2021
Copy link
Member

@jklukas jklukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I just pushed 0.8.4 with this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants