-
Notifications
You must be signed in to change notification settings - Fork 164
Use where clause to limit columns returned during reflection #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use where clause to limit columns returned during reflection #223
Conversation
The original design of this reflection logic assumed that grabbing schema info in bulk would be more efficient when doing things like crawling an entire database to dump the schema. I believe the reflection code for the postgres dialect does individual queries per column, which was untenable on Redshift. That evaluation was ~six years ago when things like Spectrum did not exist and it may well be that different trade-offs are appropriate today. But I'm unsure if users have cases where they're crawling many tables at once and this change in behavior would negatively impact performance. I'll kick off integration tests here to validate whether this causes any breakage in tests, but interested in thoughts if you can think of any way to validate how this affects large-scale reflection performance. |
|
Ah OK, yeah you're right and it would negatively impact performance when crawling many tables. Would it be acceptable to use a where clause that filtered by schema only, rather than by table and schema? From some very quick testing, this seems to be almost as performant as doing individual queries per table for our use case, and I'd imagine in most cases when users are crawling many tables they are doing it within a single schema (although, that is just an assumption). The worst case scenario I suppose would be when using a single table from multiple schema, but I don't have a feeling for how common that kind of process is. All the integration tests failed because I forgot to unpack the parameter dict, sorry. |
That does sound significantly less risky. And it likely solves the primary problem you've mentioned here of performance for spectrum tables, since those are naturally segregated per schema. |
|
Yeah, that makes sense. I've modified to only filter by schema name. |
…nnect(), use format strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I just pushed 0.8.4 with this change
If there are a large number of schema with a large number of tables, particularly if those tables are spectrum, then reflection queries can take a long time due to
_get_all_column_info. As far as I can tell, there is no reason to return all columns (the table name is always supplied). This PR limits the returned columns to those in the table specified, and optionally limited to the schema if that argument is supplied.Todos