Use where clause to limit columns returned during reflection #223

jkeelan · 2021-07-12T11:40:03Z

If there are a large number of schema with a large number of tables, particularly if those tables are spectrum, then reflection queries can take a long time due to _get_all_column_info. As far as I can tell, there is no reason to return all columns (the table name is always supplied). This PR limits the returned columns to those in the table specified, and optionally limited to the schema if that argument is supplied.

Todos

MIT compatible
Tests
Documentation
Updated CHANGES.rst

jklukas · 2021-07-12T13:53:44Z

If there are a large number of schema with a large number of tables, particularly if those tables are spectrum, then reflection queries can take a long time due to _get_all_column_info

The original design of this reflection logic assumed that grabbing schema info in bulk would be more efficient when doing things like crawling an entire database to dump the schema. I believe the reflection code for the postgres dialect does individual queries per column, which was untenable on Redshift.

That evaluation was ~six years ago when things like Spectrum did not exist and it may well be that different trade-offs are appropriate today. But I'm unsure if users have cases where they're crawling many tables at once and this change in behavior would negatively impact performance.

I'll kick off integration tests here to validate whether this causes any breakage in tests, but interested in thoughts if you can think of any way to validate how this affects large-scale reflection performance.

jkeelan · 2021-07-12T14:32:50Z

Ah OK, yeah you're right and it would negatively impact performance when crawling many tables.

Would it be acceptable to use a where clause that filtered by schema only, rather than by table and schema? From some very quick testing, this seems to be almost as performant as doing individual queries per table for our use case, and I'd imagine in most cases when users are crawling many tables they are doing it within a single schema (although, that is just an assumption). The worst case scenario I suppose would be when using a single table from multiple schema, but I don't have a feeling for how common that kind of process is.

All the integration tests failed because I forgot to unpack the parameter dict, sorry.

jklukas · 2021-07-12T14:55:40Z

Would it be acceptable to use a where clause that filtered by schema only, rather than by table and schema? From some very quick testing, this seems to be almost as performant as doing individual queries per table for our use case, and I'd imagine in most cases when users are crawling many tables they are doing it within a single schema

That does sound significantly less risky. And it likely solves the primary problem you've mentioned here of performance for spectrum tables, since those are naturally segregated per schema.

jkeelan · 2021-07-12T15:17:39Z

Yeah, that makes sense. I've modified to only filter by schema name.

sqlalchemy_redshift/dialect.py

…onale

sqlalchemy_redshift/dialect.py

…nnect(), use format strings.

jklukas

Looks great! I just pushed 0.8.4 with this change

[email protected] added 3 commits July 12, 2021 12:25

Use where clause to limit columns returned during reflection

12a5298

linting

a1c77d2

linting

fcb8b9a

Forgot to unpack parameters

223a6bf

Limit columns returned by schema only + add formatting back to query.

2aa1b3e

jklukas reviewed Jul 12, 2021

View reviewed changes

sqlalchemy_redshift/dialect.py Show resolved Hide resolved

[email protected] added 2 commits July 12, 2021 17:20

schema -> view_schema for spectrum query. Add comment explaining rati…

8d2cd25

…onale

Comment line too long

9c3d398

jklukas reviewed Jul 13, 2021

View reviewed changes

sqlalchemy_redshift/dialect.py Outdated Show resolved Hide resolved

jklukas reviewed Jul 13, 2021

View reviewed changes

sqlalchemy_redshift/dialect.py Show resolved Hide resolved

[email protected] added 2 commits July 13, 2021 17:36

Parameter binding doesn't work with context manager returned from .co…

736ffc7

…nnect(), use format strings.

Revert formatting of query to master

0542b76

jklukas merged commit b7c88fd into sqlalchemy-redshift:master Jul 15, 2021

jklukas approved these changes Jul 15, 2021

View reviewed changes

br3ndonland mentioned this pull request Jul 21, 2021

SQLAlchemy 1.4 inspection and reflection raise ObjectNotExecutableError #225

Closed

jklukas mentioned this pull request Aug 17, 2021

RFC: Restrict reflection queries to the requested schema #105

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use where clause to limit columns returned during reflection #223

Use where clause to limit columns returned during reflection #223

Uh oh!

jkeelan commented Jul 12, 2021 •

edited

Loading

Uh oh!

jklukas commented Jul 12, 2021

Uh oh!

jkeelan commented Jul 12, 2021

Uh oh!

jklukas commented Jul 12, 2021

Uh oh!

jkeelan commented Jul 12, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jklukas left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use where clause to limit columns returned during reflection #223

Use where clause to limit columns returned during reflection #223

Uh oh!

Conversation

jkeelan commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todos

Uh oh!

jklukas commented Jul 12, 2021

Uh oh!

jkeelan commented Jul 12, 2021

Uh oh!

jklukas commented Jul 12, 2021

Uh oh!

jkeelan commented Jul 12, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jklukas left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jkeelan commented Jul 12, 2021 •

edited

Loading