-
Notifications
You must be signed in to change notification settings - Fork 18
Comparing changes
Open a pull request
base repository: Unstructured-IO/unstructured-python-client
base: main
head repository: Unstructured-IO/unstructured-python-client
compare: release/0.25.x
- 9 commits
- 22 files changed
- 4 contributors
Commits on Aug 31, 2024
-
fix: Address some issues in the split pdf logic (#165)
We've encountered some bugs in the split pdf code. For one, these requests are not retried. With the new `split_pdf_allow_failed=False` behavior, this means one transient network error can interrupt the whole doc. We've also had some asyncio warnings such as `... was never awaited`. This PR adds retries, cleans up the logic, and gives us a much better base for the V2 client release. # Changes ## Return a "dummy request" in the split BeforeRequestHook When the BeforeRequestHook is called, we would split up the doc into N requests, issue coroutines for N-1 requests, and return the last one for the SDK to run. This adds two paths for recombining the results. Instead, the BeforeRequest can return a dummy request that will get a 200. This takes us straight to the AfterSuccessHook, which awaits all of the splits and builds the response. ## Add retries to the split requests This is a copy of the autogenerated code in `retry.py`, which will work for the async calls. At some point, we should be able to reuse the SDK for this so we aren't hardcoding the retry config values here. Need to work with Speakeasy on this. ## Clean up error handling When the retries fail and we do have to bubble up an error, we pass it to `create_failure_response` before returning to the SDK. This pops a 500 status code into the response, only so the SDK does not see a 502/503/504, and proceed to retry the entire doc. ## Set a httpx timeout Many of the failing requests right now are hi_res calls. This is because the default httpx client timeout is 5 seconds, and we immediately throw a ReadTimeout. For now, set this timeout to 10 minutes. This should be sufficient in the splitting code, where page size per request will be controlled. This is another hardcoded value that should go away once we're able to send our splits back into `sdk.general.partition` # Testing Any pipelines that have failed consistently should work now. For more fine grained testing, I tend to mock up my local server to return a retryable error for specific pages, a certain number of times. In the `general_partition` function, I add something like ``` global num_bounces # Initialize this somewhere up above page_num = form_params.starting_page_number or 1 if num_bounces > 0 and page_num == 3: num_bounces -= 1 logger.info(page_num) raise HTTPException(status_code=502, detail="BOUNCE") ``` Then, send a SDK request to your local server and verify that the split request for page 3 of your doc is retrying up to the number of times you want. Also, setting the max concurrency to 15 should reproduce the issue. Choose some 50+ page pdf and try the following with the current 0.25.5 branch. It will likely fail with `ServerError: {}`. Then try a local pip install off this branch. ``` s = UnstructuredClient( api_key_auth="my-api-key", ) filename = "some-large-pdf" with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = operations.PartitionRequest( shared.PartitionParameters( files=files, split_pdf_page=True, strategy="hi_res", split_pdf_allow_failed=False, split_pdf_concurrency_level=15, ), ) resp = s.general.partition(req) if num_elements := len(resp.elements): print(f"Succeeded with {num_elements}") ```Configuration menu - View commit details
-
Copy full SHA for 98028ec - Browse repository at this point
Copy the full SHA 98028ecView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2cd4093 - Browse repository at this point
Copy the full SHA 2cd4093View commit details -
chore: 🐝 Update SDK - Generate 0.25.6 (#167)
> [!IMPORTANT] > Linting report available at: <https://app.speakeasyapi.dev/org/unstructured/unstructured5xr/linting-report/588d8dad25ed464d6e7356dd4dac9e59> > OpenAPI Change report available at: <https://app.speakeasyapi.dev/org/unstructured/unstructured5xr/changes-report/7f809e2a81539ca75d969167631b3882> # SDK update Based on: - OpenAPI Doc - Speakeasy CLI 1.385.0 (2.407.2) https://github.com/speakeasy-api/speakeasy ## OpenAPI Change Summary No specification changes Co-authored-by: speakeasybot <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 84b5482 - Browse repository at this point
Copy the full SHA 84b5482View commit details -
Configuration menu - View commit details
-
Copy full SHA for adbee1d - Browse repository at this point
Copy the full SHA adbee1dView commit details
Commits on Sep 3, 2024
-
fix: raise httpx timeout value for split pdf requests (#168)
We're still seeing some ReadTimeout errors when a pdf is split and sent for `hi_res` processing. For now, let's raise the default timeout to 30 minutes, and allow for tweaking via the UNSTRUCTURED_CLIENT_TIMEOUT_MINUTES. Users should not generally need to adjust this, but it may help us debug their environment. When our split pdf hook is able to reuse the SDK logic, the client timeout will be exposed as a parameter, and we can remove this variable. Other changes: * Update CI workflow to run against release branches. This 0.25.x branch should be running tests as long as it sticks around. * Bump the gen.yaml version to 0.25.7. The next generate/publish job on this branch will use this.
Configuration menu - View commit details
-
Copy full SHA for 3d13c23 - Browse repository at this point
Copy the full SHA 3d13c23View commit details -
chore: 🐝 Update SDK - Generate 0.25.7 (#169)
# SDK update Based on: - OpenAPI Doc - Speakeasy CLI 1.389.0 (2.409.0) https://github.com/speakeasy-api/speakeasy ## OpenAPI Change Summary No specification changes Co-authored-by: speakeasybot <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 75eb8cd - Browse repository at this point
Copy the full SHA 75eb8cdView commit details
Commits on Sep 10, 2024
-
chore: 🐝 Update SDK - Generate 0.25.8 (#172)
> [!IMPORTANT] > Linting report available at: <https://app.speakeasy.com/org/unstructured/unstructured5xr/linting-report/e9de01e666565e62408c44cd641cc8b9> > OpenAPI Change report available at: <https://app.speakeasy.com/org/unstructured/unstructured5xr/changes-report/eaa6a1456213d6cb26317f5ffd2fc8fc> # SDK update Based on: - OpenAPI Doc - Speakeasy CLI 1.394.0 (2.413.0) https://github.com/speakeasy-api/speakeasy ## OpenAPI Change Summary ``` ├─┬Info │ └──[🔀] version (4:14) ├─┬Components │ └─┬partition_parameters │ ├──[➕] properties (275:17) │ └─┬strategy │ ├──[🔀] description (193:34) │ └──[🔀] default (194:30)❌ └─┬Extensions └──[🔀] x-speakeasy-retries (330:22) ``` | Document Element | Total Changes | Breaking Changes | |------------------|---------------|------------------| | components | 3 | 1 | | info | 1 | 0 | Co-authored-by: speakeasybot <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f1d799c - Browse repository at this point
Copy the full SHA f1d799cView commit details
Commits on Sep 18, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 84f39e9 - Browse repository at this point
Copy the full SHA 84f39e9View commit details -
chore: 🐝 Update SDK - Generate 0.25.9 (#177)
> [!IMPORTANT] > Linting report available at: <https://app.speakeasy.com/org/unstructured/unstructured5xr/linting-report/d2bc6f83effd47e2e89c60737ce323e6> > OpenAPI Change report available at: <https://app.speakeasy.com/org/unstructured/unstructured5xr/changes-report/7a3117a9f5b2e6252e1390b8cae91792> # SDK update Based on: - OpenAPI Doc - Speakeasy CLI 1.399.2 (2.416.6) https://github.com/speakeasy-api/speakeasy ## OpenAPI Change Summary No specification changes Co-authored-by: speakeasybot <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f773b0d - Browse repository at this point
Copy the full SHA f773b0dView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff main...release/0.25.x