Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: Unstructured-IO/unstructured-python-client
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: main
Choose a base ref
...
head repository: Unstructured-IO/unstructured-python-client
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: release/0.25.x
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 9 commits
  • 22 files changed
  • 4 contributors

Commits on Aug 31, 2024

  1. fix: Address some issues in the split pdf logic (#165)

    We've encountered some bugs in the split pdf code. For one, these
    requests are not retried. With the new `split_pdf_allow_failed=False`
    behavior, this means one transient network error can interrupt the whole
    doc. We've also had some asyncio warnings such as `... was never
    awaited`.
    
    This PR adds retries, cleans up the logic, and gives us a much better
    base for the V2 client release.
    
    # Changes
    ## Return a "dummy request" in the split BeforeRequestHook
    When the BeforeRequestHook is called, we would split up the doc into N
    requests, issue coroutines for N-1 requests, and return the last one for
    the SDK to run. This adds two paths for recombining the results.
    Instead, the BeforeRequest can return a dummy request that will get a
    200. This takes us straight to the AfterSuccessHook, which awaits all of
    the splits and builds the response.
    
    ## Add retries to the split requests
    This is a copy of the autogenerated code in `retry.py`, which will work
    for the async calls. At some point, we should be able to reuse the SDK
    for this so we aren't hardcoding the retry config values here. Need to
    work with Speakeasy on this.
    
    ## Clean up error handling
    When the retries fail and we do have to bubble up an error, we pass it
    to `create_failure_response` before returning to the SDK. This pops a
    500 status code into the response, only so the SDK does not see a
    502/503/504, and proceed to retry the entire doc.
    
    ## Set a httpx timeout
    Many of the failing requests right now are hi_res calls. This is because
    the default httpx client timeout is 5 seconds, and we immediately throw
    a ReadTimeout. For now, set this timeout to 10 minutes. This should be
    sufficient in the splitting code, where page size per request will be
    controlled. This is another hardcoded value that should go away once
    we're able to send our splits back into `sdk.general.partition`
    
    # Testing
    Any pipelines that have failed consistently should work now. For more
    fine grained testing, I tend to mock up my local server to return a
    retryable error for specific pages, a certain number of times. In the
    `general_partition` function, I add something like
    ```
        global num_bounces # Initialize this somewhere up above
        page_num = form_params.starting_page_number or 1
        if num_bounces > 0 and page_num == 3:
            num_bounces -= 1
            logger.info(page_num)
            raise HTTPException(status_code=502, detail="BOUNCE")
    ```
    
    Then, send a SDK request to your local server and verify that the split
    request for page 3 of your doc is retrying up to the number of times you
    want.
    
    Also, setting the max concurrency to 15 should reproduce the issue.
    Choose some 50+ page pdf and try the following with the current 0.25.5
    branch. It will likely fail with `ServerError: {}`. Then try a local pip
    install off this branch.
    ```
    s = UnstructuredClient(
        api_key_auth="my-api-key",
    )
    
    filename = "some-large-pdf"
    with open(filename, "rb") as f:
        files=shared.Files(
            content=f.read(),
            file_name=filename,
        )
    
    req = operations.PartitionRequest(
        shared.PartitionParameters(
            files=files,
            split_pdf_page=True,
            strategy="hi_res",
            split_pdf_allow_failed=False,
            split_pdf_concurrency_level=15,
        ),
    )
    
    resp = s.general.partition(req)
    
    if num_elements := len(resp.elements):
        print(f"Succeeded with {num_elements}")
    ```
    awalker4 authored Aug 31, 2024
    Configuration menu
    Copy the full SHA
    98028ec View commit details
    Browse the repository at this point in the history
  2. Bump to 0.25.6

    awalker4 committed Aug 31, 2024
    Configuration menu
    Copy the full SHA
    2cd4093 View commit details
    Browse the repository at this point in the history
  3. chore: 🐝 Update SDK - Generate 0.25.6 (#167)

    > [!IMPORTANT]
    > Linting report available at:
    <https://app.speakeasyapi.dev/org/unstructured/unstructured5xr/linting-report/588d8dad25ed464d6e7356dd4dac9e59>
    > OpenAPI Change report available at:
    <https://app.speakeasyapi.dev/org/unstructured/unstructured5xr/changes-report/7f809e2a81539ca75d969167631b3882>
    # SDK update
    Based on:
    - OpenAPI Doc  
    - Speakeasy CLI 1.385.0 (2.407.2)
    https://github.com/speakeasy-api/speakeasy
    ## OpenAPI Change Summary
    No specification changes
    
    Co-authored-by: speakeasybot <[email protected]>
    github-actions[bot] and speakeasybot authored Aug 31, 2024
    Configuration menu
    Copy the full SHA
    84b5482 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    adbee1d View commit details
    Browse the repository at this point in the history

Commits on Sep 3, 2024

  1. fix: raise httpx timeout value for split pdf requests (#168)

    We're still seeing some ReadTimeout errors when a pdf is split and sent
    for `hi_res` processing. For now, let's raise the default timeout to 30
    minutes, and allow for tweaking via the
    UNSTRUCTURED_CLIENT_TIMEOUT_MINUTES. Users should not generally need to
    adjust this, but it may help us debug their environment. When our split
    pdf hook is able to reuse the SDK logic, the client timeout will be
    exposed as a parameter, and we can remove this variable.
    
    Other changes:
    * Update CI workflow to run against release branches. This 0.25.x branch
    should be running tests as long as it sticks around.
    * Bump the gen.yaml version to 0.25.7. The next generate/publish job on
    this branch will use this.
    awalker4 authored Sep 3, 2024
    Configuration menu
    Copy the full SHA
    3d13c23 View commit details
    Browse the repository at this point in the history
  2. chore: 🐝 Update SDK - Generate 0.25.7 (#169)

    # SDK update
    Based on:
    - OpenAPI Doc  
    - Speakeasy CLI 1.389.0 (2.409.0)
    https://github.com/speakeasy-api/speakeasy
    ## OpenAPI Change Summary
    No specification changes
    
    Co-authored-by: speakeasybot <[email protected]>
    github-actions[bot] and speakeasybot authored Sep 3, 2024
    Configuration menu
    Copy the full SHA
    75eb8cd View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2024

  1. chore: 🐝 Update SDK - Generate 0.25.8 (#172)

    > [!IMPORTANT]
    > Linting report available at:
    <https://app.speakeasy.com/org/unstructured/unstructured5xr/linting-report/e9de01e666565e62408c44cd641cc8b9>
    > OpenAPI Change report available at:
    <https://app.speakeasy.com/org/unstructured/unstructured5xr/changes-report/eaa6a1456213d6cb26317f5ffd2fc8fc>
    # SDK update
    Based on:
    - OpenAPI Doc  
    - Speakeasy CLI 1.394.0 (2.413.0)
    https://github.com/speakeasy-api/speakeasy
    ## OpenAPI Change Summary
    
    
    ```
    ├─┬Info
    │ └──[🔀] version (4:14)
    ├─┬Components
    │ └─┬partition_parameters
    │   ├──[➕] properties (275:17)
    │   └─┬strategy
    │     ├──[🔀] description (193:34)
    │     └──[🔀] default (194:30)❌ 
    └─┬Extensions
      └──[🔀] x-speakeasy-retries (330:22)
    ```
    
    | Document Element | Total Changes | Breaking Changes |
    |------------------|---------------|------------------|
    | components       | 3             | 1                |
    | info             | 1             | 0                |
    
    Co-authored-by: speakeasybot <[email protected]>
    github-actions[bot] and speakeasybot authored Sep 10, 2024
    Configuration menu
    Copy the full SHA
    f1d799c View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2024

  1. Configuration menu
    Copy the full SHA
    84f39e9 View commit details
    Browse the repository at this point in the history
  2. chore: 🐝 Update SDK - Generate 0.25.9 (#177)

    > [!IMPORTANT]
    > Linting report available at:
    <https://app.speakeasy.com/org/unstructured/unstructured5xr/linting-report/d2bc6f83effd47e2e89c60737ce323e6>
    > OpenAPI Change report available at:
    <https://app.speakeasy.com/org/unstructured/unstructured5xr/changes-report/7a3117a9f5b2e6252e1390b8cae91792>
    # SDK update
    Based on:
    - OpenAPI Doc  
    - Speakeasy CLI 1.399.2 (2.416.6)
    https://github.com/speakeasy-api/speakeasy
    ## OpenAPI Change Summary
    No specification changes
    
    Co-authored-by: speakeasybot <[email protected]>
    github-actions[bot] and speakeasybot authored Sep 18, 2024
    Configuration menu
    Copy the full SHA
    f773b0d View commit details
    Browse the repository at this point in the history
Loading