forked from PinkCrow007/arrow-rs
-
Notifications
You must be signed in to change notification settings - Fork 1
Integration tests for reading parquet w/ Variants #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Which issue does this PR close? - Related to apache#7395 - Closes apache#7495 - Closes apache#7377 # Rationale for this change Let's update tonic to the latest Given the open and unresolved questions on @rmn-boiko's PR apache#7377 from @Xuanwo and @sundy-li, I thought a new PR would result in a faster resolution. # What changes are included in this PR? This PR is based on apache#7495 from @MichaelScofield -- I resolved some merge conflicts and updated Cargo.toml in the integration tests # Are these changes tested? Yes, by CI # Are there any user-facing changes? New dependency version --------- Co-authored-by: LFC <[email protected]>
…pache#7922) # Which issue does this PR close? - Part of apache#7896 # Rationale for this change In apache#7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the `IndexSet` that we're inserting into. `with_field_names` is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entire `IndexSet` during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building. This PR adds a new method `with_field_name_capacity` which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object building Before: <img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08 43 PM" src="http://github.com/%3Ca%20href="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c">https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c" /> After: <img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08 55 PM" src="http://github.com/%3Ca%20href="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3">https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3" />
…che#7914) # Which issue does this PR close? - Fixes apache#7907 # Rationale for this change When trying to append `VariantObject` or `VariantList`s directly on the `VariantBuilder`, it will panic. # Changes to the public API `VariantBuilder` now has these additional methods: - `append_object`, will panic if shallow validation fails or the object has duplicate field names - `try_append_object`, will perform full validation on the object before appending - `append_list`, will panic if shallow validation fails - `try_append_list`, will perform full validation on the list before appending --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? - Closes apache#7893 # What changes are included in this PR? In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. Not comprehensive. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this. - Perf is a TODO. # Are these changes tested? Some basic unit tests are added. # Are there any user-facing changes? Yes --------- Co-authored-by: Andrew Lamb <[email protected]>
…he#7774) # Which issue does this PR close? - Part of apache#7762 # Rationale for this change As part of apache#7762 I want to optimize applying filters by adding a new code path. To ensure that works well, let's ensure the filtered code path is well covered with tests # What changes are included in this PR? 1. Add tests for filtering batches with 0.01%, 1%, 10% and 90% and varying data types # Are these changes tested? Only tests, no functional changes # Are there any user-facing changes?
…ath (apache#7942) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to apache#7919 # Rationale for this change While reviewing apache#7919 from @Samyak2 I found I wanted to write some additional tests directly for `Variant::get_path` When I started doing that I found it was somewhat awkward to write examples, so I added some new conversion routines to make it easier. # What changes are included in this PR? 1. Add doc examples (and thus tests) of `VaraintGet` and `VariantPath` 2. Add more documentation # Are these changes tested? Yes, by doc examples which run in CI # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes apache#7901 . # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --------- Signed-off-by: codephage2020 <[email protected]>
…pache#7931) # Which issue does this PR close? - Closes #NNN. This is minor but I can create an issue if needed. # Rationale for this change `make_builder` currently errors with `Data type Utf8View is not currently supported`. # What changes are included in this PR? Support `DataType::Utf8View` and `DataType::BinaryView` in `make_builder`. # Are these changes tested? Only via the exhaustive enum match. It doesn't look like there are any tests for `make_builder` in that file? # Are there any user-facing changes? No
# Rationale for this change - Closes apache#7948 This PR introduces a custom implementation of `PartialEq` for variant objects. According to the spec, field values are not required to be in the same order as the field IDs, to enable flexibility when constructing Variant values. Instead of comparing the raw bytes of 2 variant objects, this implementation recursively checks whether the field values are equal -- regardless of their order
# Which issue does this PR close?
Optimize `partition_validity` function used in sort kernels
- Preallocate vectors based on known null counts
- Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of
`Array::is_valid`
- Avoid capacity checks inside loop by writing to `spare_capacity_mut`
instead of using `push`
- Closes apache#7936.
# Rationale for this change
Microbenchmark results for `sort_kernels` compared to `main`, only
looking at benchmarks matching "nulls to indices":
```
sort i32 nulls to indices 2^10
time: [4.9325 µs 4.9370 µs 4.9422 µs]
change: [−20.303% −20.133% −19.974%] (p = 0.00 < 0.05)
Performance has improved.
sort i32 nulls to indices 2^12
time: [20.096 µs 20.209 µs 20.327 µs]
change: [−26.819% −26.275% −25.697%] (p = 0.00 < 0.05)
Performance has improved.
sort f32 nulls to indices 2^12
time: [26.329 µs 26.366 µs 26.406 µs]
change: [−29.487% −29.331% −29.146%] (p = 0.00 < 0.05)
Performance has improved.
sort string[0-10] nulls to indices 2^12
time: [70.667 µs 70.762 µs 70.886 µs]
change: [−20.057% −19.935% −19.819%] (p = 0.00 < 0.05)
Performance has improved.
sort string[0-100] nulls to indices 2^12
time: [101.98 µs 102.44 µs 102.99 µs]
change: [−0.3501% +0.0835% +0.4918%] (p = 0.71 > 0.05)
No change in performance detected.
sort string[0-400] nulls to indices 2^12
time: [84.952 µs 85.024 µs 85.102 µs]
change: [−5.3969% −4.9827% −4.6421%] (p = 0.00 < 0.05)
Performance has improved.
sort string[10] nulls to indices 2^12
time: [72.486 µs 72.664 µs 72.893 µs]
change: [−14.937% −14.781% −14.599%] (p = 0.00 < 0.05)
Performance has improved.
sort string[100] nulls to indices 2^12
time: [71.354 µs 71.606 µs 71.902 µs]
change: [−17.207% −16.795% −16.373%] (p = 0.00 < 0.05)
Performance has improved.
sort string[1000] nulls to indices 2^12
time: [73.088 µs 73.195 µs 73.311 µs]
change: [−16.705% −16.599% −16.483%] (p = 0.00 < 0.05)
Performance has improved.
sort string_view[10] nulls to indices 2^12
time: [32.592 µs 32.654 µs 32.731 µs]
change: [−15.722% −15.512% −15.310%] (p = 0.00 < 0.05)
Performance has improved.
sort string_view[0-400] nulls to indices 2^12
time: [32.981 µs 33.074 µs 33.189 µs]
change: [−25.570% −25.132% −24.700%] (p = 0.00 < 0.05)
Performance has improved.
sort string_view_inlined[0-12] nulls to indices 2^12
time: [28.467 µs 28.496 µs 28.529 µs]
change: [−22.978% −22.786% −22.574%] (p = 0.00 < 0.05)
Performance has improved.
sort string[10] dict nulls to indices 2^12
time: [94.463 µs 94.503 µs 94.542 µs]
change: [−11.386% −11.165% −10.961%] (p = 0.00 < 0.05)
Performance has improved.
```
# Are these changes tested?
Covered by existing tests
# Are there any user-facing changes?
No, the method is internal to the sort kernels.
# Which issue does this PR close? - Closes apache#7949 # Rationale for this change I would like it to be easier / more ergonomic to make objects # What changes are included in this PR? 1. Add `ObjectBuilder::with_field` 2. Add documentation w/ examples 3. Rewrite some tests # Are these changes tested? Yes, by doc tests # Are there any user-facing changes? Yes a new API
…ntArray (apache#7945) # Which issue does this PR close? - Closes apache#7920. # Are these changes tested? Tests were already implemented # Are there any user-facing changes? None
…che#7956) # Which issue does this PR close? - Follow-up to apache#7901 # Rationale for this change - apache#7934 Introduced a minor regression, in (accidentally?) forbidding the empty string as a dictionary key. Fix the bug and simplify the code a bit further while we're at it. # What changes are included in this PR? Revert the unsorted dictionary check back to what it had been (it just uses `Iterator::is_sorted_by` now, instead of `primitive.slice::is_sorted_by`). Remove the redundant offset monotonicity check from the ordered dictionary path, relying on the fact that string slice extraction will anyway fail if the offsets are not monotonic. Improve the error message now that it does double duty. # Are these changes tested? New unit tests for dictionaries containing the empty string. As a side effect, we now have at least a little coverage for sorted dictionaries -- somehow, I couldn't find any existing unit test that creates a sorted dictionary?? # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of apache#7911 - Part of apache#6736 - Follow on to apache#7905 # Rationale for this change I wrote benchmark some changes to the json decoder in apache#7911 but they are non trivial. To keep apache#7911 easier to review I have pulled the benchmarks out to their own PR # What changes are included in this PR? 1. Add new json benchmark 2. Include the `variant_get` benchmark added in apache#7919 by @Samyak2 # Are these changes tested? I tested them manually and clippy CI coverage ensures they compile # Are there any user-facing changes? No these are only benchmarks
# Which issue does this PR close? - Closes apache#7951 . # Rationale for this change # What changes are included in this PR? # Are these changes tested? Yes # Are there any user-facing changes? New API Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close? - Follow on to apache#7943 - Part of apache#7948 # Rationale for this change I found a few more tests I would like to have seen while reviewing apache#7943 # What changes are included in this PR? Add some list equality tests # Are these changes tested? It is only tests, no functionality changes # Are there any user-facing changes? No
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes apache#7947 . # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close? - Related to apache#6736 # Rationale for this change I noticed in apache#7956 that some Clippy errors were introduced but not caught by CI. # What changes are included in this PR? Add `parquet-variant-compute` to the CI for parqet-variant related PRs # Are these changes tested? It is only tests # Are there any user-facing changes? No
# Which issue does this PR close? Does not yet close, but contributes towards: - apache#6356 - apache#5981 - apache#1206 # Rationale for this change See the above issues. And this is a follow up to * apache#6711 * apache#6873 This was also split out from: apache#7929 # What changes are included in this PR? This removes the API to allow preserving `dict_id` set in the `Schema`'s `Field` within arrow-ipc and arrow-flight. This is in an effort to remove the `dict_id` field entirely and make it an IPC/flight-only concern. # Are these changes tested? Yes, all existing tests continue to pass. # Are there any user-facing changes? Yes, these previously (in 54.0.0) deprecated functions/fields are removed: * `arrow_ipc::DictionaryTracker.set_dict_id` * `arrow_ipc::DictionaryTracker::new_with_preserve_dict_id` * `arrow_ipc::IpcWriteOptions.with_preserve_dict_id` * `arrow_ipc::IpcWriteOptions.preserve_dict_id` (function and field) * `arrow_ipc::schema_to_fb` * `arrow_ipc::schema_to_bytes`
# Which issue does this PR close? - Part of apache#4886 - Related to apache#6965 # Rationale for this change This change introduces support for Avro files generated by systems like Impala, which have a specific convention for representing nullable fields. In Avro, nullability is typically represented by a union of a type and a type. This PR updates the Avro reader to correctly interpret these schemas, ensuring proper handling of nullable data and improving interoperability with Impala-generated data. `null` # What changes are included in this PR? This pull request introduces several changes to support Impala-style nullability in the Avro reader: - The Avro schema parser has been updated to recognize unions where is the second type (e.g., `['type', 'null']`) as a nullable field. `null` - Logic has been added to handle this nullability convention during Avro decoding. - New tests are included to verify that Avro files using this nullability format are read correctly while ensuring that strict mode properly identifies them. # Are these changes tested? Yes, I added new test cases covering these changes to the tests named: `test_nonnullable_impala`, `test_nonnullable_impala_strict`, `test_nullable_impala` and `test_nullable_impala_strict`. # Are there any user-facing changes? N/A --------- Co-authored-by: Connor Sanders <[email protected]>
# Which issue does this PR close? Part of apache#4886 Completes the breaking down/porting of the changes in apache#6965. This PR will be closed upon merge of this PR. # Rationale for this change This change brings over the remaining integration tests present in the original PR, which validate the reader logic against the files from `testing/data/avro`. PRs containing this logic have already been merged (but are not yet released) which these tests now validate. # What changes are included in this PR? The following files are now read in: - alltypes_dictionary.avro - alltypes_nulls_plain.avro - binary.avro - dict-page-offset-zero.avro - avro/list_columns.avro - nested_lists.snappy.avro - single_nan.avro - datapage_v2.snappy.avro - nested_records.avro - repeated_no_annotation.avro # Are these changes tested? This PR consists of integration tests validating code merged recently into this crate. No changes in functionality are included. # Are there any user-facing changes? N/A
…pache#7968) # Which issue does this PR close? Closes apache#6356 # Rationale for this change Now that apache#7940 is merged, nothing useful can be done with the `dict_id` field, therefore, it is now safe to be removed from this requirement. This was also split out from: apache#7467 # What changes are included in this PR? No longer require the `dict_id` fields of two `Field`s of schemas being merged to be equal, as at this point the `dict_id` is only an IPC concern, and the fact that it is still in the struct definition is just legacy, marked for removal, we're just going through the proper procedure of deprecating and replacing the APIs that use it. # Are these changes tested? Tests passing. # Are there any user-facing changes? No API changes, just a behavior change, that was to be expected and desired due to the deprecations around the `dict_id` field. @alamb @adriangb @tustvold
…e#7966) # Which issue does this PR close? - Part of apache#4886 - Follow up to apache#7834 # Rationale for this change The initial Avro reader implementation contained an under-developed and temporary safeguard to prevent infinite loops when processing records that consumed zero bytes from the input buffer. When the `Decoder` reported that zero bytes were consumed, the `Reader` would advance it's cursor to the end of the current data block. While this successfully prevented an infinite loop, it had the critical side effect of silently discarding any remaining data in that block, leading to potential data loss. This change enhances the decoding logic to handle these zero-byte values correctly, ensuring that the `Reader` makes proper progress without dropping data and without risking an infinite loop. # What changes are included in this PR? - **Refined Decoder Logic**: The `Decoder` has been updated to accurately track and report the number of bytes consumed for all values, including valid zero-length records like `null` or empty `bytes`. This ensures the decoder always makes forward progress. - **Removal of Data-Skipping Safeguard**: The logic in the `Reader` that previously advanced to the end of a block on a zero-byte read has been removed. The reader now relies on the decoder to report accurate consumption and advances its cursor incrementally and safely. - * New integration test using a temporary `zero_byte.avro` file created via this python script: https://gist.github.com/jecsand838/e57647d0d12853f3cf07c350a6a40395 # Are these changes tested? Yes, a new `test_read_zero_byte_avro_file` test was added that reads the new `zero_byte.avro` file and confirms the update. # Are there any user-facing changes? N/A # Follow-Up PRs 1. PR to update `test_read_zero_byte_avro_file` once apache/arrow-testing#109 is merged in.
# Which issue does this PR close? - Closes apache#7899 . This pr wants to avoid the extra allocation for the object builder and the later buffer copy. # Rationale for this change Avoid extra allocation in the object builder like the issue descripted. # What changes are included in this PR? This removes the internal `buffer` in `ObjectBuilder`. All data insertion is done directly to the parent buffer wrapped in `parent_state`. The corresponding new fields are added to `ObjectBuilder`. - add `object_start_offset` in `ObjectBuilder`, which describes the start offset in the parent buffer for the current object - Add `has_been_finished` in `ObjectBuilder`, which describes whether the current object has been finished; it will be used in the `Drop` function. This patch modifies the logic of `new`, `finish`, `parent_state`, and `drop` function according to the change. In particular, it writes data into the parent buffer directly when adding a field to the object (i.e., `insert`/`try_insert` is called). When finalizing (`finish` is called) the object, as header and field ids are must be put in front of data in the buffer, the builder will shift written data bytes for the necessary space for header and field ids. Then it writes header and field ids. In `drop`, if the builder is not finalized before being dropped, it will truncate the written bytes to roll back the parent buffer status. # Are these changes tested? The logic has been covered by the exist logic. # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? - Closes apache#7686 # Rationale for this change int96 min/max statistics emitted by arrow-rs are incorrect. # What changes are included in this PR? 1. Fix the int96 stats 2. Add round-trip test to verify the behavior # Not included in this PR: 1. Read stats only from known good writers. This will be implemented after a new arrow-rs release. # Are there any user-facing changes? The int96 min/max statistics will be different and correct. --------- Co-authored-by: Rahul Sharma <[email protected]> Co-authored-by: Ed Seidl <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Alkis Evlogimenos <[email protected]>
# Rationale for this change If a variant has an unsorted dictionary, you can't assume fields are unique nor ordered by name. This PR updates the logical equality check among `VariantMetadata` to properly handle this case. - Closes apache#7952 It also fixes a bug in apache#7934 where we do a uniqueness check when probing an unsorted dictionary --------- Co-authored-by: Andrew Lamb <[email protected]>
…and writers (apache#7841) # Which issue does this PR close? - Finishes remaining work and closes apache#6661. # What changes are included in this PR? This change adds `decimal32` and `decimal64` support to Parquet, JSON and CSV readers and writers. It does not change the current default behavior of the Parquet reader which (in the absence of a specification that says otherwise) will still translate the INT32 physical type with a logical DECIMAL type into a `decimal128` instead of a `decimal32`. # Are these changes tested? Yes. # Are there any user-facing changes? The `decimal32` and `decimal64` types are now supported in Parquet, JSON and CSV readers and writers. --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Matthijs Brobbel <[email protected]>
…#7911) # Which issue does this PR close? - part of apache#6736 - Closes apache#7964 - Follow on to apache#7905 # Rationale for this change In a quest to have the fastest and most efficient Variant implementation I would like to avoid copies if at all possible Right now, to make a VariantArray first requires completing an individual buffer and appending it to the array. Let's make that faster by having the VariantBuilder append directly into the buffer # What changes are included in this PR? 1. Add `VariantBuilder::new_from_existing` 2. Add a `VariantArrayBuilder::variant_builder` that reuses the buffers # Are these changes tested? 1. New unit tests 1. Yes by existing tests # Are there any user-facing changes? Hopefully faster performance --------- Co-authored-by: Congxian Qiu <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>
…overflows (apache#7887) # Which issue does this PR close? Closes apache#7886. # Rationale for this change Casting large `Decimal256` values to `Float64` can exceed the representable range of floating point numbers. Previously, this could result in a panic due to unwrapping a failed conversion. This PR introduces a safe conversion that saturates overflowing values to `INFINITY` or `-INFINITY`, following standard floating point semantics. This ensures stable, predictable behavior without runtime crashes. # What changes are included in this PR? - Introduced a helper function `decimal256_to_f64` that converts `i256` to `f64`, returning `INFINITY` or `-INFINITY` when the value is out of range. - Updated the casting logic for `Decimal256` → `Float64` to use the new safe conversion. - Improved inline and module-level documentation to reflect that this conversion is lossy and saturating. - Added a unit test `test_cast_decimal256_to_f64_overflow` to validate overflow behavior. # Are there any user-facing changes? Yes. - **Behavior Change:** When casting `Decimal256` values that exceed the `f64` range, users now receive `INFINITY` or `-INFINITY` instead of a panic. - **Improved Docs:** Updated documentation clarifies the lossy and saturating behavior of decimal-to-float casting. - **Not a Breaking Change:** There are no API changes, but users relying on panics for overflow detection may observe different behavior.
…r coalesce (apache#7967) # Which issue does this PR close? This is a very interesting idea that we only calculate the data buffer size when we choose to gc, because we almost only care about the gc for data buffers, not for other field views/nulls. GC is only for databuffers, so the *2 calculation should also compare the databuffer size? # Rationale for this change optimize actual_buffer_size to use only data buffer capacity # What changes are included in this PR? optimize actual_buffer_size to use only data buffer capacity # Are these changes tested? The performance improvement for some high select benchmark with low null ratio is very good about 2X fast: ```rust cargo bench --bench coalesce_kernels "single_utf8view" Compiling arrow-select v55.2.0 (/Users/zhuqi/arrow-rs/arrow-select) Compiling arrow-cast v55.2.0 (/Users/zhuqi/arrow-rs/arrow-cast) Compiling arrow-string v55.2.0 (/Users/zhuqi/arrow-rs/arrow-string) Compiling arrow-ord v55.2.0 (/Users/zhuqi/arrow-rs/arrow-ord) Compiling arrow-csv v55.2.0 (/Users/zhuqi/arrow-rs/arrow-csv) Compiling arrow-json v55.2.0 (/Users/zhuqi/arrow-rs/arrow-json) Compiling arrow v55.2.0 (/Users/zhuqi/arrow-rs/arrow) Finished `bench` profile [optimized] target(s) in 13.26s Running benches/coalesce_kernels.rs (target/release/deps/coalesce_kernels-bb9750abedb10ad6) filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001 time: [30.946 ms 31.071 ms 31.193 ms] change: [−1.7086% −1.1581% −0.6036%] (p = 0.00 < 0.05) Change within noise threshold. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) low mild 1 (1.00%) high mild filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01 time: [3.8178 ms 3.8311 ms 3.8444 ms] change: [−4.0521% −3.5467% −3.0345%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild Benchmarking filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 40. filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1 time: [1.9337 ms 1.9406 ms 1.9478 ms] change: [+0.3699% +0.9557% +1.5666%] (p = 0.00 < 0.05) Change within noise threshold. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) low mild 3 (3.00%) high severe filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8 time: [797.60 µs 805.31 µs 813.85 µs] change: [−59.177% −58.412% −57.639%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001 time: [43.742 ms 43.924 ms 44.108 ms] change: [−1.2146% −0.5778% +0.0828%] (p = 0.08 > 0.05) No change in performance detected. filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01 time: [5.5736 ms 5.5987 ms 5.6247 ms] change: [−0.2381% +0.4740% +1.1711%] (p = 0.18 > 0.05) No change in performance detected. filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1 time: [2.2963 ms 2.3035 ms 2.3109 ms] change: [−0.9314% −0.5125% −0.0931%] (p = 0.02 < 0.05) Change within noise threshold. Benchmarking filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50. filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8 time: [1.5482 ms 1.5697 ms 1.5903 ms] change: [−45.794% −44.386% −43.000%] (p = 0.00 < 0.05) Performance has improved. ``` If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --------- Co-authored-by: Andrew Lamb <[email protected]>
…th length zero (apache#8009) # Which issue does this PR close? - Closes apache#7549. # Rationale for this change The failure was described in the issue. In short, the buffer pointer of an empty buffer array exported from polars is a dangling pointer not aligned. But currently we take the raw pointer from C Data Interface and check its alignment before interpreting it as `ScalarBuffer`. Thus it causes the failure case. # What changes are included in this PR? This patch changes FFI module to create an empty buffer for exported buffer with length zero. As we never dereference the dangling pointer, seems it's not necessary to require the alignment for it. For non empty buffers, we still keep the alignment check. # Are these changes tested? Added a unit test with necessary utility functions. # Are there any user-facing changes? No --------- Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
…map scan (up to 30% faster) (apache#7962) # Which issue does this PR close? This PR is follow-up for: apache#7937 I want to experiment the performance for Using word-level (u64) bit scanning: Details: apache#7937 (review) # Rationale for this change Using word-level (u64) bit scanning Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32. # What changes are included in this PR? Using word-level (u64) bit scanning Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32. # Are these changes tested? Yes, add unit test also fuzz testing, also existed testing coverage sort fuzz. # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <[email protected]>
…apache#8005) # Which issue does this PR close? - closes apache#8004 # Rationale for this change I'm opening this PR to explore the idea that for a given column, the `ArrowWriter` should compatible consider either native arrays or dictionary arrays with the same value. e.g. if I write two record batches, and in the first batch column `a` is type `Dictionary<_, Utf8>`, I should be able to append a second batch to my writer where column `a` is type `Utf8` (the native array). # What changes are included in this PR? This PR relaxes the added in apache#5341 when creating `LevelsInfoBuilder` to consider the types compatible using the logic explained above. # Are these changes tested? There is a basic unit test. I'm happy to add more if this change is something acceptable. # Are there any user-facing changes? No
…g strings (apache#8015) ### Description Add benchmark case for performance comparison in apache#7917 .
…cimal256 → Float64 casts (apache#7986) # Which issue does this PR close? Closes apache#7985 --- # Rationale for this change The existing Decimal256 → Float64 conversion was changed to **saturate** out-of-range values to `±INFINITY` (PR apache#7887) in order to avoid panics. However, every 256-bit signed integer actually fits within the exponent range of an IEEE-754 `f64` (±2¹⁰²³), so we can always produce a **finite** `f64`, only sacrificing mantissa precision. By overriding `i256::to_f64` to split the full 256-bit magnitude into high/low 128-bit halves, recombine as ```text (high as f64) * 2^128 + (low as f64) ``` and reapply the sign (special-casing i256::MIN), we: - Eliminate both panics and infinite results - Match Rust’s built-in (i128) as f64 rounding (ties-to-even) - Simplify casting logic—no saturating helpers or extra flags required # What changes are included in this PR? - Added full-range fn to_f64(&self) -> Option<f64> for i256, using checked_abs() + to_parts() + recombination - Removed fallback through 64-bit to_i64()/to_u64() and .unwrap() - Replaced the old decimal256_to_f64 saturating helper with a thin wrapper around the new i256::to_f64() (always returns Some) - Updated Decimal256 → Float64 cast sites to call the new helper ## Tests - Reworked “overflow” tests to assert finite & correctly signed results for i256::MAX and i256::MIN - Added typical-value tests; removed expectations of ∞/-∞ # Are there any user-facing changes? Behavior change: - Very large or small Decimal256 values no longer become +∞/-∞. - They now map to very large—but finite—f64 values (rounded to nearest mantissa). ## API impact: No public API signatures changed. Conversion remains lossy by design; users relying on saturation-to-infinity will observe different (more faithful) behavior. --------- Co-authored-by: Ryan Johnson <[email protected]>
…alidation disabled (apache#7917) # Which issue does this PR close? - Related to apache#6057 . # Rationale for this change As described in above issue, when constructing a `StringViewArray` from rows, we currently store inline strings twice: once through `make_view`, and again in the `values buffer` so that we can validate utf8 in one go. However, this is suboptimal in terms of memory consumption, so ideally, we should avoid placing inline strings into the values buffer when UTF-8 validation is disabled. # What changes are included in this PR? When UTF-8 validation is disabled, this PR modifies the string/bytes view array construction from rows as follows: 1. The capacity of the values buffer is set to accommodate only long strings plus 12 bytes for a single inline string placeholder. 2. All decoded strings are initially appended to the values buffer. 3. If a string turns out to be an inline string, it is included via `make_view`, and then the corresponding inline portion is truncated from the values buffer, ensuring the inline string does not appear twice in the resulting array. # Are these changes tested? 1. copied & modified existing `fuzz_test` to set disable utf8 validation. 2. Run bench & add bench case when array consists of both inline string & long strings # Are there any user-facing changes? No. # Considered alternatives One idea was to support separate buffers for inline strings even when UTF-8 validation is enabled. However, since we need to call `decoded_len()` first to determine the target buffer, this approach can be tricky or inefficient: - For example, precomputing a boolean flag per string to determine which buffer to use would increase temporary memory usage. - Alternatively, appending to the values buffer first and then moving inline strings to a separate buffer would lead to frequent memcpy overhead. Given that datafusion disables UTF-8 validation when using RowConverter, this PR focuses on improving memory efficiency specifically when validation is turned off. --------- Co-authored-by: Andrew Lamb <[email protected]>
…he#8014) # Which issue does this PR close? - part of apache#7395 # Rationale for this change Keep the code flowing # What changes are included in this PR? 1. Update CHANGELOG. See rendered version here: https://github.com/alamb/arrow-rs/blob/alamb/prepare_56.0.0/CHANGELOG.md 2. Update version to `56.0.0` # Are these changes tested? N/A # Are there any user-facing changes? Yes
# Which issue does this PR close? - part of apache#7395 - Closes apache#8018 # Rationale for this change Fix a bug I found while testing the RC # What changes are included in this PR? Check the ARROW_TESTING directory first before looking at the local path # Are these changes tested? Yes, by CI and I tested it manually # Are there any user-facing changes? No
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - related to apache#7850 # Rationale for this change While reviewing apache#7850 from @XiangpengHao I found myself wanting even more comments (or maybe I was doing this as an exercise to load the state back into my head) In any case, I wrote up some comments that I think would make the code easier to understand # What changes are included in this PR? Add some more docs # Are these changes tested? By CI # Are there any user-facing changes? No -- this is documentation to internal interfaces There is no code or functional change
# Which issue does this PR close? - Closes apache#8023 # Rationale for this change Add sync parquet read bloom filter. # What changes are included in this PR? Add a sync `get_row_group_column_bloom_filter` # Are these changes tested? By unittests # Are there any user-facing changes? Api added
Fix a typo in README. # Which issue does this PR close? - Closes #NNN. # Rationale for this change Fix typo # What changes are included in this PR? Fix typo # Are these changes tested? Nope. # Are there any user-facing changes? Nope. Signed-off-by: Cheng-Yang Chou <[email protected]>
# Which issue does this PR close? - Part of apache#5854. # Rationale for this change Before embarking on radical changes to the thrift processing in the parquet crate, add a few more benchmarks to help evaluate the performance gains. # What changes are included in this PR? Adds a test originally written by @tustvold (https://github.com/tustvold/arrow-rs/tree/thrift-bench) and exposes a new public function in the thrift module rather than making the thrift parser public. # Are these changes tested? Benchmark code only, so no tests necessary. # Are there any user-facing changes? No, but adds a public function.
# Which issue does this PR close? - Part of apache#4886 - Pre-work for apache#8006 # Rationale for this change Apache Avro’s [single object encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding) prefixes every record with the marker `0xC3 0x01` followed by a `Rabin` [schema fingerprint ](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints) so that readers can identify the correct writer schema without carrying the full definition in each message. While the current `arrow‑avro` implementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time. The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the [parsed canonical form of a schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas) to look up the `Schema` from a local schema store or registry. This PR introduces **`SchemaStore`** and **fingerprinting** to enable: * **Zero‑copy schema identification** for decoding streaming Avro messages published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow. * **Dynamic schema evolution** by laying the foundation to resolve writer reader schema differences on the fly. **NOTE:** Integration with `Decoder` and `Reader` coming in next PR. # What changes are included in this PR? | Area | Highlights | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **`schema.rs`** | *New* `Fingerprint`, `SchemaStore`, and `SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint calculator; `compare_schemas` helper. | | **`lib.rs`** | `mod schema` is now `pub` | | **Unit tests** | New tests covering fingerprint generation, store registration/lookup, unknown‑fingerprint errors, and interaction with UTF8‑view decoding. | | **Docs & Examples** | Extensive inline docs with examples on all new public methods / structs. | # Are these changes tested? Yes. New tests cover: 1. **Fingerprinting** against the canonical examples from the Avro spec 2. **`SchemaStore` behavior** deduplication, duplicate registration, and lookup. # Are there any user-facing changes? N/A
…ray` (apache#8044) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes apache#8043 # Rationale for this change As @Samyak2 suggested on https://github.com/apache/arrow-rs/pull/8021/files#r2249926579, having the ability to convert *FROM* a typed value to a VariantArray will be important For example, in SQL it could be used to cast columns to variant like in `some_column::variant` # What changes are included in this PR? 1. Add `cast_to_variant` kernel to cast native types to `VariantArray` 2. Tests # Are these changes tested? yes # Are there any user-facing changes? New kernel
…8068) # Which issue does this PR close? - Follow on to apache#8044 # Rationale for this change @scovich had a good suggestion here https://github.com/apache/arrow-rs/pull/8044/files#r2257256848: > value.into() makes clear that the conversion is infallible? # What changes are included in this PR? Use `From` impl to make it clear the conversion is infallible and can not lose precision # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No
# Which issue does this PR close? - Closes apache#8045 # Rationale for this change Simple bug fix where the order of `name` and `namespace` is wrong in Resolver field registration method. # What changes are included in this PR? One-line bug fix and corresponding tests. # Are these changes tested? - Added a test in the same file to test the fix. Made sure it doesn't pass with the original code. # Are there any user-facing changes? No
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes #NNN. # Rationale for this change I was slightly annoyed that the int96 stats roundtrip test was not part of the other arrow-reader tests You have to run it separately: ```shell cargo test --test int96_stats_roundtrip ``` Let's just move it to the arrow-reader tests, so it is run with the other tests so 1. It is easier to find 2. It takes slightly less time to build and test the parquet crate ```shell cargo test --test arrow_reader ``` # What changes are included in this PR? 1. Move the tests # Are these changes tested? Yes, by CI # Are there any user-facing changes? No
# Which issue does this PR close? Improve StringArray(Utf8) sort performance - Closes [apache#7847](apache#7847) # Rationale for this change Support prefix compare, and i optimized it to u32 prefix, and u64 increment compare, it will have best performance when experimenting. # What changes are included in this PR? Support prefix compare, and i optimized it to u32 prefix, and u64 increment compare, it will have best performance when experimenting. # Are these changes tested? Yes ```rust critcmp issue_7847 main --filter "sort string\[" group issue_7847 main ----- ---------- ---- sort string[0-400] nulls to indices 2^12 1.00 51.4±0.56µs ? ?/sec 1.19 61.0±1.02µs ? ?/sec sort string[0-400] to indices 2^12 1.00 96.5±1.63µs ? ?/sec 1.23 118.3±0.91µs ? ?/sec sort string[10] dict nulls to indices 2^12 1.00 72.4±1.00µs ? ?/sec 1.00 72.5±0.61µs ? ?/sec sort string[10] dict to indices 2^12 1.00 137.1±1.51µs ? ?/sec 1.01 138.1±1.06µs ? ?/sec sort string[10] nulls to indices 2^12 1.00 47.5±0.69µs ? ?/sec 1.18 56.3±0.56µs ? ?/sec sort string[10] to indices 2^12 1.00 86.4±1.37µs ? ?/sec 1.20 103.5±1.13µs ? ?/sec ``` # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? - Closes apache#8057 # Rationale for this change Adds Float16 conversion to the `cast_to_variant` kernel # What changes are included in this PR? - a macro to make converting array type that require a cast simpler - conversion of `DataType::Float16` => `Variant::Float` # Are these changes tested? Yes, additional unit tests have been added. # Are there any user-facing changes? Yes, adds new type conversion to kernel
…on (apache#8029) - Closes apache#7359. # Rationale for this change This is to enable concurrent column writing with encryption downstream (e.g. with datafusion). See apache#7359 for more. See https://github.com/apache/arrow-rs/pull/7111/files#r2015196618 # What changes are included in this PR? * `ArrowWriter` now has a `pub get_column_writers` method that can be used to write columns concurrently. * Minor change to how encryption tests read test data. # Are these changes tested? Yes. # Are there any user-facing changes? `pub ArrowWriter.get_column_writers` and `pub ArrowWriter.append_row_group` are added. Both to enable concurrent use of column writers. `WriterPropertiesBuilder` now implements `Default`. --------- Co-authored-by: Adam Reeve <[email protected]>
# Which issue does this PR close? - Part of apache#4886 # Rationale for this change This change introduces a comprehensive benchmark suite for the `arrow-avro` decoder. Having robust benchmarks is crucial for several reasons: - It allows for the measurement and tracking of decoding performance over time. - It helps identify performance regressions or improvements as the codebase evolves. - It provides a standardized way to evaluate the impact of optimizations and new features. # What changes are included in this PR? This PR adds a new benchmark file: `arrow-avro/benches/decoder.rs`. The key components of this new file are: - **Comprehensive Type Coverage**: Adds benchmark scenarios for a wide range of data types, including: - Primitive types (`Int32`, `Int64`, `Float32`, `Float64`, `Boolean`) - Binary and String types (`Binary(Bytes)`, `String`, `StringView`) - Logical types (`Date32`, `TimeMillis`, `TimeMicros`, `TimestampMillis`, `TimestampMicros`, `Decimal128`, `UUID`, `Interval`, `Enum`) - Complex types (`Map`, `Array`, `Nested(Struct)`) - `FixedSizeBinary` - A `Mixed` schema with multiple fields - Update to criterion 7.0.0 - Made `mod schema` public # Are these changes tested? These changes are covered by the benchmark tests themselves. # Are there any user-facing changes? N/A
# Which issue does this PR close? # Rationale for this change Clippy is failing on main. Here is an example - https://github.com/apache/arrow-rs/actions/runs/16804746850/job/47594208868 Rust 1.89 was released today and it includes a new clippy version that is more strict about some lints: https://blog.rust-lang.org/2025/08/07/Rust-1.89.0/ # What changes are included in this PR? Fix clippy lints to make CI pass with Rust 1.89 # Are these changes tested? By CI # Are there any user-facing changes?
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/releases">actions/download-artifact's">https://github.com/actions/download-artifact/releases">actions/download-artifact's releases</a>.</em></p> <blockquote> <h2>v5.0.0</h2> <h2>What's Changed</h2> <ul> <li>Update README.md by <a href="http://github.com/%3Ca%20href="https://github.com/nebuk89"><code>@%E2%80%8Bnebuk89</code></a">https://github.com/nebuk89"><code>@nebuk89</code></a> in <a href="http://github.com/%3Ca%20href="https://redirect.github.com/actions/download-artifact/pull/407">actions/download-artifact#407</a></li">https://redirect.github.com/actions/download-artifact/pull/407">actions/download-artifact#407</a></li> <li>BREAKING fix: inconsistent path behavior for single artifact downloads by ID by <a href="http://github.com/%3Ca%20href="https://github.com/GrantBirki"><code>@%E2%80%8BGrantBirki</code></a">https://github.com/GrantBirki"><code>@GrantBirki</code></a> in <a href="http://github.com/%3Ca%20href="https://redirect.github.com/actions/download-artifact/pull/416">actions/download-artifact#416</a></li">https://redirect.github.com/actions/download-artifact/pull/416">actions/download-artifact#416</a></li> </ul> <h2>v5.0.0</h2> <h3>🚨 Breaking Change</h3> <p>This release fixes an inconsistency in path behavior for single artifact downloads by ID. <strong>If you're downloading single artifacts by ID, the output path may change.</strong></p> <h4>What Changed</h4> <p>Previously, <strong>single artifact downloads</strong> behaved differently depending on how you specified the artifact:</p> <ul> <li><strong>By name</strong>: <code>name: my-artifact</code> → extracted to <code>path/</code> (direct)</li> <li><strong>By ID</strong>: <code>artifact-ids: 12345</code> → extracted to <code>path/my-artifact/</code> (nested)</li> </ul> <p>Now both methods are consistent:</p> <ul> <li><strong>By name</strong>: <code>name: my-artifact</code> → extracted to <code>path/</code> (unchanged)</li> <li><strong>By ID</strong>: <code>artifact-ids: 12345</code> → extracted to <code>path/</code> (fixed - now direct)</li> </ul> <h4>Migration Guide</h4> <h5>✅ No Action Needed If:</h5> <ul> <li>You download artifacts by <strong>name</strong></li> <li>You download <strong>multiple</strong> artifacts by ID</li> <li>You already use <code>merge-multiple: true</code> as a workaround</li> </ul> <h5>⚠️ Action Required If:</h5> <p>You download <strong>single artifacts by ID</strong> and your workflows expect the nested directory structure.</p> <p><strong>Before v5 (nested structure):</strong></p> <pre lang="yaml"><code>- uses: actions/download-artifact@v4 with: artifact-ids: 12345 path: dist # Files were in: dist/my-artifact/ </code></pre> <blockquote> <p>Where <code>my-artifact</code> is the name of the artifact you previously uploaded</p> </blockquote> <p><strong>To maintain old behavior (if needed):</strong></p> <pre lang="yaml"><code></tr></table> </code></pre> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/634f93cb2916e3fdff6788551b99b062d0335ce0"><code>634f93c</code></a">https://github.com/actions/download-artifact/commit/634f93cb2916e3fdff6788551b99b062d0335ce0"><code>634f93c</code></a> Merge pull request <a href="http://github.com/%3Ca%20href="https://redirect.github.com/actions/download-artifact/issues/416">#416</a">https://redirect.github.com/actions/download-artifact/issues/416">#416</a> from actions/single-artifact-id-download-path</li> <li><a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/b19ff4302770b82aa4694b63703b547756dacce6"><code>b19ff43</code></a">https://github.com/actions/download-artifact/commit/b19ff4302770b82aa4694b63703b547756dacce6"><code>b19ff43</code></a> refactor: resolve download path correctly in artifact download tests (mainly ...</li> <li><a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/e262cbee4ab8c473c61c59a81ad8e9dc760e90db"><code>e262cbe</code></a">https://github.com/actions/download-artifact/commit/e262cbee4ab8c473c61c59a81ad8e9dc760e90db"><code>e262cbe</code></a> bundle dist</li> <li><a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/bff23f9308ceb2f06d673043ea6311519be6a87b"><code>bff23f9</code></a">https://github.com/actions/download-artifact/commit/bff23f9308ceb2f06d673043ea6311519be6a87b"><code>bff23f9</code></a> update docs</li> <li><a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/fff8c148a8fdd56aa81fcb019f0b5f6c65700c4d"><code>fff8c14</code></a">https://github.com/actions/download-artifact/commit/fff8c148a8fdd56aa81fcb019f0b5f6c65700c4d"><code>fff8c14</code></a> fix download path logic when downloading a single artifact by id</li> <li><a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/448e3f862ab3ef47aa50ff917776823c9946035b"><code>448e3f8</code></a">https://github.com/actions/download-artifact/commit/448e3f862ab3ef47aa50ff917776823c9946035b"><code>448e3f8</code></a> Merge pull request <a href="http://github.com/%3Ca%20href="https://redirect.github.com/actions/download-artifact/issues/407">#407</a">https://redirect.github.com/actions/download-artifact/issues/407">#407</a> from actions/nebuk89-patch-1</li> <li><a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/47225c44b359a5155efdbbbc352041b3e249fb1b"><code>47225c4</code></a">https://github.com/actions/download-artifact/commit/47225c44b359a5155efdbbbc352041b3e249fb1b"><code>47225c4</code></a> Update README.md</li> <li>See full diff in <a href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/compare/v4...v5">compare">https://github.com/actions/download-artifact/compare/v4...v5">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…Binary` (apache#8074) # Which issue does this PR close? - Closes apache#8050 # Rationale for this change Adds Binary, LargeBinary, and BinaryView conversions to the cast_to_variant kernel # What changes are included in this PR? - a macro to simplify array type conversions - conversion of DataType:::{Binary, LargeBinary, BinaryView}=> Variant::Binary # Are these changes tested? Yes, additional unit tests have been added. # Are there any user-facing changes? Yes, adds new type conversions to kernel --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? - Part of apache#6736 - Closes apache#7941 - Closes apache#7965 # Rationale for this change This is has a proposal for how to structure shredded `VariantArray`s and the `variant_get` kernel If people like the basic idea I will file some more tickets to track additional follow on work It is based on ideas ideas from @carpecodeum in apache#7946 and @scovich in apache#7915 I basically took the tests from apache#7965 and the conversation with @scovich recorded from apache#7941 (comment) and I bashed out how this might look # What changes are included in this PR? 1. Update `VariantArray` to represent shredding 2. Add code to `variant_get` to support extracting paths as both variants and typed fields 3. A pattern that I think can represent shredding and extraction 4. Tests for same Note there are many things that are NOT in this PR that I envision doing as follow on PRs: 1. Support and implementing `Path`s 2. Support for shredded objects 3. Support shredded lists 4. Support nested objects / lists 5. Full casting support 6. Support for other output types: `StringArray`, `StringViewArray`, etc 8. Many performance improvements # Are these changes tested? Yes # Are there any user-facing changes? New feature --------- Co-authored-by: Samyak Sarnayak <[email protected]> Co-authored-by: Ryan Johnson <[email protected]>
…Resolution (apache#8006) # Which issue does this PR close? - Part of apache#4886 - Follow up to apache#7834 # Rationale for this change Apache Avro’s [single object encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding) prefixes every record with the marker `0xC3 0x01` followed by a `Rabin` [schema fingerprint ](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints) so that readers can identify the correct writer schema without carrying the full definition in each message. While the current `arrow‑avro` implementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time. The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the [parsed canonical form of a schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas) to look up the `Schema` from a local schema store or registry. This PR introduces **`SchemaStore`** and **fingerprinting** to enable: * **Zero‑copy schema identification** for decoding streaming Avro messages published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow. * **Dynamic schema evolution** by laying the foundation to resolve writer reader schema differences on the fly. **NOTE:** Schema Resolution support in `Codec` and `RecordDecoder` coming the next PR. # What changes are included in this PR? | Area | Highlights | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **`reader/mod.rs`** | Decoder now detects the `C3 01` prefix, extracts the fingerprint, looks up the writer schema in a `SchemaStore`, and switches to an LRU cached `RecordDecoder` without interrupting streaming; supports `static_store_mode` to skip the 2 byte peek for high‑throughput fixed‑schema pipelines. | | **`ReaderBuilder`** | New builder configuration methods: `.with_writer_schema_store`, `.with_active_fingerprint`, `.with_static_store_mode`, `.with_reader_schema`, `.with_max_decoder_cache_size`, with rigorous validation to prevent misconfiguration. | | **Unit tests** | New tests covering fingerprint generation, store registration/lookup, schema switching, unknown‑fingerprint errors, and interaction with UTF8‑view decoding. | | **Docs & Examples** | Extensive inline docs with examples on all new public methods / structs. | --- # Are these changes tested? Yes. New tests cover: 1. **Fingerprinting** against the canonical examples from the Avro spec 2. **`SchemaStore` behavior** deduplication, duplicate registration, and lookup. 3. **Decoder fast‑path** with `static_store_mode=true`, ensuring the prefix is treated as payload, the 2 byte peek is skipped, and no schema switch is attempted. # Are there any user-facing changes? N/A # Follow-Up PRs 1. Implement Schema Resolution Functionality in Codec and RecordDecoder 2. Add ID `Fingerprint` variant on `SchemaStore` for Confluent Schema Registry compatibility 3. Improve arrow-avro errors + add more benchmarks & examples to prepare for public release --------- Co-authored-by: Ryan Johnson <[email protected]>
…sync_reader) (apache#7850) This is my latest attempt to make pushdown faster. Prior art: apache#6921 cc @alamb @zhuqi-lucas - Part of apache#8000 - Related to apache/datafusion#3463 - Related to apache#7456 - Closes apache#7363 - Closes apache#8003 ## Problems of apache#6921 1. It proactively loads entire row group into memory. (rather than only loading pages that passing the filter predicate) 2. It only cache decompressed pages, still paying the decoding cost twice. This PR takes a different approach, it does not change the decoding pipeline, so we avoid the problem 1. It also caches the arrow record batch, so avoid problem 2. But this means we need to use more memory to cache data. ## How it works? 1. It instruments the `array_readers` with a transparent `cached_array_reader`. 2. The cache layer will first consult the `RowGroupCache` to look for a batch, and only reads from underlying reader on a cache miss. 3. There're cache producer and cache consumer. Producer is when we build filters we insert arrow arrays into cache, consumer is when we build outputs, we remove arrow array from cache. So the memory usage should look like this: ``` ▲ │ ╭─╮ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │╱ ╲ └─────────────╲──────► Time │ │ │ Filter Peak Consume Phase (Built) (Decrease) ``` In a concurrent setup, not all reader may reach the peak point at the same time, so the peak system memory usage might be lower. 4. It has a max_cache_size knob, this is a per row group setting. If the row group has used up the budget, the cache stops taking new data. and the `cached_array_reader` will fallback to read and decode from Parquet. ## Other benefits 1. This architecture allows nested columns (but not implemented in this pr), i.e., it's future proof. 2. There're many performance optimizations to further squeeze the performance, but even with current state, it has no regressions. ## How does it perform? My criterion somehow won't produces a result from `--save-baseline`, so I asked llm to generate a table from this benchmark: ``` cargo bench --bench arrow_reader_clickbench --features "arrow async" "async" ``` `Baseline` is the implementation for current main branch. `New Unlimited` is the new pushdown with unlimited memory budget. `New 100MB` is the new pushdown but the memory budget for a row group caching is 100MB. ``` Query | Baseline (ms) | New Unlimited (ms) | Diff (ms) | New 100MB (ms) | Diff (ms) -------+--------------+--------------------+-----------+----------------+----------- Q1 | 0.847 | 0.803 | -0.044 | 0.812 | -0.035 Q10 | 4.060 | 6.273 | +2.213 | 6.216 | +2.156 Q11 | 5.088 | 7.152 | +2.064 | 7.193 | +2.105 Q12 | 18.485 | 14.937 | -3.548 | 14.904 | -3.581 Q13 | 24.859 | 21.908 | -2.951 | 21.705 | -3.154 Q14 | 23.994 | 20.691 | -3.303 | 20.467 | -3.527 Q19 | 1.894 | 1.980 | +0.086 | 1.996 | +0.102 Q20 | 90.325 | 64.689 | -25.636 | 74.478 | -15.847 Q21 | 106.610 | 74.766 | -31.844 | 99.557 | -7.053 Q22 | 232.730 | 101.660 | -131.070 | 204.800 | -27.930 Q23 | 222.800 | 186.320 | -36.480 | 186.590 | -36.210 Q24 | 24.840 | 19.762 | -5.078 | 19.908 | -4.932 Q27 | 80.463 | 47.118 | -33.345 | 49.597 | -30.866 Q28 | 78.999 | 47.583 | -31.416 | 51.432 | -27.567 Q30 | 28.587 | 28.710 | +0.123 | 28.926 | +0.339 Q36 | 80.157 | 57.954 | -22.203 | 58.012 | -22.145 Q37 | 46.962 | 45.901 | -1.061 | 45.386 | -1.576 Q38 | 16.324 | 16.492 | +0.168 | 16.522 | +0.198 Q39 | 20.754 | 20.734 | -0.020 | 20.648 | -0.106 Q40 | 22.554 | 21.707 | -0.847 | 21.995 | -0.559 Q41 | 16.430 | 16.391 | -0.039 | 16.581 | +0.151 Q42 | 6.045 | 6.157 | +0.112 | 6.120 | +0.075 ``` 1. If we consider the diff within 5ms to be noise, then we are never worse than the current implementation. 2. We see significant improvements for string-heavy queries, because string columns are large, they take time to decompress and decode. 3. 100MB cache budget seems to have small performance impact. ## Limitations 1. It only works for async readers, because sync reader do not follow the same row group by row group structure. 2. It is memory hungry -- compared to apache#6921. But changing decoding pipeline without eager loading entire row group would require significant changes to the current decoding infrastructure, e.g., we need to make page iterator an async function. 3. It currently doesn't support nested columns, more specifically, it doesn't support nested columns with nullable parents. but supporting it is straightforward, no big changes. 4. The current memory accounting is not accurate, it will overestimate the memory usage, especially when reading string view arrays, where multiple string view may share the same underlying buffer, and that buffer size is counted twice. Anyway, we never exceeds the user configured memory usage. 5. If one row passes the filter, the entire batch will be cached. We can probably optimize this though. ## Next steps? This pr is largely proof of concept, I want to collect some feedback before sending a multi-thousands pr :) Some items I can think of: 1. Design an interface for user to specify the cache size limit, currently it's hard-coded. 2. Don't instrument nested array reader if the parquet file has nullable parent. currently it will panic 3. More testing, and integration test/benchmark with Datafusion --------- Co-authored-by: Andrew Lamb <[email protected]>
|
closing this and opening a new one |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes apache#8084.
Rationale for this change
Adds integration tests for reading variants
Are there any user-facing changes?
No mostly test changes