Skip to content

Conversation

@carpecodeum
Copy link

Which issue does this PR close?

Closes apache#8084.

Rationale for this change

Adds integration tests for reading variants

Are there any user-facing changes?

No mostly test changes

alamb and others added 30 commits July 16, 2025 13:38
# Which issue does this PR close?

- Related to apache#7395 
- Closes apache#7495
- Closes apache#7377

# Rationale for this change

Let's update tonic to the latest

Given the open and unresolved questions on @rmn-boiko's PR
apache#7377 from @Xuanwo and @sundy-li,
I thought a new PR would result in a faster resolution.

# What changes are included in this PR?

This PR is based on apache#7495 from
@MichaelScofield -- I resolved some merge conflicts and updated
Cargo.toml in the integration tests

# Are these changes tested?

Yes, by CI
# Are there any user-facing changes?
New dependency version

---------

Co-authored-by: LFC <[email protected]>
…pache#7922)

# Which issue does this PR close?

- Part of apache#7896

# Rationale for this change

In apache#7896, we saw that inserting a
large amount of field names takes a long time -- in this case ~45s to
insert 2**24 field names. The bulk of this time is spent just allocating
the strings, but we also see quite a bit of time spent reallocating the
`IndexSet` that we're inserting into.

`with_field_names` is an optimization to declare the field names upfront
which avoids having to reallocate and rehash the entire `IndexSet`
during field name insertion. Using this method requires at least 2
string allocations for each field name -- 1 to declare field names
upfront and 1 to insert the actual field name during object building.

This PR adds a new method `with_field_name_capacity` which allows you to
reserve space to the metadata builder, without needing to allocate the
field names themselves upfront. In this case, we see a modest
performance improvement when inserting the field names during object
building

Before: 
<img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08
43 PM"
src="http://github.com/%3Ca%20href="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c">https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c"
/>


After:
<img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08
55 PM"
src="http://github.com/%3Ca%20href="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3">https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3"
/>
…che#7914)

# Which issue does this PR close?

- Fixes apache#7907

# Rationale for this change

When trying to append `VariantObject` or `VariantList`s directly on the
`VariantBuilder`, it will panic.


# Changes to the public API

`VariantBuilder` now has these additional methods: 

- `append_object`, will panic if shallow validation fails or the object
has duplicate field names
- `try_append_object`, will perform full validation on the object before
appending

- `append_list`, will panic if shallow validation fails
- `try_append_list`, will perform full validation on the list before
appending

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?

- Closes apache#7893

# What changes are included in this PR?

In parquet-variant:
- Add a new function `Variant::get_path`: this traverses the path to
create a new Variant (does not cast any of it).
- Add a new module `parquet_variant::path`: adds structs/enums to define
a path to access a variant value deeply.

In parquet-variant-compute:
- Add a new compute kernel `variant_get`: does the path traversal over a
`VariantArray`. In the future, this would also cast the values to a
specified type.
- Includes some basic unit tests. Not comprehensive.
- Includes a simple micro-benchmark for reference.

Current limitations:
- It can only return another VariantArray. Casts are not implemented
yet.
- Only top-level object/list access is supported. It panics on finding a
nested object/list. Needs apache#7914
to fix this.
- Perf is a TODO.

# Are these changes tested?

Some basic unit tests are added.

# Are there any user-facing changes?

Yes

---------

Co-authored-by: Andrew Lamb <[email protected]>
…he#7774)

# Which issue does this PR close?


- Part of apache#7762


# Rationale for this change

As part of apache#7762 I want to
optimize applying filters by adding a new code path.

To ensure that works well, let's ensure the filtered code path is well
covered with tests


# What changes are included in this PR?

1. Add tests for filtering batches with 0.01%, 1%, 10% and 90% and
varying data types


# Are these changes tested?
Only tests, no functional changes


# Are there any user-facing changes?
…ath (apache#7942)

# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Follow on to apache#7919

# Rationale for this change

While reviewing apache#7919 from
@Samyak2 I found I wanted to write some additional tests directly for
`Variant::get_path`

When I started doing that I found it was somewhat awkward to write
examples, so I added some new conversion routines to make it easier.

# What changes are included in this PR?

1. Add doc examples (and thus tests) of `VaraintGet` and `VariantPath`
2. Add more documentation

# Are these changes tested?
Yes, by doc examples which run in CI
# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes apache#7901 .

# Rationale for this change

Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

# What changes are included in this PR?

There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.

# Are these changes tested?

We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

---------

Signed-off-by: codephage2020 <[email protected]>
…pache#7931)

# Which issue does this PR close?

- Closes #NNN.

This is minor but I can create an issue if needed.

# Rationale for this change

`make_builder` currently errors with `Data type Utf8View is not
currently supported`.

# What changes are included in this PR?

Support `DataType::Utf8View` and `DataType::BinaryView` in
`make_builder`.

# Are these changes tested?

Only via the exhaustive enum match. It doesn't look like there are any
tests for `make_builder` in that file?

# Are there any user-facing changes?

No
# Rationale for this change

- Closes apache#7948

This PR introduces a custom implementation of `PartialEq` for variant
objects.

According to the spec, field values are not required to be in the same
order as the field IDs, to enable flexibility when constructing Variant
values.

Instead of comparing the raw bytes of 2 variant objects, this
implementation recursively checks whether the field values are equal --
regardless of their order
# Which issue does this PR close?

Optimize `partition_validity` function used in sort kernels
    
- Preallocate vectors based on known null counts
- Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of
`Array::is_valid`
- Avoid capacity checks inside loop by writing to `spare_capacity_mut`
instead of using `push`
- Closes apache#7936.

# Rationale for this change

Microbenchmark results for `sort_kernels` compared to `main`, only
looking at benchmarks matching "nulls to indices":

```
sort i32 nulls to indices 2^10
                        time:   [4.9325 µs 4.9370 µs 4.9422 µs]
                        change: [−20.303% −20.133% −19.974%] (p = 0.00 < 0.05)
                        Performance has improved.

sort i32 nulls to indices 2^12
                        time:   [20.096 µs 20.209 µs 20.327 µs]
                        change: [−26.819% −26.275% −25.697%] (p = 0.00 < 0.05)
                        Performance has improved.

sort f32 nulls to indices 2^12
                        time:   [26.329 µs 26.366 µs 26.406 µs]
                        change: [−29.487% −29.331% −29.146%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[0-10] nulls to indices 2^12
                        time:   [70.667 µs 70.762 µs 70.886 µs]
                        change: [−20.057% −19.935% −19.819%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[0-100] nulls to indices 2^12
                        time:   [101.98 µs 102.44 µs 102.99 µs]
                        change: [−0.3501% +0.0835% +0.4918%] (p = 0.71 > 0.05)
                        No change in performance detected.

sort string[0-400] nulls to indices 2^12
                        time:   [84.952 µs 85.024 µs 85.102 µs]
                        change: [−5.3969% −4.9827% −4.6421%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[10] nulls to indices 2^12
                        time:   [72.486 µs 72.664 µs 72.893 µs]
                        change: [−14.937% −14.781% −14.599%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[100] nulls to indices 2^12
                        time:   [71.354 µs 71.606 µs 71.902 µs]
                        change: [−17.207% −16.795% −16.373%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[1000] nulls to indices 2^12
                        time:   [73.088 µs 73.195 µs 73.311 µs]
                        change: [−16.705% −16.599% −16.483%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view[10] nulls to indices 2^12
                        time:   [32.592 µs 32.654 µs 32.731 µs]
                        change: [−15.722% −15.512% −15.310%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view[0-400] nulls to indices 2^12
                        time:   [32.981 µs 33.074 µs 33.189 µs]
                        change: [−25.570% −25.132% −24.700%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view_inlined[0-12] nulls to indices 2^12
                        time:   [28.467 µs 28.496 µs 28.529 µs]
                        change: [−22.978% −22.786% −22.574%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[10] dict nulls to indices 2^12
                        time:   [94.463 µs 94.503 µs 94.542 µs]
                        change: [−11.386% −11.165% −10.961%] (p = 0.00 < 0.05)
                        Performance has improved.
```

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

No, the method is internal to the sort kernels.
# Which issue does this PR close?

- Closes apache#7949

# Rationale for this change

I would like it to be easier / more ergonomic to make objects

# What changes are included in this PR?

1. Add `ObjectBuilder::with_field`
2. Add documentation w/ examples
3. Rewrite some tests 

# Are these changes tested?

Yes, by doc tests

# Are there any user-facing changes?

Yes a new API
…ntArray (apache#7945)

# Which issue does this PR close?


- Closes apache#7920.

# Are these changes tested?


Tests were already implemented

# Are there any user-facing changes?

None
…che#7956)

# Which issue does this PR close?

- Follow-up to apache#7901

# Rationale for this change

- apache#7934

Introduced a minor regression, in (accidentally?) forbidding the empty
string as a dictionary key. Fix the bug and simplify the code a bit
further while we're at it.

# What changes are included in this PR?

Revert the unsorted dictionary check back to what it had been (it just
uses `Iterator::is_sorted_by` now, instead of
`primitive.slice::is_sorted_by`).

Remove the redundant offset monotonicity check from the ordered
dictionary path, relying on the fact that string slice extraction will
anyway fail if the offsets are not monotonic. Improve the error message
now that it does double duty.

# Are these changes tested?

New unit tests for dictionaries containing the empty string. As a side
effect, we now have at least a little coverage for sorted dictionaries
-- somehow, I couldn't find any existing unit test that creates a sorted
dictionary??

# Are there any user-facing changes?

No

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of apache#7911
- Part of  apache#6736
- Follow on to apache#7905


# Rationale for this change

I wrote benchmark some changes to the json decoder in
apache#7911 but they are non trivial.
To keep apache#7911 easier to review I
have pulled the benchmarks out to their own PR

# What changes are included in this PR?

1. Add new json  benchmark
2. Include the `variant_get` benchmark added in
apache#7919 by @Samyak2
# Are these changes tested?

I tested them manually and clippy CI coverage ensures they compile

# Are there any user-facing changes?

No these are only benchmarks
# Which issue does this PR close?


- Closes apache#7951 .

# Rationale for this change


# What changes are included in this PR?


# Are these changes tested?

Yes
# Are there any user-facing changes?

New API

Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close?
- Follow on to apache#7943
- Part of apache#7948

# Rationale for this change

I found a few more tests I would like to have seen while reviewing
apache#7943

# What changes are included in this PR?

Add some list equality tests

# Are these changes tested?

It is only tests, no functionality changes

# Are there any user-facing changes?

No
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes apache#7947 .

# Rationale for this change

Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

# What changes are included in this PR?

There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.

# Are these changes tested?

We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close?


-  Related to apache#6736 


# Rationale for this change

I noticed in apache#7956 that some
Clippy errors were introduced but not caught by CI.

# What changes are included in this PR?

Add `parquet-variant-compute` to the CI for parqet-variant related PRs

# Are these changes tested?

It is only tests


# Are there any user-facing changes?
No
# Which issue does this PR close?

Does not yet close, but contributes towards:

- apache#6356 
- apache#5981 
- apache#1206

# Rationale for this change

See the above issues. And this is a follow up to

* apache#6711
* apache#6873

This was also split out from:
apache#7929

# What changes are included in this PR?

This removes the API to allow preserving `dict_id` set in the `Schema`'s
`Field` within arrow-ipc and arrow-flight. This is in an effort to
remove the `dict_id` field entirely and make it an IPC/flight-only
concern.

# Are these changes tested?

Yes, all existing tests continue to pass.

# Are there any user-facing changes?

Yes, these previously (in 54.0.0) deprecated functions/fields are
removed:

* `arrow_ipc::DictionaryTracker.set_dict_id`
* `arrow_ipc::DictionaryTracker::new_with_preserve_dict_id`
* `arrow_ipc::IpcWriteOptions.with_preserve_dict_id`
* `arrow_ipc::IpcWriteOptions.preserve_dict_id` (function and field)
* `arrow_ipc::schema_to_fb`
* `arrow_ipc::schema_to_bytes`
# Which issue does this PR close?

- Part of apache#4886
- Related to apache#6965

# Rationale for this change

This change introduces support for Avro files generated by systems like
Impala, which have a specific convention for representing nullable
fields. In Avro, nullability is typically represented by a union of a
type and a type. This PR updates the Avro reader to correctly interpret
these schemas, ensuring proper handling of nullable data and improving
interoperability with Impala-generated data. `null`

# What changes are included in this PR?

This pull request introduces several changes to support Impala-style
nullability in the Avro reader:
- The Avro schema parser has been updated to recognize unions where is
the second type (e.g., `['type', 'null']`) as a nullable field. `null`
- Logic has been added to handle this nullability convention during Avro
decoding.
- New tests are included to verify that Avro files using this
nullability format are read correctly while ensuring that strict mode
properly identifies them.

# Are these changes tested?

Yes, I added new test cases covering these changes to the tests named:
`test_nonnullable_impala`, `test_nonnullable_impala_strict`,
`test_nullable_impala` and `test_nullable_impala_strict`.

# Are there any user-facing changes?

N/A

---------

Co-authored-by: Connor Sanders <[email protected]>
# Which issue does this PR close?

Part of apache#4886

Completes the breaking down/porting of the changes in
apache#6965. This PR will be closed
upon merge of this PR.

# Rationale for this change

This change brings over the remaining integration tests present in the
original PR, which validate the reader logic against the files from
`testing/data/avro`. PRs containing this logic have already been merged
(but are not yet released) which these tests now validate.

# What changes are included in this PR?

The following files are now read in:

- alltypes_dictionary.avro
- alltypes_nulls_plain.avro
- binary.avro
- dict-page-offset-zero.avro
- avro/list_columns.avro
- nested_lists.snappy.avro
- single_nan.avro
- datapage_v2.snappy.avro
- nested_records.avro
- repeated_no_annotation.avro

# Are these changes tested?

This PR consists of integration tests validating code merged recently
into this crate. No changes in functionality are included.

# Are there any user-facing changes?

N/A
…pache#7968)

# Which issue does this PR close?

Closes apache#6356

# Rationale for this change

Now that apache#7940 is merged, nothing
useful can be done with the `dict_id` field, therefore, it is now safe
to be removed from this requirement.

This was also split out from:
apache#7467

# What changes are included in this PR?

No longer require the `dict_id` fields of two `Field`s of schemas being
merged to be equal, as at this point the `dict_id` is only an IPC
concern, and the fact that it is still in the struct definition is just
legacy, marked for removal, we're just going through the proper
procedure of deprecating and replacing the APIs that use it.

# Are these changes tested?

Tests passing.

# Are there any user-facing changes?

No API changes, just a behavior change, that was to be expected and
desired due to the deprecations around the `dict_id` field.

@alamb @adriangb @tustvold
…e#7966)

# Which issue does this PR close?

- Part of apache#4886

- Follow up to apache#7834

# Rationale for this change

The initial Avro reader implementation contained an under-developed and
temporary safeguard to prevent infinite loops when processing records
that consumed zero bytes from the input buffer.

When the `Decoder` reported that zero bytes were consumed, the `Reader`
would advance it's cursor to the end of the current data block. While
this successfully prevented an infinite loop, it had the critical side
effect of silently discarding any remaining data in that block, leading
to potential data loss.

This change enhances the decoding logic to handle these zero-byte values
correctly, ensuring that the `Reader` makes proper progress without
dropping data and without risking an infinite loop.

# What changes are included in this PR?

- **Refined Decoder Logic**: The `Decoder` has been updated to
accurately track and report the number of bytes consumed for all values,
including valid zero-length records like `null` or empty `bytes`. This
ensures the decoder always makes forward progress.
- **Removal of Data-Skipping Safeguard**: The logic in the `Reader` that
previously advanced to the end of a block on a zero-byte read has been
removed. The reader now relies on the decoder to report accurate
consumption and advances its cursor incrementally and safely.
- * New integration test using a temporary `zero_byte.avro` file created
via this python script:
https://gist.github.com/jecsand838/e57647d0d12853f3cf07c350a6a40395

# Are these changes tested?

Yes, a new `test_read_zero_byte_avro_file` test was added that reads the
new `zero_byte.avro` file and confirms the update.

# Are there any user-facing changes?

N/A

# Follow-Up PRs

1. PR to update `test_read_zero_byte_avro_file` once
apache/arrow-testing#109 is merged in.
# Which issue does this PR close?

- Closes apache#7899 .

This pr wants to avoid the extra allocation for the object builder and
the later buffer copy.

# Rationale for this change

Avoid extra allocation in the object builder like the issue descripted.

# What changes are included in this PR?

This removes the internal `buffer` in `ObjectBuilder`. All data
insertion is done directly to the parent buffer wrapped in
`parent_state`.

The corresponding new fields are added to `ObjectBuilder`.

- add `object_start_offset` in `ObjectBuilder`, which describes the
start offset in the parent buffer for the current object
- Add `has_been_finished` in `ObjectBuilder`, which describes whether
the current object has been finished; it will be used in the `Drop`
function.

This patch modifies the logic of `new`, `finish`, `parent_state`, and
`drop` function according to the change.

In particular, it writes data into the parent buffer directly when
adding a field to the object (i.e., `insert`/`try_insert` is called).
When finalizing (`finish` is called) the object, as header and field ids
are must be put in front of data in the buffer, the builder will shift
written data bytes for the necessary space for header and field ids.
Then it writes header and field ids.

In `drop`, if the builder is not finalized before being dropped, it will
truncate the written bytes to roll back the parent buffer status.

# Are these changes tested?

The logic has been covered by the exist logic.

# Are there any user-facing changes?

No

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?

- Closes apache#7686

# Rationale for this change

int96 min/max statistics emitted by arrow-rs are incorrect. 

# What changes are included in this PR?

1. Fix the int96 stats
2. Add round-trip test to verify the behavior

# Not included in this PR:
1. Read stats only from known good writers. This will be implemented
after a new arrow-rs release.

# Are there any user-facing changes?
The int96 min/max statistics will be different and correct.

---------

Co-authored-by: Rahul Sharma <[email protected]>
Co-authored-by: Ed Seidl <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Alkis Evlogimenos <[email protected]>
# Rationale for this change

If a variant has an unsorted dictionary, you can't assume fields are
unique nor ordered by name. This PR updates the logical equality check
among `VariantMetadata` to properly handle this case.

- Closes apache#7952 

It also fixes a bug in apache#7934
where we do a uniqueness check when probing an unsorted dictionary

---------

Co-authored-by: Andrew Lamb <[email protected]>
…and writers (apache#7841)

# Which issue does this PR close?

- Finishes remaining work and closes apache#6661.

# What changes are included in this PR?

This change adds `decimal32` and `decimal64` support to Parquet, JSON
and CSV readers and writers. It does not change the current default
behavior of the Parquet reader which (in the absence of a specification
that says otherwise) will still translate the INT32 physical type with a
logical DECIMAL type into a `decimal128` instead of a `decimal32`.

# Are these changes tested?

Yes.

# Are there any user-facing changes?

The `decimal32` and `decimal64` types are now supported in Parquet, JSON
and CSV readers and writers.

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Matthijs Brobbel <[email protected]>
…#7911)

# Which issue does this PR close?

- part of apache#6736
- Closes apache#7964
- Follow on to apache#7905

# Rationale for this change

In a quest to have the fastest and most efficient Variant implementation
I would like to avoid copies if at all possible
Right now, to make a VariantArray first requires completing an
individual buffer and appending it
to the array.

Let's make that faster by having the VariantBuilder append directly into
the buffer


# What changes are included in this PR?

1. Add `VariantBuilder::new_from_existing`
2. Add a `VariantArrayBuilder::variant_builder` that reuses the buffers

# Are these changes tested?

1. New unit tests 
1. Yes by existing tests

# Are there any user-facing changes?

Hopefully faster performance

---------

Co-authored-by: Congxian Qiu <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
…overflows (apache#7887)

# Which issue does this PR close?

Closes apache#7886.

# Rationale for this change

Casting large `Decimal256` values to `Float64` can exceed the
representable range of floating point numbers. Previously, this could
result in a panic due to unwrapping a failed conversion.

This PR introduces a safe conversion that saturates overflowing values
to `INFINITY` or `-INFINITY`, following standard floating point
semantics. This ensures stable, predictable behavior without runtime
crashes.

# What changes are included in this PR?

- Introduced a helper function `decimal256_to_f64` that converts `i256`
to `f64`, returning `INFINITY` or `-INFINITY` when the value is out of
range.
- Updated the casting logic for `Decimal256` → `Float64` to use the new
safe conversion.
- Improved inline and module-level documentation to reflect that this
conversion is lossy and saturating.
- Added a unit test `test_cast_decimal256_to_f64_overflow` to validate
overflow behavior.

# Are there any user-facing changes?

Yes.

- **Behavior Change:** When casting `Decimal256` values that exceed the
`f64` range, users now receive `INFINITY` or `-INFINITY` instead of a
panic.
- **Improved Docs:** Updated documentation clarifies the lossy and
saturating behavior of decimal-to-float casting.
- **Not a Breaking Change:** There are no API changes, but users relying
on panics for overflow detection may observe different behavior.
…r coalesce (apache#7967)

# Which issue does this PR close?

This is a very interesting idea that we only calculate the data buffer
size when we choose to gc, because we almost only care about the gc for
data buffers, not for other field views/nulls.

GC is only for databuffers, so the *2 calculation should also compare
the databuffer size?


# Rationale for this change

optimize actual_buffer_size to use only data buffer capacity

# What changes are included in this PR?

optimize actual_buffer_size to use only data buffer capacity

# Are these changes tested?

The performance improvement for some high select benchmark with low null
ratio is very good about 2X fast:

```rust
cargo bench --bench coalesce_kernels "single_utf8view"
   Compiling arrow-select v55.2.0 (/Users/zhuqi/arrow-rs/arrow-select)
   Compiling arrow-cast v55.2.0 (/Users/zhuqi/arrow-rs/arrow-cast)
   Compiling arrow-string v55.2.0 (/Users/zhuqi/arrow-rs/arrow-string)
   Compiling arrow-ord v55.2.0 (/Users/zhuqi/arrow-rs/arrow-ord)
   Compiling arrow-csv v55.2.0 (/Users/zhuqi/arrow-rs/arrow-csv)
   Compiling arrow-json v55.2.0 (/Users/zhuqi/arrow-rs/arrow-json)
   Compiling arrow v55.2.0 (/Users/zhuqi/arrow-rs/arrow)
    Finished `bench` profile [optimized] target(s) in 13.26s
     Running benches/coalesce_kernels.rs (target/release/deps/coalesce_kernels-bb9750abedb10ad6)
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001
                        time:   [30.946 ms 31.071 ms 31.193 ms]
                        change: [−1.7086% −1.1581% −0.6036%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild

filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01
                        time:   [3.8178 ms 3.8311 ms 3.8444 ms]
                        change: [−4.0521% −3.5467% −3.0345%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

Benchmarking filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 40.
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1
                        time:   [1.9337 ms 1.9406 ms 1.9478 ms]
                        change: [+0.3699% +0.9557% +1.5666%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high severe

filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8
                        time:   [797.60 µs 805.31 µs 813.85 µs]
                        change: [−59.177% −58.412% −57.639%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001
                        time:   [43.742 ms 43.924 ms 44.108 ms]
                        change: [−1.2146% −0.5778% +0.0828%] (p = 0.08 > 0.05)
                        No change in performance detected.

filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01
                        time:   [5.5736 ms 5.5987 ms 5.6247 ms]
                        change: [−0.2381% +0.4740% +1.1711%] (p = 0.18 > 0.05)
                        No change in performance detected.

filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1
                        time:   [2.2963 ms 2.3035 ms 2.3109 ms]
                        change: [−0.9314% −0.5125% −0.0931%] (p = 0.02 < 0.05)
                        Change within noise threshold.

Benchmarking filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8
                        time:   [1.5482 ms 1.5697 ms 1.5903 ms]
                        change: [−45.794% −44.386% −43.000%] (p = 0.00 < 0.05)
                        Performance has improved.

```

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

---------

Co-authored-by: Andrew Lamb <[email protected]>
viirya and others added 28 commits July 29, 2025 10:59
…th length zero (apache#8009)

# Which issue does this PR close?

- Closes apache#7549.

# Rationale for this change

The failure was described in the issue. In short, the buffer pointer of
an empty buffer array exported from polars is a dangling pointer not
aligned. But currently we take the raw pointer from C Data Interface and
check its alignment before interpreting it as `ScalarBuffer`. Thus it
causes the failure case.

# What changes are included in this PR?

This patch changes FFI module to create an empty buffer for exported
buffer with length zero. As we never dereference the dangling pointer,
seems it's not necessary to require the alignment for it.

For non empty buffers, we still keep the alignment check.

# Are these changes tested?

Added a unit test with necessary utility functions.

# Are there any user-facing changes?

No

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
…map scan (up to 30% faster) (apache#7962)

# Which issue does this PR close?

This PR is follow-up for:

apache#7937 

I want to experiment the performance for Using word-level (u64) bit
scanning:

Details:


apache#7937 (review)

# Rationale for this change

Using word-level (u64) bit scanning

Use set_indices to implement this, but we need u32 index , so i also add
set_indices_u32, the performance shows %7 improvement comparing to
set_indices then to case to u32.

# What changes are included in this PR?

Using word-level (u64) bit scanning

Use set_indices to implement this, but we need u32 index , so i also add
set_indices_u32, the performance shows %7 improvement comparing to
set_indices then to case to u32.

# Are these changes tested?

Yes, add unit test also fuzz testing, also existed testing coverage sort
fuzz.


# Are there any user-facing changes?

No

---------

Co-authored-by: Andrew Lamb <[email protected]>
…apache#8005)

# Which issue does this PR close?

- closes apache#8004

# Rationale for this change

I'm opening this PR to explore the idea that for a given column, the
`ArrowWriter` should compatible consider either native arrays or
dictionary arrays with the same value.

e.g. if I write two record batches, and in the first batch column `a` is
type `Dictionary<_, Utf8>`, I should be able to append a second batch to
my writer where column `a` is type `Utf8` (the native array).

# What changes are included in this PR?

This PR relaxes the added in
apache#5341 when creating
`LevelsInfoBuilder` to consider the types compatible using the logic
explained above.

# Are these changes tested?

There is a basic unit test. I'm happy to add more if this change is
something acceptable.

# Are there any user-facing changes?
No
…g strings (apache#8015)

### Description
Add benchmark case for performance comparison in
apache#7917 .
…cimal256 → Float64 casts (apache#7986)

# Which issue does this PR close?

Closes apache#7985

---

# Rationale for this change

The existing Decimal256 → Float64 conversion was changed to **saturate**
out-of-range values to `±INFINITY` (PR apache#7887) in order to avoid panics.
However, every 256-bit signed integer actually fits within the exponent
range of an IEEE-754 `f64` (±2¹⁰²³), so we can always produce a
**finite** `f64`, only sacrificing mantissa precision.
By overriding `i256::to_f64` to split the full 256-bit magnitude into
high/low 128-bit halves, recombine as
```text
(high as f64) * 2^128 + (low as f64)
```
and reapply the sign (special-casing i256::MIN), we:

- Eliminate both panics and infinite results

- Match Rust’s built-in (i128) as f64 rounding (ties-to-even)

- Simplify casting logic—no saturating helpers or extra flags required

# What changes are included in this PR?

- Added full-range fn to_f64(&self) -> Option<f64> for i256, using
checked_abs() + to_parts() + recombination

- Removed fallback through 64-bit to_i64()/to_u64() and .unwrap()

- Replaced the old decimal256_to_f64 saturating helper with a thin
wrapper around the new i256::to_f64() (always returns Some)

- Updated Decimal256 → Float64 cast sites to call the new helper

## Tests

- Reworked “overflow” tests to assert finite & correctly signed results
for i256::MAX and i256::MIN

- Added typical-value tests; removed expectations of ∞/-∞

# Are there any user-facing changes?
Behavior change:

- Very large or small Decimal256 values no longer become +∞/-∞.

- They now map to very large—but finite—f64 values (rounded to nearest
mantissa).

## API impact:

No public API signatures changed.

Conversion remains lossy by design; users relying on
saturation-to-infinity will observe different (more faithful) behavior.

---------

Co-authored-by: Ryan Johnson <[email protected]>
…alidation disabled (apache#7917)

# Which issue does this PR close?

- Related to apache#6057 .

# Rationale for this change
As described in above issue, when constructing a `StringViewArray` from
rows, we currently store inline strings twice: once through `make_view`,
and again in the `values buffer` so that we can validate utf8 in one go.
However, this is suboptimal in terms of memory consumption, so ideally,
we should avoid placing inline strings into the values buffer when UTF-8
validation is disabled.

# What changes are included in this PR?
When UTF-8 validation is disabled, this PR modifies the string/bytes
view array construction from rows as follows:

1. The capacity of the values buffer is set to accommodate only long
strings plus 12 bytes for a single inline string placeholder.

2. All decoded strings are initially appended to the values buffer.

3. If a string turns out to be an inline string, it is included via
`make_view`, and then the corresponding inline portion is truncated from
the values buffer, ensuring the inline string does not appear twice in
the resulting array.

# Are these changes tested?
1. copied & modified existing `fuzz_test` to set disable utf8
validation.
2. Run bench & add bench case when array consists of both inline string
& long strings
 
# Are there any user-facing changes?
No.  

# Considered alternatives
One idea was to support separate buffers for inline strings even when
UTF-8 validation is enabled. However, since we need to call
`decoded_len()` first to determine the target buffer, this approach can
be tricky or inefficient:

- For example, precomputing a boolean flag per string to determine which
buffer to use would increase temporary memory usage.

- Alternatively, appending to the values buffer first and then moving
inline strings to a separate buffer would lead to frequent memcpy
overhead.

Given that datafusion disables UTF-8 validation when using RowConverter,
this PR focuses on improving memory efficiency specifically when
validation is turned off.

---------

Co-authored-by: Andrew Lamb <[email protected]>
…he#8014)

# Which issue does this PR close?

- part of apache#7395

# Rationale for this change

Keep the code flowing

# What changes are included in this PR?

1. Update CHANGELOG. See rendered version here:
https://github.com/alamb/arrow-rs/blob/alamb/prepare_56.0.0/CHANGELOG.md
2. Update version to `56.0.0`

# Are these changes tested?

N/A
# Are there any user-facing changes?

Yes
# Which issue does this PR close?

- part of apache#7395
- Closes apache#8018

# Rationale for this change

Fix a bug I found while testing the RC

# What changes are included in this PR?

Check the ARROW_TESTING directory first before looking at the local path

# Are these changes tested?

Yes, by CI and I tested it manually

# Are there any user-facing changes?

No
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- related to  apache#7850

# Rationale for this change

While reviewing apache#7850 from
@XiangpengHao I found myself wanting even more comments (or maybe I was
doing this as an exercise to load the state back into my head)

In any case, I wrote up some comments that I think would make the code
easier to understand

# What changes are included in this PR?

Add some more docs

# Are these changes tested?

By CI

# Are there any user-facing changes?
No -- this is documentation to internal interfaces

There is no code or functional change
# Which issue does this PR close?

- Closes apache#8023

# Rationale for this change

Add sync parquet read bloom filter.

# What changes are included in this PR?

Add a sync `get_row_group_column_bloom_filter`

# Are these changes tested?

By unittests

# Are there any user-facing changes?

Api added
Fix a typo in README.

# Which issue does this PR close?

- Closes #NNN.

# Rationale for this change

Fix typo

# What changes are included in this PR?

Fix typo

# Are these changes tested?

Nope.

# Are there any user-facing changes?

Nope.

Signed-off-by: Cheng-Yang Chou <[email protected]>
# Which issue does this PR close?

- Part of apache#5854.

# Rationale for this change

Before embarking on radical changes to the thrift processing in the
parquet crate, add a few more benchmarks to help evaluate the
performance gains.

# What changes are included in this PR?

Adds a test originally written by @tustvold
(https://github.com/tustvold/arrow-rs/tree/thrift-bench) and exposes a
new public function in the thrift module rather than making the thrift
parser public.

# Are these changes tested?

Benchmark code only, so no tests necessary.

# Are there any user-facing changes?

No, but adds a public function.
# Which issue does this PR close?

- Part of apache#4886

- Pre-work for apache#8006

# Rationale for this change

Apache Avro’s [single object
encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding)
prefixes every record with the marker `0xC3 0x01` followed by a `Rabin`
[schema fingerprint
](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints)
so that readers can identify the correct writer schema without carrying
the full definition in each message.
While the current `arrow‑avro` implementation can read container files,
it cannot ingest these framed messages or handle streams where the
writer schema changes over time.

The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin)
hashed fingerprint of the [parsed canonical form of a
schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas)
to look up the `Schema` from a local schema store or registry.

This PR introduces **`SchemaStore`** and **fingerprinting** to enable:

* **Zero‑copy schema identification** for decoding streaming Avro
messages published in single‑object format (i.e. Kafka, Pulsar, etc)
into Arrow.
* **Dynamic schema evolution** by laying the foundation to resolve
writer reader schema differences on the fly.

**NOTE:**  Integration with `Decoder` and `Reader` coming in next PR.

# What changes are included in this PR?

| Area | Highlights |
| ------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| **`schema.rs`** | *New* `Fingerprint`, `SchemaStore`, and
`SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint
calculator; `compare_schemas` helper. |
| **`lib.rs`** | `mod schema` is now `pub` |
| **Unit tests** | New tests covering fingerprint generation, store
registration/lookup, unknown‑fingerprint errors, and interaction with
UTF8‑view decoding. |
| **Docs & Examples** | Extensive inline docs with examples on all new
public methods / structs. |


# Are these changes tested?

Yes.  New tests cover:

1. **Fingerprinting** against the canonical examples from the Avro spec
2. **`SchemaStore` behavior** deduplication, duplicate registration, and
lookup.

# Are there any user-facing changes?

N/A
…ray` (apache#8044)

# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes apache#8043


# Rationale for this change


As @Samyak2 suggested on
https://github.com/apache/arrow-rs/pull/8021/files#r2249926579, having
the ability to convert *FROM* a typed value to a VariantArray will be
important

For example, in SQL it could be used to cast columns to variant like in
`some_column::variant`


# What changes are included in this PR?

1. Add `cast_to_variant` kernel to cast native types to `VariantArray`
2. Tests


# Are these changes tested?
yes


# Are there any user-facing changes?

New kernel
…8068)

# Which issue does this PR close?

- Follow on to  apache#8044

# Rationale for this change

@scovich had a good suggestion here
https://github.com/apache/arrow-rs/pull/8044/files#r2257256848:
> value.into() makes clear that the conversion is infallible?


# What changes are included in this PR?

Use `From` impl to make it clear the conversion is infallible and can
not lose precision

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

No
# Which issue does this PR close?

- Closes apache#8045

# Rationale for this change

Simple bug fix where the order of `name` and `namespace` is wrong in
Resolver field registration method.

# What changes are included in this PR?

One-line bug fix and corresponding tests.

# Are these changes tested?

- Added a test in the same file to test the fix. Made sure it doesn't
pass with the original code.

# Are there any user-facing changes?

No
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes #NNN.

# Rationale for this change

I was slightly annoyed that the int96 stats roundtrip test was not part
of the other arrow-reader tests

You have to run it separately:
```shell
cargo test --test int96_stats_roundtrip
```

Let's just move it to the arrow-reader tests, so it is run with the
other tests so
1. It is easier to find
2. It takes slightly less time to build and test the parquet crate



```shell
cargo test --test arrow_reader
```


# What changes are included in this PR?
1. Move the tests

# Are these changes tested?

Yes, by CI

# Are there any user-facing changes?
No
# Which issue does this PR close?

Improve StringArray(Utf8) sort performance

- Closes [apache#7847](apache#7847)

# Rationale for this change
Support prefix compare, and i optimized it to u32 prefix, and u64
increment compare, it will have best performance when experimenting.


# What changes are included in this PR?

Support prefix compare, and i optimized it to u32 prefix, and u64
increment compare, it will have best performance when experimenting.

# Are these changes tested?
Yes


```rust
critcmp  issue_7847  main --filter "sort string\["
group                                         issue_7847                             main
-----                                         ----------                             ----
sort string[0-400] nulls to indices 2^12      1.00     51.4±0.56µs        ? ?/sec    1.19     61.0±1.02µs        ? ?/sec
sort string[0-400] to indices 2^12            1.00     96.5±1.63µs        ? ?/sec    1.23    118.3±0.91µs        ? ?/sec
sort string[10] dict nulls to indices 2^12    1.00     72.4±1.00µs        ? ?/sec    1.00     72.5±0.61µs        ? ?/sec
sort string[10] dict to indices 2^12          1.00    137.1±1.51µs        ? ?/sec    1.01    138.1±1.06µs        ? ?/sec
sort string[10] nulls to indices 2^12         1.00     47.5±0.69µs        ? ?/sec    1.18     56.3±0.56µs        ? ?/sec
sort string[10] to indices 2^12               1.00     86.4±1.37µs        ? ?/sec    1.20    103.5±1.13µs        ? ?/sec
```

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?

- Closes apache#8057

# Rationale for this change
Adds Float16 conversion to the `cast_to_variant` kernel

# What changes are included in this PR?

- a macro to make converting array type that require a cast simpler
- conversion of `DataType::Float16` => `Variant::Float`

# Are these changes tested?

Yes, additional unit tests have been added.

# Are there any user-facing changes?

Yes, adds new type conversion to kernel
…on (apache#8029)

- Closes apache#7359.

# Rationale for this change

This is to enable concurrent column writing with encryption downstream
(e.g. with datafusion). See apache#7359 for more.
See https://github.com/apache/arrow-rs/pull/7111/files#r2015196618

# What changes are included in this PR?

* `ArrowWriter` now has a `pub get_column_writers` method that can be
used to write columns concurrently.
* Minor change to how encryption tests read test data.

# Are these changes tested?

Yes.

# Are there any user-facing changes?

`pub ArrowWriter.get_column_writers` and `pub
ArrowWriter.append_row_group` are added. Both to enable concurrent use
of column writers. `WriterPropertiesBuilder` now implements `Default`.

---------

Co-authored-by: Adam Reeve <[email protected]>
# Which issue does this PR close?

- Part of apache#4886

# Rationale for this change

This change introduces a comprehensive benchmark suite for the
`arrow-avro` decoder. Having robust benchmarks is crucial for several
reasons:
- It allows for the measurement and tracking of decoding performance
over time.
- It helps identify performance regressions or improvements as the
codebase evolves.
- It provides a standardized way to evaluate the impact of optimizations
and new features.

# What changes are included in this PR?

This PR adds a new benchmark file: `arrow-avro/benches/decoder.rs`.

The key components of this new file are:
- **Comprehensive Type Coverage**: Adds benchmark scenarios for a wide
range of data types, including:
- Primitive types (`Int32`, `Int64`, `Float32`, `Float64`, `Boolean`)
    - Binary and String types (`Binary(Bytes)`, `String`, `StringView`)
- Logical types (`Date32`, `TimeMillis`, `TimeMicros`,
`TimestampMillis`, `TimestampMicros`, `Decimal128`, `UUID`, `Interval`,
`Enum`)
    - Complex types (`Map`, `Array`, `Nested(Struct)`)
    - `FixedSizeBinary`
    - A `Mixed` schema with multiple fields
- Update to criterion 7.0.0
- Made `mod schema` public

# Are these changes tested?

These changes are covered by the benchmark tests themselves.

# Are there any user-facing changes?

N/A
# Which issue does this PR close?


# Rationale for this change

Clippy is failing on main. Here is an example
-
https://github.com/apache/arrow-rs/actions/runs/16804746850/job/47594208868

Rust 1.89 was released today and it includes a new clippy version that
is more strict about some lints:
https://blog.rust-lang.org/2025/08/07/Rust-1.89.0/



# What changes are included in this PR?

Fix clippy lints to make CI pass with Rust 1.89



# Are these changes tested?

By CI

# Are there any user-facing changes?
Bumps
[actions/download-artifact](https://github.com/actions/download-artifact)
from 4 to 5.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/releases">actions/download-artifact's">https://github.com/actions/download-artifact/releases">actions/download-artifact's
releases</a>.</em></p>
<blockquote>
<h2>v5.0.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Update README.md by <a
href="http://github.com/%3Ca%20href="https://github.com/nebuk89"><code>@%E2%80%8Bnebuk89</code></a">https://github.com/nebuk89"><code>@​nebuk89</code></a> in <a
href="http://github.com/%3Ca%20href="https://redirect.github.com/actions/download-artifact/pull/407">actions/download-artifact#407</a></li">https://redirect.github.com/actions/download-artifact/pull/407">actions/download-artifact#407</a></li>
<li>BREAKING fix: inconsistent path behavior for single artifact
downloads by ID by <a
href="http://github.com/%3Ca%20href="https://github.com/GrantBirki"><code>@%E2%80%8BGrantBirki</code></a">https://github.com/GrantBirki"><code>@​GrantBirki</code></a> in <a
href="http://github.com/%3Ca%20href="https://redirect.github.com/actions/download-artifact/pull/416">actions/download-artifact#416</a></li">https://redirect.github.com/actions/download-artifact/pull/416">actions/download-artifact#416</a></li>
</ul>
<h2>v5.0.0</h2>
<h3>🚨 Breaking Change</h3>
<p>This release fixes an inconsistency in path behavior for single
artifact downloads by ID. <strong>If you're downloading single artifacts
by ID, the output path may change.</strong></p>
<h4>What Changed</h4>
<p>Previously, <strong>single artifact downloads</strong> behaved
differently depending on how you specified the artifact:</p>
<ul>
<li><strong>By name</strong>: <code>name: my-artifact</code> → extracted
to <code>path/</code> (direct)</li>
<li><strong>By ID</strong>: <code>artifact-ids: 12345</code> → extracted
to <code>path/my-artifact/</code> (nested)</li>
</ul>
<p>Now both methods are consistent:</p>
<ul>
<li><strong>By name</strong>: <code>name: my-artifact</code> → extracted
to <code>path/</code> (unchanged)</li>
<li><strong>By ID</strong>: <code>artifact-ids: 12345</code> → extracted
to <code>path/</code> (fixed - now direct)</li>
</ul>
<h4>Migration Guide</h4>
<h5>✅ No Action Needed If:</h5>
<ul>
<li>You download artifacts by <strong>name</strong></li>
<li>You download <strong>multiple</strong> artifacts by ID</li>
<li>You already use <code>merge-multiple: true</code> as a
workaround</li>
</ul>
<h5>⚠️ Action Required If:</h5>
<p>You download <strong>single artifacts by ID</strong> and your
workflows expect the nested directory structure.</p>
<p><strong>Before v5 (nested structure):</strong></p>
<pre lang="yaml"><code>- uses: actions/download-artifact@v4
  with:
    artifact-ids: 12345
    path: dist
# Files were in: dist/my-artifact/
</code></pre>
<blockquote>
<p>Where <code>my-artifact</code> is the name of the artifact you
previously uploaded</p>
</blockquote>
<p><strong>To maintain old behavior (if needed):</strong></p>
<pre lang="yaml"><code>&lt;/tr&gt;&lt;/table&gt; 
</code></pre>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/634f93cb2916e3fdff6788551b99b062d0335ce0"><code>634f93c</code></a">https://github.com/actions/download-artifact/commit/634f93cb2916e3fdff6788551b99b062d0335ce0"><code>634f93c</code></a>
Merge pull request <a
href="http://github.com/%3Ca%20href="https://redirect.github.com/actions/download-artifact/issues/416">#416</a">https://redirect.github.com/actions/download-artifact/issues/416">#416</a>
from actions/single-artifact-id-download-path</li>
<li><a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/b19ff4302770b82aa4694b63703b547756dacce6"><code>b19ff43</code></a">https://github.com/actions/download-artifact/commit/b19ff4302770b82aa4694b63703b547756dacce6"><code>b19ff43</code></a>
refactor: resolve download path correctly in artifact download tests
(mainly ...</li>
<li><a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/e262cbee4ab8c473c61c59a81ad8e9dc760e90db"><code>e262cbe</code></a">https://github.com/actions/download-artifact/commit/e262cbee4ab8c473c61c59a81ad8e9dc760e90db"><code>e262cbe</code></a>
bundle dist</li>
<li><a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/bff23f9308ceb2f06d673043ea6311519be6a87b"><code>bff23f9</code></a">https://github.com/actions/download-artifact/commit/bff23f9308ceb2f06d673043ea6311519be6a87b"><code>bff23f9</code></a>
update docs</li>
<li><a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/fff8c148a8fdd56aa81fcb019f0b5f6c65700c4d"><code>fff8c14</code></a">https://github.com/actions/download-artifact/commit/fff8c148a8fdd56aa81fcb019f0b5f6c65700c4d"><code>fff8c14</code></a>
fix download path logic when downloading a single artifact by id</li>
<li><a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/448e3f862ab3ef47aa50ff917776823c9946035b"><code>448e3f8</code></a">https://github.com/actions/download-artifact/commit/448e3f862ab3ef47aa50ff917776823c9946035b"><code>448e3f8</code></a>
Merge pull request <a
href="http://github.com/%3Ca%20href="https://redirect.github.com/actions/download-artifact/issues/407">#407</a">https://redirect.github.com/actions/download-artifact/issues/407">#407</a>
from actions/nebuk89-patch-1</li>
<li><a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/commit/47225c44b359a5155efdbbbc352041b3e249fb1b"><code>47225c4</code></a">https://github.com/actions/download-artifact/commit/47225c44b359a5155efdbbbc352041b3e249fb1b"><code>47225c4</code></a>
Update README.md</li>
<li>See full diff in <a
href="http://github.com/%3Ca%20href="https://github.com/actions/download-artifact/compare/v4...v5">compare">https://github.com/actions/download-artifact/compare/v4...v5">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/download-artifact&package-manager=github_actions&previous-version=4&new-version=5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…Binary` (apache#8074)

# Which issue does this PR close?

- Closes apache#8050

# Rationale for this change

Adds Binary, LargeBinary, and BinaryView conversions to the
cast_to_variant kernel

# What changes are included in this PR?

- a macro to simplify array type conversions
- conversion of DataType:::{Binary, LargeBinary, BinaryView}=>
Variant::Binary

# Are these changes tested?

Yes, additional unit tests have been added.

# Are there any user-facing changes?

Yes, adds new type conversions to kernel

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?

- Part of apache#6736
- Closes apache#7941
- Closes apache#7965

# Rationale for this change

This is has a proposal for how to structure shredded `VariantArray`s and
the `variant_get` kernel

If people like the basic idea I will file some more tickets to track
additional follow on work

It is based on ideas ideas from @carpecodeum in
apache#7946 and @scovich in
apache#7915

I basically took the tests from
apache#7965 and the conversation with
@scovich recorded from
apache#7941 (comment)
and I bashed out how this might look

# What changes are included in this PR?

1. Update `VariantArray` to represent shredding
2. Add code to `variant_get` to support extracting paths as both
variants and typed fields
3. A pattern that I think can represent shredding and extraction
4. Tests for same

Note there are many things that are NOT in this PR that I envision doing
as follow on PRs:
1. Support and implementing `Path`s
2. Support for shredded objects
3. Support shredded lists
4. Support nested objects / lists 
5. Full casting support
6. Support for other output types: `StringArray`, `StringViewArray`, etc
8. Many performance improvements 


# Are these changes tested?

Yes

# Are there any user-facing changes?

New feature

---------

Co-authored-by: Samyak Sarnayak <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
…Resolution (apache#8006)

# Which issue does this PR close?

- Part of apache#4886

- Follow up to apache#7834

# Rationale for this change

Apache Avro’s [single object
encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding)
prefixes every record with the marker `0xC3 0x01` followed by a `Rabin`
[schema fingerprint
](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints)
so that readers can identify the correct writer schema without carrying
the full definition in each message.
While the current `arrow‑avro` implementation can read container files,
it cannot ingest these framed messages or handle streams where the
writer schema changes over time.

The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin)
hashed fingerprint of the [parsed canonical form of a
schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas)
to look up the `Schema` from a local schema store or registry.

This PR introduces **`SchemaStore`** and **fingerprinting** to enable:

* **Zero‑copy schema identification** for decoding streaming Avro
messages published in single‑object format (i.e. Kafka, Pulsar, etc)
into Arrow.
* **Dynamic schema evolution** by laying the foundation to resolve
writer reader schema differences on the fly.
**NOTE:** Schema Resolution support in `Codec` and `RecordDecoder`
coming the next PR.

# What changes are included in this PR?

| Area | Highlights |
| ------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| **`reader/mod.rs`** | Decoder now detects the `C3 01` prefix, extracts
the fingerprint, looks up the writer schema in a `SchemaStore`, and
switches to an LRU cached `RecordDecoder` without interrupting
streaming; supports `static_store_mode` to skip the 2 byte peek for
high‑throughput fixed‑schema pipelines. |
| **`ReaderBuilder`** | New builder configuration methods:
`.with_writer_schema_store`, `.with_active_fingerprint`,
`.with_static_store_mode`, `.with_reader_schema`,
`.with_max_decoder_cache_size`, with rigorous validation to prevent
misconfiguration. |
| **Unit tests** | New tests covering fingerprint generation, store
registration/lookup, schema switching, unknown‑fingerprint errors, and
interaction with UTF8‑view decoding. |
| **Docs & Examples** | Extensive inline docs with examples on all new
public methods / structs. |

---

# Are these changes tested?

Yes.  New tests cover:

1. **Fingerprinting** against the canonical examples from the Avro spec
2. **`SchemaStore` behavior** deduplication, duplicate registration, and
lookup.
3. **Decoder fast‑path** with `static_store_mode=true`, ensuring the
prefix is treated as payload, the 2 byte peek is skipped, and no schema
switch is attempted.

# Are there any user-facing changes?

N/A

# Follow-Up PRs

1. Implement Schema Resolution Functionality in Codec and RecordDecoder
2. Add ID `Fingerprint` variant on `SchemaStore` for Confluent Schema
Registry compatibility
3. Improve arrow-avro errors + add more benchmarks & examples to prepare
for public release

---------

Co-authored-by: Ryan Johnson <[email protected]>
…sync_reader) (apache#7850)

This is my latest attempt to make pushdown faster. Prior art: apache#6921

cc @alamb @zhuqi-lucas 

- Part of apache#8000
- Related to apache/datafusion#3463
- Related to apache#7456
- Closes apache#7363
- Closes apache#8003

## Problems of apache#6921

1. It proactively loads entire row group into memory. (rather than only
loading pages that passing the filter predicate)
2. It only cache decompressed pages, still paying the decoding cost
twice.

This PR takes a different approach, it does not change the decoding
pipeline, so we avoid the problem 1. It also caches the arrow record
batch, so avoid problem 2.

But this means we need to use more memory to cache data.

## How it works?

1. It instruments the `array_readers` with a transparent
`cached_array_reader`.
2. The cache layer will first consult the `RowGroupCache` to look for a
batch, and only reads from underlying reader on a cache miss.
3. There're cache producer and cache consumer. Producer is when we build
filters we insert arrow arrays into cache, consumer is when we build
outputs, we remove arrow array from cache. So the memory usage should
look like this:
```
    ▲
    │     ╭─╮
    │    ╱   ╲
    │   ╱     ╲
    │  ╱       ╲
    │ ╱         ╲
    │╱           ╲
    └─────────────╲──────► Time
    │      │      │
    Filter  Peak  Consume
    Phase (Built) (Decrease)
```
In a concurrent setup, not all reader may reach the peak point at the
same time, so the peak system memory usage might be lower.

4. It has a max_cache_size knob, this is a per row group setting. If the
row group has used up the budget, the cache stops taking new data. and
the `cached_array_reader` will fallback to read and decode from Parquet.

## Other benefits

1. This architecture allows nested columns (but not implemented in this
pr), i.e., it's future proof.
2. There're many performance optimizations to further squeeze the
performance, but even with current state, it has no regressions.

## How does it perform?

My criterion somehow won't produces a result from `--save-baseline`, so
I asked llm to generate a table from this benchmark:
```
cargo bench --bench arrow_reader_clickbench --features "arrow async" "async"
```

`Baseline` is the implementation for current main branch.
`New Unlimited` is the new pushdown with unlimited memory budget.
`New 100MB` is the new pushdown but the memory budget for a row group
caching is 100MB.

```
Query  | Baseline (ms) | New Unlimited (ms) | Diff (ms)  | New 100MB (ms) | Diff (ms)
-------+--------------+--------------------+-----------+----------------+-----------
Q1     | 0.847          | 0.803               | -0.044     | 0.812          | -0.035    
Q10    | 4.060          | 6.273               | +2.213     | 6.216          | +2.156    
Q11    | 5.088          | 7.152               | +2.064     | 7.193          | +2.105    
Q12    | 18.485         | 14.937              | -3.548     | 14.904         | -3.581    
Q13    | 24.859         | 21.908              | -2.951     | 21.705         | -3.154    
Q14    | 23.994         | 20.691              | -3.303     | 20.467         | -3.527    
Q19    | 1.894          | 1.980               | +0.086     | 1.996          | +0.102    
Q20    | 90.325         | 64.689              | -25.636    | 74.478         | -15.847   
Q21    | 106.610        | 74.766              | -31.844    | 99.557         | -7.053    
Q22    | 232.730        | 101.660             | -131.070   | 204.800        | -27.930   
Q23    | 222.800        | 186.320             | -36.480    | 186.590        | -36.210   
Q24    | 24.840         | 19.762              | -5.078     | 19.908         | -4.932    
Q27    | 80.463         | 47.118              | -33.345    | 49.597         | -30.866   
Q28    | 78.999         | 47.583              | -31.416    | 51.432         | -27.567   
Q30    | 28.587         | 28.710              | +0.123     | 28.926         | +0.339    
Q36    | 80.157         | 57.954              | -22.203    | 58.012         | -22.145   
Q37    | 46.962         | 45.901              | -1.061     | 45.386         | -1.576    
Q38    | 16.324         | 16.492              | +0.168     | 16.522         | +0.198    
Q39    | 20.754         | 20.734              | -0.020     | 20.648         | -0.106    
Q40    | 22.554         | 21.707              | -0.847     | 21.995         | -0.559    
Q41    | 16.430         | 16.391              | -0.039     | 16.581         | +0.151    
Q42    | 6.045          | 6.157               | +0.112     | 6.120          | +0.075    
```

1. If we consider the diff within 5ms to be noise, then we are never
worse than the current implementation.
2. We see significant improvements for string-heavy queries, because
string columns are large, they take time to decompress and decode.
3. 100MB cache budget seems to have small performance impact.


## Limitations

1. It only works for async readers, because sync reader do not follow
the same row group by row group structure.
2. It is memory hungry -- compared to apache#6921. But changing decoding
pipeline without eager loading entire row group would require
significant changes to the current decoding infrastructure, e.g., we
need to make page iterator an async function.
3. It currently doesn't support nested columns, more specifically, it
doesn't support nested columns with nullable parents. but supporting it
is straightforward, no big changes.
4. The current memory accounting is not accurate, it will overestimate
the memory usage, especially when reading string view arrays, where
multiple string view may share the same underlying buffer, and that
buffer size is counted twice. Anyway, we never exceeds the user
configured memory usage.
5. If one row passes the filter, the entire batch will be cached. We can
probably optimize this though.

## Next steps?

This pr is largely proof of concept, I want to collect some feedback
before sending a multi-thousands pr :)

Some items I can think of:
1. Design an interface for user to specify the cache size limit,
currently it's hard-coded.
2. Don't instrument nested array reader if the parquet file has nullable
parent. currently it will panic
3. More testing, and integration test/benchmark with Datafusion

---------

Co-authored-by: Andrew Lamb <[email protected]>
@carpecodeum
Copy link
Author

closing this and opening a new one

@carpecodeum carpecodeum closed this Aug 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Integration tests for reading parquet w/ Variants