Skip to content

branch-3.0: [fix](orc) Should not pass selection vector when decode child column of List or Map #50136 #50317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 25, 2025

Conversation

suxiaogang223
Copy link
Contributor

bp: #50136

@Thearas
Copy link
Contributor

Thearas commented Apr 23, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40384 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f0b980786038ab3c63d6d011a42037eeef0947e8, data reload: false

------ Round 1 ----------------------------------
q1	17572	6755	6591	6591
q2	2057	174	192	174
q3	10849	1069	1171	1069
q4	10564	713	732	713
q5	7753	2853	2845	2845
q6	219	137	136	136
q7	977	626	602	602
q8	9348	1956	2033	1956
q9	6646	6408	6373	6373
q10	7053	2263	2334	2263
q11	476	262	269	262
q12	401	215	215	215
q13	17803	2986	2990	2986
q14	242	210	209	209
q15	500	461	463	461
q16	686	584	576	576
q17	977	568	632	568
q18	7312	6872	6889	6872
q19	1399	1029	1046	1029
q20	473	208	208	208
q21	4189	3251	3346	3251
q22	1138	1044	1025	1025
Total cold run time: 108634 ms
Total hot run time: 40384 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6698	6689	6519	6519
q2	334	237	235	235
q3	2996	2878	2902	2878
q4	2058	1779	1780	1779
q5	5740	5747	5717	5717
q6	219	130	130	130
q7	2217	1825	1783	1783
q8	3386	3559	3512	3512
q9	8868	8923	8870	8870
q10	3549	3506	3540	3506
q11	596	496	504	496
q12	793	586	608	586
q13	9549	3212	3167	3167
q14	298	294	267	267
q15	525	466	470	466
q16	702	668	634	634
q17	1848	1627	1634	1627
q18	8105	7842	7774	7774
q19	1703	1659	1538	1538
q20	2051	1869	1884	1869
q21	5558	5324	5302	5302
q22	1182	1026	1045	1026
Total cold run time: 68975 ms
Total hot run time: 59681 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 197469 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f0b980786038ab3c63d6d011a42037eeef0947e8, data reload: false

query1	1300	904	903	903
query2	6249	1987	1952	1952
query3	10964	4449	4308	4308
query4	61654	29040	23570	23570
query5	5122	459	442	442
query6	382	175	173	173
query7	5475	313	313	313
query8	319	228	229	228
query9	8356	2629	2614	2614
query10	470	275	251	251
query11	17682	15111	15776	15111
query12	155	101	104	101
query13	1457	453	439	439
query14	10565	7477	7411	7411
query15	206	180	182	180
query16	7131	472	477	472
query17	1157	593	603	593
query18	1867	330	324	324
query19	218	166	158	158
query20	118	111	108	108
query21	202	101	104	101
query22	4748	4421	4750	4421
query23	34616	34134	33768	33768
query24	6147	2999	3004	2999
query25	551	408	421	408
query26	668	167	162	162
query27	1773	364	371	364
query28	4439	2470	2414	2414
query29	684	443	441	441
query30	244	170	164	164
query31	1005	799	846	799
query32	68	54	55	54
query33	434	291	297	291
query34	920	513	534	513
query35	843	737	716	716
query36	1101	970	984	970
query37	122	67	72	67
query38	4193	4017	3946	3946
query39	1522	1485	1454	1454
query40	197	97	103	97
query41	50	50	49	49
query42	116	99	105	99
query43	542	493	489	489
query44	1166	831	808	808
query45	189	169	168	168
query46	1151	738	732	732
query47	2052	1892	1918	1892
query48	499	392	378	378
query49	715	395	396	395
query50	856	439	427	427
query51	7429	7332	7251	7251
query52	102	88	89	88
query53	260	184	194	184
query54	578	475	487	475
query55	86	80	79	79
query56	275	258	252	252
query57	1313	1185	1176	1176
query58	230	200	218	200
query59	3156	2949	3040	2949
query60	267	257	252	252
query61	110	141	107	107
query62	785	681	650	650
query63	216	190	191	190
query64	1399	683	637	637
query65	3266	3206	3178	3178
query66	694	295	301	295
query67	15858	15517	15536	15517
query68	4300	585	566	566
query69	437	263	260	260
query70	1196	1079	1101	1079
query71	363	258	252	252
query72	6348	4124	4124	4124
query73	782	351	356	351
query74	10415	9087	9284	9087
query75	3331	2681	2657	2657
query76	2128	1107	1105	1105
query77	509	275	274	274
query78	10626	9691	9529	9529
query79	2240	595	605	595
query80	1375	430	458	430
query81	530	243	240	240
query82	1266	87	85	85
query83	268	140	140	140
query84	290	80	77	77
query85	1026	310	286	286
query86	420	299	300	299
query87	4429	4221	4290	4221
query88	3859	2392	2383	2383
query89	427	293	294	293
query90	1941	190	189	189
query91	183	172	147	147
query92	67	48	48	48
query93	2944	565	561	561
query94	788	311	291	291
query95	365	261	256	256
query96	629	287	279	279
query97	3316	3160	3174	3160
query98	213	203	195	195
query99	1603	1294	1293	1293
Total cold run time: 317298 ms
Total hot run time: 197469 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 40.22% (10556/26248)
Line Coverage 30.99% (89328/288214)
Region Coverage 30.11% (46063/152965)
Branch Coverage 26.62% (23548/88458)

@doris-robot
Copy link

ClickBench: Total hot run time: 33.44 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f0b980786038ab3c63d6d011a42037eeef0947e8, data reload: false

query1	0.03	0.04	0.03
query2	0.07	0.02	0.03
query3	0.24	0.07	0.06
query4	1.62	0.10	0.11
query5	0.51	0.52	0.50
query6	1.13	0.73	0.75
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.58	0.51	0.52
query10	0.54	0.54	0.54
query11	0.15	0.11	0.10
query12	0.14	0.11	0.11
query13	0.62	0.59	0.59
query14	2.75	2.86	2.75
query15	0.89	0.82	0.83
query16	0.39	0.39	0.39
query17	1.00	1.06	0.99
query18	0.24	0.22	0.22
query19	1.99	1.86	2.03
query20	0.01	0.01	0.02
query21	15.36	0.58	0.57
query22	2.72	1.92	2.51
query23	17.09	0.93	0.95
query24	2.97	1.91	2.00
query25	0.20	0.28	0.09
query26	0.50	0.14	0.14
query27	0.05	0.04	0.04
query28	8.70	0.50	0.51
query29	12.55	3.20	3.24
query30	0.25	0.06	0.06
query31	2.85	0.39	0.39
query32	3.24	0.46	0.45
query33	2.97	2.99	2.99
query34	16.96	4.54	4.56
query35	4.59	4.55	4.62
query36	0.67	0.48	0.48
query37	0.09	0.06	0.06
query38	0.04	0.03	0.04
query39	0.04	0.02	0.02
query40	0.17	0.13	0.12
query41	0.08	0.03	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 105.12 s
Total hot run time: 33.44 s

…of List or Map (apache#50136)

Related PR: apache#18615

Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters
should not be applied directly to their child fields. These complex
types rely on offsets to correctly map parent-child relationships within
the columnar storage layout (e.g., in ORC or Parquet files).

If filters are applied to the children of a LIST or MAP field, it may
cause inconsistencies in the offset alignment, leading to incorrect data
being read—such as mismatched elements, missing values, or even runtime
errors. This breaks the structural integrity of the nested data and can
produce incorrect query results.

```text
mysql> select * from complex_data_orc;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    1 | {"a":1, "b":2}           | ["a", "b"]      |
|    2 | {"b":3, "c":4}           | ["b"]           |
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
10 rows in set (0.02 sec)

!!!WRONG RESULT:
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+----------------+
| id   | m                        | l              |
+------+--------------------------+----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]     |
|    4 | {"a":8, "c":9}           | ["b", "c"]     |
|    5 | {"b":10, "":11}          | ["a"]          |
|    6 | {"":12, "":13}           | ["c"]          |
|    7 | {"":15}                  | ["a", ""]      |
|    8 | {"":17}                  | ["", ""]       |
|    9 | {"":19}                  | ["", ""]       |
|   10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] |
+------+--------------------------+----------------+
8 rows in set (0.02 sec)
```

To ensure correctness, filters should only be applied at the top level
of the LIST or MAP, and their children should be read in full when late
materialization occurs.

After this pr:
```text
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
8 rows in set (1.41 sec)
```
@suxiaogang223 suxiaogang223 force-pushed the pick_fix_lazt_mat_3.0 branch from f0b9807 to 2d9f03e Compare April 24, 2025 02:15
@suxiaogang223
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 40.22% (10558/26248)
Line Coverage 31.00% (89348/288222)
Region Coverage 30.12% (46082/152982)
Branch Coverage 26.62% (23553/88464)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 21a440d into apache:branch-3.0 Apr 25, 2025
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants