Skip to content

[fix](function) Result type of regexp_extract_all should be Array #50367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mrhhsg
Copy link
Member

@mrhhsg mrhhsg commented Apr 24, 2025

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

Function regexp_extract_all should return a array instead of string.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mrhhsg
Copy link
Member Author

mrhhsg commented Apr 24, 2025

run buildall

@morrySnow morrySnow requested review from eldenmoon and Copilot April 24, 2025 02:46
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the return type of the regexp_extract_all function so that it returns an array instead of a string. Key changes include updating the function registration in the Python builtins script, modifying the Java expression builder to support a three-argument signature with an array return type, and updating the vectorized function implementation and unit tests accordingly.

Reviewed Changes

Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
gensrc/script/doris_builtins_functions.py Updated function signature to register regexp_extract_all with ARRAY_VARCHAR return type instead of VARCHAR.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/RegexpExtractAll.java Revised the function implementation to support a 2- or 3-argument signature and return an array type, and removed legacy nullable interfaces.
be/test/vec/function/function_test_util.h Adjusted test utility template parameters to support the new array result type.
be/test/vec/function/function_like_test.cpp Modified test cases to validate the new array output from regexp_extract_all.
be/src/vec/functions/function_regexp.cpp Updated the implementation of the vectorized regexp_extract_all, including changes to argument handling and result construction for array outputs.
Files not reviewed (4)
  • regression-test/data/nereids_function_p0/scalar_function/R.out: Language not supported
  • regression-test/data/nereids_syntax_p0/function.out: Language not supported
  • regression-test/data/query_p0/sql_functions/string_functions/test_string_function_regexp.out: Language not supported
  • regression-test/suites/query_p0/sql_functions/string_functions/test_string_function_regexp.groovy: Language not supported
Comments suppressed due to low confidence (1)

be/src/vec/functions/function_regexp.cpp:407

  • [nitpick] Returning 0 in get_number_of_arguments() for RegexpExtractAllImpl may be non-obvious; consider adding a comment to clarify that this function is variadic and that the return value is not used conventionally.
return 0;

@mrhhsg
Copy link
Member Author

mrhhsg commented Apr 24, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.06% (1098/1322)
Line Coverage: 65.85% (18369/27896)
Region Coverage: 65.26% (9098/13942)
Branch Coverage: 55.22% (4894/8862)
Coverage Report: http://coverage.selectdb-in.cc/coverage/dd7775660a6e913a33e0b522fa4b9c4bfb6d9605_dd7775660a6e913a33e0b522fa4b9c4bfb6d9605_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 34071 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit dd7775660a6e913a33e0b522fa4b9c4bfb6d9605, data reload: false

------ Round 1 ----------------------------------
q1	26153	5139	5042	5042
q2	2071	269	188	188
q3	10401	1275	737	737
q4	10237	1001	519	519
q5	7530	2320	2353	2320
q6	187	166	136	136
q7	920	751	610	610
q8	9330	1307	1066	1066
q9	6962	5046	5010	5010
q10	6809	2309	1898	1898
q11	472	288	275	275
q12	348	351	218	218
q13	17767	3690	3091	3091
q14	233	228	206	206
q15	528	480	484	480
q16	445	445	402	402
q17	597	861	379	379
q18	7567	7184	7209	7184
q19	1209	927	548	548
q20	347	338	223	223
q21	4198	3378	2565	2565
q22	1024	997	974	974
Total cold run time: 115335 ms
Total hot run time: 34071 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5185	5137	5139	5137
q2	247	330	245	245
q3	2160	2672	2302	2302
q4	1407	1817	1385	1385
q5	4586	4498	4436	4436
q6	216	169	125	125
q7	1979	1943	1710	1710
q8	2614	2431	2556	2431
q9	7210	7243	7075	7075
q10	2992	3185	2731	2731
q11	584	511	474	474
q12	686	775	617	617
q13	3618	3934	3278	3278
q14	290	289	266	266
q15	529	486	478	478
q16	465	530	469	469
q17	1157	1512	1415	1415
q18	7807	7608	7351	7351
q19	820	813	955	813
q20	1992	2022	1806	1806
q21	5126	4746	4705	4705
q22	1026	995	986	986
Total cold run time: 52696 ms
Total hot run time: 50235 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185479 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit dd7775660a6e913a33e0b522fa4b9c4bfb6d9605, data reload: false

query1	1021	481	519	481
query2	6550	1827	1824	1824
query3	6746	217	227	217
query4	25891	23368	23153	23153
query5	4377	616	470	470
query6	317	208	220	208
query7	4609	497	276	276
query8	282	242	239	239
query9	8604	2560	2570	2560
query10	443	325	268	268
query11	15234	15243	14836	14836
query12	158	108	105	105
query13	1661	511	383	383
query14	8704	6204	6161	6161
query15	204	193	168	168
query16	7110	622	442	442
query17	922	718	569	569
query18	1973	398	295	295
query19	193	177	152	152
query20	121	114	114	114
query21	213	130	109	109
query22	4009	4106	3987	3987
query23	33848	32853	33012	32853
query24	8499	2391	2369	2369
query25	539	469	402	402
query26	1229	262	153	153
query27	2779	498	327	327
query28	4368	2105	2094	2094
query29	764	539	446	446
query30	283	219	191	191
query31	933	841	737	737
query32	71	65	63	63
query33	562	378	308	308
query34	793	827	527	527
query35	801	831	779	779
query36	952	995	867	867
query37	115	98	79	79
query38	4116	4180	4117	4117
query39	1458	1391	1396	1391
query40	222	120	119	119
query41	63	63	59	59
query42	121	106	110	106
query43	504	514	460	460
query44	1290	818	807	807
query45	181	182	182	182
query46	824	1030	620	620
query47	1747	1798	1737	1737
query48	376	415	296	296
query49	773	496	423	423
query50	657	671	403	403
query51	4104	4121	4057	4057
query52	107	104	94	94
query53	224	262	179	179
query54	574	568	509	509
query55	82	77	83	77
query56	324	286	279	279
query57	1112	1157	1098	1098
query58	297	248	251	248
query59	2579	2642	2474	2474
query60	314	313	319	313
query61	127	132	127	127
query62	805	755	660	660
query63	230	183	182	182
query64	4405	1016	666	666
query65	4335	4258	4241	4241
query66	1141	416	333	333
query67	15641	15350	15384	15350
query68	6282	881	512	512
query69	491	310	266	266
query70	1246	1103	1086	1086
query71	421	326	291	291
query72	5754	4770	4831	4770
query73	669	646	350	350
query74	8957	9092	8656	8656
query75	3176	3215	2811	2811
query76	3183	1187	760	760
query77	481	381	300	300
query78	9939	10145	9210	9210
query79	1910	831	568	568
query80	629	523	463	463
query81	500	251	226	226
query82	248	128	105	105
query83	245	257	244	244
query84	305	103	86	86
query85	794	361	309	309
query86	373	290	307	290
query87	4399	4367	4353	4353
query88	2951	2262	2275	2262
query89	391	336	295	295
query90	1808	220	218	218
query91	145	151	118	118
query92	71	71	59	59
query93	1177	972	572	572
query94	629	410	319	319
query95	379	293	281	281
query96	502	571	285	285
query97	3132	3253	3110	3110
query98	232	206	216	206
query99	1304	1410	1279	1279
Total cold run time: 267334 ms
Total hot run time: 185479 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.07 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit dd7775660a6e913a33e0b522fa4b9c4bfb6d9605, data reload: false

query1	0.04	0.04	0.03
query2	0.13	0.10	0.12
query3	0.25	0.20	0.19
query4	1.59	0.19	0.11
query5	0.57	0.54	0.55
query6	1.19	0.71	0.73
query7	0.02	0.02	0.01
query8	0.04	0.04	0.03
query9	0.57	0.52	0.52
query10	0.58	0.58	0.57
query11	0.16	0.11	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.59
query14	1.18	1.19	1.21
query15	0.86	0.85	0.85
query16	0.39	0.37	0.38
query17	1.03	1.02	1.08
query18	0.22	0.19	0.20
query19	1.87	1.76	1.79
query20	0.01	0.01	0.01
query21	15.40	0.94	0.55
query22	0.76	1.20	0.75
query23	14.82	1.39	0.63
query24	7.44	0.95	0.37
query25	0.48	0.24	0.06
query26	0.61	0.17	0.13
query27	0.05	0.05	0.05
query28	9.56	0.96	0.44
query29	12.53	4.03	3.32
query30	0.26	0.11	0.06
query31	2.82	0.60	0.39
query32	3.22	0.56	0.47
query33	2.98	3.04	3.06
query34	15.80	5.11	4.50
query35	4.50	4.49	4.51
query36	0.66	0.49	0.50
query37	0.08	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.03
query40	0.17	0.14	0.12
query41	0.08	0.02	0.02
query42	0.03	0.03	0.02
query43	0.04	0.03	0.04
Total cold run time: 103.84 s
Total hot run time: 29.07 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 91.76% (78/85) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.59% (14548/27146)
Line Coverage 42.45% (126342/297608)
Region Coverage 41.23% (64514/156468)
Branch Coverage 35.83% (32456/90580)

@hello-stephen
Copy link
Contributor

BE Regression P0 && UT Coverage Report

Increment line coverage 91.76% (78/85) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage
Line Coverage
Region Coverage
Branch Coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants