\n",
" \n",
- " | 0 | \n",
- " ## BigQuery: Your Data Warehouse in the Cloud\n",
- "... | \n",
- " [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... | \n",
+ " 1 | \n",
+ " BQML stands for **BigQuery Machine Learning**.... | \n",
+ " <NA> | \n",
" | \n",
- " What is BigQuery? | \n",
+ " What is BQML? | \n",
"
\n",
" \n",
- " | 1 | \n",
- " ## BQML - BigQuery Machine Learning\n",
- "\n",
- "BQML stan... | \n",
- " [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... | \n",
+ " 0 | \n",
+ " BigQuery is a fully managed, serverless data w... | \n",
+ " <NA> | \n",
" | \n",
- " What is BQML? | \n",
+ " What is BigQuery? | \n",
"
\n",
" \n",
" | 2 | \n",
- " ## BigQuery DataFrames\n",
- "\n",
- "BigQuery DataFrames is... | \n",
- " [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... | \n",
+ " BigQuery DataFrames are a Python library that ... | \n",
+ " <NA> | \n",
" | \n",
" What is BigQuery DataFrames? | \n",
"
\n",
@@ -1337,29 +1394,24 @@
],
"text/plain": [
" ml_generate_text_llm_result \\\n",
- "0 ## BigQuery: Your Data Warehouse in the Cloud\n",
- "... \n",
- "1 ## BQML - BigQuery Machine Learning\n",
- "\n",
- "BQML stan... \n",
- "2 ## BigQuery DataFrames\n",
+ "1 BQML stands for **BigQuery Machine Learning**.... \n",
+ "0 BigQuery is a fully managed, serverless data w... \n",
+ "2 BigQuery DataFrames are a Python library that ... \n",
"\n",
- "BigQuery DataFrames is... \n",
- "\n",
- " ml_generate_text_rai_result ml_generate_text_status \\\n",
- "0 [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... \n",
- "1 [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... \n",
- "2 [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... \n",
+ " ml_generate_text_rai_result ml_generate_text_status \\\n",
+ "1 \n",
+ "0 \n",
+ "2 \n",
"\n",
" prompt \n",
- "0 What is BigQuery? \n",
"1 What is BQML? \n",
+ "0 What is BigQuery? \n",
"2 What is BigQuery DataFrames? \n",
"\n",
"[3 rows x 4 columns]"
]
},
- "execution_count": 23,
+ "execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
@@ -1382,49 +1434,44 @@
},
{
"cell_type": "code",
- "execution_count": 24,
+ "execution_count": null,
"metadata": {},
"outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "## BigQuery DataFrames\n",
- "\n",
- "BigQuery DataFrames is a Python library that allows you to interact with BigQuery data using the familiar Pandas API. This means you can use all the powerful tools and methods from the Pandas library to explore, analyze, and transform your BigQuery data, without needing to learn a new language or API.\n",
- "\n",
- "Here are some of the key benefits of using BigQuery DataFrames:\n",
- "\n",
- "* **Ease of use:** If you're already familiar with Pandas, you can start using BigQuery DataFrames with minimal learning curve.\n",
- "* **Speed and efficiency:** BigQuery DataFrames leverages the power of BigQuery to perform complex operations on large datasets efficiently.\n",
- "* **Flexibility:** You can use BigQuery DataFrames for a wide range of tasks, including data exploration, analysis, cleaning, and transformation.\n",
- "* **Integration with other tools:** BigQuery DataFrames integrates seamlessly with other Google Cloud tools like Colab and Vertex AI, allowing you to build end-to-end data analysis pipelines.\n",
- "\n",
- "Here are some of the key features of BigQuery DataFrames:\n",
- "\n",
- "* **Support for most Pandas operations:** You can use most of the DataFrame methods you're familiar with, such as `groupby`, `filter`, `sort_values`, and `apply`.\n",
- "* **Automatic schema inference:** BigQuery DataFrames automatically infers the schema of your data, so you don't need to manually specify it.\n",
- "* **Efficient handling of large datasets:** BigQuery DataFrames pushes computations to BigQuery, which allows you to work with large datasets without running out of memory.\n",
- "* **Support for both public and private datasets:** You can use BigQuery DataFrames to access both public and private datasets stored in BigQuery.\n",
- "\n",
- "## Getting Started with BigQuery DataFrames\n",
- "\n",
- "Getting started with BigQuery DataFrames is easy. You just need to install the library and configure your authentication. Once you're set up, you can start using it to interact with your BigQuery data.\n",
- "\n",
- "Here are some resources to help you get started:\n",
- "\n",
- "* **Documentation:** https://cloud.google.com/bigquery/docs/reference/libraries/bigquery-dataframe\n",
- "* **Quickstart:** https://cloud.google.com/bigquery/docs/reference/libraries/bigquery-dataframe-python-quickstart\n",
- "* **Tutorials:** https://cloud.google.com/bigquery/docs/tutorials/bq-dataframe-pandas-tutorial\n",
- "\n",
- "## Conclusion\n",
- "\n",
- "BigQuery DataFrames is a powerful tool that can help you get the most out of your BigQuery data. If you're looking for a way to easily analyze and transform your BigQuery data using the familiar Pandas API, then BigQuery DataFrames is a great option.\n"
- ]
+ "data": {
+ "text/markdown": [
+ "BigQuery DataFrames are a Python library that provides a Pandas-like interface for interacting with BigQuery data. Instead of loading entire datasets into memory (which is impossible for very large BigQuery tables), BigQuery DataFrames allow you to work with BigQuery data in a way that feels familiar if you've used Pandas, but leverages BigQuery's processing power for efficiency. This means you can perform data analysis and manipulation on datasets that are too large for Pandas to handle directly.\n",
+ "\n",
+ "Key features and characteristics include:\n",
+ "\n",
+ "* **Lazy Evaluation:** BigQuery DataFrames don't load the entire dataset into memory. Operations are expressed as queries that are executed in BigQuery only when necessary (e.g., when you call `.to_dataframe()` to materialize a result, or when you explicitly trigger execution). This significantly reduces memory consumption.\n",
+ "\n",
+ "* **Pandas-like API:** The library aims for a familiar API similar to Pandas. You can use many of the same functions and methods you would use with Pandas DataFrames, such as filtering, selecting columns, aggregations, and joining.\n",
+ "\n",
+ "* **Integration with BigQuery:** The library seamlessly integrates with BigQuery. It allows you to read data from BigQuery tables and write data back to BigQuery.\n",
+ "\n",
+ "* **Scalability:** Because the processing happens in BigQuery, BigQuery DataFrames can scale to handle datasets of virtually any size. It's designed to efficiently process terabytes or even petabytes of data.\n",
+ "\n",
+ "* **Performance:** While providing a user-friendly interface, BigQuery DataFrames leverages BigQuery's optimized query engine for fast execution of operations.\n",
+ "\n",
+ "* **SQL integration:** While providing a Pythonic interface, you can easily incorporate SQL queries directly within the DataFrame operations providing flexibility and control over the data manipulation.\n",
+ "\n",
+ "\n",
+ "**In short:** BigQuery DataFrames provide a powerful and efficient way to work with large BigQuery datasets using a familiar Pandas-like syntax without the memory limitations of loading the entire dataset into local memory. They bridge the gap between the ease of use of Pandas and the scalability of BigQuery.\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
}
],
"source": [
- "# print(pred.loc[2][\"ml_generate_text_llm_result\"])"
+ "# import IPython.display\n",
+ "\n",
+ "# IPython.display.Markdown(pred.loc[2][\"ml_generate_text_llm_result\"])"
]
},
{
@@ -1443,7 +1490,7 @@
},
{
"cell_type": "code",
- "execution_count": 15,
+ "execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
@@ -1491,7 +1538,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.10.12"
+ "version": "3.12.6"
}
},
"nbformat": 4,
diff --git a/noxfile.py b/noxfile.py
index b29cda7a51..3bdff699ed 100644
--- a/noxfile.py
+++ b/noxfile.py
@@ -76,7 +76,7 @@
]
UNIT_TEST_LOCAL_DEPENDENCIES: List[str] = []
UNIT_TEST_DEPENDENCIES: List[str] = []
-UNIT_TEST_EXTRAS: List[str] = []
+UNIT_TEST_EXTRAS: List[str] = ["tests"]
UNIT_TEST_EXTRAS_BY_PYTHON: Dict[str, List[str]] = {
"3.12": ["polars", "scikit-learn"],
}
@@ -203,7 +203,7 @@ def install_unittest_dependencies(session, install_test_extra, *constraints):
if install_test_extra and UNIT_TEST_EXTRAS_BY_PYTHON:
extras = UNIT_TEST_EXTRAS_BY_PYTHON.get(session.python, [])
- elif install_test_extra and UNIT_TEST_EXTRAS:
+ if install_test_extra and UNIT_TEST_EXTRAS:
extras = UNIT_TEST_EXTRAS
else:
extras = []
diff --git a/samples/snippets/conftest.py b/samples/snippets/conftest.py
index 5cba045ce4..e8253bc5a7 100644
--- a/samples/snippets/conftest.py
+++ b/samples/snippets/conftest.py
@@ -24,6 +24,8 @@
"python-bigquery-dataframes", "samples/snippets"
)
+routine_prefixer = test_utils.prefixer.Prefixer("bigframes", "")
+
@pytest.fixture(scope="session", autouse=True)
def cleanup_datasets(bigquery_client: bigquery.Client) -> None:
@@ -106,3 +108,12 @@ def random_model_id_eu(
full_model_id = f"{project_id}.{dataset_id_eu}.{random_model_id_eu}"
yield full_model_id
bigquery_client.delete_model(full_model_id, not_found_ok=True)
+
+
+@pytest.fixture
+def routine_id() -> Iterator[str]:
+ """Create a new BQ routine ID each time, so random_routine_id can be used as
+ target for udf creation.
+ """
+ random_routine_id = routine_prefixer.create_prefix()
+ yield random_routine_id
diff --git a/samples/snippets/multimodal_test.py b/samples/snippets/multimodal_test.py
index 85e118d671..27a7998ff9 100644
--- a/samples/snippets/multimodal_test.py
+++ b/samples/snippets/multimodal_test.py
@@ -21,6 +21,8 @@ def test_multimodal_dataframe(gcs_dst_bucket: str) -> None:
# Flag to enable the feature
bigframes.options.experiments.blob = True
+ # Flags to control preview image/video preview size
+ bigframes.options.experiments.blob_display_width = 300
import bigframes.pandas as bpd
@@ -47,10 +49,12 @@ def test_multimodal_dataframe(gcs_dst_bucket: str) -> None:
df_image["size"] = df_image["image"].blob.size()
df_image["updated"] = df_image["image"].blob.updated()
df_image
+ # [END bigquery_dataframes_multimodal_dataframe_merge]
+ # [START bigquery_dataframes_multimodal_dataframe_filter]
# Filter images and display, you can also display audio and video types. Use width/height parameters to constrain window sizes.
- df_image[df_image["author"] == "alice"]["image"].blob.display(width=400)
- # [END bigquery_dataframes_multimodal_dataframe_merge]
+ df_image[df_image["author"] == "alice"]["image"].blob.display()
+ # [END bigquery_dataframes_multimodal_dataframe_filter]
# [START bigquery_dataframes_multimodal_dataframe_image_transform]
df_image["blurred"] = df_image["image"].blob.image_blur(
diff --git a/samples/snippets/remote_function.py b/samples/snippets/remote_function.py
index 3a7031ef89..4c5b365007 100644
--- a/samples/snippets/remote_function.py
+++ b/samples/snippets/remote_function.py
@@ -21,7 +21,7 @@ def run_remote_function_and_read_gbq_function(project_id: str) -> None:
# Set BigQuery DataFrames options
bpd.options.bigquery.project = your_gcp_project_id
- bpd.options.bigquery.location = "us"
+ bpd.options.bigquery.location = "US"
# BigQuery DataFrames gives you the ability to turn your custom scalar
# functions into a BigQuery remote function. It requires the GCP project to
@@ -56,7 +56,7 @@ def get_bucket(num: float) -> str:
boundary = 4000
return "at_or_above_4000" if num >= boundary else "below_4000"
- # Then we can apply the remote function on the `Series`` of interest via
+ # Then we can apply the remote function on the `Series` of interest via
# `apply` API and store the result in a new column in the DataFrame.
df = df.assign(body_mass_bucket=df["body_mass_g"].apply(get_bucket))
diff --git a/samples/snippets/udf.py b/samples/snippets/udf.py
new file mode 100644
index 0000000000..495cd33e84
--- /dev/null
+++ b/samples/snippets/udf.py
@@ -0,0 +1,121 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def run_udf_and_read_gbq_function(
+ project_id: str, dataset_id: str, routine_id: str
+) -> None:
+ your_gcp_project_id = project_id
+ your_bq_dataset_id = dataset_id
+ your_bq_routine_id = routine_id
+
+ # [START bigquery_dataframes_udf]
+ import bigframes.pandas as bpd
+
+ # Set BigQuery DataFrames options
+ bpd.options.bigquery.project = your_gcp_project_id
+ bpd.options.bigquery.location = "US"
+
+ # BigQuery DataFrames gives you the ability to turn your custom functions
+ # into a BigQuery Python UDF. One can find more details about the usage and
+ # the requirements via `help` command.
+ help(bpd.udf)
+
+ # Read a table and inspect the column of interest.
+ df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
+ df["body_mass_g"].peek(10)
+
+ # Define a custom function, and specify the intent to turn it into a
+ # BigQuery Python UDF. Let's try a `pandas`-like use case in which we want
+ # to apply a user defined function to every value in a `Series`, more
+ # specifically bucketize the `body_mass_g` value of the penguins, which is a
+ # real number, into a category, which is a string.
+ @bpd.udf(
+ dataset=your_bq_dataset_id,
+ name=your_bq_routine_id,
+ )
+ def get_bucket(num: float) -> str:
+ if not num:
+ return "NA"
+ boundary = 4000
+ return "at_or_above_4000" if num >= boundary else "below_4000"
+
+ # Then we can apply the udf on the `Series` of interest via
+ # `apply` API and store the result in a new column in the DataFrame.
+ df = df.assign(body_mass_bucket=df["body_mass_g"].apply(get_bucket))
+
+ # This will add a new column `body_mass_bucket` in the DataFrame. You can
+ # preview the original value and the bucketized value side by side.
+ df[["body_mass_g", "body_mass_bucket"]].peek(10)
+
+ # The above operation was possible by doing all the computation on the
+ # cloud through an underlying BigQuery Python UDF that was created to
+ # support the user's operations in the Python code.
+
+ # The BigQuery Python UDF created to support the BigQuery DataFrames
+ # udf can be located via a property `bigframes_bigquery_function`
+ # set in the udf object.
+ print(f"Created BQ Python UDF: {get_bucket.bigframes_bigquery_function}")
+
+ # If you have already defined a custom function in BigQuery, either via the
+ # BigQuery Google Cloud Console or with the `udf` decorator,
+ # or otherwise, you may use it with BigQuery DataFrames with the
+ # `read_gbq_function` method. More details are available via the `help`
+ # command.
+ help(bpd.read_gbq_function)
+
+ existing_get_bucket_bq_udf = get_bucket.bigframes_bigquery_function
+
+ # Here is an example of using `read_gbq_function` to load an existing
+ # BigQuery Python UDF.
+ df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
+ get_bucket_function = bpd.read_gbq_function(existing_get_bucket_bq_udf)
+
+ df = df.assign(body_mass_bucket=df["body_mass_g"].apply(get_bucket_function))
+ df.peek(10)
+
+ # Let's continue trying other potential use cases of udf. Let's say we
+ # consider the `species`, `island` and `sex` of the penguins sensitive
+ # information and want to redact that by replacing with their hash code
+ # instead. Let's define another scalar custom function and decorate it
+ # as a udf. The custom function in this example has external package
+ # dependency, which can be specified via `packages` parameter.
+ @bpd.udf(
+ dataset=your_bq_dataset_id,
+ name=your_bq_routine_id,
+ packages=["cryptography"],
+ )
+ def get_hash(input: str) -> str:
+ from cryptography.fernet import Fernet
+
+ # handle missing value
+ if input is None:
+ input = ""
+
+ key = Fernet.generate_key()
+ f = Fernet(key)
+ return f.encrypt(input.encode()).decode()
+
+ # We can use this udf in another `pandas`-like API `map` that
+ # can be applied on a DataFrame
+ df_redacted = df[["species", "island", "sex"]].map(get_hash)
+ df_redacted.peek(10)
+
+ # [END bigquery_dataframes_udf]
+
+ # Clean up cloud artifacts
+ session = bpd.get_global_session()
+ session.bqclient.delete_routine(
+ f"{your_bq_dataset_id}.{your_bq_routine_id}", not_found_ok=True
+ )
diff --git a/samples/snippets/udf_test.py b/samples/snippets/udf_test.py
new file mode 100644
index 0000000000..a352b4c8ce
--- /dev/null
+++ b/samples/snippets/udf_test.py
@@ -0,0 +1,38 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pytest
+
+import bigframes.pandas
+
+from . import udf
+
+
+def test_udf_and_read_gbq_function(
+ capsys: pytest.CaptureFixture[str],
+ dataset_id: str,
+ routine_id: str,
+) -> None:
+ # We need a fresh session since we're modifying connection options.
+ bigframes.pandas.close_session()
+
+ # Determine project id, in this case prefer the one set in the environment
+ # variable GOOGLE_CLOUD_PROJECT (if any)
+ import os
+
+ your_project_id = os.getenv("GOOGLE_CLOUD_PROJECT", "bigframes-dev")
+
+ udf.run_udf_and_read_gbq_function(your_project_id, dataset_id, routine_id)
+ out, _ = capsys.readouterr()
+ assert "Created BQ Python UDF:" in out
diff --git a/scripts/run_and_publish_benchmark.py b/scripts/run_and_publish_benchmark.py
index 402ba4d213..0ea3a5e162 100644
--- a/scripts/run_and_publish_benchmark.py
+++ b/scripts/run_and_publish_benchmark.py
@@ -93,10 +93,10 @@ def collect_benchmark_result(
error_files = sorted(path.rglob("*.error"))
if not (
- len(bytes_files)
- == len(millis_files)
+ len(millis_files)
== len(bq_seconds_files)
- <= len(query_char_count_files)
+ <= len(bytes_files)
+ == len(query_char_count_files)
== len(local_seconds_files)
):
raise ValueError(
@@ -108,10 +108,13 @@ def collect_benchmark_result(
for idx in range(len(local_seconds_files)):
query_char_count_file = query_char_count_files[idx]
local_seconds_file = local_seconds_files[idx]
+ bytes_file = bytes_files[idx]
filename = query_char_count_file.relative_to(path).with_suffix("")
- if filename != local_seconds_file.relative_to(path).with_suffix(""):
+ if filename != local_seconds_file.relative_to(path).with_suffix(
+ ""
+ ) or filename != bytes_file.relative_to(path).with_suffix(""):
raise ValueError(
- "File name mismatch between query_char_count and seconds reports."
+ "File name mismatch among query_char_count, bytes and seconds reports."
)
with open(query_char_count_file, "r") as file:
@@ -123,27 +126,23 @@ def collect_benchmark_result(
lines = file.read().splitlines()
local_seconds = sum(float(line) for line in lines) / iterations
+ with open(bytes_file, "r") as file:
+ lines = file.read().splitlines()
+ total_bytes = sum(int(line) for line in lines) / iterations
+
if not has_full_metrics:
- total_bytes = None
total_slot_millis = None
bq_seconds = None
else:
- bytes_file = bytes_files[idx]
millis_file = millis_files[idx]
bq_seconds_file = bq_seconds_files[idx]
- if (
- filename != bytes_file.relative_to(path).with_suffix("")
- or filename != millis_file.relative_to(path).with_suffix("")
- or filename != bq_seconds_file.relative_to(path).with_suffix("")
- ):
+ if filename != millis_file.relative_to(path).with_suffix(
+ ""
+ ) or filename != bq_seconds_file.relative_to(path).with_suffix(""):
raise ValueError(
"File name mismatch among query_char_count, bytes, millis, and seconds reports."
)
- with open(bytes_file, "r") as file:
- lines = file.read().splitlines()
- total_bytes = sum(int(line) for line in lines) / iterations
-
with open(millis_file, "r") as file:
lines = file.read().splitlines()
total_slot_millis = sum(int(line) for line in lines) / iterations
@@ -202,11 +201,7 @@ def collect_benchmark_result(
print(
f"{index} - query count: {row['Query_Count']},"
+ f" query char count: {row['Query_Char_Count']},"
- + (
- f" bytes processed sum: {row['Bytes_Processed']},"
- if has_full_metrics
- else ""
- )
+ + f" bytes processed sum: {row['Bytes_Processed']},"
+ (f" slot millis sum: {row['Slot_Millis']}," if has_full_metrics else "")
+ f" local execution time: {formatted_local_exec_time} seconds"
+ (
@@ -238,11 +233,7 @@ def collect_benchmark_result(
print(
f"---Geometric mean of queries: {geometric_mean_queries},"
+ f" Geometric mean of queries char counts: {geometric_mean_query_char_count},"
- + (
- f" Geometric mean of bytes processed: {geometric_mean_bytes},"
- if has_full_metrics
- else ""
- )
+ + f" Geometric mean of bytes processed: {geometric_mean_bytes},"
+ (
f" Geometric mean of slot millis: {geometric_mean_slot_millis},"
if has_full_metrics
diff --git a/setup.py b/setup.py
index 1fe7006860..489d9aacd9 100644
--- a/setup.py
+++ b/setup.py
@@ -38,10 +38,12 @@
"fsspec >=2023.3.0",
"gcsfs >=2023.3.0",
"geopandas >=0.12.2",
- "google-auth >=2.15.0,<3.0dev",
+ "google-auth >=2.15.0,<3.0",
"google-cloud-bigtable >=2.24.0",
"google-cloud-pubsub >=2.21.4",
"google-cloud-bigquery[bqstorage,pandas] >=3.31.0",
+ # 2.30 needed for arrow support.
+ "google-cloud-bigquery-storage >= 2.30.0, < 3.0.0",
"google-cloud-functions >=1.12.0",
"google-cloud-bigquery-connection >=1.12.0",
"google-cloud-iam >=2.12.1",
@@ -53,7 +55,7 @@
"pyarrow >=15.0.2",
"pydata-google-auth >=1.8.2",
"requests >=2.27.1",
- "shapely >=2.0.0",
+ "shapely >=1.8.5",
"sqlglot >=23.6.3",
"tabulate >=0.9",
"ipywidgets >=7.7.1",
@@ -70,7 +72,7 @@
]
extras = {
# Optional test dependencies packages. If they're missed, may skip some tests.
- "tests": [],
+ "tests": ["freezegun", "pytest-snapshot"],
# used for local engine, which is only needed for unit tests at present.
"polars": ["polars >= 1.7.0"],
"scikit-learn": ["scikit-learn>=1.2.2"],
@@ -80,7 +82,6 @@
"pre-commit",
"nox",
"google-cloud-testutils",
- "freezegun",
],
}
extras["all"] = list(sorted(frozenset(itertools.chain.from_iterable(extras.values()))))
diff --git a/testing/constraints-3.9.txt b/testing/constraints-3.9.txt
index b0537cd035..dff245d176 100644
--- a/testing/constraints-3.9.txt
+++ b/testing/constraints-3.9.txt
@@ -19,7 +19,7 @@ pyarrow==15.0.2
pydata-google-auth==1.8.2
requests==2.27.1
scikit-learn==1.2.2
-shapely==2.0.0
+shapely==1.8.5
sqlglot==23.6.3
tabulate==0.9
ipywidgets==7.7.1
diff --git a/tests/data/scalars.jsonl b/tests/data/scalars.jsonl
index 03755c94b7..2e5a1499b9 100644
--- a/tests/data/scalars.jsonl
+++ b/tests/data/scalars.jsonl
@@ -1,9 +1,9 @@
-{"bool_col": true, "bytes_col": "SGVsbG8sIFdvcmxkIQ==", "date_col": "2021-07-21", "datetime_col": "2021-07-21 11:39:45", "geography_col": "POINT(-122.0838511 37.3860517)", "int64_col": "123456789", "int64_too": "0", "numeric_col": "1.23456789", "float64_col": "1.25", "rowindex": 0, "rowindex_2": 0, "string_col": "Hello, World!", "time_col": "11:41:43.076160", "timestamp_col": "2021-07-21T17:43:43.945289Z"}
-{"bool_col": false, "bytes_col": "44GT44KT44Gr44Gh44Gv", "date_col": "1991-02-03", "datetime_col": "1991-01-02 03:45:06", "geography_col": "POINT(-71.104 42.315)", "int64_col": "-987654321", "int64_too": "1", "numeric_col": "1.23456789", "float64_col": "2.51", "rowindex": 1, "rowindex_2": 1, "string_col": "こんにちは", "time_col": "11:14:34.701606", "timestamp_col": "2021-07-21T17:43:43.945289Z"}
-{"bool_col": true, "bytes_col": "wqFIb2xhIE11bmRvIQ==", "date_col": "2023-03-01", "datetime_col": "2023-03-01 10:55:13", "geography_col": "POINT(-0.124474760143016 51.5007826749545)", "int64_col": "314159", "int64_too": "0", "numeric_col": "101.1010101", "float64_col": "2.5e10", "rowindex": 2, "rowindex_2": 2, "string_col": " ¡Hola Mundo! ", "time_col": "23:59:59.999999", "timestamp_col": "2023-03-01T10:55:13.250125Z"}
-{"bool_col": null, "bytes_col": null, "date_col": null, "datetime_col": null, "geography_col": null, "int64_col": null, "int64_too": "1", "numeric_col": null, "float64_col": null, "rowindex": 3, "rowindex_2": 3, "string_col": null, "time_col": null, "timestamp_col": null}
-{"bool_col": false, "bytes_col": "44GT44KT44Gr44Gh44Gv", "date_col": "2021-07-21", "datetime_col": null, "geography_col": null, "int64_col": "-234892", "int64_too": "-2345", "numeric_col": null, "float64_col": null, "rowindex": 4, "rowindex_2": 4, "string_col": "Hello, World!", "time_col": null, "timestamp_col": null}
-{"bool_col": false, "bytes_col": "R8O8dGVuIFRhZw==", "date_col": "1980-03-14", "datetime_col": "1980-03-14 15:16:17", "geography_col": null, "int64_col": "55555", "int64_too": "0", "numeric_col": "5.555555", "float64_col": "555.555", "rowindex": 5, "rowindex_2": 5, "string_col": "Güten Tag!", "time_col": "15:16:17.181921", "timestamp_col": "1980-03-14T15:16:17.181921Z"}
-{"bool_col": true, "bytes_col": "SGVsbG8JQmlnRnJhbWVzIQc=", "date_col": "2023-05-23", "datetime_col": "2023-05-23 11:37:01", "geography_col": "MULTIPOINT (20 20, 10 40, 40 30, 30 10)", "int64_col": "101202303", "int64_too": "2", "numeric_col": "-10.090807", "float64_col": "-123.456", "rowindex": 6, "rowindex_2": 6, "string_col": "capitalize, This ", "time_col": "01:02:03.456789", "timestamp_col": "2023-05-23T11:42:55.000001Z"}
-{"bool_col": true, "bytes_col": null, "date_col": "2038-01-20", "datetime_col": "2038-01-19 03:14:08", "geography_col": null, "int64_col": "-214748367", "int64_too": "2", "numeric_col": "11111111.1", "float64_col": "42.42", "rowindex": 7, "rowindex_2": 7, "string_col": " سلام", "time_col": "12:00:00.000001", "timestamp_col": "2038-01-19T03:14:17.999999Z"}
-{"bool_col": false, "bytes_col": null, "date_col": null, "datetime_col": null, "geography_col": null, "int64_col": "2", "int64_too": "1", "numeric_col": null, "float64_col": "6.87", "rowindex": 8, "rowindex_2": 8, "string_col": "T", "time_col": null, "timestamp_col": null}
\ No newline at end of file
+{"bool_col": true, "bytes_col": "SGVsbG8sIFdvcmxkIQ==", "date_col": "2021-07-21", "datetime_col": "2021-07-21 11:39:45", "geography_col": "POINT(-122.0838511 37.3860517)", "int64_col": "123456789", "int64_too": "0", "numeric_col": "1.23456789", "float64_col": "1.25", "rowindex": 0, "rowindex_2": 0, "string_col": "Hello, World!", "time_col": "11:41:43.076160", "timestamp_col": "2021-07-21T17:43:43.945289Z"}
+{"bool_col": false, "bytes_col": "44GT44KT44Gr44Gh44Gv", "date_col": "1991-02-03", "datetime_col": "1991-01-02 03:45:06", "geography_col": "POINT(-71.104 42.315)", "int64_col": "-987654321", "int64_too": "1", "numeric_col": "1.23456789", "float64_col": "2.51", "rowindex": 1, "rowindex_2": 1, "string_col": "こんにちは", "time_col": "11:14:34.701606", "timestamp_col": "2021-07-21T17:43:43.945289Z"}
+{"bool_col": true, "bytes_col": "wqFIb2xhIE11bmRvIQ==", "date_col": "2023-03-01", "datetime_col": "2023-03-01 10:55:13", "geography_col": "POINT(-0.124474760143016 51.5007826749545)", "int64_col": "314159", "int64_too": "0", "numeric_col": "101.1010101", "float64_col": "2.5e10", "rowindex": 2, "rowindex_2": 2, "string_col": " ¡Hola Mundo! ", "time_col": "23:59:59.999999", "timestamp_col": "2023-03-01T10:55:13.250125Z"}
+{"bool_col": null, "bytes_col": null, "date_col": null, "datetime_col": null, "geography_col": null, "int64_col": null, "int64_too": "1", "numeric_col": null, "float64_col": null, "rowindex": 3, "rowindex_2": 3, "string_col": null, "time_col": null, "timestamp_col": null}
+{"bool_col": false, "bytes_col": "44GT44KT44Gr44Gh44Gv", "date_col": "2021-07-21", "datetime_col": null, "geography_col": null, "int64_col": "-234892", "int64_too": "-2345", "numeric_col": null, "float64_col": null, "rowindex": 4, "rowindex_2": 4, "string_col": "Hello, World!", "time_col": null, "timestamp_col": null}
+{"bool_col": false, "bytes_col": "R8O8dGVuIFRhZw==", "date_col": "1980-03-14", "datetime_col": "1980-03-14 15:16:17", "geography_col": null, "int64_col": "55555", "int64_too": "0", "numeric_col": "5.555555", "float64_col": "555.555", "rowindex": 5, "rowindex_2": 5, "string_col": "Güten Tag!", "time_col": "15:16:17.181921", "timestamp_col": "1980-03-14T15:16:17.181921Z"}
+{"bool_col": true, "bytes_col": "SGVsbG8JQmlnRnJhbWVzIQc=", "date_col": "2023-05-23", "datetime_col": "2023-05-23 11:37:01", "geography_col": "LINESTRING(-0.127959 51.507728, -0.127026 51.507473)", "int64_col": "101202303", "int64_too": "2", "numeric_col": "-10.090807", "float64_col": "-123.456", "rowindex": 6, "rowindex_2": 6, "string_col": "capitalize, This ", "time_col": "01:02:03.456789", "timestamp_col": "2023-05-23T11:42:55.000001Z"}
+{"bool_col": true, "bytes_col": null, "date_col": "2038-01-20", "datetime_col": "2038-01-19 03:14:08", "geography_col": null, "int64_col": "-214748367", "int64_too": "2", "numeric_col": "11111111.1", "float64_col": "42.42", "rowindex": 7, "rowindex_2": 7, "string_col": " سلام", "time_col": "12:00:00.000001", "timestamp_col": "2038-01-19T03:14:17.999999Z"}
+{"bool_col": false, "bytes_col": null, "date_col": null, "datetime_col": null, "geography_col": null, "int64_col": "2", "int64_too": "1", "numeric_col": null, "float64_col": "6.87", "rowindex": 8, "rowindex_2": 8, "string_col": "T", "time_col": null, "timestamp_col": null}
\ No newline at end of file
diff --git a/tests/system/large/test_session.py b/tests/system/large/test_session.py
index 90955f5ddf..d28146498d 100644
--- a/tests/system/large/test_session.py
+++ b/tests/system/large/test_session.py
@@ -23,43 +23,6 @@
import bigframes.session._io.bigquery
-@pytest.mark.parametrize(
- ("query_or_table", "index_col"),
- [
- pytest.param(
- "bigquery-public-data.patents_view.ipcr_201708",
- (),
- id="1g_table_w_default_index",
- ),
- pytest.param(
- "bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2011",
- (),
- id="30g_table_w_default_index",
- ),
- # TODO(chelsealin): Disable the long run tests until we have propertily
- # ordering support to avoid materializating any data.
- # # Adding default index to large tables would take much longer time,
- # # e.g. ~5 mins for a 100G table, ~20 mins for a 1T table.
- # pytest.param(
- # "bigquery-public-data.stackoverflow.post_history",
- # ["id"],
- # id="100g_table_w_unique_column_index",
- # ),
- # pytest.param(
- # "bigquery-public-data.wise_all_sky_data_release.all_wise",
- # ["cntr"],
- # id="1t_table_w_unique_column_index",
- # ),
- ],
-)
-def test_read_gbq_for_large_tables(
- session: bigframes.Session, query_or_table, index_col
-):
- """Verify read_gbq() is able to read large tables."""
- df = session.read_gbq(query_or_table, index_col=index_col)
- assert len(df.columns) != 0
-
-
def test_close(session: bigframes.Session):
# we will create two tables and confirm that they are deleted
# when the session is closed
diff --git a/tests/system/small/bigquery/test_geo.py b/tests/system/small/bigquery/test_geo.py
index fa2c522109..be517fb5cc 100644
--- a/tests/system/small/bigquery/test_geo.py
+++ b/tests/system/small/bigquery/test_geo.py
@@ -15,6 +15,7 @@
import geopandas # type: ignore
import pandas as pd
import pandas.testing
+import pytest
from shapely.geometry import ( # type: ignore
GeometryCollection,
LineString,
@@ -94,6 +95,12 @@ def test_geo_st_difference_with_geometry_objects():
def test_geo_st_difference_with_single_geometry_object():
+ pytest.importorskip(
+ "shapely",
+ minversion="2.0.0",
+ reason="shapely objects must be hashable to include in our expression trees",
+ )
+
data1 = [
Polygon([(0, 0), (10, 0), (10, 10), (0, 10), (0, 0)]),
Polygon([(0, 1), (10, 1), (10, 9), (0, 9), (0, 1)]),
@@ -205,6 +212,12 @@ def test_geo_st_distance_with_geometry_objects():
def test_geo_st_distance_with_single_geometry_object():
+ pytest.importorskip(
+ "shapely",
+ minversion="2.0.0",
+ reason="shapely objects must be hashable to include in our expression trees",
+ )
+
data1 = [
# 0.00001 is approximately 1 meter.
Polygon([(0, 0), (0.00001, 0), (0.00001, 0.00001), (0, 0.00001), (0, 0)]),
@@ -279,6 +292,12 @@ def test_geo_st_intersection_with_geometry_objects():
def test_geo_st_intersection_with_single_geometry_object():
+ pytest.importorskip(
+ "shapely",
+ minversion="2.0.0",
+ reason="shapely objects must be hashable to include in our expression trees",
+ )
+
data1 = [
Polygon([(0, 0), (10, 0), (10, 10), (0, 10), (0, 0)]),
Polygon([(0, 1), (10, 1), (10, 9), (0, 9), (0, 1)]),
diff --git a/tests/system/small/bigquery/test_json.py b/tests/system/small/bigquery/test_json.py
index 00f690ed54..df5a524b55 100644
--- a/tests/system/small/bigquery/test_json.py
+++ b/tests/system/small/bigquery/test_json.py
@@ -66,7 +66,7 @@ def test_json_set_w_more_pairs():
s, json_path_value_pairs=[("$.a", 1), ("$.b", 2), ("$.a", [3, 4, 5])]
)
- expected_json = ['{"a": 3, "b": 2}', '{"a": 4, "b": 2}', '{"a": 5, "b": 2, "c": 1}']
+ expected_json = ['{"a": 3,"b":2}', '{"a":4,"b": 2}', '{"a": 5,"b":2,"c":1}']
expected = bpd.Series(expected_json, dtype=dtypes.JSON_DTYPE)
pd.testing.assert_series_equal(actual.to_pandas(), expected.to_pandas())
diff --git a/tests/system/small/ml/test_llm.py b/tests/system/small/ml/test_llm.py
index 544889bf5a..51e9d8ad6a 100644
--- a/tests/system/small/ml/test_llm.py
+++ b/tests/system/small/ml/test_llm.py
@@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
+from typing import Callable
from unittest import mock
import pandas as pd
@@ -151,6 +152,8 @@ def test_create_load_gemini_text_generator_model(
"gemini-1.5-flash-001",
"gemini-1.5-flash-002",
"gemini-2.0-flash-exp",
+ "gemini-2.0-flash-001",
+ "gemini-2.0-flash-lite-001",
),
)
@pytest.mark.flaky(retries=2)
@@ -176,6 +179,8 @@ def test_gemini_text_generator_predict_default_params_success(
"gemini-1.5-flash-001",
"gemini-1.5-flash-002",
"gemini-2.0-flash-exp",
+ "gemini-2.0-flash-001",
+ "gemini-2.0-flash-lite-001",
),
)
@pytest.mark.flaky(retries=2)
@@ -203,6 +208,8 @@ def test_gemini_text_generator_predict_with_params_success(
"gemini-1.5-flash-001",
"gemini-1.5-flash-002",
"gemini-2.0-flash-exp",
+ "gemini-2.0-flash-001",
+ "gemini-2.0-flash-lite-001",
),
)
@pytest.mark.flaky(retries=2)
@@ -222,6 +229,47 @@ def test_gemini_text_generator_multi_cols_predict_success(
)
+@pytest.mark.parametrize(
+ "model_name",
+ (
+ "gemini-1.5-pro-preview-0514",
+ "gemini-1.5-flash-preview-0514",
+ "gemini-1.5-pro-001",
+ "gemini-1.5-pro-002",
+ "gemini-1.5-flash-001",
+ "gemini-1.5-flash-002",
+ "gemini-2.0-flash-exp",
+ ),
+)
+@pytest.mark.flaky(retries=2)
+def test_gemini_text_generator_predict_output_schema_success(
+ llm_text_df: bpd.DataFrame, model_name, session, bq_connection
+):
+ gemini_text_generator_model = llm.GeminiTextGenerator(
+ model_name=model_name, connection_name=bq_connection, session=session
+ )
+ output_schema = {
+ "bool_output": "bool",
+ "int_output": "int64",
+ "float_output": "float64",
+ "str_output": "string",
+ }
+ df = gemini_text_generator_model.predict(
+ llm_text_df, output_schema=output_schema
+ ).to_pandas()
+ utils.check_pandas_df_schema_and_index(
+ df,
+ columns=list(output_schema.keys()) + ["prompt", "full_response", "status"],
+ index=3,
+ col_exact=False,
+ )
+
+ assert df["bool_output"].dtype == pd.BooleanDtype()
+ assert df["int_output"].dtype == pd.Int64Dtype()
+ assert df["float_output"].dtype == pd.Float64Dtype()
+ assert df["str_output"].dtype == pd.StringDtype(storage="pyarrow")
+
+
# Overrides __eq__ function for comparing as mock.call parameter
class EqCmpAllDataFrame(bpd.DataFrame):
def __eq__(self, other):
@@ -239,9 +287,7 @@ def __eq__(self, other):
{
"temperature": 0.9,
"max_output_tokens": 8192,
- "top_k": 40,
"top_p": 1.0,
- "flatten_json_output": True,
"ground_with_google_search": False,
},
),
@@ -251,7 +297,6 @@ def __eq__(self, other):
"max_output_tokens": 128,
"top_k": 40,
"top_p": 0.95,
- "flatten_json_output": True,
},
),
],
@@ -297,11 +342,16 @@ def test_text_generator_retry_success(
session=session,
)
+ mock_generate_text = mock.create_autospec(
+ Callable[[core.BqmlModel, bpd.DataFrame, dict], bpd.DataFrame]
+ )
mock_bqml_model = mock.create_autospec(spec=core.BqmlModel)
type(mock_bqml_model).session = mock.PropertyMock(return_value=session)
-
+ generate_text_tvf = core.BqmlModel.TvfDef(
+ mock_generate_text, "ml_generate_text_status"
+ )
# Responses. Retry twice then all succeeded.
- mock_bqml_model.generate_text.side_effect = [
+ mock_generate_text.side_effect = [
EqCmpAllDataFrame(
{
"ml_generate_text_status": ["", "error", "error"],
@@ -344,32 +394,33 @@ def test_text_generator_retry_success(
)
text_generator_model._bqml_model = mock_bqml_model
- # 3rd retry isn't triggered
- result = text_generator_model.predict(df0, max_retries=3)
+ with mock.patch.object(core.BqmlModel, "generate_text_tvf", generate_text_tvf):
+ # 3rd retry isn't triggered
+ result = text_generator_model.predict(df0, max_retries=3)
- mock_bqml_model.generate_text.assert_has_calls(
- [
- mock.call(df0, options),
- mock.call(df1, options),
- mock.call(df2, options),
- ]
- )
- pd.testing.assert_frame_equal(
- result.to_pandas(),
- pd.DataFrame(
- {
- "ml_generate_text_status": ["", "", ""],
- "prompt": [
- "What is BigQuery?",
- "What is BigQuery DataFrame?",
- "What is BQML?",
- ],
- },
- index=[0, 2, 1],
- ),
- check_dtype=False,
- check_index_type=False,
- )
+ mock_generate_text.assert_has_calls(
+ [
+ mock.call(mock_bqml_model, df0, options),
+ mock.call(mock_bqml_model, df1, options),
+ mock.call(mock_bqml_model, df2, options),
+ ]
+ )
+ pd.testing.assert_frame_equal(
+ result.to_pandas(),
+ pd.DataFrame(
+ {
+ "ml_generate_text_status": ["", "", ""],
+ "prompt": [
+ "What is BigQuery?",
+ "What is BigQuery DataFrame?",
+ "What is BQML?",
+ ],
+ },
+ index=[0, 2, 1],
+ ),
+ check_dtype=False,
+ check_index_type=False,
+ )
@pytest.mark.parametrize(
@@ -383,9 +434,7 @@ def test_text_generator_retry_success(
{
"temperature": 0.9,
"max_output_tokens": 8192,
- "top_k": 40,
"top_p": 1.0,
- "flatten_json_output": True,
"ground_with_google_search": False,
},
),
@@ -395,7 +444,6 @@ def test_text_generator_retry_success(
"max_output_tokens": 128,
"top_k": 40,
"top_p": 0.95,
- "flatten_json_output": True,
},
),
],
@@ -431,10 +479,16 @@ def test_text_generator_retry_no_progress(
session=session,
)
+ mock_generate_text = mock.create_autospec(
+ Callable[[core.BqmlModel, bpd.DataFrame, dict], bpd.DataFrame]
+ )
mock_bqml_model = mock.create_autospec(spec=core.BqmlModel)
type(mock_bqml_model).session = mock.PropertyMock(return_value=session)
+ generate_text_tvf = core.BqmlModel.TvfDef(
+ mock_generate_text, "ml_generate_text_status"
+ )
# Responses. Retry once, no progress, just stop.
- mock_bqml_model.generate_text.side_effect = [
+ mock_generate_text.side_effect = [
EqCmpAllDataFrame(
{
"ml_generate_text_status": ["", "error", "error"],
@@ -467,31 +521,32 @@ def test_text_generator_retry_no_progress(
)
text_generator_model._bqml_model = mock_bqml_model
- # No progress, only conduct retry once
- result = text_generator_model.predict(df0, max_retries=3)
+ with mock.patch.object(core.BqmlModel, "generate_text_tvf", generate_text_tvf):
+ # No progress, only conduct retry once
+ result = text_generator_model.predict(df0, max_retries=3)
- mock_bqml_model.generate_text.assert_has_calls(
- [
- mock.call(df0, options),
- mock.call(df1, options),
- ]
- )
- pd.testing.assert_frame_equal(
- result.to_pandas(),
- pd.DataFrame(
- {
- "ml_generate_text_status": ["", "error", "error"],
- "prompt": [
- "What is BigQuery?",
- "What is BQML?",
- "What is BigQuery DataFrame?",
- ],
- },
- index=[0, 1, 2],
- ),
- check_dtype=False,
- check_index_type=False,
- )
+ mock_generate_text.assert_has_calls(
+ [
+ mock.call(mock_bqml_model, df0, options),
+ mock.call(mock_bqml_model, df1, options),
+ ]
+ )
+ pd.testing.assert_frame_equal(
+ result.to_pandas(),
+ pd.DataFrame(
+ {
+ "ml_generate_text_status": ["", "error", "error"],
+ "prompt": [
+ "What is BigQuery?",
+ "What is BQML?",
+ "What is BigQuery DataFrame?",
+ ],
+ },
+ index=[0, 1, 2],
+ ),
+ check_dtype=False,
+ check_index_type=False,
+ )
def test_text_embedding_generator_retry_success(session, bq_connection):
@@ -529,11 +584,17 @@ def test_text_embedding_generator_retry_success(session, bq_connection):
session=session,
)
+ mock_generate_embedding = mock.create_autospec(
+ Callable[[core.BqmlModel, bpd.DataFrame, dict], bpd.DataFrame]
+ )
mock_bqml_model = mock.create_autospec(spec=core.BqmlModel)
type(mock_bqml_model).session = mock.PropertyMock(return_value=session)
+ generate_embedding_tvf = core.BqmlModel.TvfDef(
+ mock_generate_embedding, "ml_generate_embedding_status"
+ )
# Responses. Retry twice then all succeeded.
- mock_bqml_model.generate_embedding.side_effect = [
+ mock_generate_embedding.side_effect = [
EqCmpAllDataFrame(
{
"ml_generate_embedding_status": ["", "error", "error"],
@@ -568,41 +629,42 @@ def test_text_embedding_generator_retry_success(session, bq_connection):
session=session,
),
]
- options = {
- "flatten_json_output": True,
- }
+ options: dict = {}
text_embedding_model = llm.TextEmbeddingGenerator(
connection_name=bq_connection, session=session
)
text_embedding_model._bqml_model = mock_bqml_model
- # 3rd retry isn't triggered
- result = text_embedding_model.predict(df0, max_retries=3)
-
- mock_bqml_model.generate_embedding.assert_has_calls(
- [
- mock.call(df0, options),
- mock.call(df1, options),
- mock.call(df2, options),
- ]
- )
- pd.testing.assert_frame_equal(
- result.to_pandas(),
- pd.DataFrame(
- {
- "ml_generate_embedding_status": ["", "", ""],
- "content": [
- "What is BigQuery?",
- "What is BigQuery DataFrame?",
- "What is BQML?",
- ],
- },
- index=[0, 2, 1],
- ),
- check_dtype=False,
- check_index_type=False,
- )
+ with mock.patch.object(
+ core.BqmlModel, "generate_embedding_tvf", generate_embedding_tvf
+ ):
+ # 3rd retry isn't triggered
+ result = text_embedding_model.predict(df0, max_retries=3)
+
+ mock_generate_embedding.assert_has_calls(
+ [
+ mock.call(mock_bqml_model, df0, options),
+ mock.call(mock_bqml_model, df1, options),
+ mock.call(mock_bqml_model, df2, options),
+ ]
+ )
+ pd.testing.assert_frame_equal(
+ result.to_pandas(),
+ pd.DataFrame(
+ {
+ "ml_generate_embedding_status": ["", "", ""],
+ "content": [
+ "What is BigQuery?",
+ "What is BigQuery DataFrame?",
+ "What is BQML?",
+ ],
+ },
+ index=[0, 2, 1],
+ ),
+ check_dtype=False,
+ check_index_type=False,
+ )
def test_text_embedding_generator_retry_no_progress(session, bq_connection):
@@ -630,10 +692,17 @@ def test_text_embedding_generator_retry_no_progress(session, bq_connection):
session=session,
)
+ mock_generate_embedding = mock.create_autospec(
+ Callable[[core.BqmlModel, bpd.DataFrame, dict], bpd.DataFrame]
+ )
mock_bqml_model = mock.create_autospec(spec=core.BqmlModel)
type(mock_bqml_model).session = mock.PropertyMock(return_value=session)
+ generate_embedding_tvf = core.BqmlModel.TvfDef(
+ mock_generate_embedding, "ml_generate_embedding_status"
+ )
+
# Responses. Retry once, no progress, just stop.
- mock_bqml_model.generate_embedding.side_effect = [
+ mock_generate_embedding.side_effect = [
EqCmpAllDataFrame(
{
"ml_generate_embedding_status": ["", "error", "error"],
@@ -658,40 +727,41 @@ def test_text_embedding_generator_retry_no_progress(session, bq_connection):
session=session,
),
]
- options = {
- "flatten_json_output": True,
- }
+ options: dict = {}
text_embedding_model = llm.TextEmbeddingGenerator(
connection_name=bq_connection, session=session
)
text_embedding_model._bqml_model = mock_bqml_model
- # No progress, only conduct retry once
- result = text_embedding_model.predict(df0, max_retries=3)
+ with mock.patch.object(
+ core.BqmlModel, "generate_embedding_tvf", generate_embedding_tvf
+ ):
+ # No progress, only conduct retry once
+ result = text_embedding_model.predict(df0, max_retries=3)
- mock_bqml_model.generate_embedding.assert_has_calls(
- [
- mock.call(df0, options),
- mock.call(df1, options),
- ]
- )
- pd.testing.assert_frame_equal(
- result.to_pandas(),
- pd.DataFrame(
- {
- "ml_generate_embedding_status": ["", "error", "error"],
- "content": [
- "What is BigQuery?",
- "What is BQML?",
- "What is BigQuery DataFrame?",
- ],
- },
- index=[0, 1, 2],
- ),
- check_dtype=False,
- check_index_type=False,
- )
+ mock_generate_embedding.assert_has_calls(
+ [
+ mock.call(mock_bqml_model, df0, options),
+ mock.call(mock_bqml_model, df1, options),
+ ]
+ )
+ pd.testing.assert_frame_equal(
+ result.to_pandas(),
+ pd.DataFrame(
+ {
+ "ml_generate_embedding_status": ["", "error", "error"],
+ "content": [
+ "What is BigQuery?",
+ "What is BQML?",
+ "What is BigQuery DataFrame?",
+ ],
+ },
+ index=[0, 1, 2],
+ ),
+ check_dtype=False,
+ check_index_type=False,
+ )
@pytest.mark.flaky(retries=2)
@@ -700,6 +770,8 @@ def test_text_embedding_generator_retry_no_progress(session, bq_connection):
(
"gemini-1.5-pro-002",
"gemini-1.5-flash-002",
+ "gemini-2.0-flash-001",
+ "gemini-2.0-flash-lite-001",
),
)
def test_llm_gemini_score(llm_fine_tune_df_default_index, model_name):
@@ -728,6 +800,8 @@ def test_llm_gemini_score(llm_fine_tune_df_default_index, model_name):
(
"gemini-1.5-pro-002",
"gemini-1.5-flash-002",
+ "gemini-2.0-flash-001",
+ "gemini-2.0-flash-lite-001",
),
)
def test_llm_gemini_pro_score_params(llm_fine_tune_df_default_index, model_name):
diff --git a/tests/system/small/ml/test_multimodal_llm.py b/tests/system/small/ml/test_multimodal_llm.py
index 51e6bcb2d5..7c07d9ead2 100644
--- a/tests/system/small/ml/test_multimodal_llm.py
+++ b/tests/system/small/ml/test_multimodal_llm.py
@@ -47,6 +47,7 @@ def test_multimodal_embedding_generator_predict_default_params_success(
"gemini-1.5-flash-001",
"gemini-1.5-flash-002",
"gemini-2.0-flash-exp",
+ "gemini-2.0-flash-001",
),
)
@pytest.mark.flaky(retries=2)
diff --git a/tests/system/small/test_dataframe.py b/tests/system/small/test_dataframe.py
index e77319b551..362d736aeb 100644
--- a/tests/system/small/test_dataframe.py
+++ b/tests/system/small/test_dataframe.py
@@ -83,6 +83,7 @@ def test_df_construct_pandas_default(scalars_dfs):
("bigquery_inline"),
("bigquery_load"),
("bigquery_streaming"),
+ ("bigquery_write"),
],
)
def test_read_pandas_all_nice_types(
@@ -220,6 +221,21 @@ def test_get_column_nonstring(scalars_dfs):
assert_series_equal(bf_result, pd_result)
+@pytest.mark.parametrize(
+ "row_slice",
+ [
+ (slice(1, 7, 2)),
+ (slice(1, 7, None)),
+ (slice(None, -3, None)),
+ ],
+)
+def test_get_rows_with_slice(scalars_dfs, row_slice):
+ scalars_df, scalars_pandas_df = scalars_dfs
+ bf_result = scalars_df[row_slice].to_pandas()
+ pd_result = scalars_pandas_df[row_slice]
+ assert_pandas_df_equal(bf_result, pd_result)
+
+
def test_hasattr(scalars_dfs):
scalars_df, _ = scalars_dfs
assert hasattr(scalars_df, "int64_col")
@@ -1772,7 +1788,7 @@ def test_len(scalars_dfs):
)
@pytest.mark.parametrize(
"write_engine",
- ["bigquery_load", "bigquery_streaming"],
+ ["bigquery_load", "bigquery_streaming", "bigquery_write"],
)
def test_df_len_local(session, n_rows, write_engine):
assert (
@@ -5283,6 +5299,16 @@ def test_to_gbq_and_create_dataset(session, scalars_df_index, dataset_id_not_cre
assert not loaded_scalars_df_index.empty
+def test_read_gbq_to_pandas_no_exec(unordered_session: bigframes.Session):
+ metrics = unordered_session._metrics
+ execs_pre = metrics.execution_count
+ df = unordered_session.read_gbq("bigquery-public-data.ml_datasets.penguins")
+ df.to_pandas()
+ execs_post = metrics.execution_count
+ assert df.shape == (344, 7)
+ assert execs_pre == execs_post
+
+
def test_to_gbq_table_labels(scalars_df_index):
destination_table = "bigframes-dev.bigframes_tests_sys.table_labels"
result_table = scalars_df_index.to_gbq(
diff --git a/tests/system/small/test_dataframe_io.py b/tests/system/small/test_dataframe_io.py
index a69c26bc54..e12db3f598 100644
--- a/tests/system/small/test_dataframe_io.py
+++ b/tests/system/small/test_dataframe_io.py
@@ -552,6 +552,89 @@ def test_to_gbq_w_duplicate_column_names(
)
+def test_to_gbq_w_flexible_column_names(
+ scalars_df_index, dataset_id: str, bigquery_client
+):
+ """Test the `to_gbq` API when dealing with flexible column names.
+
+ This test is for BigQuery-backed storage nodes.
+
+ See: https://cloud.google.com/bigquery/docs/schemas#flexible-column-names
+ """
+ destination_table = f"{dataset_id}.test_to_gbq_w_flexible_column_names"
+ renamed_columns = {
+ # First column in Japanese (tests unicode).
+ "bool_col": "最初のカラム",
+ "bytes_col": "col with space",
+ # Dots aren't allowed in BigQuery column names, so these should be translated
+ "date_col": "col.with.dots",
+ "datetime_col": "col-with-hyphens",
+ "geography_col": "1start_with_number",
+ "int64_col": "col_with_underscore",
+ # Just numbers.
+ "int64_too": "123",
+ }
+ bf_df = scalars_df_index[renamed_columns.keys()].rename(columns=renamed_columns)
+ assert list(bf_df.columns) == list(renamed_columns.values())
+ bf_df.to_gbq(destination_table, index=False)
+
+ table = bigquery_client.get_table(destination_table)
+ columns = [field.name for field in table.schema]
+ assert columns == [
+ "最初のカラム",
+ "col with space",
+ # Dots aren't allowed in BigQuery column names, so these should be translated
+ "col_with_dots",
+ "col-with-hyphens",
+ "1start_with_number",
+ "col_with_underscore",
+ "123",
+ ]
+
+
+def test_to_gbq_w_flexible_column_names_local_node(
+ session, dataset_id: str, bigquery_client
+):
+ """Test the `to_gbq` API when dealing with flexible column names.
+
+ This test is for local nodes, e.g. read_pandas(), since those may go through
+ a different code path compared to data that starts in BigQuery.
+
+ See: https://cloud.google.com/bigquery/docs/schemas#flexible-column-names
+ """
+ destination_table = f"{dataset_id}.test_to_gbq_w_flexible_column_names_local_node"
+
+ data = {
+ # First column in Japanese (tests unicode).
+ "最初のカラム": [1, 2, 3],
+ "col with space": [4, 5, 6],
+ # Dots aren't allowed in BigQuery column names, so these should be translated
+ "col.with.dots": [7, 8, 9],
+ "col-with-hyphens": [10, 11, 12],
+ "1start_with_number": [13, 14, 15],
+ "col_with_underscore": [16, 17, 18],
+ "123": [19, 20, 21],
+ }
+ pd_df = pd.DataFrame(data)
+ assert list(pd_df.columns) == list(data.keys())
+ bf_df = session.read_pandas(pd_df)
+ assert list(bf_df.columns) == list(data.keys())
+ bf_df.to_gbq(destination_table, index=False)
+
+ table = bigquery_client.get_table(destination_table)
+ columns = [field.name for field in table.schema]
+ assert columns == [
+ "最初のカラム",
+ "col with space",
+ # Dots aren't allowed in BigQuery column names, so these should be translated
+ "col_with_dots",
+ "col-with-hyphens",
+ "1start_with_number",
+ "col_with_underscore",
+ "123",
+ ]
+
+
def test_to_gbq_w_None_column_names(
scalars_df_index, scalars_pandas_df_index, dataset_id
):
diff --git a/tests/system/small/test_session.py b/tests/system/small/test_session.py
index c7e7fa3573..ced01c940f 100644
--- a/tests/system/small/test_session.py
+++ b/tests/system/small/test_session.py
@@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import io
+import json
import random
import re
import tempfile
@@ -32,12 +33,22 @@
import pytest
import bigframes
-import bigframes.core.indexes.base
import bigframes.dataframe
import bigframes.dtypes
import bigframes.ml.linear_model
from tests.system import utils
+all_write_engines = pytest.mark.parametrize(
+ "write_engine",
+ [
+ "default",
+ "bigquery_inline",
+ "bigquery_load",
+ "bigquery_streaming",
+ "bigquery_write",
+ ],
+)
+
@pytest.fixture(scope="module")
def df_and_local_csv(scalars_df_index):
@@ -48,7 +59,7 @@ def df_and_local_csv(scalars_df_index):
with tempfile.TemporaryDirectory() as dir:
# Prepares local CSV file for reading
- path = dir + "/write_df_to_local_csv_file.csv"
+ path = dir + "/test_read_csv_w_local_csv.csv"
scalars_df_index.to_csv(path, index=True)
yield scalars_df_index, path
@@ -60,7 +71,19 @@ def df_and_gcs_csv(scalars_df_index, gcs_folder):
drop_columns = ["bytes_col", "datetime_col", "numeric_col", "geography_col"]
scalars_df_index = scalars_df_index.drop(columns=drop_columns)
- path = gcs_folder + "test_read_csv_w_write_engine*.csv"
+ path = gcs_folder + "test_read_csv_w_gcs_csv*.csv"
+ read_path = utils.get_first_file_from_wildcard(path)
+ scalars_df_index.to_csv(path, index=True)
+ return scalars_df_index, read_path
+
+
+@pytest.fixture(scope="module")
+def df_and_gcs_csv_for_two_columns(scalars_df_index, gcs_folder):
+ # Some tests require only two columns to be present in the CSV file.
+ selected_cols = ["bool_col", "int64_col"]
+ scalars_df_index = scalars_df_index[selected_cols]
+
+ path = gcs_folder + "df_and_gcs_csv_for_two_columns*.csv"
read_path = utils.get_first_file_from_wildcard(path)
scalars_df_index.to_csv(path, index=True)
return scalars_df_index, read_path
@@ -860,15 +883,13 @@ def test_read_pandas_tokyo(
result = session_tokyo._executor.execute(
df._block.expr, use_explicit_destination=True
)
+ assert result.query_job is not None
assert result.query_job.location == tokyo_location
assert len(expected) == result.total_rows
-@pytest.mark.parametrize(
- "write_engine",
- ["default", "bigquery_inline", "bigquery_load", "bigquery_streaming"],
-)
+@all_write_engines
def test_read_pandas_timedelta_dataframes(session, write_engine):
pytest.importorskip(
"pandas",
@@ -886,10 +907,7 @@ def test_read_pandas_timedelta_dataframes(session, write_engine):
pd.testing.assert_frame_equal(actual_result, expected_result)
-@pytest.mark.parametrize(
- "write_engine",
- ["default", "bigquery_inline", "bigquery_load", "bigquery_streaming"],
-)
+@all_write_engines
def test_read_pandas_timedelta_series(session, write_engine):
expected_series = pd.Series(pd.to_timedelta([1, 2, 3], unit="d"))
@@ -904,10 +922,7 @@ def test_read_pandas_timedelta_series(session, write_engine):
)
-@pytest.mark.parametrize(
- "write_engine",
- ["default", "bigquery_inline", "bigquery_load", "bigquery_streaming"],
-)
+@all_write_engines
def test_read_pandas_timedelta_index(session, write_engine):
expected_index = pd.to_timedelta(
[1, 2, 3], unit="d"
@@ -922,15 +937,7 @@ def test_read_pandas_timedelta_index(session, write_engine):
pd.testing.assert_index_equal(actual_result, expected_index)
-@pytest.mark.parametrize(
- ("write_engine"),
- [
- pytest.param("default"),
- pytest.param("bigquery_load"),
- pytest.param("bigquery_streaming"),
- pytest.param("bigquery_inline"),
- ],
-)
+@all_write_engines
def test_read_pandas_json_dataframes(session, write_engine):
json_data = [
"1",
@@ -949,15 +956,7 @@ def test_read_pandas_json_dataframes(session, write_engine):
pd.testing.assert_frame_equal(actual_result, expected_df, check_index_type=False)
-@pytest.mark.parametrize(
- ("write_engine"),
- [
- pytest.param("default"),
- pytest.param("bigquery_load"),
- pytest.param("bigquery_streaming"),
- pytest.param("bigquery_inline"),
- ],
-)
+@all_write_engines
def test_read_pandas_json_series(session, write_engine):
json_data = [
"1",
@@ -975,37 +974,18 @@ def test_read_pandas_json_series(session, write_engine):
)
-@pytest.mark.parametrize(
- ("write_engine"),
- [
- pytest.param("default"),
- pytest.param("bigquery_inline"),
- pytest.param("bigquery_load"),
- pytest.param("bigquery_streaming"),
- ],
-)
+@all_write_engines
def test_read_pandas_json_series_w_invalid_json(session, write_engine):
json_data = [
"False", # Should be "false"
]
pd_s = pd.Series(json_data, dtype=bigframes.dtypes.JSON_DTYPE)
- with pytest.raises(
- ValueError,
- match="Invalid JSON format found",
- ):
+ with pytest.raises(json.JSONDecodeError):
session.read_pandas(pd_s, write_engine=write_engine)
-@pytest.mark.parametrize(
- ("write_engine"),
- [
- pytest.param("default"),
- pytest.param("bigquery_load"),
- pytest.param("bigquery_streaming"),
- pytest.param("bigquery_inline", marks=pytest.mark.xfail(raises=ValueError)),
- ],
-)
+@all_write_engines
def test_read_pandas_json_index(session, write_engine):
json_data = [
"1",
@@ -1052,6 +1032,7 @@ def test_read_pandas_w_nested_json_fails(session, write_engine):
pytest.param("default"),
pytest.param("bigquery_inline"),
pytest.param("bigquery_streaming"),
+ pytest.param("bigquery_write"),
],
)
def test_read_pandas_w_nested_json(session, write_engine):
@@ -1101,7 +1082,7 @@ def test_read_pandas_w_nested_invalid_json(session, write_engine):
),
)
- with pytest.raises(ValueError, match="Invalid JSON format found"):
+ with pytest.raises(json.JSONDecodeError):
session.read_pandas(pd_s, write_engine=write_engine)
@@ -1137,6 +1118,7 @@ def test_read_pandas_w_nested_json_index_fails(session, write_engine):
pytest.param("default"),
pytest.param("bigquery_inline"),
pytest.param("bigquery_streaming"),
+ pytest.param("bigquery_write"),
],
)
def test_read_pandas_w_nested_json_index(session, write_engine):
@@ -1159,15 +1141,7 @@ def test_read_pandas_w_nested_json_index(session, write_engine):
pd.testing.assert_index_equal(bq_idx, pd_idx)
-@pytest.mark.parametrize(
- ("write_engine",),
- (
- ("default",),
- ("bigquery_inline",),
- ("bigquery_load",),
- ("bigquery_streaming",),
- ),
-)
+@all_write_engines
def test_read_csv_for_gcs_file_w_write_engine(session, df_and_gcs_csv, write_engine):
scalars_df, path = df_and_gcs_csv
@@ -1298,6 +1272,98 @@ def test_read_csv_raises_error_for_invalid_index_col(
session.read_csv(path, engine="bigquery", index_col=index_col)
+def test_read_csv_for_names(session, df_and_gcs_csv_for_two_columns):
+ _, path = df_and_gcs_csv_for_two_columns
+
+ names = ["a", "b", "c"]
+ bf_df = session.read_csv(path, engine="bigquery", names=names)
+
+ # Convert default pandas dtypes to match BigQuery DataFrames dtypes.
+ pd_df = session.read_csv(path, names=names, dtype=bf_df.dtypes.to_dict())
+
+ assert bf_df.shape == pd_df.shape
+ assert bf_df.columns.tolist() == pd_df.columns.tolist()
+
+ # BigFrames requires `sort_index()` because BigQuery doesn't preserve row IDs
+ # (b/280889935) or guarantee row ordering.
+ bf_df = bf_df.set_index(names[0]).sort_index()
+ pd_df = pd_df.set_index(names[0])
+ pd.testing.assert_frame_equal(bf_df.to_pandas(), pd_df.to_pandas())
+
+
+def test_read_csv_for_names_more_than_columns_can_raise_error(
+ session, df_and_gcs_csv_for_two_columns
+):
+ _, path = df_and_gcs_csv_for_two_columns
+ names = ["a", "b", "c", "d"]
+ with pytest.raises(
+ ValueError,
+ match="Too many columns specified: expected 3 and found 4",
+ ):
+ session.read_csv(path, engine="bigquery", names=names)
+
+
+def test_read_csv_for_names_less_than_columns(session, df_and_gcs_csv_for_two_columns):
+ _, path = df_and_gcs_csv_for_two_columns
+
+ names = ["b", "c"]
+ bf_df = session.read_csv(path, engine="bigquery", names=names)
+
+ # Convert default pandas dtypes to match BigQuery DataFrames dtypes.
+ pd_df = session.read_csv(path, names=names, dtype=bf_df.dtypes.to_dict())
+
+ assert bf_df.shape == pd_df.shape
+ assert bf_df.columns.tolist() == pd_df.columns.tolist()
+
+ # BigFrames requires `sort_index()` because BigQuery doesn't preserve row IDs
+ # (b/280889935) or guarantee row ordering.
+ bf_df = bf_df.sort_index()
+
+ # Pandas's index name is None, while BigFrames's index name is "rowindex".
+ pd_df.index.name = "rowindex"
+ pd.testing.assert_frame_equal(bf_df.to_pandas(), pd_df.to_pandas())
+
+
+def test_read_csv_for_names_less_than_columns_raise_error_when_index_col_set(
+ session, df_and_gcs_csv_for_two_columns
+):
+ _, path = df_and_gcs_csv_for_two_columns
+
+ names = ["b", "c"]
+ with pytest.raises(
+ KeyError,
+ match="ensure the number of `names` matches the number of columns in your data.",
+ ):
+ session.read_csv(path, engine="bigquery", names=names, index_col="rowindex")
+
+
+@pytest.mark.parametrize(
+ "index_col",
+ [
+ pytest.param("a", id="single_str"),
+ pytest.param(["a", "b"], id="multi_str"),
+ pytest.param(0, id="single_int"),
+ ],
+)
+def test_read_csv_for_names_and_index_col(
+ session, df_and_gcs_csv_for_two_columns, index_col
+):
+ _, path = df_and_gcs_csv_for_two_columns
+ names = ["a", "b", "c"]
+ bf_df = session.read_csv(path, engine="bigquery", index_col=index_col, names=names)
+
+ # Convert default pandas dtypes to match BigQuery DataFrames dtypes.
+ pd_df = session.read_csv(
+ path, index_col=index_col, names=names, dtype=bf_df.dtypes.to_dict()
+ )
+
+ assert bf_df.shape == pd_df.shape
+ assert bf_df.columns.tolist() == pd_df.columns.tolist()
+ pd.testing.assert_frame_equal(
+ bf_df.to_pandas(), pd_df.to_pandas(), check_index_type=False
+ )
+
+
@pytest.mark.parametrize(
("kwargs", "match"),
[
diff --git a/tests/unit/core/compile/sqlglot/compiler_session.py b/tests/unit/core/compile/sqlglot/compiler_session.py
new file mode 100644
index 0000000000..eddae8f891
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/compiler_session.py
@@ -0,0 +1,76 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import dataclasses
+import typing
+import weakref
+
+import bigframes.core
+import bigframes.core.compile.sqlglot as sqlglot
+import bigframes.dataframe
+import bigframes.session.executor
+import bigframes.session.metrics
+
+
+@dataclasses.dataclass
+class SQLCompilerExecutor(bigframes.session.executor.Executor):
+ """Executor for SQL compilation using sqlglot."""
+
+ compiler = sqlglot.SQLGlotCompiler()
+
+ def to_sql(
+ self,
+ array_value: bigframes.core.ArrayValue,
+ offset_column: typing.Optional[str] = None,
+ ordered: bool = True,
+ enable_cache: bool = False,
+ ) -> str:
+ if offset_column:
+ array_value, _ = array_value.promote_offsets()
+
+ # Compared with BigQueryCachingExecutor, SQLCompilerExecutor skips
+ # caching the subtree.
+ return self.compiler.compile(array_value.node, ordered=ordered)
+
+
+class SQLCompilerSession(bigframes.session.Session):
+ """Session for SQL compilation using sqlglot."""
+
+ def __init__(self):
+ # TODO: remove unused attributes.
+ self._location = None # type: ignore
+ self._bq_kms_key_name = None # type: ignore
+ self._clients_provider = None # type: ignore
+ self.ibis_client = None # type: ignore
+ self._bq_connection = None # type: ignore
+ self._skip_bq_connection_check = True
+ self._objects: list[
+ weakref.ReferenceType[
+ typing.Union[
+ bigframes.core.indexes.Index,
+ bigframes.series.Series,
+ bigframes.dataframe.DataFrame,
+ ]
+ ]
+ ] = []
+ self._strictly_ordered: bool = True
+ self._allow_ambiguity = False # type: ignore
+ self._default_index_type = bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64
+ self._metrics = bigframes.session.metrics.ExecutionMetrics()
+ self._remote_function_session = None # type: ignore
+ self._temp_storage_manager = None # type: ignore
+ self._loader = None # type: ignore
+
+ self._session_id: str = "sqlglot_unit_tests_session"
+ self._executor = SQLCompilerExecutor()
diff --git a/tests/unit/core/compile/sqlglot/conftest.py b/tests/unit/core/compile/sqlglot/conftest.py
new file mode 100644
index 0000000000..4d871fd707
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/conftest.py
@@ -0,0 +1,112 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pathlib
+
+import pandas as pd
+import pyarrow as pa
+import pytest
+
+from bigframes import dtypes
+import tests.system.utils
+
+CURRENT_DIR = pathlib.Path(__file__).parent
+DATA_DIR = CURRENT_DIR.parent.parent.parent.parent / "data"
+
+
+@pytest.fixture(scope="session")
+def compiler_session():
+ from . import compiler_session
+
+ return compiler_session.SQLCompilerSession()
+
+
+@pytest.fixture(scope="session")
+def scalars_types_pandas_df() -> pd.DataFrame:
+ """Returns a pandas DataFrame containing all scalar types and using the `rowindex`
+ column as the index."""
+ # TODO: add tests for empty dataframes
+ df = pd.read_json(
+ DATA_DIR / "scalars.jsonl",
+ lines=True,
+ )
+ tests.system.utils.convert_pandas_dtypes(df, bytes_col=True)
+
+ df = df.set_index("rowindex", drop=False)
+ return df
+
+
+@pytest.fixture(scope="session")
+def nested_structs_pandas_df() -> pd.DataFrame:
+ """Returns a pandas DataFrame containing STRUCT types and using the `id`
+ column as the index."""
+
+ df = pd.read_json(
+ DATA_DIR / "nested_structs.jsonl",
+ lines=True,
+ )
+ df = df.set_index("id")
+
+ address_struct_schema = pa.struct(
+ [pa.field("city", pa.string()), pa.field("country", pa.string())]
+ )
+ person_struct_schema = pa.struct(
+ [
+ pa.field("name", pa.string()),
+ pa.field("age", pa.int64()),
+ pa.field("address", address_struct_schema),
+ ]
+ )
+ df["person"] = df["person"].astype(pd.ArrowDtype(person_struct_schema))
+ return df
+
+
+@pytest.fixture(scope="session")
+def repeated_pandas_df() -> pd.DataFrame:
+ """Returns a pandas DataFrame containing LIST types and using the `rowindex`
+ column as the index."""
+
+ df = pd.read_json(
+ DATA_DIR / "repeated.jsonl",
+ lines=True,
+ )
+ df = df.set_index("rowindex")
+ return df
+
+
+@pytest.fixture(scope="session")
+def json_pandas_df() -> pd.DataFrame:
+ """Returns a pandas DataFrame containing JSON types and using the `rowindex`
+ column as the index."""
+ json_data = [
+ "null",
+ "true",
+ "100",
+ "0.98",
+ '"a string"',
+ "[]",
+ "[1, 2, 3]",
+ '[{"a": 1}, {"a": 2}, {"a": null}, {}]',
+ '"100"',
+ '{"date": "2024-07-16"}',
+ '{"int_value": 2, "null_filed": null}',
+ '{"list_data": [10, 20, 30]}',
+ ]
+ df = pd.DataFrame(
+ {
+ "json_col": pd.Series(json_data, dtype=dtypes.JSON_DTYPE),
+ },
+ index=pd.Series(range(len(json_data)), dtype=dtypes.INT_DTYPE),
+ )
+ return df
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal/out.sql
new file mode 100644
index 0000000000..0ef80dc8b0
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal/out.sql
@@ -0,0 +1,171 @@
+SELECT
+ `bfcol_0` AS `bfcol_16`,
+ `bfcol_1` AS `bfcol_17`,
+ `bfcol_2` AS `bfcol_18`,
+ `bfcol_3` AS `bfcol_19`,
+ `bfcol_4` AS `bfcol_20`,
+ `bfcol_5` AS `bfcol_21`,
+ `bfcol_6` AS `bfcol_22`,
+ `bfcol_7` AS `bfcol_23`,
+ `bfcol_8` AS `bfcol_24`,
+ `bfcol_9` AS `bfcol_25`,
+ `bfcol_10` AS `bfcol_26`,
+ `bfcol_11` AS `bfcol_27`,
+ `bfcol_12` AS `bfcol_28`,
+ `bfcol_13` AS `bfcol_29`,
+ `bfcol_14` AS `bfcol_30`,
+ `bfcol_15` AS `bfcol_31`
+FROM UNNEST(ARRAY>[STRUCT(
+ 0,
+ TRUE,
+ CAST(b'Hello, World!' AS BYTES),
+ CAST('2021-07-21' AS DATE),
+ CAST('2021-07-21T11:39:45' AS DATETIME),
+ ST_GEOGFROMTEXT('POINT (-122.0838511 37.3860517)'),
+ 123456789,
+ 0,
+ 1.234567890,
+ 1.25,
+ 0,
+ 0,
+ 'Hello, World!',
+ CAST('11:41:43.076160' AS TIME),
+ CAST('2021-07-21T17:43:43.945289+00:00' AS TIMESTAMP),
+ 0
+), STRUCT(
+ 1,
+ FALSE,
+ CAST(b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf' AS BYTES),
+ CAST('1991-02-03' AS DATE),
+ CAST('1991-01-02T03:45:06' AS DATETIME),
+ ST_GEOGFROMTEXT('POINT (-71.104 42.315)'),
+ -987654321,
+ 1,
+ 1.234567890,
+ 2.51,
+ 1,
+ 1,
+ 'こんにちは',
+ CAST('11:14:34.701606' AS TIME),
+ CAST('2021-07-21T17:43:43.945289+00:00' AS TIMESTAMP),
+ 1
+), STRUCT(
+ 2,
+ TRUE,
+ CAST(b'\xc2\xa1Hola Mundo!' AS BYTES),
+ CAST('2023-03-01' AS DATE),
+ CAST('2023-03-01T10:55:13' AS DATETIME),
+ ST_GEOGFROMTEXT('POINT (-0.124474760143016 51.5007826749545)'),
+ 314159,
+ 0,
+ 101.101010100,
+ 25000000000.0,
+ 2,
+ 2,
+ ' ¡Hola Mundo! ',
+ CAST('23:59:59.999999' AS TIME),
+ CAST('2023-03-01T10:55:13.250125+00:00' AS TIMESTAMP),
+ 2
+), STRUCT(
+ 3,
+ CAST(NULL AS BOOLEAN),
+ CAST(NULL AS BYTES),
+ CAST(NULL AS DATE),
+ CAST(NULL AS DATETIME),
+ CAST(NULL AS GEOGRAPHY),
+ CAST(NULL AS INT64),
+ 1,
+ CAST(NULL AS NUMERIC),
+ CAST(NULL AS FLOAT64),
+ 3,
+ 3,
+ CAST(NULL AS STRING),
+ CAST(NULL AS TIME),
+ CAST(NULL AS TIMESTAMP),
+ 3
+), STRUCT(
+ 4,
+ FALSE,
+ CAST(b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf' AS BYTES),
+ CAST('2021-07-21' AS DATE),
+ CAST(NULL AS DATETIME),
+ CAST(NULL AS GEOGRAPHY),
+ -234892,
+ -2345,
+ CAST(NULL AS NUMERIC),
+ CAST(NULL AS FLOAT64),
+ 4,
+ 4,
+ 'Hello, World!',
+ CAST(NULL AS TIME),
+ CAST(NULL AS TIMESTAMP),
+ 4
+), STRUCT(
+ 5,
+ FALSE,
+ CAST(b'G\xc3\xbcten Tag' AS BYTES),
+ CAST('1980-03-14' AS DATE),
+ CAST('1980-03-14T15:16:17' AS DATETIME),
+ CAST(NULL AS GEOGRAPHY),
+ 55555,
+ 0,
+ 5.555555000,
+ 555.555,
+ 5,
+ 5,
+ 'Güten Tag!',
+ CAST('15:16:17.181921' AS TIME),
+ CAST('1980-03-14T15:16:17.181921+00:00' AS TIMESTAMP),
+ 5
+), STRUCT(
+ 6,
+ TRUE,
+ CAST(b'Hello\tBigFrames!\x07' AS BYTES),
+ CAST('2023-05-23' AS DATE),
+ CAST('2023-05-23T11:37:01' AS DATETIME),
+ ST_GEOGFROMTEXT('LINESTRING (-0.127959 51.507728, -0.127026 51.507473)'),
+ 101202303,
+ 2,
+ -10.090807000,
+ -123.456,
+ 6,
+ 6,
+ 'capitalize, This ',
+ CAST('01:02:03.456789' AS TIME),
+ CAST('2023-05-23T11:42:55.000001+00:00' AS TIMESTAMP),
+ 6
+), STRUCT(
+ 7,
+ TRUE,
+ CAST(NULL AS BYTES),
+ CAST('2038-01-20' AS DATE),
+ CAST('2038-01-19T03:14:08' AS DATETIME),
+ CAST(NULL AS GEOGRAPHY),
+ -214748367,
+ 2,
+ 11111111.100000000,
+ 42.42,
+ 7,
+ 7,
+ ' سلام',
+ CAST('12:00:00.000001' AS TIME),
+ CAST('2038-01-19T03:14:17.999999+00:00' AS TIMESTAMP),
+ 7
+), STRUCT(
+ 8,
+ FALSE,
+ CAST(NULL AS BYTES),
+ CAST(NULL AS DATE),
+ CAST(NULL AS DATETIME),
+ CAST(NULL AS GEOGRAPHY),
+ 2,
+ 1,
+ CAST(NULL AS NUMERIC),
+ 6.87,
+ 8,
+ 8,
+ 'T',
+ CAST(NULL AS TIME),
+ CAST(NULL AS TIMESTAMP),
+ 8
+)])
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_json_df/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_json_df/out.sql
new file mode 100644
index 0000000000..3b780e6d8e
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_json_df/out.sql
@@ -0,0 +1,4 @@
+SELECT
+ `bfcol_0` AS `bfcol_2`,
+ `bfcol_1` AS `bfcol_3`
+FROM UNNEST(ARRAY>[STRUCT(PARSE_JSON('null'), 0), STRUCT(PARSE_JSON('true'), 1), STRUCT(PARSE_JSON('100'), 2), STRUCT(PARSE_JSON('0.98'), 3), STRUCT(PARSE_JSON('"a string"'), 4), STRUCT(PARSE_JSON('[]'), 5), STRUCT(PARSE_JSON('[1,2,3]'), 6), STRUCT(PARSE_JSON('[{"a":1},{"a":2},{"a":null},{}]'), 7), STRUCT(PARSE_JSON('"100"'), 8), STRUCT(PARSE_JSON('{"date":"2024-07-16"}'), 9), STRUCT(PARSE_JSON('{"int_value":2,"null_filed":null}'), 10), STRUCT(PARSE_JSON('{"list_data":[10,20,30]}'), 11)])
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_lists_df/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_lists_df/out.sql
new file mode 100644
index 0000000000..6998b41b27
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_lists_df/out.sql
@@ -0,0 +1,41 @@
+SELECT
+ `bfcol_0` AS `bfcol_9`,
+ `bfcol_1` AS `bfcol_10`,
+ `bfcol_2` AS `bfcol_11`,
+ `bfcol_3` AS `bfcol_12`,
+ `bfcol_4` AS `bfcol_13`,
+ `bfcol_5` AS `bfcol_14`,
+ `bfcol_6` AS `bfcol_15`,
+ `bfcol_7` AS `bfcol_16`,
+ `bfcol_8` AS `bfcol_17`
+FROM UNNEST(ARRAY, `bfcol_2` ARRAY, `bfcol_3` ARRAY, `bfcol_4` ARRAY, `bfcol_5` ARRAY, `bfcol_6` ARRAY, `bfcol_7` ARRAY, `bfcol_8` INT64>>[STRUCT(
+ 0,
+ [1],
+ [TRUE],
+ [1.2, 2.3],
+ ['2021-07-21'],
+ ['2021-07-21 11:39:45'],
+ [1.2, 2.3, 3.4],
+ ['abc', 'de', 'f'],
+ 0
+), STRUCT(
+ 1,
+ [1, 2],
+ [TRUE, FALSE],
+ [1.1],
+ ['2021-07-21', '1987-03-28'],
+ ['1999-03-14 17:22:00'],
+ [5.5, 2.3],
+ ['a', 'bc', 'de'],
+ 1
+), STRUCT(
+ 2,
+ [1, 2, 3],
+ [TRUE],
+ [0.5, -1.9, 2.3],
+ ['2017-08-01', '2004-11-22'],
+ ['1979-06-03 03:20:45'],
+ [1.7000000000000002],
+ ['', 'a'],
+ 2
+)])
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_nested_structs_df/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_nested_structs_df/out.sql
new file mode 100644
index 0000000000..42b7bc7361
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_nested_structs_df/out.sql
@@ -0,0 +1,19 @@
+SELECT
+ *
+FROM UNNEST(ARRAY>, `bfcol_2` INT64>>[(
+ 1,
+ STRUCT(
+ 'Alice' AS `name`,
+ 30 AS `age`,
+ STRUCT('New York' AS `city`, 'USA' AS `country`) AS `address`
+ ),
+ 0
+), (
+ 2,
+ STRUCT(
+ 'Bob' AS `name`,
+ 25 AS `age`,
+ STRUCT('London' AS `city`, 'UK' AS `country`) AS `address`
+ ),
+ 1
+)])
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_structs_df/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_structs_df/out.sql
new file mode 100644
index 0000000000..99b94915bf
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readlocal/test_compile_readlocal_w_structs_df/out.sql
@@ -0,0 +1,21 @@
+SELECT
+ `bfcol_0` AS `bfcol_3`,
+ `bfcol_1` AS `bfcol_4`,
+ `bfcol_2` AS `bfcol_5`
+FROM UNNEST(ARRAY>, `bfcol_2` INT64>>[STRUCT(
+ 1,
+ STRUCT(
+ 'Alice' AS `name`,
+ 30 AS `age`,
+ STRUCT('New York' AS `city`, 'USA' AS `country`) AS `address`
+ ),
+ 0
+), STRUCT(
+ 2,
+ STRUCT(
+ 'Bob' AS `name`,
+ 25 AS `age`,
+ STRUCT('London' AS `city`, 'UK' AS `country`) AS `address`
+ ),
+ 1
+)])
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/test_compile_readlocal.py b/tests/unit/core/compile/sqlglot/test_compile_readlocal.py
new file mode 100644
index 0000000000..58587da129
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/test_compile_readlocal.py
@@ -0,0 +1,55 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pandas as pd
+import pytest
+
+import bigframes
+import bigframes.pandas as bpd
+
+pytest.importorskip("pytest_snapshot")
+
+
+def test_compile_readlocal(
+ scalars_types_pandas_df: pd.DataFrame, compiler_session: bigframes.Session, snapshot
+):
+ bf_df = bpd.DataFrame(scalars_types_pandas_df, session=compiler_session)
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
+def test_compile_readlocal_w_structs_df(
+ nested_structs_pandas_df: pd.DataFrame,
+ compiler_session: bigframes.Session,
+ snapshot,
+):
+ bf_df = bpd.DataFrame(nested_structs_pandas_df, session=compiler_session)
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
+def test_compile_readlocal_w_lists_df(
+ repeated_pandas_df: pd.DataFrame,
+ compiler_session: bigframes.Session,
+ snapshot,
+):
+ bf_df = bpd.DataFrame(repeated_pandas_df, session=compiler_session)
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
+def test_compile_readlocal_w_json_df(
+ json_pandas_df: pd.DataFrame,
+ compiler_session: bigframes.Session,
+ snapshot,
+):
+ bf_df = bpd.DataFrame(json_pandas_df, session=compiler_session)
+ snapshot.assert_match(bf_df.sql, "out.sql")
diff --git a/tests/unit/core/test_bf_utils.py b/tests/unit/core/test_bf_utils.py
index cb3b03d988..9b4c4f8742 100644
--- a/tests/unit/core/test_bf_utils.py
+++ b/tests/unit/core/test_bf_utils.py
@@ -46,7 +46,7 @@ def test_get_standardized_ids_indexes():
assert col_ids == ["duplicate_2"]
assert idx_ids == [
"string",
- "0",
+ "_0",
utils.UNNAMED_INDEX_ID,
"duplicate",
"duplicate_1",
diff --git a/tests/unit/core/test_blocks.py b/tests/unit/core/test_blocks.py
index b1b276bda3..7c06bedfd3 100644
--- a/tests/unit/core/test_blocks.py
+++ b/tests/unit/core/test_blocks.py
@@ -85,13 +85,11 @@ def test_block_from_local(data):
# hard-coded the returned dimension of the session for that each of the test case contains 3 rows.
mock_session._executor = mock_executor
- mock_executor.get_row_count.return_value = 3
block = blocks.Block.from_local(pandas.DataFrame(data), mock_session)
pandas.testing.assert_index_equal(block.column_labels, expected.columns)
assert tuple(block.index.names) == tuple(expected.index.names)
- assert block.shape == expected.shape
def test_block_compute_dry_run__raises_error_when_sampling_is_enabled():
diff --git a/tests/unit/core/test_dtypes.py b/tests/unit/core/test_dtypes.py
index bbeac3602b..37658bc436 100644
--- a/tests/unit/core/test_dtypes.py
+++ b/tests/unit/core/test_dtypes.py
@@ -20,7 +20,7 @@
import pandas as pd
import pyarrow as pa # type: ignore
import pytest
-import shapely # type: ignore
+import shapely.geometry # type: ignore
import bigframes.core.compile.ibis_types
import bigframes.dtypes
@@ -231,9 +231,9 @@ def test_bigframes_string_dtype_converts(ibis_dtype, bigframes_dtype_str):
(bool, bigframes.dtypes.BOOL_DTYPE),
(int, bigframes.dtypes.INT_DTYPE),
(str, bigframes.dtypes.STRING_DTYPE),
- (shapely.Point, bigframes.dtypes.GEO_DTYPE),
- (shapely.Polygon, bigframes.dtypes.GEO_DTYPE),
- (shapely.Geometry, bigframes.dtypes.GEO_DTYPE),
+ (shapely.geometry.Point, bigframes.dtypes.GEO_DTYPE),
+ (shapely.geometry.Polygon, bigframes.dtypes.GEO_DTYPE),
+ (shapely.geometry.base.BaseGeometry, bigframes.dtypes.GEO_DTYPE),
],
)
def test_bigframes_type_supports_python_types(python_type, expected_dtype):
diff --git a/tests/unit/core/test_sql.py b/tests/unit/core/test_sql.py
index 913a5b61fe..17da3008fc 100644
--- a/tests/unit/core/test_sql.py
+++ b/tests/unit/core/test_sql.py
@@ -73,6 +73,67 @@ def test_simple_literal(value, expected_pattern):
assert re.match(expected_pattern, got) is not None
+@pytest.mark.parametrize(
+ ("value", "expected_pattern"),
+ (
+ # Try to have some list of literals for each scalar data type:
+ # https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
+ ([None, None], re.escape("[NULL, NULL]")),
+ ([True, False], re.escape("[True, False]")),
+ (
+ [b"\x01\x02\x03ABC", b"\x01\x02\x03ABC"],
+ re.escape("[b'\\x01\\x02\\x03ABC', b'\\x01\\x02\\x03ABC']"),
+ ),
+ (
+ [datetime.date(2025, 1, 1), datetime.date(2025, 1, 1)],
+ re.escape("[DATE('2025-01-01'), DATE('2025-01-01')]"),
+ ),
+ (
+ [datetime.datetime(2025, 1, 2, 3, 45, 6, 789123)],
+ re.escape("[DATETIME('2025-01-02T03:45:06.789123')]"),
+ ),
+ (
+ [shapely.geometry.Point(0, 1), shapely.geometry.Point(0, 2)],
+ r"\[ST_GEOGFROMTEXT\('POINT \(0[.]?0* 1[.]?0*\)'\), ST_GEOGFROMTEXT\('POINT \(0[.]?0* 2[.]?0*\)'\)\]",
+ ),
+ # TODO: INTERVAL type (e.g. from dateutil.relativedelta)
+ # TODO: JSON type (TBD what Python object that would correspond to)
+ ([123, 456], re.escape("[123, 456]")),
+ (
+ [decimal.Decimal("123.75"), decimal.Decimal("456.78")],
+ re.escape("[CAST('123.75' AS NUMERIC), CAST('456.78' AS NUMERIC)]"),
+ ),
+ # TODO: support BIGNUMERIC by looking at precision/scale of the DECIMAL
+ ([123.75, 456.78], re.escape("[123.75, 456.78]")),
+ # TODO: support RANGE type
+ (["abc", "def"], re.escape("['abc', 'def']")),
+ # TODO: support STRUCT type (possibly another method?)
+ (
+ [datetime.time(12, 34, 56, 789123), datetime.time(11, 25, 56, 789123)],
+ re.escape(
+ "[TIME(DATETIME('1970-01-01 12:34:56.789123')), TIME(DATETIME('1970-01-01 11:25:56.789123'))]"
+ ),
+ ),
+ (
+ [
+ datetime.datetime(
+ 2025, 1, 2, 3, 45, 6, 789123, tzinfo=datetime.timezone.utc
+ ),
+ datetime.datetime(
+ 2025, 2, 1, 4, 45, 6, 789123, tzinfo=datetime.timezone.utc
+ ),
+ ],
+ re.escape(
+ "[TIMESTAMP('2025-01-02T03:45:06.789123+00:00'), TIMESTAMP('2025-02-01T04:45:06.789123+00:00')]"
+ ),
+ ),
+ ),
+)
+def test_simple_literal_w_list(value: list, expected_pattern: str):
+ got = sql.simple_literal(value)
+ assert re.match(expected_pattern, got) is not None
+
+
def test_create_vector_search_sql_simple():
result_query = sql.create_vector_search_sql(
sql_string="SELECT embedding FROM my_embeddings_table WHERE id = 1",
diff --git a/tests/unit/session/test_session.py b/tests/unit/session/test_session.py
index 22b439a38b..91b6679702 100644
--- a/tests/unit/session/test_session.py
+++ b/tests/unit/session/test_session.py
@@ -108,14 +108,9 @@
@pytest.mark.parametrize(
("kwargs", "match"),
[
- pytest.param(
- {"engine": "bigquery", "names": []},
- "BigQuery engine does not support these arguments",
- id="with_names",
- ),
pytest.param(
{"engine": "bigquery", "dtype": {}},
- "BigQuery engine does not support these arguments",
+ "BigQuery engine does not support the `dtype` argument",
id="with_dtype",
),
pytest.param(
@@ -203,6 +198,23 @@ def test_read_csv_with_incompatible_write_engine(engine, write_engine):
)
+@pytest.mark.parametrize(
+ ("names", "error_message"),
+ (
+ pytest.param("abc", "Names should be an ordered collection."),
+ pytest.param({"a", "b", "c"}, "Names should be an ordered collection."),
+ pytest.param(["a", "a"], "Duplicated names are not allowed."),
+ ),
+)
+def test_read_csv_w_bigquery_engine_raises_error_for_invalid_names(
+ names, error_message
+):
+ session = mocks.create_bigquery_session()
+
+ with pytest.raises(ValueError, match=error_message):
+ session.read_csv("path/to/csv.csv", engine="bigquery", names=names)
+
+
@pytest.mark.parametrize("missing_parts_table_id", [(""), ("table")])
def test_read_gbq_missing_parts(missing_parts_table_id):
session = mocks.create_bigquery_session()
diff --git a/tests/unit/test_local_data.py b/tests/unit/test_local_data.py
new file mode 100644
index 0000000000..9cd08787c9
--- /dev/null
+++ b/tests/unit/test_local_data.py
@@ -0,0 +1,46 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import pandas as pd
+import pandas.testing
+import pyarrow as pa
+
+from bigframes import dtypes
+from bigframes.core import local_data
+
+pd_data = pd.DataFrame(
+ {
+ "ints": [10, 20, 30, 40],
+ "nested_ints": [[1, 2], [3, 4, 5], [], [20, 30]],
+ "structs": [{"a": 100}, {}, {"b": 200}, {"b": 300}],
+ }
+)
+
+pd_data_normalized = pd.DataFrame(
+ {
+ "ints": pd.Series([10, 20, 30, 40], dtype=dtypes.INT_DTYPE),
+ "nested_ints": pd.Series(
+ [[1, 2], [3, 4, 5], [], [20, 30]], dtype=pd.ArrowDtype(pa.list_(pa.int64()))
+ ),
+ "structs": pd.Series(
+ [{"a": 100}, {}, {"b": 200}, {"b": 300}],
+ dtype=pd.ArrowDtype(pa.struct({"a": pa.int64(), "b": pa.int64()})),
+ ),
+ }
+)
+
+
+def test_local_data_well_formed_round_trip():
+ local_entry = local_data.ManagedArrowTable.from_pandas(pd_data)
+ result = pd.DataFrame(local_entry.itertuples(), columns=pd_data.columns)
+ pandas.testing.assert_frame_equal(pd_data_normalized, result, check_dtype=False)
diff --git a/tests/unit/test_local_engine.py b/tests/unit/test_local_engine.py
index d4e0dae1f3..b4672d07a9 100644
--- a/tests/unit/test_local_engine.py
+++ b/tests/unit/test_local_engine.py
@@ -41,7 +41,7 @@ def small_inline_frame() -> pd.DataFrame:
"bools": pd.Series([True, None, False], dtype="boolean"),
"strings": pd.Series(["b", "aa", "ccc"], dtype="string[pyarrow]"),
"intLists": pd.Series(
- [[1, 2, 3], [4, 5, 6, 7], None],
+ [[1, 2, 3], [4, 5, 6, 7], []],
dtype=pd.ArrowDtype(pa.list_(pa.int64())),
),
},
diff --git a/tests/unit/test_sequences.py b/tests/unit/test_sequences.py
new file mode 100644
index 0000000000..d901670b9b
--- /dev/null
+++ b/tests/unit/test_sequences.py
@@ -0,0 +1,55 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import itertools
+from typing import Sequence
+
+import pytest
+
+from bigframes.core import sequences
+
+LARGE_LIST = list(range(100, 500))
+SMALL_LIST = list(range(1, 5))
+CHAINED_LIST = sequences.ChainedSequence([SMALL_LIST for i in range(100)])
+
+
+def _build_reference(*parts):
+ return tuple(itertools.chain(*parts))
+
+
+def _check_equivalence(expected: Sequence, actual: Sequence):
+ assert len(expected) == len(actual)
+ assert tuple(expected) == tuple(actual)
+ assert expected[10:1:-2] == actual[10:1:-2]
+ if len(expected) > 0:
+ assert expected[len(expected) - 1] == expected[len(actual) - 1]
+
+
+@pytest.mark.parametrize(
+ ("parts",),
+ [
+ ([],),
+ ([[]],),
+ ([[0, 1, 2]],),
+ ([LARGE_LIST, SMALL_LIST, LARGE_LIST],),
+ ([SMALL_LIST * 100],),
+ ([CHAINED_LIST, LARGE_LIST, CHAINED_LIST, SMALL_LIST],),
+ ],
+)
+def test_init_chained_sequence_single_slist(parts):
+ value = sequences.ChainedSequence(*parts)
+ expected = _build_reference(*parts)
+ _check_equivalence(expected, value)
diff --git a/third_party/bigframes_vendored/constants.py b/third_party/bigframes_vendored/constants.py
index d11d8ba2cb..af87694cd5 100644
--- a/third_party/bigframes_vendored/constants.py
+++ b/third_party/bigframes_vendored/constants.py
@@ -47,6 +47,10 @@
)
WriteEngineType = Literal[
- "default", "bigquery_inline", "bigquery_load", "bigquery_streaming"
+ "default",
+ "bigquery_inline",
+ "bigquery_load",
+ "bigquery_streaming",
+ "bigquery_write",
]
VALID_WRITE_ENGINES = typing.get_args(WriteEngineType)
diff --git a/third_party/bigframes_vendored/ibis/backends/sql/compilers/bigquery/__init__.py b/third_party/bigframes_vendored/ibis/backends/sql/compilers/bigquery/__init__.py
index 7e001d1ac3..be8f9fc555 100644
--- a/third_party/bigframes_vendored/ibis/backends/sql/compilers/bigquery/__init__.py
+++ b/third_party/bigframes_vendored/ibis/backends/sql/compilers/bigquery/__init__.py
@@ -1067,7 +1067,6 @@ def visit_InMemoryTable(self, op, *, name, schema, data):
columns=columns,
),
)
- # return expr
return sg.select(sge.Star()).from_(expr)
def visit_ArrayAggregate(self, op, *, arg, order_by, where):
diff --git a/third_party/bigframes_vendored/pandas/io/parsers/readers.py b/third_party/bigframes_vendored/pandas/io/parsers/readers.py
index 2b1e3dd70b..4757f5ed9d 100644
--- a/third_party/bigframes_vendored/pandas/io/parsers/readers.py
+++ b/third_party/bigframes_vendored/pandas/io/parsers/readers.py
@@ -114,7 +114,7 @@ def read_csv(
names (default None):
a list of column names to use. If the file contains a header row and you
want to pass this parameter, then `header=0` should be passed as well so the
- first (header) row is ignored. Only to be used with default engine.
+ first (header) row is ignored.
index_col (default None):
column(s) to use as the row labels of the DataFrame, either given as
string name or column index. `index_col=False` can be used with the default
diff --git a/third_party/bigframes_vendored/version.py b/third_party/bigframes_vendored/version.py
index b671169b24..c6ca0ee57c 100644
--- a/third_party/bigframes_vendored/version.py
+++ b/third_party/bigframes_vendored/version.py
@@ -12,8 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-__version__ = "2.1.0"
+__version__ = "2.2.0"
# {x-release-please-start-date}
-__release_date__ = "2025-04-22"
+__release_date__ = "2025-04-30"
# {x-release-please-end}