- 2.25.0 (latest)
- 2.24.0
- 2.23.0
- 2.22.0
- 2.21.0
- 2.20.0
- 2.19.0
- 2.18.0
- 2.17.0
- 2.16.0
- 2.15.0
- 2.14.0
- 2.13.0
- 2.12.0
- 2.11.0
- 2.10.0
- 2.9.0
- 2.8.0
- 2.7.0
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.0
- 2.2.0
- 1.36.0
- 1.35.0
- 1.34.0
- 1.33.0
- 1.32.0
- 1.31.0
- 1.30.0
- 1.29.0
- 1.28.0
- 1.27.0
- 1.26.0
- 1.25.0
- 1.24.0
- 1.22.0
- 1.21.0
- 1.20.0
- 1.19.0
- 1.18.0
- 1.17.0
- 1.16.0
- 1.15.0
- 1.14.0
- 1.13.0
- 1.12.0
- 1.11.1
- 1.10.0
- 1.9.0
- 1.8.0
- 1.7.0
- 1.6.0
- 1.5.0
- 1.4.0
- 1.3.0
- 1.2.0
- 1.1.0
- 1.0.0
- 0.26.0
- 0.25.0
- 0.24.0
- 0.23.0
- 0.22.0
- 0.21.0
- 0.20.1
- 0.19.2
- 0.18.0
- 0.17.0
- 0.16.0
- 0.15.0
- 0.14.1
- 0.13.0
- 0.12.0
- 0.11.0
- 0.10.0
- 0.9.0
- 0.8.0
- 0.7.0
- 0.6.0
- 0.5.0
- 0.4.0
- 0.3.0
- 0.2.0
API documentation for pandas
package.
Classes
ArrowDtype
An ExtensionDtype for PyArrow data types.
BooleanDtype
Extension dtype for boolean data.
.. method:: None :noindex:
.. rubric:: Examples
pd.BooleanDtype() BooleanDtype
DataFrame
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
DatetimeIndex
Immutable sequence used for indexing and alignment with datetime-like values
Float64Dtype
An ExtensionDtype for float64 data.
This dtype uses pd.NA
as missing value indicator.
For Float32Dtype:
ser = pd.Series([2.25, pd.NA], dtype=pd.Float32Dtype()) ser.dtype Float32Dtype()
For Float64Dtype:
ser = pd.Series([2.25, pd.NA], dtype=pd.Float64Dtype()) ser.dtype Float64Dtype()
Index
Immutable sequence used for indexing and alignment.
The basic object storing axis labels for all objects.
Int64Dtype
An ExtensionDtype for int64 integer data.
Uses pandas.NA
as its missing value, rather than numpy.nan
.
For Int8Dtype:
ser = pd.Series([2, pd.NA], dtype=pd.Int8Dtype()) ser.dtype Int8Dtype()
For Int16Dtype:
ser = pd.Series([2, pd.NA], dtype=pd.Int16Dtype()) ser.dtype Int16Dtype()
For Int32Dtype:
ser = pd.Series([2, pd.NA], dtype=pd.Int32Dtype()) ser.dtype Int32Dtype()
For Int64Dtype:
ser = pd.Series([2, pd.NA], dtype=pd.Int64Dtype()) ser.dtype Int64Dtype()
For UInt8Dtype:
ser = pd.Series([2, pd.NA], dtype=pd.UInt8Dtype()) ser.dtype UInt8Dtype()
For UInt16Dtype:
ser = pd.Series([2, pd.NA], dtype=pd.UInt16Dtype()) ser.dtype UInt16Dtype()
For UInt32Dtype:
ser = pd.Series([2, pd.NA], dtype=pd.UInt32Dtype()) ser.dtype UInt32Dtype()
For UInt64Dtype:
ser = pd.Series([2, pd.NA], dtype=pd.UInt64Dtype()) ser.dtype UInt64Dtype()
MultiIndex
A multi-level, or hierarchical, index object for pandas objects.
NamedAgg
NamedAgg(column, aggfunc)
Series
N-dimensional analogue of DataFrame. Store multi-dimensional in a size-mutable, labeled data structure
StringDtype
Extension dtype for string data.
option_context
Context manager to temporarily set thread-local options in the with
statement context.
You need to invoke as option_context(pat, val, [(pat, val), ...])
.
import bigframes
with bigframes.option_context('display.max_rows', 10, 'display.max_columns', 5): ... pass
Packages Functions
clean_up_by_session_id
clean_up_by_session_id(
session_id: str,
location: typing.Optional[str] = None,
project: typing.Optional[str] = None,
) -> None
Searches through BigQuery tables and routines and deletes the ones created during the session with the given session id. The match is determined by having the session id present in the resource name or metadata. The cloud functions serving the cleaned up routines are also cleaned up.
This could be useful if the session object has been lost.
Calling session.close()
or <xref uid="bigframes.pandas.close_session">bigframes.pandas.close_session</xref>()
is preferred in most cases.
Parameters | |
---|---|
Name | Description |
session_id |
str The session id to clean up. Can be found using session.session_id or get_default_session_id(). |
location |
str, default None The location of the session to clean up. If given, used together with project kwarg to determine the dataset to search through for tables to clean up. |
project |
str, default None The project id associated with the session to clean up. If given, used together with location kwarg to determine the dataset to search through for tables to clean up. |
close_session
close_session()
Start a fresh session the next time a function requires a session.
Closes the current session if it was already started, deleting any temporary tables that were created.
concat
concat(
objs: typing.Iterable[
typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series]
],
*,
axis: typing.Union[str, int] = 0,
join: typing.Literal["inner", "outer"] = "outer",
ignore_index: bool = False
) -> typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series]
Concatenate BigQuery DataFrames objects along a particular axis.
Allows optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
Examples:>>> import bigframes.pandas as pd
>>> pd.options.display.progress_bar = None
Combine two Series
.
>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0 a
1 b
0 c
1 d
dtype: string
Clear the existing index and reset it in the result
by setting the ignore_index
option to True
.
>>> pd.concat([s1, s2], ignore_index=True)
0 a
1 b
2 c
3 d
dtype: string
Combine two DataFrame
objects with identical columns.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df1
letter number
0 a 1
1 b 2
<BLANKLINE>
[2 rows x 2 columns]
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
... columns=['letter', 'number'])
>>> df2
letter number
0 c 3
1 d 4
<BLANKLINE>
[2 rows x 2 columns]
>>> pd.concat([df1, df2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4
<BLANKLINE>
[4 rows x 2 columns]
Combine DataFrame
objects with overlapping columns
and return everything. Columns outside the intersection will
be filled with NaN
values.
>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
... columns=['letter', 'number', 'animal'])
>>> df3
letter number animal
0 c 3 cat
1 d 4 dog
<BLANKLINE>
[2 rows x 3 columns]
>>> pd.concat([df1, df3])
letter number animal
0 a 1 <NA>
1 b 2 <NA>
0 c 3 cat
1 d 4 dog
<BLANKLINE>
[4 rows x 3 columns]
Combine DataFrame
objects with overlapping columns
and return only those that are shared by passing inner
to
the join
keyword argument.
>>> pd.concat([df1, df3], join="inner")
letter number
0 a 1
1 b 2
0 c 3
1 d 4
<BLANKLINE>
[4 rows x 2 columns]
Parameters | |
---|---|
Name | Description |
objs |
list of objects Objects to concatenate. Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised. |
axis |
{0 or 'index', 1 or 'columns'}, default 0 The axis to concatenate along. |
join |
{'inner', 'outer'}, default 'outer' How to handle indexes on other axis (or axes). |
ignore_index |
bool, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join. |
cut
cut(
x: bigframes.series.Series,
bins: typing.Union[
int, pandas.core.indexes.interval.IntervalIndex, typing.Iterable
],
*,
right: typing.Optional[bool] = True,
labels: typing.Optional[typing.Union[typing.Iterable[str], bool]] = None
) -> bigframes.series.Series
Bin values into discrete intervals.
Use cut
when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable. For example, cut
could convert ages to groups of
age ranges. Supports binning into an equal number of bins, or a
pre-specified array of bins.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> s = bpd.Series([0, 1, 5, 10])
>>> s
0 0
1 1
2 5
3 10
dtype: Int64
Cut with an integer (equal-width bins):
>>> bpd.cut(s, bins=4)
0 {'left_exclusive': -0.01, 'right_inclusive': 2.5}
1 {'left_exclusive': -0.01, 'right_inclusive': 2.5}
2 {'left_exclusive': 2.5, 'right_inclusive': 5.0}
3 {'left_exclusive': 7.5, 'right_inclusive': 10.0}
dtype: struct<left_exclusive: double, right_inclusive: double>[pyarrow]
Cut with the same bins, but assign them specific labels:
>>> bpd.cut(s, bins=3, labels=["bad", "medium", "good"])
0 bad
1 bad
2 medium
3 good
dtype: string
labels=False
implies you want the bins back.
>>> bpd.cut(s, bins=4, labels=False)
0 0
1 0
2 1
3 3
dtype: Int64
Cut with pd.IntervalIndex, requires importing pandas for IntervalIndex:
>>> import pandas as pd
>>> interval_index = pd.IntervalIndex.from_tuples([(0, 1), (1, 5), (5, 20)])
>>> bpd.cut(s, bins=interval_index)
0 <NA>
1 {'left_exclusive': 0, 'right_inclusive': 1}
2 {'left_exclusive': 1, 'right_inclusive': 5}
3 {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]
Cut with an iterable of tuples:
>>> bins_tuples = [(0, 1), (1, 4), (5, 20)]
>>> bpd.cut(s, bins=bins_tuples)
0 <NA>
1 {'left_exclusive': 0, 'right_inclusive': 1}
2 <NA>
3 {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]
Cut with an iterable of ints:
>>> bins_ints = [0, 1, 5, 20]
>>> bpd.cut(s, bins=bins_ints)
0 <NA>
1 {'left_exclusive': 0, 'right_inclusive': 1}
2 {'left_exclusive': 1, 'right_inclusive': 5}
3 {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]
Cut with an interable of ints, where intervals are left-inclusive and right-exclusive.
>>> bins_ints = [0, 1, 5, 20]
>>> bpd.cut(s, bins=bins_ints, right=False)
0 {'left_inclusive': 0, 'right_exclusive': 1}
1 {'left_inclusive': 1, 'right_exclusive': 5}
2 {'left_inclusive': 5, 'right_exclusive': 20}
3 {'left_inclusive': 5, 'right_exclusive': 20}
dtype: struct<left_inclusive: int64, right_exclusive: int64>[pyarrow]
Parameters | |
---|---|
Name | Description |
x |
bigframes.pandas.Series The input Series to be binned. Must be 1-dimensional. |
bins |
int, pd.IntervalIndex, Iterable The criteria to bin by. int: Defines the number of equal-width bins in the range of |
right |
bool, default True Indicates whether |
labels |
bool, Iterable, default None Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container. This argument is ignored when |
deploy_remote_function
deploy_remote_function(func, **kwargs)
Orchestrates the creation of a BigQuery remote function that deploys immediately.
This method ensures that the remote function is created and available for use in BigQuery as soon as this call is made.
deploy_udf
deploy_udf(func, **kwargs)
Orchestrates the creation of a BigQuery UDF that deploys immediately.
This method ensures that the UDF is created and available for use in BigQuery as soon as this call is made.
from_glob_path
from_glob_path(
path: str,
*,
connection: typing.Optional[str] = None,
name: typing.Optional[str] = None
) -> bigframes.dataframe.DataFrame
Create a BigFrames DataFrame that contains a BigFrames Blob column from a global wildcard path. This operation creates a temporary BQ Object Table under the hood and requires bigquery.connections.delegate permission or BigQuery Connection Admin role. If you have an existing BQ Object Table, use read_gbq_object_table().
Parameters | |
---|---|
Name | Description |
path |
str The wildcard global path, such as "gs://
|
connection |
str or None, default None Connection to connect with remote service. str of the format <PROJECT_NUMBER/PROJECT_ID>.
|
name |
str The column name of the Blob column. |
get_default_session_id
get_default_session_id() -> str
Gets the session id that is used whenever a custom session has not been provided.
It is the session id of the default global session. It is prefixed to the table id of all temporary tables created in the global session.
get_dummies
get_dummies(
data: typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series],
prefix: typing.Optional[typing.Union[typing.List, dict, str]] = None,
prefix_sep: typing.Optional[typing.Union[typing.List, dict, str]] = "_",
dummy_na: bool = False,
columns: typing.Optional[typing.List] = None,
drop_first: bool = False,
dtype: typing.Any = None,
) -> bigframes.dataframe.DataFrame
Convert categorical variable into dummy/indicator variables.
Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.
Examples:
>>> import bigframes.pandas as pd
>>> pd.options.display.progress_bar = None
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 True False False
1 False True False
2 False False True
3 True False False
<BLANKLINE>
[4 rows x 3 columns]
>>> s1 = pd.Series(['a', 'b', None])
>>> pd.get_dummies(s1)
a b
0 True False
1 False True
2 False False
<BLANKLINE>
[3 rows x 2 columns]
>>> pd.get_dummies(s1, dummy_na=True)
a b <NA>
0 True False False
1 False True False
2 False False True
<BLANKLINE>
[3 rows x 3 columns]
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 True False False True False
1 2 False True True False False
2 3 True False False False True
<BLANKLINE>
[3 rows x 6 columns]
>>> pd.get_dummies(pd.Series(list('abcaa')))
a b c
0 True False False
1 False True False
2 False False True
3 True False False
4 True False False
<BLANKLINE>
[5 rows x 3 columns]
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 False False
1 True False
2 False True
3 False False
4 False False
<BLANKLINE>
[5 rows x 2 columns]
Parameters | |
---|---|
Name | Description |
data |
Series or DataFrame Data of which to get dummy indicators. |
prefix |
str, list of str, or dict of str, default None String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes. |
prefix_sep |
str, list of str, or dict of str, default '_' Separator/delimiter to use, appended to prefix. Or pass a list or dictionary as with prefix. |
dummy_na |
bool, default False Add a column to indicate NaNs, if False NaNs are ignored. |
columns |
list-like, default None Column names in the DataFrame to be encoded. If columns is None then only the columns with string dtype will be converted. |
drop_first |
bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level. |
dtype |
dtype, default bool Data type for new columns. Only a single dtype is allowed. |
get_global_session
get_global_session()
Gets the global session.
Creates the global session if it does not exist.
merge
merge(
left: bigframes.dataframe.DataFrame,
right: bigframes.dataframe.DataFrame,
how: Literal["inner", "left", "outer", "right", "cross"] = "inner",
on: Optional[str] = None,
*,
left_on: Optional[str] = None,
right_on: Optional[str] = None,
sort: bool = False,
suffixes: tuple[str, str] = ("_x", "_y")
) -> bigframes.dataframe.DataFrame
Merge DataFrame objects with a database-style join.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Parameters | |
---|---|
Name | Description |
on |
label or list of labels Columns to join on. It must be found in both DataFrames. Either on or left_on + right_on must be passed in. |
left_on |
label or list of labels Columns to join on in the left DataFrame. Either on or left_on + right_on must be passed in. |
right_on |
label or list of labels Columns to join on in the right DataFrame. Either on or left_on + right_on must be passed in. |
qcut
qcut(
x: bigframes.series.Series,
q: typing.Union[int, typing.Sequence[float]],
*,
labels: typing.Optional[bool] = None,
duplicates: typing.Literal["drop", "error"] = "error"
) -> bigframes.series.Series
Quantile-based discretization function.
Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
Parameters | |
---|---|
Name | Description |
x |
Series The input Series to be binned. Must be 1-dimensional. |
q |
int or list-like of float Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles. |
labels |
None Used as labels for the resulting bins. Must be of the same length as the resulting bins. If False, return only integer indicators of the bins. If True, raises an error. |
duplicates |
{default 'raise', 'drop'}, optional If bin edges are not unique, raise ValueError or drop non-uniques. |
read_arrow
read_arrow(pa_table: pyarrow.lib.Table) -> bigframes.dataframe.DataFrame
Load a PyArrow Table to a BigQuery DataFrames DataFrame.
Parameter | |
---|---|
Name | Description |
pa_table |
pyarrow.Table PyArrow table to load data from. |
read_csv
read_csv(
filepath_or_buffer: typing.Union[str, typing.IO[bytes]],
*,
sep: typing.Optional[str] = ",",
header: typing.Optional[int] = 0,
names: typing.Optional[
typing.Union[
typing.MutableSequence[typing.Any],
numpy.ndarray[typing.Any, typing.Any],
typing.Tuple[typing.Any, ...],
range,
]
] = None,
index_col: typing.Optional[
typing.Union[
int,
str,
typing.Sequence[typing.Union[str, int]],
bigframes.enums.DefaultIndexKind,
typing.Literal[False],
]
] = None,
usecols: typing.Optional[
typing.Union[
typing.MutableSequence[str],
typing.Tuple[str, ...],
typing.Sequence[int],
pandas.core.series.Series,
pandas.core.indexes.base.Index,
numpy.ndarray[typing.Any, typing.Any],
typing.Callable[[typing.Any], bool],
]
] = None,
dtype: typing.Optional[typing.Dict] = None,
engine: typing.Optional[
typing.Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
] = None,
encoding: typing.Optional[str] = None,
write_engine: typing.Literal[
"default",
"bigquery_inline",
"bigquery_load",
"bigquery_streaming",
"bigquery_write",
"_deferred",
] = "default",
**kwargs
) -> bigframes.dataframe.DataFrame
Loads data from a comma-separated values (csv) file into a DataFrame.
The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
>>> df = bpd.read_csv(filepath_or_buffer=gcs_path)
>>> df.head(2)
name post_abbr
0 Alabama AL
1 Alaska AK
<BLANKLINE>
[2 rows x 2 columns]
Parameters | |
---|---|
Name | Description |
filepath_or_buffer |
str A local or Google Cloud Storage ( |
sep |
Optional[str], default "," the separator for fields in a CSV file. For the BigQuery engine, the separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. Both engines support |
header |
Optional[int], default 0 row number to use as the column names. - |
names |
default None a list of column names to use. If the file contains a header row and you want to pass this parameter, then |
index_col |
default None column(s) to use as the row labels of the DataFrame, either given as string name or column index. |
usecols |
default None List of column names to use): The BigQuery engine only supports having a list of string column names. Column indices and callable functions are only supported with the default engine. Using the default engine, the column names in |
dtype |
data type for data or columns Data type for data or columns. Only to be used with default engine. |
engine |
Optional[Dict], default None Type of engine to use. If |
encoding |
Optional[str], default to None encoding the character encoding of the data. The default encoding is |
write_engine |
str How data should be written to BigQuery (if at all). See |
read_gbq
read_gbq(
query_or_table: str,
*,
index_col: typing.Union[
typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
] = (),
columns: typing.Iterable[str] = (),
configuration: typing.Optional[typing.Dict] = None,
max_results: typing.Optional[int] = None,
filters: typing.Union[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
],
typing.Iterable[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
]
],
] = (),
use_cache: typing.Optional[bool] = None,
col_order: typing.Iterable[str] = (),
dry_run: bool = False,
allow_large_results: typing.Optional[bool] = None
) -> bigframes.dataframe.DataFrame | pandas.core.series.Series
Loads a DataFrame from BigQuery.
BigQuery tables are an unordered, unindexed data source. To add support
pandas-compatibility, the following indexing options are supported via
the index_col
parameter:
(Empty iterable, default) A default index. Behavior may change. Explicitly set
index_col
if your application makes use of specific index values.If a table has primary key(s), those are used as the index, otherwise a sequential index is generated.
- (
<xref uid="bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64">bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64</xref>
) Add an arbitrary sequential index and ordering. Warning This uses an analytic windowed operation that prevents filtering push down. Avoid using on large clustered or partitioned tables. - (Recommended) Set the
index_col
argument to one or more columns. Unique values for the row labels are recommended. Duplicate labels are possible, but note that joins on a non-unique index can duplicate rows via pandas-compatible outer join behavior.
GENERATE_UUID() AS
rowindex
in your SQL and set index_col='rowindex'
for the
best performance.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
If the input is a table ID:
>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
Read table path with wildcard suffix and filters:
>>> df = bpd.read_gbq_table("bigquery-public-data.noaa_gsod.gsod19*", filters=[("_table_suffix", ">=", "30"), ("_table_suffix", "<=", "39")])
Preserve ordering in a query input.
>>> df = bpd.read_gbq('''
... SELECT
... -- Instead of an ORDER BY clause on the query, use
... -- ROW_NUMBER() to create an ordered DataFrame.
... ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
... AS rowindex,
...
... pitcherFirstName,
... pitcherLastName,
... AVG(pitchSpeed) AS averagePitchSpeed
... FROM `bigquery-public-data.baseball.games_wide`
... WHERE year = 2016
... GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
pitcherFirstName pitcherLastName averagePitchSpeed
rowindex
1 Albertin Chapman 96.514113
2 Zachary Britton 94.591039
<BLANKLINE>
[2 rows x 3 columns]
Reading data with columns
and filters
parameters:
>>> columns = ['pitcherFirstName', 'pitcherLastName', 'year', 'pitchSpeed']
>>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant']), ('pitchSpeed', '>', 94)]
>>> df = bpd.read_gbq(
... "bigquery-public-data.baseball.games_wide",
... columns=columns,
... filters=filters,
... )
>>> df.head(1)
pitcherFirstName pitcherLastName year pitchSpeed
0 John Gant 2016 95
<BLANKLINE>
[1 rows x 4 columns]
Parameters | |
---|---|
Name | Description |
query_or_table |
str A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of |
index_col |
Iterable[str], str, bigframes.enums.DefaultIndexKind Name of result column(s) to use for index in results DataFrame. If an empty iterable, such as |
columns |
Iterable[str] List of BigQuery column names in the desired order for results DataFrame. |
configuration |
dict, optional Query config parameters for job processing. For example: configuration = {'query': {'useQueryCache': False}}. For more information see |
max_results |
Optional[int], default None If set, limit the maximum number of rows to fetch from the query results. |
filters |
Union[Iterable[FilterType], Iterable[Iterable[FilterType]]], default () To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, >, >=, <, <=, !=, in, not in, LIKE]. The innermost tuples are transposed into a set of filters applied through an AND operation. The outer Iterable combines these sets of filters through an OR operation. A single Iterable of tuples can also be used, meaning that no OR operation between set of filters is to be conducted. If using wildcard table suffix in query_or_table, can specify '_table_suffix' pseudo column to filter the tables to be read into the DataFrame. |
use_cache |
Optional[bool], default None Caches query results if set to |
col_order |
Iterable[str] Alias for columns, retained for backwards compatibility. |
allow_large_results |
bool, optional Whether to allow large query results. If |
read_gbq_function
read_gbq_function(function_name: str, is_row_processor: bool = False)
Loads a BigQuery function from BigQuery.
Then it can be applied to a DataFrame or Series.
BigQuery Utils provides many public functions under thebqutil
project on Google Cloud Platform project
(See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs#using-the-udfs).
You can checkout Community UDFs to use community-contributed functions.
(See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community#community-udfs).
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Use the cw_lower_case_ascii_only function from Community UDFs.
>>> func = bpd.read_gbq_function("bqutil.fn.cw_lower_case_ascii_only")
You can run it on scalar input. Usually you would do so to verify that it works as expected before applying to all values in a Series.
>>> func('AURÉLIE')
'aurÉlie'
You can apply it to a BigQuery DataFrames Series.
>>> df = bpd.DataFrame({'id': [1, 2, 3], 'name': ['AURÉLIE', 'CÉLESTINE', 'DAPHNÉ']})
>>> df
id name
0 1 AURÉLIE
1 2 CÉLESTINE
2 3 DAPHNÉ
<BLANKLINE>
[3 rows x 2 columns]
>>> df1 = df.assign(new_name=df['name'].apply(func))
>>> df1
id name new_name
0 1 AURÉLIE aurÉlie
1 2 CÉLESTINE cÉlestine
2 3 DAPHNÉ daphnÉ
<BLANKLINE>
[3 rows x 3 columns]
You can even use a function with multiple inputs. For example, cw_regexp_replace_5 from Community UDFs.
>>> func = bpd.read_gbq_function("bqutil.fn.cw_regexp_replace_5")
>>> func('TestStr123456', 'Str', 'Cad$', 1, 1)
'TestCad$123456'
>>> df = bpd.DataFrame({
... "haystack" : ["TestStr123456", "TestStr123456Str", "TestStr123456Str"],
... "regexp" : ["Str", "Str", "Str"],
... "replacement" : ["Cad$", "Cad$", "Cad$"],
... "offset" : [1, 1, 1],
... "occurrence" : [1, 2, 1]
... })
>>> df
haystack regexp replacement offset occurrence
0 TestStr123456 Str Cad$ 1 1
1 TestStr123456Str Str Cad$ 1 2
2 TestStr123456Str Str Cad$ 1 1
<BLANKLINE>
[3 rows x 5 columns]
>>> df.apply(func, axis=1)
0 TestCad$123456
1 TestStr123456Cad$
2 TestCad$123456Str
dtype: string
Another use case is to define your own remote function and use it later. For example, define the remote function:
>>> @bpd.remote_function(cloud_function_service_account="default")
... def tenfold(num: int) -> float:
... return num * 10
Then, read back the deployed BQ remote function:
>>> tenfold_ref = bpd.read_gbq_function(
... tenfold.bigframes_remote_function,
... )
>>> df = bpd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
>>> df
a b c
0 1 3 5
1 2 4 6
<BLANKLINE>
[2 rows x 3 columns]
>>> df['a'].apply(tenfold_ref)
0 10.0
1 20.0
Name: a, dtype: Float64
It also supports row processing by using is_row_processor=True
. Please
note, row processor implies that the function has only one input
parameter.
>>> @bpd.remote_function(cloud_function_service_account="default")
... def row_sum(s: bpd.Series) -> float:
... return s['a'] + s['b'] + s['c']
>>> row_sum_ref = bpd.read_gbq_function(
... row_sum.bigframes_remote_function,
... is_row_processor=True,
... )
>>> df = bpd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
>>> df
a b c
0 1 3 5
1 2 4 6
<BLANKLINE>
[2 rows x 3 columns]
>>> df.apply(row_sum_ref, axis=1)
0 9.0
1 12.0
dtype: Float64
Parameters | |
---|---|
Name | Description |
function_name |
str The function's name in BigQuery in the format |
is_row_processor |
bool, default False Whether the function is a row processor. This is set to True for a function which receives an entire row of a DataFrame as a pandas Series. |
read_gbq_model
read_gbq_model(model_name: str)
Loads a BigQuery ML model from BigQuery.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Read an existing BigQuery ML model.
>>> model_name = "bigframes-dev.bqml_tutorial.penguins_model"
>>> model = bpd.read_gbq_model(model_name)
Parameter | |
---|---|
Name | Description |
model_name |
str the model's name in BigQuery in the format |
read_gbq_object_table
read_gbq_object_table(
object_table: str, *, name: typing.Optional[str] = None
) -> bigframes.dataframe.DataFrame
Read an existing object table to create a BigFrames Blob DataFrame. Use the connection of the object table for the connection of the blob. This function dosen't retrieve the object table data. If you want to read the data, use read_gbq() instead.
Parameters | |
---|---|
Name | Description |
object_table |
str name of the object table of form <PROJECT_ID>.<DATASET_ID>.<TABLE_ID>. |
name |
str or None the returned blob column name. |
read_gbq_query
read_gbq_query(
query: str,
*,
index_col: typing.Union[
typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
] = (),
columns: typing.Iterable[str] = (),
configuration: typing.Optional[typing.Dict] = None,
max_results: typing.Optional[int] = None,
use_cache: typing.Optional[bool] = None,
col_order: typing.Iterable[str] = (),
filters: typing.Union[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
],
typing.Iterable[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
]
],
] = (),
dry_run: bool = False,
allow_large_results: typing.Optional[bool] = None
) -> bigframes.dataframe.DataFrame | pandas.core.series.Series
Turn a SQL query into a DataFrame.
Note: Because the results are written to a temporary table, ordering by
ORDER BY
is not preserved. A unique index_col
is recommended. Use
row_number() over ()
if there is no natural unique index or you
want to preserve ordering.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Simple query input:
>>> df = bpd.read_gbq_query('''
... SELECT
... pitcherFirstName,
... pitcherLastName,
... pitchSpeed,
... FROM `bigquery-public-data.baseball.games_wide`
... ''')
Preserve ordering in a query input.
>>> df = bpd.read_gbq_query('''
... SELECT
... -- Instead of an ORDER BY clause on the query, use
... -- ROW_NUMBER() to create an ordered DataFrame.
... ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
... AS rowindex,
...
... pitcherFirstName,
... pitcherLastName,
... AVG(pitchSpeed) AS averagePitchSpeed
... FROM `bigquery-public-data.baseball.games_wide`
... WHERE year = 2016
... GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
pitcherFirstName pitcherLastName averagePitchSpeed
rowindex
1 Albertin Chapman 96.514113
2 Zachary Britton 94.591039
<BLANKLINE>
[2 rows x 3 columns]
See also: Session.read_gbq
.
Parameters | |
---|---|
Name | Description |
query |
str A SQL query to execute. |
index_col |
Iterable[str] or str, optional The column(s) to use as the index for the DataFrame. This can be a single column name or a list of column names. If not provided, a default index will be used. |
columns |
Iterable[str], optional The columns to read from the query result. If not specified, all columns will be read. |
configuration |
dict, optional A dictionary of query job configuration options. See the BigQuery REST API documentation for a list of available options: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query |
max_results |
int, optional The maximum number of rows to retrieve from the query result. If not specified, all rows will be loaded. |
use_cache |
bool, optional Whether to use cached results for the query. Defaults to |
col_order |
Iterable[str], optional The desired order of columns in the resulting DataFrame. This parameter is deprecated and will be removed in a future version. Use |
filters |
list[tuple], optional A list of filters to apply to the data. Filters are specified as a list of tuples, where each tuple contains a column name, an operator (e.g., '==', '!='), and a value. |
dry_run |
bool, optional If |
allow_large_results |
bool, optional Whether to allow large query results. If |
read_gbq_table
read_gbq_table(
query: str,
*,
index_col: typing.Union[
typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
] = (),
columns: typing.Iterable[str] = (),
max_results: typing.Optional[int] = None,
filters: typing.Union[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
],
typing.Iterable[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
]
],
] = (),
use_cache: bool = True,
col_order: typing.Iterable[str] = (),
dry_run: bool = False
) -> bigframes.dataframe.DataFrame | pandas.core.series.Series
Turn a BigQuery table into a DataFrame.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Read a whole table, with arbitrary ordering or ordering corresponding to the primary key(s).
>>> df = bpd.read_gbq_table("bigquery-public-data.ml_datasets.penguins")
See also: Session.read_gbq
.
Parameters | |
---|---|
Name | Description |
table_id |
str The identifier of the BigQuery table to read. |
index_col |
Iterable[str] or str, optional The column(s) to use as the index for the DataFrame. This can be a single column name or a list of column names. If not provided, a default index will be used. |
columns |
Iterable[str], optional The columns to read from the table. If not specified, all columns will be read. |
max_results |
int, optional The maximum number of rows to retrieve from the table. If not specified, all rows will be loaded. |
filters |
list[tuple], optional A list of filters to apply to the data. Filters are specified as a list of tuples, where each tuple contains a column name, an operator (e.g., '==', '!='), and a value. |
use_cache |
bool, optional Whether to use cached results for the query. Defaults to |
col_order |
Iterable[str], optional The desired order of columns in the resulting DataFrame. This parameter is deprecated and will be removed in a future version. Use |
dry_run |
bool, optional If |
read_json
read_json(
path_or_buf: typing.Union[str, typing.IO[bytes]],
*,
orient: typing.Literal[
"split", "records", "index", "columns", "values", "table"
] = "columns",
dtype: typing.Optional[typing.Dict] = None,
encoding: typing.Optional[str] = None,
lines: bool = False,
engine: typing.Literal["ujson", "pyarrow", "bigquery"] = "ujson",
write_engine: typing.Literal[
"default",
"bigquery_inline",
"bigquery_load",
"bigquery_streaming",
"bigquery_write",
"_deferred",
] = "default",
**kwargs
) -> bigframes.dataframe.DataFrame
Convert a JSON string to DataFrame object.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://bigframes-dev-testing/sample1.json"
>>> df = bpd.read_json(path_or_buf=gcs_path, lines=True, orient="records")
>>> df.head(2)
id name
0 1 Alice
1 2 Bob
<BLANKLINE>
[2 rows x 2 columns]
Parameters | |
---|---|
Name | Description |
path_or_buf |
a valid JSON str, path object or file-like object A local or Google Cloud Storage ( |
orient |
str, optional If |
dtype |
bool or dict, default None If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don't infer dtypes at all, applies only to the data. For all |
encoding |
str, default is 'utf-8' The encoding to use to decode py3 bytes. |
lines |
bool, default False Read the file as a json object per line. If using |
engine |
{{"ujson", "pyarrow", "bigquery"}}, default "ujson" Type of engine to use. If |
write_engine |
str How data should be written to BigQuery (if at all). See |
read_pandas
read_pandas(
pandas_dataframe: typing.Union[
pandas.core.frame.DataFrame,
pandas.core.series.Series,
pandas.core.indexes.base.Index,
],
*,
write_engine: typing.Literal[
"default",
"bigquery_inline",
"bigquery_load",
"bigquery_streaming",
"bigquery_write",
"_deferred",
] = "default"
)
Loads DataFrame from a pandas DataFrame.
The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.
Examples:>>> import bigframes.pandas as bpd
>>> import pandas as pd
>>> bpd.options.display.progress_bar = None
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> pandas_df = pd.DataFrame(data=d)
>>> df = bpd.read_pandas(pandas_df)
>>> df
col1 col2
0 1 3
1 2 4
<BLANKLINE>
[2 rows x 2 columns]
Parameters | |
---|---|
Name | Description |
pandas_dataframe |
pandas.DataFrame, pandas.Series, or pandas.Index a pandas DataFrame/Series/Index object to be loaded. |
write_engine |
str How data should be written to BigQuery (if at all). Supported values: * "default": (Recommended) Select an appropriate mechanism to write data to BigQuery. Depends on data size and supported data types. * "bigquery_inline": Inline data in BigQuery SQL. Use this when you know the data is small enough to fit within BigQuery's 1 MB query text size limit. * "bigquery_load": Use a BigQuery load job. Use this for larger data sizes. * "bigquery_streaming": Use the BigQuery streaming JSON API. Use this if your workload is such that you exhaust the BigQuery load job quota and your data cannot be embedded in SQL due to size or data type limitations. * "bigquery_write": [Preview] Use the BigQuery Storage Write API. This feature is in public preview. |
read_parquet
read_parquet(
path: typing.Union[str, typing.IO[bytes]],
*,
engine: str = "auto",
write_engine: typing.Literal[
"default",
"bigquery_inline",
"bigquery_load",
"bigquery_streaming",
"bigquery_write",
"_deferred",
] = "default"
) -> bigframes.dataframe.DataFrame
Load a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
>>> df = bpd.read_parquet(path=gcs_path, engine="bigquery")
Parameters | |
---|---|
Name | Description |
path |
str Local or Cloud Storage path to Parquet file. |
engine |
str One of |
read_pickle
read_pickle(
filepath_or_buffer: FilePath | ReadPickleBuffer,
compression: CompressionOptions = "infer",
storage_options: StorageOptions = None,
*,
write_engine: constants.WriteEngineType = "default"
)
Load pickled BigFrames object (or any object) from file.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://bigframes-dev-testing/test_pickle.pkl"
>>> df = bpd.read_pickle(filepath_or_buffer=gcs_path)
Parameters | |
---|---|
Name | Description |
filepath_or_buffer |
str, path object, or file-like object String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS. |
compression |
str or dict, default 'infer' For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary compression={'method': 'zstd', 'dict_data': my_compression_dict}. |
storage_options |
dict, default None Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here. |
write_engine |
str How data should be written to BigQuery (if at all). See |
remote_function
remote_function(
input_types: typing.Union[None, type, typing.Sequence[type]] = None,
output_type: typing.Optional[type] = None,
dataset: typing.Optional[str] = None,
*,
bigquery_connection: typing.Optional[str] = None,
reuse: bool = True,
name: typing.Optional[str] = None,
packages: typing.Optional[typing.Sequence[str]] = None,
cloud_function_service_account: str,
cloud_function_kms_key_name: typing.Optional[str] = None,
cloud_function_docker_repository: typing.Optional[str] = None,
max_batching_rows: typing.Optional[int] = 1000,
cloud_function_timeout: typing.Optional[int] = 600,
cloud_function_max_instances: typing.Optional[int] = None,
cloud_function_vpc_connector: typing.Optional[str] = None,
cloud_function_vpc_connector_egress_settings: typing.Literal[
"all", "private-ranges-only", "unspecified"
] = "private-ranges-only",
cloud_function_memory_mib: typing.Optional[int] = 1024,
cloud_function_ingress_settings: typing.Literal[
"all", "internal-only", "internal-and-gclb"
] = "internal-only",
cloud_build_service_account: typing.Optional[str] = None
)
Decorator to turn a user defined function into a BigQuery remote function. Check out the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.
See, https://cloud.google.com/functions/docs/securing/function-identity.Have the below APIs enabled for your project:
- BigQuery Connection API
- Cloud Functions API
- Cloud Run API
- Cloud Build API
- Artifact Registry API
- Cloud Resource Manager API
This can be done from the cloud console (change
PROJECT_ID
to yours): https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_IDOr from the gcloud CLI:
$ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com
Have following IAM roles enabled for you:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Connection Admin (roles/bigquery.connectionAdmin)
- Cloud Functions Developer (roles/cloudfunctions.developer)
- Service Account User (roles/iam.serviceAccountUser) on the service account
[email protected]
- Storage Object Viewer (roles/storage.objectViewer)
- Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:
- To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection
To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function
Alternatively, the IAM could also be setup via the gcloud CLI:
$ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker"
.
Parameters | |
---|---|
Name | Description |
input_types |
type or sequence(type), Optional For scalar user defined function it should be the input type or sequence of input types. The supported scalar input types are |
output_type |
type, Optional Data type of the output in the user defined function. If the user defined function returns an array, then |
dataset |
str, Optional Dataset in which to create a BigQuery remote function. It should be in |
bigquery_connection |
str, Optional Name of the BigQuery connection. You should either have the connection already created in the |
reuse |
bool, Optional Reuse the remote function if already exists. |
name |
str, Optional Explicit name of the persisted BigQuery remote function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other's remote functions if they use the same persistent name. When an explicit name is provided, any session specific clean up ( |
packages |
str[], Optional Explicit name of the external package dependencies. Each dependency is added to the |
cloud_function_service_account |
str Service account to use for the cloud functions. If "default" provided then the default service account would be used. See https://cloud.google.com/functions/docs/securing/function-identity for more details. Please make sure the service account has the necessary IAM permissions configured as described in https://cloud.google.com/functions/docs/reference/iam/roles#additional-configuration. |
cloud_function_kms_key_name |
str, Optional Customer managed encryption key to protect cloud functions and related data at rest. This is of the format projects/PROJECT_ID/locations/LOCATION/keyRings/KEYRING/cryptoKeys/KEY. Read https://cloud.google.com/functions/docs/securing/cmek for more details including granting necessary service accounts access to the key. |
cloud_function_docker_repository |
str, Optional Docker repository created with the same encryption key as |
max_batching_rows |
int, Optional The maximum number of rows to be batched for processing in the BQ remote function. Default value is 1000. A lower number can be passed to avoid timeouts in case the user code is too complex to process large number of rows fast enough. A higher number can be used to increase throughput in case the user code is fast enough. |
cloud_function_timeout |
int, Optional The maximum amount of time (in seconds) BigQuery should wait for the cloud function to return a response. See for more details https://cloud.google.com/functions/docs/configuring/timeout. Please note that even though the cloud function (2nd gen) itself allows seeting up to 60 minutes of timeout, BigQuery remote function can wait only up to 20 minutes, see for more details https://cloud.google.com/bigquery/quotas#remote_function_limits. By default BigQuery DataFrames uses a 10 minute timeout. |
cloud_function_max_instances |
int, Optional The maximumm instance count for the cloud function created. This can be used to control how many cloud function instances can be active at max at any given point of time. Lower setting can help control the spike in the billing. Higher setting can help support processing larger scale data. When not specified, cloud function's default setting applies. For more details see https://cloud.google.com/functions/docs/configuring/max-instances. |
cloud_function_vpc_connector |
str, Optional The VPC connector you would like to configure for your cloud function. This is useful if your code needs access to data or service(s) that are on a VPC network. See for more details https://cloud.google.com/functions/docs/networking/connecting-vpc. |
cloud_function_vpc_connector_egress_settings |
str, Optional Egress settings for the VPC connector, controlling what outbound traffic is routed through the VPC connector. Options are: |
cloud_function_memory_mib |
int, Optional The amounts of memory (in mebibytes) to allocate for the cloud function (2nd gen) created. This also dictates a corresponding amount of allocated CPU for the function. By default a memory of 1024 MiB is set for the cloud functions created to support BigQuery DataFrames remote function. If you want to let the default memory of cloud functions be allocated, pass |
cloud_function_ingress_settings |
str, Optional Ingress settings controls dictating what traffic can reach the function. Options are: |
cloud_build_service_account |
str, Optional Service account in the fully qualified format |
reset_session
reset_session()
Start a fresh session the next time a function requires a session.
Closes the current session if it was already started, deleting any temporary tables that were created.
to_datetime
This function converts a scalar, array-like or Series to a datetime object.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Converting a Scalar to datetime:
>>> scalar = 123456.789
>>> bpd.to_datetime(scalar, unit = 's')
Timestamp('1970-01-02 10:17:36.789000')
Converting a List of Strings without Timezone Information:
>>> list_str = ["01-31-2021 14:30", "02-28-2021 15:45"]
>>> bpd.to_datetime(list_str, format="%m-%d-%Y %H:%M", utc=True)
0 2021-01-31 14:30:00+00:00
1 2021-02-28 15:45:00+00:00
dtype: timestamp[us, tz=UTC][pyarrow]
Converting a Series of Strings with Timezone Information:
>>> series_str = bpd.Series(["01-31-2021 14:30+08:00", "02-28-2021 15:45+00:00"])
>>> bpd.to_datetime(series_str, format="%m-%d-%Y %H:%M%Z", utc=True)
0 2021-01-31 06:30:00+00:00
1 2021-02-28 15:45:00+00:00
dtype: timestamp[us, tz=UTC][pyarrow]
Parameters | |
---|---|
Name | Description |
arg |
int, float, str, datetime, date, list, tuple, 1-d array, Series The object to convert to a datetime. |
utc |
bool, default False Control timezone-related parsing, localization and conversion. If True, the function always returns a timezone-aware UTC-localized timestamp or series. If False (default), inputs will not be coerced to UTC. |
format |
str, default None The strftime to parse time, e.g. "%d/%m/%Y". |
unit |
str, default 'ns' The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. |
to_timedelta
to_timedelta(
arg,
unit: typing.Optional[
typing.Literal[
"W",
"w",
"D",
"d",
"days",
"day",
"hours",
"hour",
"hr",
"h",
"m",
"minute",
"min",
"minutes",
"s",
"seconds",
"sec",
"second",
"ms",
"milliseconds",
"millisecond",
"milli",
"millis",
"us",
"microseconds",
"microsecond",
"µs",
"micro",
"micros",
]
] = None,
*,
session: typing.Optional[bigframes.session.Session] = None
)
Converts a scalar or Series to a timedelta object.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Converting a Scalar to timedelta
>>> scalar = 2
>>> bpd.to_timedelta(scalar, unit='s')
Timedelta('0 days 00:00:02')
Converting a Series of integers to a Series of timedeltas
>>> int_series = bpd.Series([1,2,3])
>>> bpd.to_timedelta(int_series, unit='s')
0 0 days 00:00:01
1 0 days 00:00:02
2 0 days 00:00:03
dtype: duration`us][pyarrow]`
Parameters | |
---|---|
Name | Description |
arg |
int, float, str, Series The object to convert to a dataframe |
unit |
str, default 'us' Denotes the unit of the arg for numeric |
udf
udf(
*,
input_types: typing.Union[None, type, typing.Sequence[type]] = None,
output_type: typing.Optional[type] = None,
dataset: str,
bigquery_connection: typing.Optional[str] = None,
name: str,
packages: typing.Optional[typing.Sequence[str]] = None,
max_batching_rows: typing.Optional[int] = None,
container_cpu: typing.Optional[float] = None,
container_memory: typing.Optional[str] = None
)
Decorator to turn a Python user defined function (udf) into a BigQuery managed user-defined function.
Examples:>>> import bigframes.pandas as bpd
>>> import datetime
>>> bpd.options.display.progress_bar = None
Turning an arbitrary python function into a BigQuery managed python udf:
>>> bq_name = datetime.datetime.now().strftime("bigframes_%Y%m%d%H%M%S%f")
>>> @bpd.udf(dataset="bigfranes_testing", name=bq_name)
... def minutes_to_hours(x: int) -> float:
... return x/60
>>> minutes = bpd.Series([0, 30, 60, 90, 120])
>>> minutes
0 0
1 30
2 60
3 90
4 120
dtype: Int64
>>> hours = minutes.apply(minutes_to_hours)
>>> hours
0 0.0
1 0.5
2 1.0
3 1.5
4 2.0
dtype: Float64
To turn a user defined function with external package dependencies into
a BigQuery managed python udf, you would provide the names of the
packages (optionally with the package version) via packages
param.
>>> bq_name = datetime.datetime.now().strftime("bigframes_%Y%m%d%H%M%S%f")
>>> @bpd.udf(
... dataset="bigfranes_testing",
... name=bq_name,
... packages=["cryptography"]
... )
... def get_hash(input: str) -> str:
... from cryptography.fernet import Fernet
...
... # handle missing value
... if input is None:
... input = ""
...
... key = Fernet.generate_key()
... f = Fernet(key)
... return f.encrypt(input.encode()).decode()
>>> names = bpd.Series(["Alice", "Bob"])
>>> hashes = names.apply(get_hash)
You can clean-up the BigQuery functions created above using the BigQuery client from the BigQuery DataFrames session:
>>> session = bpd.get_global_session()
>>> session.bqclient.delete_routine(minutes_to_hours.bigframes_bigquery_function)
>>> session.bqclient.delete_routine(get_hash.bigframes_bigquery_function)