- To raw data
+ To raw data
@@ -209,7 +209,7 @@ Plot the typical :math:`NO_2` pattern during the day of our time series of all s
air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
kind='bar', rot=0, ax=axs
)
- plt.xlabel("Hour of the day"); # custom x label using matplotlib
+ plt.xlabel("Hour of the day"); # custom x label using Matplotlib
@savefig 09_bar_chart.png
plt.ylabel("$NO_2 (µg/m^3)$");
diff --git a/doc/source/getting_started/intro_tutorials/10_text_data.rst b/doc/source/getting_started/intro_tutorials/10_text_data.rst
index 63db920164ac3..148ac246d7bf8 100644
--- a/doc/source/getting_started/intro_tutorials/10_text_data.rst
+++ b/doc/source/getting_started/intro_tutorials/10_text_data.rst
@@ -179,7 +179,7 @@ applied to integers, so no ``str`` is used.
Based on the index name of the row (``307``) and the column (``Name``),
we can do a selection using the ``loc`` operator, introduced in the
-`tutorial on subsetting <3_subset_data.ipynb>`__.
+:ref:`tutorial on subsetting <10min_tut_03_subset>`.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst b/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
index a5a5442330e43..43790bd53f587 100644
--- a/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
+++ b/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
@@ -8,7 +8,7 @@
For this tutorial, air quality data about :math:`NO_2` is used, made
-available by `openaq `__ and using the
+available by `OpenAQ `__ and using the
`py-openaq `__ package.
The ``air_quality_no2.csv`` data set provides :math:`NO_2` values for
the measurement stations *FR04014*, *BETR801* and *London Westminster*
@@ -17,6 +17,6 @@ in respectively Paris, Antwerp and London.
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/intro_tutorials/includes/titanic.rst b/doc/source/getting_started/intro_tutorials/includes/titanic.rst
index 7032b70b3f1cf..19b8e81914e31 100644
--- a/doc/source/getting_started/intro_tutorials/includes/titanic.rst
+++ b/doc/source/getting_started/intro_tutorials/includes/titanic.rst
@@ -11,22 +11,21 @@ This tutorial uses the Titanic data set, stored as CSV. The data
consists of the following data columns:
- PassengerId: Id of every passenger.
-- Survived: This feature have value 0 and 1. 0 for not survived and 1
- for survived.
-- Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
+- Survived: Indication whether passenger survived. ``0`` for yes and ``1`` for no.
+- Pclass: One out of the 3 ticket classes: Class ``1``, Class ``2`` and Class ``3``.
- Name: Name of passenger.
- Sex: Gender of passenger.
-- Age: Age of passenger.
-- SibSp: Indication that passenger have siblings and spouse.
-- Parch: Whether a passenger is alone or have family.
+- Age: Age of passenger in years.
+- SibSp: Number of siblings or spouses aboard.
+- Parch: Number of parents or children aboard.
- Ticket: Ticket number of passenger.
- Fare: Indicating the fare.
-- Cabin: The cabin of passenger.
-- Embarked: The embarked category.
+- Cabin: Cabin number of passenger.
+- Embarked: Port of embarkation.
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/overview.rst b/doc/source/getting_started/overview.rst
index 306eb28d23fe7..320d2da01418c 100644
--- a/doc/source/getting_started/overview.rst
+++ b/doc/source/getting_started/overview.rst
@@ -75,7 +75,7 @@ Some other notes
specialized tool.
- pandas is a dependency of `statsmodels
- `__, making it an important part of the
+ `__, making it an important part of the
statistical computing ecosystem in Python.
- pandas has been used extensively in production in financial applications.
diff --git a/doc/source/getting_started/tutorials.rst b/doc/source/getting_started/tutorials.rst
index a349251bdfca6..bff50bb1e4c2d 100644
--- a/doc/source/getting_started/tutorials.rst
+++ b/doc/source/getting_started/tutorials.rst
@@ -75,6 +75,16 @@ Excel charts with pandas, vincent and xlsxwriter
* `Using Pandas and XlsxWriter to create Excel charts `_
+Joyful pandas
+-------------
+
+A tutorial written in Chinese by Yuanhao Geng. It covers the basic operations
+for NumPy and pandas, 4 main data manipulation methods (including indexing, groupby, reshaping
+and concatenation) and 4 main data types (including missing data, string data, categorical
+data and time series data). At the end of each chapter, corresponding exercises are posted.
+All the datasets and related materials can be found in the GitHub repository
+`datawhalechina/joyful-pandas `_.
+
Video tutorials
---------------
@@ -90,11 +100,11 @@ Video tutorials
* `Data analysis in Python with pandas `_
(2016-2018)
`GitHub repo `__ and
- `Jupyter Notebook `__
+ `Jupyter Notebook `__
* `Best practices with pandas `_
(2018)
`GitHub repo `__ and
- `Jupyter Notebook `__
+ `Jupyter Notebook `__
Various tutorials
@@ -108,3 +118,4 @@ Various tutorials
* `Pandas and Python: Top 10, by Manish Amde `_
* `Pandas DataFrames Tutorial, by Karlijn Willems `_
* `A concise tutorial with real life examples `_
+* `430+ Searchable Pandas recipes by Isshin Inada `_
diff --git a/doc/source/index.rst.template b/doc/source/index.rst.template
index 3b440122c2b97..59280536536db 100644
--- a/doc/source/index.rst.template
+++ b/doc/source/index.rst.template
@@ -10,7 +10,7 @@ pandas documentation
**Date**: |today| **Version**: |version|
-**Download documentation**: `PDF Version `__ | `Zipped HTML `__
+**Download documentation**: `Zipped HTML `__
**Previous versions**: Documentation of previous pandas versions is available at
`pandas.pydata.org `__.
@@ -26,6 +26,7 @@ pandas documentation
easy-to-use data structures and data analysis tools for the `Python `__
programming language.
+{% if not single_doc -%}
.. panels::
:card: + intro-card text-center
:column: col-lg-6 col-md-6 col-sm-6 col-xs-12 d-flex
@@ -96,16 +97,22 @@ programming language.
:text: To the development guide
:classes: btn-block btn-secondary stretched-link
-
+{% endif %}
{% if single_doc and single_doc.endswith('.rst') -%}
.. toctree::
:maxdepth: 3
:titlesonly:
{{ single_doc[:-4] }}
+{% elif single_doc and single_doc.count('.') <= 1 %}
+.. autosummary::
+ :toctree: reference/api/
+
+ {{ single_doc }}
{% elif single_doc %}
.. autosummary::
:toctree: reference/api/
+ :template: autosummary/accessor_method.rst
{{ single_doc }}
{% else -%}
diff --git a/doc/source/reference/arrays.rst b/doc/source/reference/arrays.rst
index 38792c46e5f54..6d09e10f284af 100644
--- a/doc/source/reference/arrays.rst
+++ b/doc/source/reference/arrays.rst
@@ -6,6 +6,10 @@
pandas arrays, scalars, and data types
======================================
+*******
+Objects
+*******
+
.. currentmodule:: pandas
For most data types, pandas uses NumPy arrays as the concrete
@@ -15,19 +19,20 @@ objects contained with a :class:`Index`, :class:`Series`, or
For some data types, pandas extends NumPy's type system. String aliases for these types
can be found at :ref:`basics.dtypes`.
-=================== ========================= ================== =============================
-Kind of Data pandas Data Type Scalar Array
-=================== ========================= ================== =============================
-TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
-Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
-Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
-Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
-Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
-Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
-Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
-Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
-Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
-=================== ========================= ================== =============================
+=================== ========================= ============================= =============================
+Kind of Data pandas Data Type Scalar Array
+=================== ========================= ============================= =============================
+TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
+Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
+Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
+Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
+Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
+Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
+Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
+Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
+Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
+PyArrow :class:`ArrowDtype` Python Scalars or :class:`NA` :ref:`api.arrays.arrow`
+=================== ========================= ============================= =============================
pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
The top-level :meth:`array` method can be used to create a new array, which may be
@@ -38,10 +43,48 @@ stored in a :class:`Series`, :class:`Index`, or as a column in a :class:`DataFra
array
+.. _api.arrays.arrow:
+
+PyArrow
+-------
+
+.. warning::
+
+ This feature is experimental, and the API can change in a future release without warning.
+
+The :class:`arrays.ArrowExtensionArray` is backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` with a
+:external+pyarrow:py:class:`pyarrow.DataType` instead of a NumPy array and data type. The ``.dtype`` of a :class:`arrays.ArrowExtensionArray`
+is an :class:`ArrowDtype`.
+
+`Pyarrow `__ provides similar array and `data type `__
+support as NumPy including first-class nullability support for all data types, immutability and more.
+
+.. note::
+
+ For string types (``pyarrow.string()``, ``string[pyarrow]``), PyArrow support is still facilitated
+ by :class:`arrays.ArrowStringArray` and ``StringDtype("pyarrow")``. See the :ref:`string section `
+ below.
+
+While individual values in an :class:`arrays.ArrowExtensionArray` are stored as a PyArrow objects, scalars are **returned**
+as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or :class:`NA` for missing
+values.
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/class_without_autosummary.rst
+
+ arrays.ArrowExtensionArray
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/class_without_autosummary.rst
+
+ ArrowDtype
+
.. _api.arrays.datetime:
-Datetime data
--------------
+Datetimes
+---------
NumPy cannot natively represent timezone-aware datetimes. pandas supports this
with the :class:`arrays.DatetimeArray` extension array, which can hold timezone-naive
@@ -161,8 +204,8 @@ If the data are timezone-aware, then every value in the array must have the same
.. _api.arrays.timedelta:
-Timedelta data
---------------
+Timedeltas
+----------
NumPy can natively represent timedeltas. pandas provides :class:`Timedelta`
for symmetry with :class:`Timestamp`.
@@ -216,8 +259,8 @@ A collection of :class:`Timedelta` may be stored in a :class:`TimedeltaArray`.
.. _api.arrays.period:
-Timespan data
--------------
+Periods
+-------
pandas represents spans of times as :class:`Period` objects.
@@ -284,8 +327,8 @@ Every period in a :class:`arrays.PeriodArray` must have the same ``freq``.
.. _api.arrays.interval:
-Interval data
--------------
+Intervals
+---------
Arbitrary intervals can be represented as :class:`Interval` objects.
@@ -379,8 +422,8 @@ pandas provides this through :class:`arrays.IntegerArray`.
.. _api.arrays.categorical:
-Categorical data
-----------------
+Categoricals
+------------
pandas defines a custom data type for representing data that can take only a
limited, fixed set of values. The dtype of a :class:`Categorical` can be described by
@@ -444,8 +487,8 @@ data. See :ref:`api.series.cat` for more.
.. _api.arrays.sparse:
-Sparse data
------------
+Sparse
+------
Data where a single value is repeated many times (e.g. ``0`` or ``NaN``) may
be stored efficiently as a :class:`arrays.SparseArray`.
@@ -464,13 +507,13 @@ be stored efficiently as a :class:`arrays.SparseArray`.
The ``Series.sparse`` accessor may be used to access sparse-specific attributes
and methods if the :class:`Series` contains sparse values. See
-:ref:`api.series.sparse` for more.
+:ref:`api.series.sparse` and :ref:`the user guide ` for more.
.. _api.arrays.string:
-Text data
----------
+Strings
+-------
When working with text data, where each valid element is a string or missing,
we recommend using :class:`StringDtype` (with the alias ``"string"``).
@@ -494,8 +537,8 @@ See :ref:`api.series.str` for more.
.. _api.arrays.bool:
-Boolean data with missing values
---------------------------------
+Nullable Boolean
+----------------
The boolean dtype (with the alias ``"boolean"``) provides support for storing
boolean data (``True``, ``False``) with missing values, which is not possible
@@ -525,3 +568,72 @@ with a bool :class:`numpy.ndarray`.
DatetimeTZDtype.tz
PeriodDtype.freq
IntervalDtype.subtype
+
+*********
+Utilities
+*********
+
+Constructors
+------------
+.. autosummary::
+ :toctree: api/
+
+ api.types.union_categoricals
+ api.types.infer_dtype
+ api.types.pandas_dtype
+
+Data type introspection
+~~~~~~~~~~~~~~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ api.types.is_bool_dtype
+ api.types.is_categorical_dtype
+ api.types.is_complex_dtype
+ api.types.is_datetime64_any_dtype
+ api.types.is_datetime64_dtype
+ api.types.is_datetime64_ns_dtype
+ api.types.is_datetime64tz_dtype
+ api.types.is_extension_type
+ api.types.is_extension_array_dtype
+ api.types.is_float_dtype
+ api.types.is_int64_dtype
+ api.types.is_integer_dtype
+ api.types.is_interval_dtype
+ api.types.is_numeric_dtype
+ api.types.is_object_dtype
+ api.types.is_period_dtype
+ api.types.is_signed_integer_dtype
+ api.types.is_string_dtype
+ api.types.is_timedelta64_dtype
+ api.types.is_timedelta64_ns_dtype
+ api.types.is_unsigned_integer_dtype
+ api.types.is_sparse
+
+Iterable introspection
+~~~~~~~~~~~~~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ api.types.is_dict_like
+ api.types.is_file_like
+ api.types.is_list_like
+ api.types.is_named_tuple
+ api.types.is_iterator
+
+Scalar introspection
+~~~~~~~~~~~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ api.types.is_bool
+ api.types.is_categorical
+ api.types.is_complex
+ api.types.is_float
+ api.types.is_hashable
+ api.types.is_integer
+ api.types.is_interval
+ api.types.is_number
+ api.types.is_re
+ api.types.is_re_compilable
+ api.types.is_scalar
diff --git a/doc/source/reference/frame.rst b/doc/source/reference/frame.rst
index 9a1ebc8d670dc..e71ee80767d29 100644
--- a/doc/source/reference/frame.rst
+++ b/doc/source/reference/frame.rst
@@ -373,6 +373,7 @@ Serialization / IO / conversion
DataFrame.from_dict
DataFrame.from_records
+ DataFrame.to_orc
DataFrame.to_parquet
DataFrame.to_pickle
DataFrame.to_csv
@@ -391,3 +392,4 @@ Serialization / IO / conversion
DataFrame.to_clipboard
DataFrame.to_markdown
DataFrame.style
+ DataFrame.__dataframe__
diff --git a/doc/source/reference/general_functions.rst b/doc/source/reference/general_functions.rst
index b5832cb8aa591..474e37a85d857 100644
--- a/doc/source/reference/general_functions.rst
+++ b/doc/source/reference/general_functions.rst
@@ -23,6 +23,7 @@ Data manipulations
merge_asof
concat
get_dummies
+ from_dummies
factorize
unique
wide_to_long
@@ -37,15 +38,15 @@ Top-level missing data
notna
notnull
-Top-level conversions
-~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with numeric data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
to_numeric
-Top-level dealing with datetimelike
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with datetimelike data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
@@ -57,8 +58,8 @@ Top-level dealing with datetimelike
timedelta_range
infer_freq
-Top-level dealing with intervals
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with Interval data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
@@ -79,9 +80,9 @@ Hashing
util.hash_array
util.hash_pandas_object
-Testing
-~~~~~~~
+Importing from other DataFrame libraries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
- test
+ api.interchange.from_dataframe
diff --git a/doc/source/reference/general_utility_functions.rst b/doc/source/reference/general_utility_functions.rst
deleted file mode 100644
index ee17ef3831164..0000000000000
--- a/doc/source/reference/general_utility_functions.rst
+++ /dev/null
@@ -1,127 +0,0 @@
-{{ header }}
-
-.. _api.general_utility_functions:
-
-=========================
-General utility functions
-=========================
-.. currentmodule:: pandas
-
-Working with options
---------------------
-.. autosummary::
- :toctree: api/
-
- describe_option
- reset_option
- get_option
- set_option
- option_context
-
-.. _api.general.testing:
-
-Testing functions
------------------
-.. autosummary::
- :toctree: api/
-
- testing.assert_frame_equal
- testing.assert_series_equal
- testing.assert_index_equal
- testing.assert_extension_array_equal
-
-Exceptions and warnings
------------------------
-.. autosummary::
- :toctree: api/
-
- errors.AbstractMethodError
- errors.AccessorRegistrationWarning
- errors.DtypeWarning
- errors.DuplicateLabelError
- errors.EmptyDataError
- errors.InvalidIndexError
- errors.IntCastingNaNError
- errors.MergeError
- errors.NullFrequencyError
- errors.NumbaUtilError
- errors.OptionError
- errors.OutOfBoundsDatetime
- errors.OutOfBoundsTimedelta
- errors.ParserError
- errors.ParserWarning
- errors.PerformanceWarning
- errors.UnsortedIndexError
- errors.UnsupportedFunctionCall
-
-Data types related functionality
---------------------------------
-.. autosummary::
- :toctree: api/
-
- api.types.union_categoricals
- api.types.infer_dtype
- api.types.pandas_dtype
-
-Dtype introspection
-~~~~~~~~~~~~~~~~~~~
-.. autosummary::
- :toctree: api/
-
- api.types.is_bool_dtype
- api.types.is_categorical_dtype
- api.types.is_complex_dtype
- api.types.is_datetime64_any_dtype
- api.types.is_datetime64_dtype
- api.types.is_datetime64_ns_dtype
- api.types.is_datetime64tz_dtype
- api.types.is_extension_type
- api.types.is_extension_array_dtype
- api.types.is_float_dtype
- api.types.is_int64_dtype
- api.types.is_integer_dtype
- api.types.is_interval_dtype
- api.types.is_numeric_dtype
- api.types.is_object_dtype
- api.types.is_period_dtype
- api.types.is_signed_integer_dtype
- api.types.is_string_dtype
- api.types.is_timedelta64_dtype
- api.types.is_timedelta64_ns_dtype
- api.types.is_unsigned_integer_dtype
- api.types.is_sparse
-
-Iterable introspection
-~~~~~~~~~~~~~~~~~~~~~~
-.. autosummary::
- :toctree: api/
-
- api.types.is_dict_like
- api.types.is_file_like
- api.types.is_list_like
- api.types.is_named_tuple
- api.types.is_iterator
-
-Scalar introspection
-~~~~~~~~~~~~~~~~~~~~
-.. autosummary::
- :toctree: api/
-
- api.types.is_bool
- api.types.is_categorical
- api.types.is_complex
- api.types.is_float
- api.types.is_hashable
- api.types.is_integer
- api.types.is_interval
- api.types.is_number
- api.types.is_re
- api.types.is_re_compilable
- api.types.is_scalar
-
-Bug report function
--------------------
-.. autosummary::
- :toctree: api/
-
- show_versions
diff --git a/doc/source/reference/groupby.rst b/doc/source/reference/groupby.rst
index 2bb0659264eb0..51bd659081b8f 100644
--- a/doc/source/reference/groupby.rst
+++ b/doc/source/reference/groupby.rst
@@ -132,9 +132,7 @@ The following methods are available only for ``SeriesGroupBy`` objects.
SeriesGroupBy.hist
SeriesGroupBy.nlargest
SeriesGroupBy.nsmallest
- SeriesGroupBy.nunique
SeriesGroupBy.unique
- SeriesGroupBy.value_counts
SeriesGroupBy.is_monotonic_increasing
SeriesGroupBy.is_monotonic_decreasing
diff --git a/doc/source/reference/index.rst b/doc/source/reference/index.rst
index f7c5eaf242b34..fc920db671ee5 100644
--- a/doc/source/reference/index.rst
+++ b/doc/source/reference/index.rst
@@ -37,8 +37,9 @@ public functions related to data types in pandas.
resampling
style
plotting
- general_utility_functions
+ options
extensions
+ testing
.. This is to prevent warnings in the doc build. We don't want to encourage
.. these methods.
@@ -46,20 +47,11 @@ public functions related to data types in pandas.
..
.. toctree::
- api/pandas.DataFrame.blocks
- api/pandas.DataFrame.as_matrix
api/pandas.Index.asi8
- api/pandas.Index.data
- api/pandas.Index.flags
api/pandas.Index.holds_integer
api/pandas.Index.is_type_compatible
api/pandas.Index.nlevels
api/pandas.Index.sort
- api/pandas.Series.asobject
- api/pandas.Series.blocks
- api/pandas.Series.from_array
- api/pandas.Series.imag
- api/pandas.Series.real
.. Can't convince sphinx to generate toctree for this class attribute.
diff --git a/doc/source/reference/io.rst b/doc/source/reference/io.rst
index 7aad937d10a18..425b5f81be966 100644
--- a/doc/source/reference/io.rst
+++ b/doc/source/reference/io.rst
@@ -53,7 +53,6 @@ Excel
.. autosummary::
:toctree: api/
- :template: autosummary/class_without_autosummary.rst
ExcelWriter
@@ -135,7 +134,7 @@ HDFStore: PyTables (HDF5)
.. warning::
- One can store a subclass of ``DataFrame`` or ``Series`` to HDF5,
+ One can store a subclass of :class:`DataFrame` or :class:`Series` to HDF5,
but the type of the subclass is lost upon storing.
Feather
@@ -160,6 +159,7 @@ ORC
:toctree: api/
read_orc
+ DataFrame.to_orc
SAS
~~~
diff --git a/doc/source/reference/options.rst b/doc/source/reference/options.rst
new file mode 100644
index 0000000000000..7316b6e9c72b1
--- /dev/null
+++ b/doc/source/reference/options.rst
@@ -0,0 +1,21 @@
+{{ header }}
+
+.. _api.options:
+
+====================
+Options and settings
+====================
+.. currentmodule:: pandas
+
+API for configuring global behavior. See :ref:`the User Guide ` for more.
+
+Working with options
+--------------------
+.. autosummary::
+ :toctree: api/
+
+ describe_option
+ reset_option
+ get_option
+ set_option
+ option_context
diff --git a/doc/source/reference/series.rst b/doc/source/reference/series.rst
index a60dab549e66d..fcdc9ea9b95da 100644
--- a/doc/source/reference/series.rst
+++ b/doc/source/reference/series.rst
@@ -342,6 +342,7 @@ Datetime methods
:toctree: api/
:template: autosummary/accessor_method.rst
+ Series.dt.isocalendar
Series.dt.to_period
Series.dt.to_pydatetime
Series.dt.tz_localize
diff --git a/doc/source/reference/style.rst b/doc/source/reference/style.rst
index a739993e4d376..5144f12fa373a 100644
--- a/doc/source/reference/style.rst
+++ b/doc/source/reference/style.rst
@@ -27,6 +27,7 @@ Styler properties
Styler.template_html_style
Styler.template_html_table
Styler.template_latex
+ Styler.template_string
Styler.loader
Style application
@@ -40,7 +41,9 @@ Style application
Styler.applymap_index
Styler.format
Styler.format_index
+ Styler.relabel_index
Styler.hide
+ Styler.concat
Styler.set_td_classes
Styler.set_table_styles
Styler.set_table_attributes
@@ -74,5 +77,6 @@ Style export and import
Styler.to_html
Styler.to_latex
Styler.to_excel
+ Styler.to_string
Styler.export
Styler.use
diff --git a/doc/source/reference/testing.rst b/doc/source/reference/testing.rst
new file mode 100644
index 0000000000000..1144c767942d4
--- /dev/null
+++ b/doc/source/reference/testing.rst
@@ -0,0 +1,77 @@
+{{ header }}
+
+.. _api.testing:
+
+=======
+Testing
+=======
+.. currentmodule:: pandas
+
+.. _api.general.testing:
+
+Assertion functions
+-------------------
+.. autosummary::
+ :toctree: api/
+
+ testing.assert_frame_equal
+ testing.assert_series_equal
+ testing.assert_index_equal
+ testing.assert_extension_array_equal
+
+Exceptions and warnings
+-----------------------
+.. autosummary::
+ :toctree: api/
+
+ errors.AbstractMethodError
+ errors.AccessorRegistrationWarning
+ errors.AttributeConflictWarning
+ errors.CategoricalConversionWarning
+ errors.ClosedFileError
+ errors.CSSWarning
+ errors.DatabaseError
+ errors.DataError
+ errors.DtypeWarning
+ errors.DuplicateLabelError
+ errors.EmptyDataError
+ errors.IncompatibilityWarning
+ errors.IndexingError
+ errors.InvalidColumnName
+ errors.InvalidIndexError
+ errors.IntCastingNaNError
+ errors.MergeError
+ errors.NullFrequencyError
+ errors.NumbaUtilError
+ errors.NumExprClobberingError
+ errors.OptionError
+ errors.OutOfBoundsDatetime
+ errors.OutOfBoundsTimedelta
+ errors.ParserError
+ errors.ParserWarning
+ errors.PerformanceWarning
+ errors.PossibleDataLossError
+ errors.PossiblePrecisionLoss
+ errors.PyperclipException
+ errors.PyperclipWindowsException
+ errors.SettingWithCopyError
+ errors.SettingWithCopyWarning
+ errors.SpecificationError
+ errors.UndefinedVariableError
+ errors.UnsortedIndexError
+ errors.UnsupportedFunctionCall
+ errors.ValueLabelTypeMismatch
+
+Bug report function
+-------------------
+.. autosummary::
+ :toctree: api/
+
+ show_versions
+
+Test suite runner
+-----------------
+.. autosummary::
+ :toctree: api/
+
+ test
diff --git a/doc/source/user_guide/10min.rst b/doc/source/user_guide/10min.rst
index 08488a33936f0..c767fb1ebef7f 100644
--- a/doc/source/user_guide/10min.rst
+++ b/doc/source/user_guide/10min.rst
@@ -29,7 +29,7 @@ a default integer index:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
-Creating a :class:`DataFrame` by passing a NumPy array, with a datetime index
+Creating a :class:`DataFrame` by passing a NumPy array, with a datetime index using :func:`date_range`
and labeled columns:
.. ipython:: python
@@ -93,14 +93,15 @@ Viewing data
See the :ref:`Basics section `.
-Here is how to view the top and bottom rows of the frame:
+Use :meth:`DataFrame.head` and :meth:`DataFrame.tail` to view the top and bottom rows of the frame
+respectively:
.. ipython:: python
df.head()
df.tail(3)
-Display the index, columns:
+Display the :attr:`DataFrame.index` or :attr:`DataFrame.columns`:
.. ipython:: python
@@ -116,7 +117,7 @@ while pandas DataFrames have one dtype per column**. When you call
of the dtypes in the DataFrame. This may end up being ``object``, which requires
casting every value to a Python object.
-For ``df``, our :class:`DataFrame` of all floating-point values,
+For ``df``, our :class:`DataFrame` of all floating-point values, and
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data:
.. ipython:: python
@@ -147,13 +148,13 @@ Transposing your data:
df.T
-Sorting by an axis:
+:meth:`DataFrame.sort_index` sorts by an axis:
.. ipython:: python
df.sort_index(axis=1, ascending=False)
-Sorting by values:
+:meth:`DataFrame.sort_values` sorts by values:
.. ipython:: python
@@ -166,8 +167,8 @@ Selection
While standard Python / NumPy expressions for selecting and setting are
intuitive and come in handy for interactive work, for production code, we
- recommend the optimized pandas data access methods, ``.at``, ``.iat``,
- ``.loc`` and ``.iloc``.
+ recommend the optimized pandas data access methods, :meth:`DataFrame.at`, :meth:`DataFrame.iat`,
+ :meth:`DataFrame.loc` and :meth:`DataFrame.iloc`.
See the indexing documentation :ref:`Indexing and Selecting Data ` and :ref:`MultiIndex / Advanced Indexing `.
@@ -181,7 +182,7 @@ equivalent to ``df.A``:
df["A"]
-Selecting via ``[]``, which slices the rows:
+Selecting via ``[]`` (``__getitem__``), which slices the rows:
.. ipython:: python
@@ -191,7 +192,7 @@ Selecting via ``[]``, which slices the rows:
Selection by label
~~~~~~~~~~~~~~~~~~
-See more in :ref:`Selection by Label `.
+See more in :ref:`Selection by Label ` using :meth:`DataFrame.loc` or :meth:`DataFrame.at`.
For getting a cross section using a label:
@@ -232,7 +233,7 @@ For getting fast access to a scalar (equivalent to the prior method):
Selection by position
~~~~~~~~~~~~~~~~~~~~~
-See more in :ref:`Selection by Position `.
+See more in :ref:`Selection by Position ` using :meth:`DataFrame.iloc` or :meth:`DataFrame.at`.
Select via the position of the passed integers:
@@ -327,6 +328,7 @@ Setting values by position:
Setting by assigning with a NumPy array:
.. ipython:: python
+ :okwarning:
df.loc[:, "D"] = np.array([5] * len(df))
@@ -361,19 +363,19 @@ returns a copy of the data:
df1.loc[dates[0] : dates[1], "E"] = 1
df1
-To drop any rows that have missing data:
+:meth:`DataFrame.dropna` drops any rows that have missing data:
.. ipython:: python
df1.dropna(how="any")
-Filling missing data:
+:meth:`DataFrame.fillna` fills missing data:
.. ipython:: python
df1.fillna(value=5)
-To get the boolean mask where values are ``nan``:
+:func:`isna` gets the boolean mask where values are ``nan``:
.. ipython:: python
@@ -415,7 +417,7 @@ In addition, pandas automatically broadcasts along the specified dimension:
Apply
~~~~~
-Applying functions to the data:
+:meth:`DataFrame.apply` applies a user defined function to the data:
.. ipython:: python
@@ -461,7 +463,7 @@ operations.
See the :ref:`Merging section `.
-Concatenating pandas objects together with :func:`concat`:
+Concatenating pandas objects together along an axis with :func:`concat`:
.. ipython:: python
@@ -482,7 +484,7 @@ Concatenating pandas objects together with :func:`concat`:
Join
~~~~
-SQL style merges. See the :ref:`Database style joining ` section.
+:func:`merge` enables SQL style join types along specific columns. See the :ref:`Database style joining ` section.
.. ipython:: python
@@ -531,7 +533,7 @@ groups:
.. ipython:: python
- df.groupby("A").sum()
+ df.groupby("A")[["C", "D"]].sum()
Grouping by multiple columns forms a hierarchical index, and again we can
apply the :meth:`~pandas.core.groupby.GroupBy.sum` function:
@@ -553,10 +555,8 @@ Stack
tuples = list(
zip(
- *[
- ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
- ["one", "two", "one", "two", "one", "two", "one", "two"],
- ]
+ ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
+ ["one", "two", "one", "two", "one", "two", "one", "two"],
)
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
@@ -572,7 +572,7 @@ columns:
stacked = df2.stack()
stacked
-With a "stacked" DataFrame or Series (having a ``MultiIndex`` as the
+With a "stacked" DataFrame or Series (having a :class:`MultiIndex` as the
``index``), the inverse operation of :meth:`~DataFrame.stack` is
:meth:`~DataFrame.unstack`, which by default unstacks the **last level**:
@@ -599,7 +599,7 @@ See the section on :ref:`Pivot Tables `.
)
df
-We can produce pivot tables from this data very easily:
+:func:`pivot_table` pivots a :class:`DataFrame` specifying the ``values``, ``index`` and ``columns``
.. ipython:: python
@@ -620,7 +620,7 @@ financial applications. See the :ref:`Time Series section `.
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample("5Min").sum()
-Time zone representation:
+:meth:`Series.tz_localize` localizes a time series to a time zone:
.. ipython:: python
@@ -630,7 +630,7 @@ Time zone representation:
ts_utc = ts.tz_localize("UTC")
ts_utc
-Converting to another time zone:
+:meth:`Series.tz_convert` converts a timezones aware time series to another time zone:
.. ipython:: python
@@ -680,12 +680,12 @@ Converting the raw grades to a categorical data type:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
-Rename the categories to more meaningful names (assigning to
-:meth:`Series.cat.categories` is in place!):
+Rename the categories to more meaningful names:
.. ipython:: python
- df["grade"].cat.categories = ["very good", "good", "very bad"]
+ new_categories = ["very good", "good", "very bad"]
+ df["grade"] = df["grade"].cat.rename_categories(new_categories)
Reorder the categories and simultaneously add the missing categories (methods under :meth:`Series.cat` return a new :class:`Series` by default):
@@ -722,7 +722,7 @@ We use the standard convention for referencing the matplotlib API:
plt.close("all")
-The :meth:`~plt.close` method is used to `close `__ a figure window:
+The ``plt.close`` method is used to `close `__ a figure window:
.. ipython:: python
@@ -732,7 +732,7 @@ The :meth:`~plt.close` method is used to `close `__ to show it or
`matplotlib.pyplot.savefig `__ to write it to a file.
@@ -756,19 +756,19 @@ of the columns with labels:
@savefig frame_plot_basic.png
plt.legend(loc='best');
-Getting data in/out
--------------------
+Importing and exporting data
+----------------------------
CSV
~~~
-:ref:`Writing to a csv file: `
+:ref:`Writing to a csv file: ` using :meth:`DataFrame.to_csv`
.. ipython:: python
df.to_csv("foo.csv")
-:ref:`Reading from a csv file: `
+:ref:`Reading from a csv file: ` using :func:`read_csv`
.. ipython:: python
@@ -786,13 +786,13 @@ HDF5
Reading and writing to :ref:`HDFStores `.
-Writing to a HDF5 Store:
+Writing to a HDF5 Store using :meth:`DataFrame.to_hdf`:
.. ipython:: python
df.to_hdf("foo.h5", "df")
-Reading from a HDF5 Store:
+Reading from a HDF5 Store using :func:`read_hdf`:
.. ipython:: python
@@ -806,15 +806,15 @@ Reading from a HDF5 Store:
Excel
~~~~~
-Reading and writing to :ref:`MS Excel `.
+Reading and writing to :ref:`Excel `.
-Writing to an excel file:
+Writing to an excel file using :meth:`DataFrame.to_excel`:
.. ipython:: python
df.to_excel("foo.xlsx", sheet_name="Sheet1")
-Reading from an excel file:
+Reading from an excel file using :func:`read_excel`:
.. ipython:: python
@@ -828,16 +828,13 @@ Reading from an excel file:
Gotchas
-------
-If you are attempting to perform an operation you might see an exception like:
+If you are attempting to perform a boolean operation on a :class:`Series` or :class:`DataFrame`
+you might see an exception like:
-.. code-block:: python
-
- >>> if pd.Series([False, True, False]):
- ... print("I was true")
- Traceback
- ...
- ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
+.. ipython:: python
+ :okexcept:
-See :ref:`Comparisons` for an explanation and what to do.
+ if pd.Series([False, True, False]):
+ print("I was true")
-See :ref:`Gotchas` as well.
+See :ref:`Comparisons` and :ref:`Gotchas` for an explanation and what to do.
diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst
index 45b2e57f52c6c..b8df21ab5a5b4 100644
--- a/doc/source/user_guide/advanced.rst
+++ b/doc/source/user_guide/advanced.rst
@@ -1246,5 +1246,5 @@ This is because the (re)indexing operations above silently inserts ``NaNs`` and
changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs``
such as ``numpy.logical_and``.
-See the `this old issue `__ for a more
+See the :issue:`2388` for a more
detailed discussion.
diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst
index 40ff1049e5820..a34d4891b9d77 100644
--- a/doc/source/user_guide/basics.rst
+++ b/doc/source/user_guide/basics.rst
@@ -848,8 +848,8 @@ have introduced the popular ``(%>%)`` (read pipe) operator for R_.
The implementation of ``pipe`` here is quite clean and feels right at home in Python.
We encourage you to view the source code of :meth:`~DataFrame.pipe`.
-.. _dplyr: https://github.com/hadley/dplyr
-.. _magrittr: https://github.com/smbache/magrittr
+.. _dplyr: https://github.com/tidyverse/dplyr
+.. _magrittr: https://github.com/tidyverse/magrittr
.. _R: https://www.r-project.org
diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst
index 0105cf99193dd..b5cb1d83a9f52 100644
--- a/doc/source/user_guide/categorical.rst
+++ b/doc/source/user_guide/categorical.rst
@@ -334,8 +334,7 @@ It's also possible to pass in the categories in a specific order:
Renaming categories
~~~~~~~~~~~~~~~~~~~
-Renaming categories is done by assigning new values to the
-``Series.cat.categories`` property or by using the
+Renaming categories is done by using the
:meth:`~pandas.Categorical.rename_categories` method:
@@ -343,9 +342,8 @@ Renaming categories is done by assigning new values to the
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s
- s.cat.categories = ["Group %s" % g for g in s.cat.categories]
- s
- s = s.cat.rename_categories([1, 2, 3])
+ new_categories = ["Group %s" % g for g in s.cat.categories]
+ s = s.cat.rename_categories(new_categories)
s
# You can also pass a dict-like object to map the renaming
s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})
@@ -365,7 +363,7 @@ Categories must be unique or a ``ValueError`` is raised:
.. ipython:: python
try:
- s.cat.categories = [1, 1, 1]
+ s = s.cat.rename_categories([1, 1, 1])
except ValueError as e:
print("ValueError:", str(e))
@@ -374,7 +372,7 @@ Categories must also not be ``NaN`` or a ``ValueError`` is raised:
.. ipython:: python
try:
- s.cat.categories = [1, 2, np.nan]
+ s = s.cat.rename_categories([1, 2, np.nan])
except ValueError as e:
print("ValueError:", str(e))
@@ -702,7 +700,7 @@ of length "1".
.. ipython:: python
df.iat[0, 0]
- df["cats"].cat.categories = ["x", "y", "z"]
+ df["cats"] = df["cats"].cat.rename_categories(["x", "y", "z"])
df.at["h", "cats"] # returns a string
.. note::
@@ -960,7 +958,7 @@ relevant columns back to ``category`` and assign the right categories and catego
s = pd.Series(pd.Categorical(["a", "b", "b", "a", "a", "d"]))
# rename the categories
- s.cat.categories = ["very good", "good", "bad"]
+ s = s.cat.rename_categories(["very good", "good", "bad"])
# reorder the categories and add missing categories
s = s.cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df = pd.DataFrame({"cats": s, "vals": [1, 2, 3, 4, 5, 6]})
@@ -1164,6 +1162,7 @@ Constructing a ``Series`` from a ``Categorical`` will not copy the input
change the original ``Categorical``:
.. ipython:: python
+ :okwarning:
cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
s = pd.Series(cat, name="cat")
diff --git a/doc/source/user_guide/computation.rst b/doc/source/user_guide/computation.rst
deleted file mode 100644
index 6007129e96ba0..0000000000000
--- a/doc/source/user_guide/computation.rst
+++ /dev/null
@@ -1,212 +0,0 @@
-.. _computation:
-
-{{ header }}
-
-Computational tools
-===================
-
-
-Statistical functions
----------------------
-
-.. _computation.pct_change:
-
-Percent change
-~~~~~~~~~~~~~~
-
-``Series`` and ``DataFrame`` have a method
-:meth:`~DataFrame.pct_change` to compute the percent change over a given number
-of periods (using ``fill_method`` to fill NA/null values *before* computing
-the percent change).
-
-.. ipython:: python
-
- ser = pd.Series(np.random.randn(8))
-
- ser.pct_change()
-
-.. ipython:: python
-
- df = pd.DataFrame(np.random.randn(10, 4))
-
- df.pct_change(periods=3)
-
-.. _computation.covariance:
-
-Covariance
-~~~~~~~~~~
-
-:meth:`Series.cov` can be used to compute covariance between series
-(excluding missing values).
-
-.. ipython:: python
-
- s1 = pd.Series(np.random.randn(1000))
- s2 = pd.Series(np.random.randn(1000))
- s1.cov(s2)
-
-Analogously, :meth:`DataFrame.cov` to compute pairwise covariances among the
-series in the DataFrame, also excluding NA/null values.
-
-.. _computation.covariance.caveats:
-
-.. note::
-
- Assuming the missing data are missing at random this results in an estimate
- for the covariance matrix which is unbiased. However, for many applications
- this estimate may not be acceptable because the estimated covariance matrix
- is not guaranteed to be positive semi-definite. This could lead to
- estimated correlations having absolute values which are greater than one,
- and/or a non-invertible covariance matrix. See `Estimation of covariance
- matrices `_
- for more details.
-
-.. ipython:: python
-
- frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
- frame.cov()
-
-``DataFrame.cov`` also supports an optional ``min_periods`` keyword that
-specifies the required minimum number of observations for each column pair
-in order to have a valid result.
-
-.. ipython:: python
-
- frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
- frame.loc[frame.index[:5], "a"] = np.nan
- frame.loc[frame.index[5:10], "b"] = np.nan
-
- frame.cov()
-
- frame.cov(min_periods=12)
-
-
-.. _computation.correlation:
-
-Correlation
-~~~~~~~~~~~
-
-Correlation may be computed using the :meth:`~DataFrame.corr` method.
-Using the ``method`` parameter, several methods for computing correlations are
-provided:
-
-.. csv-table::
- :header: "Method name", "Description"
- :widths: 20, 80
-
- ``pearson (default)``, Standard correlation coefficient
- ``kendall``, Kendall Tau correlation coefficient
- ``spearman``, Spearman rank correlation coefficient
-
-.. \rho = \cov(x, y) / \sigma_x \sigma_y
-
-All of these are currently computed using pairwise complete observations.
-Wikipedia has articles covering the above correlation coefficients:
-
-* `Pearson correlation coefficient `_
-* `Kendall rank correlation coefficient `_
-* `Spearman's rank correlation coefficient `_
-
-.. note::
-
- Please see the :ref:`caveats ` associated
- with this method of calculating correlation matrices in the
- :ref:`covariance section `.
-
-.. ipython:: python
-
- frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
- frame.iloc[::2] = np.nan
-
- # Series with Series
- frame["a"].corr(frame["b"])
- frame["a"].corr(frame["b"], method="spearman")
-
- # Pairwise correlation of DataFrame columns
- frame.corr()
-
-Note that non-numeric columns will be automatically excluded from the
-correlation calculation.
-
-Like ``cov``, ``corr`` also supports the optional ``min_periods`` keyword:
-
-.. ipython:: python
-
- frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
- frame.loc[frame.index[:5], "a"] = np.nan
- frame.loc[frame.index[5:10], "b"] = np.nan
-
- frame.corr()
-
- frame.corr(min_periods=12)
-
-
-The ``method`` argument can also be a callable for a generic correlation
-calculation. In this case, it should be a single function
-that produces a single value from two ndarray inputs. Suppose we wanted to
-compute the correlation based on histogram intersection:
-
-.. ipython:: python
-
- # histogram intersection
- def histogram_intersection(a, b):
- return np.minimum(np.true_divide(a, a.sum()), np.true_divide(b, b.sum())).sum()
-
-
- frame.corr(method=histogram_intersection)
-
-A related method :meth:`~DataFrame.corrwith` is implemented on DataFrame to
-compute the correlation between like-labeled Series contained in different
-DataFrame objects.
-
-.. ipython:: python
-
- index = ["a", "b", "c", "d", "e"]
- columns = ["one", "two", "three", "four"]
- df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)
- df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)
- df1.corrwith(df2)
- df2.corrwith(df1, axis=1)
-
-.. _computation.ranking:
-
-Data ranking
-~~~~~~~~~~~~
-
-The :meth:`~Series.rank` method produces a data ranking with ties being
-assigned the mean of the ranks (by default) for the group:
-
-.. ipython:: python
-
- s = pd.Series(np.random.randn(5), index=list("abcde"))
- s["d"] = s["b"] # so there's a tie
- s.rank()
-
-:meth:`~DataFrame.rank` is also a DataFrame method and can rank either the rows
-(``axis=0``) or the columns (``axis=1``). ``NaN`` values are excluded from the
-ranking.
-
-.. ipython:: python
-
- df = pd.DataFrame(np.random.randn(10, 6))
- df[4] = df[2][:5] # some ties
- df
- df.rank(1)
-
-``rank`` optionally takes a parameter ``ascending`` which by default is true;
-when false, data is reverse-ranked, with larger values assigned a smaller rank.
-
-``rank`` supports different tie-breaking methods, specified with the ``method``
-parameter:
-
- - ``average`` : average rank of tied group
- - ``min`` : lowest rank in the group
- - ``max`` : highest rank in the group
- - ``first`` : ranks assigned in the order they appear in the array
-
-.. _computation.windowing:
-
-Windowing functions
-~~~~~~~~~~~~~~~~~~~
-
-See :ref:`the window operations user guide ` for an overview of windowing functions.
diff --git a/doc/source/user_guide/cookbook.rst b/doc/source/user_guide/cookbook.rst
index 8c2dd3ba60f13..daf5a0e481b8e 100644
--- a/doc/source/user_guide/cookbook.rst
+++ b/doc/source/user_guide/cookbook.rst
@@ -193,8 +193,7 @@ The :ref:`indexing ` docs.
df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
-`Use loc for label-oriented slicing and iloc positional slicing
-`__
+Use loc for label-oriented slicing and iloc positional slicing :issue:`2904`
.. ipython:: python
@@ -229,7 +228,7 @@ Ambiguity arises when an index consists of integers with a non-zero start or non
df2.loc[1:3] # Label-oriented
`Using inverse operator (~) to take the complement of a mask
-`__
+`__
.. ipython:: python
@@ -259,7 +258,7 @@ New columns
df
`Keep other columns when using min() with groupby
-`__
+`__
.. ipython:: python
@@ -389,14 +388,13 @@ Sorting
*******
`Sort by specific column or an ordered list of columns, with a MultiIndex
-`__
+`__
.. ipython:: python
df.sort_values(by=("Labs", "II"), ascending=False)
-`Partial selection, the need for sortedness;
-`__
+Partial selection, the need for sortedness :issue:`2995`
Levels
******
@@ -405,7 +403,7 @@ Levels
`__
`Flatten Hierarchical columns
-`__
+`__
.. _cookbook.missing_data:
@@ -425,7 +423,7 @@ Fill forward a reversed timeseries
)
df.loc[df.index[3], "A"] = np.nan
df
- df.reindex(df.index[::-1]).ffill()
+ df.bfill()
`cumsum reset at NaN values
`__
@@ -513,7 +511,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
def replace(g):
mask = g < 0
- return g.where(mask, g[~mask].mean())
+ return g.where(~mask, g[~mask].mean())
gb.transform(replace)
@@ -556,7 +554,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
ts
`Create a value counts column and reassign back to the DataFrame
-`__
+`__
.. ipython:: python
@@ -663,7 +661,7 @@ Pivot
The :ref:`Pivot ` docs.
`Partial sums and subtotals
-`__
+`__
.. ipython:: python
@@ -870,7 +868,7 @@ Timeseries
`__
`Constructing a datetime range that excludes weekends and includes only certain times
-`__
+`__
`Vectorized Lookup
`__
@@ -910,8 +908,7 @@ Valid frequency arguments to Grouper :ref:`Timeseries `__
-`Using TimeGrouper and another grouping to create subgroups, then apply a custom function
-`__
+Using TimeGrouper and another grouping to create subgroups, then apply a custom function :issue:`3791`
`Resampling with custom periods
`__
@@ -947,8 +944,7 @@ Depending on df construction, ``ignore_index`` may be needed
df = pd.concat([df1, df2], ignore_index=True)
df
-`Self Join of a DataFrame
-`__
+Self Join of a DataFrame :issue:`2996`
.. ipython:: python
@@ -1038,7 +1034,7 @@ Data in/out
-----------
`Performance comparison of SQL vs HDF5
-`__
+`__
.. _cookbook.csv:
@@ -1070,14 +1066,7 @@ using that handle to read.
`Inferring dtypes from a file
`__
-`Dealing with bad lines
-`__
-
-`Dealing with bad lines II
-`__
-
-`Reading CSV with Unix timestamps and converting to local timezone
-`__
+Dealing with bad lines :issue:`2886`
`Write a multi-row index CSV without writing duplicates
`__
@@ -1211,8 +1200,7 @@ The :ref:`Excel ` docs
`Modifying formatting in XlsxWriter output
`__
-`Loading only visible sheets
-`__
+Loading only visible sheets :issue:`19842#issuecomment-892150745`
.. _cookbook.html:
@@ -1232,8 +1220,7 @@ The :ref:`HDFStores ` docs
`Simple queries with a Timestamp Index
`__
-`Managing heterogeneous data using a linked multiple table hierarchy
-`__
+Managing heterogeneous data using a linked multiple table hierarchy :issue:`3032`
`Merging on-disk tables with millions of rows
`__
@@ -1253,7 +1240,7 @@ csv file and creating a store by chunks, with date parsing as well.
`__
`Large Data work flows
-`__
+`__
`Reading in a sequence of files, then providing a global unique index to a store while appending
`__
@@ -1384,7 +1371,7 @@ Computation
-----------
`Numerical integration (sample-based) of a time series
-`__
+`__
Correlation
***********
diff --git a/doc/source/user_guide/dsintro.rst b/doc/source/user_guide/dsintro.rst
index efcf1a8703d2b..571f8980070af 100644
--- a/doc/source/user_guide/dsintro.rst
+++ b/doc/source/user_guide/dsintro.rst
@@ -8,7 +8,7 @@ Intro to data structures
We'll start with a quick, non-comprehensive overview of the fundamental data
structures in pandas to get you started. The fundamental behavior about data
-types, indexing, and axis labeling / alignment apply across all of the
+types, indexing, axis labeling, and alignment apply across all of the
objects. To get started, import NumPy and load pandas into your namespace:
.. ipython:: python
@@ -16,7 +16,7 @@ objects. To get started, import NumPy and load pandas into your namespace:
import numpy as np
import pandas as pd
-Here is a basic tenet to keep in mind: **data alignment is intrinsic**. The link
+Fundamentally, **data alignment is intrinsic**. The link
between labels and data will not be broken unless done so explicitly by you.
We'll give a brief intro to the data structures, then consider all of the broad
@@ -29,7 +29,7 @@ Series
:class:`Series` is a one-dimensional labeled array capable of holding any data
type (integers, strings, floating point numbers, Python objects, etc.). The axis
-labels are collectively referred to as the **index**. The basic method to create a Series is to call:
+labels are collectively referred to as the **index**. The basic method to create a :class:`Series` is to call:
::
@@ -61,32 +61,17 @@ index is passed, one will be created having values ``[0, ..., len(data) - 1]``.
pandas supports non-unique index values. If an operation
that does not support duplicate index values is attempted, an exception
- will be raised at that time. The reason for being lazy is nearly all performance-based
- (there are many instances in computations, like parts of GroupBy, where the index
- is not used).
+ will be raised at that time.
**From dict**
-Series can be instantiated from dicts:
+:class:`Series` can be instantiated from dicts:
.. ipython:: python
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)
-.. note::
-
- When the data is a dict, and an index is not passed, the ``Series`` index
- will be ordered by the dict's insertion order, if you're using Python
- version >= 3.6 and pandas version >= 0.23.
-
- If you're using Python < 3.6 or pandas < 0.23, and an index is not passed,
- the ``Series`` index will be the lexically ordered list of dict keys.
-
-In the example above, if you were on a Python version lower than 3.6 or a
-pandas version lower than 0.23, the ``Series`` would be ordered by the lexical
-order of the dict keys (i.e. ``['a', 'b', 'c']`` rather than ``['b', 'a', 'c']``).
-
If an index is passed, the values in data corresponding to the labels in the
index will be pulled out.
@@ -112,7 +97,7 @@ provided. The value will be repeated to match the length of **index**.
Series is ndarray-like
~~~~~~~~~~~~~~~~~~~~~~
-``Series`` acts very similarly to a ``ndarray``, and is a valid argument to most NumPy functions.
+:class:`Series` acts very similarly to a ``ndarray`` and is a valid argument to most NumPy functions.
However, operations such as slicing will also slice the index.
.. ipython:: python
@@ -128,7 +113,7 @@ However, operations such as slicing will also slice the index.
We will address array-based indexing like ``s[[4, 3, 1]]``
in :ref:`section on indexing `.
-Like a NumPy array, a pandas Series has a :attr:`~Series.dtype`.
+Like a NumPy array, a pandas :class:`Series` has a single :attr:`~Series.dtype`.
.. ipython:: python
@@ -140,7 +125,7 @@ be an :class:`~pandas.api.extensions.ExtensionDtype`. Some examples within
pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`basics.dtypes`
for more.
-If you need the actual array backing a ``Series``, use :attr:`Series.array`.
+If you need the actual array backing a :class:`Series`, use :attr:`Series.array`.
.. ipython:: python
@@ -151,24 +136,24 @@ index (to disable :ref:`automatic alignment `, for example).
:attr:`Series.array` will always be an :class:`~pandas.api.extensions.ExtensionArray`.
Briefly, an ExtensionArray is a thin wrapper around one or more *concrete* arrays like a
-:class:`numpy.ndarray`. pandas knows how to take an ``ExtensionArray`` and
-store it in a ``Series`` or a column of a ``DataFrame``.
+:class:`numpy.ndarray`. pandas knows how to take an :class:`~pandas.api.extensions.ExtensionArray` and
+store it in a :class:`Series` or a column of a :class:`DataFrame`.
See :ref:`basics.dtypes` for more.
-While Series is ndarray-like, if you need an *actual* ndarray, then use
+While :class:`Series` is ndarray-like, if you need an *actual* ndarray, then use
:meth:`Series.to_numpy`.
.. ipython:: python
s.to_numpy()
-Even if the Series is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
+Even if the :class:`Series` is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
:meth:`Series.to_numpy` will return a NumPy ndarray.
Series is dict-like
~~~~~~~~~~~~~~~~~~~
-A Series is like a fixed-size dict in that you can get and set values by index
+A :class:`Series` is also like a fixed-size dict in that you can get and set values by index
label:
.. ipython:: python
@@ -179,14 +164,14 @@ label:
"e" in s
"f" in s
-If a label is not contained, an exception is raised:
+If a label is not contained in the index, an exception is raised:
-.. code-block:: python
+.. ipython:: python
+ :okexcept:
- >>> s["f"]
- KeyError: 'f'
+ s["f"]
-Using the ``get`` method, a missing label will return None or specified default:
+Using the :meth:`Series.get` method, a missing label will return None or specified default:
.. ipython:: python
@@ -194,14 +179,14 @@ Using the ``get`` method, a missing label will return None or specified default:
s.get("f", np.nan)
-See also the :ref:`section on attribute access`.
+These labels can also be accessed by :ref:`attribute`.
Vectorized operations and label alignment with Series
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When working with raw NumPy arrays, looping through value-by-value is usually
-not necessary. The same is true when working with Series in pandas.
-Series can also be passed into most NumPy methods expecting an ndarray.
+not necessary. The same is true when working with :class:`Series` in pandas.
+:class:`Series` can also be passed into most NumPy methods expecting an ndarray.
.. ipython:: python
@@ -209,17 +194,17 @@ Series can also be passed into most NumPy methods expecting an ndarray.
s * 2
np.exp(s)
-A key difference between Series and ndarray is that operations between Series
+A key difference between :class:`Series` and ndarray is that operations between :class:`Series`
automatically align the data based on label. Thus, you can write computations
-without giving consideration to whether the Series involved have the same
+without giving consideration to whether the :class:`Series` involved have the same
labels.
.. ipython:: python
s[1:] + s[:-1]
-The result of an operation between unaligned Series will have the **union** of
-the indexes involved. If a label is not found in one Series or the other, the
+The result of an operation between unaligned :class:`Series` will have the **union** of
+the indexes involved. If a label is not found in one :class:`Series` or the other, the
result will be marked as missing ``NaN``. Being able to write code without doing
any explicit data alignment grants immense freedom and flexibility in
interactive data analysis and research. The integrated data alignment features
@@ -240,7 +225,7 @@ Name attribute
.. _dsintro.name_attribute:
-Series can also have a ``name`` attribute:
+:class:`Series` also has a ``name`` attribute:
.. ipython:: python
@@ -248,10 +233,11 @@ Series can also have a ``name`` attribute:
s
s.name
-The Series ``name`` will be assigned automatically in many cases, in particular
-when taking 1D slices of DataFrame as you will see below.
+The :class:`Series` ``name`` can be assigned automatically in many cases, in particular,
+when selecting a single column from a :class:`DataFrame`, the ``name`` will be assigned
+the column label.
-You can rename a Series with the :meth:`pandas.Series.rename` method.
+You can rename a :class:`Series` with the :meth:`pandas.Series.rename` method.
.. ipython:: python
@@ -265,17 +251,17 @@ Note that ``s`` and ``s2`` refer to different objects.
DataFrame
---------
-**DataFrame** is a 2-dimensional labeled data structure with columns of
+:class:`DataFrame` is a 2-dimensional labeled data structure with columns of
potentially different types. You can think of it like a spreadsheet or SQL
table, or a dict of Series objects. It is generally the most commonly used
pandas object. Like Series, DataFrame accepts many different kinds of input:
-* Dict of 1D ndarrays, lists, dicts, or Series
+* Dict of 1D ndarrays, lists, dicts, or :class:`Series`
* 2-D numpy.ndarray
* `Structured or record
`__ ndarray
-* A ``Series``
-* Another ``DataFrame``
+* A :class:`Series`
+* Another :class:`DataFrame`
Along with the data, you can optionally pass **index** (row labels) and
**columns** (column labels) arguments. If you pass an index and / or columns,
@@ -286,16 +272,6 @@ not matching up to the passed index.
If axis labels are not passed, they will be constructed from the input data
based on common sense rules.
-.. note::
-
- When the data is a dict, and ``columns`` is not specified, the ``DataFrame``
- columns will be ordered by the dict's insertion order, if you are using
- Python version >= 3.6 and pandas >= 0.23.
-
- If you are using Python < 3.6 or pandas < 0.23, and ``columns`` is not
- specified, the ``DataFrame`` columns will be the lexically ordered list of dict
- keys.
-
From dict of Series or dicts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -333,7 +309,7 @@ From dict of ndarrays / lists
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ndarrays must all be the same length. If an index is passed, it must
-clearly also be the same length as the arrays. If no index is passed, the
+also be the same length as the arrays. If no index is passed, the
result will be ``range(n)``, where ``n`` is the array length.
.. ipython:: python
@@ -402,6 +378,10 @@ The result will be a DataFrame with the same index as the input Series, and
with one column whose name is the original name of the Series (only if no other
column name provided).
+.. ipython:: python
+
+ ser = pd.Series(range(3), index=list("abc"), name="ser")
+ pd.DataFrame(ser)
.. _basics.dataframe.from_list_namedtuples:
@@ -409,8 +389,8 @@ From a list of namedtuples
~~~~~~~~~~~~~~~~~~~~~~~~~~
The field names of the first ``namedtuple`` in the list determine the columns
-of the ``DataFrame``. The remaining namedtuples (or tuples) are simply unpacked
-and their values are fed into the rows of the ``DataFrame``. If any of those
+of the :class:`DataFrame`. The remaining namedtuples (or tuples) are simply unpacked
+and their values are fed into the rows of the :class:`DataFrame`. If any of those
tuples is shorter than the first ``namedtuple`` then the later columns in the
corresponding row are marked as missing values. If any are longer than the
first ``namedtuple``, a ``ValueError`` is raised.
@@ -440,7 +420,7 @@ can be passed into the DataFrame constructor.
Passing a list of dataclasses is equivalent to passing a list of dictionaries.
Please be aware, that all values in the list should be dataclasses, mixing
-types in the list would result in a TypeError.
+types in the list would result in a ``TypeError``.
.. ipython:: python
@@ -452,11 +432,10 @@ types in the list would result in a TypeError.
**Missing data**
-Much more will be said on this topic in the :ref:`Missing data `
-section. To construct a DataFrame with missing data, we use ``np.nan`` to
+To construct a DataFrame with missing data, we use ``np.nan`` to
represent missing values. Alternatively, you may pass a ``numpy.MaskedArray``
as the data argument to the DataFrame constructor, and its masked entries will
-be considered missing.
+be considered missing. See :ref:`Missing data ` for more.
Alternate constructors
~~~~~~~~~~~~~~~~~~~~~~
@@ -465,8 +444,8 @@ Alternate constructors
**DataFrame.from_dict**
-``DataFrame.from_dict`` takes a dict of dicts or a dict of array-like sequences
-and returns a DataFrame. It operates like the ``DataFrame`` constructor except
+:meth:`DataFrame.from_dict` takes a dict of dicts or a dict of array-like sequences
+and returns a DataFrame. It operates like the :class:`DataFrame` constructor except
for the ``orient`` parameter which is ``'columns'`` by default, but which can be
set to ``'index'`` in order to use the dict keys as row labels.
@@ -490,10 +469,10 @@ case, you can also pass the desired column names:
**DataFrame.from_records**
-``DataFrame.from_records`` takes a list of tuples or an ndarray with structured
-dtype. It works analogously to the normal ``DataFrame`` constructor, except that
+:meth:`DataFrame.from_records` takes a list of tuples or an ndarray with structured
+dtype. It works analogously to the normal :class:`DataFrame` constructor, except that
the resulting DataFrame index may be a specific field of the structured
-dtype. For example:
+dtype.
.. ipython:: python
@@ -505,7 +484,7 @@ dtype. For example:
Column selection, addition, deletion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-You can treat a DataFrame semantically like a dict of like-indexed Series
+You can treat a :class:`DataFrame` semantically like a dict of like-indexed :class:`Series`
objects. Getting, setting, and deleting columns works with the same syntax as
the analogous dict operations:
@@ -532,7 +511,7 @@ column:
df["foo"] = "bar"
df
-When inserting a Series that does not have the same index as the DataFrame, it
+When inserting a :class:`Series` that does not have the same index as the :class:`DataFrame`, it
will be conformed to the DataFrame's index:
.. ipython:: python
@@ -543,8 +522,8 @@ will be conformed to the DataFrame's index:
You can insert raw ndarrays but their length must match the length of the
DataFrame's index.
-By default, columns get inserted at the end. The ``insert`` function is
-available to insert at a particular location in the columns:
+By default, columns get inserted at the end. :meth:`DataFrame.insert`
+inserts at a particular location in the columns:
.. ipython:: python
@@ -575,12 +554,12 @@ a function of one argument to be evaluated on the DataFrame being assigned to.
iris.assign(sepal_ratio=lambda x: (x["SepalWidth"] / x["SepalLength"])).head()
-``assign`` **always** returns a copy of the data, leaving the original
+:meth:`~pandas.DataFrame.assign` **always** returns a copy of the data, leaving the original
DataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, is
useful when you don't have a reference to the DataFrame at hand. This is
-common when using ``assign`` in a chain of operations. For example,
+common when using :meth:`~pandas.DataFrame.assign` in a chain of operations. For example,
we can limit the DataFrame to just those observations with a Sepal Length
greater than 5, calculate the ratio, and plot:
@@ -602,13 +581,13 @@ to those rows with sepal length greater than 5. The filtering happens first,
and then the ratio calculations. This is an example where we didn't
have a reference to the *filtered* DataFrame available.
-The function signature for ``assign`` is simply ``**kwargs``. The keys
+The function signature for :meth:`~pandas.DataFrame.assign` is simply ``**kwargs``. The keys
are the column names for the new fields, and the values are either a value
-to be inserted (for example, a ``Series`` or NumPy array), or a function
-of one argument to be called on the ``DataFrame``. A *copy* of the original
-DataFrame is returned, with the new values inserted.
+to be inserted (for example, a :class:`Series` or NumPy array), or a function
+of one argument to be called on the :class:`DataFrame`. A *copy* of the original
+:class:`DataFrame` is returned, with the new values inserted.
-Starting with Python 3.6 the order of ``**kwargs`` is preserved. This allows
+The order of ``**kwargs`` is preserved. This allows
for *dependent* assignment, where an expression later in ``**kwargs`` can refer
to a column created earlier in the same :meth:`~DataFrame.assign`.
@@ -635,8 +614,8 @@ The basics of indexing are as follows:
Slice rows, ``df[5:10]``, DataFrame
Select rows by boolean vector, ``df[bool_vec]``, DataFrame
-Row selection, for example, returns a Series whose index is the columns of the
-DataFrame:
+Row selection, for example, returns a :class:`Series` whose index is the columns of the
+:class:`DataFrame`:
.. ipython:: python
@@ -653,7 +632,7 @@ fundamentals of reindexing / conforming to new sets of labels in the
Data alignment and arithmetic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Data alignment between DataFrame objects automatically align on **both the
+Data alignment between :class:`DataFrame` objects automatically align on **both the
columns and the index (row labels)**. Again, the resulting object will have the
union of the column and row labels.
@@ -663,8 +642,8 @@ union of the column and row labels.
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])
df + df2
-When doing an operation between DataFrame and Series, the default behavior is
-to align the Series **index** on the DataFrame **columns**, thus `broadcasting
+When doing an operation between :class:`DataFrame` and :class:`Series`, the default behavior is
+to align the :class:`Series` **index** on the :class:`DataFrame` **columns**, thus `broadcasting
`__
row-wise. For example:
@@ -675,7 +654,7 @@ row-wise. For example:
For explicit control over the matching and broadcasting behavior, see the
section on :ref:`flexible binary operations `.
-Operations with scalars are just as you would expect:
+Arithmetic operations with scalars operate element-wise:
.. ipython:: python
@@ -685,7 +664,7 @@ Operations with scalars are just as you would expect:
.. _dsintro.boolean:
-Boolean operators work as well:
+Boolean operators operate element-wise as well:
.. ipython:: python
@@ -699,7 +678,7 @@ Boolean operators work as well:
Transposing
~~~~~~~~~~~
-To transpose, access the ``T`` attribute (also the ``transpose`` function),
+To transpose, access the ``T`` attribute or :meth:`DataFrame.transpose`,
similar to an ndarray:
.. ipython:: python
@@ -712,23 +691,21 @@ similar to an ndarray:
DataFrame interoperability with NumPy functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Elementwise NumPy ufuncs (log, exp, sqrt, ...) and various other NumPy functions
-can be used with no issues on Series and DataFrame, assuming the data within
-are numeric:
+Most NumPy functions can be called directly on :class:`Series` and :class:`DataFrame`.
.. ipython:: python
np.exp(df)
np.asarray(df)
-DataFrame is not intended to be a drop-in replacement for ndarray as its
+:class:`DataFrame` is not intended to be a drop-in replacement for ndarray as its
indexing semantics and data model are quite different in places from an n-dimensional
array.
:class:`Series` implements ``__array_ufunc__``, which allows it to work with NumPy's
`universal functions `_.
-The ufunc is applied to the underlying array in a Series.
+The ufunc is applied to the underlying array in a :class:`Series`.
.. ipython:: python
@@ -737,7 +714,7 @@ The ufunc is applied to the underlying array in a Series.
.. versionchanged:: 0.25.0
- When multiple ``Series`` are passed to a ufunc, they are aligned before
+ When multiple :class:`Series` are passed to a ufunc, they are aligned before
performing the operation.
Like other parts of the library, pandas will automatically align labeled inputs
@@ -761,8 +738,8 @@ with missing values.
ser3
np.remainder(ser1, ser3)
-When a binary ufunc is applied to a :class:`Series` and :class:`Index`, the Series
-implementation takes precedence and a Series is returned.
+When a binary ufunc is applied to a :class:`Series` and :class:`Index`, the :class:`Series`
+implementation takes precedence and a :class:`Series` is returned.
.. ipython:: python
@@ -778,10 +755,9 @@ the ufunc is applied without converting the underlying data to an ndarray.
Console display
~~~~~~~~~~~~~~~
-Very large DataFrames will be truncated to display them in the console.
+A very large :class:`DataFrame` will be truncated to display them in the console.
You can also get a summary using :meth:`~pandas.DataFrame.info`.
-(Here I am reading a CSV version of the **baseball** dataset from the **plyr**
-R package):
+(The **baseball** dataset is from the **plyr** R package):
.. ipython:: python
:suppress:
@@ -802,8 +778,8 @@ R package):
# restore GlobalPrintConfig
pd.reset_option(r"^display\.")
-However, using ``to_string`` will return a string representation of the
-DataFrame in tabular form, though it won't always fit the console width:
+However, using :meth:`DataFrame.to_string` will return a string representation of the
+:class:`DataFrame` in tabular form, though it won't always fit the console width:
.. ipython:: python
@@ -855,7 +831,7 @@ This will print the table in one block.
DataFrame column attribute access and IPython completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If a DataFrame column label is a valid Python variable name, the column can be
+If a :class:`DataFrame` column label is a valid Python variable name, the column can be
accessed like an attribute:
.. ipython:: python
diff --git a/doc/source/user_guide/duplicates.rst b/doc/source/user_guide/duplicates.rst
index 36c2ec53d58b4..7894789846ce8 100644
--- a/doc/source/user_guide/duplicates.rst
+++ b/doc/source/user_guide/duplicates.rst
@@ -172,7 +172,7 @@ going forward, to ensure that your data pipeline doesn't introduce duplicates.
>>> deduplicated = raw.groupby(level=0).first() # remove duplicates
>>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward
-Setting ``allows_duplicate_labels=True`` on a ``Series`` or ``DataFrame`` with duplicate
+Setting ``allows_duplicate_labels=False`` on a ``Series`` or ``DataFrame`` with duplicate
labels or performing an operation that introduces duplicate labels on a ``Series`` or
``DataFrame`` that disallows duplicates will raise an
:class:`errors.DuplicateLabelError`.
diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst
index c78d972f33d65..1a1229f95523b 100644
--- a/doc/source/user_guide/enhancingperf.rst
+++ b/doc/source/user_guide/enhancingperf.rst
@@ -7,10 +7,10 @@ Enhancing performance
*********************
In this part of the tutorial, we will investigate how to speed up certain
-functions operating on pandas ``DataFrames`` using three different techniques:
+functions operating on pandas :class:`DataFrame` using three different techniques:
Cython, Numba and :func:`pandas.eval`. We will see a speed improvement of ~200
when we use Cython and Numba on a test function operating row-wise on the
-``DataFrame``. Using :func:`pandas.eval` we will speed up a sum by an order of
+:class:`DataFrame`. Using :func:`pandas.eval` we will speed up a sum by an order of
~2.
.. note::
@@ -35,7 +35,7 @@ by trying to remove for-loops and making use of NumPy vectorization. It's always
optimising in Python first.
This tutorial walks through a "typical" process of cythonizing a slow computation.
-We use an `example from the Cython documentation `__
+We use an `example from the Cython documentation `__
but in the context of pandas. Our final cythonized solution is around 100 times
faster than the pure Python solution.
@@ -44,7 +44,7 @@ faster than the pure Python solution.
Pure Python
~~~~~~~~~~~
-We have a ``DataFrame`` to which we want to apply a function row-wise.
+We have a :class:`DataFrame` to which we want to apply a function row-wise.
.. ipython:: python
@@ -73,12 +73,11 @@ Here's the function in pure Python:
s += f(a + i * dx)
return s * dx
-We achieve our result by using ``apply`` (row-wise):
+We achieve our result by using :meth:`DataFrame.apply` (row-wise):
-.. code-block:: ipython
+.. ipython:: python
- In [7]: %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)
- 10 loops, best of 3: 174 ms per loop
+ %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)
But clearly this isn't fast enough for us. Let's take a look and see where the
time is spent during this operation (limited to the most time consuming
@@ -126,10 +125,9 @@ is here to distinguish between function versions):
to be using bleeding edge IPython for paste to play well with cell magics.
-.. code-block:: ipython
+.. ipython:: python
- In [4]: %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)
- 10 loops, best of 3: 85.5 ms per loop
+ %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)
Already this has shaved a third off, not too bad for a simple copy and paste.
@@ -155,10 +153,9 @@ We get another huge improvement simply by providing type information:
...: return s * dx
...:
-.. code-block:: ipython
+.. ipython:: python
- In [4]: %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)
- 10 loops, best of 3: 20.3 ms per loop
+ %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)
Now, we're talking! It's now over ten times faster than the original Python
implementation, and we haven't *really* modified the code. Let's have another
@@ -173,7 +170,7 @@ look at what's eating up time:
Using ndarray
~~~~~~~~~~~~~
-It's calling series... a lot! It's creating a Series from each row, and get-ting from both
+It's calling series a lot! It's creating a :class:`Series` from each row, and calling get from both
the index and the series (three times for each row). Function calls are expensive
in Python, so maybe we could minimize these by cythonizing the apply part.
@@ -216,10 +213,10 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
.. warning::
- You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
+ You can **not pass** a :class:`Series` directly as a ``ndarray`` typed parameter
to a Cython function. Instead pass the actual ``ndarray`` using the
:meth:`Series.to_numpy`. The reason is that the Cython
- definition is specific to an ndarray and not the passed ``Series``.
+ definition is specific to an ndarray and not the passed :class:`Series`.
So, do not do this:
@@ -238,10 +235,9 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
Loops like this would be *extremely* slow in Python, but in Cython looping
over NumPy arrays is *fast*.
-.. code-block:: ipython
+.. ipython:: python
- In [4]: %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
- 1000 loops, best of 3: 1.25 ms per loop
+ %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
We've gotten another big improvement. Let's check again where the time is spent:
@@ -267,33 +263,33 @@ advanced Cython techniques:
...: cimport cython
...: cimport numpy as np
...: import numpy as np
- ...: cdef double f_typed(double x) except? -2:
+ ...: cdef np.float64_t f_typed(np.float64_t x) except? -2:
...: return x * (x - 1)
- ...: cpdef double integrate_f_typed(double a, double b, int N):
- ...: cdef int i
- ...: cdef double s, dx
- ...: s = 0
+ ...: cpdef np.float64_t integrate_f_typed(np.float64_t a, np.float64_t b, np.int64_t N):
+ ...: cdef np.int64_t i
+ ...: cdef np.float64_t s = 0.0, dx
...: dx = (b - a) / N
...: for i in range(N):
...: s += f_typed(a + i * dx)
...: return s * dx
...: @cython.boundscheck(False)
...: @cython.wraparound(False)
- ...: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a,
- ...: np.ndarray[double] col_b,
- ...: np.ndarray[int] col_N):
- ...: cdef int i, n = len(col_N)
+ ...: cpdef np.ndarray[np.float64_t] apply_integrate_f_wrap(
+ ...: np.ndarray[np.float64_t] col_a,
+ ...: np.ndarray[np.float64_t] col_b,
+ ...: np.ndarray[np.int64_t] col_N
+ ...: ):
+ ...: cdef np.int64_t i, n = len(col_N)
...: assert len(col_a) == len(col_b) == n
- ...: cdef np.ndarray[double] res = np.empty(n)
+ ...: cdef np.ndarray[np.float64_t] res = np.empty(n, dtype=np.float64)
...: for i in range(n):
...: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
...: return res
...:
-.. code-block:: ipython
+.. ipython:: python
- In [4]: %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
- 1000 loops, best of 3: 987 us per loop
+ %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
Even faster, with the caveat that a bug in our Cython code (an off-by-one error,
for example) might cause a segfault because memory access isn't checked.
@@ -321,7 +317,7 @@ Numba supports compilation of Python to run on either CPU or GPU hardware and is
Numba can be used in 2 ways with pandas:
#. Specify the ``engine="numba"`` keyword in select pandas methods
-#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`Dataframe` (using ``to_numpy()``) into the function
+#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`DataFrame` (using ``to_numpy()``) into the function
pandas Numba Engine
~~~~~~~~~~~~~~~~~~~
@@ -354,6 +350,28 @@ a larger amount of data points (e.g. 1+ million).
In [6]: %timeit roll.apply(f, engine='cython', raw=True)
3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+If your compute hardware contains multiple CPUs, the largest performance gain can be realized by setting ``parallel`` to ``True``
+to leverage more than 1 CPU. Internally, pandas leverages numba to parallelize computations over the columns of a :class:`DataFrame`;
+therefore, this performance benefit is only beneficial for a :class:`DataFrame` with a large number of columns.
+
+.. code-block:: ipython
+
+ In [1]: import numba
+
+ In [2]: numba.set_num_threads(1)
+
+ In [3]: df = pd.DataFrame(np.random.randn(10_000, 100))
+
+ In [4]: roll = df.rolling(100)
+
+ In [5]: %timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})
+ 347 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+ In [6]: numba.set_num_threads(2)
+
+ In [7]: %timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})
+ 201 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
Custom Function Examples
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -595,8 +613,8 @@ Now let's do the same thing but with comparisons:
of type ``bool`` or ``np.bool_``. Again, you should perform these kinds of
operations in plain Python.
-The ``DataFrame.eval`` method
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The :meth:`DataFrame.eval` method
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In addition to the top level :func:`pandas.eval` function you can also
evaluate an expression in the "context" of a :class:`~pandas.DataFrame`.
@@ -630,7 +648,7 @@ new column name or an existing column name, and it must be a valid Python
identifier.
The ``inplace`` keyword determines whether this assignment will performed
-on the original ``DataFrame`` or return a copy with the new column.
+on the original :class:`DataFrame` or return a copy with the new column.
.. ipython:: python
@@ -640,7 +658,7 @@ on the original ``DataFrame`` or return a copy with the new column.
df.eval("a = 1", inplace=True)
df
-When ``inplace`` is set to ``False``, the default, a copy of the ``DataFrame`` with the
+When ``inplace`` is set to ``False``, the default, a copy of the :class:`DataFrame` with the
new or modified columns is returned and the original frame is unchanged.
.. ipython:: python
@@ -672,7 +690,7 @@ The equivalent in standard Python would be
df["a"] = 1
df
-The ``query`` method has a ``inplace`` keyword which determines
+The :class:`DataFrame.query` method has a ``inplace`` keyword which determines
whether the query modifies the original frame.
.. ipython:: python
@@ -814,7 +832,7 @@ computation. The two lines are two different engines.
.. image:: ../_static/eval-perf-small.png
-This plot was created using a ``DataFrame`` with 3 columns each containing
+This plot was created using a :class:`DataFrame` with 3 columns each containing
floating point values generated using ``numpy.random.randn()``.
Technical minutia regarding expression evaluation
diff --git a/doc/source/user_guide/gotchas.rst b/doc/source/user_guide/gotchas.rst
index 1de978b195382..adb40e166eab4 100644
--- a/doc/source/user_guide/gotchas.rst
+++ b/doc/source/user_guide/gotchas.rst
@@ -10,13 +10,13 @@ Frequently Asked Questions (FAQ)
DataFrame memory usage
----------------------
-The memory usage of a ``DataFrame`` (including the index) is shown when calling
+The memory usage of a :class:`DataFrame` (including the index) is shown when calling
the :meth:`~DataFrame.info`. A configuration option, ``display.memory_usage``
(see :ref:`the list of options `), specifies if the
-``DataFrame``'s memory usage will be displayed when invoking the ``df.info()``
+:class:`DataFrame` memory usage will be displayed when invoking the ``df.info()``
method.
-For example, the memory usage of the ``DataFrame`` below is shown
+For example, the memory usage of the :class:`DataFrame` below is shown
when calling :meth:`~DataFrame.info`:
.. ipython:: python
@@ -53,9 +53,9 @@ By default the display option is set to ``True`` but can be explicitly
overridden by passing the ``memory_usage`` argument when invoking ``df.info()``.
The memory usage of each column can be found by calling the
-:meth:`~DataFrame.memory_usage` method. This returns a ``Series`` with an index
+:meth:`~DataFrame.memory_usage` method. This returns a :class:`Series` with an index
represented by column names and memory usage of each column shown in bytes. For
-the ``DataFrame`` above, the memory usage of each column and the total memory
+the :class:`DataFrame` above, the memory usage of each column and the total memory
usage can be found with the ``memory_usage`` method:
.. ipython:: python
@@ -65,8 +65,8 @@ usage can be found with the ``memory_usage`` method:
# total memory usage of dataframe
df.memory_usage().sum()
-By default the memory usage of the ``DataFrame``'s index is shown in the
-returned ``Series``, the memory usage of the index can be suppressed by passing
+By default the memory usage of the :class:`DataFrame` index is shown in the
+returned :class:`Series`, the memory usage of the index can be suppressed by passing
the ``index=False`` argument:
.. ipython:: python
@@ -75,7 +75,7 @@ the ``index=False`` argument:
The memory usage displayed by the :meth:`~DataFrame.info` method utilizes the
:meth:`~DataFrame.memory_usage` method to determine the memory usage of a
-``DataFrame`` while also formatting the output in human-readable units (base-2
+:class:`DataFrame` while also formatting the output in human-readable units (base-2
representation; i.e. 1KB = 1024 bytes).
See also :ref:`Categorical Memory Usage `.
@@ -98,32 +98,28 @@ of the following code should be:
Should it be ``True`` because it's not zero-length, or ``False`` because there
are ``False`` values? It is unclear, so instead, pandas raises a ``ValueError``:
-.. code-block:: python
+.. ipython:: python
+ :okexcept:
- >>> if pd.Series([False, True, False]):
- ... print("I was true")
- Traceback
- ...
- ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
+ if pd.Series([False, True, False]):
+ print("I was true")
-You need to explicitly choose what you want to do with the ``DataFrame``, e.g.
+You need to explicitly choose what you want to do with the :class:`DataFrame`, e.g.
use :meth:`~DataFrame.any`, :meth:`~DataFrame.all` or :meth:`~DataFrame.empty`.
Alternatively, you might want to compare if the pandas object is ``None``:
-.. code-block:: python
+.. ipython:: python
- >>> if pd.Series([False, True, False]) is not None:
- ... print("I was not None")
- I was not None
+ if pd.Series([False, True, False]) is not None:
+ print("I was not None")
Below is how to check if any of the values are ``True``:
-.. code-block:: python
+.. ipython:: python
- >>> if pd.Series([False, True, False]).any():
- ... print("I am any")
- I am any
+ if pd.Series([False, True, False]).any():
+ print("I am any")
To evaluate single-element pandas objects in a boolean context, use the method
:meth:`~DataFrame.bool`:
@@ -138,27 +134,21 @@ To evaluate single-element pandas objects in a boolean context, use the method
Bitwise boolean
~~~~~~~~~~~~~~~
-Bitwise boolean operators like ``==`` and ``!=`` return a boolean ``Series``,
-which is almost always what you want anyways.
+Bitwise boolean operators like ``==`` and ``!=`` return a boolean :class:`Series`
+which performs an element-wise comparison when compared to a scalar.
-.. code-block:: python
+.. ipython:: python
- >>> s = pd.Series(range(5))
- >>> s == 4
- 0 False
- 1 False
- 2 False
- 3 False
- 4 True
- dtype: bool
+ s = pd.Series(range(5))
+ s == 4
See :ref:`boolean comparisons` for more examples.
Using the ``in`` operator
~~~~~~~~~~~~~~~~~~~~~~~~~
-Using the Python ``in`` operator on a ``Series`` tests for membership in the
-index, not membership among the values.
+Using the Python ``in`` operator on a :class:`Series` tests for membership in the
+**index**, not membership among the values.
.. ipython:: python
@@ -167,7 +157,7 @@ index, not membership among the values.
'b' in s
If this behavior is surprising, keep in mind that using ``in`` on a Python
-dictionary tests keys, not values, and ``Series`` are dict-like.
+dictionary tests keys, not values, and :class:`Series` are dict-like.
To test for membership in the values, use the method :meth:`~pandas.Series.isin`:
.. ipython:: python
@@ -175,7 +165,7 @@ To test for membership in the values, use the method :meth:`~pandas.Series.isin`
s.isin([2])
s.isin([2]).any()
-For ``DataFrames``, likewise, ``in`` applies to the column axis,
+For :class:`DataFrame`, likewise, ``in`` applies to the column axis,
testing for membership in the list of column names.
.. _gotchas.udf-mutation:
@@ -206,8 +196,8 @@ causing unexpected behavior. Consider the example:
One probably would have expected that the result would be ``[1, 3, 5]``.
When using a pandas method that takes a UDF, internally pandas is often
iterating over the
-``DataFrame`` or other pandas object. Therefore, if the UDF mutates (changes)
-the ``DataFrame``, unexpected behavior can arise.
+:class:`DataFrame` or other pandas object. Therefore, if the UDF mutates (changes)
+the :class:`DataFrame`, unexpected behavior can arise.
Here is a similar example with :meth:`DataFrame.apply`:
@@ -267,7 +257,7 @@ For many reasons we chose the latter. After years of production use it has
proven, at least in my opinion, to be the best decision given the state of
affairs in NumPy and Python in general. The special value ``NaN``
(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
-functions ``isna`` and ``notna`` which can be used across the dtypes to
+functions :meth:`DataFrame.isna` and :meth:`DataFrame.notna` which can be used across the dtypes to
detect NA values.
However, it comes with it a couple of trade-offs which I most certainly have
@@ -293,7 +283,7 @@ arrays. For example:
s2.dtype
This trade-off is made largely for memory and performance reasons, and also so
-that the resulting ``Series`` continues to be "numeric".
+that the resulting :class:`Series` continues to be "numeric".
If you need to represent integers with possibly missing values, use one of
the nullable-integer extension dtypes provided by pandas
@@ -318,7 +308,7 @@ See :ref:`integer_na` for more.
``NA`` type promotions
~~~~~~~~~~~~~~~~~~~~~~
-When introducing NAs into an existing ``Series`` or ``DataFrame`` via
+When introducing NAs into an existing :class:`Series` or :class:`DataFrame` via
:meth:`~Series.reindex` or some other means, boolean and integer types will be
promoted to a different dtype in order to store the NAs. The promotions are
summarized in this table:
@@ -341,7 +331,7 @@ Why not make NumPy like R?
Many people have suggested that NumPy should simply emulate the ``NA`` support
present in the more domain-specific statistical programming language `R
-`__. Part of the reason is the NumPy type hierarchy:
+`__. Part of the reason is the NumPy type hierarchy:
.. csv-table::
:header: "Typeclass","Dtypes"
@@ -376,18 +366,19 @@ integer arrays to floating when NAs must be introduced.
Differences with NumPy
----------------------
-For ``Series`` and ``DataFrame`` objects, :meth:`~DataFrame.var` normalizes by
-``N-1`` to produce unbiased estimates of the sample variance, while NumPy's
-``var`` normalizes by N, which measures the variance of the sample. Note that
+For :class:`Series` and :class:`DataFrame` objects, :meth:`~DataFrame.var` normalizes by
+``N-1`` to produce `unbiased estimates of the population variance `__, while NumPy's
+:meth:`numpy.var` normalizes by N, which measures the variance of the sample. Note that
:meth:`~DataFrame.cov` normalizes by ``N-1`` in both pandas and NumPy.
+.. _gotchas.thread-safety:
Thread-safety
-------------
-As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to
+pandas is not 100% thread safe. The known issues relate to
the :meth:`~DataFrame.copy` method. If you are doing a lot of copying of
-``DataFrame`` objects shared among threads, we recommend holding locks inside
+:class:`DataFrame` objects shared among threads, we recommend holding locks inside
the threads where the data copying occurs.
See `this link `__
@@ -406,7 +397,7 @@ symptom of this issue is an error like::
To deal
with this issue you should convert the underlying NumPy array to the native
-system byte order *before* passing it to ``Series`` or ``DataFrame``
+system byte order *before* passing it to :class:`Series` or :class:`DataFrame`
constructors using something similar to the following:
.. ipython:: python
diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst
index 0fb59c50efa74..5d8ef7ce02097 100644
--- a/doc/source/user_guide/groupby.rst
+++ b/doc/source/user_guide/groupby.rst
@@ -477,7 +477,7 @@ An obvious one is aggregation via the
.. ipython:: python
grouped = df.groupby("A")
- grouped.aggregate(np.sum)
+ grouped[["C", "D"]].aggregate(np.sum)
grouped = df.groupby(["A", "B"])
grouped.aggregate(np.sum)
@@ -492,7 +492,7 @@ changed by using the ``as_index`` option:
grouped = df.groupby(["A", "B"], as_index=False)
grouped.aggregate(np.sum)
- df.groupby("A", as_index=False).sum()
+ df.groupby("A", as_index=False)[["C", "D"]].sum()
Note that you could use the ``reset_index`` DataFrame function to achieve the
same result as the column names are stored in the resulting ``MultiIndex``:
@@ -539,19 +539,19 @@ Some common aggregating functions are tabulated below:
:widths: 20, 80
:delim: ;
- :meth:`~pd.core.groupby.DataFrameGroupBy.mean`;Compute mean of groups
- :meth:`~pd.core.groupby.DataFrameGroupBy.sum`;Compute sum of group values
- :meth:`~pd.core.groupby.DataFrameGroupBy.size`;Compute group sizes
- :meth:`~pd.core.groupby.DataFrameGroupBy.count`;Compute count of group
- :meth:`~pd.core.groupby.DataFrameGroupBy.std`;Standard deviation of groups
- :meth:`~pd.core.groupby.DataFrameGroupBy.var`;Compute variance of groups
- :meth:`~pd.core.groupby.DataFrameGroupBy.sem`;Standard error of the mean of groups
- :meth:`~pd.core.groupby.DataFrameGroupBy.describe`;Generates descriptive statistics
- :meth:`~pd.core.groupby.DataFrameGroupBy.first`;Compute first of group values
- :meth:`~pd.core.groupby.DataFrameGroupBy.last`;Compute last of group values
- :meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
- :meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
- :meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.mean`;Compute mean of groups
+ :meth:`~pd.core.groupby.DataFrameGroupBy.sum`;Compute sum of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.size`;Compute group sizes
+ :meth:`~pd.core.groupby.DataFrameGroupBy.count`;Compute count of group
+ :meth:`~pd.core.groupby.DataFrameGroupBy.std`;Standard deviation of groups
+ :meth:`~pd.core.groupby.DataFrameGroupBy.var`;Compute variance of groups
+ :meth:`~pd.core.groupby.DataFrameGroupBy.sem`;Standard error of the mean of groups
+ :meth:`~pd.core.groupby.DataFrameGroupBy.describe`;Generates descriptive statistics
+ :meth:`~pd.core.groupby.DataFrameGroupBy.first`;Compute first of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.last`;Compute last of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
+ :meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values
The aggregating functions above will exclude NA values. Any function which
@@ -730,7 +730,7 @@ optimized Cython implementations:
.. ipython:: python
- df.groupby("A").sum()
+ df.groupby("A")[["C", "D"]].sum()
df.groupby(["A", "B"]).mean()
Of course ``sum`` and ``mean`` are implemented on pandas objects, so the above
@@ -761,7 +761,7 @@ different dtypes, then a common dtype will be determined in the same way as ``Da
Transformation
--------------
-The ``transform`` method returns an object that is indexed the same (same size)
+The ``transform`` method returns an object that is indexed the same
as the one being grouped. The transform function must:
* Return a result that is either the same size as the group chunk or
@@ -776,6 +776,14 @@ as the one being grouped. The transform function must:
* (Optionally) operates on the entire group chunk. If this is supported, a
fast path is used starting from the *second* chunk.
+.. deprecated:: 1.5.0
+
+ When using ``.transform`` on a grouped DataFrame and the transformation function
+ returns a DataFrame, currently pandas does not align the result's index
+ with the input's index. This behavior is deprecated and alignment will
+ be performed in a future version of pandas. You can apply ``.to_numpy()`` to the
+ result of the transformation function to avoid alignment.
+
Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the
transformation function. If the results from different groups have different dtypes, then
a common dtype will be determined in the same way as ``DataFrame`` construction.
@@ -831,10 +839,10 @@ Alternatively, the built-in methods could be used to produce the same outputs.
.. ipython:: python
- max = ts.groupby(lambda x: x.year).transform("max")
- min = ts.groupby(lambda x: x.year).transform("min")
+ max_ts = ts.groupby(lambda x: x.year).transform("max")
+ min_ts = ts.groupby(lambda x: x.year).transform("min")
- max - min
+ max_ts - min_ts
Another common data transform is to replace missing data with the group mean.
@@ -1052,7 +1060,14 @@ Some operations on the grouped data might not fit into either the aggregate or
transform categories. Or, you may simply want GroupBy to infer how to combine
the results. For these, use the ``apply`` function, which can be substituted
for both ``aggregate`` and ``transform`` in many standard use cases. However,
-``apply`` can handle some exceptional use cases, for example:
+``apply`` can handle some exceptional use cases.
+
+.. note::
+
+ ``apply`` can act as a reducer, transformer, *or* filter function, depending
+ on exactly what is passed to it. It can depend on the passed function and
+ exactly what you are grouping. Thus the grouped column(s) may be included in
+ the output as well as set the indices.
.. ipython:: python
@@ -1064,16 +1079,14 @@ for both ``aggregate`` and ``transform`` in many standard use cases. However,
The dimension of the returned result can also change:
-.. ipython::
-
- In [8]: grouped = df.groupby('A')['C']
+.. ipython:: python
- In [10]: def f(group):
- ....: return pd.DataFrame({'original': group,
- ....: 'demeaned': group - group.mean()})
- ....:
+ grouped = df.groupby('A')['C']
- In [11]: grouped.apply(f)
+ def f(group):
+ return pd.DataFrame({'original': group,
+ 'demeaned': group - group.mean()})
+ grouped.apply(f)
``apply`` on a Series can operate on a returned value from the applied function,
that is itself a series, and possibly upcast the result to a DataFrame:
@@ -1088,11 +1101,33 @@ that is itself a series, and possibly upcast the result to a DataFrame:
s
s.apply(f)
+Control grouped column(s) placement with ``group_keys``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
.. note::
- ``apply`` can act as a reducer, transformer, *or* filter function, depending on exactly what is passed to it.
- So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in
- the output as well as set the indices.
+ If ``group_keys=True`` is specified when calling :meth:`~DataFrame.groupby`,
+ functions passed to ``apply`` that return like-indexed outputs will have the
+ group keys added to the result index. Previous versions of pandas would add
+ the group keys only when the result from the applied function had a different
+ index than the input. If ``group_keys`` is not specified, the group keys will
+ not be added for like-indexed outputs. In the future this behavior
+ will change to always respect ``group_keys``, which defaults to ``True``.
+
+ .. versionchanged:: 1.5.0
+
+To control whether the grouped column(s) are included in the indices, you can use
+the argument ``group_keys``. Compare
+
+.. ipython:: python
+
+ df.groupby("A", group_keys=True).apply(lambda x: x)
+
+with
+
+.. ipython:: python
+
+ df.groupby("A", group_keys=False).apply(lambda x: x)
Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the
apply function. If the results from different groups have different dtypes, then
@@ -1132,13 +1167,12 @@ Again consider the example DataFrame we've been looking at:
Suppose we wish to compute the standard deviation grouped by the ``A``
column. There is a slight problem, namely that we don't care about the data in
-column ``B``. We refer to this as a "nuisance" column. If the passed
-aggregation function can't be applied to some columns, the troublesome columns
-will be (silently) dropped. Thus, this does not pose any problems:
+column ``B``. We refer to this as a "nuisance" column. You can avoid nuisance
+columns by specifying ``numeric_only=True``:
.. ipython:: python
- df.groupby("A").std()
+ df.groupby("A").std(numeric_only=True)
Note that ``df.groupby('A').colname.std().`` is more efficient than
``df.groupby('A').std().colname``, so if the result of an aggregation function
@@ -1153,7 +1187,14 @@ is only interesting over one column (here ``colname``), it may be filtered
If you do wish to include decimal or object columns in an aggregation with
other non-nuisance data types, you must do so explicitly.
+.. warning::
+ The automatic dropping of nuisance columns has been deprecated and will be removed
+ in a future version of pandas. If columns are included that cannot be operated
+ on, pandas will instead raise an error. In order to avoid this, either select
+ the columns you wish to operate on or specify ``numeric_only=True``.
+
.. ipython:: python
+ :okwarning:
from decimal import Decimal
@@ -1277,7 +1318,7 @@ Groupby a specific column with the desired frequency. This is like resampling.
.. ipython:: python
- df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"]).sum()
+ df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"])[["Quantity"]].sum()
You have an ambiguous specification in that you have a named index and a column
that could be potential groupers.
@@ -1286,9 +1327,9 @@ that could be potential groupers.
df = df.set_index("Date")
df["Date"] = df.index + pd.offsets.MonthEnd(2)
- df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"]).sum()
+ df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"])[["Quantity"]].sum()
- df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"]).sum()
+ df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"])[["Quantity"]].sum()
Taking the first rows of each group
diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst
index 6b6e212cde635..a6392706eb7a3 100644
--- a/doc/source/user_guide/index.rst
+++ b/doc/source/user_guide/index.rst
@@ -17,6 +17,43 @@ For a high level summary of the pandas fundamentals, see :ref:`dsintro` and :ref
Further information on any specific method can be obtained in the
:ref:`api`.
+How to read these guides
+------------------------
+In these guides you will see input code inside code blocks such as:
+
+::
+
+ import pandas as pd
+ pd.DataFrame({'A': [1, 2, 3]})
+
+
+or:
+
+.. ipython:: python
+
+ import pandas as pd
+ pd.DataFrame({'A': [1, 2, 3]})
+
+The first block is a standard python input, while in the second the ``In [1]:`` indicates the input is inside a `notebook `__. In Jupyter Notebooks the last line is printed and plots are shown inline.
+
+For example:
+
+.. ipython:: python
+
+ a = 1
+ a
+is equivalent to:
+
+::
+
+ a = 1
+ print(a)
+
+
+
+Guides
+-------
+
.. If you update this toctree, also update the manual toctree in the
main index.rst.template
@@ -39,7 +76,6 @@ Further information on any specific method can be obtained in the
boolean
visualization
style
- computation
groupby
window
timeseries
diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst
index e41f938170417..f939945fc6cda 100644
--- a/doc/source/user_guide/indexing.rst
+++ b/doc/source/user_guide/indexing.rst
@@ -89,7 +89,7 @@ Getting values from an object with multi-axes selection uses the following
notation (using ``.loc`` as an example, but the following applies to ``.iloc`` as
well). Any of the axes accessors may be the null slice ``:``. Axes left out of
the specification are assumed to be ``:``, e.g. ``p.loc['a']`` is equivalent to
-``p.loc['a', :, :]``.
+``p.loc['a', :]``.
.. csv-table::
:header: "Object Type", "Indexers"
@@ -583,7 +583,7 @@ without using a temporary variable.
.. ipython:: python
bb = pd.read_csv('data/baseball.csv', index_col='id')
- (bb.groupby(['year', 'team']).sum()
+ (bb.groupby(['year', 'team']).sum(numeric_only=True)
.loc[lambda df: df['r'] > 100])
@@ -1885,7 +1885,7 @@ chained indexing expression, you can set the :ref:`option `
``mode.chained_assignment`` to one of these values:
* ``'warn'``, the default, means a ``SettingWithCopyWarning`` is printed.
-* ``'raise'`` means pandas will raise a ``SettingWithCopyException``
+* ``'raise'`` means pandas will raise a ``SettingWithCopyError``
you have to deal with.
* ``None`` will suppress the warnings entirely.
@@ -1953,7 +1953,7 @@ Last, the subsequent example will **not** work at all, and so should be avoided:
>>> dfd.loc[0]['a'] = 1111
Traceback (most recent call last)
...
- SettingWithCopyException:
+ SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
diff --git a/doc/source/user_guide/integer_na.rst b/doc/source/user_guide/integer_na.rst
index 2ce8bf23de824..fe732daccb649 100644
--- a/doc/source/user_guide/integer_na.rst
+++ b/doc/source/user_guide/integer_na.rst
@@ -29,7 +29,7 @@ Construction
------------
pandas can represent integer data with possibly missing values using
-:class:`arrays.IntegerArray`. This is an :ref:`extension types `
+:class:`arrays.IntegerArray`. This is an :ref:`extension type `
implemented within pandas.
.. ipython:: python
diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst
index 9faef9b15bfb4..7a7e518e1f7db 100644
--- a/doc/source/user_guide/io.rst
+++ b/doc/source/user_guide/io.rst
@@ -26,11 +26,11 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
text;`XML `__;:ref:`read_xml`;:ref:`to_xml`
text; Local clipboard;:ref:`read_clipboard`;:ref:`to_clipboard`
binary;`MS Excel `__;:ref:`read_excel`;:ref:`to_excel`
- binary;`OpenDocument `__;:ref:`read_excel`;
+ binary;`OpenDocument `__;:ref:`read_excel`;
binary;`HDF5 Format `__;:ref:`read_hdf`;:ref:`to_hdf`
binary;`Feather Format `__;:ref:`read_feather`;:ref:`to_feather`
binary;`Parquet Format `__;:ref:`read_parquet`;:ref:`to_parquet`
- binary;`ORC Format `__;:ref:`read_orc`;
+ binary;`ORC Format `__;:ref:`read_orc`;:ref:`to_orc`
binary;`Stata `__;:ref:`read_stata`;:ref:`to_stata`
binary;`SAS `__;:ref:`read_sas`;
binary;`SPSS `__;:ref:`read_spss`;
@@ -107,9 +107,10 @@ index_col : int, str, sequence of int / str, or False, optional, default ``None`
string name or column index. If a sequence of int / str is given, a
MultiIndex is used.
- Note: ``index_col=False`` can be used to force pandas to *not* use the first
- column as the index, e.g. when you have a malformed file with delimiters at
- the end of each line.
+ .. note::
+ ``index_col=False`` can be used to force pandas to *not* use the first
+ column as the index, e.g. when you have a malformed file with delimiters at
+ the end of each line.
The default value of ``None`` instructs pandas to guess. If the number of
fields in the column header row is equal to the number of fields in the body
@@ -178,14 +179,24 @@ mangle_dupe_cols : boolean, default ``True``
Passing in ``False`` will cause data to be overwritten if there are duplicate
names in the columns.
+ .. deprecated:: 1.5.0
+ The argument was never implemented, and a new argument where the
+ renaming pattern can be specified will be added instead.
+
General parsing configuration
+++++++++++++++++++++++++++++
dtype : Type name or dict of column -> type, default ``None``
- Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32}``
- (unsupported with ``engine='python'``). Use ``str`` or ``object`` together
- with suitable ``na_values`` settings to preserve and
- not interpret dtype.
+ Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32, 'c': 'Int64'}``
+ Use ``str`` or ``object`` together with suitable ``na_values`` settings to preserve
+ and not interpret dtype. If converters are specified, they will be applied INSTEAD
+ of dtype conversion.
+
+ .. versionadded:: 1.5.0
+
+ Support for defaultdict was added. Specify a defaultdict as input where
+ the default determines the dtype of the columns which are not explicitly
+ listed.
engine : {``'c'``, ``'python'``, ``'pyarrow'``}
Parser engine to use. The C and pyarrow engines are faster, while the python engine
is currently more feature-complete. Multithreading is currently only supported by
@@ -278,7 +289,9 @@ parse_dates : boolean or list of ints or names or list of lists or dict, default
* If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date
column.
* If ``{'foo': [1, 3]}`` -> parse columns 1, 3 as date and call result 'foo'.
- A fast-path exists for iso8601-formatted dates.
+
+ .. note::
+ A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default ``False``
If ``True`` and parse_dates is enabled for a column, attempt to infer the
datetime format to speed up the processing.
@@ -549,7 +562,8 @@ This matches the behavior of :meth:`Categorical.set_categories`.
df = pd.read_csv(StringIO(data), dtype="category")
df.dtypes
df["col3"]
- df["col3"].cat.categories = pd.to_numeric(df["col3"].cat.categories)
+ new_categories = pd.to_numeric(df["col3"].cat.categories)
+ df["col3"] = df["col3"].cat.rename_categories(new_categories)
df["col3"]
@@ -601,6 +615,10 @@ If the header is in a row other than the first, pass the row number to
Duplicate names parsing
'''''''''''''''''''''''
+ .. deprecated:: 1.5.0
+ ``mangle_dupe_cols`` was never implemented, and a new argument where the
+ renaming pattern can be specified will be added instead.
+
If the file or header contains duplicate names, pandas will by default
distinguish between them so as to prevent overwriting data:
@@ -611,27 +629,7 @@ distinguish between them so as to prevent overwriting data:
There is no more duplicate data because ``mangle_dupe_cols=True`` by default,
which modifies a series of duplicate columns 'X', ..., 'X' to become
-'X', 'X.1', ..., 'X.N'. If ``mangle_dupe_cols=False``, duplicate data can
-arise:
-
-.. code-block:: ipython
-
- In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
- In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
- Out[3]:
- a b a
- 0 2 1 2
- 1 5 4 5
-
-To prevent users from encountering this problem with duplicate data, a ``ValueError``
-exception is raised if ``mangle_dupe_cols != True``:
-
-.. code-block:: ipython
-
- In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
- In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
- ...
- ValueError: Setting mangle_dupe_cols=False is not supported yet
+'X', 'X.1', ..., 'X.N'.
.. _io.usecols:
@@ -837,13 +835,9 @@ input text data into ``datetime`` objects.
The simplest case is to just pass in ``parse_dates=True``:
.. ipython:: python
- :suppress:
- f = open("foo.csv", "w")
- f.write("date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5")
- f.close()
-
-.. ipython:: python
+ with open("foo.csv", mode="w") as f:
+ f.write("date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5")
# Use a column as an index, and parse it as dates.
df = pd.read_csv("foo.csv", index_col=0, parse_dates=True)
@@ -862,7 +856,6 @@ order) and the new column names will be the concatenation of the component
column names:
.. ipython:: python
- :suppress:
data = (
"KORD,19990127, 19:00:00, 18:56:00, 0.8100\n"
@@ -876,9 +869,6 @@ column names:
with open("tmp.csv", "w") as fh:
fh.write(data)
-.. ipython:: python
-
- print(open("tmp.csv").read())
df = pd.read_csv("tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]])
df
@@ -1058,19 +1048,20 @@ While US date formats tend to be MM/DD/YYYY, many international formats use
DD/MM/YYYY instead. For convenience, a ``dayfirst`` keyword is provided:
.. ipython:: python
- :suppress:
data = "date,value,cat\n1/6/2000,5,a\n2/6/2000,10,b\n3/6/2000,15,c"
+ print(data)
with open("tmp.csv", "w") as fh:
fh.write(data)
-.. ipython:: python
-
- print(open("tmp.csv").read())
-
pd.read_csv("tmp.csv", parse_dates=[0])
pd.read_csv("tmp.csv", dayfirst=True, parse_dates=[0])
+.. ipython:: python
+ :suppress:
+
+ os.remove("tmp.csv")
+
Writing CSVs to binary file objects
+++++++++++++++++++++++++++++++++++
@@ -1133,8 +1124,9 @@ For large numbers that have been written with a thousands separator, you can
set the ``thousands`` keyword to a string of length 1 so that integers will be parsed
correctly:
+By default, numbers with a thousands separator will be parsed as strings:
+
.. ipython:: python
- :suppress:
data = (
"ID|level|category\n"
@@ -1146,11 +1138,6 @@ correctly:
with open("tmp.csv", "w") as fh:
fh.write(data)
-By default, numbers with a thousands separator will be parsed as strings:
-
-.. ipython:: python
-
- print(open("tmp.csv").read())
df = pd.read_csv("tmp.csv", sep="|")
df
@@ -1160,7 +1147,6 @@ The ``thousands`` keyword allows integers to be parsed correctly:
.. ipython:: python
- print(open("tmp.csv").read())
df = pd.read_csv("tmp.csv", sep="|", thousands=",")
df
@@ -1239,16 +1225,13 @@ as a ``Series``:
``read_csv`` instead.
.. ipython:: python
- :suppress:
+ :okwarning:
data = "level\nPatient1,123000\nPatient2,23000\nPatient3,1234018"
with open("tmp.csv", "w") as fh:
fh.write(data)
-.. ipython:: python
- :okwarning:
-
print(open("tmp.csv").read())
output = pd.read_csv("tmp.csv", squeeze=True)
@@ -1305,14 +1288,38 @@ You can elect to skip bad lines:
0 1 2 3
1 8 9 10
+Or pass a callable function to handle the bad line if ``engine="python"``.
+The bad line will be a list of strings that was split by the ``sep``:
+
+.. code-block:: ipython
+
+ In [29]: external_list = []
+
+ In [30]: def bad_lines_func(line):
+ ...: external_list.append(line)
+ ...: return line[-3:]
+
+ In [31]: pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python")
+ Out[31]:
+ a b c
+ 0 1 2 3
+ 1 5 6 7
+ 2 8 9 10
+
+ In [32]: external_list
+ Out[32]: [4, 5, 6, 7]
+
+ .. versionadded:: 1.4.0
+
+
You can also use the ``usecols`` parameter to eliminate extraneous column
data that appear in some lines but not others:
.. code-block:: ipython
- In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
+ In [33]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
- Out[30]:
+ Out[33]:
a b c
0 1 2 3
1 4 5 6
@@ -1324,9 +1331,9 @@ fields are filled with ``NaN``.
.. code-block:: ipython
- In [31]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])
+ In [34]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])
- Out[31]:
+ Out[34]:
a b c d
0 1 2 3 NaN
1 4 5 6 7
@@ -1341,15 +1348,11 @@ The ``dialect`` keyword gives greater flexibility in specifying the file format.
By default it uses the Excel dialect but you can specify either the dialect name
or a :class:`python:csv.Dialect` instance.
-.. ipython:: python
- :suppress:
-
- data = "label1,label2,label3\n" 'index1,"a,c,e\n' "index2,b,d,f"
-
Suppose you had data with unenclosed quotes:
.. ipython:: python
+ data = "label1,label2,label3\n" 'index1,"a,c,e\n' "index2,b,d,f"
print(data)
By default, ``read_csv`` uses the Excel dialect and treats the double quote as
@@ -1425,10 +1428,10 @@ a different usage of the ``delimiter`` parameter:
Can be used to specify the filler character of the fields
if it is not spaces (e.g., '~').
+Consider a typical fixed-width data file:
+
.. ipython:: python
- :suppress:
- f = open("bar.csv", "w")
data1 = (
"id8141 360.242940 149.910199 11950.7\n"
"id1594 444.953632 166.985655 11788.4\n"
@@ -1436,14 +1439,8 @@ a different usage of the ``delimiter`` parameter:
"id1230 413.836124 184.375703 11916.8\n"
"id1948 502.953953 173.237159 12468.3"
)
- f.write(data1)
- f.close()
-
-Consider a typical fixed-width data file:
-
-.. ipython:: python
-
- print(open("bar.csv").read())
+ with open("bar.csv", "w") as f:
+ f.write(data1)
In order to parse this file into a ``DataFrame``, we simply need to supply the
column specifications to the ``read_fwf`` function along with the file name:
@@ -1499,19 +1496,15 @@ Indexes
Files with an "implicit" index column
+++++++++++++++++++++++++++++++++++++
-.. ipython:: python
- :suppress:
-
- f = open("foo.csv", "w")
- f.write("A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5")
- f.close()
-
Consider a file with one less entry in the header than the number of data
column:
.. ipython:: python
- print(open("foo.csv").read())
+ data = "A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5"
+ print(data)
+ with open("foo.csv", "w") as f:
+ f.write(data)
In this special case, ``read_csv`` assumes that the first column is to be used
as the index of the ``DataFrame``:
@@ -1543,7 +1536,10 @@ Suppose you have data indexed by two columns:
.. ipython:: python
- print(open("data/mindex_ex.csv").read())
+ data = 'year,indiv,zit,xit\n1977,"A",1.2,.6\n1977,"B",1.5,.5'
+ print(data)
+ with open("mindex_ex.csv", mode="w") as f:
+ f.write(data)
The ``index_col`` argument to ``read_csv`` can take a list of
column numbers to turn multiple columns into a ``MultiIndex`` for the index of the
@@ -1551,9 +1547,14 @@ returned object:
.. ipython:: python
- df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1])
+ df = pd.read_csv("mindex_ex.csv", index_col=[0, 1])
df
- df.loc[1978]
+ df.loc[1977]
+
+.. ipython:: python
+ :suppress:
+
+ os.remove("mindex_ex.csv")
.. _io.multi_index_columns:
@@ -1577,20 +1578,18 @@ rows will skip the intervening rows.
of multi-columns indices.
.. ipython:: python
- :suppress:
data = ",a,a,a,b,c,c\n,q,r,s,t,u,v\none,1,2,3,4,5,6\ntwo,7,8,9,10,11,12"
- fh = open("mi2.csv", "w")
- fh.write(data)
- fh.close()
-
-.. ipython:: python
+ print(data)
+ with open("mi2.csv", "w") as fh:
+ fh.write(data)
- print(open("mi2.csv").read())
pd.read_csv("mi2.csv", header=[0, 1], index_col=0)
-Note: If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it
-with ``df.to_csv(..., index=False)``, then any ``names`` on the columns index will be *lost*.
+.. note::
+ If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it
+ with ``df.to_csv(..., index=False)``, then any ``names`` on the columns index will
+ be *lost*.
.. ipython:: python
:suppress:
@@ -1608,16 +1607,16 @@ comma-separated) files, as pandas uses the :class:`python:csv.Sniffer`
class of the csv module. For this, you have to specify ``sep=None``.
.. ipython:: python
- :suppress:
df = pd.DataFrame(np.random.randn(10, 4))
- df.to_csv("tmp.sv", sep="|")
- df.to_csv("tmp2.sv", sep=":")
+ df.to_csv("tmp.csv", sep="|")
+ df.to_csv("tmp2.csv", sep=":")
+ pd.read_csv("tmp2.csv", sep=None, engine="python")
.. ipython:: python
+ :suppress:
- print(open("tmp2.sv").read())
- pd.read_csv("tmp2.sv", sep=None, engine="python")
+ os.remove("tmp2.csv")
.. _io.multiple_files:
@@ -1638,8 +1637,9 @@ rather than reading the entire file into memory, such as the following:
.. ipython:: python
- print(open("tmp.sv").read())
- table = pd.read_csv("tmp.sv", sep="|")
+ df = pd.DataFrame(np.random.randn(10, 4))
+ df.to_csv("tmp.csv", sep="|")
+ table = pd.read_csv("tmp.csv", sep="|")
table
@@ -1648,7 +1648,7 @@ value will be an iterable object of type ``TextFileReader``:
.. ipython:: python
- with pd.read_csv("tmp.sv", sep="|", chunksize=4) as reader:
+ with pd.read_csv("tmp.csv", sep="|", chunksize=4) as reader:
reader
for chunk in reader:
print(chunk)
@@ -1661,14 +1661,13 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
.. ipython:: python
- with pd.read_csv("tmp.sv", sep="|", iterator=True) as reader:
+ with pd.read_csv("tmp.csv", sep="|", iterator=True) as reader:
reader.get_chunk(5)
.. ipython:: python
:suppress:
- os.remove("tmp.sv")
- os.remove("tmp2.sv")
+ os.remove("tmp.csv")
Specifying the parser engine
''''''''''''''''''''''''''''
@@ -1827,7 +1826,7 @@ function takes a number of arguments. Only the first is required.
* ``mode`` : Python write mode, default 'w'
* ``encoding``: a string representing the encoding to use if the contents are
non-ASCII, for Python versions prior to 3
-* ``line_terminator``: Character sequence denoting line end (default ``os.linesep``)
+* ``lineterminator``: Character sequence denoting line end (default ``os.linesep``)
* ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a ``float_format`` then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric
* ``quotechar``: Character used to quote fields (default '"')
* ``doublequote``: Control quoting of ``quotechar`` in fields (default True)
@@ -2555,42 +2554,66 @@ Let's look at a few examples.
Read a URL with no options:
-.. ipython:: python
+.. code-block:: ipython
- url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list"
- dfs = pd.read_html(url)
- dfs
+ In [320]: "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list"
+ In [321]: pd.read_html(url)
+ Out[321]:
+ [ Bank NameBank CityCity StateSt ... Acquiring InstitutionAI Closing DateClosing FundFund
+ 0 Almena State Bank Almena KS ... Equity Bank October 23, 2020 10538
+ 1 First City Bank of Florida Fort Walton Beach FL ... United Fidelity Bank, fsb October 16, 2020 10537
+ 2 The First State Bank Barboursville WV ... MVB Bank, Inc. April 3, 2020 10536
+ 3 Ericson State Bank Ericson NE ... Farmers and Merchants Bank February 14, 2020 10535
+ 4 City National Bank of New Jersey Newark NJ ... Industrial Bank November 1, 2019 10534
+ .. ... ... ... ... ... ... ...
+ 558 Superior Bank, FSB Hinsdale IL ... Superior Federal, FSB July 27, 2001 6004
+ 559 Malta National Bank Malta OH ... North Valley Bank May 3, 2001 4648
+ 560 First Alliance Bank & Trust Co. Manchester NH ... Southern New Hampshire Bank & Trust February 2, 2001 4647
+ 561 National State Bank of Metropolis Metropolis IL ... Banterra Bank of Marion December 14, 2000 4646
+ 562 Bank of Honolulu Honolulu HI ... Bank of the Orient October 13, 2000 4645
+
+ [563 rows x 7 columns]]
.. note::
- The data from the above URL changes every Monday so the resulting data above
- and the data below may be slightly different.
+ The data from the above URL changes every Monday so the resulting data above may be slightly different.
Read in the content of the file from the above URL and pass it to ``read_html``
as a string:
.. ipython:: python
- :suppress:
- rel_path = os.path.join("..", "pandas", "tests", "io", "data", "html",
- "banklist.html")
- file_path = os.path.abspath(rel_path)
+ html_str = """
+
+
+
A
+
B
+
C
+
+
+
a
+
b
+
c
+
+
+ """
+
+ with open("tmp.html", "w") as f:
+ f.write(html_str)
+ df = pd.read_html("tmp.html")
+ df[0]
.. ipython:: python
+ :suppress:
- with open(file_path, "r") as f:
- dfs = pd.read_html(f.read())
- dfs
+ os.remove("tmp.html")
You can even pass in an instance of ``StringIO`` if you so desire:
.. ipython:: python
- with open(file_path, "r") as f:
- sio = StringIO(f.read())
-
- dfs = pd.read_html(sio)
- dfs
+ dfs = pd.read_html(StringIO(html_str))
+ dfs[0]
.. note::
@@ -2598,7 +2621,7 @@ You can even pass in an instance of ``StringIO`` if you so desire:
that having so many network-accessing functions slows down the documentation
build. If you spot an error or an example that doesn't run, please do not
hesitate to report it over on `pandas GitHub issues page
- `__.
+ `__.
Read a URL and match a table that contains specific text:
@@ -2708,6 +2731,30 @@ succeeds, the function will return*.
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"])
+Links can be extracted from cells along with the text using ``extract_links="all"``.
+
+.. ipython:: python
+
+ html_table = """
+
+ """
+
+ df = pd.read_html(
+ html_table,
+ extract_links="all"
+ )[0]
+ df
+ df[("GitHub", None)]
+ df[("GitHub", None)].str[1]
+
+.. versionadded:: 1.5.0
.. _io.html:
@@ -2724,77 +2771,48 @@ in the method ``to_string`` described above.
brevity's sake. See :func:`~pandas.core.frame.DataFrame.to_html` for the
full set of options.
-.. ipython:: python
- :suppress:
+.. note::
- def write_html(df, filename, *args, **kwargs):
- static = os.path.abspath(os.path.join("source", "_static"))
- with open(os.path.join(static, filename + ".html"), "w") as f:
- df.to_html(f, *args, **kwargs)
+ In an HTML-rendering supported environment like a Jupyter Notebook, ``display(HTML(...))```
+ will render the raw HTML into the environment.
.. ipython:: python
+ from IPython.display import display, HTML
+
df = pd.DataFrame(np.random.randn(2, 2))
df
- print(df.to_html()) # raw html
-
-.. ipython:: python
- :suppress:
-
- write_html(df, "basic")
-
-HTML:
-
-.. raw:: html
- :file: ../_static/basic.html
+ html = df.to_html()
+ print(html) # raw html
+ display(HTML(html))
The ``columns`` argument will limit the columns shown:
.. ipython:: python
- print(df.to_html(columns=[0]))
-
-.. ipython:: python
- :suppress:
-
- write_html(df, "columns", columns=[0])
-
-HTML:
-
-.. raw:: html
- :file: ../_static/columns.html
+ html = df.to_html(columns=[0])
+ print(html)
+ display(HTML(html))
``float_format`` takes a Python callable to control the precision of floating
point values:
.. ipython:: python
- print(df.to_html(float_format="{0:.10f}".format))
+ html = df.to_html(float_format="{0:.10f}".format)
+ print(html)
+ display(HTML(html))
-.. ipython:: python
- :suppress:
-
- write_html(df, "float_format", float_format="{0:.10f}".format)
-
-HTML:
-
-.. raw:: html
- :file: ../_static/float_format.html
``bold_rows`` will make the row labels bold by default, but you can turn that
off:
.. ipython:: python
- print(df.to_html(bold_rows=False))
-
-.. ipython:: python
- :suppress:
-
- write_html(df, "nobold", bold_rows=False)
+ html = df.to_html(bold_rows=False)
+ print(html)
+ display(HTML(html))
-.. raw:: html
- :file: ../_static/nobold.html
The ``classes`` argument provides the ability to give the resulting HTML
table CSS classes. Note that these classes are *appended* to the existing
@@ -2815,17 +2833,9 @@ that contain URLs.
"url": ["https://www.python.org/", "https://pandas.pydata.org/"],
}
)
- print(url_df.to_html(render_links=True))
-
-.. ipython:: python
- :suppress:
-
- write_html(url_df, "render_links", render_links=True)
-
-HTML:
-
-.. raw:: html
- :file: ../_static/render_links.html
+ html = url_df.to_html(render_links=True)
+ print(html)
+ display(HTML(html))
Finally, the ``escape`` argument allows you to control whether the
"<", ">" and "&" characters escaped in the resulting HTML (by default it is
@@ -2835,30 +2845,21 @@ Finally, the ``escape`` argument allows you to control whether the
df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)})
-
-.. ipython:: python
- :suppress:
-
- write_html(df, "escape")
- write_html(df, "noescape", escape=False)
-
Escaped:
.. ipython:: python
- print(df.to_html())
-
-.. raw:: html
- :file: ../_static/escape.html
+ html = df.to_html()
+ print(html)
+ display(HTML(html))
Not escaped:
.. ipython:: python
- print(df.to_html(escape=False))
-
-.. raw:: html
- :file: ../_static/noescape.html
+ html = df.to_html(escape=False)
+ print(html)
+ display(HTML(html))
.. note::
@@ -3038,13 +3039,10 @@ Read in the content of the "books.xml" file and pass it to ``read_xml``
as a string:
.. ipython:: python
- :suppress:
- rel_path = os.path.join("..", "pandas", "tests", "io", "data", "xml",
- "books.xml")
- file_path = os.path.abspath(rel_path)
-
-.. ipython:: python
+ file_path = "books.xml"
+ with open(file_path, "w") as f:
+ f.write(xml)
with open(file_path, "r") as f:
df = pd.read_xml(f.read())
@@ -3069,15 +3067,15 @@ Read in the content of the "books.xml" as instance of ``StringIO`` or
df = pd.read_xml(bio)
df
-Even read XML from AWS S3 buckets such as Python Software Foundation's IRS 990 Form:
+Even read XML from AWS S3 buckets such as NIH NCBI PMC Article Datasets providing
+Biomedical and Life Science Jorurnals:
.. ipython:: python
:okwarning:
df = pd.read_xml(
- "s3://irs-form-990/201923199349319487_public.xml",
- xpath=".//irs:Form990PartVIISectionAGrp",
- namespaces={"irs": "http://www.irs.gov/efile"}
+ "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml",
+ xpath=".//journal-meta",
)
df
@@ -3104,6 +3102,11 @@ Specify only elements or only attributes to parse:
df = pd.read_xml(file_path, attrs_only=True)
df
+.. ipython:: python
+ :suppress:
+
+ os.remove("books.xml")
+
XML documents can have namespaces with prefixes and default namespaces without
prefixes both of which are denoted with a special attribute ``xmlns``. In order
to parse by node under a namespace context, ``xpath`` must reference a prefix.
@@ -3266,6 +3269,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
df = pd.read_xml(xml, stylesheet=xsl)
df
+For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
+supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
+which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes.
+without holding entire tree in memory.
+
+ .. versionadded:: 1.5.0
+
+.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
+.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
+
+To use this feature, you must pass a physical XML file path into ``read_xml`` and use the ``iterparse`` argument.
+Files should not be compressed or point to online sources but stored on local disk. Also, ``iterparse`` should be
+a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of
+any element or attribute that is a descendant (i.e., child, grandchild) of repeating node. Since XPath is not
+used in this method, descendants do not need to share same relationship with one another. Below shows example
+of reading in Wikipedia's very large (12 GB+) latest article data dump.
+
+.. code-block:: ipython
+
+ In [1]: df = pd.read_xml(
+ ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
+ ... iterparse = {"page": ["title", "ns", "id"]}
+ ... )
+ ... df
+ Out[2]:
+ title ns id
+ 0 Gettysburg Address 0 21450
+ 1 Main Page 0 42950
+ 2 Declaration by United Nations 0 8435
+ 3 Constitution of the United States of America 0 8435
+ 4 Declaration of Independence (Israel) 0 17858
+ ... ... ... ...
+ 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
+ 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
+ 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
+ 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
+ 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
+
+ [3578765 rows x 3 columns]
.. _io.xml:
@@ -3464,7 +3506,7 @@ See the :ref:`cookbook` for some advanced strategies.
**Please do not report issues when using ``xlrd`` to read ``.xlsx`` files.**
This is no longer supported, switch to using ``openpyxl`` instead.
- Attempting to use the the ``xlwt`` engine will raise a ``FutureWarning``
+ Attempting to use the ``xlwt`` engine will raise a ``FutureWarning``
unless the option :attr:`io.excel.xls.writer` is set to ``"xlwt"``.
While this option is now deprecated and will also raise a ``FutureWarning``,
it can be globally set and the warning suppressed. Users are recommended to
@@ -3654,6 +3696,10 @@ should be passed to ``index_col`` and ``header``:
os.remove("path_to_file.xlsx")
+Missing values in columns specified in ``index_col`` will be forward filled to
+allow roundtripping with ``to_excel`` for ``merged_cells=True``. To avoid forward
+filling the missing values use ``set_index`` after reading the data instead of
+``index_col``.
Parsing specific columns
++++++++++++++++++++++++
@@ -4968,7 +5014,7 @@ control compression: ``complevel`` and ``complib``.
rates but is somewhat slow.
- `lzo `_: Fast
compression and decompression.
- - `bzip2 `_: Good compression rates.
+ - `bzip2 `_: Good compression rates.
- `blosc `_: Fast compression and
decompression.
@@ -4977,10 +5023,10 @@ control compression: ``complevel`` and ``complib``.
- `blosc:blosclz `_ This is the
default compressor for ``blosc``
- `blosc:lz4
- `_:
+ `_:
A compact, very popular and fast compressor.
- `blosc:lz4hc
- `_:
+ `_:
A tweaked version of LZ4, produces better
compression ratios at the expense of speed.
- `blosc:snappy `_:
@@ -5401,7 +5447,7 @@ See the documentation for `pyarrow `__ an
.. note::
These engines are very similar and should read/write nearly identical parquet format files.
- Currently ``pyarrow`` does not support timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
+ ``pyarrow>=8.0.0`` supports timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
.. ipython:: python
@@ -5548,13 +5594,64 @@ ORC
.. versionadded:: 1.0.0
Similar to the :ref:`parquet ` format, the `ORC Format `__ is a binary columnar serialization
-for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
-ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow `__ library.
+for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the
+ORC format, :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc`. This requires the `pyarrow `__ library.
.. warning::
* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
- * :func:`~pandas.read_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `.
+ * :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0.
+ * :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc` are not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `.
+ * For supported dtypes please refer to `supported ORC features in Arrow `__.
+ * Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
+
+.. ipython:: python
+
+ df = pd.DataFrame(
+ {
+ "a": list("abc"),
+ "b": list(range(1, 4)),
+ "c": np.arange(4.0, 7.0, dtype="float64"),
+ "d": [True, False, True],
+ "e": pd.date_range("20130101", periods=3),
+ }
+ )
+
+ df
+ df.dtypes
+
+Write to an orc file.
+
+.. ipython:: python
+ :okwarning:
+
+ df.to_orc("example_pa.orc", engine="pyarrow")
+
+Read from an orc file.
+
+.. ipython:: python
+ :okwarning:
+
+ result = pd.read_orc("example_pa.orc")
+
+ result.dtypes
+
+Read only certain columns of an orc file.
+
+.. ipython:: python
+
+ result = pd.read_orc(
+ "example_pa.orc",
+ columns=["a", "b"],
+ )
+ result.dtypes
+
+
+.. ipython:: python
+ :suppress:
+
+ os.remove("example_pa.orc")
+
.. _io.sql:
@@ -5564,7 +5661,7 @@ SQL queries
The :mod:`pandas.io.sql` module provides a collection of query wrappers to both
facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction
is provided by SQLAlchemy if installed. In addition you will need a driver library for
-your database. Examples of such drivers are `psycopg2 `__
+your database. Examples of such drivers are `psycopg2 `__
for PostgreSQL or `pymysql `__ for MySQL.
For `SQLite `__ this is
included in Python's standard library by default.
@@ -5596,7 +5693,7 @@ The key functions are:
the provided input (database table name or sql query).
Table names do not need to be quoted if they have special characters.
-In the following example, we use the `SQlite `__ SQL database
+In the following example, we use the `SQlite `__ SQL database
engine. You can use a temporary SQLite database where data are stored in
"memory".
@@ -5626,9 +5723,9 @@ for an explanation of how the database connection is handled.
.. warning::
- When you open a connection to a database you are also responsible for closing it.
- Side effects of leaving a connection open may include locking the database or
- other breaking behaviour.
+ When you open a connection to a database you are also responsible for closing it.
+ Side effects of leaving a connection open may include locking the database or
+ other breaking behaviour.
Writing DataFrames
''''''''''''''''''
@@ -5648,7 +5745,6 @@ the database using :func:`~pandas.DataFrame.to_sql`.
.. ipython:: python
- :suppress:
import datetime
@@ -5661,10 +5757,8 @@ the database using :func:`~pandas.DataFrame.to_sql`.
data = pd.DataFrame(d, columns=c)
-.. ipython:: python
-
- data
- data.to_sql("data", engine)
+ data
+ data.to_sql("data", engine)
With some databases, writing large DataFrames can result in errors due to
packet size limitations being exceeded. This can be avoided by setting the
@@ -5760,7 +5854,7 @@ Possible values are:
specific backend dialect features.
Example of a callable using PostgreSQL `COPY clause
-`__::
+`__::
# Alternative to_sql() *method* for DBs that support COPY FROM
import csv
@@ -6022,7 +6116,7 @@ pandas integrates with this external package. if ``pandas-gbq`` is installed, yo
use the pandas methods ``pd.read_gbq`` and ``DataFrame.to_gbq``, which will call the
respective functions from ``pandas-gbq``.
-Full documentation can be found `here `__.
+Full documentation can be found `here `__.
.. _io.stata:
@@ -6230,7 +6324,7 @@ Obtain an iterator and read an XPORT file 100,000 lines at a time:
The specification_ for the xport file format is available from the SAS
web site.
-.. _specification: https://support.sas.com/techsup/technote/ts140.pdf
+.. _specification: https://support.sas.com/content/dam/SAS/support/en/technical-papers/record-layout-of-a-sas-version-5-or-6-data-set-in-sas-transport-xport-format.pdf
No official documentation is available for the SAS7BDAT format.
@@ -6272,7 +6366,7 @@ avoid converting categorical columns into ``pd.Categorical``:
More information about the SAV and ZSAV file formats is available here_.
-.. _here: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_22.0.0/com.ibm.spss.statistics.help/spss/base/savedatatypes.htm
+.. _here: https://www.ibm.com/docs/en/spss-statistics/22.0.0
.. _io.other:
@@ -6290,7 +6384,7 @@ xarray_ provides data structures inspired by the pandas ``DataFrame`` for workin
with multi-dimensional datasets, with a focus on the netCDF file format and
easy conversion to and from pandas.
-.. _xarray: https://xarray.pydata.org/
+.. _xarray: https://xarray.pydata.org/en/stable/
.. _io.perf:
diff --git a/doc/source/user_guide/missing_data.rst b/doc/source/user_guide/missing_data.rst
index 1621b37f31b23..3052ee3001681 100644
--- a/doc/source/user_guide/missing_data.rst
+++ b/doc/source/user_guide/missing_data.rst
@@ -470,7 +470,7 @@ at the new values.
interp_s = ser.reindex(new_index).interpolate(method="pchip")
interp_s[49:51]
-.. _scipy: https://www.scipy.org
+.. _scipy: https://scipy.org/
.. _documentation: https://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
.. _guide: https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
@@ -580,7 +580,7 @@ String/regular expression replacement
backslashes than strings without this prefix. Backslashes in raw strings
will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. You
should `read about them
- `__
+ `__
if this is unclear.
Replace the '.' with ``NaN`` (str -> str):
diff --git a/doc/source/user_guide/options.rst b/doc/source/user_guide/options.rst
index 93448dae578c9..c7f5d3ddf66d3 100644
--- a/doc/source/user_guide/options.rst
+++ b/doc/source/user_guide/options.rst
@@ -8,8 +8,8 @@ Options and settings
Overview
--------
-pandas has an options system that lets you customize some aspects of its behaviour,
-display-related options being those the user is most likely to adjust.
+pandas has an options API configure and customize global behavior related to
+:class:`DataFrame` display, data behavior and more.
Options have a full "dotted-style", case-insensitive name (e.g. ``display.max_rows``).
You can get/set options directly as attributes of the top-level ``options`` attribute:
@@ -31,10 +31,12 @@ namespace:
* :func:`~pandas.option_context` - execute a codeblock with a set of options
that revert to prior settings after execution.
-**Note:** Developers can check out `pandas/core/config_init.py `_ for more information.
+.. note::
+
+ Developers can check out `pandas/core/config_init.py `_ for more information.
All of the functions above accept a regexp pattern (``re.search`` style) as an argument,
-and so passing in a substring will work - as long as it is unambiguous:
+to match an unambiguous substring:
.. ipython:: python
@@ -51,17 +53,13 @@ The following will **not work** because it matches multiple option names, e.g.
.. ipython:: python
:okexcept:
- try:
- pd.get_option("max")
- except KeyError as e:
- print(e)
+ pd.get_option("max")
-**Note:** Using this form of shorthand may cause your code to break if new options with similar names are added in future versions.
+.. warning::
+ Using this form of shorthand may cause your code to break if new options with similar names are added in future versions.
-You can get a list of available options and their descriptions with ``describe_option``. When called
-with no argument ``describe_option`` will print out the descriptions for all available options.
.. ipython:: python
:suppress:
@@ -69,6 +67,18 @@ with no argument ``describe_option`` will print out the descriptions for all ava
pd.reset_option("all")
+.. _options.available:
+
+Available options
+-----------------
+
+You can get a list of available options and their descriptions with :func:`~pandas.describe_option`. When called
+with no argument :func:`~pandas.describe_option` will print out the descriptions for all available options.
+
+.. ipython:: python
+
+ pd.describe_option()
+
Getting and setting options
---------------------------
@@ -82,9 +92,11 @@ are available from the pandas namespace. To change an option, call
pd.set_option("mode.sim_interactive", True)
pd.get_option("mode.sim_interactive")
-**Note:** The option 'mode.sim_interactive' is mostly used for debugging purposes.
+.. note::
+
+ The option ``'mode.sim_interactive'`` is mostly used for debugging purposes.
-All options also have a default value, and you can use ``reset_option`` to do just that:
+You can use :func:`~pandas.reset_option` to revert to a setting's default value
.. ipython:: python
:suppress:
@@ -108,7 +120,7 @@ It's also possible to reset multiple options at once (using a regex):
pd.reset_option("^display")
-``option_context`` context manager has been exposed through
+:func:`~pandas.option_context` context manager has been exposed through
the top-level API, allowing you to execute code with given option values. Option values
are restored automatically when you exit the ``with`` block:
@@ -124,7 +136,9 @@ are restored automatically when you exit the ``with`` block:
Setting startup options in Python/IPython environment
-----------------------------------------------------
-Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas more efficient. To do this, create a .py or .ipy script in the startup directory of the desired profile. An example where the startup folder is in a default IPython profile can be found at:
+Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas more efficient.
+To do this, create a ``.py`` or ``.ipy`` script in the startup directory of the desired profile.
+An example where the startup folder is in a default IPython profile can be found at:
.. code-block:: none
@@ -144,10 +158,10 @@ More information can be found in the `IPython documentation
Frequently used options
-----------------------
-The following is a walk-through of the more frequently used display options.
+The following is a demonstrates the more frequently used display options.
``display.max_rows`` and ``display.max_columns`` sets the maximum number
-of rows and columns displayed when a frame is pretty-printed. Truncated
+of rows and columns displayed when a frame is pretty-printed. Truncated
lines are replaced by an ellipsis.
.. ipython:: python
@@ -175,8 +189,8 @@ determines how many rows are shown in the truncated repr.
pd.reset_option("display.max_rows")
pd.reset_option("display.min_rows")
-``display.expand_frame_repr`` allows for the representation of
-dataframes to stretch across pages, wrapped over the full column vs row-wise.
+``display.expand_frame_repr`` allows for the representation of a
+:class:`DataFrame` to stretch across pages, wrapped over the all the columns.
.. ipython:: python
@@ -187,8 +201,8 @@ dataframes to stretch across pages, wrapped over the full column vs row-wise.
df
pd.reset_option("expand_frame_repr")
-``display.large_repr`` lets you select whether to display dataframes that exceed
-``max_columns`` or ``max_rows`` as a truncated frame, or as a summary.
+``display.large_repr`` displays a :class:`DataFrame` that exceed
+``max_columns`` or ``max_rows`` as a truncated frame or summary.
.. ipython:: python
@@ -220,8 +234,8 @@ of this length or longer will be truncated with an ellipsis.
df
pd.reset_option("max_colwidth")
-``display.max_info_columns`` sets a threshold for when by-column info
-will be given.
+``display.max_info_columns`` sets a threshold for the number of columns
+displayed when calling :meth:`~pandas.DataFrame.info`.
.. ipython:: python
@@ -232,10 +246,10 @@ will be given.
df.info()
pd.reset_option("max_info_columns")
-``display.max_info_rows``: ``df.info()`` will usually show null-counts for each column.
-For large frames this can be quite slow. ``max_info_rows`` and ``max_info_cols``
-limit this null check only to frames with smaller dimensions then specified. Note that you
-can specify the option ``df.info(null_counts=True)`` to override on showing a particular frame.
+``display.max_info_rows``: :meth:`~pandas.DataFrame.info` will usually show null-counts for each column.
+For a large :class:`DataFrame`, this can be quite slow. ``max_info_rows`` and ``max_info_cols``
+limit this null check to the specified rows and columns respectively. The :meth:`~pandas.DataFrame.info`
+keyword argument ``null_counts=True`` will override this.
.. ipython:: python
@@ -248,7 +262,6 @@ can specify the option ``df.info(null_counts=True)`` to override on showing a pa
pd.reset_option("max_info_rows")
``display.precision`` sets the output display precision in terms of decimal places.
-This is only a suggestion.
.. ipython:: python
@@ -258,8 +271,8 @@ This is only a suggestion.
pd.set_option("display.precision", 4)
df
-``display.chop_threshold`` sets at what level pandas rounds to zero when
-it displays a Series of DataFrame. This setting does not change the
+``display.chop_threshold`` sets the rounding threshold to zero when displaying a
+:class:`Series` or :class:`DataFrame`. This setting does not change the
precision at which the number is stored.
.. ipython:: python
@@ -272,7 +285,7 @@ precision at which the number is stored.
pd.reset_option("chop_threshold")
``display.colheader_justify`` controls the justification of the headers.
-The options are 'right', and 'left'.
+The options are ``'right'``, and ``'left'``.
.. ipython:: python
@@ -288,238 +301,6 @@ The options are 'right', and 'left'.
pd.reset_option("colheader_justify")
-
-.. _options.available:
-
-Available options
------------------
-
-======================================= ============ ==================================
-Option Default Function
-======================================= ============ ==================================
-display.chop_threshold None If set to a float value, all float
- values smaller then the given
- threshold will be displayed as
- exactly 0 by repr and friends.
-display.colheader_justify right Controls the justification of
- column headers. used by DataFrameFormatter.
-display.column_space 12 No description available.
-display.date_dayfirst False When True, prints and parses dates
- with the day first, eg 20/01/2005
-display.date_yearfirst False When True, prints and parses dates
- with the year first, eg 2005/01/20
-display.encoding UTF-8 Defaults to the detected encoding
- of the console. Specifies the encoding
- to be used for strings returned by
- to_string, these are generally strings
- meant to be displayed on the console.
-display.expand_frame_repr True Whether to print out the full DataFrame
- repr for wide DataFrames across
- multiple lines, ``max_columns`` is
- still respected, but the output will
- wrap-around across multiple "pages"
- if its width exceeds ``display.width``.
-display.float_format None The callable should accept a floating
- point number and return a string with
- the desired format of the number.
- This is used in some places like
- SeriesFormatter.
- See core.format.EngFormatter for an example.
-display.large_repr truncate For DataFrames exceeding max_rows/max_cols,
- the repr (and HTML repr) can show
- a truncated table (the default),
- or switch to the view from df.info()
- (the behaviour in earlier versions of pandas).
- allowable settings, ['truncate', 'info']
-display.latex.repr False Whether to produce a latex DataFrame
- representation for Jupyter frontends
- that support it.
-display.latex.escape True Escapes special characters in DataFrames, when
- using the to_latex method.
-display.latex.longtable False Specifies if the to_latex method of a DataFrame
- uses the longtable format.
-display.latex.multicolumn True Combines columns when using a MultiIndex
-display.latex.multicolumn_format 'l' Alignment of multicolumn labels
-display.latex.multirow False Combines rows when using a MultiIndex.
- Centered instead of top-aligned,
- separated by clines.
-display.max_columns 0 or 20 max_rows and max_columns are used
- in __repr__() methods to decide if
- to_string() or info() is used to
- render an object to a string. In
- case Python/IPython is running in
- a terminal this is set to 0 by default and
- pandas will correctly auto-detect
- the width of the terminal and switch to
- a smaller format in case all columns
- would not fit vertically. The IPython
- notebook, IPython qtconsole, or IDLE
- do not run in a terminal and hence
- it is not possible to do correct
- auto-detection, in which case the default
- is set to 20. 'None' value means unlimited.
-display.max_colwidth 50 The maximum width in characters of
- a column in the repr of a pandas
- data structure. When the column overflows,
- a "..." placeholder is embedded in
- the output. 'None' value means unlimited.
-display.max_info_columns 100 max_info_columns is used in DataFrame.info
- method to decide if per column information
- will be printed.
-display.max_info_rows 1690785 df.info() will usually show null-counts
- for each column. For large frames
- this can be quite slow. max_info_rows
- and max_info_cols limit this null
- check only to frames with smaller
- dimensions then specified.
-display.max_rows 60 This sets the maximum number of rows
- pandas should output when printing
- out various output. For example,
- this value determines whether the
- repr() for a dataframe prints out
- fully or just a truncated or summary repr.
- 'None' value means unlimited.
-display.min_rows 10 The numbers of rows to show in a truncated
- repr (when ``max_rows`` is exceeded). Ignored
- when ``max_rows`` is set to None or 0. When set
- to None, follows the value of ``max_rows``.
-display.max_seq_items 100 when pretty-printing a long sequence,
- no more then ``max_seq_items`` will
- be printed. If items are omitted,
- they will be denoted by the addition
- of "..." to the resulting string.
- If set to None, the number of items
- to be printed is unlimited.
-display.memory_usage True This specifies if the memory usage of
- a DataFrame should be displayed when the
- df.info() method is invoked.
-display.multi_sparse True "Sparsify" MultiIndex display (don't
- display repeated elements in outer
- levels within groups)
-display.notebook_repr_html True When True, IPython notebook will
- use html representation for
- pandas objects (if it is available).
-display.pprint_nest_depth 3 Controls the number of nested levels
- to process when pretty-printing
-display.precision 6 Floating point output precision in
- terms of number of places after the
- decimal, for regular formatting as well
- as scientific notation. Similar to
- numpy's ``precision`` print option
-display.show_dimensions truncate Whether to print out dimensions
- at the end of DataFrame repr.
- If 'truncate' is specified, only
- print out the dimensions if the
- frame is truncated (e.g. not display
- all rows and/or columns)
-display.width 80 Width of the display in characters.
- In case Python/IPython is running in
- a terminal this can be set to None
- and pandas will correctly auto-detect
- the width. Note that the IPython notebook,
- IPython qtconsole, or IDLE do not run in a
- terminal and hence it is not possible
- to correctly detect the width.
-display.html.table_schema False Whether to publish a Table Schema
- representation for frontends that
- support it.
-display.html.border 1 A ``border=value`` attribute is
- inserted in the ``
`` tag
- for the DataFrame HTML repr.
-display.html.use_mathjax True When True, Jupyter notebook will process
- table contents using MathJax, rendering
- mathematical expressions enclosed by the
- dollar symbol.
-display.max_dir_items 100 The number of columns from a dataframe that
- are added to dir. These columns can then be
- suggested by tab completion. 'None' value means
- unlimited.
-io.excel.xls.writer xlwt The default Excel writer engine for
- 'xls' files.
-
- .. deprecated:: 1.2.0
-
- As `xlwt `__
- package is no longer maintained, the ``xlwt``
- engine will be removed in a future version of
- pandas. Since this is the only engine in pandas
- that supports writing to ``.xls`` files,
- this option will also be removed.
-
-io.excel.xlsm.writer openpyxl The default Excel writer engine for
- 'xlsm' files. Available options:
- 'openpyxl' (the default).
-io.excel.xlsx.writer openpyxl The default Excel writer engine for
- 'xlsx' files.
-io.hdf.default_format None default format writing format, if
- None, then put will default to
- 'fixed' and append will default to
- 'table'
-io.hdf.dropna_table True drop ALL nan rows when appending
- to a table
-io.parquet.engine None The engine to use as a default for
- parquet reading and writing. If None
- then try 'pyarrow' and 'fastparquet'
-io.sql.engine None The engine to use as a default for
- sql reading and writing, with SQLAlchemy
- as a higher level interface. If None
- then try 'sqlalchemy'
-mode.chained_assignment warn Controls ``SettingWithCopyWarning``:
- 'raise', 'warn', or None. Raise an
- exception, warn, or no action if
- trying to use :ref:`chained assignment `.
-mode.sim_interactive False Whether to simulate interactive mode
- for purposes of testing.
-mode.use_inf_as_na False True means treat None, NaN, -INF,
- INF as NA (old way), False means
- None and NaN are null, but INF, -INF
- are not NA (new way).
-compute.use_bottleneck True Use the bottleneck library to accelerate
- computation if it is installed.
-compute.use_numexpr True Use the numexpr library to accelerate
- computation if it is installed.
-plotting.backend matplotlib Change the plotting backend to a different
- backend than the current matplotlib one.
- Backends can be implemented as third-party
- libraries implementing the pandas plotting
- API. They can use other plotting libraries
- like Bokeh, Altair, etc.
-plotting.matplotlib.register_converters True Register custom converters with
- matplotlib. Set to False to de-register.
-styler.sparse.index True "Sparsify" MultiIndex display for rows
- in Styler output (don't display repeated
- elements in outer levels within groups).
-styler.sparse.columns True "Sparsify" MultiIndex display for columns
- in Styler output.
-styler.render.repr html Standard output format for Styler rendered in Jupyter Notebook.
- Should be one of "html" or "latex".
-styler.render.max_elements 262144 Maximum number of datapoints that Styler will render
- trimming either rows, columns or both to fit.
-styler.render.max_rows None Maximum number of rows that Styler will render. By default
- this is dynamic based on ``max_elements``.
-styler.render.max_columns None Maximum number of columns that Styler will render. By default
- this is dynamic based on ``max_elements``.
-styler.render.encoding utf-8 Default encoding for output HTML or LaTeX files.
-styler.format.formatter None Object to specify formatting functions to ``Styler.format``.
-styler.format.na_rep None String representation for missing data.
-styler.format.precision 6 Precision to display floating point and complex numbers.
-styler.format.decimal . String representation for decimal point separator for floating
- point and complex numbers.
-styler.format.thousands None String representation for thousands separator for
- integers, and floating point and complex numbers.
-styler.format.escape None Whether to escape "html" or "latex" special
- characters in the display representation.
-styler.html.mathjax True If set to False will render specific CSS classes to
- table attributes that will prevent Mathjax from rendering
- in Jupyter Notebook.
-styler.latex.multicol_align r Alignment of headers in a merged column due to sparsification. Can be in {"r", "c", "l"}.
-styler.latex.multirow_align c Alignment of index labels in a merged row due to sparsification. Can be in {"c", "t", "b"}.
-styler.latex.environment None If given will replace the default ``\\begin{table}`` environment. If "longtable" is specified
- this will render with a specific "longtable" template with longtable features.
-styler.latex.hrules False If set to True will render ``\\toprule``, ``\\midrule``, and ``\bottomrule`` by default.
-======================================= ============ ==================================
-
-
.. _basics.console_output:
Number formatting
@@ -532,8 +313,6 @@ Use the ``set_eng_float_format`` function
to alter the floating-point formatting of pandas objects to produce a particular
format.
-For instance:
-
.. ipython:: python
import numpy as np
@@ -549,7 +328,7 @@ For instance:
pd.reset_option("^display")
-To round floats on a case-by-case basis, you can also use :meth:`~pandas.Series.round` and :meth:`~pandas.DataFrame.round`.
+Use :meth:`~pandas.DataFrame.round` to specifically control rounding of an individual :class:`DataFrame`
.. _options.east_asian_width:
@@ -564,15 +343,11 @@ Unicode formatting
Some East Asian countries use Unicode characters whose width corresponds to two Latin characters.
If a DataFrame or Series contains these characters, the default output mode may not align them properly.
-.. note:: Screen captures are attached for each output to show the actual results.
-
.. ipython:: python
df = pd.DataFrame({"国籍": ["UK", "日本"], "名前": ["Alice", "しのぶ"]})
df
-.. image:: ../_static/option_unicode01.png
-
Enabling ``display.unicode.east_asian_width`` allows pandas to check each character's "East Asian Width" property.
These characters can be aligned properly by setting this option to ``True``. However, this will result in longer render
times than the standard ``len`` function.
@@ -582,19 +357,16 @@ times than the standard ``len`` function.
pd.set_option("display.unicode.east_asian_width", True)
df
-.. image:: ../_static/option_unicode02.png
-
-In addition, Unicode characters whose width is "Ambiguous" can either be 1 or 2 characters wide depending on the
+In addition, Unicode characters whose width is "ambiguous" can either be 1 or 2 characters wide depending on the
terminal setting or encoding. The option ``display.unicode.ambiguous_as_wide`` can be used to handle the ambiguity.
-By default, an "Ambiguous" character's width, such as "¡" (inverted exclamation) in the example below, is taken to be 1.
+By default, an "ambiguous" character's width, such as "¡" (inverted exclamation) in the example below, is taken to be 1.
.. ipython:: python
df = pd.DataFrame({"a": ["xxx", "¡¡"], "b": ["yyy", "¡¡"]})
df
-.. image:: ../_static/option_unicode03.png
Enabling ``display.unicode.ambiguous_as_wide`` makes pandas interpret these characters' widths to be 2.
(Note that this option will only be effective when ``display.unicode.east_asian_width`` is enabled.)
@@ -606,7 +378,6 @@ However, setting this option incorrectly for your terminal will cause these char
pd.set_option("display.unicode.ambiguous_as_wide", True)
df
-.. image:: ../_static/option_unicode04.png
.. ipython:: python
:suppress:
@@ -619,8 +390,8 @@ However, setting this option incorrectly for your terminal will cause these char
Table schema display
--------------------
-``DataFrame`` and ``Series`` will publish a Table Schema representation
-by default. False by default, this can be enabled globally with the
+:class:`DataFrame` and :class:`Series` will publish a Table Schema representation
+by default. This can be enabled globally with the
``display.html.table_schema`` option:
.. ipython:: python
diff --git a/doc/source/user_guide/reshaping.rst b/doc/source/user_guide/reshaping.rst
index e74272c825e46..adca9de6c130a 100644
--- a/doc/source/user_guide/reshaping.rst
+++ b/doc/source/user_guide/reshaping.rst
@@ -13,37 +13,12 @@ Reshaping by pivoting DataFrame objects
.. image:: ../_static/reshaping_pivot.png
-.. ipython:: python
- :suppress:
-
- import pandas._testing as tm
-
- def unpivot(frame):
- N, K = frame.shape
- data = {
- "value": frame.to_numpy().ravel("F"),
- "variable": np.asarray(frame.columns).repeat(N),
- "date": np.tile(np.asarray(frame.index), K),
- }
- columns = ["date", "variable", "value"]
- return pd.DataFrame(data, columns=columns)
-
- df = unpivot(tm.makeTimeDataFrame(3))
-
Data is often stored in so-called "stacked" or "record" format:
.. ipython:: python
- df
-
-
-For the curious here is how the above ``DataFrame`` was created:
-
-.. code-block:: python
-
import pandas._testing as tm
-
def unpivot(frame):
N, K = frame.shape
data = {
@@ -53,14 +28,15 @@ For the curious here is how the above ``DataFrame`` was created:
}
return pd.DataFrame(data, columns=["date", "variable", "value"])
-
df = unpivot(tm.makeTimeDataFrame(3))
+ df
To select out everything for variable ``A`` we could do:
.. ipython:: python
- df[df["variable"] == "A"]
+ filtered = df[df["variable"] == "A"]
+ filtered
But suppose we wish to do time series operations with the variables. A better
representation would be where the ``columns`` are the unique variables and an
@@ -70,11 +46,12 @@ top level function :func:`~pandas.pivot`):
.. ipython:: python
- df.pivot(index="date", columns="variable", values="value")
+ pivoted = df.pivot(index="date", columns="variable", values="value")
+ pivoted
-If the ``values`` argument is omitted, and the input ``DataFrame`` has more than
-one column of values which are not used as column or index inputs to ``pivot``,
-then the resulting "pivoted" ``DataFrame`` will have :ref:`hierarchical columns
+If the ``values`` argument is omitted, and the input :class:`DataFrame` has more than
+one column of values which are not used as column or index inputs to :meth:`~DataFrame.pivot`,
+then the resulting "pivoted" :class:`DataFrame` will have :ref:`hierarchical columns
` whose topmost level indicates the respective value
column:
@@ -84,7 +61,7 @@ column:
pivoted = df.pivot(index="date", columns="variable")
pivoted
-You can then select subsets from the pivoted ``DataFrame``:
+You can then select subsets from the pivoted :class:`DataFrame`:
.. ipython:: python
@@ -108,16 +85,16 @@ Reshaping by stacking and unstacking
Closely related to the :meth:`~DataFrame.pivot` method are the related
:meth:`~DataFrame.stack` and :meth:`~DataFrame.unstack` methods available on
-``Series`` and ``DataFrame``. These methods are designed to work together with
-``MultiIndex`` objects (see the section on :ref:`hierarchical indexing
+:class:`Series` and :class:`DataFrame`. These methods are designed to work together with
+:class:`MultiIndex` objects (see the section on :ref:`hierarchical indexing
`). Here are essentially what these methods do:
-* ``stack``: "pivot" a level of the (possibly hierarchical) column labels,
- returning a ``DataFrame`` with an index with a new inner-most level of row
+* :meth:`~DataFrame.stack`: "pivot" a level of the (possibly hierarchical) column labels,
+ returning a :class:`DataFrame` with an index with a new inner-most level of row
labels.
-* ``unstack``: (inverse operation of ``stack``) "pivot" a level of the
+* :meth:`~DataFrame.unstack`: (inverse operation of :meth:`~DataFrame.stack`) "pivot" a level of the
(possibly hierarchical) row index to the column axis, producing a reshaped
- ``DataFrame`` with a new inner-most level of column labels.
+ :class:`DataFrame` with a new inner-most level of column labels.
.. image:: ../_static/reshaping_unstack.png
@@ -139,22 +116,22 @@ from the hierarchical indexing section:
df2 = df[:4]
df2
-The ``stack`` function "compresses" a level in the ``DataFrame``'s columns to
+The :meth:`~DataFrame.stack` function "compresses" a level in the :class:`DataFrame` columns to
produce either:
-* A ``Series``, in the case of a simple column Index.
-* A ``DataFrame``, in the case of a ``MultiIndex`` in the columns.
+* A :class:`Series`, in the case of a simple column Index.
+* A :class:`DataFrame`, in the case of a :class:`MultiIndex` in the columns.
-If the columns have a ``MultiIndex``, you can choose which level to stack. The
-stacked level becomes the new lowest level in a ``MultiIndex`` on the columns:
+If the columns have a :class:`MultiIndex`, you can choose which level to stack. The
+stacked level becomes the new lowest level in a :class:`MultiIndex` on the columns:
.. ipython:: python
stacked = df2.stack()
stacked
-With a "stacked" ``DataFrame`` or ``Series`` (having a ``MultiIndex`` as the
-``index``), the inverse operation of ``stack`` is ``unstack``, which by default
+With a "stacked" :class:`DataFrame` or :class:`Series` (having a :class:`MultiIndex` as the
+``index``), the inverse operation of :meth:`~DataFrame.stack` is :meth:`~DataFrame.unstack`, which by default
unstacks the **last level**:
.. ipython:: python
@@ -177,9 +154,9 @@ the level numbers:
.. image:: ../_static/reshaping_unstack_0.png
-Notice that the ``stack`` and ``unstack`` methods implicitly sort the index
-levels involved. Hence a call to ``stack`` and then ``unstack``, or vice versa,
-will result in a **sorted** copy of the original ``DataFrame`` or ``Series``:
+Notice that the :meth:`~DataFrame.stack` and :meth:`~DataFrame.unstack` methods implicitly sort the index
+levels involved. Hence a call to :meth:`~DataFrame.stack` and then :meth:`~DataFrame.unstack`, or vice versa,
+will result in a **sorted** copy of the original :class:`DataFrame` or :class:`Series`:
.. ipython:: python
@@ -188,7 +165,7 @@ will result in a **sorted** copy of the original ``DataFrame`` or ``Series``:
df
all(df.unstack().stack() == df.sort_index())
-The above code will raise a ``TypeError`` if the call to ``sort_index`` is
+The above code will raise a ``TypeError`` if the call to :meth:`~DataFrame.sort_index` is
removed.
.. _reshaping.stack_multiple:
@@ -231,7 +208,7 @@ Missing data
These functions are intelligent about handling missing data and do not expect
each subgroup within the hierarchical index to have the same set of labels.
They also can handle the index being unsorted (but you can make it sorted by
-calling ``sort_index``, of course). Here is a more complex example:
+calling :meth:`~DataFrame.sort_index`, of course). Here is a more complex example:
.. ipython:: python
@@ -251,7 +228,7 @@ calling ``sort_index``, of course). Here is a more complex example:
df2 = df.iloc[[0, 1, 2, 4, 5, 7]]
df2
-As mentioned above, ``stack`` can be called with a ``level`` argument to select
+As mentioned above, :meth:`~DataFrame.stack` can be called with a ``level`` argument to select
which level in the columns to stack:
.. ipython:: python
@@ -281,7 +258,7 @@ the value of missing data.
With a MultiIndex
~~~~~~~~~~~~~~~~~
-Unstacking when the columns are a ``MultiIndex`` is also careful about doing
+Unstacking when the columns are a :class:`MultiIndex` is also careful about doing
the right thing:
.. ipython:: python
@@ -297,7 +274,7 @@ Reshaping by melt
.. image:: ../_static/reshaping_melt.png
The top-level :func:`~pandas.melt` function and the corresponding :meth:`DataFrame.melt`
-are useful to massage a ``DataFrame`` into a format where one or more columns
+are useful to massage a :class:`DataFrame` into a format where one or more columns
are *identifier variables*, while all other columns, considered *measured
variables*, are "unpivoted" to the row axis, leaving just two non-identifier
columns, "variable" and "value". The names of those columns can be customized
@@ -363,7 +340,7 @@ user-friendly.
Combining with stats and GroupBy
--------------------------------
-It should be no shock that combining ``pivot`` / ``stack`` / ``unstack`` with
+It should be no shock that combining :meth:`~DataFrame.pivot` / :meth:`~DataFrame.stack` / :meth:`~DataFrame.unstack` with
GroupBy and the basic Series and DataFrame statistical functions can produce
some very expressive and fast data manipulations.
@@ -385,8 +362,6 @@ Pivot tables
.. _reshaping.pivot:
-
-
While :meth:`~DataFrame.pivot` provides general purpose pivoting with various
data types (strings, numerics, etc.), pandas also provides :func:`~pandas.pivot_table`
for pivoting with aggregation of numeric data.
@@ -437,30 +412,29 @@ We can produce pivot tables from this data very easily:
aggfunc=np.sum,
)
-The result object is a ``DataFrame`` having potentially hierarchical indexes on the
+The result object is a :class:`DataFrame` having potentially hierarchical indexes on the
rows and columns. If the ``values`` column name is not given, the pivot table
-will include all of the data that can be aggregated in an additional level of
-hierarchy in the columns:
+will include all of the data in an additional level of hierarchy in the columns:
.. ipython:: python
- pd.pivot_table(df, index=["A", "B"], columns=["C"])
+ pd.pivot_table(df[["A", "B", "C", "D", "E"]], index=["A", "B"], columns=["C"])
-Also, you can use ``Grouper`` for ``index`` and ``columns`` keywords. For detail of ``Grouper``, see :ref:`Grouping with a Grouper specification `.
+Also, you can use :class:`Grouper` for ``index`` and ``columns`` keywords. For detail of :class:`Grouper`, see :ref:`Grouping with a Grouper specification `.
.. ipython:: python
pd.pivot_table(df, values="D", index=pd.Grouper(freq="M", key="F"), columns="C")
You can render a nice output of the table omitting the missing values by
-calling ``to_string`` if you wish:
+calling :meth:`~DataFrame.to_string` if you wish:
.. ipython:: python
- table = pd.pivot_table(df, index=["A", "B"], columns=["C"])
+ table = pd.pivot_table(df, index=["A", "B"], columns=["C"], values=["D", "E"])
print(table.to_string(na_rep=""))
-Note that ``pivot_table`` is also available as an instance method on DataFrame,
+Note that :meth:`~DataFrame.pivot_table` is also available as an instance method on DataFrame,
i.e. :meth:`DataFrame.pivot_table`.
.. _reshaping.pivot.margins:
@@ -468,13 +442,19 @@ Note that ``pivot_table`` is also available as an instance method on DataFrame,
Adding margins
~~~~~~~~~~~~~~
-If you pass ``margins=True`` to ``pivot_table``, special ``All`` columns and
+If you pass ``margins=True`` to :meth:`~DataFrame.pivot_table`, special ``All`` columns and
rows will be added with partial group aggregates across the categories on the
rows and columns:
.. ipython:: python
- table = df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
+ table = df.pivot_table(
+ index=["A", "B"],
+ columns="C",
+ values=["D", "E"],
+ margins=True,
+ aggfunc=np.std
+ )
table
Additionally, you can call :meth:`DataFrame.stack` to display a pivoted DataFrame
@@ -490,7 +470,7 @@ Cross tabulations
-----------------
Use :func:`~pandas.crosstab` to compute a cross-tabulation of two (or more)
-factors. By default ``crosstab`` computes a frequency table of the factors
+factors. By default :func:`~pandas.crosstab` computes a frequency table of the factors
unless an array of values and an aggregation function are passed.
It takes a number of arguments
@@ -509,7 +489,7 @@ It takes a number of arguments
Normalize by dividing all values by the sum of values.
-Any ``Series`` passed will have their name attributes used unless row or column
+Any :class:`Series` passed will have their name attributes used unless row or column
names for the cross-tabulation are specified
For example:
@@ -523,7 +503,7 @@ For example:
pd.crosstab(a, [b, c], rownames=["a"], colnames=["b", "c"])
-If ``crosstab`` receives only two Series, it will provide a frequency table.
+If :func:`~pandas.crosstab` receives only two Series, it will provide a frequency table.
.. ipython:: python
@@ -534,8 +514,8 @@ If ``crosstab`` receives only two Series, it will provide a frequency table.
pd.crosstab(df["A"], df["B"])
-``crosstab`` can also be implemented
-to ``Categorical`` data.
+:func:`~pandas.crosstab` can also be implemented
+to :class:`Categorical` data.
.. ipython:: python
@@ -568,9 +548,9 @@ using the ``normalize`` argument:
pd.crosstab(df["A"], df["B"], normalize="columns")
-``crosstab`` can also be passed a third ``Series`` and an aggregation function
-(``aggfunc``) that will be applied to the values of the third ``Series`` within
-each group defined by the first two ``Series``:
+:func:`~pandas.crosstab` can also be passed a third :class:`Series` and an aggregation function
+(``aggfunc``) that will be applied to the values of the third :class:`Series` within
+each group defined by the first two :class:`Series`:
.. ipython:: python
@@ -611,7 +591,7 @@ Alternatively we can specify custom bin-edges:
c = pd.cut(ages, bins=[0, 18, 35, 70])
c
-If the ``bins`` keyword is an ``IntervalIndex``, then these will be
+If the ``bins`` keyword is an :class:`IntervalIndex`, then these will be
used to bin the passed data.::
pd.cut([25, 20, 50], bins=c.categories)
@@ -622,9 +602,9 @@ used to bin the passed data.::
Computing indicator / dummy variables
-------------------------------------
-To convert a categorical variable into a "dummy" or "indicator" ``DataFrame``,
-for example a column in a ``DataFrame`` (a ``Series``) which has ``k`` distinct
-values, can derive a ``DataFrame`` containing ``k`` columns of 1s and 0s using
+To convert a categorical variable into a "dummy" or "indicator" :class:`DataFrame`,
+for example a column in a :class:`DataFrame` (a :class:`Series`) which has ``k`` distinct
+values, can derive a :class:`DataFrame` containing ``k`` columns of 1s and 0s using
:func:`~pandas.get_dummies`:
.. ipython:: python
@@ -634,7 +614,7 @@ values, can derive a ``DataFrame`` containing ``k`` columns of 1s and 0s using
pd.get_dummies(df["key"])
Sometimes it's useful to prefix the column names, for example when merging the result
-with the original ``DataFrame``:
+with the original :class:`DataFrame`:
.. ipython:: python
@@ -643,7 +623,7 @@ with the original ``DataFrame``:
df[["data1"]].join(dummies)
-This function is often used along with discretization functions like ``cut``:
+This function is often used along with discretization functions like :func:`~pandas.cut`:
.. ipython:: python
@@ -656,7 +636,7 @@ This function is often used along with discretization functions like ``cut``:
See also :func:`Series.str.get_dummies `.
-:func:`get_dummies` also accepts a ``DataFrame``. By default all categorical
+:func:`get_dummies` also accepts a :class:`DataFrame`. By default all categorical
variables (categorical in the statistical sense, those with ``object`` or
``categorical`` dtype) are encoded as dummy variables.
@@ -677,8 +657,8 @@ Notice that the ``B`` column is still included in the output, it just hasn't
been encoded. You can drop ``B`` before calling ``get_dummies`` if you don't
want to include it in the output.
-As with the ``Series`` version, you can pass values for the ``prefix`` and
-``prefix_sep``. By default the column name is used as the prefix, and '_' as
+As with the :class:`Series` version, you can pass values for the ``prefix`` and
+``prefix_sep``. By default the column name is used as the prefix, and ``_`` as
the prefix separator. You can specify ``prefix`` and ``prefix_sep`` in 3 ways:
* string: Use the same value for ``prefix`` or ``prefix_sep`` for each column
@@ -726,6 +706,30 @@ To choose another dtype, use the ``dtype`` argument:
pd.get_dummies(df, dtype=bool).dtypes
+.. versionadded:: 1.5.0
+
+To convert a "dummy" or "indicator" ``DataFrame``, into a categorical ``DataFrame``,
+for example ``k`` columns of a ``DataFrame`` containing 1s and 0s can derive a
+``DataFrame`` which has ``k`` distinct values using
+:func:`~pandas.from_dummies`:
+
+.. ipython:: python
+
+ df = pd.DataFrame({"prefix_a": [0, 1, 0], "prefix_b": [1, 0, 1]})
+ df
+
+ pd.from_dummies(df, sep="_")
+
+Dummy coded data only requires ``k - 1`` categories to be included, in this case
+the ``k`` th category is the default category, implied by not being assigned any of
+the other ``k - 1`` categories, can be passed via ``default_category``.
+
+.. ipython:: python
+
+ df = pd.DataFrame({"prefix_a": [0, 1, 0]})
+ df
+
+ pd.from_dummies(df, sep="_", default_category="b")
.. _reshaping.factorize:
@@ -742,7 +746,7 @@ To encode 1-d values as an enumerated type use :func:`~pandas.factorize`:
labels
uniques
-Note that ``factorize`` is similar to ``numpy.unique``, but differs in its
+Note that :func:`~pandas.factorize` is similar to ``numpy.unique``, but differs in its
handling of NaN:
.. note::
@@ -750,16 +754,12 @@ handling of NaN:
because of an ordering bug. See also
`here `__.
-.. code-block:: ipython
-
- In [1]: x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
- In [2]: pd.factorize(x, sort=True)
- Out[2]:
- (array([ 2, 2, -1, 3, 0, 1]),
- Index([3.14, inf, 'A', 'B'], dtype='object'))
+.. ipython:: python
+ :okexcept:
- In [3]: np.unique(x, return_inverse=True)[::-1]
- Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
+ ser = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
+ pd.factorize(ser, sort=True)
+ np.unique(ser, return_inverse=True)[::-1]
.. note::
If you just want to handle one column as a categorical variable (like R's factor),
@@ -907,13 +907,13 @@ We can 'explode' the ``values`` column, transforming each list-like to a separat
df["values"].explode()
-You can also explode the column in the ``DataFrame``.
+You can also explode the column in the :class:`DataFrame`.
.. ipython:: python
df.explode("values")
-:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting ``Series`` is always ``object``.
+:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting :class:`Series` is always ``object``.
.. ipython:: python
diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst
index 71aef4fdd75f6..129f43dd36930 100644
--- a/doc/source/user_guide/scale.rst
+++ b/doc/source/user_guide/scale.rst
@@ -18,36 +18,9 @@ tool for all situations. If you're working with very large datasets and a tool
like PostgreSQL fits your needs, then you should probably be using that.
Assuming you want or need the expressiveness and power of pandas, let's carry on.
-.. ipython:: python
-
- import pandas as pd
- import numpy as np
-
-.. ipython:: python
- :suppress:
-
- from pandas._testing import _make_timeseries
-
- # Make a random in-memory dataset
- ts = _make_timeseries(freq="30S", seed=0)
- ts.to_csv("timeseries.csv")
- ts.to_parquet("timeseries.parquet")
-
-
Load less data
--------------
-.. ipython:: python
- :suppress:
-
- # make a similar dataset with many columns
- timeseries = [
- _make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}")
- for i in range(10)
- ]
- ts_wide = pd.concat(timeseries, axis=1)
- ts_wide.to_parquet("timeseries_wide.parquet")
-
Suppose our raw dataset on disk has many columns::
id_0 name_0 x_0 y_0 id_1 name_1 x_1 ... name_8 x_8 y_8 id_9 name_9 x_9 y_9
@@ -66,6 +39,34 @@ Suppose our raw dataset on disk has many columns::
[525601 rows x 40 columns]
+That can be generated by the following code snippet:
+
+.. ipython:: python
+
+ import pandas as pd
+ import numpy as np
+
+ def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
+ index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
+ n = len(index)
+ state = np.random.RandomState(seed)
+ columns = {
+ "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
+ "id": state.poisson(1000, size=n),
+ "x": state.rand(n) * 2 - 1,
+ "y": state.rand(n) * 2 - 1,
+ }
+ df = pd.DataFrame(columns, index=index, columns=sorted(columns))
+ if df.index[-1] == end:
+ df = df.iloc[:-1]
+ return df
+
+ timeseries = [
+ make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}")
+ for i in range(10)
+ ]
+ ts_wide = pd.concat(timeseries, axis=1)
+ ts_wide.to_parquet("timeseries_wide.parquet")
To load the columns we want, we have two options.
Option 1 loads in all the data and then filters to what we need.
@@ -82,6 +83,13 @@ Option 2 only loads the columns we request.
pd.read_parquet("timeseries_wide.parquet", columns=columns)
+.. ipython:: python
+ :suppress:
+
+ import os
+
+ os.remove("timeseries_wide.parquet")
+
If we were to measure the memory usage of the two calls, we'd see that specifying
``columns`` uses about 1/10th the memory in this case.
@@ -99,9 +107,16 @@ can store larger datasets in memory.
.. ipython:: python
+ ts = make_timeseries(freq="30S", seed=0)
+ ts.to_parquet("timeseries.parquet")
ts = pd.read_parquet("timeseries.parquet")
ts
+.. ipython:: python
+ :suppress:
+
+ os.remove("timeseries.parquet")
+
Now, let's inspect the data types and memory usage to see where we should focus our
attention.
@@ -116,7 +131,7 @@ attention.
The ``name`` column is taking up much more memory than any other. It has just a
few unique values, so it's a good candidate for converting to a
-:class:`Categorical`. With a Categorical, we store each unique name once and use
+:class:`pandas.Categorical`. With a :class:`pandas.Categorical`, we store each unique name once and use
space-efficient integers to know which specific name is used in each row.
@@ -147,7 +162,7 @@ using :func:`pandas.to_numeric`.
In all, we've reduced the in-memory footprint of this dataset to 1/5 of its
original size.
-See :ref:`categorical` for more on ``Categorical`` and :ref:`basics.dtypes`
+See :ref:`categorical` for more on :class:`pandas.Categorical` and :ref:`basics.dtypes`
for an overview of all of pandas' dtypes.
Use chunking
@@ -168,7 +183,6 @@ Suppose we have an even larger "logical dataset" on disk that's a directory of p
files. Each file in the directory represents a different year of the entire dataset.
.. ipython:: python
- :suppress:
import pathlib
@@ -179,7 +193,7 @@ files. Each file in the directory represents a different year of the entire data
pathlib.Path("data/timeseries").mkdir(exist_ok=True)
for i, (start, end) in enumerate(zip(starts, ends)):
- ts = _make_timeseries(start=start, end=end, freq="1T", seed=i)
+ ts = make_timeseries(start=start, end=end, freq="1T", seed=i)
ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
@@ -200,7 +214,7 @@ files. Each file in the directory represents a different year of the entire data
├── ts-10.parquet
└── ts-11.parquet
-Now we'll implement an out-of-core ``value_counts``. The peak memory usage of this
+Now we'll implement an out-of-core :meth:`pandas.Series.value_counts`. The peak memory usage of this
workflow is the single largest chunk, plus a small series storing the unique value
counts up to this point. As long as each individual file fits in memory, this will
work for arbitrary-sized datasets.
@@ -211,9 +225,7 @@ work for arbitrary-sized datasets.
files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
counts = pd.Series(dtype=int)
for path in files:
- # Only one dataframe is in memory at a time...
df = pd.read_parquet(path)
- # ... plus a small Series ``counts``, which is updated.
counts = counts.add(df["name"].value_counts(), fill_value=0)
counts.astype(int)
@@ -221,7 +233,7 @@ Some readers, like :meth:`pandas.read_csv`, offer parameters to control the
``chunksize`` when reading a single file.
Manually chunking is an OK option for workflows that don't
-require too sophisticated of operations. Some operations, like ``groupby``, are
+require too sophisticated of operations. Some operations, like :meth:`pandas.DataFrame.groupby`, are
much harder to do chunkwise. In these cases, you may be better switching to a
different library that implements these out-of-core algorithms for you.
@@ -259,7 +271,7 @@ Inspecting the ``ddf`` object, we see a few things
* There are new attributes like ``.npartitions`` and ``.divisions``
The partitions and divisions are how Dask parallelizes computation. A **Dask**
-DataFrame is made up of many pandas DataFrames. A single method call on a
+DataFrame is made up of many pandas :class:`pandas.DataFrame`. A single method call on a
Dask DataFrame ends up making many pandas method calls, and Dask knows how to
coordinate everything to get the result.
@@ -275,6 +287,7 @@ column names and dtypes. That's because Dask hasn't actually read the data yet.
Rather than executing immediately, doing operations build up a **task graph**.
.. ipython:: python
+ :okwarning:
ddf
ddf["name"]
@@ -282,8 +295,8 @@ Rather than executing immediately, doing operations build up a **task graph**.
Each of these calls is instant because the result isn't being computed yet.
We're just building up a list of computation to do when someone needs the
-result. Dask knows that the return type of a ``pandas.Series.value_counts``
-is a pandas Series with a certain dtype and a certain name. So the Dask version
+result. Dask knows that the return type of a :class:`pandas.Series.value_counts`
+is a pandas :class:`pandas.Series` with a certain dtype and a certain name. So the Dask version
returns a Dask Series with the same dtype and the same name.
To get the actual result you can call ``.compute()``.
@@ -293,13 +306,13 @@ To get the actual result you can call ``.compute()``.
%time ddf["name"].value_counts().compute()
At that point, you get back the same thing you'd get with pandas, in this case
-a concrete pandas Series with the count of each ``name``.
+a concrete pandas :class:`pandas.Series` with the count of each ``name``.
Calling ``.compute`` causes the full task graph to be executed. This includes
reading the data, selecting the columns, and doing the ``value_counts``. The
execution is done *in parallel* where possible, and Dask tries to keep the
overall memory footprint small. You can work with datasets that are much larger
-than memory, as long as each partition (a regular pandas DataFrame) fits in memory.
+than memory, as long as each partition (a regular pandas :class:`pandas.DataFrame`) fits in memory.
By default, ``dask.dataframe`` operations use a threadpool to do operations in
parallel. We can also connect to a cluster to distribute the work on many
@@ -333,6 +346,7 @@ known automatically. In this case, since we created the parquet files manually,
we need to supply the divisions manually.
.. ipython:: python
+ :okwarning:
N = 12
starts = [f"20{i:>02d}-01-01" for i in range(N)]
@@ -364,6 +378,13 @@ out of memory. At that point it's just a regular pandas object.
@savefig dask_resample.png
ddf[["x", "y"]].resample("1D").mean().cumsum().compute().plot()
+.. ipython:: python
+ :suppress:
+
+ import shutil
+
+ shutil.rmtree("data/timeseries")
+
These Dask examples have all be done using multiple processes on a single
machine. Dask can be `deployed on a cluster
`_ to scale up to even larger
diff --git a/doc/source/user_guide/sparse.rst b/doc/source/user_guide/sparse.rst
index b2b3678e48534..bc4eec1c23a35 100644
--- a/doc/source/user_guide/sparse.rst
+++ b/doc/source/user_guide/sparse.rst
@@ -23,7 +23,7 @@ array that are ``nan`` aren't actually stored, only the non-``nan`` elements are
Those non-``nan`` elements have a ``float64`` dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had a
-large, mostly NA ``DataFrame``:
+large, mostly NA :class:`DataFrame`:
.. ipython:: python
@@ -139,7 +139,7 @@ Sparse calculation
------------------
You can apply NumPy `ufuncs `_
-to ``SparseArray`` and get a ``SparseArray`` as a result.
+to :class:`arrays.SparseArray` and get a :class:`arrays.SparseArray` as a result.
.. ipython:: python
@@ -183,7 +183,7 @@ your code, rather than ignoring the warning.
**Construction**
From an array-like, use the regular :class:`Series` or
-:class:`DataFrame` constructors with :class:`SparseArray` values.
+:class:`DataFrame` constructors with :class:`arrays.SparseArray` values.
.. code-block:: python
@@ -240,7 +240,7 @@ Sparse-specific properties, like ``density``, are available on the ``.sparse`` a
**General differences**
In a ``SparseDataFrame``, *all* columns were sparse. A :class:`DataFrame` can have a mixture of
-sparse and dense columns. As a consequence, assigning new columns to a ``DataFrame`` with sparse
+sparse and dense columns. As a consequence, assigning new columns to a :class:`DataFrame` with sparse
values will not automatically convert the input to be sparse.
.. code-block:: python
@@ -266,10 +266,10 @@ have no replacement.
.. _sparse.scipysparse:
-Interaction with scipy.sparse
------------------------------
+Interaction with *scipy.sparse*
+-------------------------------
-Use :meth:`DataFrame.sparse.from_spmatrix` to create a ``DataFrame`` with sparse values from a sparse matrix.
+Use :meth:`DataFrame.sparse.from_spmatrix` to create a :class:`DataFrame` with sparse values from a sparse matrix.
.. versionadded:: 0.25.0
@@ -294,9 +294,9 @@ To convert back to sparse SciPy matrix in COO format, you can use the :meth:`Dat
sdf.sparse.to_coo()
-:meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a :class:`MultiIndex` to a :class:`scipy.sparse.coo_matrix`.
+:meth:`Series.sparse.to_coo` is implemented for transforming a :class:`Series` with sparse values indexed by a :class:`MultiIndex` to a :class:`scipy.sparse.coo_matrix`.
-The method requires a ``MultiIndex`` with two or more levels.
+The method requires a :class:`MultiIndex` with two or more levels.
.. ipython:: python
@@ -315,7 +315,7 @@ The method requires a ``MultiIndex`` with two or more levels.
ss = s.astype('Sparse')
ss
-In the example below, we transform the ``Series`` to a sparse representation of a 2-d array by specifying that the first and second ``MultiIndex`` levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
+In the example below, we transform the :class:`Series` to a sparse representation of a 2-d array by specifying that the first and second ``MultiIndex`` levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
.. ipython:: python
@@ -341,7 +341,7 @@ Specifying different row and column labels (and not sorting them) yields a diffe
rows
columns
-A convenience method :meth:`Series.sparse.from_coo` is implemented for creating a ``Series`` with sparse values from a ``scipy.sparse.coo_matrix``.
+A convenience method :meth:`Series.sparse.from_coo` is implemented for creating a :class:`Series` with sparse values from a ``scipy.sparse.coo_matrix``.
.. ipython:: python
@@ -350,7 +350,7 @@ A convenience method :meth:`Series.sparse.from_coo` is implemented for creating
A
A.todense()
-The default behaviour (with ``dense_index=False``) simply returns a ``Series`` containing
+The default behaviour (with ``dense_index=False``) simply returns a :class:`Series` containing
only the non-null entries.
.. ipython:: python
diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb
index f94f86b4eea58..620e3806a33b5 100644
--- a/doc/source/user_guide/style.ipynb
+++ b/doc/source/user_guide/style.ipynb
@@ -11,7 +11,7 @@
"\n",
"[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n",
"[viz]: visualization.rst\n",
- "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/user_guide/style.ipynb"
+ "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/main/doc/source/user_guide/style.ipynb"
]
},
{
@@ -151,7 +151,7 @@
"\n",
"### Formatting Values\n",
"\n",
- "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value, in both datavlaues and index or columns headers. To control the display value, the text is printed in each cell as string, and we can use the [.format()][formatfunc] and [.format_index()][formatfuncindex] methods to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table, or index, or for individual columns, or MultiIndex levels. \n",
+ "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value, in both datavalues and index or columns headers. To control the display value, the text is printed in each cell as string, and we can use the [.format()][formatfunc] and [.format_index()][formatfuncindex] methods to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table, or index, or for individual columns, or MultiIndex levels. \n",
"\n",
"Additionally, the format function has a **precision** argument to specifically help formatting floats, as well as **decimal** and **thousands** separators to support other locales, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML or safe-LaTeX. The default formatter is configured to adopt pandas' `styler.format.precision` option, controllable using `with pd.option_context('format.precision', 2):` \n",
"\n",
@@ -612,7 +612,7 @@
"source": [
"### Acting on the Index and Column Headers\n",
"\n",
- "Similar application is acheived for headers by using:\n",
+ "Similar application is achieved for headers by using:\n",
" \n",
"- [.applymap_index()][applymapindex] (elementwise): accepts a function that takes a single value and returns a string with the CSS attribute-value pair.\n",
"- [.apply_index()][applyindex] (level-wise): accepts a function that takes a Series and returns a Series, or numpy array with an identical shape where each element is a string with a CSS attribute-value pair. This method passes each level of your Index one-at-a-time. To style the index use `axis=0` and to style the column headers use `axis=1`.\n",
@@ -1100,7 +1100,7 @@
" - [.highlight_null][nullfunc]: for use with identifying missing data. \n",
" - [.highlight_min][minfunc] and [.highlight_max][maxfunc]: for use with identifying extremeties in data.\n",
" - [.highlight_between][betweenfunc] and [.highlight_quantile][quantilefunc]: for use with identifying classes within data.\n",
- " - [.background_gradient][bgfunc]: a flexible method for highlighting cells based or their, or other, values on a numeric scale.\n",
+ " - [.background_gradient][bgfunc]: a flexible method for highlighting cells based on their, or other, values on a numeric scale.\n",
" - [.text_gradient][textfunc]: similar method for highlighting text based on their, or other, values on a numeric scale.\n",
" - [.bar][barfunc]: to display mini-charts within cell backgrounds.\n",
" \n",
@@ -1131,7 +1131,7 @@
"source": [
"df2.iloc[0,2] = np.nan\n",
"df2.iloc[4,3] = np.nan\n",
- "df2.loc[:4].style.highlight_null(null_color='yellow')"
+ "df2.loc[:4].style.highlight_null(color='yellow')"
]
},
{
@@ -1196,7 +1196,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "You can create \"heatmaps\" with the `background_gradient` and `text_gradient` methods. These require matplotlib, and we'll use [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) to get a nice colormap."
+ "You can create \"heatmaps\" with the `background_gradient` and `text_gradient` methods. These require matplotlib, and we'll use [Seaborn](http://seaborn.pydata.org/) to get a nice colormap."
]
},
{
@@ -1577,6 +1577,9 @@
"Some support (*since version 0.20.0*) is available for exporting styled `DataFrames` to Excel worksheets using the `OpenPyXL` or `XlsxWriter` engines. CSS2.2 properties handled include:\n",
"\n",
"- `background-color`\n",
+ "- `border-style` properties\n",
+ "- `border-width` properties\n",
+ "- `border-color` properties\n",
"- `color`\n",
"- `font-family`\n",
"- `font-style`\n",
@@ -1587,12 +1590,13 @@
"- `white-space: nowrap`\n",
"\n",
"\n",
- "- Currently broken: `border-style`, `border-width`, `border-color` and their {`top`, `right`, `bottom`, `left` variants}\n",
+ "- Shorthand and side-specific border properties are supported (e.g. `border-style` and `border-left-style`) as well as the `border` shorthands for all sides (`border: 1px solid green`) or specified sides (`border-left: 1px solid green`). Using a `border` shorthand will override any border properties set before it (See [CSS Working Group](https://drafts.csswg.org/css-backgrounds/#border-shorthands) for more details)\n",
"\n",
"\n",
"- Only CSS2 named colors and hex colors of the form `#rgb` or `#rrggbb` are currently supported.\n",
- "- The following pseudo CSS properties are also available to set excel specific style properties:\n",
+ "- The following pseudo CSS properties are also available to set Excel specific style properties:\n",
" - `number-format`\n",
+ " - `border-style` (for Excel-specific styles: \"hair\", \"mediumDashDot\", \"dashDotDot\", \"mediumDashDotDot\", \"dashDot\", \"slantDashDot\", or \"mediumDashed\")\n",
"\n",
"Table level styles, and data cell CSS-classes are not included in the export to Excel: individual cells must have their properties mapped by the `Styler.apply` and/or `Styler.applymap` methods."
]
@@ -1759,7 +1763,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In the above case the text is blue because the selector `#T_b_ .cls-1` is worth 110 (ID plus class), which takes precendence."
+ "In the above case the text is blue because the selector `#T_b_ .cls-1` is worth 110 (ID plus class), which takes precedence."
]
},
{
diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst
index 6df234a027ee9..474068e43a4d4 100644
--- a/doc/source/user_guide/timeseries.rst
+++ b/doc/source/user_guide/timeseries.rst
@@ -388,7 +388,7 @@ We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by
.. _timeseries.origin:
-Using the ``origin`` Parameter
+Using the ``origin`` parameter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the ``origin`` parameter, one can specify an alternative starting point for creation
@@ -869,6 +869,7 @@ arithmetic operator (``+``) can be used to perform the shift.
friday + two_business_days
(friday + two_business_days).day_name()
+
Most ``DateOffsets`` have associated frequencies strings, or offset aliases, that can be passed
into ``freq`` keyword arguments. The available date offsets and associated frequency strings can be found below:
@@ -1522,7 +1523,7 @@ or calendars with additional rules.
.. _timeseries.advanced_datetime:
-Time series-related instance methods
+Time Series-related instance methods
------------------------------------
Shifting / lagging
@@ -1820,7 +1821,7 @@ to resample based on datetimelike column in the frame, it can passed to the
),
)
df
- df.resample("M", on="date").sum()
+ df.resample("M", on="date")[["a"]].sum()
Similarly, if you instead want to resample by a datetimelike
level of ``MultiIndex``, its name or location can be passed to the
@@ -1828,7 +1829,7 @@ level of ``MultiIndex``, its name or location can be passed to the
.. ipython:: python
- df.resample("M", level="d").sum()
+ df.resample("M", level="d")[["a"]].sum()
.. _timeseries.iterating-label:
@@ -1980,7 +1981,6 @@ frequency. Arithmetic is not allowed between ``Period`` with different ``freq``
p = pd.Period("2012-01", freq="2M")
p + 2
p - 1
- @okexcept
p == pd.Period("2012-01", freq="3M")
@@ -2404,9 +2404,9 @@ you can use the ``tz_convert`` method.
.. warning::
- Be wary of conversions between libraries. For some time zones, ``pytz`` and ``dateutil`` have different
- definitions of the zone. This is more of a problem for unusual time zones than for
- 'standard' zones like ``US/Eastern``.
+ Be wary of conversions between libraries. For some time zones, ``pytz`` and ``dateutil`` have different
+ definitions of the zone. This is more of a problem for unusual time zones than for
+ 'standard' zones like ``US/Eastern``.
.. warning::
@@ -2600,7 +2600,7 @@ Transform nonexistent times to ``NaT`` or shift the times.
.. _timeseries.timezone_series:
-Time zone series operations
+Time zone Series operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~
A :class:`Series` with time zone **naive** values is
diff --git a/doc/source/user_guide/visualization.rst b/doc/source/user_guide/visualization.rst
index fd0af7583f5dc..147981f29476f 100644
--- a/doc/source/user_guide/visualization.rst
+++ b/doc/source/user_guide/visualization.rst
@@ -3,9 +3,14 @@
{{ header }}
*******************
-Chart Visualization
+Chart visualization
*******************
+
+.. note::
+
+ The examples below assume that you're using `Jupyter `_.
+
This section demonstrates visualization through charting. For information on
visualization of tabular data please see the section on `Table Visualization `_.
@@ -272,7 +277,7 @@ horizontal and cumulative histograms can be drawn by
plt.close("all")
See the :meth:`hist ` method and the
-`matplotlib hist documentation `__ for more.
+`matplotlib hist documentation `__ for more.
The existing interface ``DataFrame.hist`` to plot histogram still can be used.
@@ -410,7 +415,7 @@ For example, horizontal and custom-positioned boxplot can be drawn by
See the :meth:`boxplot ` method and the
-`matplotlib boxplot documentation `__ for more.
+`matplotlib boxplot documentation `__ for more.
The existing interface ``DataFrame.boxplot`` to plot boxplot still can be used.
@@ -620,6 +625,7 @@ To plot multiple column groups in a single axes, repeat ``plot`` method specifyi
It is recommended to specify ``color`` and ``label`` keywords to distinguish each groups.
.. ipython:: python
+ :okwarning:
ax = df.plot.scatter(x="a", y="b", color="DarkBlue", label="Group 1")
@savefig scatter_plot_repeated.png
@@ -674,7 +680,7 @@ bubble chart using a column of the ``DataFrame`` as the bubble size.
plt.close("all")
See the :meth:`scatter ` method and the
-`matplotlib scatter documentation `__ for more.
+`matplotlib scatter documentation `__ for more.
.. _visualization.hexbin:
@@ -734,7 +740,7 @@ given by column ``z``. The bins are aggregated with NumPy's ``max`` function.
plt.close("all")
See the :meth:`hexbin ` method and the
-`matplotlib hexbin documentation `__ for more.
+`matplotlib hexbin documentation `__ for more.
.. _visualization.pie:
@@ -823,7 +829,7 @@ Also, other keywords supported by :func:`matplotlib.pyplot.pie` can be used.
figsize=(6, 6),
);
-If you pass values whose sum total is less than 1.0, matplotlib draws a semicircle.
+If you pass values whose sum total is less than 1.0 they will be rescaled so that they sum to 1.
.. ipython:: python
:suppress:
@@ -839,7 +845,7 @@ If you pass values whose sum total is less than 1.0, matplotlib draws a semicirc
@savefig series_pie_plot_semi.png
series.plot.pie(figsize=(6, 6));
-See the `matplotlib pie documentation `__ for more.
+See the `matplotlib pie documentation `__ for more.
.. ipython:: python
:suppress:
@@ -956,7 +962,7 @@ for more information. By coloring these curves differently for each class
it is possible to visualize data clustering. Curves belonging to samples
of the same class will usually be closer together and form larger structures.
-**Note**: The "Iris" dataset is available `here `__.
+**Note**: The "Iris" dataset is available `here `__.
.. ipython:: python
@@ -1113,10 +1119,10 @@ unit interval). The point in the plane, where our sample settles to (where the
forces acting on our sample are at an equilibrium) is where a dot representing
our sample will be drawn. Depending on which class that sample belongs it will
be colored differently.
-See the R package `Radviz `__
+See the R package `Radviz `__
for more information.
-**Note**: The "Iris" dataset is available `here `__.
+**Note**: The "Iris" dataset is available `here `__.
.. ipython:: python
@@ -1384,7 +1390,7 @@ tick locator methods, it is useful to call the automatic
date tick adjustment from matplotlib for figures whose ticklabels overlap.
See the :meth:`autofmt_xdate ` method and the
-`matplotlib documentation `__ for more.
+`matplotlib documentation `__ for more.
Subplots
~~~~~~~~
@@ -1620,7 +1626,7 @@ as seen in the example below.
There also exists a helper function ``pandas.plotting.table``, which creates a
table from :class:`DataFrame` or :class:`Series`, and adds it to an
``matplotlib.Axes`` instance. This function can accept keywords which the
-matplotlib `table `__ has.
+matplotlib `table `__ has.
.. ipython:: python
@@ -1746,7 +1752,7 @@ Andrews curves charts:
plt.close("all")
-Plotting directly with matplotlib
+Plotting directly with Matplotlib
---------------------------------
In some situations it may still be preferable or necessary to prepare plots
diff --git a/doc/source/user_guide/window.rst b/doc/source/user_guide/window.rst
index 3e533cbadc5f7..e08fa81c5fa09 100644
--- a/doc/source/user_guide/window.rst
+++ b/doc/source/user_guide/window.rst
@@ -3,7 +3,7 @@
{{ header }}
********************
-Windowing Operations
+Windowing operations
********************
pandas contains a compact set of APIs for performing windowing operations - an operation that performs
@@ -287,7 +287,7 @@ and we want to use an expanding window where ``use_expanding`` is ``True`` other
3 3.0
4 10.0
-You can view other examples of ``BaseIndexer`` subclasses `here `__
+You can view other examples of ``BaseIndexer`` subclasses `here `__
.. versionadded:: 1.1
@@ -427,10 +427,16 @@ can even be omitted:
.. note::
Missing values are ignored and each entry is computed using the pairwise
- complete observations. Please see the :ref:`covariance section
- ` for :ref:`caveats
- ` associated with this method of
- calculating covariance and correlation matrices.
+ complete observations.
+
+ Assuming the missing data are missing at random this results in an estimate
+ for the covariance matrix which is unbiased. However, for many applications
+ this estimate may not be acceptable because the estimated covariance matrix
+ is not guaranteed to be positive semi-definite. This could lead to
+ estimated correlations having absolute values which are greater than one,
+ and/or a non-invertible covariance matrix. See `Estimation of covariance
+ matrices `_
+ for more details.
.. ipython:: python
@@ -484,7 +490,7 @@ For all supported aggregation functions, see :ref:`api.functions_expanding`.
.. _window.exponentially_weighted:
-Exponentially Weighted window
+Exponentially weighted window
-----------------------------
An exponentially weighted window is similar to an expanding window but with each prior point
@@ -618,13 +624,13 @@ average of ``3, NaN, 5`` would be calculated as
.. math::
- \frac{(1-\alpha)^2 \cdot 3 + 1 \cdot 5}{(1-\alpha)^2 + 1}.
+ \frac{(1-\alpha)^2 \cdot 3 + 1 \cdot 5}{(1-\alpha)^2 + 1}.
Whereas if ``ignore_na=True``, the weighted average would be calculated as
.. math::
- \frac{(1-\alpha) \cdot 3 + 1 \cdot 5}{(1-\alpha) + 1}.
+ \frac{(1-\alpha) \cdot 3 + 1 \cdot 5}{(1-\alpha) + 1}.
The :meth:`~Ewm.var`, :meth:`~Ewm.std`, and :meth:`~Ewm.cov` functions have a ``bias`` argument,
specifying whether the result should contain biased or unbiased statistics.
diff --git a/doc/source/whatsnew/index.rst b/doc/source/whatsnew/index.rst
index df33174804a33..e2f3b45d47bef 100644
--- a/doc/source/whatsnew/index.rst
+++ b/doc/source/whatsnew/index.rst
@@ -10,12 +10,27 @@ This is the list of changes to pandas between each release. For full details,
see the `commit logs `_. For install and
upgrade instructions, see :ref:`install`.
+Version 1.5
+-----------
+
+.. toctree::
+ :maxdepth: 2
+
+ v1.5.3
+ v1.5.2
+ v1.5.1
+ v1.5.0
+
Version 1.4
-----------
.. toctree::
:maxdepth: 2
+ v1.4.4
+ v1.4.3
+ v1.4.2
+ v1.4.1
v1.4.0
Version 1.3
diff --git a/doc/source/whatsnew/v0.13.0.rst b/doc/source/whatsnew/v0.13.0.rst
index b2596358d0c9d..44223bc694360 100644
--- a/doc/source/whatsnew/v0.13.0.rst
+++ b/doc/source/whatsnew/v0.13.0.rst
@@ -664,7 +664,7 @@ Enhancements
other = pd.DataFrame({'A': [1, 3, 3, 7], 'B': ['e', 'f', 'f', 'e']})
mask = dfi.isin(other)
mask
- dfi[mask.any(1)]
+ dfi[mask.any(axis=1)]
- ``Series`` now supports a ``to_frame`` method to convert it to a single-column DataFrame (:issue:`5164`)
@@ -733,7 +733,7 @@ Enhancements
.. _scipy: http://www.scipy.org
.. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
-.. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
+.. _guide: https://docs.scipy.org/doc/scipy/tutorial/interpolate.html
- ``to_csv`` now takes a ``date_format`` keyword argument that specifies how
output datetime objects should be formatted. Datetimes encountered in the
diff --git a/doc/source/whatsnew/v0.15.0.rst b/doc/source/whatsnew/v0.15.0.rst
index fc2b070df4392..04506f1655c7d 100644
--- a/doc/source/whatsnew/v0.15.0.rst
+++ b/doc/source/whatsnew/v0.15.0.rst
@@ -462,15 +462,15 @@ Rolling/expanding moments improvements
.. code-block:: ipython
- In [51]: ewma(s, com=3., min_periods=2)
- Out[51]:
- 0 NaN
- 1 NaN
- 2 1.000000
- 3 1.000000
- 4 1.571429
- 5 2.189189
- dtype: float64
+ In [51]: pd.ewma(s, com=3., min_periods=2)
+ Out[51]:
+ 0 NaN
+ 1 NaN
+ 2 1.000000
+ 3 1.000000
+ 4 1.571429
+ 5 2.189189
+ dtype: float64
New behavior (note values start at index ``4``, the location of the 2nd (since ``min_periods=2``) non-empty value):
@@ -557,21 +557,21 @@ Rolling/expanding moments improvements
.. code-block:: ipython
- In [89]: ewmvar(s, com=2., bias=False)
- Out[89]:
- 0 -2.775558e-16
- 1 3.000000e-01
- 2 9.556787e-01
- 3 3.585799e+00
- dtype: float64
-
- In [90]: ewmvar(s, com=2., bias=False) / ewmvar(s, com=2., bias=True)
- Out[90]:
- 0 1.25
- 1 1.25
- 2 1.25
- 3 1.25
- dtype: float64
+ In [89]: pd.ewmvar(s, com=2., bias=False)
+ Out[89]:
+ 0 -2.775558e-16
+ 1 3.000000e-01
+ 2 9.556787e-01
+ 3 3.585799e+00
+ dtype: float64
+
+ In [90]: pd.ewmvar(s, com=2., bias=False) / pd.ewmvar(s, com=2., bias=True)
+ Out[90]:
+ 0 1.25
+ 1 1.25
+ 2 1.25
+ 3 1.25
+ dtype: float64
Note that entry ``0`` is approximately 0, and the debiasing factors are a constant 1.25.
By comparison, the following 0.15.0 results have a ``NaN`` for entry ``0``,
diff --git a/doc/source/whatsnew/v0.16.2.rst b/doc/source/whatsnew/v0.16.2.rst
index 40d764e880c9c..c6c134a383e11 100644
--- a/doc/source/whatsnew/v0.16.2.rst
+++ b/doc/source/whatsnew/v0.16.2.rst
@@ -83,7 +83,7 @@ popular ``(%>%)`` pipe operator for R_.
See the :ref:`documentation ` for more. (:issue:`10129`)
-.. _dplyr: https://github.com/hadley/dplyr
+.. _dplyr: https://github.com/tidyverse/dplyr
.. _magrittr: https://github.com/smbache/magrittr
.. _R: http://www.r-project.org
diff --git a/doc/source/whatsnew/v0.17.0.rst b/doc/source/whatsnew/v0.17.0.rst
index 991b9a40d151b..7067407604d24 100644
--- a/doc/source/whatsnew/v0.17.0.rst
+++ b/doc/source/whatsnew/v0.17.0.rst
@@ -363,16 +363,12 @@ Some East Asian countries use Unicode characters its width is corresponding to 2
.. ipython:: python
df = pd.DataFrame({u"国籍": ["UK", u"日本"], u"名前": ["Alice", u"しのぶ"]})
- df;
-
-.. image:: ../_static/option_unicode01.png
+ df
.. ipython:: python
pd.set_option("display.unicode.east_asian_width", True)
- df;
-
-.. image:: ../_static/option_unicode02.png
+ df
For further details, see :ref:`here `
diff --git a/doc/source/whatsnew/v0.17.1.rst b/doc/source/whatsnew/v0.17.1.rst
index 6b0a28ec47568..774d17e6ff6b0 100644
--- a/doc/source/whatsnew/v0.17.1.rst
+++ b/doc/source/whatsnew/v0.17.1.rst
@@ -37,9 +37,7 @@ Conditional HTML formatting
.. warning::
This is a new feature and is under active development.
We'll be adding features an possibly making breaking changes in future
- releases. Feedback is welcome_.
-
-.. _welcome: https://github.com/pandas-dev/pandas/issues/11610
+ releases. Feedback is welcome in :issue:`11610`
We've added *experimental* support for conditional HTML formatting:
the visual styling of a DataFrame based on the data.
diff --git a/doc/source/whatsnew/v0.18.1.rst b/doc/source/whatsnew/v0.18.1.rst
index 3db00f686d62c..7d9008fdbdecd 100644
--- a/doc/source/whatsnew/v0.18.1.rst
+++ b/doc/source/whatsnew/v0.18.1.rst
@@ -149,8 +149,8 @@ can return a valid boolean indexer or anything which is valid for these indexer'
# callable returns list of labels
df.loc[lambda x: [1, 2], lambda x: ["A", "B"]]
-Indexing with``[]``
-"""""""""""""""""""
+Indexing with ``[]``
+""""""""""""""""""""
Finally, you can use a callable in ``[]`` indexing of Series, DataFrame and Panel.
The callable must return a valid input for ``[]`` indexing depending on its
@@ -166,7 +166,7 @@ without using temporary variable.
.. ipython:: python
bb = pd.read_csv("data/baseball.csv", index_col="id")
- (bb.groupby(["year", "team"]).sum().loc[lambda df: df.r > 100])
+ (bb.groupby(["year", "team"]).sum(numeric_only=True).loc[lambda df: df.r > 100])
.. _whatsnew_0181.partial_string_indexing:
diff --git a/doc/source/whatsnew/v0.19.0.rst b/doc/source/whatsnew/v0.19.0.rst
index 340e1ce9ee1ef..f2fdd23af1297 100644
--- a/doc/source/whatsnew/v0.19.0.rst
+++ b/doc/source/whatsnew/v0.19.0.rst
@@ -271,6 +271,7 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification
such as :func:`to_datetime`.
.. ipython:: python
+ :okwarning:
df = pd.read_csv(StringIO(data), dtype="category")
df.dtypes
@@ -497,8 +498,8 @@ Other enhancements
),
)
df
- df.resample("M", on="date").sum()
- df.resample("M", level="d").sum()
+ df.resample("M", on="date")[["a"]].sum()
+ df.resample("M", level="d")[["a"]].sum()
- The ``.get_credentials()`` method of ``GbqConnector`` can now first try to fetch `the application default credentials `__. See the docs for more details (:issue:`13577`).
- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behavior remains to raising a ``NonExistentTimeError`` (:issue:`13057`)
@@ -1553,7 +1554,7 @@ Bug fixes
- Bug in invalid datetime parsing in ``to_datetime`` and ``DatetimeIndex`` may raise ``TypeError`` rather than ``ValueError`` (:issue:`11169`, :issue:`11287`)
- Bug in ``Index`` created with tz-aware ``Timestamp`` and mismatched ``tz`` option incorrectly coerces timezone (:issue:`13692`)
- Bug in ``DatetimeIndex`` with nanosecond frequency does not include timestamp specified with ``end`` (:issue:`13672`)
-- Bug in ```Series`` when setting a slice with a ``np.timedelta64`` (:issue:`14155`)
+- Bug in ``Series`` when setting a slice with a ``np.timedelta64`` (:issue:`14155`)
- Bug in ``Index`` raises ``OutOfBoundsDatetime`` if ``datetime`` exceeds ``datetime64[ns]`` bounds, rather than coercing to ``object`` dtype (:issue:`13663`)
- Bug in ``Index`` may ignore specified ``datetime64`` or ``timedelta64`` passed as ``dtype`` (:issue:`13981`)
- Bug in ``RangeIndex`` can be created without no arguments rather than raises ``TypeError`` (:issue:`13793`)
diff --git a/doc/source/whatsnew/v0.19.2.rst b/doc/source/whatsnew/v0.19.2.rst
index bba89d78be869..db9d9e65c923d 100644
--- a/doc/source/whatsnew/v0.19.2.rst
+++ b/doc/source/whatsnew/v0.19.2.rst
@@ -18,7 +18,7 @@ We recommend that all users upgrade to this version.
Highlights include:
- Compatibility with Python 3.6
-- Added a `Pandas Cheat Sheet `__. (:issue:`13202`).
+- Added a `Pandas Cheat Sheet `__. (:issue:`13202`).
.. contents:: What's new in v0.19.2
diff --git a/doc/source/whatsnew/v0.20.0.rst b/doc/source/whatsnew/v0.20.0.rst
index 239431b7621c6..faf4b1ac44d5b 100644
--- a/doc/source/whatsnew/v0.20.0.rst
+++ b/doc/source/whatsnew/v0.20.0.rst
@@ -188,7 +188,7 @@ support for bz2 compression in the python 2 C-engine improved (:issue:`14874`).
url = ('https://github.com/%7Brepo%7D/raw/%7Bbranch%7D/%7Bpath%7D'
.format(repo='pandas-dev/pandas',
- branch='master',
+ branch='main',
path='pandas/tests/io/parser/data/salaries.csv.bz2'))
# default, infer compression
df = pd.read_csv(url, sep='\t', compression='infer')
@@ -328,7 +328,7 @@ more information about the data.
You must enable this by setting the ``display.html.table_schema`` option to ``True``.
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
-.. _nteract: http://nteract.io/
+.. _nteract: https://nteract.io/
.. _whatsnew_0200.enhancements.scipy_sparse:
diff --git a/doc/source/whatsnew/v0.21.1.rst b/doc/source/whatsnew/v0.21.1.rst
index 090a988d6406a..e217e1a75efc5 100644
--- a/doc/source/whatsnew/v0.21.1.rst
+++ b/doc/source/whatsnew/v0.21.1.rst
@@ -125,7 +125,7 @@ Indexing
IO
^^
-- Bug in class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects.
+- Bug in :class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects.
- Bug in :func:`read_csv` when reading a compressed UTF-16 encoded file (:issue:`18071`)
- Bug in :func:`read_csv` for handling null values in index columns when specifying ``na_filter=False`` (:issue:`5239`)
- Bug in :func:`read_csv` when reading numeric category fields with high cardinality (:issue:`18186`)
diff --git a/doc/source/whatsnew/v0.23.0.rst b/doc/source/whatsnew/v0.23.0.rst
index be84c562b3c32..9f24bc8e8ec50 100644
--- a/doc/source/whatsnew/v0.23.0.rst
+++ b/doc/source/whatsnew/v0.23.0.rst
@@ -1126,7 +1126,7 @@ Removal of prior version deprecations/changes
- The ``Panel`` class has dropped the ``to_long`` and ``toLong`` methods (:issue:`19077`)
- The options ``display.line_with`` and ``display.height`` are removed in favor of ``display.width`` and ``display.max_rows`` respectively (:issue:`4391`, :issue:`19107`)
- The ``labels`` attribute of the ``Categorical`` class has been removed in favor of :attr:`Categorical.codes` (:issue:`7768`)
-- The ``flavor`` parameter have been removed from func:`to_sql` method (:issue:`13611`)
+- The ``flavor`` parameter have been removed from :func:`to_sql` method (:issue:`13611`)
- The modules ``pandas.tools.hashing`` and ``pandas.util.hashing`` have been removed (:issue:`16223`)
- The top-level functions ``pd.rolling_*``, ``pd.expanding_*`` and ``pd.ewm*`` have been removed (Deprecated since v0.18).
Instead, use the DataFrame/Series methods :attr:`~DataFrame.rolling`, :attr:`~DataFrame.expanding` and :attr:`~DataFrame.ewm` (:issue:`18723`)
diff --git a/doc/source/whatsnew/v0.25.0.rst b/doc/source/whatsnew/v0.25.0.rst
index 9cbfa49cc8c5c..e4dd6fa091d80 100644
--- a/doc/source/whatsnew/v0.25.0.rst
+++ b/doc/source/whatsnew/v0.25.0.rst
@@ -342,10 +342,15 @@ Now every group is evaluated only a single time.
*New behavior*:
-.. ipython:: python
-
- df.groupby("a").apply(func)
+.. code-block:: python
+ In [3]: df.groupby('a').apply(func)
+ x
+ y
+ Out[3]:
+ a b
+ 0 x 1
+ 1 y 2
Concatenating sparse values
^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -1116,7 +1121,7 @@ Indexing
- Bug in which :meth:`DataFrame.to_csv` caused a segfault for a reindexed data frame, when the indices were single-level :class:`MultiIndex` (:issue:`26303`).
- Fixed bug where assigning a :class:`arrays.PandasArray` to a :class:`pandas.core.frame.DataFrame` would raise error (:issue:`26390`)
- Allow keyword arguments for callable local reference used in the :meth:`DataFrame.query` string (:issue:`26426`)
-- Fixed a ``KeyError`` when indexing a :class:`MultiIndex`` level with a list containing exactly one label, which is missing (:issue:`27148`)
+- Fixed a ``KeyError`` when indexing a :class:`MultiIndex` level with a list containing exactly one label, which is missing (:issue:`27148`)
- Bug which produced ``AttributeError`` on partial matching :class:`Timestamp` in a :class:`MultiIndex` (:issue:`26944`)
- Bug in :class:`Categorical` and :class:`CategoricalIndex` with :class:`Interval` values when using the ``in`` operator (``__contains``) with objects that are not comparable to the values in the ``Interval`` (:issue:`23705`)
- Bug in :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` on a :class:`DataFrame` with a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of a :class:`Series` (:issue:`27110`)
diff --git a/doc/source/whatsnew/v0.6.0.rst b/doc/source/whatsnew/v0.6.0.rst
index 19e2e85c09a87..5ddcd5d90e65c 100644
--- a/doc/source/whatsnew/v0.6.0.rst
+++ b/doc/source/whatsnew/v0.6.0.rst
@@ -24,7 +24,7 @@ New features
- :ref:`Added ` multiple levels to groupby (:issue:`103`)
- :ref:`Allow ` multiple columns in ``by`` argument of ``DataFrame.sort_index`` (:issue:`92`, :issue:`362`)
- :ref:`Added ` fast ``get_value`` and ``put_value`` methods to DataFrame (:issue:`360`)
-- :ref:`Added ` ``cov`` instance methods to Series and DataFrame (:issue:`194`, :issue:`362`)
+- Added ``cov`` instance methods to Series and DataFrame (:issue:`194`, :issue:`362`)
- :ref:`Added ` ``kind='bar'`` option to ``DataFrame.plot`` (:issue:`348`)
- :ref:`Added ` ``idxmin`` and ``idxmax`` to Series and DataFrame (:issue:`286`)
- :ref:`Added ` ``read_clipboard`` function to parse DataFrame from clipboard (:issue:`300`)
diff --git a/doc/source/whatsnew/v0.6.1.rst b/doc/source/whatsnew/v0.6.1.rst
index 4e72a630ad9f1..58a7d1ee13278 100644
--- a/doc/source/whatsnew/v0.6.1.rst
+++ b/doc/source/whatsnew/v0.6.1.rst
@@ -7,7 +7,7 @@ Version 0.6.1 (December 13, 2011)
New features
~~~~~~~~~~~~
- Can append single rows (as Series) to a DataFrame
-- Add Spearman and Kendall rank :ref:`correlation `
+- Add Spearman and Kendall rank correlation
options to Series.corr and DataFrame.corr (:issue:`428`)
- :ref:`Added ` ``get_value`` and ``set_value`` methods to
Series, DataFrame, and Panel for very low-overhead access (>2x faster in many
@@ -19,7 +19,7 @@ New features
- Implement new :ref:`SparseArray ` and ``SparseList``
data structures. SparseSeries now derives from SparseArray (:issue:`463`)
- :ref:`Better console printing options ` (:issue:`453`)
-- Implement fast :ref:`data ranking ` for Series and
+- Implement fast data ranking for Series and
DataFrame, fast versions of scipy.stats.rankdata (:issue:`428`)
- Implement ``DataFrame.from_items`` alternate
constructor (:issue:`444`)
diff --git a/doc/source/whatsnew/v0.7.0.rst b/doc/source/whatsnew/v0.7.0.rst
index 1b947030ab8ab..1ee6a9899a655 100644
--- a/doc/source/whatsnew/v0.7.0.rst
+++ b/doc/source/whatsnew/v0.7.0.rst
@@ -190,11 +190,11 @@ been added:
:header: "Method","Description"
:widths: 40,60
- ``Series.iget_value(i)``, Retrieve value stored at location ``i``
- ``Series.iget(i)``, Alias for ``iget_value``
- ``DataFrame.irow(i)``, Retrieve the ``i``-th row
- ``DataFrame.icol(j)``, Retrieve the ``j``-th column
- "``DataFrame.iget_value(i, j)``", Retrieve the value at row ``i`` and column ``j``
+ ``Series.iget_value(i)``, Retrieve value stored at location ``i``
+ ``Series.iget(i)``, Alias for ``iget_value``
+ ``DataFrame.irow(i)``, Retrieve the ``i``-th row
+ ``DataFrame.icol(j)``, Retrieve the ``j``-th column
+ "``DataFrame.iget_value(i, j)``", Retrieve the value at row ``i`` and column ``j``
API tweaks regarding label-based slicing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/doc/source/whatsnew/v0.8.0.rst b/doc/source/whatsnew/v0.8.0.rst
index 490175914cef1..ce02525a69ace 100644
--- a/doc/source/whatsnew/v0.8.0.rst
+++ b/doc/source/whatsnew/v0.8.0.rst
@@ -145,7 +145,7 @@ Other new features
- Add :ref:`'kde' ` plot option for density plots
- Support for converting DataFrame to R data.frame through rpy2
- Improved support for complex numbers in Series and DataFrame
-- Add :ref:`pct_change ` method to all data structures
+- Add ``pct_change`` method to all data structures
- Add max_colwidth configuration option for DataFrame console output
- :ref:`Interpolate ` Series values using index values
- Can select multiple columns from GroupBy
diff --git a/doc/source/whatsnew/v0.9.1.rst b/doc/source/whatsnew/v0.9.1.rst
index 6b05e5bcded7e..cdc0671feeeb2 100644
--- a/doc/source/whatsnew/v0.9.1.rst
+++ b/doc/source/whatsnew/v0.9.1.rst
@@ -54,52 +54,53 @@ New features
- DataFrame has new ``where`` and ``mask`` methods to select values according to a
given boolean mask (:issue:`2109`, :issue:`2151`)
- DataFrame currently supports slicing via a boolean vector the same length as the DataFrame (inside the ``[]``).
- The returned DataFrame has the same number of columns as the original, but is sliced on its index.
+ DataFrame currently supports slicing via a boolean vector the same length as the DataFrame (inside the ``[]``).
+ The returned DataFrame has the same number of columns as the original, but is sliced on its index.
.. ipython:: python
- df = DataFrame(np.random.randn(5, 3), columns = ['A','B','C'])
+ df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
- df
+ df
- df[df['A'] > 0]
+ df[df['A'] > 0]
- If a DataFrame is sliced with a DataFrame based boolean condition (with the same size as the original DataFrame),
- then a DataFrame the same size (index and columns) as the original is returned, with
- elements that do not meet the boolean condition as ``NaN``. This is accomplished via
- the new method ``DataFrame.where``. In addition, ``where`` takes an optional ``other`` argument for replacement.
+ If a DataFrame is sliced with a DataFrame based boolean condition (with the same size as the original DataFrame),
+ then a DataFrame the same size (index and columns) as the original is returned, with
+ elements that do not meet the boolean condition as ``NaN``. This is accomplished via
+ the new method ``DataFrame.where``. In addition, ``where`` takes an optional ``other`` argument for replacement.
- .. ipython:: python
+ .. ipython:: python
- df[df>0]
+ df[df > 0]
- df.where(df>0)
+ df.where(df > 0)
- df.where(df>0,-df)
+ df.where(df > 0, -df)
- Furthermore, ``where`` now aligns the input boolean condition (ndarray or DataFrame), such that partial selection
- with setting is possible. This is analogous to partial setting via ``.ix`` (but on the contents rather than the axis labels)
+ Furthermore, ``where`` now aligns the input boolean condition (ndarray or DataFrame), such that partial selection
+ with setting is possible. This is analogous to partial setting via ``.ix`` (but on the contents rather than the axis labels)
- .. ipython:: python
+ .. ipython:: python
- df2 = df.copy()
- df2[ df2[1:4] > 0 ] = 3
- df2
+ df2 = df.copy()
+ df2[df2[1:4] > 0] = 3
+ df2
- ``DataFrame.mask`` is the inverse boolean operation of ``where``.
+ ``DataFrame.mask`` is the inverse boolean operation of ``where``.
- .. ipython:: python
+ .. ipython:: python
- df.mask(df<=0)
+ df.mask(df <= 0)
- Enable referencing of Excel columns by their column names (:issue:`1936`)
- .. ipython:: python
+ .. code-block:: ipython
+
+ In [1]: xl = pd.ExcelFile('data/test.xls')
- xl = pd.ExcelFile('data/test.xls')
- xl.parse('Sheet1', index_col=0, parse_dates=True,
- parse_cols='A:D')
+ In [2]: xl.parse('Sheet1', index_col=0, parse_dates=True,
+ parse_cols='A:D')
- Added option to disable pandas-style tick locators and formatters
diff --git a/doc/source/whatsnew/v1.0.0.rst b/doc/source/whatsnew/v1.0.0.rst
index 03dfe475475a1..2ab0af46cda88 100755
--- a/doc/source/whatsnew/v1.0.0.rst
+++ b/doc/source/whatsnew/v1.0.0.rst
@@ -525,7 +525,7 @@ Use :meth:`arrays.IntegerArray.to_numpy` with an explicit ``na_value`` instead.
a.to_numpy(dtype="float", na_value=np.nan)
-**Reductions can return ``pd.NA``**
+**Reductions can return** ``pd.NA``
When performing a reduction such as a sum with ``skipna=False``, the result
will now be ``pd.NA`` instead of ``np.nan`` in presence of missing values
diff --git a/doc/source/whatsnew/v1.1.0.rst b/doc/source/whatsnew/v1.1.0.rst
index 9f3ccb3e14116..e1f54c439ae9b 100644
--- a/doc/source/whatsnew/v1.1.0.rst
+++ b/doc/source/whatsnew/v1.1.0.rst
@@ -265,7 +265,7 @@ SSH, FTP, dropbox and github. For docs and capabilities, see the `fsspec docs`_.
The existing capability to interface with S3 and GCS will be unaffected by this
change, as ``fsspec`` will still bring in the same packages as before.
-.. _Azure Datalake and Blob: https://github.com/dask/adlfs
+.. _Azure Datalake and Blob: https://github.com/fsspec/adlfs
.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/
@@ -665,9 +665,9 @@ the previous index (:issue:`32240`).
In [4]: result
Out[4]:
min_val
- 0 x
- 1 y
- 2 z
+ 0 x
+ 1 y
+ 2 z
*New behavior*:
diff --git a/doc/source/whatsnew/v1.2.0.rst b/doc/source/whatsnew/v1.2.0.rst
index 3d3ec53948a01..49f9abd99db53 100644
--- a/doc/source/whatsnew/v1.2.0.rst
+++ b/doc/source/whatsnew/v1.2.0.rst
@@ -354,6 +354,7 @@ of columns could result in a larger Series result. See (:issue:`37799`).
*New behavior*:
.. ipython:: python
+ :okwarning:
In [5]: df.all(bool_only=True)
diff --git a/doc/source/whatsnew/v1.3.0.rst b/doc/source/whatsnew/v1.3.0.rst
index ed66861efad93..a392aeb5274c2 100644
--- a/doc/source/whatsnew/v1.3.0.rst
+++ b/doc/source/whatsnew/v1.3.0.rst
@@ -811,7 +811,7 @@ Other Deprecations
- Deprecated allowing scalars to be passed to the :class:`Categorical` constructor (:issue:`38433`)
- Deprecated constructing :class:`CategoricalIndex` without passing list-like data (:issue:`38944`)
- Deprecated allowing subclass-specific keyword arguments in the :class:`Index` constructor, use the specific subclass directly instead (:issue:`14093`, :issue:`21311`, :issue:`22315`, :issue:`26974`)
-- Deprecated the :meth:`astype` method of datetimelike (``timedelta64[ns]``, ``datetime64[ns]``, ``Datetime64TZDtype``, ``PeriodDtype``) to convert to integer dtypes, use ``values.view(...)`` instead (:issue:`38544`)
+- Deprecated the :meth:`astype` method of datetimelike (``timedelta64[ns]``, ``datetime64[ns]``, ``Datetime64TZDtype``, ``PeriodDtype``) to convert to integer dtypes, use ``values.view(...)`` instead (:issue:`38544`). This deprecation was later reverted in pandas 1.4.0.
- Deprecated :meth:`MultiIndex.is_lexsorted` and :meth:`MultiIndex.lexsort_depth`, use :meth:`MultiIndex.is_monotonic_increasing` instead (:issue:`32259`)
- Deprecated keyword ``try_cast`` in :meth:`Series.where`, :meth:`Series.mask`, :meth:`DataFrame.where`, :meth:`DataFrame.mask`; cast results manually if desired (:issue:`38836`)
- Deprecated comparison of :class:`Timestamp` objects with ``datetime.date`` objects. Instead of e.g. ``ts <= mydate`` use ``ts <= pd.Timestamp(mydate)`` or ``ts.date() <= mydate`` (:issue:`36131`)
diff --git a/doc/source/whatsnew/v1.4.0.rst b/doc/source/whatsnew/v1.4.0.rst
index 54fb6555fa05d..697070e50a40a 100644
--- a/doc/source/whatsnew/v1.4.0.rst
+++ b/doc/source/whatsnew/v1.4.0.rst
@@ -1,6 +1,6 @@
.. _whatsnew_140:
-What's new in 1.4.0 (January ??, 2022)
+What's new in 1.4.0 (January 22, 2022)
--------------------------------------
These are the changes in pandas 1.4.0. See :ref:`release` for a full changelog
@@ -20,7 +20,8 @@ Enhancements
Improved warning messages
^^^^^^^^^^^^^^^^^^^^^^^^^
-Previously, warning messages may have pointed to lines within the pandas library. Running the script ``setting_with_copy_warning.py``
+Previously, warning messages may have pointed to lines within the pandas
+library. Running the script ``setting_with_copy_warning.py``
.. code-block:: python
@@ -34,7 +35,10 @@ with pandas 1.3 resulted in::
.../site-packages/pandas/core/indexing.py:1951: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
-This made it difficult to determine where the warning was being generated from. Now pandas will inspect the call stack, reporting the first line outside of the pandas library that gave rise to the warning. The output of the above script is now::
+This made it difficult to determine where the warning was being generated from.
+Now pandas will inspect the call stack, reporting the first line outside of the
+pandas library that gave rise to the warning. The output of the above script is
+now::
setting_with_copy_warning.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
@@ -47,8 +51,9 @@ This made it difficult to determine where the warning was being generated from.
Index can hold arbitrary ExtensionArrays
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Until now, passing a custom :class:`ExtensionArray` to ``pd.Index`` would cast the
-array to ``object`` dtype. Now :class:`Index` can directly hold arbitrary ExtensionArrays (:issue:`43930`).
+Until now, passing a custom :class:`ExtensionArray` to ``pd.Index`` would cast
+the array to ``object`` dtype. Now :class:`Index` can directly hold arbitrary
+ExtensionArrays (:issue:`43930`).
*Previous behavior*:
@@ -87,39 +92,45 @@ Styler
- Styling and formatting of indexes has been added, with :meth:`.Styler.apply_index`, :meth:`.Styler.applymap_index` and :meth:`.Styler.format_index`. These mirror the signature of the methods already used to style and format data values, and work with both HTML, LaTeX and Excel format (:issue:`41893`, :issue:`43101`, :issue:`41993`, :issue:`41995`)
- The new method :meth:`.Styler.hide` deprecates :meth:`.Styler.hide_index` and :meth:`.Styler.hide_columns` (:issue:`43758`)
- - The keyword arguments ``level`` and ``names`` have been added to :meth:`.Styler.hide` (and implicitly to the deprecated methods :meth:`.Styler.hide_index` and :meth:`.Styler.hide_columns`) for additional control of visibility of MultiIndexes and of index names (:issue:`25475`, :issue:`43404`, :issue:`43346`)
+ - The keyword arguments ``level`` and ``names`` have been added to :meth:`.Styler.hide` (and implicitly to the deprecated methods :meth:`.Styler.hide_index` and :meth:`.Styler.hide_columns`) for additional control of visibility of MultiIndexes and of Index names (:issue:`25475`, :issue:`43404`, :issue:`43346`)
- The :meth:`.Styler.export` and :meth:`.Styler.use` have been updated to address all of the added functionality from v1.2.0 and v1.3.0 (:issue:`40675`)
- - Global options under the category ``pd.options.styler`` have been extended to configure default ``Styler`` properties which address formatting, encoding, and HTML and LaTeX rendering. Note that formerly ``Styler`` relied on ``display.html.use_mathjax``, which has now been replaced by ``styler.html.mathjax``. (:issue:`41395`)
+ - Global options under the category ``pd.options.styler`` have been extended to configure default ``Styler`` properties which address formatting, encoding, and HTML and LaTeX rendering. Note that formerly ``Styler`` relied on ``display.html.use_mathjax``, which has now been replaced by ``styler.html.mathjax`` (:issue:`41395`)
- Validation of certain keyword arguments, e.g. ``caption`` (:issue:`43368`)
- Various bug fixes as recorded below
Additionally there are specific enhancements to the HTML specific rendering:
- - :meth:`.Styler.bar` introduces additional arguments to control alignment and display (:issue:`26070`, :issue:`36419`), and it also validates the input arguments ``width`` and ``height`` (:issue:`42511`).
- - :meth:`.Styler.to_html` introduces keyword arguments ``sparse_index``, ``sparse_columns``, ``bold_headers``, ``caption``, ``max_rows`` and ``max_columns`` (:issue:`41946`, :issue:`43149`, :issue:`42972`).
+ - :meth:`.Styler.bar` introduces additional arguments to control alignment and display (:issue:`26070`, :issue:`36419`), and it also validates the input arguments ``width`` and ``height`` (:issue:`42511`)
+ - :meth:`.Styler.to_html` introduces keyword arguments ``sparse_index``, ``sparse_columns``, ``bold_headers``, ``caption``, ``max_rows`` and ``max_columns`` (:issue:`41946`, :issue:`43149`, :issue:`42972`)
- :meth:`.Styler.to_html` omits CSSStyle rules for hidden table elements as a performance enhancement (:issue:`43619`)
- Custom CSS classes can now be directly specified without string replacement (:issue:`43686`)
- Ability to render hyperlinks automatically via a new ``hyperlinks`` formatting keyword argument (:issue:`45058`)
There are also some LaTeX specific enhancements:
- - :meth:`.Styler.to_latex` introduces keyword argument ``environment``, which also allows a specific "longtable" entry through a separate jinja2 template (:issue:`41866`).
+ - :meth:`.Styler.to_latex` introduces keyword argument ``environment``, which also allows a specific "longtable" entry through a separate jinja2 template (:issue:`41866`)
- Naive sparsification is now possible for LaTeX without the necessity of including the multirow package (:issue:`43369`)
+ - *cline* support has been added for :class:`MultiIndex` row sparsification through a keyword argument (:issue:`45138`)
.. _whatsnew_140.enhancements.pyarrow_csv_engine:
-Multithreaded CSV reading with a new CSV Engine based on pyarrow
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Multi-threaded CSV reading with a new CSV Engine based on pyarrow
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-:func:`pandas.read_csv` now accepts ``engine="pyarrow"`` (requires at least ``pyarrow`` 1.0.1) as an argument, allowing for faster csv parsing on multicore machines
-with pyarrow installed. See the :doc:`I/O docs ` for more info. (:issue:`23697`, :issue:`43706`)
+:func:`pandas.read_csv` now accepts ``engine="pyarrow"`` (requires at least
+``pyarrow`` 1.0.1) as an argument, allowing for faster csv parsing on multicore
+machines with pyarrow installed. See the :doc:`I/O docs ` for
+more info. (:issue:`23697`, :issue:`43706`)
.. _whatsnew_140.enhancements.window_rank:
Rank function for rolling and expanding windows
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Added ``rank`` function to :class:`Rolling` and :class:`Expanding`. The new function supports the ``method``, ``ascending``, and ``pct`` flags of :meth:`DataFrame.rank`. The ``method`` argument supports ``min``, ``max``, and ``average`` ranking methods.
+Added ``rank`` function to :class:`Rolling` and :class:`Expanding`. The new
+function supports the ``method``, ``ascending``, and ``pct`` flags of
+:meth:`DataFrame.rank`. The ``method`` argument supports ``min``, ``max``, and
+``average`` ranking methods.
Example:
.. ipython:: python
@@ -134,10 +145,12 @@ Example:
Groupby positional indexing
^^^^^^^^^^^^^^^^^^^^^^^^^^^
-It is now possible to specify positional ranges relative to the ends of each group.
+It is now possible to specify positional ranges relative to the ends of each
+group.
-Negative arguments for :meth:`.GroupBy.head` and :meth:`.GroupBy.tail` now work correctly and result in ranges relative to the end and start of each group, respectively.
-Previously, negative arguments returned empty frames.
+Negative arguments for :meth:`.GroupBy.head` and :meth:`.GroupBy.tail` now work
+correctly and result in ranges relative to the end and start of each group,
+respectively. Previously, negative arguments returned empty frames.
.. ipython:: python
@@ -166,10 +179,11 @@ Previously, negative arguments returned empty frames.
DataFrame.from_dict and DataFrame.to_dict have new ``'tight'`` option
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-A new ``'tight'`` dictionary format that preserves :class:`MultiIndex` entries and names
-is now available with the :meth:`DataFrame.from_dict` and :meth:`DataFrame.to_dict` methods
-and can be used with the standard ``json`` library to produce a tight
-representation of :class:`DataFrame` objects (:issue:`4889`).
+A new ``'tight'`` dictionary format that preserves :class:`MultiIndex` entries
+and names is now available with the :meth:`DataFrame.from_dict` and
+:meth:`DataFrame.to_dict` methods and can be used with the standard ``json``
+library to produce a tight representation of :class:`DataFrame` objects
+(:issue:`4889`).
.. ipython:: python
@@ -187,41 +201,42 @@ representation of :class:`DataFrame` objects (:issue:`4889`).
Other enhancements
^^^^^^^^^^^^^^^^^^
-- :meth:`concat` will preserve the ``attrs`` when it is the same for all objects and discard the ``attrs`` when they are different. (:issue:`41828`)
+- :meth:`concat` will preserve the ``attrs`` when it is the same for all objects and discard the ``attrs`` when they are different (:issue:`41828`)
- :class:`DataFrameGroupBy` operations with ``as_index=False`` now correctly retain ``ExtensionDtype`` dtypes for columns being grouped on (:issue:`41373`)
- Add support for assigning values to ``by`` argument in :meth:`DataFrame.plot.hist` and :meth:`DataFrame.plot.box` (:issue:`15079`)
- :meth:`Series.sample`, :meth:`DataFrame.sample`, and :meth:`.GroupBy.sample` now accept a ``np.random.Generator`` as input to ``random_state``. A generator will be more performant, especially with ``replace=False`` (:issue:`38100`)
-- :meth:`Series.ewm`, :meth:`DataFrame.ewm`, now support a ``method`` argument with a ``'table'`` option that performs the windowing operation over an entire :class:`DataFrame`. See :ref:`Window Overview ` for performance and functional benefits (:issue:`42273`)
+- :meth:`Series.ewm` and :meth:`DataFrame.ewm` now support a ``method`` argument with a ``'table'`` option that performs the windowing operation over an entire :class:`DataFrame`. See :ref:`Window Overview ` for performance and functional benefits (:issue:`42273`)
- :meth:`.GroupBy.cummin` and :meth:`.GroupBy.cummax` now support the argument ``skipna`` (:issue:`34047`)
- :meth:`read_table` now supports the argument ``storage_options`` (:issue:`39167`)
-- :meth:`DataFrame.to_stata` and :meth:`StataWriter` now accept the keyword only argument ``value_labels`` to save labels for non-categorical columns
+- :meth:`DataFrame.to_stata` and :meth:`StataWriter` now accept the keyword only argument ``value_labels`` to save labels for non-categorical columns (:issue:`38454`)
- Methods that relied on hashmap based algos such as :meth:`DataFrameGroupBy.value_counts`, :meth:`DataFrameGroupBy.count` and :func:`factorize` ignored imaginary component for complex numbers (:issue:`17927`)
- Add :meth:`Series.str.removeprefix` and :meth:`Series.str.removesuffix` introduced in Python 3.9 to remove pre-/suffixes from string-type :class:`Series` (:issue:`36944`)
- Attempting to write into a file in missing parent directory with :meth:`DataFrame.to_csv`, :meth:`DataFrame.to_html`, :meth:`DataFrame.to_excel`, :meth:`DataFrame.to_feather`, :meth:`DataFrame.to_parquet`, :meth:`DataFrame.to_stata`, :meth:`DataFrame.to_json`, :meth:`DataFrame.to_pickle`, and :meth:`DataFrame.to_xml` now explicitly mentions missing parent directory, the same is true for :class:`Series` counterparts (:issue:`24306`)
- Indexing with ``.loc`` and ``.iloc`` now supports ``Ellipsis`` (:issue:`37750`)
- :meth:`IntegerArray.all` , :meth:`IntegerArray.any`, :meth:`FloatingArray.any`, and :meth:`FloatingArray.all` use Kleene logic (:issue:`41967`)
- Added support for nullable boolean and integer types in :meth:`DataFrame.to_stata`, :class:`~pandas.io.stata.StataWriter`, :class:`~pandas.io.stata.StataWriter117`, and :class:`~pandas.io.stata.StataWriterUTF8` (:issue:`40855`)
-- :meth:`DataFrame.__pos__`, :meth:`DataFrame.__neg__` now retain ``ExtensionDtype`` dtypes (:issue:`43883`)
+- :meth:`DataFrame.__pos__` and :meth:`DataFrame.__neg__` now retain ``ExtensionDtype`` dtypes (:issue:`43883`)
- The error raised when an optional dependency can't be imported now includes the original exception, for easier investigation (:issue:`43882`)
- Added :meth:`.ExponentialMovingWindow.sum` (:issue:`13297`)
- :meth:`Series.str.split` now supports a ``regex`` argument that explicitly specifies whether the pattern is a regular expression. Default is ``None`` (:issue:`43563`, :issue:`32835`, :issue:`25549`)
- :meth:`DataFrame.dropna` now accepts a single label as ``subset`` along with array-like (:issue:`41021`)
- Added :meth:`DataFrameGroupBy.value_counts` (:issue:`43564`)
+- :func:`read_csv` now accepts a ``callable`` function in ``on_bad_lines`` when ``engine="python"`` for custom handling of bad lines (:issue:`5686`)
- :class:`ExcelWriter` argument ``if_sheet_exists="overlay"`` option added (:issue:`40231`)
- :meth:`read_excel` now accepts a ``decimal`` argument that allow the user to specify the decimal point when parsing string columns to numeric (:issue:`14403`)
-- :meth:`.GroupBy.mean`, :meth:`.GroupBy.std`, :meth:`.GroupBy.var`, :meth:`.GroupBy.sum` now supports `Numba `_ execution with the ``engine`` keyword (:issue:`43731`, :issue:`44862`, :issue:`44939`)
-- :meth:`Timestamp.isoformat`, now handles the ``timespec`` argument from the base :class:``datetime`` class (:issue:`26131`)
+- :meth:`.GroupBy.mean`, :meth:`.GroupBy.std`, :meth:`.GroupBy.var`, and :meth:`.GroupBy.sum` now support `Numba `_ execution with the ``engine`` keyword (:issue:`43731`, :issue:`44862`, :issue:`44939`)
+- :meth:`Timestamp.isoformat` now handles the ``timespec`` argument from the base ``datetime`` class (:issue:`26131`)
- :meth:`NaT.to_numpy` ``dtype`` argument is now respected, so ``np.timedelta64`` can be returned (:issue:`44460`)
- New option ``display.max_dir_items`` customizes the number of columns added to :meth:`Dataframe.__dir__` and suggested for tab completion (:issue:`37996`)
-- Added "Juneteenth National Independence Day" to ``USFederalHolidayCalendar``. See also `Other API changes`_.
-- :meth:`.Rolling.var`, :meth:`.Expanding.var`, :meth:`.Rolling.std`, :meth:`.Expanding.std` now support `Numba `_ execution with the ``engine`` keyword (:issue:`44461`)
+- Added "Juneteenth National Independence Day" to ``USFederalHolidayCalendar`` (:issue:`44574`)
+- :meth:`.Rolling.var`, :meth:`.Expanding.var`, :meth:`.Rolling.std`, and :meth:`.Expanding.std` now support `Numba `_ execution with the ``engine`` keyword (:issue:`44461`)
- :meth:`Series.info` has been added, for compatibility with :meth:`DataFrame.info` (:issue:`5167`)
-- Implemented :meth:`IntervalArray.min`, :meth:`IntervalArray.max`, as a result of which ``min`` and ``max`` now work for :class:`IntervalIndex`, :class:`Series` and :class:`DataFrame` with ``IntervalDtype`` (:issue:`44746`)
+- Implemented :meth:`IntervalArray.min` and :meth:`IntervalArray.max`, as a result of which ``min`` and ``max`` now work for :class:`IntervalIndex`, :class:`Series` and :class:`DataFrame` with ``IntervalDtype`` (:issue:`44746`)
- :meth:`UInt64Index.map` now retains ``dtype`` where possible (:issue:`44609`)
- :meth:`read_json` can now parse unsigned long long integers (:issue:`26068`)
- :meth:`DataFrame.take` now raises a ``TypeError`` when passed a scalar for the indexer (:issue:`42875`)
- :meth:`is_list_like` now identifies duck-arrays as list-like unless ``.ndim == 0`` (:issue:`35131`)
-- :class:`ExtensionDtype` and :class:`ExtensionArray` are now (de)serialized when exporting a :class:`DataFrame` with :meth:`DataFrame.to_json` using ``orient='table'`` (:issue:`20612`, :issue:`44705`).
+- :class:`ExtensionDtype` and :class:`ExtensionArray` are now (de)serialized when exporting a :class:`DataFrame` with :meth:`DataFrame.to_json` using ``orient='table'`` (:issue:`20612`, :issue:`44705`)
- Add support for `Zstandard `_ compression to :meth:`DataFrame.to_pickle`/:meth:`read_pickle` and friends (:issue:`43925`)
- :meth:`DataFrame.to_sql` now returns an ``int`` of the number of written rows (:issue:`23998`)
@@ -239,24 +254,30 @@ These are bug fixes that might have notable behavior changes.
Inconsistent date string parsing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The ``dayfirst`` option of :func:`to_datetime` isn't strict, and this can lead to surprising behaviour:
+The ``dayfirst`` option of :func:`to_datetime` isn't strict, and this can lead
+to surprising behavior:
.. ipython:: python
:okwarning:
pd.to_datetime(["31-12-2021"], dayfirst=False)
-Now, a warning will be raised if a date string cannot be parsed accordance to the given ``dayfirst`` value when
-the value is a delimited date string (e.g. ``31-12-2012``).
+Now, a warning will be raised if a date string cannot be parsed accordance to
+the given ``dayfirst`` value when the value is a delimited date string (e.g.
+``31-12-2012``).
.. _whatsnew_140.notable_bug_fixes.concat_with_empty_or_all_na:
Ignoring dtypes in concat with empty or all-NA columns
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. note::
+ This behaviour change has been reverted in pandas 1.4.3.
+
When using :func:`concat` to concatenate two or more :class:`DataFrame` objects,
-if one of the DataFrames was empty or had all-NA values, its dtype was *sometimes*
-ignored when finding the concatenated dtype. These are now consistently *not* ignored (:issue:`43507`).
+if one of the DataFrames was empty or had all-NA values, its dtype was
+*sometimes* ignored when finding the concatenated dtype. These are now
+consistently *not* ignored (:issue:`43507`).
.. ipython:: python
@@ -264,7 +285,9 @@ ignored when finding the concatenated dtype. These are now consistently *not* i
df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2))
res = pd.concat([df1, df2])
-Previously, the float-dtype in ``df2`` would be ignored so the result dtype would be ``datetime64[ns]``. As a result, the ``np.nan`` would be cast to ``NaT``.
+Previously, the float-dtype in ``df2`` would be ignored so the result dtype
+would be ``datetime64[ns]``. As a result, the ``np.nan`` would be cast to
+``NaT``.
*Previous behavior*:
@@ -276,20 +299,30 @@ Previously, the float-dtype in ``df2`` would be ignored so the result dtype woul
0 2013-01-01
1 NaT
-Now the float-dtype is respected. Since the common dtype for these DataFrames is object, the ``np.nan`` is retained.
+Now the float-dtype is respected. Since the common dtype for these DataFrames is
+object, the ``np.nan`` is retained.
*New behavior*:
-.. ipython:: python
+.. code-block:: ipython
+
+ In [4]: res
+ Out[4]:
+ bar
+ 0 2013-01-01 00:00:00
+ 1 NaN
+
- res
.. _whatsnew_140.notable_bug_fixes.value_counts_and_mode_do_not_coerce_to_nan:
Null-values are no longer coerced to NaN-value in value_counts and mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-:meth:`Series.value_counts` and :meth:`Series.mode` no longer coerce ``None``, ``NaT`` and other null-values to a NaN-value for ``np.object``-dtype. This behavior is now consistent with ``unique``, ``isin`` and others (:issue:`42688`).
+:meth:`Series.value_counts` and :meth:`Series.mode` no longer coerce ``None``,
+``NaT`` and other null-values to a NaN-value for ``np.object``-dtype. This
+behavior is now consistent with ``unique``, ``isin`` and others
+(:issue:`42688`).
.. ipython:: python
@@ -318,11 +351,12 @@ Now null-values are no longer mangled.
.. _whatsnew_140.notable_bug_fixes.read_csv_mangle_dup_cols:
-mangle_dupe_cols in read_csv no longer renaming unique columns conflicting with target names
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+mangle_dupe_cols in read_csv no longer renames unique columns conflicting with target names
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-:func:`read_csv` no longer renaming unique cols, which conflict with the target names of duplicated columns.
-Already existing columns are jumped, e.g. the next available index is used for the target column name (:issue:`14704`).
+:func:`read_csv` no longer renames unique column labels which conflict with the target
+names of duplicated columns. Already existing columns are skipped, i.e. the next
+available index is used for the target column name (:issue:`14704`).
.. ipython:: python
@@ -331,7 +365,8 @@ Already existing columns are jumped, e.g. the next available index is used for t
data = "a,a,a.1\n1,2,3"
res = pd.read_csv(io.StringIO(data))
-Previously, the second column was called ``a.1``, while the third col was also renamed to ``a.1.1``.
+Previously, the second column was called ``a.1``, while the third column was
+also renamed to ``a.1.1``.
*Previous behavior*:
@@ -342,8 +377,9 @@ Previously, the second column was called ``a.1``, while the third col was also r
a a.1 a.1.1
0 1 2 3
-Now the renaming checks if ``a.1`` already exists when changing the name of the second column and jumps this index. The
-second column is instead renamed to ``a.2``.
+Now the renaming checks if ``a.1`` already exists when changing the name of the
+second column and jumps this index. The second column is instead renamed to
+``a.2``.
*New behavior*:
@@ -356,9 +392,10 @@ second column is instead renamed to ``a.2``.
unstack and pivot_table no longer raises ValueError for result that would exceed int32 limit
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Previously :meth:`DataFrame.pivot_table` and :meth:`DataFrame.unstack` would raise a ``ValueError`` if the operation
-could produce a result with more than ``2**31 - 1`` elements. This operation now raises a :class:`errors.PerformanceWarning`
-instead (:issue:`26314`).
+Previously :meth:`DataFrame.pivot_table` and :meth:`DataFrame.unstack` would
+raise a ``ValueError`` if the operation could produce a result with more than
+``2**31 - 1`` elements. This operation now raises a
+:class:`errors.PerformanceWarning` instead (:issue:`26314`).
*Previous behavior*:
@@ -377,11 +414,86 @@ instead (:issue:`26314`).
.. ---------------------------------------------------------------------------
+.. _whatsnew_140.notable_bug_fixes.groupby_apply_mutation:
+
+groupby.apply consistent transform detection
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:meth:`.GroupBy.apply` is designed to be flexible, allowing users to perform
+aggregations, transformations, filters, and use it with user-defined functions
+that might not fall into any of these categories. As part of this, apply will
+attempt to detect when an operation is a transform, and in such a case, the
+result will have the same index as the input. In order to determine if the
+operation is a transform, pandas compares the input's index to the result's and
+determines if it has been mutated. Previously in pandas 1.3, different code
+paths used different definitions of "mutated": some would use Python's ``is``
+whereas others would test only up to equality.
+
+This inconsistency has been removed, pandas now tests up to equality.
+
+.. ipython:: python
+
+ def func(x):
+ return x.copy()
+
+ df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
+ df
+
+*Previous behavior*:
+
+.. code-block:: ipython
+
+ In [3]: df.groupby(['a']).apply(func)
+ Out[3]:
+ a b c
+ a
+ 1 0 1 3 5
+ 2 1 2 4 6
+
+ In [4]: df.set_index(['a', 'b']).groupby(['a']).apply(func)
+ Out[4]:
+ c
+ a b
+ 1 3 5
+ 2 4 6
+
+In the examples above, the first uses a code path where pandas uses ``is`` and
+determines that ``func`` is not a transform whereas the second tests up to
+equality and determines that ``func`` is a transform. In the first case, the
+result's index is not the same as the input's.
+
+*New behavior*:
+
+.. code-block:: ipython
+
+ In [5]: df.groupby(['a']).apply(func)
+ Out[5]:
+ a b c
+ 0 1 3 5
+ 1 2 4 6
+
+ In [6]: df.set_index(['a', 'b']).groupby(['a']).apply(func)
+ Out[6]:
+ c
+ a b
+ 1 3 5
+ 2 4 6
+
+Now in both cases it is determined that ``func`` is a transform. In each case,
+the result has the same index as the input.
+
.. _whatsnew_140.api_breaking:
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. _whatsnew_140.api_breaking.python:
+
+Increased minimum version for Python
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+pandas 1.4.0 supports Python 3.8 and higher.
+
.. _whatsnew_140.api_breaking.deps:
Increased minimum versions for dependencies
@@ -407,9 +519,12 @@ If installed, we now require:
| mypy (dev) | 0.930 | | X |
+-----------------+-----------------+----------+---------+
-For `optional libraries `_ the general recommendation is to use the latest version.
-The following table lists the lowest version per library that is currently being tested throughout the development of pandas.
-Optional libraries below the lowest tested version may still work, but are not considered supported.
+For `optional libraries
+`_ the general
+recommendation is to use the latest version. The following table lists the
+lowest version per library that is currently being tested throughout the
+development of pandas. Optional libraries below the lowest tested version may
+still work, but are not considered supported.
+-----------------+-----------------+---------+
| Package | Minimum Version | Changed |
@@ -430,6 +545,8 @@ Optional libraries below the lowest tested version may still work, but are not c
+-----------------+-----------------+---------+
| openpyxl | 3.0.3 | X |
+-----------------+-----------------+---------+
+| pandas-gbq | 0.14.0 | X |
++-----------------+-----------------+---------+
| pyarrow | 1.0.1 | X |
+-----------------+-----------------+---------+
| pymysql | 0.10.1 | X |
@@ -452,8 +569,6 @@ Optional libraries below the lowest tested version may still work, but are not c
+-----------------+-----------------+---------+
| xlwt | 1.3.0 | |
+-----------------+-----------------+---------+
-| pandas-gbq | 0.14.0 | X |
-+-----------------+-----------------+---------+
See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.
@@ -461,7 +576,7 @@ See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for mor
Other API changes
^^^^^^^^^^^^^^^^^
-- :meth:`Index.get_indexer_for` no longer accepts keyword arguments (other than 'target'); in the past these would be silently ignored if the index was not unique (:issue:`42310`)
+- :meth:`Index.get_indexer_for` no longer accepts keyword arguments (other than ``target``); in the past these would be silently ignored if the index was not unique (:issue:`42310`)
- Change in the position of the ``min_rows`` argument in :meth:`DataFrame.to_string` due to change in the docstring (:issue:`44304`)
- Reduction operations for :class:`DataFrame` or :class:`Series` now raising a ``ValueError`` when ``None`` is passed for ``skipna`` (:issue:`44178`)
- :func:`read_csv` and :func:`read_html` no longer raising an error when one of the header rows consists only of ``Unnamed:`` columns (:issue:`13054`)
@@ -477,7 +592,6 @@ Other API changes
- "Thanksgiving" is now "Thanksgiving Day"
- "Christmas" is now "Christmas Day"
- Added "Juneteenth National Independence Day"
--
.. ---------------------------------------------------------------------------
@@ -491,11 +605,13 @@ Deprecations
Deprecated Int64Index, UInt64Index & Float64Index
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-:class:`Int64Index`, :class:`UInt64Index` and :class:`Float64Index` have been deprecated
-in favor of the base :class:`Index` class and will be removed in Pandas 2.0 (:issue:`43028`).
+:class:`Int64Index`, :class:`UInt64Index` and :class:`Float64Index` have been
+deprecated in favor of the base :class:`Index` class and will be removed in
+Pandas 2.0 (:issue:`43028`).
-For constructing a numeric index, you can use the base :class:`Index` class instead
-specifying the data type (which will also work on older pandas releases):
+For constructing a numeric index, you can use the base :class:`Index` class
+instead specifying the data type (which will also work on older pandas
+releases):
.. code-block:: python
@@ -514,9 +630,10 @@ checks with checking the ``dtype``:
# with
idx.dtype == "int64"
-Currently, in order to maintain backward compatibility, calls to
-:class:`Index` will continue to return :class:`Int64Index`, :class:`UInt64Index` and :class:`Float64Index`
-when given numeric data, but in the future, an :class:`Index` will be returned.
+Currently, in order to maintain backward compatibility, calls to :class:`Index`
+will continue to return :class:`Int64Index`, :class:`UInt64Index` and
+:class:`Float64Index` when given numeric data, but in the future, an
+:class:`Index` will be returned.
*Current behavior*:
@@ -539,11 +656,11 @@ when given numeric data, but in the future, an :class:`Index` will be returned.
.. _whatsnew_140.deprecations.frame_series_append:
-Deprecated Frame.append and Series.append
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Deprecated DataFrame.append and Series.append
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-:meth:`DataFrame.append` and :meth:`Series.append` have been deprecated and will be removed in Pandas 2.0.
-Use :func:`pandas.concat` instead (:issue:`35407`).
+:meth:`DataFrame.append` and :meth:`Series.append` have been deprecated and will
+be removed in a future version. Use :func:`pandas.concat` instead (:issue:`35407`).
*Deprecated syntax*
@@ -588,28 +705,27 @@ Other Deprecations
- Deprecated ``method`` argument in :meth:`Index.get_loc`, use ``index.get_indexer([label], method=...)`` instead (:issue:`42269`)
- Deprecated treating integer keys in :meth:`Series.__setitem__` as positional when the index is a :class:`Float64Index` not containing the key, a :class:`IntervalIndex` with no entries containing the key, or a :class:`MultiIndex` with leading :class:`Float64Index` level not containing the key (:issue:`33469`)
- Deprecated treating ``numpy.datetime64`` objects as UTC times when passed to the :class:`Timestamp` constructor along with a timezone. In a future version, these will be treated as wall-times. To retain the old behavior, use ``Timestamp(dt64).tz_localize("UTC").tz_convert(tz)`` (:issue:`24559`)
-- Deprecated ignoring missing labels when indexing with a sequence of labels on a level of a MultiIndex (:issue:`42351`)
-- Creating an empty Series without a dtype will now raise a more visible ``FutureWarning`` instead of a ``DeprecationWarning`` (:issue:`30017`)
-- Deprecated the 'kind' argument in :meth:`Index.get_slice_bound`, :meth:`Index.slice_indexer`, :meth:`Index.slice_locs`; in a future version passing 'kind' will raise (:issue:`42857`)
+- Deprecated ignoring missing labels when indexing with a sequence of labels on a level of a :class:`MultiIndex` (:issue:`42351`)
+- Creating an empty :class:`Series` without a ``dtype`` will now raise a more visible ``FutureWarning`` instead of a ``DeprecationWarning`` (:issue:`30017`)
+- Deprecated the ``kind`` argument in :meth:`Index.get_slice_bound`, :meth:`Index.slice_indexer`, and :meth:`Index.slice_locs`; in a future version passing ``kind`` will raise (:issue:`42857`)
- Deprecated dropping of nuisance columns in :class:`Rolling`, :class:`Expanding`, and :class:`EWM` aggregations (:issue:`42738`)
-- Deprecated :meth:`Index.reindex` with a non-unique index (:issue:`42568`)
-- Deprecated :meth:`.Styler.render` in favour of :meth:`.Styler.to_html` (:issue:`42140`)
-- Deprecated :meth:`.Styler.hide_index` and :meth:`.Styler.hide_columns` in favour of :meth:`.Styler.hide` (:issue:`43758`)
+- Deprecated :meth:`Index.reindex` with a non-unique :class:`Index` (:issue:`42568`)
+- Deprecated :meth:`.Styler.render` in favor of :meth:`.Styler.to_html` (:issue:`42140`)
+- Deprecated :meth:`.Styler.hide_index` and :meth:`.Styler.hide_columns` in favor of :meth:`.Styler.hide` (:issue:`43758`)
- Deprecated passing in a string column label into ``times`` in :meth:`DataFrame.ewm` (:issue:`43265`)
-- Deprecated the 'include_start' and 'include_end' arguments in :meth:`DataFrame.between_time`; in a future version passing 'include_start' or 'include_end' will raise (:issue:`40245`)
-- Deprecated the ``squeeze`` argument to :meth:`read_csv`, :meth:`read_table`, and :meth:`read_excel`. Users should squeeze the DataFrame afterwards with ``.squeeze("columns")`` instead. (:issue:`43242`)
+- Deprecated the ``include_start`` and ``include_end`` arguments in :meth:`DataFrame.between_time`; in a future version passing ``include_start`` or ``include_end`` will raise (:issue:`40245`)
+- Deprecated the ``squeeze`` argument to :meth:`read_csv`, :meth:`read_table`, and :meth:`read_excel`. Users should squeeze the :class:`DataFrame` afterwards with ``.squeeze("columns")`` instead (:issue:`43242`)
- Deprecated the ``index`` argument to :class:`SparseArray` construction (:issue:`23089`)
- Deprecated the ``closed`` argument in :meth:`date_range` and :meth:`bdate_range` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`)
- Deprecated :meth:`.Rolling.validate`, :meth:`.Expanding.validate`, and :meth:`.ExponentialMovingWindow.validate` (:issue:`43665`)
- Deprecated silent dropping of columns that raised a ``TypeError`` in :class:`Series.transform` and :class:`DataFrame.transform` when used with a dictionary (:issue:`43740`)
- Deprecated silent dropping of columns that raised a ``TypeError``, ``DataError``, and some cases of ``ValueError`` in :meth:`Series.aggregate`, :meth:`DataFrame.aggregate`, :meth:`Series.groupby.aggregate`, and :meth:`DataFrame.groupby.aggregate` when used with a list (:issue:`43740`)
- Deprecated casting behavior when setting timezone-aware value(s) into a timezone-aware :class:`Series` or :class:`DataFrame` column when the timezones do not match. Previously this cast to object dtype. In a future version, the values being inserted will be converted to the series or column's existing timezone (:issue:`37605`)
-- Deprecated casting behavior when passing an item with mismatched-timezone to :meth:`DatetimeIndex.insert`, :meth:`DatetimeIndex.putmask`, :meth:`DatetimeIndex.where` :meth:`DatetimeIndex.fillna`, :meth:`Series.mask`, :meth:`Series.where`, :meth:`Series.fillna`, :meth:`Series.shift`, :meth:`Series.replace`, :meth:`Series.reindex` (and :class:`DataFrame` column analogues). In the past this has cast to object dtype. In a future version, these will cast the passed item to the index or series's timezone (:issue:`37605`,:issue:`44940`)
-- Deprecated the 'errors' keyword argument in :meth:`Series.where`, :meth:`DataFrame.where`, :meth:`Series.mask`, and :meth:`DataFrame.mask`; in a future version the argument will be removed (:issue:`44294`)
+- Deprecated casting behavior when passing an item with mismatched-timezone to :meth:`DatetimeIndex.insert`, :meth:`DatetimeIndex.putmask`, :meth:`DatetimeIndex.where` :meth:`DatetimeIndex.fillna`, :meth:`Series.mask`, :meth:`Series.where`, :meth:`Series.fillna`, :meth:`Series.shift`, :meth:`Series.replace`, :meth:`Series.reindex` (and :class:`DataFrame` column analogues). In the past this has cast to object ``dtype``. In a future version, these will cast the passed item to the index or series's timezone (:issue:`37605`, :issue:`44940`)
- Deprecated the ``prefix`` keyword argument in :func:`read_csv` and :func:`read_table`, in a future version the argument will be removed (:issue:`43396`)
-- Deprecated passing non boolean argument to sort in :func:`concat` (:issue:`41518`)
-- Deprecated passing arguments as positional for :func:`read_fwf` other than ``filepath_or_buffer`` (:issue:`41485`):
-- Deprecated passing arguments as positional for :func:`read_xml` other than ``path_or_buffer`` (:issue:`45133`):
+- Deprecated passing non boolean argument to ``sort`` in :func:`concat` (:issue:`41518`)
+- Deprecated passing arguments as positional for :func:`read_fwf` other than ``filepath_or_buffer`` (:issue:`41485`)
+- Deprecated passing arguments as positional for :func:`read_xml` other than ``path_or_buffer`` (:issue:`45133`)
- Deprecated passing ``skipna=None`` for :meth:`DataFrame.mad` and :meth:`Series.mad`, pass ``skipna=True`` instead (:issue:`44580`)
- Deprecated the behavior of :func:`to_datetime` with the string "now" with ``utc=False``; in a future version this will match ``Timestamp("now")``, which in turn matches :meth:`Timestamp.now` returning the local time (:issue:`18705`)
- Deprecated :meth:`DateOffset.apply`, use ``offset + other`` instead (:issue:`44522`)
@@ -629,6 +745,7 @@ Other Deprecations
- Deprecated the behavior of :meth:`Timestamp.utcfromtimestamp`, in the future it will return a timezone-aware UTC :class:`Timestamp` (:issue:`22451`)
- Deprecated :meth:`NaT.freq` (:issue:`45071`)
- Deprecated behavior of :class:`Series` and :class:`DataFrame` construction when passed float-dtype data containing ``NaN`` and an integer dtype ignoring the dtype argument; in a future version this will raise (:issue:`40110`)
+- Deprecated the behaviour of :meth:`Series.to_frame` and :meth:`Index.to_frame` to ignore the ``name`` argument when ``name=None``. Currently, this means to preserve the existing name, but in the future explicitly passing ``name=None`` will set ``None`` as the name of the column in the resulting DataFrame (:issue:`44212`)
.. ---------------------------------------------------------------------------
@@ -650,9 +767,9 @@ Performance improvements
- Performance improvement in :meth:`Series.sparse.to_coo` (:issue:`42880`)
- Performance improvement in indexing with a :class:`UInt64Index` (:issue:`43862`)
- Performance improvement in indexing with a :class:`Float64Index` (:issue:`43705`)
-- Performance improvement in indexing with a non-unique Index (:issue:`43792`)
+- Performance improvement in indexing with a non-unique :class:`Index` (:issue:`43792`)
- Performance improvement in indexing with a listlike indexer on a :class:`MultiIndex` (:issue:`43370`)
-- Performance improvement in indexing with a :class:`MultiIndex` indexer on another :class:`MultiIndex` (:issue:43370`)
+- Performance improvement in indexing with a :class:`MultiIndex` indexer on another :class:`MultiIndex` (:issue:`43370`)
- Performance improvement in :meth:`GroupBy.quantile` (:issue:`43469`, :issue:`43725`)
- Performance improvement in :meth:`GroupBy.count` (:issue:`43730`, :issue:`43694`)
- Performance improvement in :meth:`GroupBy.any` and :meth:`GroupBy.all` (:issue:`43675`, :issue:`42841`)
@@ -709,26 +826,28 @@ Datetimelike
- :func:`to_datetime` would silently swap ``MM/DD/YYYY`` and ``DD/MM/YYYY`` formats if the given ``dayfirst`` option could not be respected - now, a warning is raised in the case of delimited date strings (e.g. ``31-12-2012``) (:issue:`12585`)
- Bug in :meth:`date_range` and :meth:`bdate_range` do not return right bound when ``start`` = ``end`` and set is closed on one side (:issue:`43394`)
- Bug in inplace addition and subtraction of :class:`DatetimeIndex` or :class:`TimedeltaIndex` with :class:`DatetimeArray` or :class:`TimedeltaArray` (:issue:`43904`)
-- Bug in in calling ``np.isnan``, ``np.isfinite``, or ``np.isinf`` on a timezone-aware :class:`DatetimeIndex` incorrectly raising ``TypeError`` (:issue:`43917`)
+- Bug in calling ``np.isnan``, ``np.isfinite``, or ``np.isinf`` on a timezone-aware :class:`DatetimeIndex` incorrectly raising ``TypeError`` (:issue:`43917`)
- Bug in constructing a :class:`Series` from datetime-like strings with mixed timezones incorrectly partially-inferring datetime values (:issue:`40111`)
-- Bug in addition with a :class:`Tick` object and a ``np.timedelta64`` object incorrectly raising instead of returning :class:`Timedelta` (:issue:`44474`)
+- Bug in addition of a :class:`Tick` object and a ``np.timedelta64`` object incorrectly raising instead of returning :class:`Timedelta` (:issue:`44474`)
- ``np.maximum.reduce`` and ``np.minimum.reduce`` now correctly return :class:`Timestamp` and :class:`Timedelta` objects when operating on :class:`Series`, :class:`DataFrame`, or :class:`Index` with ``datetime64[ns]`` or ``timedelta64[ns]`` dtype (:issue:`43923`)
- Bug in adding a ``np.timedelta64`` object to a :class:`BusinessDay` or :class:`CustomBusinessDay` object incorrectly raising (:issue:`44532`)
- Bug in :meth:`Index.insert` for inserting ``np.datetime64``, ``np.timedelta64`` or ``tuple`` into :class:`Index` with ``dtype='object'`` with negative loc adding ``None`` and replacing existing value (:issue:`44509`)
+- Bug in :meth:`Timestamp.to_pydatetime` failing to retain the ``fold`` attribute (:issue:`45087`)
- Bug in :meth:`Series.mode` with ``DatetimeTZDtype`` incorrectly returning timezone-naive and ``PeriodDtype`` incorrectly raising (:issue:`41927`)
-- Bug in :class:`DateOffset`` addition with :class:`Timestamp` where ``offset.nanoseconds`` would not be included in the result (:issue:`43968`, :issue:`36589`)
+- Fixed regression in :meth:`~Series.reindex` raising an error when using an incompatible fill value with a datetime-like dtype (or not raising a deprecation warning for using a ``datetime.date`` as fill value) (:issue:`42921`)
+- Bug in :class:`DateOffset` addition with :class:`Timestamp` where ``offset.nanoseconds`` would not be included in the result (:issue:`43968`, :issue:`36589`)
- Bug in :meth:`Timestamp.fromtimestamp` not supporting the ``tz`` argument (:issue:`45083`)
- Bug in :class:`DataFrame` construction from dict of :class:`Series` with mismatched index dtypes sometimes raising depending on the ordering of the passed dict (:issue:`44091`)
- Bug in :class:`Timestamp` hashing during some DST transitions caused a segmentation fault (:issue:`33931` and :issue:`40817`)
Timedelta
^^^^^^^^^
-- Bug in division of all-``NaT`` :class:`TimeDeltaIndex`, :class:`Series` or :class:`DataFrame` column with object-dtype arraylike of numbers failing to infer the result as timedelta64-dtype (:issue:`39750`)
+- Bug in division of all-``NaT`` :class:`TimeDeltaIndex`, :class:`Series` or :class:`DataFrame` column with object-dtype array like of numbers failing to infer the result as timedelta64-dtype (:issue:`39750`)
- Bug in floor division of ``timedelta64[ns]`` data with a scalar returning garbage values (:issue:`44466`)
-- Bug in :class:`Timedelta` now properly taking into account any nanoseconds contribution of any kwarg (:issue:`43764`)
+- Bug in :class:`Timedelta` now properly taking into account any nanoseconds contribution of any kwarg (:issue:`43764`, :issue:`45227`)
-Timezones
-^^^^^^^^^
+Time Zones
+^^^^^^^^^^
- Bug in :func:`to_datetime` with ``infer_datetime_format=True`` failing to parse zero UTC offset (``Z``) correctly (:issue:`41047`)
- Bug in :meth:`Series.dt.tz_convert` resetting index in a :class:`Series` with :class:`CategoricalIndex` (:issue:`43080`)
- Bug in ``Timestamp`` and ``DatetimeIndex`` incorrectly raising a ``TypeError`` when subtracting two timezone-aware objects with mismatched timezones (:issue:`31793`)
@@ -747,11 +866,11 @@ Numeric
Conversion
^^^^^^^^^^
-- Bug in :class:`UInt64Index` constructor when passing a list containing both positive integers small enough to cast to int64 and integers too large too hold in int64 (:issue:`42201`)
+- Bug in :class:`UInt64Index` constructor when passing a list containing both positive integers small enough to cast to int64 and integers too large to hold in int64 (:issue:`42201`)
- Bug in :class:`Series` constructor returning 0 for missing values with dtype ``int64`` and ``False`` for dtype ``bool`` (:issue:`43017`, :issue:`43018`)
- Bug in constructing a :class:`DataFrame` from a :class:`PandasArray` containing :class:`Series` objects behaving differently than an equivalent ``np.ndarray`` (:issue:`43986`)
- Bug in :class:`IntegerDtype` not allowing coercion from string dtype (:issue:`25472`)
-- Bug in :func:`to_datetime` with ``arg:xr.DataArray`` and ``unit="ns"`` specified raises TypeError (:issue:`44053`)
+- Bug in :func:`to_datetime` with ``arg:xr.DataArray`` and ``unit="ns"`` specified raises ``TypeError`` (:issue:`44053`)
- Bug in :meth:`DataFrame.convert_dtypes` not returning the correct type when a subclass does not overload :meth:`_constructor_sliced` (:issue:`43201`)
- Bug in :meth:`DataFrame.astype` not propagating ``attrs`` from the original :class:`DataFrame` (:issue:`44414`)
- Bug in :meth:`DataFrame.convert_dtypes` result losing ``columns.names`` (:issue:`41435`)
@@ -760,7 +879,7 @@ Conversion
Strings
^^^^^^^
-- Fixed bug in checking for ``string[pyarrow]`` dtype incorrectly raising an ImportError when pyarrow is not installed (:issue:`44276`)
+- Bug in checking for ``string[pyarrow]`` dtype incorrectly raising an ``ImportError`` when pyarrow is not installed (:issue:`44276`)
Interval
^^^^^^^^
@@ -768,18 +887,18 @@ Interval
Indexing
^^^^^^^^
-- Bug in :meth:`Series.rename` when index in Series is MultiIndex and level in rename is provided. (:issue:`43659`)
-- Bug in :meth:`DataFrame.truncate` and :meth:`Series.truncate` when the object's Index has a length greater than one but only one unique value (:issue:`42365`)
+- Bug in :meth:`Series.rename` with :class:`MultiIndex` and ``level`` is provided (:issue:`43659`)
+- Bug in :meth:`DataFrame.truncate` and :meth:`Series.truncate` when the object's :class:`Index` has a length greater than one but only one unique value (:issue:`42365`)
- Bug in :meth:`Series.loc` and :meth:`DataFrame.loc` with a :class:`MultiIndex` when indexing with a tuple in which one of the levels is also a tuple (:issue:`27591`)
-- Bug in :meth:`Series.loc` when with a :class:`MultiIndex` whose first level contains only ``np.nan`` values (:issue:`42055`)
+- Bug in :meth:`Series.loc` with a :class:`MultiIndex` whose first level contains only ``np.nan`` values (:issue:`42055`)
- Bug in indexing on a :class:`Series` or :class:`DataFrame` with a :class:`DatetimeIndex` when passing a string, the return type depended on whether the index was monotonic (:issue:`24892`)
- Bug in indexing on a :class:`MultiIndex` failing to drop scalar levels when the indexer is a tuple containing a datetime-like string (:issue:`42476`)
- Bug in :meth:`DataFrame.sort_values` and :meth:`Series.sort_values` when passing an ascending value, failed to raise or incorrectly raising ``ValueError`` (:issue:`41634`)
- Bug in updating values of :class:`pandas.Series` using boolean index, created by using :meth:`pandas.DataFrame.pop` (:issue:`42530`)
- Bug in :meth:`Index.get_indexer_non_unique` when index contains multiple ``np.nan`` (:issue:`35392`)
-- Bug in :meth:`DataFrame.query` did not handle the degree sign in a backticked column name, such as \`Temp(°C)\`, used in an expression to query a dataframe (:issue:`42826`)
+- Bug in :meth:`DataFrame.query` did not handle the degree sign in a backticked column name, such as \`Temp(°C)\`, used in an expression to query a :class:`DataFrame` (:issue:`42826`)
- Bug in :meth:`DataFrame.drop` where the error message did not show missing labels with commas when raising ``KeyError`` (:issue:`42881`)
-- Bug in :meth:`DataFrame.query` where method calls in query strings led to errors when the ``numexpr`` package was installed. (:issue:`22435`)
+- Bug in :meth:`DataFrame.query` where method calls in query strings led to errors when the ``numexpr`` package was installed (:issue:`22435`)
- Bug in :meth:`DataFrame.nlargest` and :meth:`Series.nlargest` where sorted result did not count indexes containing ``np.nan`` (:issue:`28984`)
- Bug in indexing on a non-unique object-dtype :class:`Index` with an NA scalar (e.g. ``np.nan``) (:issue:`43711`)
- Bug in :meth:`DataFrame.__setitem__` incorrectly writing into an existing column's array rather than setting a new array when the new dtype and the old dtype match (:issue:`43406`)
@@ -802,16 +921,16 @@ Indexing
- Bug in :meth:`IntervalIndex.get_indexer_non_unique` not handling targets of ``dtype`` 'object' with NaNs correctly (:issue:`44482`)
- Fixed regression where a single column ``np.matrix`` was no longer coerced to a 1d ``np.ndarray`` when added to a :class:`DataFrame` (:issue:`42376`)
- Bug in :meth:`Series.__getitem__` with a :class:`CategoricalIndex` of integers treating lists of integers as positional indexers, inconsistent with the behavior with a single scalar integer (:issue:`15470`, :issue:`14865`)
-- Bug in :meth:`Series.__setitem__` when setting floats or integers into integer-dtype series failing to upcast when necessary to retain precision (:issue:`45121`)
+- Bug in :meth:`Series.__setitem__` when setting floats or integers into integer-dtype :class:`Series` failing to upcast when necessary to retain precision (:issue:`45121`)
- Bug in :meth:`DataFrame.iloc.__setitem__` ignores axis argument (:issue:`45032`)
Missing
^^^^^^^
-- Bug in :meth:`DataFrame.fillna` with limit and no method ignores axis='columns' or ``axis = 1`` (:issue:`40989`)
+- Bug in :meth:`DataFrame.fillna` with ``limit`` and no ``method`` ignores ``axis='columns'`` or ``axis = 1`` (:issue:`40989`, :issue:`17399`)
- Bug in :meth:`DataFrame.fillna` not replacing missing values when using a dict-like ``value`` and duplicate column names (:issue:`43476`)
- Bug in constructing a :class:`DataFrame` with a dictionary ``np.datetime64`` as a value and ``dtype='timedelta64[ns]'``, or vice-versa, incorrectly casting instead of raising (:issue:`44428`)
- Bug in :meth:`Series.interpolate` and :meth:`DataFrame.interpolate` with ``inplace=True`` not writing to the underlying array(s) in-place (:issue:`44749`)
-- Bug in :meth:`Index.fillna` incorrectly returning an un-filled :class:`Index` when NA values are present and ``downcast`` argument is specified. This now raises ``NotImplementedError`` instead; do not pass ``downcast`` argument (:issue:`44873`)
+- Bug in :meth:`Index.fillna` incorrectly returning an unfilled :class:`Index` when NA values are present and ``downcast`` argument is specified. This now raises ``NotImplementedError`` instead; do not pass ``downcast`` argument (:issue:`44873`)
- Bug in :meth:`DataFrame.dropna` changing :class:`Index` even if no entries were dropped (:issue:`41965`)
- Bug in :meth:`Series.fillna` with an object-dtype incorrectly ignoring ``downcast="infer"`` (:issue:`44241`)
@@ -830,33 +949,33 @@ I/O
- Bug in :func:`json_normalize` where ``errors=ignore`` could fail to ignore missing values of ``meta`` when ``record_path`` has a length greater than one (:issue:`41876`)
- Bug in :func:`read_csv` with multi-header input and arguments referencing column names as tuples (:issue:`42446`)
- Bug in :func:`read_fwf`, where difference in lengths of ``colspecs`` and ``names`` was not raising ``ValueError`` (:issue:`40830`)
-- Bug in :func:`Series.to_json` and :func:`DataFrame.to_json` where some attributes were skipped when serialising plain Python objects to JSON (:issue:`42768`, :issue:`33043`)
+- Bug in :func:`Series.to_json` and :func:`DataFrame.to_json` where some attributes were skipped when serializing plain Python objects to JSON (:issue:`42768`, :issue:`33043`)
- Column headers are dropped when constructing a :class:`DataFrame` from a sqlalchemy's ``Row`` object (:issue:`40682`)
-- Bug in unpickling a :class:`Index` with object dtype incorrectly inferring numeric dtypes (:issue:`43188`)
-- Bug in :func:`read_csv` where reading multi-header input with unequal lengths incorrectly raising uncontrolled ``IndexError`` (:issue:`43102`)
+- Bug in unpickling an :class:`Index` with object dtype incorrectly inferring numeric dtypes (:issue:`43188`)
+- Bug in :func:`read_csv` where reading multi-header input with unequal lengths incorrectly raised ``IndexError`` (:issue:`43102`)
- Bug in :func:`read_csv` raising ``ParserError`` when reading file in chunks and some chunk blocks have fewer columns than header for ``engine="c"`` (:issue:`21211`)
- Bug in :func:`read_csv`, changed exception class when expecting a file path name or file-like object from ``OSError`` to ``TypeError`` (:issue:`43366`)
- Bug in :func:`read_csv` and :func:`read_fwf` ignoring all ``skiprows`` except first when ``nrows`` is specified for ``engine='python'`` (:issue:`44021`, :issue:`10261`)
- Bug in :func:`read_csv` keeping the original column in object format when ``keep_date_col=True`` is set (:issue:`13378`)
- Bug in :func:`read_json` not handling non-numpy dtypes correctly (especially ``category``) (:issue:`21892`, :issue:`33205`)
- Bug in :func:`json_normalize` where multi-character ``sep`` parameter is incorrectly prefixed to every key (:issue:`43831`)
-- Bug in :func:`json_normalize` where reading data with missing multi-level metadata would not respect errors="ignore" (:issue:`44312`)
+- Bug in :func:`json_normalize` where reading data with missing multi-level metadata would not respect ``errors="ignore"`` (:issue:`44312`)
- Bug in :func:`read_csv` used second row to guess implicit index if ``header`` was set to ``None`` for ``engine="python"`` (:issue:`22144`)
- Bug in :func:`read_csv` not recognizing bad lines when ``names`` were given for ``engine="c"`` (:issue:`22144`)
- Bug in :func:`read_csv` with :code:`float_precision="round_trip"` which did not skip initial/trailing whitespace (:issue:`43713`)
-- Bug when Python is built without lzma module: a warning was raised at the pandas import time, even if the lzma capability isn't used. (:issue:`43495`)
+- Bug when Python is built without the lzma module: a warning was raised at the pandas import time, even if the lzma capability isn't used (:issue:`43495`)
- Bug in :func:`read_csv` not applying dtype for ``index_col`` (:issue:`9435`)
- Bug in dumping/loading a :class:`DataFrame` with ``yaml.dump(frame)`` (:issue:`42748`)
-- Bug in :func:`read_csv` raising ``ValueError`` when names was longer than header but equal to data rows for ``engine="python"`` (:issue:`38453`)
+- Bug in :func:`read_csv` raising ``ValueError`` when ``names`` was longer than ``header`` but equal to data rows for ``engine="python"`` (:issue:`38453`)
- Bug in :class:`ExcelWriter`, where ``engine_kwargs`` were not passed through to all engines (:issue:`43442`)
-- Bug in :func:`read_csv` raising ``ValueError`` when ``parse_dates`` was used with ``MultiIndex`` columns (:issue:`8991`)
+- Bug in :func:`read_csv` raising ``ValueError`` when ``parse_dates`` was used with :class:`MultiIndex` columns (:issue:`8991`)
- Bug in :func:`read_csv` not raising an ``ValueError`` when ``\n`` was specified as ``delimiter`` or ``sep`` which conflicts with ``lineterminator`` (:issue:`43528`)
- Bug in :func:`to_csv` converting datetimes in categorical :class:`Series` to integers (:issue:`40754`)
- Bug in :func:`read_csv` converting columns to numeric after date parsing failed (:issue:`11019`)
- Bug in :func:`read_csv` not replacing ``NaN`` values with ``np.nan`` before attempting date conversion (:issue:`26203`)
- Bug in :func:`read_csv` raising ``AttributeError`` when attempting to read a .csv file and infer index column dtype from an nullable integer type (:issue:`44079`)
- Bug in :func:`to_csv` always coercing datetime columns with different formats to the same format (:issue:`21734`)
-- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` with ``compression`` set to ``'zip'`` no longer create a zip file containing a file ending with ".zip". Instead, they try to infer the inner file name more smartly. (:issue:`39465`)
+- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` with ``compression`` set to ``'zip'`` no longer create a zip file containing a file ending with ".zip". Instead, they try to infer the inner file name more smartly (:issue:`39465`)
- Bug in :func:`read_csv` where reading a mixed column of booleans and missing values to a float type results in the missing values becoming 1.0 rather than NaN (:issue:`42808`, :issue:`34120`)
- Bug in :func:`to_xml` raising error for ``pd.NA`` with extension array dtype (:issue:`43903`)
- Bug in :func:`read_csv` when passing simultaneously a parser in ``date_parser`` and ``parse_dates=False``, the parsing was still called (:issue:`44366`)
@@ -865,6 +984,8 @@ I/O
- Bug in :func:`read_csv` when passing a ``tempfile.SpooledTemporaryFile`` opened in binary mode (:issue:`44748`)
- Bug in :func:`read_json` raising ``ValueError`` when attempting to parse json strings containing "://" (:issue:`36271`)
- Bug in :func:`read_csv` when ``engine="c"`` and ``encoding_errors=None`` which caused a segfault (:issue:`45180`)
+- Bug in :func:`read_csv` an invalid value of ``usecols`` leading to an unclosed file handle (:issue:`45384`)
+- Bug in :meth:`DataFrame.to_json` fix memory leak (:issue:`43877`)
Period
^^^^^^
@@ -876,16 +997,16 @@ Period
Plotting
^^^^^^^^
-- When given non-numeric data, :meth:`DataFrame.boxplot` now raises a ``ValueError`` rather than a cryptic ``KeyError`` or ``ZeroDivisionError``, in line with other plotting functions like :meth:`DataFrame.hist`. (:issue:`43480`)
+- When given non-numeric data, :meth:`DataFrame.boxplot` now raises a ``ValueError`` rather than a cryptic ``KeyError`` or ``ZeroDivisionError``, in line with other plotting functions like :meth:`DataFrame.hist` (:issue:`43480`)
Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^
-- Fixed bug in :meth:`SeriesGroupBy.apply` where passing an unrecognized string argument failed to raise ``TypeError`` when the underlying ``Series`` is empty (:issue:`42021`)
+- Bug in :meth:`SeriesGroupBy.apply` where passing an unrecognized string argument failed to raise ``TypeError`` when the underlying ``Series`` is empty (:issue:`42021`)
- Bug in :meth:`Series.rolling.apply`, :meth:`DataFrame.rolling.apply`, :meth:`Series.expanding.apply` and :meth:`DataFrame.expanding.apply` with ``engine="numba"`` where ``*args`` were being cached with the user passed function (:issue:`42287`)
- Bug in :meth:`GroupBy.max` and :meth:`GroupBy.min` with nullable integer dtypes losing precision (:issue:`41743`)
- Bug in :meth:`DataFrame.groupby.rolling.var` would calculate the rolling variance only on the first group (:issue:`42442`)
-- Bug in :meth:`GroupBy.shift` that would return the grouping columns if ``fill_value`` was not None (:issue:`41556`)
-- Bug in :meth:`SeriesGroupBy.nlargest` and :meth:`SeriesGroupBy.nsmallest` would have an inconsistent index when the input Series was sorted and ``n`` was greater than or equal to all group sizes (:issue:`15272`, :issue:`16345`, :issue:`29129`)
+- Bug in :meth:`GroupBy.shift` that would return the grouping columns if ``fill_value`` was not ``None`` (:issue:`41556`)
+- Bug in :meth:`SeriesGroupBy.nlargest` and :meth:`SeriesGroupBy.nsmallest` would have an inconsistent index when the input :class:`Series` was sorted and ``n`` was greater than or equal to all group sizes (:issue:`15272`, :issue:`16345`, :issue:`29129`)
- Bug in :meth:`pandas.DataFrame.ewm`, where non-float64 dtypes were silently failing (:issue:`42452`)
- Bug in :meth:`pandas.DataFrame.rolling` operation along rows (``axis=1``) incorrectly omits columns containing ``float16`` and ``float32`` (:issue:`41779`)
- Bug in :meth:`Resampler.aggregate` did not allow the use of Named Aggregation (:issue:`32803`)
@@ -894,37 +1015,37 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrame.groupby.rolling` when specifying ``on`` and calling ``__getitem__`` would subsequently return incorrect results (:issue:`43355`)
- Bug in :meth:`GroupBy.apply` with time-based :class:`Grouper` objects incorrectly raising ``ValueError`` in corner cases where the grouping vector contains a ``NaT`` (:issue:`43500`, :issue:`43515`)
- Bug in :meth:`GroupBy.mean` failing with ``complex`` dtype (:issue:`43701`)
-- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not calculating window bounds correctly for the first row when ``center=True`` and index is decreasing (:issue:`43927`)
-- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` for centered datetimelike windows with uneven nanosecond (:issue:`43997`)
+- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not calculating window bounds correctly for the first row when ``center=True`` and index is decreasing (:issue:`43927`)
+- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` for centered datetimelike windows with uneven nanosecond (:issue:`43997`)
- Bug in :meth:`GroupBy.mean` raising ``KeyError`` when column was selected at least twice (:issue:`44924`)
- Bug in :meth:`GroupBy.nth` failing on ``axis=1`` (:issue:`43926`)
-- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not respecting right bound on centered datetime-like windows, if the index contain duplicates (:issue:`3944`)
+- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not respecting right bound on centered datetime-like windows, if the index contain duplicates (:issue:`3944`)
- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` when using a :class:`pandas.api.indexers.BaseIndexer` subclass that returned unequal start and end arrays would segfault instead of raising a ``ValueError`` (:issue:`44470`)
-- Bug in :meth:`Groupby.nunique` not respecting ``observed=True`` for Categorical grouping columns (:issue:`45128`)
+- Bug in :meth:`Groupby.nunique` not respecting ``observed=True`` for ``categorical`` grouping columns (:issue:`45128`)
- Bug in :meth:`GroupBy.head` and :meth:`GroupBy.tail` not dropping groups with ``NaN`` when ``dropna=True`` (:issue:`45089`)
-- Fixed bug in :meth:`GroupBy.__iter__` after selecting a subset of columns in a :class:`GroupBy` object, which returned all columns instead of the chosen subset (:issue:`#44821`)
+- Bug in :meth:`GroupBy.__iter__` after selecting a subset of columns in a :class:`GroupBy` object, which returned all columns instead of the chosen subset (:issue:`44821`)
- Bug in :meth:`Groupby.rolling` when non-monotonic data passed, fails to correctly raise ``ValueError`` (:issue:`43909`)
-- Fixed bug where grouping by a :class:`Series` that has a categorical data type and length unequal to the axis of grouping raised ``ValueError`` (:issue:`44179`)
+- Bug where grouping by a :class:`Series` that has a ``categorical`` data type and length unequal to the axis of grouping raised ``ValueError`` (:issue:`44179`)
Reshaping
^^^^^^^^^
- Improved error message when creating a :class:`DataFrame` column from a multi-dimensional :class:`numpy.ndarray` (:issue:`42463`)
-- :func:`concat` creating :class:`MultiIndex` with duplicate level entries when concatenating a :class:`DataFrame` with duplicates in :class:`Index` and multiple keys (:issue:`42651`)
-- Bug in :meth:`pandas.cut` on :class:`Series` with duplicate indices (:issue:`42185`) and non-exact :meth:`pandas.CategoricalIndex` (:issue:`42425`)
+- Bug in :func:`concat` creating :class:`MultiIndex` with duplicate level entries when concatenating a :class:`DataFrame` with duplicates in :class:`Index` and multiple keys (:issue:`42651`)
+- Bug in :meth:`pandas.cut` on :class:`Series` with duplicate indices and non-exact :meth:`pandas.CategoricalIndex` (:issue:`42185`, :issue:`42425`)
- Bug in :meth:`DataFrame.append` failing to retain dtypes when appended columns do not match (:issue:`43392`)
- Bug in :func:`concat` of ``bool`` and ``boolean`` dtypes resulting in ``object`` dtype instead of ``boolean`` dtype (:issue:`42800`)
-- Bug in :func:`crosstab` when inputs are are categorical Series, there are categories that are not present in one or both of the Series, and ``margins=True``. Previously the margin value for missing categories was ``NaN``. It is now correctly reported as 0 (:issue:`43505`)
+- Bug in :func:`crosstab` when inputs are categorical :class:`Series`, there are categories that are not present in one or both of the :class:`Series`, and ``margins=True``. Previously the margin value for missing categories was ``NaN``. It is now correctly reported as 0 (:issue:`43505`)
- Bug in :func:`concat` would fail when the ``objs`` argument all had the same index and the ``keys`` argument contained duplicates (:issue:`43595`)
- Bug in :func:`concat` which ignored the ``sort`` parameter (:issue:`43375`)
-- Fixed bug in :func:`merge` with :class:`MultiIndex` as column index for the ``on`` argument returning an error when assigning a column internally (:issue:`43734`)
+- Bug in :func:`merge` with :class:`MultiIndex` as column index for the ``on`` argument returning an error when assigning a column internally (:issue:`43734`)
- Bug in :func:`crosstab` would fail when inputs are lists or tuples (:issue:`44076`)
- Bug in :meth:`DataFrame.append` failing to retain ``index.name`` when appending a list of :class:`Series` objects (:issue:`44109`)
- Fixed metadata propagation in :meth:`Dataframe.apply` method, consequently fixing the same issue for :meth:`Dataframe.transform`, :meth:`Dataframe.nunique` and :meth:`Dataframe.mode` (:issue:`28283`)
-- Bug in :func:`concat` casting levels of :class:`MultiIndex` to float if the only consist of missing values (:issue:`44900`)
+- Bug in :func:`concat` casting levels of :class:`MultiIndex` to float if all levels only consist of missing values (:issue:`44900`)
- Bug in :meth:`DataFrame.stack` with ``ExtensionDtype`` columns incorrectly raising (:issue:`43561`)
- Bug in :func:`merge` raising ``KeyError`` when joining over differently named indexes with on keywords (:issue:`45094`)
- Bug in :meth:`Series.unstack` with object doing unwanted type inference on resulting columns (:issue:`44595`)
-- Bug in :class:`MultiIndex` failing join operations with overlapping ``IntervalIndex`` levels (:issue:`44096`)
+- Bug in :meth:`MultiIndex.join()` with overlapping ``IntervalIndex`` levels (:issue:`44096`)
- Bug in :meth:`DataFrame.replace` and :meth:`Series.replace` results is different ``dtype`` based on ``regex`` parameter (:issue:`44864`)
- Bug in :meth:`DataFrame.pivot` with ``index=None`` when the :class:`DataFrame` index was a :class:`MultiIndex` (:issue:`23955`)
@@ -943,24 +1064,24 @@ ExtensionArray
- NumPy ufuncs ``np.abs``, ``np.positive``, ``np.negative`` now correctly preserve dtype when called on ExtensionArrays that implement ``__abs__, __pos__, __neg__``, respectively. In particular this is fixed for :class:`TimedeltaArray` (:issue:`43899`, :issue:`23316`)
- NumPy ufuncs ``np.minimum.reduce`` ``np.maximum.reduce``, ``np.add.reduce``, and ``np.prod.reduce`` now work correctly instead of raising ``NotImplementedError`` on :class:`Series` with ``IntegerDtype`` or ``FloatDtype`` (:issue:`43923`, :issue:`44793`)
- NumPy ufuncs with ``out`` keyword are now supported by arrays with ``IntegerDtype`` and ``FloatingDtype`` (:issue:`45122`)
-- Avoid raising ``PerformanceWarning`` about fragmented DataFrame when using many columns with an extension dtype (:issue:`44098`)
+- Avoid raising ``PerformanceWarning`` about fragmented :class:`DataFrame` when using many columns with an extension dtype (:issue:`44098`)
- Bug in :class:`IntegerArray` and :class:`FloatingArray` construction incorrectly coercing mismatched NA values (e.g. ``np.timedelta64("NaT")``) to numeric NA (:issue:`44514`)
- Bug in :meth:`BooleanArray.__eq__` and :meth:`BooleanArray.__ne__` raising ``TypeError`` on comparison with an incompatible type (like a string). This caused :meth:`DataFrame.replace` to sometimes raise a ``TypeError`` if a nullable boolean column was included (:issue:`44499`)
- Bug in :func:`array` incorrectly raising when passed a ``ndarray`` with ``float16`` dtype (:issue:`44715`)
- Bug in calling ``np.sqrt`` on :class:`BooleanArray` returning a malformed :class:`FloatingArray` (:issue:`44715`)
-- Bug in :meth:`Series.where` with ``ExtensionDtype`` when ``other`` is a NA scalar incompatible with the series dtype (e.g. ``NaT`` with a numeric dtype) incorrectly casting to a compatible NA value (:issue:`44697`)
-- Fixed bug in :meth:`Series.replace` where explicitly passing ``value=None`` is treated as if no ``value`` was passed, and ``None`` not being in the result (:issue:`36984`, :issue:`19998`)
-- Fixed bug in :meth:`Series.replace` with unwanted downcasting being done in no-op replacements (:issue:`44498`)
-- Fixed bug in :meth:`Series.replace` with ``FloatDtype``, ``string[python]``, or ``string[pyarrow]`` dtype not being preserved when possible (:issue:`33484`, :issue:`40732`, :issue:`31644`, :issue:`41215`, :issue:`25438`)
+- Bug in :meth:`Series.where` with ``ExtensionDtype`` when ``other`` is a NA scalar incompatible with the :class:`Series` dtype (e.g. ``NaT`` with a numeric dtype) incorrectly casting to a compatible NA value (:issue:`44697`)
+- Bug in :meth:`Series.replace` where explicitly passing ``value=None`` is treated as if no ``value`` was passed, and ``None`` not being in the result (:issue:`36984`, :issue:`19998`)
+- Bug in :meth:`Series.replace` with unwanted downcasting being done in no-op replacements (:issue:`44498`)
+- Bug in :meth:`Series.replace` with ``FloatDtype``, ``string[python]``, or ``string[pyarrow]`` dtype not being preserved when possible (:issue:`33484`, :issue:`40732`, :issue:`31644`, :issue:`41215`, :issue:`25438`)
Styler
^^^^^^
-- Minor bug in :class:`.Styler` where the ``uuid`` at initialization maintained a floating underscore (:issue:`43037`)
+- Bug in :class:`.Styler` where the ``uuid`` at initialization maintained a floating underscore (:issue:`43037`)
- Bug in :meth:`.Styler.to_html` where the ``Styler`` object was updated if the ``to_html`` method was called with some args (:issue:`43034`)
- Bug in :meth:`.Styler.copy` where ``uuid`` was not previously copied (:issue:`40675`)
-- Bug in :meth:`Styler.apply` where functions which returned Series objects were not correctly handled in terms of aligning their index labels (:issue:`13657`, :issue:`42014`)
-- Bug when rendering an empty DataFrame with a named index (:issue:`43305`).
-- Bug when rendering a single level MultiIndex (:issue:`43383`).
+- Bug in :meth:`Styler.apply` where functions which returned :class:`Series` objects were not correctly handled in terms of aligning their index labels (:issue:`13657`, :issue:`42014`)
+- Bug when rendering an empty :class:`DataFrame` with a named :class:`Index` (:issue:`43305`)
+- Bug when rendering a single level :class:`MultiIndex` (:issue:`43383`)
- Bug when combining non-sparse rendering and :meth:`.Styler.hide_columns` or :meth:`.Styler.hide_index` (:issue:`43464`)
- Bug setting a table style when using multiple selectors in :class:`.Styler` (:issue:`44011`)
- Bugs where row trimming and column trimming failed to reflect hidden rows (:issue:`43703`, :issue:`44247`)
@@ -971,7 +1092,6 @@ Other
- Bug in :meth:`CustomBusinessMonthBegin.__add__` (:meth:`CustomBusinessMonthEnd.__add__`) not applying the extra ``offset`` parameter when beginning (end) of the target month is already a business day (:issue:`41356`)
- Bug in :meth:`RangeIndex.union` with another ``RangeIndex`` with matching (even) ``step`` and starts differing by strictly less than ``step / 2`` (:issue:`44019`)
- Bug in :meth:`RangeIndex.difference` with ``sort=None`` and ``step<0`` failing to sort (:issue:`44085`)
-- Bug in :meth:`Series.to_frame` and :meth:`Index.to_frame` ignoring the ``name`` argument when ``name=None`` is explicitly passed (:issue:`44212`)
- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` with ``value=None`` and ExtensionDtypes (:issue:`44270`, :issue:`37899`)
- Bug in :meth:`FloatingArray.equals` failing to consider two arrays equal if they contain ``np.nan`` values (:issue:`44382`)
- Bug in :meth:`DataFrame.shift` with ``axis=1`` and ``ExtensionDtype`` columns incorrectly raising when an incompatible ``fill_value`` is passed (:issue:`44564`)
@@ -980,6 +1100,7 @@ Other
- Bug in :meth:`Series.replace` raising ``ValueError`` when using ``regex=True`` with a :class:`Series` containing ``np.nan`` values (:issue:`43344`)
- Bug in :meth:`DataFrame.to_records` where an incorrect ``n`` was used when missing names were replaced by ``level_n`` (:issue:`44818`)
- Bug in :meth:`DataFrame.eval` where ``resolvers`` argument was overriding the default resolvers (:issue:`34966`)
+- :meth:`Series.__repr__` and :meth:`DataFrame.__repr__` no longer replace all null-values in indexes with "NaN" but use their real string-representations. "NaN" is used only for ``float("nan")`` (:issue:`45263`)
.. ---------------------------------------------------------------------------
@@ -988,4 +1109,4 @@ Other
Contributors
~~~~~~~~~~~~
-.. contributors:: v1.3.5..v1.4.0|HEAD
+.. contributors:: v1.3.5..v1.4.0
diff --git a/doc/source/whatsnew/v1.4.1.rst b/doc/source/whatsnew/v1.4.1.rst
new file mode 100644
index 0000000000000..dd2002bb87648
--- /dev/null
+++ b/doc/source/whatsnew/v1.4.1.rst
@@ -0,0 +1,56 @@
+.. _whatsnew_141:
+
+What's new in 1.4.1 (February 12, 2022)
+---------------------------------------
+
+These are the changes in pandas 1.4.1. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_141.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Regression in :meth:`Series.mask` with ``inplace=True`` and ``PeriodDtype`` and an incompatible ``other`` coercing to a common dtype instead of raising (:issue:`45546`)
+- Regression in :func:`.assert_frame_equal` not respecting ``check_flags=False`` (:issue:`45554`)
+- Regression in :meth:`DataFrame.loc` raising ``ValueError`` when indexing (getting values) on a :class:`MultiIndex` with one level (:issue:`45779`)
+- Regression in :meth:`Series.fillna` with ``downcast=False`` incorrectly downcasting ``object`` dtype (:issue:`45603`)
+- Regression in :func:`api.types.is_bool_dtype` raising an ``AttributeError`` when evaluating a categorical :class:`Series` (:issue:`45615`)
+- Regression in :meth:`DataFrame.iat` setting values leading to not propagating correctly in subsequent lookups (:issue:`45684`)
+- Regression when setting values with :meth:`DataFrame.loc` losing :class:`Index` name if :class:`DataFrame` was empty before (:issue:`45621`)
+- Regression in :meth:`~Index.join` with overlapping :class:`IntervalIndex` raising an ``InvalidIndexError`` (:issue:`45661`)
+- Regression when setting values with :meth:`Series.loc` raising with all ``False`` indexer and :class:`Series` on the right hand side (:issue:`45778`)
+- Regression in :func:`read_sql` with a DBAPI2 connection that is not an instance of ``sqlite3.Connection`` incorrectly requiring SQLAlchemy be installed (:issue:`45660`)
+- Regression in :class:`DateOffset` when constructing with an integer argument with no keywords (e.g. ``pd.DateOffset(n)``) would behave like ``datetime.timedelta(days=0)`` (:issue:`45643`, :issue:`45890`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_141.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Fixed segfault in :meth:`DataFrame.to_json` when dumping tz-aware datetimes in Python 3.10 (:issue:`42130`)
+- Stopped emitting unnecessary ``FutureWarning`` in :meth:`DataFrame.sort_values` with sparse columns (:issue:`45618`)
+- Fixed window aggregations in :meth:`DataFrame.rolling` and :meth:`Series.rolling` to skip over unused elements (:issue:`45647`)
+- Fixed builtin highlighters in :class:`.Styler` to be responsive to ``NA`` with nullable dtypes (:issue:`45804`)
+- Bug in :meth:`~Rolling.apply` with ``axis=1`` raising an erroneous ``ValueError`` (:issue:`45912`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_141.other:
+
+Other
+~~~~~
+- Reverted performance speedup of :meth:`DataFrame.corr` for ``method=pearson`` to fix precision regression (:issue:`45640`, :issue:`42761`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_141.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.4.0..v1.4.1
diff --git a/doc/source/whatsnew/v1.4.2.rst b/doc/source/whatsnew/v1.4.2.rst
new file mode 100644
index 0000000000000..64c36632bfefe
--- /dev/null
+++ b/doc/source/whatsnew/v1.4.2.rst
@@ -0,0 +1,45 @@
+.. _whatsnew_142:
+
+What's new in 1.4.2 (April 2, 2022)
+-----------------------------------
+
+These are the changes in pandas 1.4.2. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_142.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed regression in :meth:`DataFrame.drop` and :meth:`Series.drop` when :class:`Index` had extension dtype and duplicates (:issue:`45860`)
+- Fixed regression in :func:`read_csv` killing python process when invalid file input was given for ``engine="c"`` (:issue:`45957`)
+- Fixed memory performance regression in :meth:`Series.fillna` when called on a :class:`DataFrame` column with ``inplace=True`` (:issue:`46149`)
+- Provided an alternative solution for passing custom Excel formats in :meth:`.Styler.to_excel`, which was a regression based on stricter CSS validation. Examples available in the documentation for :meth:`.Styler.format` (:issue:`46152`)
+- Fixed regression in :meth:`DataFrame.replace` when a replacement value was also a target for replacement (:issue:`46306`)
+- Fixed regression in :meth:`DataFrame.replace` when the replacement value was explicitly ``None`` when passed in a dictionary to ``to_replace`` (:issue:`45601`, :issue:`45836`)
+- Fixed regression when setting values with :meth:`DataFrame.loc` losing :class:`MultiIndex` names if :class:`DataFrame` was empty before (:issue:`46317`)
+- Fixed regression when rendering boolean datatype columns with :meth:`.Styler` (:issue:`46384`)
+- Fixed regression in :meth:`Groupby.rolling` with a frequency window that would raise a ``ValueError`` even if the datetimes within each group were monotonic (:issue:`46061`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_142.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Fix some cases for subclasses that define their ``_constructor`` properties as general callables (:issue:`46018`)
+- Fixed "longtable" formatting in :meth:`.Styler.to_latex` when ``column_format`` is given in extended format (:issue:`46037`)
+- Fixed incorrect rendering in :meth:`.Styler.format` with ``hyperlinks="html"`` when the url contains a colon or other special characters (:issue:`46389`)
+- Improved error message in :class:`~pandas.core.window.Rolling` when ``window`` is a frequency and ``NaT`` is in the rolling axis (:issue:`46087`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_142.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.4.1..v1.4.2
diff --git a/doc/source/whatsnew/v1.4.3.rst b/doc/source/whatsnew/v1.4.3.rst
new file mode 100644
index 0000000000000..70b451a231453
--- /dev/null
+++ b/doc/source/whatsnew/v1.4.3.rst
@@ -0,0 +1,72 @@
+.. _whatsnew_143:
+
+What's new in 1.4.3 (June 23, 2022)
+-----------------------------------
+
+These are the changes in pandas 1.4.3. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_143.concat:
+
+Behavior of ``concat`` with empty or all-NA DataFrame columns
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The behavior change in version 1.4.0 to stop ignoring the data type
+of empty or all-NA columns with float or object dtype in :func:`concat`
+(:ref:`whatsnew_140.notable_bug_fixes.concat_with_empty_or_all_na`) has been
+reverted (:issue:`45637`).
+
+
+.. _whatsnew_143.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed regression in :meth:`DataFrame.replace` when the replacement value was explicitly ``None`` when passed in a dictionary to ``to_replace`` also casting other columns to object dtype even when there were no values to replace (:issue:`46634`)
+- Fixed regression in :meth:`DataFrame.to_csv` raising error when :class:`DataFrame` contains extension dtype categorical column (:issue:`46297`, :issue:`46812`)
+- Fixed regression in representation of ``dtypes`` attribute of :class:`MultiIndex` (:issue:`46900`)
+- Fixed regression when setting values with :meth:`DataFrame.loc` updating :class:`RangeIndex` when index was set as new column and column was updated afterwards (:issue:`47128`)
+- Fixed regression in :meth:`DataFrame.fillna` and :meth:`DataFrame.update` creating a copy when updating inplace (:issue:`47188`)
+- Fixed regression in :meth:`DataFrame.nsmallest` led to wrong results when the sorting column has ``np.nan`` values (:issue:`46589`)
+- Fixed regression in :func:`read_fwf` raising ``ValueError`` when ``widths`` was specified with ``usecols`` (:issue:`46580`)
+- Fixed regression in :func:`concat` not sorting columns for mixed column names (:issue:`47127`)
+- Fixed regression in :meth:`.Groupby.transform` and :meth:`.Groupby.agg` failing with ``engine="numba"`` when the index was a :class:`MultiIndex` (:issue:`46867`)
+- Fixed regression in ``NaN`` comparison for :class:`Index` operations where the same object was compared (:issue:`47105`)
+- Fixed regression is :meth:`.Styler.to_latex` and :meth:`.Styler.to_html` where ``buf`` failed in combination with ``encoding`` (:issue:`47053`)
+- Fixed regression in :func:`read_csv` with ``index_col=False`` identifying first row as index names when ``header=None`` (:issue:`46955`)
+- Fixed regression in :meth:`.DataFrameGroupBy.agg` when used with list-likes or dict-likes and ``axis=1`` that would give incorrect results; now raises ``NotImplementedError`` (:issue:`46995`)
+- Fixed regression in :meth:`DataFrame.resample` and :meth:`DataFrame.rolling` when used with list-likes or dict-likes and ``axis=1`` that would raise an unintuitive error message; now raises ``NotImplementedError`` (:issue:`46904`)
+- Fixed regression in :func:`testing.assert_index_equal` when ``check_order=False`` and :class:`Index` has extension or object dtype (:issue:`47207`)
+- Fixed regression in :func:`read_excel` returning ints as floats on certain input sheets (:issue:`46988`)
+- Fixed regression in :meth:`DataFrame.shift` when ``axis`` is ``columns`` and ``fill_value`` is absent, ``freq`` is ignored (:issue:`47039`)
+- Fixed regression in :meth:`DataFrame.to_json` causing a segmentation violation when :class:`DataFrame` is created with an ``index`` parameter of the type :class:`PeriodIndex` (:issue:`46683`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_143.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Bug in :func:`pandas.eval`, :meth:`DataFrame.eval` and :meth:`DataFrame.query` where passing empty ``local_dict`` or ``global_dict`` was treated as passing ``None`` (:issue:`47084`)
+- Most I/O methods no longer suppress ``OSError`` and ``ValueError`` when closing file handles (:issue:`47136`)
+- Improving error message raised by :meth:`DataFrame.from_dict` when passing an invalid ``orient`` parameter (:issue:`47450`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_143.other:
+
+Other
+~~~~~
+- The minimum version of Cython needed to compile pandas is now ``0.29.30`` (:issue:`41935`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_143.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.4.2..v1.4.3
diff --git a/doc/source/whatsnew/v1.4.4.rst b/doc/source/whatsnew/v1.4.4.rst
new file mode 100644
index 0000000000000..56b1254d8a359
--- /dev/null
+++ b/doc/source/whatsnew/v1.4.4.rst
@@ -0,0 +1,65 @@
+.. _whatsnew_144:
+
+What's new in 1.4.4 (August 31, 2022)
+-------------------------------------
+
+These are the changes in pandas 1.4.4. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_144.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed regression in :meth:`DataFrame.fillna` not working on a :class:`DataFrame` with a :class:`MultiIndex` (:issue:`47649`)
+- Fixed regression in taking NULL :class:`objects` from a :class:`DataFrame` causing a segmentation violation. These NULL values are created by :meth:`numpy.empty_like` (:issue:`46848`)
+- Fixed regression in :func:`concat` materializing the :class:`Index` during sorting even if the :class:`Index` was already sorted (:issue:`47501`)
+- Fixed regression in :func:`concat` or :func:`merge` handling of all-NaN ExtensionArrays with custom attributes (:issue:`47762`)
+- Fixed regression in calling bitwise numpy ufuncs (for example, ``np.bitwise_and``) on Index objects (:issue:`46769`)
+- Fixed regression in :func:`cut` when using a ``datetime64`` IntervalIndex as bins (:issue:`46218`)
+- Fixed regression in :meth:`DataFrame.select_dtypes` where ``include="number"`` included :class:`BooleanDtype` (:issue:`46870`)
+- Fixed regression in :meth:`DataFrame.loc` raising error when indexing with a ``NamedTuple`` (:issue:`48124`)
+- Fixed regression in :meth:`DataFrame.loc` not updating the cache correctly after values were set (:issue:`47867`)
+- Fixed regression in :meth:`DataFrame.loc` not aligning index in some cases when setting a :class:`DataFrame` (:issue:`47578`)
+- Fixed regression in :meth:`DataFrame.loc` setting a length-1 array like value to a single value in the DataFrame (:issue:`46268`)
+- Fixed regression when slicing with :meth:`DataFrame.loc` with :class:`DatetimeIndex` with a :class:`.DateOffset` object for its ``freq`` (:issue:`46671`)
+- Fixed regression in setting ``None`` or non-string value into a ``string``-dtype Series using a mask (:issue:`47628`)
+- Fixed regression in updating a DataFrame column through Series ``__setitem__`` (using chained assignment) not updating column values inplace and using too much memory (:issue:`47172`)
+- Fixed regression in :meth:`DataFrame.select_dtypes` returning a view on the original DataFrame (:issue:`48090`)
+- Fixed regression using custom Index subclasses (for example, used in xarray) with :meth:`~DataFrame.reset_index` or :meth:`Index.insert` (:issue:`47071`)
+- Fixed regression in :meth:`~Index.intersection` when the :class:`DatetimeIndex` has dates crossing daylight savings time (:issue:`46702`)
+- Fixed regression in :func:`merge` throwing an error when passing a :class:`Series` with a multi-level name (:issue:`47946`)
+- Fixed regression in :meth:`DataFrame.eval` creating a copy when updating inplace (:issue:`47449`)
+- Fixed regression where getting a row using :meth:`DataFrame.iloc` with :class:`SparseDtype` would raise (:issue:`46406`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_144.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- The ``FutureWarning`` raised when passing arguments (other than ``filepath_or_buffer``) as positional in :func:`read_csv` is now raised at the correct stacklevel (:issue:`47385`)
+- Bug in :meth:`DataFrame.to_sql` when ``method`` was a ``callable`` that did not return an ``int`` and would raise a ``TypeError`` (:issue:`46891`)
+- Bug in :meth:`.DataFrameGroupBy.value_counts` where ``subset`` had no effect (:issue:`46383`)
+- Bug when getting values with :meth:`DataFrame.loc` with a list of keys causing an internal inconsistency that could lead to a disconnect between ``frame.at[x, y]`` vs ``frame[y].loc[x]`` (:issue:`22372`)
+- Bug in the :meth:`Series.dt.strftime` accessor return a float instead of object dtype Series for all-NaT input, which also causes a spurious deprecation warning (:issue:`45858`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_144.other:
+
+Other
+~~~~~
+- The minimum version of Cython needed to compile pandas is now ``0.29.32`` (:issue:`47978`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_144.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.4.3..v1.4.4|HEAD
diff --git a/doc/source/whatsnew/v1.5.0.rst b/doc/source/whatsnew/v1.5.0.rst
new file mode 100644
index 0000000000000..ecd38555be040
--- /dev/null
+++ b/doc/source/whatsnew/v1.5.0.rst
@@ -0,0 +1,1294 @@
+.. _whatsnew_150:
+
+What's new in 1.5.0 (September 19, 2022)
+----------------------------------------
+
+These are the changes in pandas 1.5.0. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_150.enhancements:
+
+Enhancements
+~~~~~~~~~~~~
+
+.. _whatsnew_150.enhancements.pandas-stubs:
+
+``pandas-stubs``
+^^^^^^^^^^^^^^^^
+
+The ``pandas-stubs`` library is now supported by the pandas development team, providing type stubs for the pandas API. Please visit
+https://github.com/pandas-dev/pandas-stubs for more information.
+
+We thank VirtusLab and Microsoft for their initial, significant contributions to ``pandas-stubs``
+
+.. _whatsnew_150.enhancements.arrow:
+
+Native PyArrow-backed ExtensionArray
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+With `Pyarrow `__ installed, users can now create pandas objects
+that are backed by a ``pyarrow.ChunkedArray`` and ``pyarrow.DataType``.
+
+The ``dtype`` argument can accept a string of a `pyarrow data type `__
+with ``pyarrow`` in brackets e.g. ``"int64[pyarrow]"`` or, for pyarrow data types that take parameters, a :class:`ArrowDtype`
+initialized with a ``pyarrow.DataType``.
+
+.. ipython:: python
+
+ import pyarrow as pa
+ ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]")
+ ser_float
+
+ list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64()))
+ ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type)
+ ser_list
+
+ ser_list.take([1, 0])
+ ser_float * 5
+ ser_float.mean()
+ ser_float.dropna()
+
+Most operations are supported and have been implemented using `pyarrow compute `__ functions.
+We recommend installing the latest version of PyArrow to access the most recently implemented compute functions.
+
+.. warning::
+
+ This feature is experimental, and the API can change in a future release without warning.
+
+.. _whatsnew_150.enhancements.dataframe_interchange:
+
+DataFrame interchange protocol implementation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Pandas now implement the DataFrame interchange API spec.
+See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html
+
+The protocol consists of two parts:
+
+- New method :meth:`DataFrame.__dataframe__` which produces the interchange object.
+ It effectively "exports" the pandas dataframe as an interchange object so
+ any other library which has the protocol implemented can "import" that dataframe
+ without knowing anything about the producer except that it makes an interchange object.
+- New function :func:`pandas.api.interchange.from_dataframe` which can take
+ an arbitrary interchange object from any conformant library and construct a
+ pandas DataFrame out of it.
+
+.. _whatsnew_150.enhancements.styler:
+
+Styler
+^^^^^^
+
+The most notable development is the new method :meth:`.Styler.concat` which
+allows adding customised footer rows to visualise additional calculations on the data,
+e.g. totals and counts etc. (:issue:`43875`, :issue:`46186`)
+
+Additionally there is an alternative output method :meth:`.Styler.to_string`,
+which allows using the Styler's formatting methods to create, for example, CSVs (:issue:`44502`).
+
+A new feature :meth:`.Styler.relabel_index` is also made available to provide full customisation of the display of
+index or column headers (:issue:`47864`)
+
+Minor feature improvements are:
+
+ - Adding the ability to render ``border`` and ``border-{side}`` CSS properties in Excel (:issue:`42276`)
+ - Making keyword arguments consist: :meth:`.Styler.highlight_null` now accepts ``color`` and deprecates ``null_color`` although this remains backwards compatible (:issue:`45907`)
+
+.. _whatsnew_150.enhancements.resample_group_keys:
+
+Control of index with ``group_keys`` in :meth:`DataFrame.resample`
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The argument ``group_keys`` has been added to the method :meth:`DataFrame.resample`.
+As with :meth:`DataFrame.groupby`, this argument controls the whether each group is added
+to the index in the resample when :meth:`.Resampler.apply` is used.
+
+.. warning::
+ Not specifying the ``group_keys`` argument will retain the
+ previous behavior and emit a warning if the result will change
+ by specifying ``group_keys=False``. In a future version
+ of pandas, not specifying ``group_keys`` will default to
+ the same behavior as ``group_keys=False``.
+
+.. ipython:: python
+
+ df = pd.DataFrame(
+ {'a': range(6)},
+ index=pd.date_range("2021-01-01", periods=6, freq="8H")
+ )
+ df.resample("D", group_keys=True).apply(lambda x: x)
+ df.resample("D", group_keys=False).apply(lambda x: x)
+
+Previously, the resulting index would depend upon the values returned by ``apply``,
+as seen in the following example.
+
+.. code-block:: ipython
+
+ In [1]: # pandas 1.3
+ In [2]: df.resample("D").apply(lambda x: x)
+ Out[2]:
+ a
+ 2021-01-01 00:00:00 0
+ 2021-01-01 08:00:00 1
+ 2021-01-01 16:00:00 2
+ 2021-01-02 00:00:00 3
+ 2021-01-02 08:00:00 4
+ 2021-01-02 16:00:00 5
+
+ In [3]: df.resample("D").apply(lambda x: x.reset_index())
+ Out[3]:
+ index a
+ 2021-01-01 0 2021-01-01 00:00:00 0
+ 1 2021-01-01 08:00:00 1
+ 2 2021-01-01 16:00:00 2
+ 2021-01-02 0 2021-01-02 00:00:00 3
+ 1 2021-01-02 08:00:00 4
+ 2 2021-01-02 16:00:00 5
+
+.. _whatsnew_150.enhancements.from_dummies:
+
+from_dummies
+^^^^^^^^^^^^
+
+Added new function :func:`~pandas.from_dummies` to convert a dummy coded :class:`DataFrame` into a categorical :class:`DataFrame`.
+
+.. ipython:: python
+
+ import pandas as pd
+
+ df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0],
+ "col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
+ "col2_c": [0, 0, 1]})
+
+ pd.from_dummies(df, sep="_")
+
+.. _whatsnew_150.enhancements.orc:
+
+Writing to ORC files
+^^^^^^^^^^^^^^^^^^^^
+
+The new method :meth:`DataFrame.to_orc` allows writing to ORC files (:issue:`43864`).
+
+This functionality depends the `pyarrow `__ library. For more details, see :ref:`the IO docs on ORC `.
+
+.. warning::
+
+ * It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
+ * :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0.
+ * :func:`~pandas.DataFrame.to_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `.
+ * For supported dtypes please refer to `supported ORC features in Arrow `__.
+ * Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
+
+.. code-block:: python
+
+ df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]})
+ df.to_orc("./out.orc")
+
+.. _whatsnew_150.enhancements.tar:
+
+Reading directly from TAR archives
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+I/O methods like :func:`read_csv` or :meth:`DataFrame.to_json` now allow reading and writing
+directly on TAR archives (:issue:`44787`).
+
+.. code-block:: python
+
+ df = pd.read_csv("./movement.tar.gz")
+ # ...
+ df.to_csv("./out.tar.gz")
+
+This supports ``.tar``, ``.tar.gz``, ``.tar.bz`` and ``.tar.xz2`` archives.
+The used compression method is inferred from the filename.
+If the compression method cannot be inferred, use the ``compression`` argument:
+
+.. code-block:: python
+
+ df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821
+
+(``mode`` being one of ``tarfile.open``'s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)
+
+
+.. _whatsnew_150.enhancements.read_xml_dtypes:
+
+read_xml now supports ``dtype``, ``converters``, and ``parse_dates``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns,
+apply converter methods, and parse dates (:issue:`43567`).
+
+.. ipython:: python
+
+ xml_dates = """
+
+
+ square
+ 00360
+ 4.0
+ 2020-01-01
+
+
+ circle
+ 00360
+
+ 2021-01-01
+
+
+ triangle
+ 00180
+ 3.0
+ 2022-01-01
+
+ """
+
+ df = pd.read_xml(
+ xml_dates,
+ dtype={'sides': 'Int64'},
+ converters={'degrees': str},
+ parse_dates=['date']
+ )
+ df
+ df.dtypes
+
+
+.. _whatsnew_150.enhancements.read_xml_iterparse:
+
+read_xml now supports large XML using ``iterparse``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
+now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
+which are memory-efficient methods to iterate through XML trees and extract specific elements
+and attributes without holding entire tree in memory (:issue:`45442`).
+
+.. code-block:: ipython
+
+ In [1]: df = pd.read_xml(
+ ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
+ ... iterparse = {"page": ["title", "ns", "id"]})
+ ... )
+ df
+ Out[2]:
+ title ns id
+ 0 Gettysburg Address 0 21450
+ 1 Main Page 0 42950
+ 2 Declaration by United Nations 0 8435
+ 3 Constitution of the United States of America 0 8435
+ 4 Declaration of Independence (Israel) 0 17858
+ ... ... ... ...
+ 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
+ 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
+ 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
+ 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
+ 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
+
+ [3578765 rows x 3 columns]
+
+
+.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
+.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
+
+.. _whatsnew_150.enhancements.copy_on_write:
+
+Copy on Write
+^^^^^^^^^^^^^
+
+A new feature ``copy_on_write`` was added (:issue:`46958`). Copy on write ensures that
+any DataFrame or Series derived from another in any way always behaves as a copy.
+Copy on write disallows updating any other object than the object the method
+was applied to.
+
+Copy on write can be enabled through:
+
+.. code-block:: python
+
+ pd.set_option("mode.copy_on_write", True)
+ pd.options.mode.copy_on_write = True
+
+Alternatively, copy on write can be enabled locally through:
+
+.. code-block:: python
+
+ with pd.option_context("mode.copy_on_write", True):
+ ...
+
+Without copy on write, the parent :class:`DataFrame` is updated when updating a child
+:class:`DataFrame` that was derived from this :class:`DataFrame`.
+
+.. ipython:: python
+
+ df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1})
+ view = df["foo"]
+ view.iloc[0]
+ df
+
+With copy on write enabled, df won't be updated anymore:
+
+.. ipython:: python
+
+ with pd.option_context("mode.copy_on_write", True):
+ df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1})
+ view = df["foo"]
+ view.iloc[0]
+ df
+
+A more detailed explanation can be found `here `_.
+
+.. _whatsnew_150.enhancements.other:
+
+Other enhancements
+^^^^^^^^^^^^^^^^^^
+- :meth:`Series.map` now raises when ``arg`` is dict but ``na_action`` is not either ``None`` or ``'ignore'`` (:issue:`46588`)
+- :meth:`MultiIndex.to_frame` now supports the argument ``allow_duplicates`` and raises on duplicate labels if it is missing or False (:issue:`45245`)
+- :class:`.StringArray` now accepts array-likes containing nan-likes (``None``, ``np.nan``) for the ``values`` parameter in its constructor in addition to strings and :attr:`pandas.NA`. (:issue:`40839`)
+- Improved the rendering of ``categories`` in :class:`CategoricalIndex` (:issue:`45218`)
+- :meth:`DataFrame.plot` will now allow the ``subplots`` parameter to be a list of iterables specifying column groups, so that columns may be grouped together in the same subplot (:issue:`29688`).
+- :meth:`to_numeric` now preserves float64 arrays when downcasting would generate values not representable in float32 (:issue:`43693`)
+- :meth:`Series.reset_index` and :meth:`DataFrame.reset_index` now support the argument ``allow_duplicates`` (:issue:`44410`)
+- :meth:`.GroupBy.min` and :meth:`.GroupBy.max` now supports `Numba `_ execution with the ``engine`` keyword (:issue:`45428`)
+- :func:`read_csv` now supports ``defaultdict`` as a ``dtype`` parameter (:issue:`41574`)
+- :meth:`DataFrame.rolling` and :meth:`Series.rolling` now support a ``step`` parameter with fixed-length windows (:issue:`15354`)
+- Implemented a ``bool``-dtype :class:`Index`, passing a bool-dtype array-like to ``pd.Index`` will now retain ``bool`` dtype instead of casting to ``object`` (:issue:`45061`)
+- Implemented a complex-dtype :class:`Index`, passing a complex-dtype array-like to ``pd.Index`` will now retain complex dtype instead of casting to ``object`` (:issue:`45845`)
+- :class:`Series` and :class:`DataFrame` with :class:`IntegerDtype` now supports bitwise operations (:issue:`34463`)
+- Add ``milliseconds`` field support for :class:`.DateOffset` (:issue:`43371`)
+- :meth:`DataFrame.where` tries to maintain dtype of :class:`DataFrame` if fill value can be cast without loss of precision (:issue:`45582`)
+- :meth:`DataFrame.reset_index` now accepts a ``names`` argument which renames the index names (:issue:`6878`)
+- :func:`concat` now raises when ``levels`` is given but ``keys`` is None (:issue:`46653`)
+- :func:`concat` now raises when ``levels`` contains duplicate values (:issue:`46653`)
+- Added ``numeric_only`` argument to :meth:`DataFrame.corr`, :meth:`DataFrame.corrwith`, :meth:`DataFrame.cov`, :meth:`DataFrame.idxmin`, :meth:`DataFrame.idxmax`, :meth:`.DataFrameGroupBy.idxmin`, :meth:`.DataFrameGroupBy.idxmax`, :meth:`.GroupBy.var`, :meth:`.GroupBy.std`, :meth:`.GroupBy.sem`, and :meth:`.DataFrameGroupBy.quantile` (:issue:`46560`)
+- A :class:`errors.PerformanceWarning` is now thrown when using ``string[pyarrow]`` dtype with methods that don't dispatch to ``pyarrow.compute`` methods (:issue:`42613`, :issue:`46725`)
+- Added ``validate`` argument to :meth:`DataFrame.join` (:issue:`46622`)
+- A :class:`errors.PerformanceWarning` is now thrown when using ``string[pyarrow]`` dtype with methods that don't dispatch to ``pyarrow.compute`` methods (:issue:`42613`)
+- Added ``numeric_only`` argument to :meth:`Resampler.sum`, :meth:`Resampler.prod`, :meth:`Resampler.min`, :meth:`Resampler.max`, :meth:`Resampler.first`, and :meth:`Resampler.last` (:issue:`46442`)
+- ``times`` argument in :class:`.ExponentialMovingWindow` now accepts ``np.timedelta64`` (:issue:`47003`)
+- :class:`.DataError`, :class:`.SpecificationError`, :class:`.SettingWithCopyError`, :class:`.SettingWithCopyWarning`, :class:`.NumExprClobberingError`, :class:`.UndefinedVariableError`, :class:`.IndexingError`, :class:`.PyperclipException`, :class:`.PyperclipWindowsException`, :class:`.CSSWarning`, :class:`.PossibleDataLossError`, :class:`.ClosedFileError`, :class:`.IncompatibilityWarning`, :class:`.AttributeConflictWarning`, :class:`.DatabaseError`, :class:`.PossiblePrecisionLoss`, :class:`.ValueLabelTypeMismatch`, :class:`.InvalidColumnName`, and :class:`.CategoricalConversionWarning` are now exposed in ``pandas.errors`` (:issue:`27656`)
+- Added ``check_like`` argument to :func:`testing.assert_series_equal` (:issue:`47247`)
+- Add support for :meth:`.GroupBy.ohlc` for extension array dtypes (:issue:`37493`)
+- Allow reading compressed SAS files with :func:`read_sas` (e.g., ``.sas7bdat.gz`` files)
+- :func:`pandas.read_html` now supports extracting links from table cells (:issue:`13141`)
+- :meth:`DatetimeIndex.astype` now supports casting timezone-naive indexes to ``datetime64[s]``, ``datetime64[ms]``, and ``datetime64[us]``, and timezone-aware indexes to the corresponding ``datetime64[unit, tzname]`` dtypes (:issue:`47579`)
+- :class:`Series` reducers (e.g. ``min``, ``max``, ``sum``, ``mean``) will now successfully operate when the dtype is numeric and ``numeric_only=True`` is provided; previously this would raise a ``NotImplementedError`` (:issue:`47500`)
+- :meth:`RangeIndex.union` now can return a :class:`RangeIndex` instead of a :class:`Int64Index` if the resulting values are equally spaced (:issue:`47557`, :issue:`43885`)
+- :meth:`DataFrame.compare` now accepts an argument ``result_names`` to allow the user to specify the result's names of both left and right DataFrame which are being compared. This is by default ``'self'`` and ``'other'`` (:issue:`44354`)
+- :meth:`DataFrame.quantile` gained a ``method`` argument that can accept ``table`` to evaluate multi-column quantiles (:issue:`43881`)
+- :class:`Interval` now supports checking whether one interval is contained by another interval (:issue:`46613`)
+- Added ``copy`` keyword to :meth:`Series.set_axis` and :meth:`DataFrame.set_axis` to allow user to set axis on a new object without necessarily copying the underlying data (:issue:`47932`)
+- The method :meth:`.ExtensionArray.factorize` accepts ``use_na_sentinel=False`` for determining how null values are to be treated (:issue:`46601`)
+- The ``Dockerfile`` now installs a dedicated ``pandas-dev`` virtual environment for pandas development instead of using the ``base`` environment (:issue:`48427`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_150.notable_bug_fixes:
+
+Notable bug fixes
+~~~~~~~~~~~~~~~~~
+
+These are bug fixes that might have notable behavior changes.
+
+.. _whatsnew_150.notable_bug_fixes.groupby_transform_dropna:
+
+Using ``dropna=True`` with ``groupby`` transforms
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A transform is an operation whose result has the same size as its input. When the
+result is a :class:`DataFrame` or :class:`Series`, it is also required that the
+index of the result matches that of the input. In pandas 1.4, using
+:meth:`.DataFrameGroupBy.transform` or :meth:`.SeriesGroupBy.transform` with null
+values in the groups and ``dropna=True`` gave incorrect results. Demonstrated by the
+examples below, the incorrect results either contained incorrect values, or the result
+did not have the same index as the input.
+
+.. ipython:: python
+
+ df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})
+
+*Old behavior*:
+
+.. code-block:: ipython
+
+ In [3]: # Value in the last row should be np.nan
+ df.groupby('a', dropna=True).transform('sum')
+ Out[3]:
+ b
+ 0 5
+ 1 5
+ 2 5
+
+ In [3]: # Should have one additional row with the value np.nan
+ df.groupby('a', dropna=True).transform(lambda x: x.sum())
+ Out[3]:
+ b
+ 0 5
+ 1 5
+
+ In [3]: # The value in the last row is np.nan interpreted as an integer
+ df.groupby('a', dropna=True).transform('ffill')
+ Out[3]:
+ b
+ 0 2
+ 1 3
+ 2 -9223372036854775808
+
+ In [3]: # Should have one additional row with the value np.nan
+ df.groupby('a', dropna=True).transform(lambda x: x)
+ Out[3]:
+ b
+ 0 2
+ 1 3
+
+*New behavior*:
+
+.. ipython:: python
+
+ df.groupby('a', dropna=True).transform('sum')
+ df.groupby('a', dropna=True).transform(lambda x: x.sum())
+ df.groupby('a', dropna=True).transform('ffill')
+ df.groupby('a', dropna=True).transform(lambda x: x)
+
+.. _whatsnew_150.notable_bug_fixes.to_json_incorrectly_localizing_naive_timestamps:
+
+Serializing tz-naive Timestamps with to_json() with ``iso_dates=True``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:meth:`DataFrame.to_json`, :meth:`Series.to_json`, and :meth:`Index.to_json`
+would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps
+to UTC. (:issue:`38760`)
+
+Note that this patch does not fix the localization of tz-aware Timestamps to UTC
+upon serialization. (Related issue :issue:`12997`)
+
+*Old Behavior*
+
+.. ipython:: python
+
+ index = pd.date_range(
+ start='2020-12-28 00:00:00',
+ end='2020-12-28 02:00:00',
+ freq='1H',
+ )
+ a = pd.Series(
+ data=range(3),
+ index=index,
+ )
+
+.. code-block:: ipython
+
+ In [4]: a.to_json(date_format='iso')
+ Out[4]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'
+
+ In [5]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
+ Out[5]: array([False, False, False])
+
+*New Behavior*
+
+.. ipython:: python
+
+ a.to_json(date_format='iso')
+ # Roundtripping now works
+ pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
+
+.. _whatsnew_150.notable_bug_fixes.groupby_value_counts_categorical:
+
+DataFrameGroupBy.value_counts with non-grouping categorical columns and ``observed=True``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Calling :meth:`.DataFrameGroupBy.value_counts` with ``observed=True`` would incorrectly drop non-observed categories of non-grouping columns (:issue:`46357`).
+
+.. code-block:: ipython
+
+ In [6]: df = pd.DataFrame(["a", "b", "c"], dtype="category").iloc[0:2]
+ In [7]: df
+ Out[7]:
+ 0
+ 0 a
+ 1 b
+
+*Old Behavior*
+
+.. code-block:: ipython
+
+ In [8]: df.groupby(level=0, observed=True).value_counts()
+ Out[8]:
+ 0 a 1
+ 1 b 1
+ dtype: int64
+
+
+*New Behavior*
+
+.. code-block:: ipython
+
+ In [9]: df.groupby(level=0, observed=True).value_counts()
+ Out[9]:
+ 0 a 1
+ 1 a 0
+ b 1
+ 0 b 0
+ c 0
+ 1 c 0
+ dtype: int64
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_150.api_breaking:
+
+Backwards incompatible API changes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _whatsnew_150.api_breaking.deps:
+
+Increased minimum versions for dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Some minimum supported versions of dependencies were updated.
+If installed, we now require:
+
++-----------------+-----------------+----------+---------+
+| Package | Minimum Version | Required | Changed |
++=================+=================+==========+=========+
+| numpy | 1.20.3 | X | X |
++-----------------+-----------------+----------+---------+
+| mypy (dev) | 0.971 | | X |
++-----------------+-----------------+----------+---------+
+| beautifulsoup4 | 4.9.3 | | X |
++-----------------+-----------------+----------+---------+
+| blosc | 1.21.0 | | X |
++-----------------+-----------------+----------+---------+
+| bottleneck | 1.3.2 | | X |
++-----------------+-----------------+----------+---------+
+| fsspec | 2021.07.0 | | X |
++-----------------+-----------------+----------+---------+
+| hypothesis | 6.13.0 | | X |
++-----------------+-----------------+----------+---------+
+| gcsfs | 2021.07.0 | | X |
++-----------------+-----------------+----------+---------+
+| jinja2 | 3.0.0 | | X |
++-----------------+-----------------+----------+---------+
+| lxml | 4.6.3 | | X |
++-----------------+-----------------+----------+---------+
+| numba | 0.53.1 | | X |
++-----------------+-----------------+----------+---------+
+| numexpr | 2.7.3 | | X |
++-----------------+-----------------+----------+---------+
+| openpyxl | 3.0.7 | | X |
++-----------------+-----------------+----------+---------+
+| pandas-gbq | 0.15.0 | | X |
++-----------------+-----------------+----------+---------+
+| psycopg2 | 2.8.6 | | X |
++-----------------+-----------------+----------+---------+
+| pymysql | 1.0.2 | | X |
++-----------------+-----------------+----------+---------+
+| pyreadstat | 1.1.2 | | X |
++-----------------+-----------------+----------+---------+
+| pyxlsb | 1.0.8 | | X |
++-----------------+-----------------+----------+---------+
+| s3fs | 2021.08.0 | | X |
++-----------------+-----------------+----------+---------+
+| scipy | 1.7.1 | | X |
++-----------------+-----------------+----------+---------+
+| sqlalchemy | 1.4.16 | | X |
++-----------------+-----------------+----------+---------+
+| tabulate | 0.8.9 | | X |
++-----------------+-----------------+----------+---------+
+| xarray | 0.19.0 | | X |
++-----------------+-----------------+----------+---------+
+| xlsxwriter | 1.4.3 | | X |
++-----------------+-----------------+----------+---------+
+
+For `optional libraries `_ the general recommendation is to use the latest version.
+The following table lists the lowest version per library that is currently being tested throughout the development of pandas.
+Optional libraries below the lowest tested version may still work, but are not considered supported.
+
++-----------------+-----------------+---------+
+| Package | Minimum Version | Changed |
++=================+=================+=========+
+| beautifulsoup4 |4.9.3 | X |
++-----------------+-----------------+---------+
+| blosc |1.21.0 | X |
++-----------------+-----------------+---------+
+| bottleneck |1.3.2 | X |
++-----------------+-----------------+---------+
+| brotlipy |0.7.0 | |
++-----------------+-----------------+---------+
+| fastparquet |0.4.0 | |
++-----------------+-----------------+---------+
+| fsspec |2021.08.0 | X |
++-----------------+-----------------+---------+
+| html5lib |1.1 | |
++-----------------+-----------------+---------+
+| hypothesis |6.13.0 | X |
++-----------------+-----------------+---------+
+| gcsfs |2021.08.0 | X |
++-----------------+-----------------+---------+
+| jinja2 |3.0.0 | X |
++-----------------+-----------------+---------+
+| lxml |4.6.3 | X |
++-----------------+-----------------+---------+
+| matplotlib |3.3.2 | |
++-----------------+-----------------+---------+
+| numba |0.53.1 | X |
++-----------------+-----------------+---------+
+| numexpr |2.7.3 | X |
++-----------------+-----------------+---------+
+| odfpy |1.4.1 | |
++-----------------+-----------------+---------+
+| openpyxl |3.0.7 | X |
++-----------------+-----------------+---------+
+| pandas-gbq |0.15.0 | X |
++-----------------+-----------------+---------+
+| psycopg2 |2.8.6 | X |
++-----------------+-----------------+---------+
+| pyarrow |1.0.1 | |
++-----------------+-----------------+---------+
+| pymysql |1.0.2 | X |
++-----------------+-----------------+---------+
+| pyreadstat |1.1.2 | X |
++-----------------+-----------------+---------+
+| pytables |3.6.1 | |
++-----------------+-----------------+---------+
+| python-snappy |0.6.0 | |
++-----------------+-----------------+---------+
+| pyxlsb |1.0.8 | X |
++-----------------+-----------------+---------+
+| s3fs |2021.08.0 | X |
++-----------------+-----------------+---------+
+| scipy |1.7.1 | X |
++-----------------+-----------------+---------+
+| sqlalchemy |1.4.16 | X |
++-----------------+-----------------+---------+
+| tabulate |0.8.9 | X |
++-----------------+-----------------+---------+
+| tzdata |2022a | |
++-----------------+-----------------+---------+
+| xarray |0.19.0 | X |
++-----------------+-----------------+---------+
+| xlrd |2.0.1 | |
++-----------------+-----------------+---------+
+| xlsxwriter |1.4.3 | X |
++-----------------+-----------------+---------+
+| xlwt |1.3.0 | |
++-----------------+-----------------+---------+
+| zstandard |0.15.2 | |
++-----------------+-----------------+---------+
+
+See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.
+
+.. _whatsnew_150.api_breaking.other:
+
+Other API changes
+^^^^^^^^^^^^^^^^^
+
+- BigQuery I/O methods :func:`read_gbq` and :meth:`DataFrame.to_gbq` default to
+ ``auth_local_webserver = True``. Google has deprecated the
+ ``auth_local_webserver = False`` `"out of band" (copy-paste) flow
+ `_.
+ The ``auth_local_webserver = False`` option is planned to stop working in
+ October 2022. (:issue:`46312`)
+- :func:`read_json` now raises ``FileNotFoundError`` (previously ``ValueError``) when input is a string ending in ``.json``, ``.json.gz``, ``.json.bz2``, etc. but no such file exists. (:issue:`29102`)
+- Operations with :class:`Timestamp` or :class:`Timedelta` that would previously raise ``OverflowError`` instead raise ``OutOfBoundsDatetime`` or ``OutOfBoundsTimedelta`` where appropriate (:issue:`47268`)
+- When :func:`read_sas` previously returned ``None``, it now returns an empty :class:`DataFrame` (:issue:`47410`)
+- :class:`DataFrame` constructor raises if ``index`` or ``columns`` arguments are sets (:issue:`47215`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_150.deprecations:
+
+Deprecations
+~~~~~~~~~~~~
+
+.. warning::
+
+ In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation such as
+ making the standard library `zoneinfo `_ the default timezone implementation instead of ``pytz``,
+ having the :class:`Index` support all data types instead of having multiple subclasses (:class:`CategoricalIndex`, :class:`Int64Index`, etc.), and more.
+ The changes under consideration are logged in `this Github issue `_, and any
+ feedback or concerns are welcome.
+
+.. _whatsnew_150.deprecations.int_slicing_series:
+
+Label-based integer slicing on a Series with an Int64Index or RangeIndex
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In a future version, integer slicing on a :class:`Series` with a :class:`Int64Index` or :class:`RangeIndex` will be treated as *label-based*, not positional. This will make the behavior consistent with other :meth:`Series.__getitem__` and :meth:`Series.__setitem__` behaviors (:issue:`45162`).
+
+For example:
+
+.. ipython:: python
+
+ ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])
+
+In the old behavior, ``ser[2:4]`` treats the slice as positional:
+
+*Old behavior*:
+
+.. code-block:: ipython
+
+ In [3]: ser[2:4]
+ Out[3]:
+ 5 3
+ 7 4
+ dtype: int64
+
+In a future version, this will be treated as label-based:
+
+*Future behavior*:
+
+.. code-block:: ipython
+
+ In [4]: ser.loc[2:4]
+ Out[4]:
+ 2 1
+ 3 2
+ dtype: int64
+
+To retain the old behavior, use ``series.iloc[i:j]``. To get the future behavior,
+use ``series.loc[i:j]``.
+
+Slicing on a :class:`DataFrame` will not be affected.
+
+.. _whatsnew_150.deprecations.excel_writer_attributes:
+
+:class:`ExcelWriter` attributes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All attributes of :class:`ExcelWriter` were previously documented as not
+public. However some third party Excel engines documented accessing
+``ExcelWriter.book`` or ``ExcelWriter.sheets``, and users were utilizing these
+and possibly other attributes. Previously these attributes were not safe to use;
+e.g. modifications to ``ExcelWriter.book`` would not update ``ExcelWriter.sheets``
+and conversely. In order to support this, pandas has made some attributes public
+and improved their implementations so that they may now be safely used. (:issue:`45572`)
+
+The following attributes are now public and considered safe to access.
+
+ - ``book``
+ - ``check_extension``
+ - ``close``
+ - ``date_format``
+ - ``datetime_format``
+ - ``engine``
+ - ``if_sheet_exists``
+ - ``sheets``
+ - ``supported_extensions``
+
+The following attributes have been deprecated. They now raise a ``FutureWarning``
+when accessed and will be removed in a future version. Users should be aware
+that their usage is considered unsafe, and can lead to unexpected results.
+
+ - ``cur_sheet``
+ - ``handles``
+ - ``path``
+ - ``save``
+ - ``write_cells``
+
+See the documentation of :class:`ExcelWriter` for further details.
+
+.. _whatsnew_150.deprecations.group_keys_in_apply:
+
+Using ``group_keys`` with transformers in :meth:`.GroupBy.apply`
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In previous versions of pandas, if it was inferred that the function passed to
+:meth:`.GroupBy.apply` was a transformer (i.e. the resulting index was equal to
+the input index), the ``group_keys`` argument of :meth:`DataFrame.groupby` and
+:meth:`Series.groupby` was ignored and the group keys would never be added to
+the index of the result. In the future, the group keys will be added to the index
+when the user specifies ``group_keys=True``.
+
+As ``group_keys=True`` is the default value of :meth:`DataFrame.groupby` and
+:meth:`Series.groupby`, not specifying ``group_keys`` with a transformer will
+raise a ``FutureWarning``. This can be silenced and the previous behavior
+retained by specifying ``group_keys=False``.
+
+.. _whatsnew_150.deprecations.setitem_column_try_inplace:
+ _ see also _whatsnew_130.notable_bug_fixes.setitem_column_try_inplace
+
+Inplace operation when setting values with ``loc`` and ``iloc``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Most of the time setting values with :meth:`DataFrame.iloc` attempts to set values
+inplace, only falling back to inserting a new array if necessary. There are
+some cases where this rule is not followed, for example when setting an entire
+column from an array with different dtype:
+
+.. ipython:: python
+
+ df = pd.DataFrame({'price': [11.1, 12.2]}, index=['book1', 'book2'])
+ original_prices = df['price']
+ new_prices = np.array([98, 99])
+
+*Old behavior*:
+
+.. code-block:: ipython
+
+ In [3]: df.iloc[:, 0] = new_prices
+ In [4]: df.iloc[:, 0]
+ Out[4]:
+ book1 98
+ book2 99
+ Name: price, dtype: int64
+ In [5]: original_prices
+ Out[5]:
+ book1 11.1
+ book2 12.2
+ Name: price, float: 64
+
+This behavior is deprecated. In a future version, setting an entire column with
+iloc will attempt to operate inplace.
+
+*Future behavior*:
+
+.. code-block:: ipython
+
+ In [3]: df.iloc[:, 0] = new_prices
+ In [4]: df.iloc[:, 0]
+ Out[4]:
+ book1 98.0
+ book2 99.0
+ Name: price, dtype: float64
+ In [5]: original_prices
+ Out[5]:
+ book1 98.0
+ book2 99.0
+ Name: price, dtype: float64
+
+To get the old behavior, use :meth:`DataFrame.__setitem__` directly:
+
+.. code-block:: ipython
+
+ In [3]: df[df.columns[0]] = new_prices
+ In [4]: df.iloc[:, 0]
+ Out[4]
+ book1 98
+ book2 99
+ Name: price, dtype: int64
+ In [5]: original_prices
+ Out[5]:
+ book1 11.1
+ book2 12.2
+ Name: price, dtype: float64
+
+To get the old behaviour when ``df.columns`` is not unique and you want to
+change a single column by index, you can use :meth:`DataFrame.isetitem`, which
+has been added in pandas 1.5:
+
+.. code-block:: ipython
+
+ In [3]: df_with_duplicated_cols = pd.concat([df, df], axis='columns')
+ In [3]: df_with_duplicated_cols.isetitem(0, new_prices)
+ In [4]: df_with_duplicated_cols.iloc[:, 0]
+ Out[4]:
+ book1 98
+ book2 99
+ Name: price, dtype: int64
+ In [5]: original_prices
+ Out[5]:
+ book1 11.1
+ book2 12.2
+ Name: 0, dtype: float64
+
+.. _whatsnew_150.deprecations.numeric_only_default:
+
+``numeric_only`` default value
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Across the :class:`DataFrame`, :class:`.DataFrameGroupBy`, and :class:`.Resampler` operations such as
+``min``, ``sum``, and ``idxmax``, the default
+value of the ``numeric_only`` argument, if it exists at all, was inconsistent.
+Furthermore, operations with the default value ``None`` can lead to surprising
+results. (:issue:`46560`)
+
+.. code-block:: ipython
+
+ In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})
+
+ In [2]: # Reading the next line without knowing the contents of df, one would
+ # expect the result to contain the products for both columns a and b.
+ df[["a", "b"]].prod()
+ Out[2]:
+ a 2
+ dtype: int64
+
+To avoid this behavior, the specifying the value ``numeric_only=None`` has been
+deprecated, and will be removed in a future version of pandas. In the future,
+all operations with a ``numeric_only`` argument will default to ``False``. Users
+should either call the operation only with columns that can be operated on, or
+specify ``numeric_only=True`` to operate only on Boolean, integer, and float columns.
+
+In order to support the transition to the new behavior, the following methods have
+gained the ``numeric_only`` argument.
+
+- :meth:`DataFrame.corr`
+- :meth:`DataFrame.corrwith`
+- :meth:`DataFrame.cov`
+- :meth:`DataFrame.idxmin`
+- :meth:`DataFrame.idxmax`
+- :meth:`.DataFrameGroupBy.cummin`
+- :meth:`.DataFrameGroupBy.cummax`
+- :meth:`.DataFrameGroupBy.idxmin`
+- :meth:`.DataFrameGroupBy.idxmax`
+- :meth:`.GroupBy.var`
+- :meth:`.GroupBy.std`
+- :meth:`.GroupBy.sem`
+- :meth:`.DataFrameGroupBy.quantile`
+- :meth:`.Resampler.mean`
+- :meth:`.Resampler.median`
+- :meth:`.Resampler.sem`
+- :meth:`.Resampler.std`
+- :meth:`.Resampler.var`
+- :meth:`DataFrame.rolling` operations
+- :meth:`DataFrame.expanding` operations
+- :meth:`DataFrame.ewm` operations
+
+.. _whatsnew_150.deprecations.other:
+
+Other Deprecations
+^^^^^^^^^^^^^^^^^^
+- Deprecated the keyword ``line_terminator`` in :meth:`DataFrame.to_csv` and :meth:`Series.to_csv`, use ``lineterminator`` instead; this is for consistency with :func:`read_csv` and the standard library 'csv' module (:issue:`9568`)
+- Deprecated behavior of :meth:`SparseArray.astype`, :meth:`Series.astype`, and :meth:`DataFrame.astype` with :class:`SparseDtype` when passing a non-sparse ``dtype``. In a future version, this will cast to that non-sparse dtype instead of wrapping it in a :class:`SparseDtype` (:issue:`34457`)
+- Deprecated behavior of :meth:`DatetimeIndex.intersection` and :meth:`DatetimeIndex.symmetric_difference` (``union`` behavior was already deprecated in version 1.3.0) with mixed time zones; in a future version both will be cast to UTC instead of object dtype (:issue:`39328`, :issue:`45357`)
+- Deprecated :meth:`DataFrame.iteritems`, :meth:`Series.iteritems`, :meth:`HDFStore.iteritems` in favor of :meth:`DataFrame.items`, :meth:`Series.items`, :meth:`HDFStore.items` (:issue:`45321`)
+- Deprecated :meth:`Series.is_monotonic` and :meth:`Index.is_monotonic` in favor of :meth:`Series.is_monotonic_increasing` and :meth:`Index.is_monotonic_increasing` (:issue:`45422`, :issue:`21335`)
+- Deprecated behavior of :meth:`DatetimeIndex.astype`, :meth:`TimedeltaIndex.astype`, :meth:`PeriodIndex.astype` when converting to an integer dtype other than ``int64``. In a future version, these will convert to exactly the specified dtype (instead of always ``int64``) and will raise if the conversion overflows (:issue:`45034`)
+- Deprecated the ``__array_wrap__`` method of DataFrame and Series, rely on standard numpy ufuncs instead (:issue:`45451`)
+- Deprecated treating float-dtype data as wall-times when passed with a timezone to :class:`Series` or :class:`DatetimeIndex` (:issue:`45573`)
+- Deprecated the behavior of :meth:`Series.fillna` and :meth:`DataFrame.fillna` with ``timedelta64[ns]`` dtype and incompatible fill value; in a future version this will cast to a common dtype (usually object) instead of raising, matching the behavior of other dtypes (:issue:`45746`)
+- Deprecated the ``warn`` parameter in :func:`infer_freq` (:issue:`45947`)
+- Deprecated allowing non-keyword arguments in :meth:`.ExtensionArray.argsort` (:issue:`46134`)
+- Deprecated treating all-bool ``object``-dtype columns as bool-like in :meth:`DataFrame.any` and :meth:`DataFrame.all` with ``bool_only=True``, explicitly cast to bool instead (:issue:`46188`)
+- Deprecated behavior of method :meth:`DataFrame.quantile`, attribute ``numeric_only`` will default False. Including datetime/timedelta columns in the result (:issue:`7308`).
+- Deprecated :attr:`Timedelta.freq` and :attr:`Timedelta.is_populated` (:issue:`46430`)
+- Deprecated :attr:`Timedelta.delta` (:issue:`46476`)
+- Deprecated passing arguments as positional in :meth:`DataFrame.any` and :meth:`Series.any` (:issue:`44802`)
+- Deprecated passing positional arguments to :meth:`DataFrame.pivot` and :func:`pivot` except ``data`` (:issue:`30228`)
+- Deprecated the methods :meth:`DataFrame.mad`, :meth:`Series.mad`, and the corresponding groupby methods (:issue:`11787`)
+- Deprecated positional arguments to :meth:`Index.join` except for ``other``, use keyword-only arguments instead of positional arguments (:issue:`46518`)
+- Deprecated positional arguments to :meth:`StringMethods.rsplit` and :meth:`StringMethods.split` except for ``pat``, use keyword-only arguments instead of positional arguments (:issue:`47423`)
+- Deprecated indexing on a timezone-naive :class:`DatetimeIndex` using a string representing a timezone-aware datetime (:issue:`46903`, :issue:`36148`)
+- Deprecated allowing ``unit="M"`` or ``unit="Y"`` in :class:`Timestamp` constructor with a non-round float value (:issue:`47267`)
+- Deprecated the ``display.column_space`` global configuration option (:issue:`7576`)
+- Deprecated the argument ``na_sentinel`` in :func:`factorize`, :meth:`Index.factorize`, and :meth:`.ExtensionArray.factorize`; pass ``use_na_sentinel=True`` instead to use the sentinel ``-1`` for NaN values and ``use_na_sentinel=False`` instead of ``na_sentinel=None`` to encode NaN values (:issue:`46910`)
+- Deprecated :meth:`DataFrameGroupBy.transform` not aligning the result when the UDF returned DataFrame (:issue:`45648`)
+- Clarified warning from :func:`to_datetime` when delimited dates can't be parsed in accordance to specified ``dayfirst`` argument (:issue:`46210`)
+- Emit warning from :func:`to_datetime` when delimited dates can't be parsed in accordance to specified ``dayfirst`` argument even for dates where leading zero is omitted (e.g. ``31/1/2001``) (:issue:`47880`)
+- Deprecated :class:`Series` and :class:`Resampler` reducers (e.g. ``min``, ``max``, ``sum``, ``mean``) raising a ``NotImplementedError`` when the dtype is non-numric and ``numeric_only=True`` is provided; this will raise a ``TypeError`` in a future version (:issue:`47500`)
+- Deprecated :meth:`Series.rank` returning an empty result when the dtype is non-numeric and ``numeric_only=True`` is provided; this will raise a ``TypeError`` in a future version (:issue:`47500`)
+- Deprecated argument ``errors`` for :meth:`Series.mask`, :meth:`Series.where`, :meth:`DataFrame.mask`, and :meth:`DataFrame.where` as ``errors`` had no effect on this methods (:issue:`47728`)
+- Deprecated arguments ``*args`` and ``**kwargs`` in :class:`Rolling`, :class:`Expanding`, and :class:`ExponentialMovingWindow` ops. (:issue:`47836`)
+- Deprecated the ``inplace`` keyword in :meth:`Categorical.set_ordered`, :meth:`Categorical.as_ordered`, and :meth:`Categorical.as_unordered` (:issue:`37643`)
+- Deprecated setting a categorical's categories with ``cat.categories = ['a', 'b', 'c']``, use :meth:`Categorical.rename_categories` instead (:issue:`37643`)
+- Deprecated unused arguments ``encoding`` and ``verbose`` in :meth:`Series.to_excel` and :meth:`DataFrame.to_excel` (:issue:`47912`)
+- Deprecated the ``inplace`` keyword in :meth:`DataFrame.set_axis` and :meth:`Series.set_axis`, use ``obj = obj.set_axis(..., copy=False)`` instead (:issue:`48130`)
+- Deprecated producing a single element when iterating over a :class:`DataFrameGroupBy` or a :class:`SeriesGroupBy` that has been grouped by a list of length 1; A tuple of length one will be returned instead (:issue:`42795`)
+- Fixed up warning message of deprecation of :meth:`MultiIndex.lesort_depth` as public method, as the message previously referred to :meth:`MultiIndex.is_lexsorted` instead (:issue:`38701`)
+- Deprecated the ``sort_columns`` argument in :meth:`DataFrame.plot` and :meth:`Series.plot` (:issue:`47563`).
+- Deprecated positional arguments for all but the first argument of :meth:`DataFrame.to_stata` and :func:`read_stata`, use keyword arguments instead (:issue:`48128`).
+- Deprecated the ``mangle_dupe_cols`` argument in :func:`read_csv`, :func:`read_fwf`, :func:`read_table` and :func:`read_excel`. The argument was never implemented, and a new argument where the renaming pattern can be specified will be added instead (:issue:`47718`)
+- Deprecated allowing ``dtype='datetime64'`` or ``dtype=np.datetime64`` in :meth:`Series.astype`, use "datetime64[ns]" instead (:issue:`47844`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_150.performance:
+
+Performance improvements
+~~~~~~~~~~~~~~~~~~~~~~~~
+- Performance improvement in :meth:`DataFrame.corrwith` for column-wise (axis=0) Pearson and Spearman correlation when other is a :class:`Series` (:issue:`46174`)
+- Performance improvement in :meth:`.GroupBy.transform` for some user-defined DataFrame -> Series functions (:issue:`45387`)
+- Performance improvement in :meth:`DataFrame.duplicated` when subset consists of only one column (:issue:`45236`)
+- Performance improvement in :meth:`.GroupBy.diff` (:issue:`16706`)
+- Performance improvement in :meth:`.GroupBy.transform` when broadcasting values for user-defined functions (:issue:`45708`)
+- Performance improvement in :meth:`.GroupBy.transform` for user-defined functions when only a single group exists (:issue:`44977`)
+- Performance improvement in :meth:`.GroupBy.apply` when grouping on a non-unique unsorted index (:issue:`46527`)
+- Performance improvement in :meth:`DataFrame.loc` and :meth:`Series.loc` for tuple-based indexing of a :class:`MultiIndex` (:issue:`45681`, :issue:`46040`, :issue:`46330`)
+- Performance improvement in :meth:`.GroupBy.var` with ``ddof`` other than one (:issue:`48152`)
+- Performance improvement in :meth:`DataFrame.to_records` when the index is a :class:`MultiIndex` (:issue:`47263`)
+- Performance improvement in :attr:`MultiIndex.values` when the MultiIndex contains levels of type DatetimeIndex, TimedeltaIndex or ExtensionDtypes (:issue:`46288`)
+- Performance improvement in :func:`merge` when left and/or right are empty (:issue:`45838`)
+- Performance improvement in :meth:`DataFrame.join` when left and/or right are empty (:issue:`46015`)
+- Performance improvement in :meth:`DataFrame.reindex` and :meth:`Series.reindex` when target is a :class:`MultiIndex` (:issue:`46235`)
+- Performance improvement when setting values in a pyarrow backed string array (:issue:`46400`)
+- Performance improvement in :func:`factorize` (:issue:`46109`)
+- Performance improvement in :class:`DataFrame` and :class:`Series` constructors for extension dtype scalars (:issue:`45854`)
+- Performance improvement in :func:`read_excel` when ``nrows`` argument provided (:issue:`32727`)
+- Performance improvement in :meth:`.Styler.to_excel` when applying repeated CSS formats (:issue:`47371`)
+- Performance improvement in :meth:`MultiIndex.is_monotonic_increasing` (:issue:`47458`)
+- Performance improvement in :class:`BusinessHour` ``str`` and ``repr`` (:issue:`44764`)
+- Performance improvement in datetime arrays string formatting when one of the default strftime formats ``"%Y-%m-%d %H:%M:%S"`` or ``"%Y-%m-%d %H:%M:%S.%f"`` is used. (:issue:`44764`)
+- Performance improvement in :meth:`Series.to_sql` and :meth:`DataFrame.to_sql` (:class:`SQLiteTable`) when processing time arrays. (:issue:`44764`)
+- Performance improvement to :func:`read_sas` (:issue:`47404`)
+- Performance improvement in ``argmax`` and ``argmin`` for :class:`arrays.SparseArray` (:issue:`34197`)
+-
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_150.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+
+Categorical
+^^^^^^^^^^^
+- Bug in :meth:`.Categorical.view` not accepting integer dtypes (:issue:`25464`)
+- Bug in :meth:`.CategoricalIndex.union` when the index's categories are integer-dtype and the index contains ``NaN`` values incorrectly raising instead of casting to ``float64`` (:issue:`45362`)
+- Bug in :meth:`concat` when concatenating two (or more) unordered :class:`CategoricalIndex` variables, whose categories are permutations, yields incorrect index values (:issue:`24845`)
+
+Datetimelike
+^^^^^^^^^^^^
+- Bug in :meth:`DataFrame.quantile` with datetime-like dtypes and no rows incorrectly returning ``float64`` dtype instead of retaining datetime-like dtype (:issue:`41544`)
+- Bug in :func:`to_datetime` with sequences of ``np.str_`` objects incorrectly raising (:issue:`32264`)
+- Bug in :class:`Timestamp` construction when passing datetime components as positional arguments and ``tzinfo`` as a keyword argument incorrectly raising (:issue:`31929`)
+- Bug in :meth:`Index.astype` when casting from object dtype to ``timedelta64[ns]`` dtype incorrectly casting ``np.datetime64("NaT")`` values to ``np.timedelta64("NaT")`` instead of raising (:issue:`45722`)
+- Bug in :meth:`SeriesGroupBy.value_counts` index when passing categorical column (:issue:`44324`)
+- Bug in :meth:`DatetimeIndex.tz_localize` localizing to UTC failing to make a copy of the underlying data (:issue:`46460`)
+- Bug in :meth:`DatetimeIndex.resolution` incorrectly returning "day" instead of "nanosecond" for nanosecond-resolution indexes (:issue:`46903`)
+- Bug in :class:`Timestamp` with an integer or float value and ``unit="Y"`` or ``unit="M"`` giving slightly-wrong results (:issue:`47266`)
+- Bug in :class:`.DatetimeArray` construction when passed another :class:`.DatetimeArray` and ``freq=None`` incorrectly inferring the freq from the given array (:issue:`47296`)
+- Bug in :func:`to_datetime` where ``OutOfBoundsDatetime`` would be thrown even if ``errors=coerce`` if there were more than 50 rows (:issue:`45319`)
+- Bug when adding a :class:`DateOffset` to a :class:`Series` would not add the ``nanoseconds`` field (:issue:`47856`)
+-
+
+Timedelta
+^^^^^^^^^
+- Bug in :func:`astype_nansafe` astype("timedelta64[ns]") fails when np.nan is included (:issue:`45798`)
+- Bug in constructing a :class:`Timedelta` with a ``np.timedelta64`` object and a ``unit`` sometimes silently overflowing and returning incorrect results instead of raising ``OutOfBoundsTimedelta`` (:issue:`46827`)
+- Bug in constructing a :class:`Timedelta` from a large integer or float with ``unit="W"`` silently overflowing and returning incorrect results instead of raising ``OutOfBoundsTimedelta`` (:issue:`47268`)
+-
+
+Time Zones
+^^^^^^^^^^
+- Bug in :class:`Timestamp` constructor raising when passed a ``ZoneInfo`` tzinfo object (:issue:`46425`)
+-
+
+Numeric
+^^^^^^^
+- Bug in operations with array-likes with ``dtype="boolean"`` and :attr:`NA` incorrectly altering the array in-place (:issue:`45421`)
+- Bug in arithmetic operations with nullable types without :attr:`NA` values not matching the same operation with non-nullable types (:issue:`48223`)
+- Bug in ``floordiv`` when dividing by ``IntegerDtype`` ``0`` would return ``0`` instead of ``inf`` (:issue:`48223`)
+- Bug in division, ``pow`` and ``mod`` operations on array-likes with ``dtype="boolean"`` not being like their ``np.bool_`` counterparts (:issue:`46063`)
+- Bug in multiplying a :class:`Series` with ``IntegerDtype`` or ``FloatingDtype`` by an array-like with ``timedelta64[ns]`` dtype incorrectly raising (:issue:`45622`)
+- Bug in :meth:`mean` where the optional dependency ``bottleneck`` causes precision loss linear in the length of the array. ``bottleneck`` has been disabled for :meth:`mean` improving the loss to log-linear but may result in a performance decrease. (:issue:`42878`)
+
+Conversion
+^^^^^^^^^^
+- Bug in :meth:`DataFrame.astype` not preserving subclasses (:issue:`40810`)
+- Bug in constructing a :class:`Series` from a float-containing list or a floating-dtype ndarray-like (e.g. ``dask.Array``) and an integer dtype raising instead of casting like we would with an ``np.ndarray`` (:issue:`40110`)
+- Bug in :meth:`Float64Index.astype` to unsigned integer dtype incorrectly casting to ``np.int64`` dtype (:issue:`45309`)
+- Bug in :meth:`Series.astype` and :meth:`DataFrame.astype` from floating dtype to unsigned integer dtype failing to raise in the presence of negative values (:issue:`45151`)
+- Bug in :func:`array` with ``FloatingDtype`` and values containing float-castable strings incorrectly raising (:issue:`45424`)
+- Bug when comparing string and datetime64ns objects causing ``OverflowError`` exception. (:issue:`45506`)
+- Bug in metaclass of generic abstract dtypes causing :meth:`DataFrame.apply` and :meth:`Series.apply` to raise for the built-in function ``type`` (:issue:`46684`)
+- Bug in :meth:`DataFrame.to_records` returning inconsistent numpy types if the index was a :class:`MultiIndex` (:issue:`47263`)
+- Bug in :meth:`DataFrame.to_dict` for ``orient="list"`` or ``orient="index"`` was not returning native types (:issue:`46751`)
+- Bug in :meth:`DataFrame.apply` that returns a :class:`DataFrame` instead of a :class:`Series` when applied to an empty :class:`DataFrame` and ``axis=1`` (:issue:`39111`)
+- Bug when inferring the dtype from an iterable that is *not* a NumPy ``ndarray`` consisting of all NumPy unsigned integer scalars did not result in an unsigned integer dtype (:issue:`47294`)
+- Bug in :meth:`DataFrame.eval` when pandas objects (e.g. ``'Timestamp'``) were column names (:issue:`44603`)
+-
+
+Strings
+^^^^^^^
+- Bug in :meth:`str.startswith` and :meth:`str.endswith` when using other series as parameter _pat_. Now raises ``TypeError`` (:issue:`3485`)
+- Bug in :meth:`Series.str.zfill` when strings contain leading signs, padding '0' before the sign character rather than after as ``str.zfill`` from standard library (:issue:`20868`)
+-
+
+Interval
+^^^^^^^^
+- Bug in :meth:`IntervalArray.__setitem__` when setting ``np.nan`` into an integer-backed array raising ``ValueError`` instead of ``TypeError`` (:issue:`45484`)
+- Bug in :class:`IntervalDtype` when using datetime64[ns, tz] as a dtype string (:issue:`46999`)
+
+Indexing
+^^^^^^^^
+- Bug in :meth:`DataFrame.iloc` where indexing a single row on a :class:`DataFrame` with a single ExtensionDtype column gave a copy instead of a view on the underlying data (:issue:`45241`)
+- Bug in :meth:`DataFrame.__getitem__` returning copy when :class:`DataFrame` has duplicated columns even if a unique column is selected (:issue:`45316`, :issue:`41062`)
+- Bug in :meth:`Series.align` does not create :class:`MultiIndex` with union of levels when both MultiIndexes intersections are identical (:issue:`45224`)
+- Bug in setting a NA value (``None`` or ``np.nan``) into a :class:`Series` with int-based :class:`IntervalDtype` incorrectly casting to object dtype instead of a float-based :class:`IntervalDtype` (:issue:`45568`)
+- Bug in indexing setting values into an ``ExtensionDtype`` column with ``df.iloc[:, i] = values`` with ``values`` having the same dtype as ``df.iloc[:, i]`` incorrectly inserting a new array instead of setting in-place (:issue:`33457`)
+- Bug in :meth:`Series.__setitem__` with a non-integer :class:`Index` when using an integer key to set a value that cannot be set inplace where a ``ValueError`` was raised instead of casting to a common dtype (:issue:`45070`)
+- Bug in :meth:`DataFrame.loc` not casting ``None`` to ``NA`` when setting value as a list into :class:`DataFrame` (:issue:`47987`)
+- Bug in :meth:`Series.__setitem__` when setting incompatible values into a ``PeriodDtype`` or ``IntervalDtype`` :class:`Series` raising when indexing with a boolean mask but coercing when indexing with otherwise-equivalent indexers; these now consistently coerce, along with :meth:`Series.mask` and :meth:`Series.where` (:issue:`45768`)
+- Bug in :meth:`DataFrame.where` with multiple columns with datetime-like dtypes failing to downcast results consistent with other dtypes (:issue:`45837`)
+- Bug in :func:`isin` upcasting to ``float64`` with unsigned integer dtype and list-like argument without a dtype (:issue:`46485`)
+- Bug in :meth:`Series.loc.__setitem__` and :meth:`Series.loc.__getitem__` not raising when using multiple keys without using a :class:`MultiIndex` (:issue:`13831`)
+- Bug in :meth:`Index.reindex` raising ``AssertionError`` when ``level`` was specified but no :class:`MultiIndex` was given; level is ignored now (:issue:`35132`)
+- Bug when setting a value too large for a :class:`Series` dtype failing to coerce to a common type (:issue:`26049`, :issue:`32878`)
+- Bug in :meth:`loc.__setitem__` treating ``range`` keys as positional instead of label-based (:issue:`45479`)
+- Bug in :meth:`DataFrame.__setitem__` casting extension array dtypes to object when setting with a scalar key and :class:`DataFrame` as value (:issue:`46896`)
+- Bug in :meth:`Series.__setitem__` when setting a scalar to a nullable pandas dtype would not raise a ``TypeError`` if the scalar could not be cast (losslessly) to the nullable type (:issue:`45404`)
+- Bug in :meth:`Series.__setitem__` when setting ``boolean`` dtype values containing ``NA`` incorrectly raising instead of casting to ``boolean`` dtype (:issue:`45462`)
+- Bug in :meth:`Series.loc` raising with boolean indexer containing ``NA`` when :class:`Index` did not match (:issue:`46551`)
+- Bug in :meth:`Series.__setitem__` where setting :attr:`NA` into a numeric-dtype :class:`Series` would incorrectly upcast to object-dtype rather than treating the value as ``np.nan`` (:issue:`44199`)
+- Bug in :meth:`DataFrame.loc` when setting values to a column and right hand side is a dictionary (:issue:`47216`)
+- Bug in :meth:`Series.__setitem__` with ``datetime64[ns]`` dtype, an all-``False`` boolean mask, and an incompatible value incorrectly casting to ``object`` instead of retaining ``datetime64[ns]`` dtype (:issue:`45967`)
+- Bug in :meth:`Index.__getitem__` raising ``ValueError`` when indexer is from boolean dtype with ``NA`` (:issue:`45806`)
+- Bug in :meth:`Series.__setitem__` losing precision when enlarging :class:`Series` with scalar (:issue:`32346`)
+- Bug in :meth:`Series.mask` with ``inplace=True`` or setting values with a boolean mask with small integer dtypes incorrectly raising (:issue:`45750`)
+- Bug in :meth:`DataFrame.mask` with ``inplace=True`` and ``ExtensionDtype`` columns incorrectly raising (:issue:`45577`)
+- Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (:issue:`42950`)
+- Bug in :meth:`DataFrame.__getattribute__` raising ``AttributeError`` if columns have ``"string"`` dtype (:issue:`46185`)
+- Bug in :meth:`DataFrame.compare` returning all ``NaN`` column when comparing extension array dtype and numpy dtype (:issue:`44014`)
+- Bug in :meth:`DataFrame.where` setting wrong values with ``"boolean"`` mask for numpy dtype (:issue:`44014`)
+- Bug in indexing on a :class:`DatetimeIndex` with a ``np.str_`` key incorrectly raising (:issue:`45580`)
+- Bug in :meth:`CategoricalIndex.get_indexer` when index contains ``NaN`` values, resulting in elements that are in target but not present in the index to be mapped to the index of the NaN element, instead of -1 (:issue:`45361`)
+- Bug in setting large integer values into :class:`Series` with ``float32`` or ``float16`` dtype incorrectly altering these values instead of coercing to ``float64`` dtype (:issue:`45844`)
+- Bug in :meth:`Series.asof` and :meth:`DataFrame.asof` incorrectly casting bool-dtype results to ``float64`` dtype (:issue:`16063`)
+- Bug in :meth:`NDFrame.xs`, :meth:`DataFrame.iterrows`, :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` not always propagating metadata (:issue:`28283`)
+- Bug in :meth:`DataFrame.sum` min_count changes dtype if input contains NaNs (:issue:`46947`)
+- Bug in :class:`IntervalTree` that lead to an infinite recursion. (:issue:`46658`)
+- Bug in :class:`PeriodIndex` raising ``AttributeError`` when indexing on ``NA``, rather than putting ``NaT`` in its place. (:issue:`46673`)
+- Bug in :meth:`DataFrame.at` would allow the modification of multiple columns (:issue:`48296`)
+
+Missing
+^^^^^^^
+- Bug in :meth:`Series.fillna` and :meth:`DataFrame.fillna` with ``downcast`` keyword not being respected in some cases where there are no NA values present (:issue:`45423`)
+- Bug in :meth:`Series.fillna` and :meth:`DataFrame.fillna` with :class:`IntervalDtype` and incompatible value raising instead of casting to a common (usually object) dtype (:issue:`45796`)
+- Bug in :meth:`Series.map` not respecting ``na_action`` argument if mapper is a ``dict`` or :class:`Series` (:issue:`47527`)
+- Bug in :meth:`DataFrame.interpolate` with object-dtype column not returning a copy with ``inplace=False`` (:issue:`45791`)
+- Bug in :meth:`DataFrame.dropna` allows to set both ``how`` and ``thresh`` incompatible arguments (:issue:`46575`)
+- Bug in :meth:`DataFrame.fillna` ignored ``axis`` when :class:`DataFrame` is single block (:issue:`47713`)
+
+MultiIndex
+^^^^^^^^^^
+- Bug in :meth:`DataFrame.loc` returning empty result when slicing a :class:`MultiIndex` with a negative step size and non-null start/stop values (:issue:`46156`)
+- Bug in :meth:`DataFrame.loc` raising when slicing a :class:`MultiIndex` with a negative step size other than -1 (:issue:`46156`)
+- Bug in :meth:`DataFrame.loc` raising when slicing a :class:`MultiIndex` with a negative step size and slicing a non-int labeled index level (:issue:`46156`)
+- Bug in :meth:`Series.to_numpy` where multiindexed Series could not be converted to numpy arrays when an ``na_value`` was supplied (:issue:`45774`)
+- Bug in :class:`MultiIndex.equals` not commutative when only one side has extension array dtype (:issue:`46026`)
+- Bug in :meth:`MultiIndex.from_tuples` cannot construct Index of empty tuples (:issue:`45608`)
+
+I/O
+^^^
+- Bug in :meth:`DataFrame.to_stata` where no error is raised if the :class:`DataFrame` contains ``-np.inf`` (:issue:`45350`)
+- Bug in :func:`read_excel` results in an infinite loop with certain ``skiprows`` callables (:issue:`45585`)
+- Bug in :meth:`DataFrame.info` where a new line at the end of the output is omitted when called on an empty :class:`DataFrame` (:issue:`45494`)
+- Bug in :func:`read_csv` not recognizing line break for ``on_bad_lines="warn"`` for ``engine="c"`` (:issue:`41710`)
+- Bug in :meth:`DataFrame.to_csv` not respecting ``float_format`` for ``Float64`` dtype (:issue:`45991`)
+- Bug in :func:`read_csv` not respecting a specified converter to index columns in all cases (:issue:`40589`)
+- Bug in :func:`read_csv` interpreting second row as :class:`Index` names even when ``index_col=False`` (:issue:`46569`)
+- Bug in :func:`read_parquet` when ``engine="pyarrow"`` which caused partial write to disk when column of unsupported datatype was passed (:issue:`44914`)
+- Bug in :func:`DataFrame.to_excel` and :class:`ExcelWriter` would raise when writing an empty DataFrame to a ``.ods`` file (:issue:`45793`)
+- Bug in :func:`read_csv` ignoring non-existing header row for ``engine="python"`` (:issue:`47400`)
+- Bug in :func:`read_excel` raising uncontrolled ``IndexError`` when ``header`` references non-existing rows (:issue:`43143`)
+- Bug in :func:`read_html` where elements surrounding `` `` were joined without a space between them (:issue:`29528`)
+- Bug in :func:`read_csv` when data is longer than header leading to issues with callables in ``usecols`` expecting strings (:issue:`46997`)
+- Bug in Parquet roundtrip for Interval dtype with ``datetime64[ns]`` subtype (:issue:`45881`)
+- Bug in :func:`read_excel` when reading a ``.ods`` file with newlines between xml elements (:issue:`45598`)
+- Bug in :func:`read_parquet` when ``engine="fastparquet"`` where the file was not closed on error (:issue:`46555`)
+- :meth:`to_html` now excludes the ``border`` attribute from ``
`` elements when ``border`` keyword is set to ``False``.
+- Bug in :func:`read_sas` with certain types of compressed SAS7BDAT files (:issue:`35545`)
+- Bug in :func:`read_excel` not forward filling :class:`MultiIndex` when no names were given (:issue:`47487`)
+- Bug in :func:`read_sas` returned ``None`` rather than an empty DataFrame for SAS7BDAT files with zero rows (:issue:`18198`)
+- Bug in :meth:`DataFrame.to_string` using wrong missing value with extension arrays in :class:`MultiIndex` (:issue:`47986`)
+- Bug in :class:`StataWriter` where value labels were always written with default encoding (:issue:`46750`)
+- Bug in :class:`StataWriterUTF8` where some valid characters were removed from variable names (:issue:`47276`)
+- Bug in :meth:`DataFrame.to_excel` when writing an empty dataframe with :class:`MultiIndex` (:issue:`19543`)
+- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (:issue:`31243`)
+- Bug in :func:`read_sas` that scrambled column names (:issue:`31243`)
+- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (:issue:`47099`)
+- Bug in :func:`read_parquet` with ``use_nullable_dtypes=True`` where ``float64`` dtype was returned instead of nullable ``Float64`` dtype (:issue:`45694`)
+- Bug in :meth:`DataFrame.to_json` where ``PeriodDtype`` would not make the serialization roundtrip when read back with :meth:`read_json` (:issue:`44720`)
+- Bug in :func:`read_xml` when reading XML files with Chinese character tags and would raise ``XMLSyntaxError`` (:issue:`47902`)
+
+Period
+^^^^^^
+- Bug in subtraction of :class:`Period` from :class:`.PeriodArray` returning wrong results (:issue:`45999`)
+- Bug in :meth:`Period.strftime` and :meth:`PeriodIndex.strftime`, directives ``%l`` and ``%u`` were giving wrong results (:issue:`46252`)
+- Bug in inferring an incorrect ``freq`` when passing a string to :class:`Period` microseconds that are a multiple of 1000 (:issue:`46811`)
+- Bug in constructing a :class:`Period` from a :class:`Timestamp` or ``np.datetime64`` object with non-zero nanoseconds and ``freq="ns"`` incorrectly truncating the nanoseconds (:issue:`46811`)
+- Bug in adding ``np.timedelta64("NaT", "ns")`` to a :class:`Period` with a timedelta-like freq incorrectly raising ``IncompatibleFrequency`` instead of returning ``NaT`` (:issue:`47196`)
+- Bug in adding an array of integers to an array with :class:`PeriodDtype` giving incorrect results when ``dtype.freq.n > 1`` (:issue:`47209`)
+- Bug in subtracting a :class:`Period` from an array with :class:`PeriodDtype` returning incorrect results instead of raising ``OverflowError`` when the operation overflows (:issue:`47538`)
+-
+
+Plotting
+^^^^^^^^
+- Bug in :meth:`DataFrame.plot.barh` that prevented labeling the x-axis and ``xlabel`` updating the y-axis label (:issue:`45144`)
+- Bug in :meth:`DataFrame.plot.box` that prevented labeling the x-axis (:issue:`45463`)
+- Bug in :meth:`DataFrame.boxplot` that prevented passing in ``xlabel`` and ``ylabel`` (:issue:`45463`)
+- Bug in :meth:`DataFrame.boxplot` that prevented specifying ``vert=False`` (:issue:`36918`)
+- Bug in :meth:`DataFrame.plot.scatter` that prevented specifying ``norm`` (:issue:`45809`)
+- Fix showing "None" as ylabel in :meth:`Series.plot` when not setting ylabel (:issue:`46129`)
+- Bug in :meth:`DataFrame.plot` that led to xticks and vertical grids being improperly placed when plotting a quarterly series (:issue:`47602`)
+- Bug in :meth:`DataFrame.plot` that prevented setting y-axis label, limits and ticks for a secondary y-axis (:issue:`47753`)
+
+Groupby/resample/rolling
+^^^^^^^^^^^^^^^^^^^^^^^^
+- Bug in :meth:`DataFrame.resample` ignoring ``closed="right"`` on :class:`TimedeltaIndex` (:issue:`45414`)
+- Bug in :meth:`.DataFrameGroupBy.transform` fails when ``func="size"`` and the input DataFrame has multiple columns (:issue:`27469`)
+- Bug in :meth:`.DataFrameGroupBy.size` and :meth:`.DataFrameGroupBy.transform` with ``func="size"`` produced incorrect results when ``axis=1`` (:issue:`45715`)
+- Bug in :meth:`.ExponentialMovingWindow.mean` with ``axis=1`` and ``engine='numba'`` when the :class:`DataFrame` has more columns than rows (:issue:`46086`)
+- Bug when using ``engine="numba"`` would return the same jitted function when modifying ``engine_kwargs`` (:issue:`46086`)
+- Bug in :meth:`.DataFrameGroupBy.transform` fails when ``axis=1`` and ``func`` is ``"first"`` or ``"last"`` (:issue:`45986`)
+- Bug in :meth:`DataFrameGroupBy.cumsum` with ``skipna=False`` giving incorrect results (:issue:`46216`)
+- Bug in :meth:`.GroupBy.sum`, :meth:`.GroupBy.prod` and :meth:`.GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)
+- Bug in :meth:`.GroupBy.cumsum` with ``timedelta64[ns]`` dtype failing to recognize ``NaT`` as a null value (:issue:`46216`)
+- Bug in :meth:`.GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)
+- Bug in :meth:`.GroupBy.cummin` and :meth:`.GroupBy.cummax` with nullable dtypes incorrectly altering the original data in place (:issue:`46220`)
+- Bug in :meth:`DataFrame.groupby` raising error when ``None`` is in first level of :class:`MultiIndex` (:issue:`47348`)
+- Bug in :meth:`.GroupBy.cummax` with ``int64`` dtype with leading value being the smallest possible int64 (:issue:`46382`)
+- Bug in :meth:`.GroupBy.cumprod` ``NaN`` influences calculation in different columns with ``skipna=False`` (:issue:`48064`)
+- Bug in :meth:`.GroupBy.max` with empty groups and ``uint64`` dtype incorrectly raising ``RuntimeError`` (:issue:`46408`)
+- Bug in :meth:`.GroupBy.apply` would fail when ``func`` was a string and args or kwargs were supplied (:issue:`46479`)
+- Bug in :meth:`SeriesGroupBy.apply` would incorrectly name its result when there was a unique group (:issue:`46369`)
+- Bug in :meth:`.Rolling.sum` and :meth:`.Rolling.mean` would give incorrect result with window of same values (:issue:`42064`, :issue:`46431`)
+- Bug in :meth:`.Rolling.var` and :meth:`.Rolling.std` would give non-zero result with window of same values (:issue:`42064`)
+- Bug in :meth:`.Rolling.skew` and :meth:`.Rolling.kurt` would give NaN with window of same values (:issue:`30993`)
+- Bug in :meth:`.Rolling.var` would segfault calculating weighted variance when window size was larger than data size (:issue:`46760`)
+- Bug in :meth:`Grouper.__repr__` where ``dropna`` was not included. Now it is (:issue:`46754`)
+- Bug in :meth:`DataFrame.rolling` gives ValueError when center=True, axis=1 and win_type is specified (:issue:`46135`)
+- Bug in :meth:`.DataFrameGroupBy.describe` and :meth:`.SeriesGroupBy.describe` produces inconsistent results for empty datasets (:issue:`41575`)
+- Bug in :meth:`DataFrame.resample` reduction methods when used with ``on`` would attempt to aggregate the provided column (:issue:`47079`)
+- Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` would not respect ``dropna=False`` when the input DataFrame/Series had a NaN values in a :class:`MultiIndex` (:issue:`46783`)
+- Bug in :meth:`DataFrameGroupBy.resample` raises ``KeyError`` when getting the result from a key list which misses the resample key (:issue:`47362`)
+- Bug in :meth:`DataFrame.groupby` would lose index columns when the DataFrame is empty for transforms, like fillna (:issue:`47787`)
+- Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` with ``dropna=False`` and ``sort=False`` would put any null groups at the end instead the order that they are encountered (:issue:`46584`)
+-
+
+Reshaping
+^^^^^^^^^
+- Bug in :func:`concat` between a :class:`Series` with integer dtype and another with :class:`CategoricalDtype` with integer categories and containing ``NaN`` values casting to object dtype instead of ``float64`` (:issue:`45359`)
+- Bug in :func:`get_dummies` that selected object and categorical dtypes but not string (:issue:`44965`)
+- Bug in :meth:`DataFrame.align` when aligning a :class:`MultiIndex` to a :class:`Series` with another :class:`MultiIndex` (:issue:`46001`)
+- Bug in concatenation with ``IntegerDtype``, or ``FloatingDtype`` arrays where the resulting dtype did not mirror the behavior of the non-nullable dtypes (:issue:`46379`)
+- Bug in :func:`concat` losing dtype of columns when ``join="outer"`` and ``sort=True`` (:issue:`47329`)
+- Bug in :func:`concat` not sorting the column names when ``None`` is included (:issue:`47331`)
+- Bug in :func:`concat` with identical key leads to error when indexing :class:`MultiIndex` (:issue:`46519`)
+- Bug in :func:`pivot_table` raising ``TypeError`` when ``dropna=True`` and aggregation column has extension array dtype (:issue:`47477`)
+- Bug in :func:`merge` raising error for ``how="cross"`` when using ``FIPS`` mode in ssl library (:issue:`48024`)
+- Bug in :meth:`DataFrame.join` with a list when using suffixes to join DataFrames with duplicate column names (:issue:`46396`)
+- Bug in :meth:`DataFrame.pivot_table` with ``sort=False`` results in sorted index (:issue:`17041`)
+- Bug in :meth:`concat` when ``axis=1`` and ``sort=False`` where the resulting Index was a :class:`Int64Index` instead of a :class:`RangeIndex` (:issue:`46675`)
+- Bug in :meth:`wide_to_long` raises when ``stubnames`` is missing in columns and ``i`` contains string dtype column (:issue:`46044`)
+- Bug in :meth:`DataFrame.join` with categorical index results in unexpected reordering (:issue:`47812`)
+
+Sparse
+^^^^^^
+- Bug in :meth:`Series.where` and :meth:`DataFrame.where` with ``SparseDtype`` failing to retain the array's ``fill_value`` (:issue:`45691`)
+- Bug in :meth:`SparseArray.unique` fails to keep original elements order (:issue:`47809`)
+
+ExtensionArray
+^^^^^^^^^^^^^^
+- Bug in :meth:`IntegerArray.searchsorted` and :meth:`FloatingArray.searchsorted` returning inconsistent results when acting on ``np.nan`` (:issue:`45255`)
+
+Styler
+^^^^^^
+- Bug when attempting to apply styling functions to an empty DataFrame subset (:issue:`45313`)
+- Bug in :class:`CSSToExcelConverter` leading to ``TypeError`` when border color provided without border style for ``xlsxwriter`` engine (:issue:`42276`)
+- Bug in :meth:`Styler.set_sticky` leading to white text on white background in dark mode (:issue:`46984`)
+- Bug in :meth:`Styler.to_latex` causing ``UnboundLocalError`` when ``clines="all;data"`` and the ``DataFrame`` has no rows. (:issue:`47203`)
+- Bug in :meth:`Styler.to_excel` when using ``vertical-align: middle;`` with ``xlsxwriter`` engine (:issue:`30107`)
+- Bug when applying styles to a DataFrame with boolean column labels (:issue:`47838`)
+
+Metadata
+^^^^^^^^
+- Fixed metadata propagation in :meth:`DataFrame.melt` (:issue:`28283`)
+- Fixed metadata propagation in :meth:`DataFrame.explode` (:issue:`28283`)
+
+Other
+^^^^^
+
+.. ***DO NOT USE THIS SECTION***
+
+- Bug in :func:`.assert_index_equal` with ``names=True`` and ``check_order=False`` not checking names (:issue:`47328`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_150.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.4.4..v1.5.0
diff --git a/doc/source/whatsnew/v1.5.1.rst b/doc/source/whatsnew/v1.5.1.rst
new file mode 100644
index 0000000000000..bcd8ddb9cbc0b
--- /dev/null
+++ b/doc/source/whatsnew/v1.5.1.rst
@@ -0,0 +1,122 @@
+.. _whatsnew_151:
+
+What's new in 1.5.1 (October 19, 2022)
+--------------------------------------
+
+These are the changes in pandas 1.5.1. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_151.groupby_categorical_regr:
+
+Behavior of ``groupby`` with categorical groupers (:issue:`48645`)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In versions of pandas prior to 1.5, ``groupby`` with ``dropna=False`` would still drop
+NA values when the grouper was a categorical dtype. A fix for this was attempted in
+1.5, however it introduced a regression where passing ``observed=False`` and
+``dropna=False`` to ``groupby`` would result in only observed categories. It was found
+that the patch fixing the ``dropna=False`` bug is incompatible with ``observed=False``,
+and decided that the best resolution is to restore the correct ``observed=False``
+behavior at the cost of reintroducing the ``dropna=False`` bug.
+
+.. ipython:: python
+
+ df = pd.DataFrame(
+ {
+ "x": pd.Categorical([1, None], categories=[1, 2, 3]),
+ "y": [3, 4],
+ }
+ )
+ df
+
+*1.5.0 behavior*:
+
+.. code-block:: ipython
+
+ In [3]: # Correct behavior, NA values are not dropped
+ df.groupby("x", observed=True, dropna=False).sum()
+ Out[3]:
+ y
+ x
+ 1 3
+ NaN 4
+
+
+ In [4]: # Incorrect behavior, only observed categories present
+ df.groupby("x", observed=False, dropna=False).sum()
+ Out[4]:
+ y
+ x
+ 1 3
+ NaN 4
+
+
+*1.5.1 behavior*:
+
+.. ipython:: python
+
+ # Incorrect behavior, NA values are dropped
+ df.groupby("x", observed=True, dropna=False).sum()
+
+ # Correct behavior, unobserved categories present (NA values still dropped)
+ df.groupby("x", observed=False, dropna=False).sum()
+
+.. _whatsnew_151.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed Regression in :meth:`Series.__setitem__` casting ``None`` to ``NaN`` for object dtype (:issue:`48665`)
+- Fixed Regression in :meth:`DataFrame.loc` when setting values as a :class:`DataFrame` with all ``True`` indexer (:issue:`48701`)
+- Regression in :func:`.read_csv` causing an ``EmptyDataError`` when using an UTF-8 file handle that was already read from (:issue:`48646`)
+- Regression in :func:`to_datetime` when ``utc=True`` and ``arg`` contained timezone naive and aware arguments raised a ``ValueError`` (:issue:`48678`)
+- Fixed regression in :meth:`DataFrame.loc` raising ``FutureWarning`` when setting an empty :class:`DataFrame` (:issue:`48480`)
+- Fixed regression in :meth:`DataFrame.describe` raising ``TypeError`` when result contains ``NA`` (:issue:`48778`)
+- Fixed regression in :meth:`DataFrame.plot` ignoring invalid ``colormap`` for ``kind="scatter"`` (:issue:`48726`)
+- Fixed regression in :meth:`MultiIndex.values` resetting ``freq`` attribute of underlying :class:`Index` object (:issue:`49054`)
+- Fixed performance regression in :func:`factorize` when ``na_sentinel`` is not ``None`` and ``sort=False`` (:issue:`48620`)
+- Fixed regression causing an ``AttributeError`` during warning emitted if the provided table name in :meth:`DataFrame.to_sql` and the table name actually used in the database do not match (:issue:`48733`)
+- Fixed regression in :func:`to_datetime` when ``arg`` was a date string with nanosecond and ``format`` contained ``%f`` would raise a ``ValueError`` (:issue:`48767`)
+- Fixed regression in :func:`testing.assert_frame_equal` raising for :class:`MultiIndex` with :class:`Categorical` and ``check_like=True`` (:issue:`48975`)
+- Fixed regression in :meth:`DataFrame.fillna` replacing wrong values for ``datetime64[ns]`` dtype and ``inplace=True`` (:issue:`48863`)
+- Fixed :meth:`.DataFrameGroupBy.size` not returning a Series when ``axis=1`` (:issue:`48738`)
+- Fixed Regression in :meth:`.DataFrameGroupBy.apply` when user defined function is called on an empty dataframe (:issue:`47985`)
+- Fixed regression in :meth:`DataFrame.apply` when passing non-zero ``axis`` via keyword argument (:issue:`48656`)
+- Fixed regression in :meth:`Series.groupby` and :meth:`DataFrame.groupby` when the grouper is a nullable data type (e.g. :class:`Int64`) or a PyArrow-backed string array, contains null values, and ``dropna=False`` (:issue:`48794`)
+- Fixed performance regression in :meth:`Series.isin` with mismatching dtypes (:issue:`49162`)
+- Fixed regression in :meth:`DataFrame.to_parquet` raising when file name was specified as ``bytes`` (:issue:`48944`)
+- Fixed regression in :class:`ExcelWriter` where the ``book`` attribute could no longer be set; however setting this attribute is now deprecated and this ability will be removed in a future version of pandas (:issue:`48780`)
+- Fixed regression in :meth:`DataFrame.corrwith` when computing correlation on tied data with ``method="spearman"`` (:issue:`48826`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_151.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Bug in :meth:`Series.__getitem__` not falling back to positional for integer keys and boolean :class:`Index` (:issue:`48653`)
+- Bug in :meth:`DataFrame.to_hdf` raising ``AssertionError`` with boolean index (:issue:`48667`)
+- Bug in :func:`testing.assert_index_equal` for extension arrays with non matching ``NA`` raising ``ValueError`` (:issue:`48608`)
+- Bug in :meth:`DataFrame.pivot_table` raising unexpected ``FutureWarning`` when setting datetime column as index (:issue:`48683`)
+- Bug in :meth:`DataFrame.sort_values` emitting unnecessary ``FutureWarning`` when called on :class:`DataFrame` with boolean sparse columns (:issue:`48784`)
+- Bug in :class:`.arrays.ArrowExtensionArray` with a comparison operator to an invalid object would not raise a ``NotImplementedError`` (:issue:`48833`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_151.other:
+
+Other
+~~~~~
+- Avoid showing deprecated signatures when introspecting functions with warnings about arguments becoming keyword-only (:issue:`48692`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_151.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.5.0..v1.5.1
diff --git a/doc/source/whatsnew/v1.5.2.rst b/doc/source/whatsnew/v1.5.2.rst
new file mode 100644
index 0000000000000..6397016d827f2
--- /dev/null
+++ b/doc/source/whatsnew/v1.5.2.rst
@@ -0,0 +1,46 @@
+.. _whatsnew_152:
+
+What's new in 1.5.2 (November 21, 2022)
+---------------------------------------
+
+These are the changes in pandas 1.5.2. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_152.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed regression in :meth:`MultiIndex.join` for extension array dtypes (:issue:`49277`)
+- Fixed regression in :meth:`Series.replace` raising ``RecursionError`` with numeric dtype and when specifying ``value=None`` (:issue:`45725`)
+- Fixed regression in arithmetic operations for :class:`DataFrame` with :class:`MultiIndex` columns with different dtypes (:issue:`49769`)
+- Fixed regression in :meth:`DataFrame.plot` preventing :class:`~matplotlib.colors.Colormap` instance
+ from being passed using the ``colormap`` argument if Matplotlib 3.6+ is used (:issue:`49374`)
+- Fixed regression in :func:`date_range` returning an invalid set of periods for ``CustomBusinessDay`` frequency and ``start`` date with timezone (:issue:`49441`)
+- Fixed performance regression in groupby operations (:issue:`49676`)
+- Fixed regression in :class:`Timedelta` constructor returning object of wrong type when subclassing ``Timedelta`` (:issue:`49579`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_152.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Bug in the Copy-on-Write implementation losing track of views in certain chained indexing cases (:issue:`48996`)
+- Fixed memory leak in :meth:`.Styler.to_excel` (:issue:`49751`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_152.other:
+
+Other
+~~~~~
+- Reverted ``color`` as an alias for ``c`` and ``size`` as an alias for ``s`` in function :meth:`DataFrame.plot.scatter` (:issue:`49732`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_152.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.5.1..v1.5.2|HEAD
diff --git a/doc/source/whatsnew/v1.5.3.rst b/doc/source/whatsnew/v1.5.3.rst
new file mode 100644
index 0000000000000..97c4c73f08c37
--- /dev/null
+++ b/doc/source/whatsnew/v1.5.3.rst
@@ -0,0 +1,59 @@
+.. _whatsnew_153:
+
+What's new in 1.5.3 (January 18, 2023)
+--------------------------------------
+
+These are the changes in pandas 1.5.3. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_153.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed performance regression in :meth:`Series.isin` when ``values`` is empty (:issue:`49839`)
+- Fixed regression in :meth:`DataFrame.memory_usage` showing unnecessary ``FutureWarning`` when :class:`DataFrame` is empty (:issue:`50066`)
+- Fixed regression in :meth:`.DataFrameGroupBy.transform` when used with ``as_index=False`` (:issue:`49834`)
+- Enforced reversion of ``color`` as an alias for ``c`` and ``size`` as an alias for ``s`` in function :meth:`DataFrame.plot.scatter` (:issue:`49732`)
+- Fixed regression in :meth:`.SeriesGroupBy.apply` setting a ``name`` attribute on the result if the result was a :class:`DataFrame` (:issue:`49907`)
+- Fixed performance regression in setting with the :meth:`~DataFrame.at` indexer (:issue:`49771`)
+- Fixed regression in the methods ``apply``, ``agg``, and ``transform`` when used with NumPy functions that informed users to supply ``numeric_only=True`` if the operation failed on non-numeric dtypes; such columns must be dropped prior to using these methods (:issue:`50538`)
+- Fixed regression in :func:`to_datetime` raising ``ValueError`` when parsing array of ``float`` containing ``np.nan`` (:issue:`50237`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_153.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Bug in the Copy-on-Write implementation losing track of views when indexing a :class:`DataFrame` with another :class:`DataFrame` (:issue:`50630`)
+- Bug in :meth:`.Styler.to_excel` leading to error when unrecognized ``border-style`` (e.g. ``"hair"``) provided to Excel writers (:issue:`48649`)
+- Bug in :meth:`Series.quantile` emitting warning from NumPy when :class:`Series` has only ``NA`` values (:issue:`50681`)
+- Bug when chaining several :meth:`.Styler.concat` calls, only the last styler was concatenated (:issue:`49207`)
+- Fixed bug when instantiating a :class:`DataFrame` subclass inheriting from ``typing.Generic`` that triggered a ``UserWarning`` on python 3.11 (:issue:`49649`)
+- Bug in :func:`pivot_table` with NumPy 1.24 or greater when the :class:`DataFrame` columns has nested elements (:issue:`50342`)
+- Bug in :func:`pandas.testing.assert_series_equal` (and equivalent ``assert_`` functions) when having nested data and using numpy >= 1.25 (:issue:`50360`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_153.other:
+
+Other
+~~~~~
+
+.. note::
+
+ If you are using :meth:`DataFrame.to_sql`, :func:`read_sql`, :func:`read_sql_table`, or :func:`read_sql_query` with SQLAlchemy 1.4.46 or greater,
+ you may see a ``sqlalchemy.exc.RemovedIn20Warning``. These warnings can be safely ignored for the SQLAlchemy 1.4.x releases
+ as pandas works toward compatibility with SQLAlchemy 2.0.
+
+- Reverted deprecation (:issue:`45324`) of behavior of :meth:`Series.__getitem__` and :meth:`Series.__setitem__` slicing with an integer :class:`Index`; this will remain positional (:issue:`49612`)
+- A ``FutureWarning`` raised when attempting to set values inplace with :meth:`DataFrame.loc` or :meth:`DataFrame.iloc` has been changed to a ``DeprecationWarning`` (:issue:`48673`)
+
+.. ---------------------------------------------------------------------------
+.. _whatsnew_153.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.5.2..v1.5.3|HEAD
diff --git a/doc/sphinxext/README.rst b/doc/sphinxext/README.rst
index 8f0f4a8b2636d..ef52433e5e869 100644
--- a/doc/sphinxext/README.rst
+++ b/doc/sphinxext/README.rst
@@ -1,17 +1,5 @@
sphinxext
=========
-This directory contains copies of different sphinx extensions in use in the
-pandas documentation. These copies originate from other projects:
-
-- ``numpydoc`` - Numpy's Sphinx extensions: this can be found at its own
- repository: https://github.com/numpy/numpydoc
-- ``ipython_directive`` and ``ipython_console_highlighting`` in the folder
- ``ipython_sphinxext`` - Sphinx extensions from IPython: these are included
- in IPython: https://github.com/ipython/ipython/tree/master/IPython/sphinxext
-
-.. note::
-
- These copies are maintained at the respective projects, so fixes should,
- to the extent possible, be pushed upstream instead of only adapting our
- local copy to avoid divergence between the local and upstream version.
+This directory contains custom sphinx extensions in use in the pandas
+documentation.
diff --git a/environment.yml b/environment.yml
index bf317d24122ae..20f839db9ad60 100644
--- a/environment.yml
+++ b/environment.yml
@@ -1,128 +1,132 @@
+# Local development dependencies including docs building, website upload, ASV benchmark
name: pandas-dev
channels:
- conda-forge
dependencies:
- # required
- - numpy>=1.18.5, <1.22.0
- python=3.8
- - python-dateutil>=2.8.1
+
+ # test dependencies
+ - cython=0.29.32
+ - pytest>=6.0
+ - pytest-cov
+ - pytest-xdist>=1.31
+ - psutil
+ - pytest-asyncio>=0.17
+ - boto3
+
+ # required dependencies
+ - python-dateutil
+ - numpy
- pytz
+ # optional dependencies
+ - beautifulsoup4
+ - blosc
+ - brotlipy
+ - bottleneck
+ - fastparquet
+ - fsspec
+ - html5lib
+ - hypothesis
+ - gcsfs
+ - jinja2
+ - lxml
+ - matplotlib>=3.6.1
+ - numba>=0.53.1
+ - numexpr>=2.8.0 # pin for "Run checks on imported code" job
+ - openpyxl
+ - odfpy
+ - pandas-gbq
+ - psycopg2
+ - pyarrow<10
+ - pymysql
+ - pyreadstat
+ - pytables
+ - python-snappy
+ - pyxlsb
+ - s3fs>=2021.08.0
+ - scipy
+ - sqlalchemy<1.4.46
+ - tabulate
+ - tzdata>=2022a
+ - xarray
+ - xlrd
+ - xlsxwriter
+ - xlwt
+ - zstandard
+
+ # downstream packages
+ - aiobotocore<2.0.0 # GH#44311 pinned to fix docbuild
+ - botocore
+ - cftime
+ - dask
+ - ipython
+ - geopandas-base
+ - seaborn
+ - scikit-learn
+ - statsmodels
+ - coverage
+ - pandas-datareader
+ - pyyaml
+ - py
+ - pytorch
+
+ # local testing dependencies
+ - moto
+ - flask
+
# benchmarks
- asv
- # building
# The compiler packages are meta-packages and install the correct compiler (activation) packages on the respective platforms.
- c-compiler
- cxx-compiler
- - cython>=0.29.24
# code checks
- - black=21.5b2
+ - black=22.3.0
- cpplint
- - flake8=4.0.1
- - flake8-bugbear=21.3.2 # used by flake8, find likely bugs
- - flake8-comprehensions=3.7.0 # used by flake8, linting of unnecessary comprehensions
+ - flake8=5.0.4
+ - flake8-bugbear=22.7.1 # used by flake8, find likely bugs
- isort>=5.2.1 # check that imports are in the right order
- - mypy=0.930
- - pre-commit>=2.9.2
+ - mypy=0.971
+ - pre-commit>=2.15.0
- pycodestyle # used by flake8
- pyupgrade
# documentation
- gitpython # obtain contributors from git for whatsnew
- gitdb
+ - natsort # DataFrame.sort_values doctest
+ - numpydoc
+ - pandas-dev-flaker=0.5.0
+ - pydata-sphinx-theme<0.11
+ - pytest-cython # doctest
- sphinx
- sphinx-panels
- - numpydoc < 1.2 # 2021-02-09 1.2dev breaking CI
+ - sphinx-copybutton
- types-python-dateutil
- types-PyMySQL
- types-pytz
- types-setuptools
# documentation (jupyter notebooks)
- - nbconvert>=5.4.1
+ - nbconvert>=6.4.5
- nbsphinx
- pandoc
-
- # Dask and its dependencies (that dont install with dask)
- - dask-core
- - toolz>=0.7.3
- - partd>=0.3.10
- - cloudpickle>=0.2.1
-
- # web (jinja2 is also needed, but it's also an optional pandas dependency)
- - markdown
- - feedparser
- - pyyaml
- - requests
-
- # testing
- - boto3
- - botocore>=1.11
- - hypothesis>=5.5.3
- - moto # mock S3
- - flask
- - pytest>=6.0
- - pytest-cov
- - pytest-xdist>=1.31
- - pytest-asyncio
- - pytest-instafail
-
- # downstream tests
- - seaborn
- - statsmodels
-
- # unused (required indirectly may be?)
- ipywidgets
- nbformat
- notebook>=6.0.3
- - pip
-
- # optional
- - blosc
- - bottleneck>=1.3.1
- ipykernel
- - ipython>=7.11.1
- - jinja2 # pandas.Styler
- - matplotlib>=3.3.2 # pandas.plotting, Series.plot, DataFrame.plot
- - numexpr>=2.7.1
- - scipy>=1.4.1
- - numba>=0.50.1
- # optional for io
- # ---------------
- # pd.read_html
- - beautifulsoup4>=4.8.2
- - html5lib
- - lxml
-
- # pd.read_excel, DataFrame.to_excel, pd.ExcelWriter, pd.ExcelFile
- - openpyxl
- - xlrd
- - xlsxwriter
- - xlwt
- - odfpy
-
- - fastparquet>=0.4.0 # pandas.read_parquet, DataFrame.to_parquet
- - pyarrow>2.0.1 # pandas.read_parquet, DataFrame.to_parquet, pandas.read_feather, DataFrame.to_feather
- - python-snappy # required by pyarrow
+ # web
+ - jinja2 # in optional dependencies, but documented here as needed
+ - markdown
+ - feedparser
+ - pyyaml
+ - requests
- - pytables>=3.6.1 # pandas.read_hdf, DataFrame.to_hdf
- - s3fs>=0.4.0 # file IO when using 's3://...' path
- - aiobotocore<2.0.0 # GH#44311 pinned to fix docbuild
- - fsspec>=0.7.4 # for generic remote file operations
- - gcsfs>=0.6.0 # file IO when using 'gcs://...' path
- - sqlalchemy # pandas.read_sql, DataFrame.to_sql
- - xarray<0.19 # DataFrame.to_xarray
- - cftime # Needed for downstream xarray.CFTimeIndex test
- - pyreadstat # pandas.read_spss
- - tabulate>=0.8.3 # DataFrame.to_markdown
- - natsort # DataFrame.sort_values
+ # build the interactive terminal
+ - jupyterlab >=3.4,<4
- pip:
- #issue with building environment in conda on windows. Issue: https://github.com/pandas-dev/pandas/issues/45123
- #issue with pydata-sphix-theme on windows. Issue: https://github.com/pydata/pydata-sphinx-theme/issues/523
- #using previous stable version as workaround
- - git+https://github.com/pydata/pydata-sphinx-theme.git@41764f5
- - pandas-dev-flaker==0.2.0
- - pytest-cython
+ - jupyterlite==0.1.0b10
+ - sphinx-toggleprompt
diff --git a/pandas/__init__.py b/pandas/__init__.py
index 1b18af0f69cf2..5016bde000c3b 100644
--- a/pandas/__init__.py
+++ b/pandas/__init__.py
@@ -1,35 +1,35 @@
-# flake8: noqa
+from __future__ import annotations
__docformat__ = "restructuredtext"
# Let users know if they're missing any of our hard dependencies
-hard_dependencies = ("numpy", "pytz", "dateutil")
-missing_dependencies = []
+_hard_dependencies = ("numpy", "pytz", "dateutil")
+_missing_dependencies = []
-for dependency in hard_dependencies:
+for _dependency in _hard_dependencies:
try:
- __import__(dependency)
- except ImportError as e:
- missing_dependencies.append(f"{dependency}: {e}")
+ __import__(_dependency)
+ except ImportError as _e:
+ _missing_dependencies.append(f"{_dependency}: {_e}")
-if missing_dependencies:
+if _missing_dependencies:
raise ImportError(
- "Unable to import required dependencies:\n" + "\n".join(missing_dependencies)
+ "Unable to import required dependencies:\n" + "\n".join(_missing_dependencies)
)
-del hard_dependencies, dependency, missing_dependencies
+del _hard_dependencies, _dependency, _missing_dependencies
# numpy compat
-from pandas.compat import is_numpy_dev as _is_numpy_dev
+from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401
try:
from pandas._libs import hashtable as _hashtable, lib as _lib, tslib as _tslib
-except ImportError as err: # pragma: no cover
- module = err.name
+except ImportError as _err: # pragma: no cover
+ _module = _err.name
raise ImportError(
- f"C extension: {module} not built. If you want to import "
+ f"C extension: {_module} not built. If you want to import "
"pandas from the source directory, you may need to run "
"'python setup.py build_ext --force' to build the C extensions first."
- ) from err
+ ) from _err
else:
del _tslib, _lib, _hashtable
@@ -43,10 +43,11 @@
)
# let init-time option registration happen
-import pandas.core.config_init
+import pandas.core.config_init # pyright: ignore # noqa:F401
from pandas.core.api import (
# dtype
+ ArrowDtype,
Int8Dtype,
Int16Dtype,
Int32Dtype,
@@ -128,11 +129,13 @@
pivot,
pivot_table,
get_dummies,
+ from_dummies,
cut,
qcut,
)
-from pandas import api, arrays, errors, io, plotting, testing, tseries
+from pandas import api, arrays, errors, io, plotting, tseries
+from pandas import testing # noqa:PDF015
from pandas.util._print_versions import show_versions
from pandas.io.api import (
@@ -184,7 +187,7 @@
__deprecated_num_index_names = ["Float64Index", "Int64Index", "UInt64Index"]
-def __dir__():
+def __dir__() -> list[str]:
# GH43028
# Int64Index etc. are deprecated, but we still want them to be available in the dir.
# Remove in Pandas 2.0, when we remove Int64Index etc. from the code base.
@@ -306,6 +309,7 @@ def __getattr__(name):
# Pandas is not (yet) a py.typed library: the public API is determined
# based on the documentation.
__all__ = [
+ "ArrowDtype",
"BooleanDtype",
"Categorical",
"CategoricalDtype",
@@ -361,6 +365,7 @@ def __getattr__(name):
"eval",
"factorize",
"get_dummies",
+ "from_dummies",
"get_option",
"infer_freq",
"interval_range",
diff --git a/pandas/_config/__init__.py b/pandas/_config/__init__.py
index 65936a9fcdbf3..929f8a5af6b3f 100644
--- a/pandas/_config/__init__.py
+++ b/pandas/_config/__init__.py
@@ -16,7 +16,7 @@
"options",
]
from pandas._config import config
-from pandas._config import dates # noqa:F401
+from pandas._config import dates # pyright: ignore # noqa:F401
from pandas._config.config import (
describe_option,
get_option,
diff --git a/pandas/_config/config.py b/pandas/_config/config.py
index 5a0f58266c203..b4b06c819431f 100644
--- a/pandas/_config/config.py
+++ b/pandas/_config/config.py
@@ -58,13 +58,19 @@
from typing import (
Any,
Callable,
+ Generic,
Iterable,
+ Iterator,
NamedTuple,
cast,
)
import warnings
-from pandas._typing import F
+from pandas._typing import (
+ F,
+ T,
+)
+from pandas.util._exceptions import find_stack_level
class DeprecatedOption(NamedTuple):
@@ -97,8 +103,9 @@ class RegisteredOption(NamedTuple):
class OptionError(AttributeError, KeyError):
"""
- Exception for pandas.options, backwards compatible with KeyError
- checks.
+ Exception raised for pandas.options.
+
+ Backwards compatible with KeyError checks.
"""
@@ -124,7 +131,7 @@ def _get_single_key(pat: str, silent: bool) -> str:
return key
-def _get_option(pat: str, silent: bool = False):
+def _get_option(pat: str, silent: bool = False) -> Any:
key = _get_single_key(pat, silent)
# walk the nested dict
@@ -164,7 +171,7 @@ def _set_option(*args, **kwargs) -> None:
o.cb(key)
-def _describe_option(pat: str = "", _print_desc: bool = True):
+def _describe_option(pat: str = "", _print_desc: bool = True) -> str | None:
keys = _select_options(pat)
if len(keys) == 0:
@@ -174,8 +181,8 @@ def _describe_option(pat: str = "", _print_desc: bool = True):
if _print_desc:
print(s)
- else:
- return s
+ return None
+ return s
def _reset_option(pat: str, silent: bool = False) -> None:
@@ -204,7 +211,7 @@ def get_default_val(pat: str):
class DictWrapper:
"""provide attribute-style access to a nested dict"""
- def __init__(self, d: dict[str, Any], prefix: str = ""):
+ def __init__(self, d: dict[str, Any], prefix: str = "") -> None:
object.__setattr__(self, "d", d)
object.__setattr__(self, "prefix", prefix)
@@ -247,16 +254,17 @@ def __dir__(self) -> Iterable[str]:
# of options, and option descriptions.
-class CallableDynamicDoc:
- def __init__(self, func, doc_tmpl):
+class CallableDynamicDoc(Generic[T]):
+ def __init__(self, func: Callable[..., T], doc_tmpl: str) -> None:
self.__doc_tmpl__ = doc_tmpl
self.__func__ = func
- def __call__(self, *args, **kwds):
+ def __call__(self, *args, **kwds) -> T:
return self.__func__(*args, **kwds)
+ # error: Signature of "__doc__" incompatible with supertype "object"
@property
- def __doc__(self):
+ def __doc__(self) -> str: # type: ignore[override]
opts_desc = _describe_option("all", _print_desc=False)
opts_list = pp_options_list(list(_registered_options.keys()))
return self.__doc_tmpl__.format(opts_desc=opts_desc, opts_list=opts_list)
@@ -289,6 +297,8 @@ def __doc__(self):
Notes
-----
+Please reference the :ref:`User Guide ` for more information.
+
The available options with its descriptions:
{opts_desc}
@@ -323,6 +333,8 @@ def __doc__(self):
Notes
-----
+Please reference the :ref:`User Guide ` for more information.
+
The available options with its descriptions:
{opts_desc}
@@ -355,6 +367,8 @@ def __doc__(self):
Notes
-----
+Please reference the :ref:`User Guide ` for more information.
+
The available options with its descriptions:
{opts_desc}
@@ -385,6 +399,8 @@ def __doc__(self):
Notes
-----
+Please reference the :ref:`User Guide ` for more information.
+
The available options with its descriptions:
{opts_desc}
@@ -414,7 +430,7 @@ class option_context(ContextDecorator):
... pass
"""
- def __init__(self, *args):
+ def __init__(self, *args) -> None:
if len(args) % 2 != 0 or len(args) < 2:
raise ValueError(
"Need to invoke as option_context(pat, val, [(pat, val), ...])."
@@ -422,13 +438,13 @@ def __init__(self, *args):
self.ops = list(zip(args[::2], args[1::2]))
- def __enter__(self):
+ def __enter__(self) -> None:
self.undo = [(pat, _get_option(pat, silent=True)) for pat, val in self.ops]
for pat, val in self.ops:
_set_option(pat, val, silent=True)
- def __exit__(self, *args):
+ def __exit__(self, *args) -> None:
if self.undo:
for pat, val in self.undo:
_set_option(pat, val, silent=True)
@@ -642,7 +658,11 @@ def _warn_if_deprecated(key: str) -> bool:
d = _get_deprecated_option(key)
if d:
if d.msg:
- warnings.warn(d.msg, FutureWarning)
+ warnings.warn(
+ d.msg,
+ FutureWarning,
+ stacklevel=find_stack_level(),
+ )
else:
msg = f"'{key}' is deprecated"
if d.removal_ver:
@@ -652,7 +672,7 @@ def _warn_if_deprecated(key: str) -> bool:
else:
msg += ", please refrain from using it."
- warnings.warn(msg, FutureWarning)
+ warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
return True
return False
@@ -720,7 +740,7 @@ def pp(name: str, ks: Iterable[str]) -> list[str]:
@contextmanager
-def config_prefix(prefix):
+def config_prefix(prefix) -> Iterator[None]:
"""
contextmanager for multiple invocations of API with a common prefix
diff --git a/pandas/_config/dates.py b/pandas/_config/dates.py
index 5bf2b49ce5904..b37831f96eb73 100644
--- a/pandas/_config/dates.py
+++ b/pandas/_config/dates.py
@@ -1,6 +1,8 @@
"""
config for datetime formatting
"""
+from __future__ import annotations
+
from pandas._config import config as cf
pc_date_dayfirst_doc = """
diff --git a/pandas/_config/localization.py b/pandas/_config/localization.py
index 2a487fa4b6877..c4355e954c67c 100644
--- a/pandas/_config/localization.py
+++ b/pandas/_config/localization.py
@@ -39,13 +39,14 @@ def set_locale(
particular locale, without globally setting the locale. This probably isn't
thread-safe.
"""
- current_locale = locale.getlocale()
+ # getlocale is not always compliant with setlocale, use setlocale. GH#46595
+ current_locale = locale.setlocale(lc_var)
try:
locale.setlocale(lc_var, new_locale)
- normalized_locale = locale.getlocale()
- if all(x is not None for x in normalized_locale):
- yield ".".join(normalized_locale)
+ normalized_code, normalized_encoding = locale.getlocale()
+ if normalized_code is not None and normalized_encoding is not None:
+ yield f"{normalized_code}.{normalized_encoding}"
else:
yield new_locale
finally:
diff --git a/pandas/_libs/algos.pxd b/pandas/_libs/algos.pxd
index fdeff2ed11805..c3b83b9bd40cb 100644
--- a/pandas/_libs/algos.pxd
+++ b/pandas/_libs/algos.pxd
@@ -1,4 +1,7 @@
-from pandas._libs.dtypes cimport numeric_t
+from pandas._libs.dtypes cimport (
+ numeric_object_t,
+ numeric_t,
+)
cdef numeric_t kth_smallest_c(numeric_t* arr, Py_ssize_t k, Py_ssize_t n) nogil
@@ -10,3 +13,10 @@ cdef enum TiebreakEnumType:
TIEBREAK_FIRST
TIEBREAK_FIRST_DESCENDING
TIEBREAK_DENSE
+
+
+cdef numeric_object_t get_rank_nan_fill_val(
+ bint rank_nans_highest,
+ numeric_object_t val,
+ bint is_datetimelike=*,
+)
diff --git a/pandas/_libs/algos.pyi b/pandas/_libs/algos.pyi
index df8ac3f3b0696..5a2005722c85c 100644
--- a/pandas/_libs/algos.pyi
+++ b/pandas/_libs/algos.pyi
@@ -1,5 +1,3 @@
-from __future__ import annotations
-
from typing import Any
import numpy as np
@@ -42,7 +40,7 @@ def groupsort_indexer(
np.ndarray, # ndarray[int64_t, ndim=1]
]: ...
def kth_smallest(
- a: np.ndarray, # numeric[:]
+ arr: np.ndarray, # numeric[:]
k: int,
) -> Any: ... # numeric
@@ -61,52 +59,39 @@ def nancorr_spearman(
# ----------------------------------------------------------------------
-# ctypedef fused algos_t:
-# float64_t
-# float32_t
-# object
-# int64_t
-# int32_t
-# int16_t
-# int8_t
-# uint64_t
-# uint32_t
-# uint16_t
-# uint8_t
-
def validate_limit(nobs: int | None, limit=...) -> int: ...
def pad(
- old: np.ndarray, # ndarray[algos_t]
- new: np.ndarray, # ndarray[algos_t]
+ old: np.ndarray, # ndarray[numeric_object_t]
+ new: np.ndarray, # ndarray[numeric_object_t]
limit=...,
) -> npt.NDArray[np.intp]: ... # np.ndarray[np.intp, ndim=1]
def pad_inplace(
- values: np.ndarray, # algos_t[:]
+ values: np.ndarray, # numeric_object_t[:]
mask: np.ndarray, # uint8_t[:]
limit=...,
) -> None: ...
def pad_2d_inplace(
- values: np.ndarray, # algos_t[:, :]
+ values: np.ndarray, # numeric_object_t[:, :]
mask: np.ndarray, # const uint8_t[:, :]
limit=...,
) -> None: ...
def backfill(
- old: np.ndarray, # ndarray[algos_t]
- new: np.ndarray, # ndarray[algos_t]
+ old: np.ndarray, # ndarray[numeric_object_t]
+ new: np.ndarray, # ndarray[numeric_object_t]
limit=...,
) -> npt.NDArray[np.intp]: ... # np.ndarray[np.intp, ndim=1]
def backfill_inplace(
- values: np.ndarray, # algos_t[:]
+ values: np.ndarray, # numeric_object_t[:]
mask: np.ndarray, # uint8_t[:]
limit=...,
) -> None: ...
def backfill_2d_inplace(
- values: np.ndarray, # algos_t[:, :]
+ values: np.ndarray, # numeric_object_t[:, :]
mask: np.ndarray, # const uint8_t[:, :]
limit=...,
) -> None: ...
def is_monotonic(
- arr: np.ndarray, # ndarray[algos_t, ndim=1]
+ arr: np.ndarray, # ndarray[numeric_object_t, ndim=1]
timelike: bool,
) -> tuple[bool, bool, bool]: ...
@@ -114,23 +99,18 @@ def is_monotonic(
# rank_1d, rank_2d
# ----------------------------------------------------------------------
-# ctypedef fused rank_t:
-# object
-# float64_t
-# uint64_t
-# int64_t
-
def rank_1d(
- values: np.ndarray, # ndarray[rank_t, ndim=1]
+ values: np.ndarray, # ndarray[numeric_object_t, ndim=1]
labels: np.ndarray | None = ..., # const int64_t[:]=None
is_datetimelike: bool = ...,
ties_method=...,
ascending: bool = ...,
pct: bool = ...,
na_option=...,
+ mask: npt.NDArray[np.bool_] | None = ...,
) -> np.ndarray: ... # np.ndarray[float64_t, ndim=1]
def rank_2d(
- in_arr: np.ndarray, # ndarray[rank_t, ndim=2]
+ in_arr: np.ndarray, # ndarray[numeric_object_t, ndim=2]
axis: int = ...,
is_datetimelike: bool = ...,
ties_method=...,
@@ -147,17 +127,11 @@ def diff_2d(
) -> None: ...
def ensure_platform_int(arr: object) -> npt.NDArray[np.intp]: ...
def ensure_object(arr: object) -> npt.NDArray[np.object_]: ...
-def ensure_complex64(arr: object, copy=...) -> npt.NDArray[np.complex64]: ...
-def ensure_complex128(arr: object, copy=...) -> npt.NDArray[np.complex128]: ...
def ensure_float64(arr: object, copy=...) -> npt.NDArray[np.float64]: ...
-def ensure_float32(arr: object, copy=...) -> npt.NDArray[np.float32]: ...
def ensure_int8(arr: object, copy=...) -> npt.NDArray[np.int8]: ...
def ensure_int16(arr: object, copy=...) -> npt.NDArray[np.int16]: ...
def ensure_int32(arr: object, copy=...) -> npt.NDArray[np.int32]: ...
def ensure_int64(arr: object, copy=...) -> npt.NDArray[np.int64]: ...
-def ensure_uint8(arr: object, copy=...) -> npt.NDArray[np.uint8]: ...
-def ensure_uint16(arr: object, copy=...) -> npt.NDArray[np.uint16]: ...
-def ensure_uint32(arr: object, copy=...) -> npt.NDArray[np.uint32]: ...
def ensure_uint64(arr: object, copy=...) -> npt.NDArray[np.uint64]: ...
def take_1d_int8_int8(
values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
diff --git a/pandas/_libs/algos.pyx b/pandas/_libs/algos.pyx
index 3d099a53163bc..c05d6a300ccf0 100644
--- a/pandas/_libs/algos.pyx
+++ b/pandas/_libs/algos.pyx
@@ -1,6 +1,5 @@
-import cython
-from cython import Py_ssize_t
-
+cimport cython
+from cython cimport Py_ssize_t
from libc.math cimport (
fabs,
sqrt,
@@ -46,7 +45,6 @@ cnp.import_array()
cimport pandas._libs.util as util
from pandas._libs.dtypes cimport (
- iu_64_floating_obj_t,
numeric_object_t,
numeric_t,
)
@@ -182,6 +180,8 @@ def is_lexsorted(list_of_arrays: list) -> bint:
else:
result = False
break
+ if not result:
+ break
free(vecs)
return result
@@ -324,17 +324,14 @@ def kth_smallest(numeric_t[::1] arr, Py_ssize_t k) -> numeric_t:
@cython.boundscheck(False)
@cython.wraparound(False)
+@cython.cdivision(True)
def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None):
cdef:
Py_ssize_t i, j, xi, yi, N, K
bint minpv
float64_t[:, ::1] result
- # Initialize to None since we only use in the no missing value case
- float64_t[::1] means=None, ssqds=None
ndarray[uint8_t, ndim=2] mask
- bint no_nans
int64_t nobs = 0
- float64_t mean, ssqd, val
float64_t vx, vy, dx, dy, meanx, meany, divisor, ssqdmx, ssqdmy, covxy
N, K = (