- To raw data
+ To raw data
@@ -209,7 +209,7 @@ Plot the typical :math:`NO_2` pattern during the day of our time series of all s
air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
kind='bar', rot=0, ax=axs
)
- plt.xlabel("Hour of the day"); # custom x label using matplotlib
+ plt.xlabel("Hour of the day"); # custom x label using Matplotlib
@savefig 09_bar_chart.png
plt.ylabel("$NO_2 (µg/m^3)$");
diff --git a/doc/source/getting_started/intro_tutorials/10_text_data.rst b/doc/source/getting_started/intro_tutorials/10_text_data.rst
index 63db920164ac3..148ac246d7bf8 100644
--- a/doc/source/getting_started/intro_tutorials/10_text_data.rst
+++ b/doc/source/getting_started/intro_tutorials/10_text_data.rst
@@ -179,7 +179,7 @@ applied to integers, so no ``str`` is used.
Based on the index name of the row (``307``) and the column (``Name``),
we can do a selection using the ``loc`` operator, introduced in the
-`tutorial on subsetting <3_subset_data.ipynb>`__.
+:ref:`tutorial on subsetting <10min_tut_03_subset>`.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst b/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
index a5a5442330e43..43790bd53f587 100644
--- a/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
+++ b/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
@@ -8,7 +8,7 @@
For this tutorial, air quality data about :math:`NO_2` is used, made
-available by `openaq `__ and using the
+available by `OpenAQ `__ and using the
`py-openaq `__ package.
The ``air_quality_no2.csv`` data set provides :math:`NO_2` values for
the measurement stations *FR04014*, *BETR801* and *London Westminster*
@@ -17,6 +17,6 @@ in respectively Paris, Antwerp and London.
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/intro_tutorials/includes/titanic.rst b/doc/source/getting_started/intro_tutorials/includes/titanic.rst
index 7032b70b3f1cf..19b8e81914e31 100644
--- a/doc/source/getting_started/intro_tutorials/includes/titanic.rst
+++ b/doc/source/getting_started/intro_tutorials/includes/titanic.rst
@@ -11,22 +11,21 @@ This tutorial uses the Titanic data set, stored as CSV. The data
consists of the following data columns:
- PassengerId: Id of every passenger.
-- Survived: This feature have value 0 and 1. 0 for not survived and 1
- for survived.
-- Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
+- Survived: Indication whether passenger survived. ``0`` for yes and ``1`` for no.
+- Pclass: One out of the 3 ticket classes: Class ``1``, Class ``2`` and Class ``3``.
- Name: Name of passenger.
- Sex: Gender of passenger.
-- Age: Age of passenger.
-- SibSp: Indication that passenger have siblings and spouse.
-- Parch: Whether a passenger is alone or have family.
+- Age: Age of passenger in years.
+- SibSp: Number of siblings or spouses aboard.
+- Parch: Number of parents or children aboard.
- Ticket: Ticket number of passenger.
- Fare: Indicating the fare.
-- Cabin: The cabin of passenger.
-- Embarked: The embarked category.
+- Cabin: Cabin number of passenger.
+- Embarked: Port of embarkation.
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/overview.rst b/doc/source/getting_started/overview.rst
index 7084b67cf9424..320d2da01418c 100644
--- a/doc/source/getting_started/overview.rst
+++ b/doc/source/getting_started/overview.rst
@@ -29,7 +29,7 @@ and :class:`DataFrame` (2-dimensional), handle the vast majority of typical use
cases in finance, statistics, social science, and many areas of
engineering. For R users, :class:`DataFrame` provides everything that R's
``data.frame`` provides and much more. pandas is built on top of `NumPy
-`__ and is intended to integrate well within a scientific
+`__ and is intended to integrate well within a scientific
computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
@@ -75,7 +75,7 @@ Some other notes
specialized tool.
- pandas is a dependency of `statsmodels
- `__, making it an important part of the
+ `__, making it an important part of the
statistical computing ecosystem in Python.
- pandas has been used extensively in production in financial applications.
@@ -168,7 +168,7 @@ The list of the Core Team members and more detailed information can be found on
Institutional partners
----------------------
-The information about current institutional partners can be found on `pandas website page `__.
+The information about current institutional partners can be found on `pandas website page `__.
License
-------
diff --git a/doc/source/getting_started/tutorials.rst b/doc/source/getting_started/tutorials.rst
index b8940d2efed2f..bff50bb1e4c2d 100644
--- a/doc/source/getting_started/tutorials.rst
+++ b/doc/source/getting_started/tutorials.rst
@@ -18,6 +18,19 @@ entails.
For the table of contents, see the `pandas-cookbook GitHub
repository `_.
+pandas workshop by Stefanie Molin
+---------------------------------
+
+An introductory workshop by `Stefanie Molin `_
+designed to quickly get you up to speed with pandas using real-world datasets.
+It covers getting started with pandas, data wrangling, and data visualization
+(with some exposure to matplotlib and seaborn). The
+`pandas-workshop GitHub repository `_
+features detailed environment setup instructions (including a Binder environment),
+slides and notebooks for following along, and exercises to practice the concepts.
+There is also a lab with new exercises on a dataset not covered in the workshop for
+additional practice.
+
Learn pandas by Hernan Rojas
----------------------------
@@ -62,6 +75,16 @@ Excel charts with pandas, vincent and xlsxwriter
* `Using Pandas and XlsxWriter to create Excel charts `_
+Joyful pandas
+-------------
+
+A tutorial written in Chinese by Yuanhao Geng. It covers the basic operations
+for NumPy and pandas, 4 main data manipulation methods (including indexing, groupby, reshaping
+and concatenation) and 4 main data types (including missing data, string data, categorical
+data and time series data). At the end of each chapter, corresponding exercises are posted.
+All the datasets and related materials can be found in the GitHub repository
+`datawhalechina/joyful-pandas `_.
+
Video tutorials
---------------
@@ -77,11 +100,11 @@ Video tutorials
* `Data analysis in Python with pandas `_
(2016-2018)
`GitHub repo `__ and
- `Jupyter Notebook `__
+ `Jupyter Notebook `__
* `Best practices with pandas `_
(2018)
`GitHub repo `__ and
- `Jupyter Notebook `__
+ `Jupyter Notebook `__
Various tutorials
@@ -95,3 +118,4 @@ Various tutorials
* `Pandas and Python: Top 10, by Manish Amde `_
* `Pandas DataFrames Tutorial, by Karlijn Willems `_
* `A concise tutorial with real life examples `_
+* `430+ Searchable Pandas recipes by Isshin Inada `_
diff --git a/doc/source/index.rst.template b/doc/source/index.rst.template
index 51a6807b30e2a..59280536536db 100644
--- a/doc/source/index.rst.template
+++ b/doc/source/index.rst.template
@@ -10,7 +10,10 @@ pandas documentation
**Date**: |today| **Version**: |version|
-**Download documentation**: `PDF Version `__ | `Zipped HTML `__
+**Download documentation**: `Zipped HTML `__
+
+**Previous versions**: Documentation of previous pandas versions is available at
+`pandas.pydata.org `__.
**Useful links**:
`Binary Installers `__ |
@@ -23,6 +26,7 @@ pandas documentation
easy-to-use data structures and data analysis tools for the `Python `__
programming language.
+{% if not single_doc -%}
.. panels::
:card: + intro-card text-center
:column: col-lg-6 col-md-6 col-sm-6 col-xs-12 d-flex
@@ -93,16 +97,22 @@ programming language.
:text: To the development guide
:classes: btn-block btn-secondary stretched-link
-
+{% endif %}
{% if single_doc and single_doc.endswith('.rst') -%}
.. toctree::
:maxdepth: 3
:titlesonly:
{{ single_doc[:-4] }}
+{% elif single_doc and single_doc.count('.') <= 1 %}
+.. autosummary::
+ :toctree: reference/api/
+
+ {{ single_doc }}
{% elif single_doc %}
.. autosummary::
:toctree: reference/api/
+ :template: autosummary/accessor_method.rst
{{ single_doc }}
{% else -%}
diff --git a/doc/source/reference/arrays.rst b/doc/source/reference/arrays.rst
index c6fda85b0486d..6d09e10f284af 100644
--- a/doc/source/reference/arrays.rst
+++ b/doc/source/reference/arrays.rst
@@ -2,9 +2,13 @@
.. _api.arrays:
-=============
-pandas arrays
-=============
+======================================
+pandas arrays, scalars, and data types
+======================================
+
+*******
+Objects
+*******
.. currentmodule:: pandas
@@ -15,19 +19,20 @@ objects contained with a :class:`Index`, :class:`Series`, or
For some data types, pandas extends NumPy's type system. String aliases for these types
can be found at :ref:`basics.dtypes`.
-=================== ========================= ================== =============================
-Kind of Data pandas Data Type Scalar Array
-=================== ========================= ================== =============================
-TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
-Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
-Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
-Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
-Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
-Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
-Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
-Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
-Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
-=================== ========================= ================== =============================
+=================== ========================= ============================= =============================
+Kind of Data pandas Data Type Scalar Array
+=================== ========================= ============================= =============================
+TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
+Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
+Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
+Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
+Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
+Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
+Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
+Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
+Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
+PyArrow :class:`ArrowDtype` Python Scalars or :class:`NA` :ref:`api.arrays.arrow`
+=================== ========================= ============================= =============================
pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
The top-level :meth:`array` method can be used to create a new array, which may be
@@ -38,10 +43,48 @@ stored in a :class:`Series`, :class:`Index`, or as a column in a :class:`DataFra
array
+.. _api.arrays.arrow:
+
+PyArrow
+-------
+
+.. warning::
+
+ This feature is experimental, and the API can change in a future release without warning.
+
+The :class:`arrays.ArrowExtensionArray` is backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` with a
+:external+pyarrow:py:class:`pyarrow.DataType` instead of a NumPy array and data type. The ``.dtype`` of a :class:`arrays.ArrowExtensionArray`
+is an :class:`ArrowDtype`.
+
+`Pyarrow `__ provides similar array and `data type `__
+support as NumPy including first-class nullability support for all data types, immutability and more.
+
+.. note::
+
+ For string types (``pyarrow.string()``, ``string[pyarrow]``), PyArrow support is still facilitated
+ by :class:`arrays.ArrowStringArray` and ``StringDtype("pyarrow")``. See the :ref:`string section `
+ below.
+
+While individual values in an :class:`arrays.ArrowExtensionArray` are stored as a PyArrow objects, scalars are **returned**
+as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or :class:`NA` for missing
+values.
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/class_without_autosummary.rst
+
+ arrays.ArrowExtensionArray
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/class_without_autosummary.rst
+
+ ArrowDtype
+
.. _api.arrays.datetime:
-Datetime data
--------------
+Datetimes
+---------
NumPy cannot natively represent timezone-aware datetimes. pandas supports this
with the :class:`arrays.DatetimeArray` extension array, which can hold timezone-naive
@@ -141,11 +184,11 @@ Methods
Timestamp.weekday
A collection of timestamps may be stored in a :class:`arrays.DatetimeArray`.
-For timezone-aware data, the ``.dtype`` of a ``DatetimeArray`` is a
+For timezone-aware data, the ``.dtype`` of a :class:`arrays.DatetimeArray` is a
:class:`DatetimeTZDtype`. For timezone-naive data, ``np.dtype("datetime64[ns]")``
is used.
-If the data are tz-aware, then every value in the array must have the same timezone.
+If the data are timezone-aware, then every value in the array must have the same timezone.
.. autosummary::
:toctree: api/
@@ -161,8 +204,8 @@ If the data are tz-aware, then every value in the array must have the same timez
.. _api.arrays.timedelta:
-Timedelta data
---------------
+Timedeltas
+----------
NumPy can natively represent timedeltas. pandas provides :class:`Timedelta`
for symmetry with :class:`Timestamp`.
@@ -206,7 +249,7 @@ Methods
Timedelta.to_numpy
Timedelta.total_seconds
-A collection of timedeltas may be stored in a :class:`TimedeltaArray`.
+A collection of :class:`Timedelta` may be stored in a :class:`TimedeltaArray`.
.. autosummary::
:toctree: api/
@@ -216,8 +259,8 @@ A collection of timedeltas may be stored in a :class:`TimedeltaArray`.
.. _api.arrays.period:
-Timespan data
--------------
+Periods
+-------
pandas represents spans of times as :class:`Period` objects.
@@ -267,8 +310,8 @@ Methods
Period.strftime
Period.to_timestamp
-A collection of timedeltas may be stored in a :class:`arrays.PeriodArray`.
-Every period in a ``PeriodArray`` must have the same ``freq``.
+A collection of :class:`Period` may be stored in a :class:`arrays.PeriodArray`.
+Every period in a :class:`arrays.PeriodArray` must have the same ``freq``.
.. autosummary::
:toctree: api/
@@ -284,8 +327,8 @@ Every period in a ``PeriodArray`` must have the same ``freq``.
.. _api.arrays.interval:
-Interval data
--------------
+Intervals
+---------
Arbitrary intervals can be represented as :class:`Interval` objects.
@@ -379,12 +422,12 @@ pandas provides this through :class:`arrays.IntegerArray`.
.. _api.arrays.categorical:
-Categorical data
-----------------
+Categoricals
+------------
pandas defines a custom data type for representing data that can take only a
-limited, fixed set of values. The dtype of a ``Categorical`` can be described by
-a :class:`pandas.api.types.CategoricalDtype`.
+limited, fixed set of values. The dtype of a :class:`Categorical` can be described by
+a :class:`CategoricalDtype`.
.. autosummary::
:toctree: api/
@@ -414,7 +457,7 @@ have the categories and integer codes already:
Categorical.from_codes
-The dtype information is available on the ``Categorical``
+The dtype information is available on the :class:`Categorical`
.. autosummary::
:toctree: api/
@@ -425,27 +468,27 @@ The dtype information is available on the ``Categorical``
Categorical.codes
``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts
-the Categorical back to a NumPy array, so categories and order information is not preserved!
+the :class:`Categorical` back to a NumPy array, so categories and order information is not preserved!
.. autosummary::
:toctree: api/
Categorical.__array__
-A ``Categorical`` can be stored in a ``Series`` or ``DataFrame``.
+A :class:`Categorical` can be stored in a :class:`Series` or :class:`DataFrame`.
To create a Series of dtype ``category``, use ``cat = s.astype(dtype)`` or
``Series(..., dtype=dtype)`` where ``dtype`` is either
* the string ``'category'``
-* an instance of :class:`~pandas.api.types.CategoricalDtype`.
+* an instance of :class:`CategoricalDtype`.
-If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
+If the :class:`Series` is of dtype :class:`CategoricalDtype`, ``Series.cat`` can be used to change the categorical
data. See :ref:`api.series.cat` for more.
.. _api.arrays.sparse:
-Sparse data
------------
+Sparse
+------
Data where a single value is repeated many times (e.g. ``0`` or ``NaN``) may
be stored efficiently as a :class:`arrays.SparseArray`.
@@ -464,13 +507,13 @@ be stored efficiently as a :class:`arrays.SparseArray`.
The ``Series.sparse`` accessor may be used to access sparse-specific attributes
and methods if the :class:`Series` contains sparse values. See
-:ref:`api.series.sparse` for more.
+:ref:`api.series.sparse` and :ref:`the user guide ` for more.
.. _api.arrays.string:
-Text data
----------
+Strings
+-------
When working with text data, where each valid element is a string or missing,
we recommend using :class:`StringDtype` (with the alias ``"string"``).
@@ -488,17 +531,17 @@ we recommend using :class:`StringDtype` (with the alias ``"string"``).
StringDtype
-The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.StringArray`.
+The ``Series.str`` accessor is available for :class:`Series` backed by a :class:`arrays.StringArray`.
See :ref:`api.series.str` for more.
.. _api.arrays.bool:
-Boolean data with missing values
---------------------------------
+Nullable Boolean
+----------------
The boolean dtype (with the alias ``"boolean"``) provides support for storing
-boolean data (True, False values) with missing values, which is not possible
+boolean data (``True``, ``False``) with missing values, which is not possible
with a bool :class:`numpy.ndarray`.
.. autosummary::
@@ -525,3 +568,72 @@ with a bool :class:`numpy.ndarray`.
DatetimeTZDtype.tz
PeriodDtype.freq
IntervalDtype.subtype
+
+*********
+Utilities
+*********
+
+Constructors
+------------
+.. autosummary::
+ :toctree: api/
+
+ api.types.union_categoricals
+ api.types.infer_dtype
+ api.types.pandas_dtype
+
+Data type introspection
+~~~~~~~~~~~~~~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ api.types.is_bool_dtype
+ api.types.is_categorical_dtype
+ api.types.is_complex_dtype
+ api.types.is_datetime64_any_dtype
+ api.types.is_datetime64_dtype
+ api.types.is_datetime64_ns_dtype
+ api.types.is_datetime64tz_dtype
+ api.types.is_extension_type
+ api.types.is_extension_array_dtype
+ api.types.is_float_dtype
+ api.types.is_int64_dtype
+ api.types.is_integer_dtype
+ api.types.is_interval_dtype
+ api.types.is_numeric_dtype
+ api.types.is_object_dtype
+ api.types.is_period_dtype
+ api.types.is_signed_integer_dtype
+ api.types.is_string_dtype
+ api.types.is_timedelta64_dtype
+ api.types.is_timedelta64_ns_dtype
+ api.types.is_unsigned_integer_dtype
+ api.types.is_sparse
+
+Iterable introspection
+~~~~~~~~~~~~~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ api.types.is_dict_like
+ api.types.is_file_like
+ api.types.is_list_like
+ api.types.is_named_tuple
+ api.types.is_iterator
+
+Scalar introspection
+~~~~~~~~~~~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ api.types.is_bool
+ api.types.is_categorical
+ api.types.is_complex
+ api.types.is_float
+ api.types.is_hashable
+ api.types.is_integer
+ api.types.is_interval
+ api.types.is_number
+ api.types.is_re
+ api.types.is_re_compilable
+ api.types.is_scalar
diff --git a/doc/source/reference/extensions.rst b/doc/source/reference/extensions.rst
index 7b451ed3bf296..ce8d8d5c2ca10 100644
--- a/doc/source/reference/extensions.rst
+++ b/doc/source/reference/extensions.rst
@@ -48,6 +48,7 @@ objects.
api.extensions.ExtensionArray.equals
api.extensions.ExtensionArray.factorize
api.extensions.ExtensionArray.fillna
+ api.extensions.ExtensionArray.insert
api.extensions.ExtensionArray.isin
api.extensions.ExtensionArray.isna
api.extensions.ExtensionArray.ravel
@@ -60,6 +61,7 @@ objects.
api.extensions.ExtensionArray.nbytes
api.extensions.ExtensionArray.ndim
api.extensions.ExtensionArray.shape
+ api.extensions.ExtensionArray.tolist
Additionally, we have some utility methods for ensuring your object
behaves correctly.
diff --git a/doc/source/reference/frame.rst b/doc/source/reference/frame.rst
index 9a1ebc8d670dc..e71ee80767d29 100644
--- a/doc/source/reference/frame.rst
+++ b/doc/source/reference/frame.rst
@@ -373,6 +373,7 @@ Serialization / IO / conversion
DataFrame.from_dict
DataFrame.from_records
+ DataFrame.to_orc
DataFrame.to_parquet
DataFrame.to_pickle
DataFrame.to_csv
@@ -391,3 +392,4 @@ Serialization / IO / conversion
DataFrame.to_clipboard
DataFrame.to_markdown
DataFrame.style
+ DataFrame.__dataframe__
diff --git a/doc/source/reference/general_functions.rst b/doc/source/reference/general_functions.rst
index b5832cb8aa591..474e37a85d857 100644
--- a/doc/source/reference/general_functions.rst
+++ b/doc/source/reference/general_functions.rst
@@ -23,6 +23,7 @@ Data manipulations
merge_asof
concat
get_dummies
+ from_dummies
factorize
unique
wide_to_long
@@ -37,15 +38,15 @@ Top-level missing data
notna
notnull
-Top-level conversions
-~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with numeric data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
to_numeric
-Top-level dealing with datetimelike
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with datetimelike data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
@@ -57,8 +58,8 @@ Top-level dealing with datetimelike
timedelta_range
infer_freq
-Top-level dealing with intervals
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with Interval data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
@@ -79,9 +80,9 @@ Hashing
util.hash_array
util.hash_pandas_object
-Testing
-~~~~~~~
+Importing from other DataFrame libraries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
- test
+ api.interchange.from_dataframe
diff --git a/doc/source/reference/general_utility_functions.rst b/doc/source/reference/general_utility_functions.rst
deleted file mode 100644
index 37fe980dbf68c..0000000000000
--- a/doc/source/reference/general_utility_functions.rst
+++ /dev/null
@@ -1,124 +0,0 @@
-{{ header }}
-
-.. _api.general_utility_functions:
-
-=========================
-General utility functions
-=========================
-.. currentmodule:: pandas
-
-Working with options
---------------------
-.. autosummary::
- :toctree: api/
-
- describe_option
- reset_option
- get_option
- set_option
- option_context
-
-.. _api.general.testing:
-
-Testing functions
------------------
-.. autosummary::
- :toctree: api/
-
- testing.assert_frame_equal
- testing.assert_series_equal
- testing.assert_index_equal
- testing.assert_extension_array_equal
-
-Exceptions and warnings
------------------------
-.. autosummary::
- :toctree: api/
-
- errors.AccessorRegistrationWarning
- errors.DtypeWarning
- errors.DuplicateLabelError
- errors.EmptyDataError
- errors.InvalidIndexError
- errors.MergeError
- errors.NullFrequencyError
- errors.NumbaUtilError
- errors.OutOfBoundsDatetime
- errors.OutOfBoundsTimedelta
- errors.ParserError
- errors.ParserWarning
- errors.PerformanceWarning
- errors.UnsortedIndexError
- errors.UnsupportedFunctionCall
-
-Data types related functionality
---------------------------------
-.. autosummary::
- :toctree: api/
-
- api.types.union_categoricals
- api.types.infer_dtype
- api.types.pandas_dtype
-
-Dtype introspection
-~~~~~~~~~~~~~~~~~~~
-.. autosummary::
- :toctree: api/
-
- api.types.is_bool_dtype
- api.types.is_categorical_dtype
- api.types.is_complex_dtype
- api.types.is_datetime64_any_dtype
- api.types.is_datetime64_dtype
- api.types.is_datetime64_ns_dtype
- api.types.is_datetime64tz_dtype
- api.types.is_extension_type
- api.types.is_extension_array_dtype
- api.types.is_float_dtype
- api.types.is_int64_dtype
- api.types.is_integer_dtype
- api.types.is_interval_dtype
- api.types.is_numeric_dtype
- api.types.is_object_dtype
- api.types.is_period_dtype
- api.types.is_signed_integer_dtype
- api.types.is_string_dtype
- api.types.is_timedelta64_dtype
- api.types.is_timedelta64_ns_dtype
- api.types.is_unsigned_integer_dtype
- api.types.is_sparse
-
-Iterable introspection
-~~~~~~~~~~~~~~~~~~~~~~
-.. autosummary::
- :toctree: api/
-
- api.types.is_dict_like
- api.types.is_file_like
- api.types.is_list_like
- api.types.is_named_tuple
- api.types.is_iterator
-
-Scalar introspection
-~~~~~~~~~~~~~~~~~~~~
-.. autosummary::
- :toctree: api/
-
- api.types.is_bool
- api.types.is_categorical
- api.types.is_complex
- api.types.is_float
- api.types.is_hashable
- api.types.is_integer
- api.types.is_interval
- api.types.is_number
- api.types.is_re
- api.types.is_re_compilable
- api.types.is_scalar
-
-Bug report function
--------------------
-.. autosummary::
- :toctree: api/
-
- show_versions
diff --git a/doc/source/reference/groupby.rst b/doc/source/reference/groupby.rst
index ccf130d03418c..51bd659081b8f 100644
--- a/doc/source/reference/groupby.rst
+++ b/doc/source/reference/groupby.rst
@@ -122,6 +122,7 @@ application to columns of a specific data type.
DataFrameGroupBy.skew
DataFrameGroupBy.take
DataFrameGroupBy.tshift
+ DataFrameGroupBy.value_counts
The following methods are available only for ``SeriesGroupBy`` objects.
@@ -131,9 +132,7 @@ The following methods are available only for ``SeriesGroupBy`` objects.
SeriesGroupBy.hist
SeriesGroupBy.nlargest
SeriesGroupBy.nsmallest
- SeriesGroupBy.nunique
SeriesGroupBy.unique
- SeriesGroupBy.value_counts
SeriesGroupBy.is_monotonic_increasing
SeriesGroupBy.is_monotonic_decreasing
diff --git a/doc/source/reference/index.rst b/doc/source/reference/index.rst
index f7c5eaf242b34..fc920db671ee5 100644
--- a/doc/source/reference/index.rst
+++ b/doc/source/reference/index.rst
@@ -37,8 +37,9 @@ public functions related to data types in pandas.
resampling
style
plotting
- general_utility_functions
+ options
extensions
+ testing
.. This is to prevent warnings in the doc build. We don't want to encourage
.. these methods.
@@ -46,20 +47,11 @@ public functions related to data types in pandas.
..
.. toctree::
- api/pandas.DataFrame.blocks
- api/pandas.DataFrame.as_matrix
api/pandas.Index.asi8
- api/pandas.Index.data
- api/pandas.Index.flags
api/pandas.Index.holds_integer
api/pandas.Index.is_type_compatible
api/pandas.Index.nlevels
api/pandas.Index.sort
- api/pandas.Series.asobject
- api/pandas.Series.blocks
- api/pandas.Series.from_array
- api/pandas.Series.imag
- api/pandas.Series.real
.. Can't convince sphinx to generate toctree for this class attribute.
diff --git a/doc/source/reference/indexing.rst b/doc/source/reference/indexing.rst
index 1a8c21a2c1a74..ddfef14036ef3 100644
--- a/doc/source/reference/indexing.rst
+++ b/doc/source/reference/indexing.rst
@@ -406,6 +406,7 @@ Methods
:toctree: api/
DatetimeIndex.mean
+ DatetimeIndex.std
TimedeltaIndex
--------------
diff --git a/doc/source/reference/io.rst b/doc/source/reference/io.rst
index 442631de50c7a..425b5f81be966 100644
--- a/doc/source/reference/io.rst
+++ b/doc/source/reference/io.rst
@@ -13,6 +13,7 @@ Pickling
:toctree: api/
read_pickle
+ DataFrame.to_pickle
Flat file
~~~~~~~~~
@@ -21,6 +22,7 @@ Flat file
read_table
read_csv
+ DataFrame.to_csv
read_fwf
Clipboard
@@ -29,6 +31,7 @@ Clipboard
:toctree: api/
read_clipboard
+ DataFrame.to_clipboard
Excel
~~~~~
@@ -36,14 +39,25 @@ Excel
:toctree: api/
read_excel
+ DataFrame.to_excel
ExcelFile.parse
+.. currentmodule:: pandas.io.formats.style
+
+.. autosummary::
+ :toctree: api/
+
+ Styler.to_excel
+
+.. currentmodule:: pandas
+
.. autosummary::
:toctree: api/
- :template: autosummary/class_without_autosummary.rst
ExcelWriter
+.. currentmodule:: pandas
+
JSON
~~~~
.. autosummary::
@@ -51,6 +65,7 @@ JSON
read_json
json_normalize
+ DataFrame.to_json
.. currentmodule:: pandas.io.json
@@ -67,6 +82,16 @@ HTML
:toctree: api/
read_html
+ DataFrame.to_html
+
+.. currentmodule:: pandas.io.formats.style
+
+.. autosummary::
+ :toctree: api/
+
+ Styler.to_html
+
+.. currentmodule:: pandas
XML
~~~~
@@ -74,6 +99,23 @@ XML
:toctree: api/
read_xml
+ DataFrame.to_xml
+
+Latex
+~~~~~
+.. autosummary::
+ :toctree: api/
+
+ DataFrame.to_latex
+
+.. currentmodule:: pandas.io.formats.style
+
+.. autosummary::
+ :toctree: api/
+
+ Styler.to_latex
+
+.. currentmodule:: pandas
HDFStore: PyTables (HDF5)
~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -92,7 +134,7 @@ HDFStore: PyTables (HDF5)
.. warning::
- One can store a subclass of ``DataFrame`` or ``Series`` to HDF5,
+ One can store a subclass of :class:`DataFrame` or :class:`Series` to HDF5,
but the type of the subclass is lost upon storing.
Feather
@@ -101,6 +143,7 @@ Feather
:toctree: api/
read_feather
+ DataFrame.to_feather
Parquet
~~~~~~~
@@ -108,6 +151,7 @@ Parquet
:toctree: api/
read_parquet
+ DataFrame.to_parquet
ORC
~~~
@@ -115,6 +159,7 @@ ORC
:toctree: api/
read_orc
+ DataFrame.to_orc
SAS
~~~
@@ -138,6 +183,7 @@ SQL
read_sql_table
read_sql_query
read_sql
+ DataFrame.to_sql
Google BigQuery
~~~~~~~~~~~~~~~
@@ -152,6 +198,7 @@ STATA
:toctree: api/
read_stata
+ DataFrame.to_stata
.. currentmodule:: pandas.io.stata
diff --git a/doc/source/reference/options.rst b/doc/source/reference/options.rst
new file mode 100644
index 0000000000000..7316b6e9c72b1
--- /dev/null
+++ b/doc/source/reference/options.rst
@@ -0,0 +1,21 @@
+{{ header }}
+
+.. _api.options:
+
+====================
+Options and settings
+====================
+.. currentmodule:: pandas
+
+API for configuring global behavior. See :ref:`the User Guide ` for more.
+
+Working with options
+--------------------
+.. autosummary::
+ :toctree: api/
+
+ describe_option
+ reset_option
+ get_option
+ set_option
+ option_context
diff --git a/doc/source/reference/series.rst b/doc/source/reference/series.rst
index 3ff3b2bb53fda..fcdc9ea9b95da 100644
--- a/doc/source/reference/series.rst
+++ b/doc/source/reference/series.rst
@@ -342,6 +342,7 @@ Datetime methods
:toctree: api/
:template: autosummary/accessor_method.rst
+ Series.dt.isocalendar
Series.dt.to_period
Series.dt.to_pydatetime
Series.dt.tz_localize
@@ -427,6 +428,8 @@ strings and apply several methods to it. These can be accessed like
Series.str.normalize
Series.str.pad
Series.str.partition
+ Series.str.removeprefix
+ Series.str.removesuffix
Series.str.repeat
Series.str.replace
Series.str.rfind
diff --git a/doc/source/reference/style.rst b/doc/source/reference/style.rst
index 5a2ff803f0323..5144f12fa373a 100644
--- a/doc/source/reference/style.rst
+++ b/doc/source/reference/style.rst
@@ -24,7 +24,10 @@ Styler properties
Styler.env
Styler.template_html
+ Styler.template_html_style
+ Styler.template_html_table
Styler.template_latex
+ Styler.template_string
Styler.loader
Style application
@@ -34,13 +37,19 @@ Style application
Styler.apply
Styler.applymap
- Styler.where
+ Styler.apply_index
+ Styler.applymap_index
Styler.format
+ Styler.format_index
+ Styler.relabel_index
+ Styler.hide
+ Styler.concat
Styler.set_td_classes
Styler.set_table_styles
Styler.set_table_attributes
Styler.set_tooltips
Styler.set_caption
+ Styler.set_sticky
Styler.set_properties
Styler.set_uuid
Styler.clear
@@ -65,9 +74,9 @@ Style export and import
.. autosummary::
:toctree: api/
- Styler.render
- Styler.export
- Styler.use
Styler.to_html
- Styler.to_excel
Styler.to_latex
+ Styler.to_excel
+ Styler.to_string
+ Styler.export
+ Styler.use
diff --git a/doc/source/reference/testing.rst b/doc/source/reference/testing.rst
new file mode 100644
index 0000000000000..1144c767942d4
--- /dev/null
+++ b/doc/source/reference/testing.rst
@@ -0,0 +1,77 @@
+{{ header }}
+
+.. _api.testing:
+
+=======
+Testing
+=======
+.. currentmodule:: pandas
+
+.. _api.general.testing:
+
+Assertion functions
+-------------------
+.. autosummary::
+ :toctree: api/
+
+ testing.assert_frame_equal
+ testing.assert_series_equal
+ testing.assert_index_equal
+ testing.assert_extension_array_equal
+
+Exceptions and warnings
+-----------------------
+.. autosummary::
+ :toctree: api/
+
+ errors.AbstractMethodError
+ errors.AccessorRegistrationWarning
+ errors.AttributeConflictWarning
+ errors.CategoricalConversionWarning
+ errors.ClosedFileError
+ errors.CSSWarning
+ errors.DatabaseError
+ errors.DataError
+ errors.DtypeWarning
+ errors.DuplicateLabelError
+ errors.EmptyDataError
+ errors.IncompatibilityWarning
+ errors.IndexingError
+ errors.InvalidColumnName
+ errors.InvalidIndexError
+ errors.IntCastingNaNError
+ errors.MergeError
+ errors.NullFrequencyError
+ errors.NumbaUtilError
+ errors.NumExprClobberingError
+ errors.OptionError
+ errors.OutOfBoundsDatetime
+ errors.OutOfBoundsTimedelta
+ errors.ParserError
+ errors.ParserWarning
+ errors.PerformanceWarning
+ errors.PossibleDataLossError
+ errors.PossiblePrecisionLoss
+ errors.PyperclipException
+ errors.PyperclipWindowsException
+ errors.SettingWithCopyError
+ errors.SettingWithCopyWarning
+ errors.SpecificationError
+ errors.UndefinedVariableError
+ errors.UnsortedIndexError
+ errors.UnsupportedFunctionCall
+ errors.ValueLabelTypeMismatch
+
+Bug report function
+-------------------
+.. autosummary::
+ :toctree: api/
+
+ show_versions
+
+Test suite runner
+-----------------
+.. autosummary::
+ :toctree: api/
+
+ test
diff --git a/doc/source/reference/window.rst b/doc/source/reference/window.rst
index a255b3ae8081e..0be3184a9356c 100644
--- a/doc/source/reference/window.rst
+++ b/doc/source/reference/window.rst
@@ -35,6 +35,7 @@ Rolling window functions
Rolling.aggregate
Rolling.quantile
Rolling.sem
+ Rolling.rank
.. _api.functions_window:
@@ -75,6 +76,7 @@ Expanding window functions
Expanding.aggregate
Expanding.quantile
Expanding.sem
+ Expanding.rank
.. _api.functions_ewm:
@@ -86,6 +88,7 @@ Exponentially-weighted window functions
:toctree: api/
ExponentialMovingWindow.mean
+ ExponentialMovingWindow.sum
ExponentialMovingWindow.std
ExponentialMovingWindow.var
ExponentialMovingWindow.corr
diff --git a/doc/source/user_guide/10min.rst b/doc/source/user_guide/10min.rst
index 2b329ef362354..c767fb1ebef7f 100644
--- a/doc/source/user_guide/10min.rst
+++ b/doc/source/user_guide/10min.rst
@@ -19,7 +19,7 @@ Customarily, we import as follows:
Object creation
---------------
-See the :ref:`Data Structure Intro section `.
+See the :ref:`Intro to data structures section `.
Creating a :class:`Series` by passing a list of values, letting pandas create
a default integer index:
@@ -29,7 +29,7 @@ a default integer index:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
-Creating a :class:`DataFrame` by passing a NumPy array, with a datetime index
+Creating a :class:`DataFrame` by passing a NumPy array, with a datetime index using :func:`date_range`
and labeled columns:
.. ipython:: python
@@ -39,7 +39,8 @@ and labeled columns:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
-Creating a :class:`DataFrame` by passing a dict of objects that can be converted to series-like.
+Creating a :class:`DataFrame` by passing a dictionary of objects that can be
+converted into a series-like structure:
.. ipython:: python
@@ -56,7 +57,7 @@ Creating a :class:`DataFrame` by passing a dict of objects that can be converted
df2
The columns of the resulting :class:`DataFrame` have different
-:ref:`dtypes `.
+:ref:`dtypes `:
.. ipython:: python
@@ -92,14 +93,15 @@ Viewing data
See the :ref:`Basics section `.
-Here is how to view the top and bottom rows of the frame:
+Use :meth:`DataFrame.head` and :meth:`DataFrame.tail` to view the top and bottom rows of the frame
+respectively:
.. ipython:: python
df.head()
df.tail(3)
-Display the index, columns:
+Display the :attr:`DataFrame.index` or :attr:`DataFrame.columns`:
.. ipython:: python
@@ -115,15 +117,15 @@ while pandas DataFrames have one dtype per column**. When you call
of the dtypes in the DataFrame. This may end up being ``object``, which requires
casting every value to a Python object.
-For ``df``, our :class:`DataFrame` of all floating-point values,
-:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.
+For ``df``, our :class:`DataFrame` of all floating-point values, and
+:meth:`DataFrame.to_numpy` is fast and doesn't require copying data:
.. ipython:: python
df.to_numpy()
For ``df2``, the :class:`DataFrame` with multiple dtypes,
-:meth:`DataFrame.to_numpy` is relatively expensive.
+:meth:`DataFrame.to_numpy` is relatively expensive:
.. ipython:: python
@@ -146,13 +148,13 @@ Transposing your data:
df.T
-Sorting by an axis:
+:meth:`DataFrame.sort_index` sorts by an axis:
.. ipython:: python
df.sort_index(axis=1, ascending=False)
-Sorting by values:
+:meth:`DataFrame.sort_values` sorts by values:
.. ipython:: python
@@ -165,8 +167,8 @@ Selection
While standard Python / NumPy expressions for selecting and setting are
intuitive and come in handy for interactive work, for production code, we
- recommend the optimized pandas data access methods, ``.at``, ``.iat``,
- ``.loc`` and ``.iloc``.
+ recommend the optimized pandas data access methods, :meth:`DataFrame.at`, :meth:`DataFrame.iat`,
+ :meth:`DataFrame.loc` and :meth:`DataFrame.iloc`.
See the indexing documentation :ref:`Indexing and Selecting Data ` and :ref:`MultiIndex / Advanced Indexing `.
@@ -180,7 +182,7 @@ equivalent to ``df.A``:
df["A"]
-Selecting via ``[]``, which slices the rows.
+Selecting via ``[]`` (``__getitem__``), which slices the rows:
.. ipython:: python
@@ -190,7 +192,7 @@ Selecting via ``[]``, which slices the rows.
Selection by label
~~~~~~~~~~~~~~~~~~
-See more in :ref:`Selection by Label `.
+See more in :ref:`Selection by Label ` using :meth:`DataFrame.loc` or :meth:`DataFrame.at`.
For getting a cross section using a label:
@@ -231,7 +233,7 @@ For getting fast access to a scalar (equivalent to the prior method):
Selection by position
~~~~~~~~~~~~~~~~~~~~~
-See more in :ref:`Selection by Position `.
+See more in :ref:`Selection by Position ` using :meth:`DataFrame.iloc` or :meth:`DataFrame.at`.
Select via the position of the passed integers:
@@ -278,13 +280,13 @@ For getting fast access to a scalar (equivalent to the prior method):
Boolean indexing
~~~~~~~~~~~~~~~~
-Using a single column's values to select data.
+Using a single column's values to select data:
.. ipython:: python
df[df["A"] > 0]
-Selecting values from a DataFrame where a boolean condition is met.
+Selecting values from a DataFrame where a boolean condition is met:
.. ipython:: python
@@ -303,7 +305,7 @@ Setting
~~~~~~~
Setting a new column automatically aligns the data
-by the indexes.
+by the indexes:
.. ipython:: python
@@ -326,16 +328,17 @@ Setting values by position:
Setting by assigning with a NumPy array:
.. ipython:: python
+ :okwarning:
df.loc[:, "D"] = np.array([5] * len(df))
-The result of the prior setting operations.
+The result of the prior setting operations:
.. ipython:: python
df
-A ``where`` operation with setting.
+A ``where`` operation with setting:
.. ipython:: python
@@ -352,7 +355,7 @@ default not included in computations. See the :ref:`Missing Data section
`.
Reindexing allows you to change/add/delete the index on a specified axis. This
-returns a copy of the data.
+returns a copy of the data:
.. ipython:: python
@@ -360,19 +363,19 @@ returns a copy of the data.
df1.loc[dates[0] : dates[1], "E"] = 1
df1
-To drop any rows that have missing data.
+:meth:`DataFrame.dropna` drops any rows that have missing data:
.. ipython:: python
df1.dropna(how="any")
-Filling missing data.
+:meth:`DataFrame.fillna` fills missing data:
.. ipython:: python
df1.fillna(value=5)
-To get the boolean mask where values are ``nan``.
+:func:`isna` gets the boolean mask where values are ``nan``:
.. ipython:: python
@@ -402,7 +405,7 @@ Same operation on the other axis:
df.mean(1)
Operating with objects that have different dimensionality and need alignment.
-In addition, pandas automatically broadcasts along the specified dimension.
+In addition, pandas automatically broadcasts along the specified dimension:
.. ipython:: python
@@ -414,7 +417,7 @@ In addition, pandas automatically broadcasts along the specified dimension.
Apply
~~~~~
-Applying functions to the data:
+:meth:`DataFrame.apply` applies a user defined function to the data:
.. ipython:: python
@@ -460,7 +463,7 @@ operations.
See the :ref:`Merging section `.
-Concatenating pandas objects together with :func:`concat`:
+Concatenating pandas objects together along an axis with :func:`concat`:
.. ipython:: python
@@ -477,12 +480,11 @@ Concatenating pandas objects together with :func:`concat`:
a row requires a copy, and may be expensive. We recommend passing a
pre-built list of records to the :class:`DataFrame` constructor instead
of building a :class:`DataFrame` by iteratively appending records to it.
- See :ref:`Appending to dataframe ` for more.
Join
~~~~
-SQL style merges. See the :ref:`Database style joining ` section.
+:func:`merge` enables SQL style join types along specific columns. See the :ref:`Database style joining ` section.
.. ipython:: python
@@ -527,14 +529,14 @@ See the :ref:`Grouping section `.
df
Grouping and then applying the :meth:`~pandas.core.groupby.GroupBy.sum` function to the resulting
-groups.
+groups:
.. ipython:: python
- df.groupby("A").sum()
+ df.groupby("A")[["C", "D"]].sum()
Grouping by multiple columns forms a hierarchical index, and again we can
-apply the :meth:`~pandas.core.groupby.GroupBy.sum` function.
+apply the :meth:`~pandas.core.groupby.GroupBy.sum` function:
.. ipython:: python
@@ -553,10 +555,8 @@ Stack
tuples = list(
zip(
- *[
- ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
- ["one", "two", "one", "two", "one", "two", "one", "two"],
- ]
+ ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
+ ["one", "two", "one", "two", "one", "two", "one", "two"],
)
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
@@ -565,14 +565,14 @@ Stack
df2
The :meth:`~DataFrame.stack` method "compresses" a level in the DataFrame's
-columns.
+columns:
.. ipython:: python
stacked = df2.stack()
stacked
-With a "stacked" DataFrame or Series (having a ``MultiIndex`` as the
+With a "stacked" DataFrame or Series (having a :class:`MultiIndex` as the
``index``), the inverse operation of :meth:`~DataFrame.stack` is
:meth:`~DataFrame.unstack`, which by default unstacks the **last level**:
@@ -599,7 +599,7 @@ See the section on :ref:`Pivot Tables `.
)
df
-We can produce pivot tables from this data very easily:
+:func:`pivot_table` pivots a :class:`DataFrame` specifying the ``values``, ``index`` and ``columns``
.. ipython:: python
@@ -620,7 +620,7 @@ financial applications. See the :ref:`Time Series section `.
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample("5Min").sum()
-Time zone representation:
+:meth:`Series.tz_localize` localizes a time series to a time zone:
.. ipython:: python
@@ -630,7 +630,7 @@ Time zone representation:
ts_utc = ts.tz_localize("UTC")
ts_utc
-Converting to another time zone:
+:meth:`Series.tz_convert` converts a timezones aware time series to another time zone:
.. ipython:: python
@@ -673,21 +673,21 @@ pandas can include categorical data in a :class:`DataFrame`. For full docs, see
-Convert the raw grades to a categorical data type.
+Converting the raw grades to a categorical data type:
.. ipython:: python
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
-Rename the categories to more meaningful names (assigning to
-:meth:`Series.cat.categories` is in place!).
+Rename the categories to more meaningful names:
.. ipython:: python
- df["grade"].cat.categories = ["very good", "good", "very bad"]
+ new_categories = ["very good", "good", "very bad"]
+ df["grade"] = df["grade"].cat.rename_categories(new_categories)
-Reorder the categories and simultaneously add the missing categories (methods under :meth:`Series.cat` return a new :class:`Series` by default).
+Reorder the categories and simultaneously add the missing categories (methods under :meth:`Series.cat` return a new :class:`Series` by default):
.. ipython:: python
@@ -696,13 +696,13 @@ Reorder the categories and simultaneously add the missing categories (methods un
)
df["grade"]
-Sorting is per order in the categories, not lexical order.
+Sorting is per order in the categories, not lexical order:
.. ipython:: python
df.sort_values(by="grade")
-Grouping by a categorical column also shows empty categories.
+Grouping by a categorical column also shows empty categories:
.. ipython:: python
@@ -722,7 +722,7 @@ We use the standard convention for referencing the matplotlib API:
plt.close("all")
-The :meth:`~plt.close` method is used to `close `__ a figure window.
+The ``plt.close`` method is used to `close `__ a figure window:
.. ipython:: python
@@ -732,6 +732,14 @@ The :meth:`~plt.close` method is used to `close `__ to show it or
+`matplotlib.pyplot.savefig `__ to write it to a file.
+
+.. ipython:: python
+
+ plt.show();
+
On a DataFrame, the :meth:`~DataFrame.plot` method is a convenience to plot all
of the columns with labels:
@@ -748,19 +756,19 @@ of the columns with labels:
@savefig frame_plot_basic.png
plt.legend(loc='best');
-Getting data in/out
--------------------
+Importing and exporting data
+----------------------------
CSV
~~~
-:ref:`Writing to a csv file. `
+:ref:`Writing to a csv file: ` using :meth:`DataFrame.to_csv`
.. ipython:: python
df.to_csv("foo.csv")
-:ref:`Reading from a csv file. `
+:ref:`Reading from a csv file: ` using :func:`read_csv`
.. ipython:: python
@@ -778,13 +786,13 @@ HDF5
Reading and writing to :ref:`HDFStores `.
-Writing to a HDF5 Store.
+Writing to a HDF5 Store using :meth:`DataFrame.to_hdf`:
.. ipython:: python
df.to_hdf("foo.h5", "df")
-Reading from a HDF5 Store.
+Reading from a HDF5 Store using :func:`read_hdf`:
.. ipython:: python
@@ -798,15 +806,15 @@ Reading from a HDF5 Store.
Excel
~~~~~
-Reading and writing to :ref:`MS Excel `.
+Reading and writing to :ref:`Excel `.
-Writing to an excel file.
+Writing to an excel file using :meth:`DataFrame.to_excel`:
.. ipython:: python
df.to_excel("foo.xlsx", sheet_name="Sheet1")
-Reading from an excel file.
+Reading from an excel file using :func:`read_excel`:
.. ipython:: python
@@ -820,16 +828,13 @@ Reading from an excel file.
Gotchas
-------
-If you are attempting to perform an operation you might see an exception like:
-
-.. code-block:: python
+If you are attempting to perform a boolean operation on a :class:`Series` or :class:`DataFrame`
+you might see an exception like:
- >>> if pd.Series([False, True, False]):
- ... print("I was true")
- Traceback
- ...
- ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
+.. ipython:: python
+ :okexcept:
-See :ref:`Comparisons` for an explanation and what to do.
+ if pd.Series([False, True, False]):
+ print("I was true")
-See :ref:`Gotchas` as well.
+See :ref:`Comparisons` and :ref:`Gotchas` for an explanation and what to do.
diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst
index 3b33ebe701037..b8df21ab5a5b4 100644
--- a/doc/source/user_guide/advanced.rst
+++ b/doc/source/user_guide/advanced.rst
@@ -7,7 +7,7 @@ MultiIndex / advanced indexing
******************************
This section covers :ref:`indexing with a MultiIndex `
-and :ref:`other advanced indexing features `.
+and :ref:`other advanced indexing features `.
See the :ref:`Indexing and Selecting Data ` for general indexing documentation.
@@ -738,7 +738,7 @@ faster than fancy indexing.
%timeit ser.iloc[indexer]
%timeit ser.take(indexer)
-.. _indexing.index_types:
+.. _advanced.index_types:
Index types
-----------
@@ -749,7 +749,7 @@ and documentation about ``TimedeltaIndex`` is found :ref:`here `__.
-.. _indexing.float64index:
+.. _advanced.float64index:
Float64Index
~~~~~~~~~~~~
+.. deprecated:: 1.4.0
+ :class:`Index` will become the default index type for numeric types in the future
+ instead of ``Int64Index``, ``Float64Index`` and ``UInt64Index`` and those index types
+ are therefore deprecated and will be removed in a future version of Pandas.
+ ``RangeIndex`` will not be removed as it represents an optimized version of an integer index.
+
By default a :class:`Float64Index` will be automatically created when passing floating, or mixed-integer-floating values in index creation.
This enables a pure label-based slicing paradigm that makes ``[],ix,loc`` for scalar indexing and slicing work exactly the
same.
@@ -956,6 +968,7 @@ If you need integer based selection, you should use ``iloc``:
dfir.iloc[0:5]
+
.. _advanced.intervalindex:
IntervalIndex
@@ -1233,5 +1246,5 @@ This is because the (re)indexing operations above silently inserts ``NaNs`` and
changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs``
such as ``numpy.logical_and``.
-See the `this old issue `__ for a more
+See the :issue:`2388` for a more
detailed discussion.
diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst
index 82c8a27bec3a5..a34d4891b9d77 100644
--- a/doc/source/user_guide/basics.rst
+++ b/doc/source/user_guide/basics.rst
@@ -848,8 +848,8 @@ have introduced the popular ``(%>%)`` (read pipe) operator for R_.
The implementation of ``pipe`` here is quite clean and feels right at home in Python.
We encourage you to view the source code of :meth:`~DataFrame.pipe`.
-.. _dplyr: https://github.com/hadley/dplyr
-.. _magrittr: https://github.com/smbache/magrittr
+.. _dplyr: https://github.com/tidyverse/dplyr
+.. _magrittr: https://github.com/tidyverse/magrittr
.. _R: https://www.r-project.org
@@ -1045,6 +1045,9 @@ not noted for a particular column will be ``NaN``:
Mixed dtypes
++++++++++++
+.. deprecated:: 1.4.0
+ Attempting to determine which columns cannot be aggregated and silently dropping them from the results is deprecated and will be removed in a future version. If any porition of the columns or operations provided fail, the call to ``.agg`` will raise.
+
When presented with mixed dtypes that cannot aggregate, ``.agg`` will only take the valid
aggregations. This is similar to how ``.groupby.agg`` works.
@@ -1061,6 +1064,7 @@ aggregations. This is similar to how ``.groupby.agg`` works.
mdf.dtypes
.. ipython:: python
+ :okwarning:
mdf.agg(["min", "sum"])
@@ -2047,32 +2051,33 @@ The following table lists all of pandas extension types. For methods requiring `
arguments, strings can be specified as indicated. See the respective
documentation sections for more on each type.
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| Kind of Data | Data Type | Scalar | Array | String Aliases | Documentation |
-+===================+===========================+====================+===============================+=========================================+===============================+
-| tz-aware datetime | :class:`DatetimeTZDtype` | :class:`Timestamp` | :class:`arrays.DatetimeArray` | ``'datetime64[ns, ]'`` | :ref:`timeseries.timezone` |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| Categorical | :class:`CategoricalDtype` | (none) | :class:`Categorical` | ``'category'`` | :ref:`categorical` |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| period | :class:`PeriodDtype` | :class:`Period` | :class:`arrays.PeriodArray` | ``'period[]'``, | :ref:`timeseries.periods` |
-| (time spans) | | | | ``'Period[]'`` | |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| sparse | :class:`SparseDtype` | (none) | :class:`arrays.SparseArray` | ``'Sparse'``, ``'Sparse[int]'``, | :ref:`sparse` |
-| | | | | ``'Sparse[float]'`` | |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| intervals | :class:`IntervalDtype` | :class:`Interval` | :class:`arrays.IntervalArray` | ``'interval'``, ``'Interval'``, | :ref:`advanced.intervalindex` |
-| | | | | ``'Interval[]'``, | |
-| | | | | ``'Interval[datetime64[ns, ]]'``, | |
-| | | | | ``'Interval[timedelta64[]]'`` | |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| nullable integer + :class:`Int64Dtype`, ... | (none) | :class:`arrays.IntegerArray` | ``'Int8'``, ``'Int16'``, ``'Int32'``, | :ref:`integer_na` |
-| | | | | ``'Int64'``, ``'UInt8'``, ``'UInt16'``, | |
-| | | | | ``'UInt32'``, ``'UInt64'`` | |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| Strings | :class:`StringDtype` | :class:`str` | :class:`arrays.StringArray` | ``'string'`` | :ref:`text` |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| Boolean (with NA) | :class:`BooleanDtype` | :class:`bool` | :class:`arrays.BooleanArray` | ``'boolean'`` | :ref:`api.arrays.bool` |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| Kind of Data | Data Type | Scalar | Array | String Aliases |
++=================================================+===============+===========+========+===========+===============================+========================================+
+| :ref:`tz-aware datetime ` | :class:`DatetimeTZDtype` | :class:`Timestamp` | :class:`arrays.DatetimeArray` | ``'datetime64[ns, ]'`` |
+| | | | | |
++-------------------------------------------------+---------------+-----------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`Categorical ` | :class:`CategoricalDtype` | (none) | :class:`Categorical` | ``'category'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`period (time spans) ` | :class:`PeriodDtype` | :class:`Period` | :class:`arrays.PeriodArray` | ``'period[]'``, |
+| | | | ``'Period[]'`` | |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`sparse ` | :class:`SparseDtype` | (none) | :class:`arrays.SparseArray` | ``'Sparse'``, ``'Sparse[int]'``, |
+| | | | | ``'Sparse[float]'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`intervals ` | :class:`IntervalDtype` | :class:`Interval` | :class:`arrays.IntervalArray` | ``'interval'``, ``'Interval'``, |
+| | | | | ``'Interval[]'``, |
+| | | | | ``'Interval[datetime64[ns, ]]'``, |
+| | | | | ``'Interval[timedelta64[]]'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`nullable integer ` | :class:`Int64Dtype`, ... | (none) | :class:`arrays.IntegerArray` | ``'Int8'``, ``'Int16'``, ``'Int32'``, |
+| | | | | ``'Int64'``, ``'UInt8'``, ``'UInt16'``,|
+| | | | | ``'UInt32'``, ``'UInt64'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`Strings ` | :class:`StringDtype` | :class:`str` | :class:`arrays.StringArray` | ``'string'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`Boolean (with NA) ` | :class:`BooleanDtype` | :class:`bool` | :class:`arrays.BooleanArray` | ``'boolean'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
pandas has two ways to store strings.
diff --git a/doc/source/user_guide/boolean.rst b/doc/source/user_guide/boolean.rst
index 76c922fcef638..54c67674b890c 100644
--- a/doc/source/user_guide/boolean.rst
+++ b/doc/source/user_guide/boolean.rst
@@ -12,6 +12,11 @@
Nullable Boolean data type
**************************
+.. note::
+
+ BooleanArray is currently experimental. Its API or implementation may
+ change without warning.
+
.. versionadded:: 1.0.0
diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst
index f65638cd78a2b..b5cb1d83a9f52 100644
--- a/doc/source/user_guide/categorical.rst
+++ b/doc/source/user_guide/categorical.rst
@@ -334,8 +334,7 @@ It's also possible to pass in the categories in a specific order:
Renaming categories
~~~~~~~~~~~~~~~~~~~
-Renaming categories is done by assigning new values to the
-``Series.cat.categories`` property or by using the
+Renaming categories is done by using the
:meth:`~pandas.Categorical.rename_categories` method:
@@ -343,9 +342,8 @@ Renaming categories is done by assigning new values to the
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s
- s.cat.categories = ["Group %s" % g for g in s.cat.categories]
- s
- s = s.cat.rename_categories([1, 2, 3])
+ new_categories = ["Group %s" % g for g in s.cat.categories]
+ s = s.cat.rename_categories(new_categories)
s
# You can also pass a dict-like object to map the renaming
s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})
@@ -365,7 +363,7 @@ Categories must be unique or a ``ValueError`` is raised:
.. ipython:: python
try:
- s.cat.categories = [1, 1, 1]
+ s = s.cat.rename_categories([1, 1, 1])
except ValueError as e:
print("ValueError:", str(e))
@@ -374,7 +372,7 @@ Categories must also not be ``NaN`` or a ``ValueError`` is raised:
.. ipython:: python
try:
- s.cat.categories = [1, 2, np.nan]
+ s = s.cat.rename_categories([1, 2, np.nan])
except ValueError as e:
print("ValueError:", str(e))
@@ -702,7 +700,7 @@ of length "1".
.. ipython:: python
df.iat[0, 0]
- df["cats"].cat.categories = ["x", "y", "z"]
+ df["cats"] = df["cats"].cat.rename_categories(["x", "y", "z"])
df.at["h", "cats"] # returns a string
.. note::
@@ -777,8 +775,8 @@ value is included in the ``categories``:
df
try:
df.iloc[2:4, :] = [["c", 3], ["c", 3]]
- except ValueError as e:
- print("ValueError:", str(e))
+ except TypeError as e:
+ print("TypeError:", str(e))
Setting values by assigning categorical data will also check that the ``categories`` match:
@@ -788,8 +786,8 @@ Setting values by assigning categorical data will also check that the ``categori
df
try:
df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"], categories=["a", "b", "c"])
- except ValueError as e:
- print("ValueError:", str(e))
+ except TypeError as e:
+ print("TypeError:", str(e))
Assigning a ``Categorical`` to parts of a column of other types will use the values:
@@ -960,7 +958,7 @@ relevant columns back to ``category`` and assign the right categories and catego
s = pd.Series(pd.Categorical(["a", "b", "b", "a", "a", "d"]))
# rename the categories
- s.cat.categories = ["very good", "good", "bad"]
+ s = s.cat.rename_categories(["very good", "good", "bad"])
# reorder the categories and add missing categories
s = s.cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df = pd.DataFrame({"cats": s, "vals": [1, 2, 3, 4, 5, 6]})
@@ -1141,7 +1139,7 @@ Categorical index
``CategoricalIndex`` is a type of index that is useful for supporting
indexing with duplicates. This is a container around a ``Categorical``
and allows efficient indexing and storage of an index with a large number of duplicated elements.
-See the :ref:`advanced indexing docs ` for a more detailed
+See the :ref:`advanced indexing docs ` for a more detailed
explanation.
Setting the index will create a ``CategoricalIndex``:
@@ -1164,6 +1162,7 @@ Constructing a ``Series`` from a ``Categorical`` will not copy the input
change the original ``Categorical``:
.. ipython:: python
+ :okwarning:
cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
s = pd.Series(cat, name="cat")
diff --git a/doc/source/user_guide/computation.rst b/doc/source/user_guide/computation.rst
deleted file mode 100644
index 6007129e96ba0..0000000000000
--- a/doc/source/user_guide/computation.rst
+++ /dev/null
@@ -1,212 +0,0 @@
-.. _computation:
-
-{{ header }}
-
-Computational tools
-===================
-
-
-Statistical functions
----------------------
-
-.. _computation.pct_change:
-
-Percent change
-~~~~~~~~~~~~~~
-
-``Series`` and ``DataFrame`` have a method
-:meth:`~DataFrame.pct_change` to compute the percent change over a given number
-of periods (using ``fill_method`` to fill NA/null values *before* computing
-the percent change).
-
-.. ipython:: python
-
- ser = pd.Series(np.random.randn(8))
-
- ser.pct_change()
-
-.. ipython:: python
-
- df = pd.DataFrame(np.random.randn(10, 4))
-
- df.pct_change(periods=3)
-
-.. _computation.covariance:
-
-Covariance
-~~~~~~~~~~
-
-:meth:`Series.cov` can be used to compute covariance between series
-(excluding missing values).
-
-.. ipython:: python
-
- s1 = pd.Series(np.random.randn(1000))
- s2 = pd.Series(np.random.randn(1000))
- s1.cov(s2)
-
-Analogously, :meth:`DataFrame.cov` to compute pairwise covariances among the
-series in the DataFrame, also excluding NA/null values.
-
-.. _computation.covariance.caveats:
-
-.. note::
-
- Assuming the missing data are missing at random this results in an estimate
- for the covariance matrix which is unbiased. However, for many applications
- this estimate may not be acceptable because the estimated covariance matrix
- is not guaranteed to be positive semi-definite. This could lead to
- estimated correlations having absolute values which are greater than one,
- and/or a non-invertible covariance matrix. See `Estimation of covariance
- matrices `_
- for more details.
-
-.. ipython:: python
-
- frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
- frame.cov()
-
-``DataFrame.cov`` also supports an optional ``min_periods`` keyword that
-specifies the required minimum number of observations for each column pair
-in order to have a valid result.
-
-.. ipython:: python
-
- frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
- frame.loc[frame.index[:5], "a"] = np.nan
- frame.loc[frame.index[5:10], "b"] = np.nan
-
- frame.cov()
-
- frame.cov(min_periods=12)
-
-
-.. _computation.correlation:
-
-Correlation
-~~~~~~~~~~~
-
-Correlation may be computed using the :meth:`~DataFrame.corr` method.
-Using the ``method`` parameter, several methods for computing correlations are
-provided:
-
-.. csv-table::
- :header: "Method name", "Description"
- :widths: 20, 80
-
- ``pearson (default)``, Standard correlation coefficient
- ``kendall``, Kendall Tau correlation coefficient
- ``spearman``, Spearman rank correlation coefficient
-
-.. \rho = \cov(x, y) / \sigma_x \sigma_y
-
-All of these are currently computed using pairwise complete observations.
-Wikipedia has articles covering the above correlation coefficients:
-
-* `Pearson correlation coefficient `_
-* `Kendall rank correlation coefficient `_
-* `Spearman's rank correlation coefficient `_
-
-.. note::
-
- Please see the :ref:`caveats ` associated
- with this method of calculating correlation matrices in the
- :ref:`covariance section `.
-
-.. ipython:: python
-
- frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
- frame.iloc[::2] = np.nan
-
- # Series with Series
- frame["a"].corr(frame["b"])
- frame["a"].corr(frame["b"], method="spearman")
-
- # Pairwise correlation of DataFrame columns
- frame.corr()
-
-Note that non-numeric columns will be automatically excluded from the
-correlation calculation.
-
-Like ``cov``, ``corr`` also supports the optional ``min_periods`` keyword:
-
-.. ipython:: python
-
- frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
- frame.loc[frame.index[:5], "a"] = np.nan
- frame.loc[frame.index[5:10], "b"] = np.nan
-
- frame.corr()
-
- frame.corr(min_periods=12)
-
-
-The ``method`` argument can also be a callable for a generic correlation
-calculation. In this case, it should be a single function
-that produces a single value from two ndarray inputs. Suppose we wanted to
-compute the correlation based on histogram intersection:
-
-.. ipython:: python
-
- # histogram intersection
- def histogram_intersection(a, b):
- return np.minimum(np.true_divide(a, a.sum()), np.true_divide(b, b.sum())).sum()
-
-
- frame.corr(method=histogram_intersection)
-
-A related method :meth:`~DataFrame.corrwith` is implemented on DataFrame to
-compute the correlation between like-labeled Series contained in different
-DataFrame objects.
-
-.. ipython:: python
-
- index = ["a", "b", "c", "d", "e"]
- columns = ["one", "two", "three", "four"]
- df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)
- df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)
- df1.corrwith(df2)
- df2.corrwith(df1, axis=1)
-
-.. _computation.ranking:
-
-Data ranking
-~~~~~~~~~~~~
-
-The :meth:`~Series.rank` method produces a data ranking with ties being
-assigned the mean of the ranks (by default) for the group:
-
-.. ipython:: python
-
- s = pd.Series(np.random.randn(5), index=list("abcde"))
- s["d"] = s["b"] # so there's a tie
- s.rank()
-
-:meth:`~DataFrame.rank` is also a DataFrame method and can rank either the rows
-(``axis=0``) or the columns (``axis=1``). ``NaN`` values are excluded from the
-ranking.
-
-.. ipython:: python
-
- df = pd.DataFrame(np.random.randn(10, 6))
- df[4] = df[2][:5] # some ties
- df
- df.rank(1)
-
-``rank`` optionally takes a parameter ``ascending`` which by default is true;
-when false, data is reverse-ranked, with larger values assigned a smaller rank.
-
-``rank`` supports different tie-breaking methods, specified with the ``method``
-parameter:
-
- - ``average`` : average rank of tied group
- - ``min`` : lowest rank in the group
- - ``max`` : highest rank in the group
- - ``first`` : ranks assigned in the order they appear in the array
-
-.. _computation.windowing:
-
-Windowing functions
-~~~~~~~~~~~~~~~~~~~
-
-See :ref:`the window operations user guide ` for an overview of windowing functions.
diff --git a/doc/source/user_guide/cookbook.rst b/doc/source/user_guide/cookbook.rst
index e1aae0fd481b1..daf5a0e481b8e 100644
--- a/doc/source/user_guide/cookbook.rst
+++ b/doc/source/user_guide/cookbook.rst
@@ -193,8 +193,7 @@ The :ref:`indexing ` docs.
df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
-`Use loc for label-oriented slicing and iloc positional slicing
-`__
+Use loc for label-oriented slicing and iloc positional slicing :issue:`2904`
.. ipython:: python
@@ -229,7 +228,7 @@ Ambiguity arises when an index consists of integers with a non-zero start or non
df2.loc[1:3] # Label-oriented
`Using inverse operator (~) to take the complement of a mask
-`__
+`__
.. ipython:: python
@@ -259,7 +258,7 @@ New columns
df
`Keep other columns when using min() with groupby
-`__
+`__
.. ipython:: python
@@ -389,14 +388,13 @@ Sorting
*******
`Sort by specific column or an ordered list of columns, with a MultiIndex
-`__
+`__
.. ipython:: python
df.sort_values(by=("Labs", "II"), ascending=False)
-`Partial selection, the need for sortedness;
-`__
+Partial selection, the need for sortedness :issue:`2995`
Levels
******
@@ -405,7 +403,7 @@ Levels
`__
`Flatten Hierarchical columns
-`__
+`__
.. _cookbook.missing_data:
@@ -425,7 +423,7 @@ Fill forward a reversed timeseries
)
df.loc[df.index[3], "A"] = np.nan
df
- df.reindex(df.index[::-1]).ffill()
+ df.bfill()
`cumsum reset at NaN values
`__
@@ -513,7 +511,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
def replace(g):
mask = g < 0
- return g.where(mask, g[~mask].mean())
+ return g.where(~mask, g[~mask].mean())
gb.transform(replace)
@@ -556,7 +554,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
ts
`Create a value counts column and reassign back to the DataFrame
-`__
+`__
.. ipython:: python
@@ -663,7 +661,7 @@ Pivot
The :ref:`Pivot ` docs.
`Partial sums and subtotals
-`__
+`__
.. ipython:: python
@@ -870,7 +868,7 @@ Timeseries
`__
`Constructing a datetime range that excludes weekends and includes only certain times
-`__
+`__
`Vectorized Lookup
`__
@@ -910,8 +908,7 @@ Valid frequency arguments to Grouper :ref:`Timeseries `__
-`Using TimeGrouper and another grouping to create subgroups, then apply a custom function
-`__
+Using TimeGrouper and another grouping to create subgroups, then apply a custom function :issue:`3791`
`Resampling with custom periods
`__
@@ -929,9 +926,9 @@ Valid frequency arguments to Grouper :ref:`Timeseries ` docs. The :ref:`Join ` docs.
+The :ref:`Join ` docs.
-`Append two dataframes with overlapping index (emulate R rbind)
+`Concatenate two dataframes with overlapping index (emulate R rbind)
`__
.. ipython:: python
@@ -944,11 +941,10 @@ Depending on df construction, ``ignore_index`` may be needed
.. ipython:: python
- df = df1.append(df2, ignore_index=True)
+ df = pd.concat([df1, df2], ignore_index=True)
df
-`Self Join of a DataFrame
-`__
+Self Join of a DataFrame :issue:`2996`
.. ipython:: python
@@ -1038,7 +1034,7 @@ Data in/out
-----------
`Performance comparison of SQL vs HDF5
-`__
+`__
.. _cookbook.csv:
@@ -1070,14 +1066,7 @@ using that handle to read.
`Inferring dtypes from a file
`__
-`Dealing with bad lines
-`__
-
-`Dealing with bad lines II
-`__
-
-`Reading CSV with Unix timestamps and converting to local timezone
-`__
+Dealing with bad lines :issue:`2886`
`Write a multi-row index CSV without writing duplicates
`__
@@ -1211,6 +1200,8 @@ The :ref:`Excel ` docs
`Modifying formatting in XlsxWriter output
`__
+Loading only visible sheets :issue:`19842#issuecomment-892150745`
+
.. _cookbook.html:
HTML
@@ -1229,8 +1220,7 @@ The :ref:`HDFStores ` docs
`Simple queries with a Timestamp Index
`__
-`Managing heterogeneous data using a linked multiple table hierarchy
-`__
+Managing heterogeneous data using a linked multiple table hierarchy :issue:`3032`
`Merging on-disk tables with millions of rows
`__
@@ -1250,7 +1240,7 @@ csv file and creating a store by chunks, with date parsing as well.
`__
`Large Data work flows
-`__
+`__
`Reading in a sequence of files, then providing a global unique index to a store while appending
`__
@@ -1300,7 +1290,7 @@ is closed.
.. ipython:: python
- store = pd.HDFStore("test.h5", "w", diver="H5FD_CORE")
+ store = pd.HDFStore("test.h5", "w", driver="H5FD_CORE")
df = pd.DataFrame(np.random.randn(8, 3))
store["test"] = df
@@ -1381,7 +1371,7 @@ Computation
-----------
`Numerical integration (sample-based) of a time series
-`__
+`__
Correlation
***********
diff --git a/doc/source/user_guide/dsintro.rst b/doc/source/user_guide/dsintro.rst
index efcf1a8703d2b..571f8980070af 100644
--- a/doc/source/user_guide/dsintro.rst
+++ b/doc/source/user_guide/dsintro.rst
@@ -8,7 +8,7 @@ Intro to data structures
We'll start with a quick, non-comprehensive overview of the fundamental data
structures in pandas to get you started. The fundamental behavior about data
-types, indexing, and axis labeling / alignment apply across all of the
+types, indexing, axis labeling, and alignment apply across all of the
objects. To get started, import NumPy and load pandas into your namespace:
.. ipython:: python
@@ -16,7 +16,7 @@ objects. To get started, import NumPy and load pandas into your namespace:
import numpy as np
import pandas as pd
-Here is a basic tenet to keep in mind: **data alignment is intrinsic**. The link
+Fundamentally, **data alignment is intrinsic**. The link
between labels and data will not be broken unless done so explicitly by you.
We'll give a brief intro to the data structures, then consider all of the broad
@@ -29,7 +29,7 @@ Series
:class:`Series` is a one-dimensional labeled array capable of holding any data
type (integers, strings, floating point numbers, Python objects, etc.). The axis
-labels are collectively referred to as the **index**. The basic method to create a Series is to call:
+labels are collectively referred to as the **index**. The basic method to create a :class:`Series` is to call:
::
@@ -61,32 +61,17 @@ index is passed, one will be created having values ``[0, ..., len(data) - 1]``.
pandas supports non-unique index values. If an operation
that does not support duplicate index values is attempted, an exception
- will be raised at that time. The reason for being lazy is nearly all performance-based
- (there are many instances in computations, like parts of GroupBy, where the index
- is not used).
+ will be raised at that time.
**From dict**
-Series can be instantiated from dicts:
+:class:`Series` can be instantiated from dicts:
.. ipython:: python
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)
-.. note::
-
- When the data is a dict, and an index is not passed, the ``Series`` index
- will be ordered by the dict's insertion order, if you're using Python
- version >= 3.6 and pandas version >= 0.23.
-
- If you're using Python < 3.6 or pandas < 0.23, and an index is not passed,
- the ``Series`` index will be the lexically ordered list of dict keys.
-
-In the example above, if you were on a Python version lower than 3.6 or a
-pandas version lower than 0.23, the ``Series`` would be ordered by the lexical
-order of the dict keys (i.e. ``['a', 'b', 'c']`` rather than ``['b', 'a', 'c']``).
-
If an index is passed, the values in data corresponding to the labels in the
index will be pulled out.
@@ -112,7 +97,7 @@ provided. The value will be repeated to match the length of **index**.
Series is ndarray-like
~~~~~~~~~~~~~~~~~~~~~~
-``Series`` acts very similarly to a ``ndarray``, and is a valid argument to most NumPy functions.
+:class:`Series` acts very similarly to a ``ndarray`` and is a valid argument to most NumPy functions.
However, operations such as slicing will also slice the index.
.. ipython:: python
@@ -128,7 +113,7 @@ However, operations such as slicing will also slice the index.
We will address array-based indexing like ``s[[4, 3, 1]]``
in :ref:`section on indexing `.
-Like a NumPy array, a pandas Series has a :attr:`~Series.dtype`.
+Like a NumPy array, a pandas :class:`Series` has a single :attr:`~Series.dtype`.
.. ipython:: python
@@ -140,7 +125,7 @@ be an :class:`~pandas.api.extensions.ExtensionDtype`. Some examples within
pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`basics.dtypes`
for more.
-If you need the actual array backing a ``Series``, use :attr:`Series.array`.
+If you need the actual array backing a :class:`Series`, use :attr:`Series.array`.
.. ipython:: python
@@ -151,24 +136,24 @@ index (to disable :ref:`automatic alignment `, for example).
:attr:`Series.array` will always be an :class:`~pandas.api.extensions.ExtensionArray`.
Briefly, an ExtensionArray is a thin wrapper around one or more *concrete* arrays like a
-:class:`numpy.ndarray`. pandas knows how to take an ``ExtensionArray`` and
-store it in a ``Series`` or a column of a ``DataFrame``.
+:class:`numpy.ndarray`. pandas knows how to take an :class:`~pandas.api.extensions.ExtensionArray` and
+store it in a :class:`Series` or a column of a :class:`DataFrame`.
See :ref:`basics.dtypes` for more.
-While Series is ndarray-like, if you need an *actual* ndarray, then use
+While :class:`Series` is ndarray-like, if you need an *actual* ndarray, then use
:meth:`Series.to_numpy`.
.. ipython:: python
s.to_numpy()
-Even if the Series is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
+Even if the :class:`Series` is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
:meth:`Series.to_numpy` will return a NumPy ndarray.
Series is dict-like
~~~~~~~~~~~~~~~~~~~
-A Series is like a fixed-size dict in that you can get and set values by index
+A :class:`Series` is also like a fixed-size dict in that you can get and set values by index
label:
.. ipython:: python
@@ -179,14 +164,14 @@ label:
"e" in s
"f" in s
-If a label is not contained, an exception is raised:
+If a label is not contained in the index, an exception is raised:
-.. code-block:: python
+.. ipython:: python
+ :okexcept:
- >>> s["f"]
- KeyError: 'f'
+ s["f"]
-Using the ``get`` method, a missing label will return None or specified default:
+Using the :meth:`Series.get` method, a missing label will return None or specified default:
.. ipython:: python
@@ -194,14 +179,14 @@ Using the ``get`` method, a missing label will return None or specified default:
s.get("f", np.nan)
-See also the :ref:`section on attribute access`.
+These labels can also be accessed by :ref:`attribute`.
Vectorized operations and label alignment with Series
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When working with raw NumPy arrays, looping through value-by-value is usually
-not necessary. The same is true when working with Series in pandas.
-Series can also be passed into most NumPy methods expecting an ndarray.
+not necessary. The same is true when working with :class:`Series` in pandas.
+:class:`Series` can also be passed into most NumPy methods expecting an ndarray.
.. ipython:: python
@@ -209,17 +194,17 @@ Series can also be passed into most NumPy methods expecting an ndarray.
s * 2
np.exp(s)
-A key difference between Series and ndarray is that operations between Series
+A key difference between :class:`Series` and ndarray is that operations between :class:`Series`
automatically align the data based on label. Thus, you can write computations
-without giving consideration to whether the Series involved have the same
+without giving consideration to whether the :class:`Series` involved have the same
labels.
.. ipython:: python
s[1:] + s[:-1]
-The result of an operation between unaligned Series will have the **union** of
-the indexes involved. If a label is not found in one Series or the other, the
+The result of an operation between unaligned :class:`Series` will have the **union** of
+the indexes involved. If a label is not found in one :class:`Series` or the other, the
result will be marked as missing ``NaN``. Being able to write code without doing
any explicit data alignment grants immense freedom and flexibility in
interactive data analysis and research. The integrated data alignment features
@@ -240,7 +225,7 @@ Name attribute
.. _dsintro.name_attribute:
-Series can also have a ``name`` attribute:
+:class:`Series` also has a ``name`` attribute:
.. ipython:: python
@@ -248,10 +233,11 @@ Series can also have a ``name`` attribute:
s
s.name
-The Series ``name`` will be assigned automatically in many cases, in particular
-when taking 1D slices of DataFrame as you will see below.
+The :class:`Series` ``name`` can be assigned automatically in many cases, in particular,
+when selecting a single column from a :class:`DataFrame`, the ``name`` will be assigned
+the column label.
-You can rename a Series with the :meth:`pandas.Series.rename` method.
+You can rename a :class:`Series` with the :meth:`pandas.Series.rename` method.
.. ipython:: python
@@ -265,17 +251,17 @@ Note that ``s`` and ``s2`` refer to different objects.
DataFrame
---------
-**DataFrame** is a 2-dimensional labeled data structure with columns of
+:class:`DataFrame` is a 2-dimensional labeled data structure with columns of
potentially different types. You can think of it like a spreadsheet or SQL
table, or a dict of Series objects. It is generally the most commonly used
pandas object. Like Series, DataFrame accepts many different kinds of input:
-* Dict of 1D ndarrays, lists, dicts, or Series
+* Dict of 1D ndarrays, lists, dicts, or :class:`Series`
* 2-D numpy.ndarray
* `Structured or record
`__ ndarray
-* A ``Series``
-* Another ``DataFrame``
+* A :class:`Series`
+* Another :class:`DataFrame`
Along with the data, you can optionally pass **index** (row labels) and
**columns** (column labels) arguments. If you pass an index and / or columns,
@@ -286,16 +272,6 @@ not matching up to the passed index.
If axis labels are not passed, they will be constructed from the input data
based on common sense rules.
-.. note::
-
- When the data is a dict, and ``columns`` is not specified, the ``DataFrame``
- columns will be ordered by the dict's insertion order, if you are using
- Python version >= 3.6 and pandas >= 0.23.
-
- If you are using Python < 3.6 or pandas < 0.23, and ``columns`` is not
- specified, the ``DataFrame`` columns will be the lexically ordered list of dict
- keys.
-
From dict of Series or dicts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -333,7 +309,7 @@ From dict of ndarrays / lists
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ndarrays must all be the same length. If an index is passed, it must
-clearly also be the same length as the arrays. If no index is passed, the
+also be the same length as the arrays. If no index is passed, the
result will be ``range(n)``, where ``n`` is the array length.
.. ipython:: python
@@ -402,6 +378,10 @@ The result will be a DataFrame with the same index as the input Series, and
with one column whose name is the original name of the Series (only if no other
column name provided).
+.. ipython:: python
+
+ ser = pd.Series(range(3), index=list("abc"), name="ser")
+ pd.DataFrame(ser)
.. _basics.dataframe.from_list_namedtuples:
@@ -409,8 +389,8 @@ From a list of namedtuples
~~~~~~~~~~~~~~~~~~~~~~~~~~
The field names of the first ``namedtuple`` in the list determine the columns
-of the ``DataFrame``. The remaining namedtuples (or tuples) are simply unpacked
-and their values are fed into the rows of the ``DataFrame``. If any of those
+of the :class:`DataFrame`. The remaining namedtuples (or tuples) are simply unpacked
+and their values are fed into the rows of the :class:`DataFrame`. If any of those
tuples is shorter than the first ``namedtuple`` then the later columns in the
corresponding row are marked as missing values. If any are longer than the
first ``namedtuple``, a ``ValueError`` is raised.
@@ -440,7 +420,7 @@ can be passed into the DataFrame constructor.
Passing a list of dataclasses is equivalent to passing a list of dictionaries.
Please be aware, that all values in the list should be dataclasses, mixing
-types in the list would result in a TypeError.
+types in the list would result in a ``TypeError``.
.. ipython:: python
@@ -452,11 +432,10 @@ types in the list would result in a TypeError.
**Missing data**
-Much more will be said on this topic in the :ref:`Missing data `
-section. To construct a DataFrame with missing data, we use ``np.nan`` to
+To construct a DataFrame with missing data, we use ``np.nan`` to
represent missing values. Alternatively, you may pass a ``numpy.MaskedArray``
as the data argument to the DataFrame constructor, and its masked entries will
-be considered missing.
+be considered missing. See :ref:`Missing data ` for more.
Alternate constructors
~~~~~~~~~~~~~~~~~~~~~~
@@ -465,8 +444,8 @@ Alternate constructors
**DataFrame.from_dict**
-``DataFrame.from_dict`` takes a dict of dicts or a dict of array-like sequences
-and returns a DataFrame. It operates like the ``DataFrame`` constructor except
+:meth:`DataFrame.from_dict` takes a dict of dicts or a dict of array-like sequences
+and returns a DataFrame. It operates like the :class:`DataFrame` constructor except
for the ``orient`` parameter which is ``'columns'`` by default, but which can be
set to ``'index'`` in order to use the dict keys as row labels.
@@ -490,10 +469,10 @@ case, you can also pass the desired column names:
**DataFrame.from_records**
-``DataFrame.from_records`` takes a list of tuples or an ndarray with structured
-dtype. It works analogously to the normal ``DataFrame`` constructor, except that
+:meth:`DataFrame.from_records` takes a list of tuples or an ndarray with structured
+dtype. It works analogously to the normal :class:`DataFrame` constructor, except that
the resulting DataFrame index may be a specific field of the structured
-dtype. For example:
+dtype.
.. ipython:: python
@@ -505,7 +484,7 @@ dtype. For example:
Column selection, addition, deletion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-You can treat a DataFrame semantically like a dict of like-indexed Series
+You can treat a :class:`DataFrame` semantically like a dict of like-indexed :class:`Series`
objects. Getting, setting, and deleting columns works with the same syntax as
the analogous dict operations:
@@ -532,7 +511,7 @@ column:
df["foo"] = "bar"
df
-When inserting a Series that does not have the same index as the DataFrame, it
+When inserting a :class:`Series` that does not have the same index as the :class:`DataFrame`, it
will be conformed to the DataFrame's index:
.. ipython:: python
@@ -543,8 +522,8 @@ will be conformed to the DataFrame's index:
You can insert raw ndarrays but their length must match the length of the
DataFrame's index.
-By default, columns get inserted at the end. The ``insert`` function is
-available to insert at a particular location in the columns:
+By default, columns get inserted at the end. :meth:`DataFrame.insert`
+inserts at a particular location in the columns:
.. ipython:: python
@@ -575,12 +554,12 @@ a function of one argument to be evaluated on the DataFrame being assigned to.
iris.assign(sepal_ratio=lambda x: (x["SepalWidth"] / x["SepalLength"])).head()
-``assign`` **always** returns a copy of the data, leaving the original
+:meth:`~pandas.DataFrame.assign` **always** returns a copy of the data, leaving the original
DataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, is
useful when you don't have a reference to the DataFrame at hand. This is
-common when using ``assign`` in a chain of operations. For example,
+common when using :meth:`~pandas.DataFrame.assign` in a chain of operations. For example,
we can limit the DataFrame to just those observations with a Sepal Length
greater than 5, calculate the ratio, and plot:
@@ -602,13 +581,13 @@ to those rows with sepal length greater than 5. The filtering happens first,
and then the ratio calculations. This is an example where we didn't
have a reference to the *filtered* DataFrame available.
-The function signature for ``assign`` is simply ``**kwargs``. The keys
+The function signature for :meth:`~pandas.DataFrame.assign` is simply ``**kwargs``. The keys
are the column names for the new fields, and the values are either a value
-to be inserted (for example, a ``Series`` or NumPy array), or a function
-of one argument to be called on the ``DataFrame``. A *copy* of the original
-DataFrame is returned, with the new values inserted.
+to be inserted (for example, a :class:`Series` or NumPy array), or a function
+of one argument to be called on the :class:`DataFrame`. A *copy* of the original
+:class:`DataFrame` is returned, with the new values inserted.
-Starting with Python 3.6 the order of ``**kwargs`` is preserved. This allows
+The order of ``**kwargs`` is preserved. This allows
for *dependent* assignment, where an expression later in ``**kwargs`` can refer
to a column created earlier in the same :meth:`~DataFrame.assign`.
@@ -635,8 +614,8 @@ The basics of indexing are as follows:
Slice rows, ``df[5:10]``, DataFrame
Select rows by boolean vector, ``df[bool_vec]``, DataFrame
-Row selection, for example, returns a Series whose index is the columns of the
-DataFrame:
+Row selection, for example, returns a :class:`Series` whose index is the columns of the
+:class:`DataFrame`:
.. ipython:: python
@@ -653,7 +632,7 @@ fundamentals of reindexing / conforming to new sets of labels in the
Data alignment and arithmetic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Data alignment between DataFrame objects automatically align on **both the
+Data alignment between :class:`DataFrame` objects automatically align on **both the
columns and the index (row labels)**. Again, the resulting object will have the
union of the column and row labels.
@@ -663,8 +642,8 @@ union of the column and row labels.
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])
df + df2
-When doing an operation between DataFrame and Series, the default behavior is
-to align the Series **index** on the DataFrame **columns**, thus `broadcasting
+When doing an operation between :class:`DataFrame` and :class:`Series`, the default behavior is
+to align the :class:`Series` **index** on the :class:`DataFrame` **columns**, thus `broadcasting
`__
row-wise. For example:
@@ -675,7 +654,7 @@ row-wise. For example:
For explicit control over the matching and broadcasting behavior, see the
section on :ref:`flexible binary operations `.
-Operations with scalars are just as you would expect:
+Arithmetic operations with scalars operate element-wise:
.. ipython:: python
@@ -685,7 +664,7 @@ Operations with scalars are just as you would expect:
.. _dsintro.boolean:
-Boolean operators work as well:
+Boolean operators operate element-wise as well:
.. ipython:: python
@@ -699,7 +678,7 @@ Boolean operators work as well:
Transposing
~~~~~~~~~~~
-To transpose, access the ``T`` attribute (also the ``transpose`` function),
+To transpose, access the ``T`` attribute or :meth:`DataFrame.transpose`,
similar to an ndarray:
.. ipython:: python
@@ -712,23 +691,21 @@ similar to an ndarray:
DataFrame interoperability with NumPy functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Elementwise NumPy ufuncs (log, exp, sqrt, ...) and various other NumPy functions
-can be used with no issues on Series and DataFrame, assuming the data within
-are numeric:
+Most NumPy functions can be called directly on :class:`Series` and :class:`DataFrame`.
.. ipython:: python
np.exp(df)
np.asarray(df)
-DataFrame is not intended to be a drop-in replacement for ndarray as its
+:class:`DataFrame` is not intended to be a drop-in replacement for ndarray as its
indexing semantics and data model are quite different in places from an n-dimensional
array.
:class:`Series` implements ``__array_ufunc__``, which allows it to work with NumPy's
`universal functions `_.
-The ufunc is applied to the underlying array in a Series.
+The ufunc is applied to the underlying array in a :class:`Series`.
.. ipython:: python
@@ -737,7 +714,7 @@ The ufunc is applied to the underlying array in a Series.
.. versionchanged:: 0.25.0
- When multiple ``Series`` are passed to a ufunc, they are aligned before
+ When multiple :class:`Series` are passed to a ufunc, they are aligned before
performing the operation.
Like other parts of the library, pandas will automatically align labeled inputs
@@ -761,8 +738,8 @@ with missing values.
ser3
np.remainder(ser1, ser3)
-When a binary ufunc is applied to a :class:`Series` and :class:`Index`, the Series
-implementation takes precedence and a Series is returned.
+When a binary ufunc is applied to a :class:`Series` and :class:`Index`, the :class:`Series`
+implementation takes precedence and a :class:`Series` is returned.
.. ipython:: python
@@ -778,10 +755,9 @@ the ufunc is applied without converting the underlying data to an ndarray.
Console display
~~~~~~~~~~~~~~~
-Very large DataFrames will be truncated to display them in the console.
+A very large :class:`DataFrame` will be truncated to display them in the console.
You can also get a summary using :meth:`~pandas.DataFrame.info`.
-(Here I am reading a CSV version of the **baseball** dataset from the **plyr**
-R package):
+(The **baseball** dataset is from the **plyr** R package):
.. ipython:: python
:suppress:
@@ -802,8 +778,8 @@ R package):
# restore GlobalPrintConfig
pd.reset_option(r"^display\.")
-However, using ``to_string`` will return a string representation of the
-DataFrame in tabular form, though it won't always fit the console width:
+However, using :meth:`DataFrame.to_string` will return a string representation of the
+:class:`DataFrame` in tabular form, though it won't always fit the console width:
.. ipython:: python
@@ -855,7 +831,7 @@ This will print the table in one block.
DataFrame column attribute access and IPython completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If a DataFrame column label is a valid Python variable name, the column can be
+If a :class:`DataFrame` column label is a valid Python variable name, the column can be
accessed like an attribute:
.. ipython:: python
diff --git a/doc/source/user_guide/duplicates.rst b/doc/source/user_guide/duplicates.rst
index 7cda067fb24ad..7894789846ce8 100644
--- a/doc/source/user_guide/duplicates.rst
+++ b/doc/source/user_guide/duplicates.rst
@@ -28,6 +28,7 @@ duplicates present. The output can't be determined, and so pandas raises.
.. ipython:: python
:okexcept:
+ :okwarning:
s1 = pd.Series([0, 1, 2], index=["a", "b", "b"])
s1.reindex(["a", "b", "c"])
@@ -171,7 +172,7 @@ going forward, to ensure that your data pipeline doesn't introduce duplicates.
>>> deduplicated = raw.groupby(level=0).first() # remove duplicates
>>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward
-Setting ``allows_duplicate_labels=True`` on a ``Series`` or ``DataFrame`` with duplicate
+Setting ``allows_duplicate_labels=False`` on a ``Series`` or ``DataFrame`` with duplicate
labels or performing an operation that introduces duplicate labels on a ``Series`` or
``DataFrame`` that disallows duplicates will raise an
:class:`errors.DuplicateLabelError`.
diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst
index aa9a1ba6d6bf0..1a1229f95523b 100644
--- a/doc/source/user_guide/enhancingperf.rst
+++ b/doc/source/user_guide/enhancingperf.rst
@@ -7,10 +7,10 @@ Enhancing performance
*********************
In this part of the tutorial, we will investigate how to speed up certain
-functions operating on pandas ``DataFrames`` using three different techniques:
+functions operating on pandas :class:`DataFrame` using three different techniques:
Cython, Numba and :func:`pandas.eval`. We will see a speed improvement of ~200
when we use Cython and Numba on a test function operating row-wise on the
-``DataFrame``. Using :func:`pandas.eval` we will speed up a sum by an order of
+:class:`DataFrame`. Using :func:`pandas.eval` we will speed up a sum by an order of
~2.
.. note::
@@ -35,7 +35,7 @@ by trying to remove for-loops and making use of NumPy vectorization. It's always
optimising in Python first.
This tutorial walks through a "typical" process of cythonizing a slow computation.
-We use an `example from the Cython documentation `__
+We use an `example from the Cython documentation `__
but in the context of pandas. Our final cythonized solution is around 100 times
faster than the pure Python solution.
@@ -44,7 +44,7 @@ faster than the pure Python solution.
Pure Python
~~~~~~~~~~~
-We have a ``DataFrame`` to which we want to apply a function row-wise.
+We have a :class:`DataFrame` to which we want to apply a function row-wise.
.. ipython:: python
@@ -73,12 +73,11 @@ Here's the function in pure Python:
s += f(a + i * dx)
return s * dx
-We achieve our result by using ``apply`` (row-wise):
+We achieve our result by using :meth:`DataFrame.apply` (row-wise):
-.. code-block:: ipython
+.. ipython:: python
- In [7]: %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)
- 10 loops, best of 3: 174 ms per loop
+ %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)
But clearly this isn't fast enough for us. Let's take a look and see where the
time is spent during this operation (limited to the most time consuming
@@ -126,10 +125,9 @@ is here to distinguish between function versions):
to be using bleeding edge IPython for paste to play well with cell magics.
-.. code-block:: ipython
+.. ipython:: python
- In [4]: %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)
- 10 loops, best of 3: 85.5 ms per loop
+ %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)
Already this has shaved a third off, not too bad for a simple copy and paste.
@@ -155,10 +153,9 @@ We get another huge improvement simply by providing type information:
...: return s * dx
...:
-.. code-block:: ipython
+.. ipython:: python
- In [4]: %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)
- 10 loops, best of 3: 20.3 ms per loop
+ %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)
Now, we're talking! It's now over ten times faster than the original Python
implementation, and we haven't *really* modified the code. Let's have another
@@ -173,7 +170,7 @@ look at what's eating up time:
Using ndarray
~~~~~~~~~~~~~
-It's calling series... a lot! It's creating a Series from each row, and get-ting from both
+It's calling series a lot! It's creating a :class:`Series` from each row, and calling get from both
the index and the series (three times for each row). Function calls are expensive
in Python, so maybe we could minimize these by cythonizing the apply part.
@@ -216,10 +213,10 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
.. warning::
- You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
+ You can **not pass** a :class:`Series` directly as a ``ndarray`` typed parameter
to a Cython function. Instead pass the actual ``ndarray`` using the
:meth:`Series.to_numpy`. The reason is that the Cython
- definition is specific to an ndarray and not the passed ``Series``.
+ definition is specific to an ndarray and not the passed :class:`Series`.
So, do not do this:
@@ -238,10 +235,9 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
Loops like this would be *extremely* slow in Python, but in Cython looping
over NumPy arrays is *fast*.
-.. code-block:: ipython
+.. ipython:: python
- In [4]: %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
- 1000 loops, best of 3: 1.25 ms per loop
+ %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
We've gotten another big improvement. Let's check again where the time is spent:
@@ -267,33 +263,33 @@ advanced Cython techniques:
...: cimport cython
...: cimport numpy as np
...: import numpy as np
- ...: cdef double f_typed(double x) except? -2:
+ ...: cdef np.float64_t f_typed(np.float64_t x) except? -2:
...: return x * (x - 1)
- ...: cpdef double integrate_f_typed(double a, double b, int N):
- ...: cdef int i
- ...: cdef double s, dx
- ...: s = 0
+ ...: cpdef np.float64_t integrate_f_typed(np.float64_t a, np.float64_t b, np.int64_t N):
+ ...: cdef np.int64_t i
+ ...: cdef np.float64_t s = 0.0, dx
...: dx = (b - a) / N
...: for i in range(N):
...: s += f_typed(a + i * dx)
...: return s * dx
...: @cython.boundscheck(False)
...: @cython.wraparound(False)
- ...: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a,
- ...: np.ndarray[double] col_b,
- ...: np.ndarray[int] col_N):
- ...: cdef int i, n = len(col_N)
+ ...: cpdef np.ndarray[np.float64_t] apply_integrate_f_wrap(
+ ...: np.ndarray[np.float64_t] col_a,
+ ...: np.ndarray[np.float64_t] col_b,
+ ...: np.ndarray[np.int64_t] col_N
+ ...: ):
+ ...: cdef np.int64_t i, n = len(col_N)
...: assert len(col_a) == len(col_b) == n
- ...: cdef np.ndarray[double] res = np.empty(n)
+ ...: cdef np.ndarray[np.float64_t] res = np.empty(n, dtype=np.float64)
...: for i in range(n):
...: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
...: return res
...:
-.. code-block:: ipython
+.. ipython:: python
- In [4]: %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
- 1000 loops, best of 3: 987 us per loop
+ %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
Even faster, with the caveat that a bug in our Cython code (an off-by-one error,
for example) might cause a segfault because memory access isn't checked.
@@ -302,28 +298,85 @@ For more about ``boundscheck`` and ``wraparound``, see the Cython docs on
.. _enhancingperf.numba:
-Using Numba
------------
+Numba (JIT compilation)
+-----------------------
-A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba.
+An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler with `Numba `__.
-Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
+Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran,
+by decorating your function with ``@jit``.
-Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
+Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool).
+Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack.
.. note::
- You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda`.
+ The ``@jit`` compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets.
+ Consider `caching `__ your function to avoid compilation overhead each time your function is run.
-.. note::
+Numba can be used in 2 ways with pandas:
+
+#. Specify the ``engine="numba"`` keyword in select pandas methods
+#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`DataFrame` (using ``to_numpy()``) into the function
+
+pandas Numba Engine
+~~~~~~~~~~~~~~~~~~~
+
+If Numba is installed, one can specify ``engine="numba"`` in select pandas methods to execute the method using Numba.
+Methods that support ``engine="numba"`` will also have an ``engine_kwargs`` keyword that accepts a dictionary that allows one to specify
+``"nogil"``, ``"nopython"`` and ``"parallel"`` keys with boolean values to pass into the ``@jit`` decorator.
+If ``engine_kwargs`` is not specified, it defaults to ``{"nogil": False, "nopython": True, "parallel": False}`` unless otherwise specified.
+
+In terms of performance, **the first time a function is run using the Numba engine will be slow**
+as Numba will have some function compilation overhead. However, the JIT compiled functions are cached,
+and subsequent calls will be fast. In general, the Numba engine is performant with
+a larger amount of data points (e.g. 1+ million).
+
+.. code-block:: ipython
- As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
+ In [1]: data = pd.Series(range(1_000_000)) # noqa: E225
-Jit
-~~~
+ In [2]: roll = data.rolling(10)
-We demonstrate how to use Numba to just-in-time compile our code. We simply
-take the plain Python code from above and annotate with the ``@jit`` decorator.
+ In [3]: def f(x):
+ ...: return np.sum(x) + 5
+ # Run the first time, compilation time will affect performance
+ In [4]: %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)
+ 1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
+ # Function is cached and performance will improve
+ In [5]: %timeit roll.apply(f, engine='numba', raw=True)
+ 188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+ In [6]: %timeit roll.apply(f, engine='cython', raw=True)
+ 3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+If your compute hardware contains multiple CPUs, the largest performance gain can be realized by setting ``parallel`` to ``True``
+to leverage more than 1 CPU. Internally, pandas leverages numba to parallelize computations over the columns of a :class:`DataFrame`;
+therefore, this performance benefit is only beneficial for a :class:`DataFrame` with a large number of columns.
+
+.. code-block:: ipython
+
+ In [1]: import numba
+
+ In [2]: numba.set_num_threads(1)
+
+ In [3]: df = pd.DataFrame(np.random.randn(10_000, 100))
+
+ In [4]: roll = df.rolling(100)
+
+ In [5]: %timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})
+ 347 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+ In [6]: numba.set_num_threads(2)
+
+ In [7]: %timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})
+ 201 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+Custom Function Examples
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A custom Python function decorated with ``@jit`` can be used with pandas objects by passing their NumPy array
+representations with ``to_numpy()``.
.. code-block:: python
@@ -360,8 +413,6 @@ take the plain Python code from above and annotate with the ``@jit`` decorator.
)
return pd.Series(result, index=df.index, name="result")
-Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a
-nicer interface by passing/returning pandas objects.
.. code-block:: ipython
@@ -370,19 +421,9 @@ nicer interface by passing/returning pandas objects.
In this example, using Numba was faster than Cython.
-Numba as an argument
-~~~~~~~~~~~~~~~~~~~~
-
-Additionally, we can leverage the power of `Numba `__
-by calling it as an argument in :meth:`~Rolling.apply`. See :ref:`Computation tools
-` for an extensive example.
-
-Vectorize
-~~~~~~~~~
-
Numba can also be used to write vectorized functions that do not require the user to explicitly
loop over the observations of a vector; a vectorized function will be applied to each row automatically.
-Consider the following toy example of doubling each observation:
+Consider the following example of doubling each observation:
.. code-block:: python
@@ -414,25 +455,23 @@ Consider the following toy example of doubling each observation:
Caveats
~~~~~~~
-.. note::
-
- Numba will execute on any function, but can only accelerate certain classes of functions.
-
Numba is best at accelerating functions that apply numerical functions to NumPy
-arrays. When passed a function that only uses operations it knows how to
-accelerate, it will execute in ``nopython`` mode.
-
-If Numba is passed a function that includes something it doesn't know how to
-work with -- a category that currently includes sets, lists, dictionaries, or
-string functions -- it will revert to ``object mode``. In ``object mode``,
-Numba will execute but your code will not speed up significantly. If you would
+arrays. If you try to ``@jit`` a function that contains unsupported `Python `__
+or `NumPy `__
+code, compilation will revert `object mode `__ which
+will mostly likely not speed up your function. If you would
prefer that Numba throw an error if it cannot compile a function in a way that
speeds up your code, pass Numba the argument
-``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on
+``nopython=True`` (e.g. ``@jit(nopython=True)``). For more on
troubleshooting Numba modes, see the `Numba troubleshooting page
`__.
-Read more in the `Numba docs `__.
+Using ``parallel=True`` (e.g. ``@jit(parallel=True)``) may result in a ``SIGABRT`` if the threading layer leads to unsafe
+behavior. You can first `specify a safe threading layer `__
+before running a JIT function with ``parallel=True``.
+
+Generally if the you encounter a segfault (``SIGSEGV``) while using Numba, please report the issue
+to the `Numba issue tracker. `__
.. _enhancingperf.eval:
@@ -574,8 +613,8 @@ Now let's do the same thing but with comparisons:
of type ``bool`` or ``np.bool_``. Again, you should perform these kinds of
operations in plain Python.
-The ``DataFrame.eval`` method
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The :meth:`DataFrame.eval` method
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In addition to the top level :func:`pandas.eval` function you can also
evaluate an expression in the "context" of a :class:`~pandas.DataFrame`.
@@ -609,7 +648,7 @@ new column name or an existing column name, and it must be a valid Python
identifier.
The ``inplace`` keyword determines whether this assignment will performed
-on the original ``DataFrame`` or return a copy with the new column.
+on the original :class:`DataFrame` or return a copy with the new column.
.. ipython:: python
@@ -619,7 +658,7 @@ on the original ``DataFrame`` or return a copy with the new column.
df.eval("a = 1", inplace=True)
df
-When ``inplace`` is set to ``False``, the default, a copy of the ``DataFrame`` with the
+When ``inplace`` is set to ``False``, the default, a copy of the :class:`DataFrame` with the
new or modified columns is returned and the original frame is unchanged.
.. ipython:: python
@@ -651,7 +690,7 @@ The equivalent in standard Python would be
df["a"] = 1
df
-The ``query`` method has a ``inplace`` keyword which determines
+The :class:`DataFrame.query` method has a ``inplace`` keyword which determines
whether the query modifies the original frame.
.. ipython:: python
@@ -793,7 +832,7 @@ computation. The two lines are two different engines.
.. image:: ../_static/eval-perf-small.png
-This plot was created using a ``DataFrame`` with 3 columns each containing
+This plot was created using a :class:`DataFrame` with 3 columns each containing
floating point values generated using ``numpy.random.randn()``.
Technical minutia regarding expression evaluation
diff --git a/doc/source/user_guide/gotchas.rst b/doc/source/user_guide/gotchas.rst
index 1de978b195382..adb40e166eab4 100644
--- a/doc/source/user_guide/gotchas.rst
+++ b/doc/source/user_guide/gotchas.rst
@@ -10,13 +10,13 @@ Frequently Asked Questions (FAQ)
DataFrame memory usage
----------------------
-The memory usage of a ``DataFrame`` (including the index) is shown when calling
+The memory usage of a :class:`DataFrame` (including the index) is shown when calling
the :meth:`~DataFrame.info`. A configuration option, ``display.memory_usage``
(see :ref:`the list of options `), specifies if the
-``DataFrame``'s memory usage will be displayed when invoking the ``df.info()``
+:class:`DataFrame` memory usage will be displayed when invoking the ``df.info()``
method.
-For example, the memory usage of the ``DataFrame`` below is shown
+For example, the memory usage of the :class:`DataFrame` below is shown
when calling :meth:`~DataFrame.info`:
.. ipython:: python
@@ -53,9 +53,9 @@ By default the display option is set to ``True`` but can be explicitly
overridden by passing the ``memory_usage`` argument when invoking ``df.info()``.
The memory usage of each column can be found by calling the
-:meth:`~DataFrame.memory_usage` method. This returns a ``Series`` with an index
+:meth:`~DataFrame.memory_usage` method. This returns a :class:`Series` with an index
represented by column names and memory usage of each column shown in bytes. For
-the ``DataFrame`` above, the memory usage of each column and the total memory
+the :class:`DataFrame` above, the memory usage of each column and the total memory
usage can be found with the ``memory_usage`` method:
.. ipython:: python
@@ -65,8 +65,8 @@ usage can be found with the ``memory_usage`` method:
# total memory usage of dataframe
df.memory_usage().sum()
-By default the memory usage of the ``DataFrame``'s index is shown in the
-returned ``Series``, the memory usage of the index can be suppressed by passing
+By default the memory usage of the :class:`DataFrame` index is shown in the
+returned :class:`Series`, the memory usage of the index can be suppressed by passing
the ``index=False`` argument:
.. ipython:: python
@@ -75,7 +75,7 @@ the ``index=False`` argument:
The memory usage displayed by the :meth:`~DataFrame.info` method utilizes the
:meth:`~DataFrame.memory_usage` method to determine the memory usage of a
-``DataFrame`` while also formatting the output in human-readable units (base-2
+:class:`DataFrame` while also formatting the output in human-readable units (base-2
representation; i.e. 1KB = 1024 bytes).
See also :ref:`Categorical Memory Usage `.
@@ -98,32 +98,28 @@ of the following code should be:
Should it be ``True`` because it's not zero-length, or ``False`` because there
are ``False`` values? It is unclear, so instead, pandas raises a ``ValueError``:
-.. code-block:: python
+.. ipython:: python
+ :okexcept:
- >>> if pd.Series([False, True, False]):
- ... print("I was true")
- Traceback
- ...
- ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
+ if pd.Series([False, True, False]):
+ print("I was true")
-You need to explicitly choose what you want to do with the ``DataFrame``, e.g.
+You need to explicitly choose what you want to do with the :class:`DataFrame`, e.g.
use :meth:`~DataFrame.any`, :meth:`~DataFrame.all` or :meth:`~DataFrame.empty`.
Alternatively, you might want to compare if the pandas object is ``None``:
-.. code-block:: python
+.. ipython:: python
- >>> if pd.Series([False, True, False]) is not None:
- ... print("I was not None")
- I was not None
+ if pd.Series([False, True, False]) is not None:
+ print("I was not None")
Below is how to check if any of the values are ``True``:
-.. code-block:: python
+.. ipython:: python
- >>> if pd.Series([False, True, False]).any():
- ... print("I am any")
- I am any
+ if pd.Series([False, True, False]).any():
+ print("I am any")
To evaluate single-element pandas objects in a boolean context, use the method
:meth:`~DataFrame.bool`:
@@ -138,27 +134,21 @@ To evaluate single-element pandas objects in a boolean context, use the method
Bitwise boolean
~~~~~~~~~~~~~~~
-Bitwise boolean operators like ``==`` and ``!=`` return a boolean ``Series``,
-which is almost always what you want anyways.
+Bitwise boolean operators like ``==`` and ``!=`` return a boolean :class:`Series`
+which performs an element-wise comparison when compared to a scalar.
-.. code-block:: python
+.. ipython:: python
- >>> s = pd.Series(range(5))
- >>> s == 4
- 0 False
- 1 False
- 2 False
- 3 False
- 4 True
- dtype: bool
+ s = pd.Series(range(5))
+ s == 4
See :ref:`boolean comparisons` for more examples.
Using the ``in`` operator
~~~~~~~~~~~~~~~~~~~~~~~~~
-Using the Python ``in`` operator on a ``Series`` tests for membership in the
-index, not membership among the values.
+Using the Python ``in`` operator on a :class:`Series` tests for membership in the
+**index**, not membership among the values.
.. ipython:: python
@@ -167,7 +157,7 @@ index, not membership among the values.
'b' in s
If this behavior is surprising, keep in mind that using ``in`` on a Python
-dictionary tests keys, not values, and ``Series`` are dict-like.
+dictionary tests keys, not values, and :class:`Series` are dict-like.
To test for membership in the values, use the method :meth:`~pandas.Series.isin`:
.. ipython:: python
@@ -175,7 +165,7 @@ To test for membership in the values, use the method :meth:`~pandas.Series.isin`
s.isin([2])
s.isin([2]).any()
-For ``DataFrames``, likewise, ``in`` applies to the column axis,
+For :class:`DataFrame`, likewise, ``in`` applies to the column axis,
testing for membership in the list of column names.
.. _gotchas.udf-mutation:
@@ -206,8 +196,8 @@ causing unexpected behavior. Consider the example:
One probably would have expected that the result would be ``[1, 3, 5]``.
When using a pandas method that takes a UDF, internally pandas is often
iterating over the
-``DataFrame`` or other pandas object. Therefore, if the UDF mutates (changes)
-the ``DataFrame``, unexpected behavior can arise.
+:class:`DataFrame` or other pandas object. Therefore, if the UDF mutates (changes)
+the :class:`DataFrame`, unexpected behavior can arise.
Here is a similar example with :meth:`DataFrame.apply`:
@@ -267,7 +257,7 @@ For many reasons we chose the latter. After years of production use it has
proven, at least in my opinion, to be the best decision given the state of
affairs in NumPy and Python in general. The special value ``NaN``
(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
-functions ``isna`` and ``notna`` which can be used across the dtypes to
+functions :meth:`DataFrame.isna` and :meth:`DataFrame.notna` which can be used across the dtypes to
detect NA values.
However, it comes with it a couple of trade-offs which I most certainly have
@@ -293,7 +283,7 @@ arrays. For example:
s2.dtype
This trade-off is made largely for memory and performance reasons, and also so
-that the resulting ``Series`` continues to be "numeric".
+that the resulting :class:`Series` continues to be "numeric".
If you need to represent integers with possibly missing values, use one of
the nullable-integer extension dtypes provided by pandas
@@ -318,7 +308,7 @@ See :ref:`integer_na` for more.
``NA`` type promotions
~~~~~~~~~~~~~~~~~~~~~~
-When introducing NAs into an existing ``Series`` or ``DataFrame`` via
+When introducing NAs into an existing :class:`Series` or :class:`DataFrame` via
:meth:`~Series.reindex` or some other means, boolean and integer types will be
promoted to a different dtype in order to store the NAs. The promotions are
summarized in this table:
@@ -341,7 +331,7 @@ Why not make NumPy like R?
Many people have suggested that NumPy should simply emulate the ``NA`` support
present in the more domain-specific statistical programming language `R
-`__. Part of the reason is the NumPy type hierarchy:
+`__. Part of the reason is the NumPy type hierarchy:
.. csv-table::
:header: "Typeclass","Dtypes"
@@ -376,18 +366,19 @@ integer arrays to floating when NAs must be introduced.
Differences with NumPy
----------------------
-For ``Series`` and ``DataFrame`` objects, :meth:`~DataFrame.var` normalizes by
-``N-1`` to produce unbiased estimates of the sample variance, while NumPy's
-``var`` normalizes by N, which measures the variance of the sample. Note that
+For :class:`Series` and :class:`DataFrame` objects, :meth:`~DataFrame.var` normalizes by
+``N-1`` to produce `unbiased estimates of the population variance `__, while NumPy's
+:meth:`numpy.var` normalizes by N, which measures the variance of the sample. Note that
:meth:`~DataFrame.cov` normalizes by ``N-1`` in both pandas and NumPy.
+.. _gotchas.thread-safety:
Thread-safety
-------------
-As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to
+pandas is not 100% thread safe. The known issues relate to
the :meth:`~DataFrame.copy` method. If you are doing a lot of copying of
-``DataFrame`` objects shared among threads, we recommend holding locks inside
+:class:`DataFrame` objects shared among threads, we recommend holding locks inside
the threads where the data copying occurs.
See `this link `__
@@ -406,7 +397,7 @@ symptom of this issue is an error like::
To deal
with this issue you should convert the underlying NumPy array to the native
-system byte order *before* passing it to ``Series`` or ``DataFrame``
+system byte order *before* passing it to :class:`Series` or :class:`DataFrame`
constructors using something similar to the following:
.. ipython:: python
diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst
index 870ec6763c72f..5d8ef7ce02097 100644
--- a/doc/source/user_guide/groupby.rst
+++ b/doc/source/user_guide/groupby.rst
@@ -391,7 +391,6 @@ something different for each of the columns. Thus, using ``[]`` similar to
getting a column from a DataFrame, you can do:
.. ipython:: python
- :suppress:
df = pd.DataFrame(
{
@@ -402,7 +401,7 @@ getting a column from a DataFrame, you can do:
}
)
-.. ipython:: python
+ df
grouped = df.groupby(["A"])
grouped_C = grouped["C"]
@@ -478,7 +477,7 @@ An obvious one is aggregation via the
.. ipython:: python
grouped = df.groupby("A")
- grouped.aggregate(np.sum)
+ grouped[["C", "D"]].aggregate(np.sum)
grouped = df.groupby(["A", "B"])
grouped.aggregate(np.sum)
@@ -493,7 +492,7 @@ changed by using the ``as_index`` option:
grouped = df.groupby(["A", "B"], as_index=False)
grouped.aggregate(np.sum)
- df.groupby("A", as_index=False).sum()
+ df.groupby("A", as_index=False)[["C", "D"]].sum()
Note that you could use the ``reset_index`` DataFrame function to achieve the
same result as the column names are stored in the resulting ``MultiIndex``:
@@ -540,19 +539,19 @@ Some common aggregating functions are tabulated below:
:widths: 20, 80
:delim: ;
- :meth:`~pd.core.groupby.DataFrameGroupBy.mean`;Compute mean of groups
- :meth:`~pd.core.groupby.DataFrameGroupBy.sum`;Compute sum of group values
- :meth:`~pd.core.groupby.DataFrameGroupBy.size`;Compute group sizes
- :meth:`~pd.core.groupby.DataFrameGroupBy.count`;Compute count of group
- :meth:`~pd.core.groupby.DataFrameGroupBy.std`;Standard deviation of groups
- :meth:`~pd.core.groupby.DataFrameGroupBy.var`;Compute variance of groups
- :meth:`~pd.core.groupby.DataFrameGroupBy.sem`;Standard error of the mean of groups
- :meth:`~pd.core.groupby.DataFrameGroupBy.describe`;Generates descriptive statistics
- :meth:`~pd.core.groupby.DataFrameGroupBy.first`;Compute first of group values
- :meth:`~pd.core.groupby.DataFrameGroupBy.last`;Compute last of group values
- :meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
- :meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
- :meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.mean`;Compute mean of groups
+ :meth:`~pd.core.groupby.DataFrameGroupBy.sum`;Compute sum of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.size`;Compute group sizes
+ :meth:`~pd.core.groupby.DataFrameGroupBy.count`;Compute count of group
+ :meth:`~pd.core.groupby.DataFrameGroupBy.std`;Standard deviation of groups
+ :meth:`~pd.core.groupby.DataFrameGroupBy.var`;Compute variance of groups
+ :meth:`~pd.core.groupby.DataFrameGroupBy.sem`;Standard error of the mean of groups
+ :meth:`~pd.core.groupby.DataFrameGroupBy.describe`;Generates descriptive statistics
+ :meth:`~pd.core.groupby.DataFrameGroupBy.first`;Compute first of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.last`;Compute last of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
+ :meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
+ :meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values
The aggregating functions above will exclude NA values. Any function which
@@ -579,7 +578,7 @@ column, which produces an aggregated result with a hierarchical index:
.. ipython:: python
- grouped.agg([np.sum, np.mean, np.std])
+ grouped[["C", "D"]].agg([np.sum, np.mean, np.std])
The resulting aggregations are named for the functions themselves. If you
@@ -598,7 +597,7 @@ For a grouped ``DataFrame``, you can rename in a similar manner:
.. ipython:: python
(
- grouped.agg([np.sum, np.mean, np.std]).rename(
+ grouped[["C", "D"]].agg([np.sum, np.mean, np.std]).rename(
columns={"sum": "foo", "mean": "bar", "std": "baz"}
)
)
@@ -731,7 +730,7 @@ optimized Cython implementations:
.. ipython:: python
- df.groupby("A").sum()
+ df.groupby("A")[["C", "D"]].sum()
df.groupby(["A", "B"]).mean()
Of course ``sum`` and ``mean`` are implemented on pandas objects, so the above
@@ -762,7 +761,7 @@ different dtypes, then a common dtype will be determined in the same way as ``Da
Transformation
--------------
-The ``transform`` method returns an object that is indexed the same (same size)
+The ``transform`` method returns an object that is indexed the same
as the one being grouped. The transform function must:
* Return a result that is either the same size as the group chunk or
@@ -777,6 +776,14 @@ as the one being grouped. The transform function must:
* (Optionally) operates on the entire group chunk. If this is supported, a
fast path is used starting from the *second* chunk.
+.. deprecated:: 1.5.0
+
+ When using ``.transform`` on a grouped DataFrame and the transformation function
+ returns a DataFrame, currently pandas does not align the result's index
+ with the input's index. This behavior is deprecated and alignment will
+ be performed in a future version of pandas. You can apply ``.to_numpy()`` to the
+ result of the transformation function to avoid alignment.
+
Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the
transformation function. If the results from different groups have different dtypes, then
a common dtype will be determined in the same way as ``DataFrame`` construction.
@@ -832,10 +839,10 @@ Alternatively, the built-in methods could be used to produce the same outputs.
.. ipython:: python
- max = ts.groupby(lambda x: x.year).transform("max")
- min = ts.groupby(lambda x: x.year).transform("min")
+ max_ts = ts.groupby(lambda x: x.year).transform("max")
+ min_ts = ts.groupby(lambda x: x.year).transform("min")
- max - min
+ max_ts - min_ts
Another common data transform is to replace missing data with the group mean.
@@ -1053,7 +1060,14 @@ Some operations on the grouped data might not fit into either the aggregate or
transform categories. Or, you may simply want GroupBy to infer how to combine
the results. For these, use the ``apply`` function, which can be substituted
for both ``aggregate`` and ``transform`` in many standard use cases. However,
-``apply`` can handle some exceptional use cases, for example:
+``apply`` can handle some exceptional use cases.
+
+.. note::
+
+ ``apply`` can act as a reducer, transformer, *or* filter function, depending
+ on exactly what is passed to it. It can depend on the passed function and
+ exactly what you are grouping. Thus the grouped column(s) may be included in
+ the output as well as set the indices.
.. ipython:: python
@@ -1065,16 +1079,14 @@ for both ``aggregate`` and ``transform`` in many standard use cases. However,
The dimension of the returned result can also change:
-.. ipython::
-
- In [8]: grouped = df.groupby('A')['C']
+.. ipython:: python
- In [10]: def f(group):
- ....: return pd.DataFrame({'original': group,
- ....: 'demeaned': group - group.mean()})
- ....:
+ grouped = df.groupby('A')['C']
- In [11]: grouped.apply(f)
+ def f(group):
+ return pd.DataFrame({'original': group,
+ 'demeaned': group - group.mean()})
+ grouped.apply(f)
``apply`` on a Series can operate on a returned value from the applied function,
that is itself a series, and possibly upcast the result to a DataFrame:
@@ -1089,11 +1101,33 @@ that is itself a series, and possibly upcast the result to a DataFrame:
s
s.apply(f)
+Control grouped column(s) placement with ``group_keys``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
.. note::
- ``apply`` can act as a reducer, transformer, *or* filter function, depending on exactly what is passed to it.
- So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in
- the output as well as set the indices.
+ If ``group_keys=True`` is specified when calling :meth:`~DataFrame.groupby`,
+ functions passed to ``apply`` that return like-indexed outputs will have the
+ group keys added to the result index. Previous versions of pandas would add
+ the group keys only when the result from the applied function had a different
+ index than the input. If ``group_keys`` is not specified, the group keys will
+ not be added for like-indexed outputs. In the future this behavior
+ will change to always respect ``group_keys``, which defaults to ``True``.
+
+ .. versionchanged:: 1.5.0
+
+To control whether the grouped column(s) are included in the indices, you can use
+the argument ``group_keys``. Compare
+
+.. ipython:: python
+
+ df.groupby("A", group_keys=True).apply(lambda x: x)
+
+with
+
+.. ipython:: python
+
+ df.groupby("A", group_keys=False).apply(lambda x: x)
Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the
apply function. If the results from different groups have different dtypes, then
@@ -1106,11 +1140,9 @@ Numba Accelerated Routines
.. versionadded:: 1.1
If `Numba `__ is installed as an optional dependency, the ``transform`` and
-``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments. The ``engine_kwargs``
-argument is a dictionary of keyword arguments that will be passed into the
-`numba.jit decorator `__.
-These keyword arguments will be applied to the passed function. Currently only ``nogil``, ``nopython``,
-and ``parallel`` are supported, and their default values are set to ``False``, ``True`` and ``False`` respectively.
+``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments.
+See :ref:`enhancing performance with Numba ` for general usage of the arguments
+and performance considerations.
The function signature must start with ``values, index`` **exactly** as the data belonging to each group
will be passed into ``values``, and the group index will be passed into ``index``.
@@ -1121,52 +1153,6 @@ will be passed into ``values``, and the group index will be passed into ``index`
data and group index will be passed as NumPy arrays to the JITed user defined function, and no
alternative execution attempts will be tried.
-.. note::
-
- In terms of performance, **the first time a function is run using the Numba engine will be slow**
- as Numba will have some function compilation overhead. However, the compiled functions are cached,
- and subsequent calls will be fast. In general, the Numba engine is performant with
- a larger amount of data points (e.g. 1+ million).
-
-.. code-block:: ipython
-
- In [1]: N = 10 ** 3
-
- In [2]: data = {0: [str(i) for i in range(100)] * N, 1: list(range(100)) * N}
-
- In [3]: df = pd.DataFrame(data, columns=[0, 1])
-
- In [4]: def f_numba(values, index):
- ...: total = 0
- ...: for i, value in enumerate(values):
- ...: if i % 2:
- ...: total += value + 5
- ...: else:
- ...: total += value * 2
- ...: return total
- ...:
-
- In [5]: def f_cython(values):
- ...: total = 0
- ...: for i, value in enumerate(values):
- ...: if i % 2:
- ...: total += value + 5
- ...: else:
- ...: total += value * 2
- ...: return total
- ...:
-
- In [6]: groupby = df.groupby(0)
- # Run the first time, compilation time will affect performance
- In [7]: %timeit -r 1 -n 1 groupby.aggregate(f_numba, engine='numba') # noqa: E225
- 2.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
- # Function is cached and performance will improve
- In [8]: %timeit groupby.aggregate(f_numba, engine='numba')
- 4.93 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-
- In [9]: %timeit groupby.aggregate(f_cython, engine='cython')
- 18.6 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-
Other useful features
---------------------
@@ -1181,13 +1167,12 @@ Again consider the example DataFrame we've been looking at:
Suppose we wish to compute the standard deviation grouped by the ``A``
column. There is a slight problem, namely that we don't care about the data in
-column ``B``. We refer to this as a "nuisance" column. If the passed
-aggregation function can't be applied to some columns, the troublesome columns
-will be (silently) dropped. Thus, this does not pose any problems:
+column ``B``. We refer to this as a "nuisance" column. You can avoid nuisance
+columns by specifying ``numeric_only=True``:
.. ipython:: python
- df.groupby("A").std()
+ df.groupby("A").std(numeric_only=True)
Note that ``df.groupby('A').colname.std().`` is more efficient than
``df.groupby('A').std().colname``, so if the result of an aggregation function
@@ -1202,7 +1187,14 @@ is only interesting over one column (here ``colname``), it may be filtered
If you do wish to include decimal or object columns in an aggregation with
other non-nuisance data types, you must do so explicitly.
+.. warning::
+ The automatic dropping of nuisance columns has been deprecated and will be removed
+ in a future version of pandas. If columns are included that cannot be operated
+ on, pandas will instead raise an error. In order to avoid this, either select
+ the columns you wish to operate on or specify ``numeric_only=True``.
+
.. ipython:: python
+ :okwarning:
from decimal import Decimal
@@ -1326,7 +1318,7 @@ Groupby a specific column with the desired frequency. This is like resampling.
.. ipython:: python
- df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"]).sum()
+ df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"])[["Quantity"]].sum()
You have an ambiguous specification in that you have a named index and a column
that could be potential groupers.
@@ -1335,9 +1327,9 @@ that could be potential groupers.
df = df.set_index("Date")
df["Date"] = df.index + pd.offsets.MonthEnd(2)
- df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"]).sum()
+ df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"])[["Quantity"]].sum()
- df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"]).sum()
+ df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"])[["Quantity"]].sum()
Taking the first rows of each group
diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst
index 6b6e212cde635..a6392706eb7a3 100644
--- a/doc/source/user_guide/index.rst
+++ b/doc/source/user_guide/index.rst
@@ -17,6 +17,43 @@ For a high level summary of the pandas fundamentals, see :ref:`dsintro` and :ref
Further information on any specific method can be obtained in the
:ref:`api`.
+How to read these guides
+------------------------
+In these guides you will see input code inside code blocks such as:
+
+::
+
+ import pandas as pd
+ pd.DataFrame({'A': [1, 2, 3]})
+
+
+or:
+
+.. ipython:: python
+
+ import pandas as pd
+ pd.DataFrame({'A': [1, 2, 3]})
+
+The first block is a standard python input, while in the second the ``In [1]:`` indicates the input is inside a `notebook `__. In Jupyter Notebooks the last line is printed and plots are shown inline.
+
+For example:
+
+.. ipython:: python
+
+ a = 1
+ a
+is equivalent to:
+
+::
+
+ a = 1
+ print(a)
+
+
+
+Guides
+-------
+
.. If you update this toctree, also update the manual toctree in the
main index.rst.template
@@ -39,7 +76,6 @@ Further information on any specific method can be obtained in the
boolean
visualization
style
- computation
groupby
window
timeseries
diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst
index dc66303a44f53..f939945fc6cda 100644
--- a/doc/source/user_guide/indexing.rst
+++ b/doc/source/user_guide/indexing.rst
@@ -89,7 +89,7 @@ Getting values from an object with multi-axes selection uses the following
notation (using ``.loc`` as an example, but the following applies to ``.iloc`` as
well). Any of the axes accessors may be the null slice ``:``. Axes left out of
the specification are assumed to be ``:``, e.g. ``p.loc['a']`` is equivalent to
-``p.loc['a', :, :]``.
+``p.loc['a', :]``.
.. csv-table::
:header: "Object Type", "Indexers"
@@ -583,7 +583,7 @@ without using a temporary variable.
.. ipython:: python
bb = pd.read_csv('data/baseball.csv', index_col='id')
- (bb.groupby(['year', 'team']).sum()
+ (bb.groupby(['year', 'team']).sum(numeric_only=True)
.loc[lambda df: df['r'] > 100])
@@ -701,7 +701,7 @@ Having a duplicated index will raise for a ``.reindex()``:
.. code-block:: ipython
In [17]: s.reindex(labels)
- ValueError: cannot reindex from a duplicate axis
+ ValueError: cannot reindex on an axis with duplicate labels
Generally, you can intersect the desired labels with the current
axis, and then reindex.
@@ -717,7 +717,7 @@ However, this would *still* raise if your resulting index is duplicated.
In [41]: labels = ['a', 'd']
In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
- ValueError: cannot reindex from a duplicate axis
+ ValueError: cannot reindex on an axis with duplicate labels
.. _indexing.basics.partial_setting:
@@ -997,6 +997,15 @@ a list of items you want to check for.
df.isin(values)
+To return the DataFrame of booleans where the values are *not* in the original DataFrame,
+use the ``~`` operator:
+
+.. ipython:: python
+
+ values = {'ids': ['a', 'b'], 'vals': [1, 3]}
+
+ ~df.isin(values)
+
Combine DataFrame's ``isin`` with the ``any()`` and ``all()`` methods to
quickly select subsets of your data that meet a given criteria.
To select a row where each column meets its own criterion:
@@ -1523,8 +1532,8 @@ Looking up values by index/column labels
----------------------------------------
Sometimes you want to extract a set of values given a sequence of row labels
-and column labels, this can be achieved by ``DataFrame.melt`` combined by filtering the corresponding
-rows with ``DataFrame.loc``. For instance:
+and column labels, this can be achieved by ``pandas.factorize`` and NumPy indexing.
+For instance:
.. ipython:: python
@@ -1532,9 +1541,8 @@ rows with ``DataFrame.loc``. For instance:
'A': [80, 23, np.nan, 22],
'B': [80, 55, 76, 67]})
df
- melt = df.melt('col')
- melt = melt.loc[melt['col'] == melt['variable'], 'value']
- melt.reset_index(drop=True)
+ idx, cols = pd.factorize(df['col'])
+ df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Formerly this could be achieved with the dedicated ``DataFrame.lookup`` method
which was deprecated in version 1.2.0.
@@ -1877,7 +1885,7 @@ chained indexing expression, you can set the :ref:`option `
``mode.chained_assignment`` to one of these values:
* ``'warn'``, the default, means a ``SettingWithCopyWarning`` is printed.
-* ``'raise'`` means pandas will raise a ``SettingWithCopyException``
+* ``'raise'`` means pandas will raise a ``SettingWithCopyError``
you have to deal with.
* ``None`` will suppress the warnings entirely.
@@ -1945,7 +1953,7 @@ Last, the subsequent example will **not** work at all, and so should be avoided:
>>> dfd.loc[0]['a'] = 1111
Traceback (most recent call last)
...
- SettingWithCopyException:
+ SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
diff --git a/doc/source/user_guide/integer_na.rst b/doc/source/user_guide/integer_na.rst
index 2ce8bf23de824..fe732daccb649 100644
--- a/doc/source/user_guide/integer_na.rst
+++ b/doc/source/user_guide/integer_na.rst
@@ -29,7 +29,7 @@ Construction
------------
pandas can represent integer data with possibly missing values using
-:class:`arrays.IntegerArray`. This is an :ref:`extension types `
+:class:`arrays.IntegerArray`. This is an :ref:`extension type `
implemented within pandas.
.. ipython:: python
diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst
index c2b030d732ba9..7a7e518e1f7db 100644
--- a/doc/source/user_guide/io.rst
+++ b/doc/source/user_guide/io.rst
@@ -26,11 +26,11 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
text;`XML `__;:ref:`read_xml`;:ref:`to_xml`
text; Local clipboard;:ref:`read_clipboard`;:ref:`to_clipboard`
binary;`MS Excel `__;:ref:`read_excel`;:ref:`to_excel`
- binary;`OpenDocument `__;:ref:`read_excel`;
+ binary;`OpenDocument `__;:ref:`read_excel`;
binary;`HDF5 Format `__;:ref:`read_hdf`;:ref:`to_hdf`
binary;`Feather Format `__;:ref:`read_feather`;:ref:`to_feather`
binary;`Parquet Format `__;:ref:`read_parquet`;:ref:`to_parquet`
- binary;`ORC Format `__;:ref:`read_orc`;
+ binary;`ORC Format `__;:ref:`read_orc`;:ref:`to_orc`
binary;`Stata `__;:ref:`read_stata`;:ref:`to_stata`
binary;`SAS `__;:ref:`read_sas`;
binary;`SPSS `__;:ref:`read_spss`;
@@ -102,25 +102,34 @@ header : int or list of ints, default ``'infer'``
names : array-like, default ``None``
List of column names to use. If file contains no header row, then you should
explicitly pass ``header=None``. Duplicates in this list are not allowed.
-index_col : int, str, sequence of int / str, or False, default ``None``
+index_col : int, str, sequence of int / str, or False, optional, default ``None``
Column(s) to use as the row labels of the ``DataFrame``, either given as
string name or column index. If a sequence of int / str is given, a
MultiIndex is used.
- Note: ``index_col=False`` can be used to force pandas to *not* use the first
- column as the index, e.g. when you have a malformed file with delimiters at
- the end of each line.
+ .. note::
+ ``index_col=False`` can be used to force pandas to *not* use the first
+ column as the index, e.g. when you have a malformed file with delimiters at
+ the end of each line.
The default value of ``None`` instructs pandas to guess. If the number of
fields in the column header row is equal to the number of fields in the body
of the data file, then a default index is used. If it is larger, then
the first columns are used as index so that the remaining number of fields in
the body are equal to the number of fields in the header.
+
+ The first row after the header is used to determine the number of columns,
+ which will go into the index. If the subsequent rows contain less columns
+ than the first row, they are filled with ``NaN``.
+
+ This can be avoided through ``usecols``. This ensures that the columns are
+ taken as is and the trailing data are ignored.
usecols : list-like or callable, default ``None``
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in ``names`` or
- inferred from the document header row(s). For example, a valid list-like
+ inferred from the document header row(s). If ``names`` are given, the document
+ header row(s) are not taken into account. For example, a valid list-like
``usecols`` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. To
@@ -142,27 +151,61 @@ usecols : list-like or callable, default ``None``
pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"])
- Using this parameter results in much faster parsing time and lower memory usage.
+ Using this parameter results in much faster parsing time and lower memory usage
+ when using the c engine. The Python engine loads the data first before deciding
+ which columns to drop.
squeeze : boolean, default ``False``
If the parsed data only contains one column then return a ``Series``.
+
+ .. deprecated:: 1.4.0
+ Append ``.squeeze("columns")`` to the call to ``{func_name}`` to squeeze
+ the data.
prefix : str, default ``None``
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
+
+ .. deprecated:: 1.4.0
+ Use a list comprehension on the DataFrame's columns after calling ``read_csv``.
+
+ .. ipython:: python
+
+ data = "col1,col2,col3\na,b,1"
+
+ df = pd.read_csv(StringIO(data))
+ df.columns = [f"pre_{col}" for col in df.columns]
+ df
+
mangle_dupe_cols : boolean, default ``True``
Duplicate columns will be specified as 'X', 'X.1'...'X.N', rather than 'X'...'X'.
Passing in ``False`` will cause data to be overwritten if there are duplicate
names in the columns.
+ .. deprecated:: 1.5.0
+ The argument was never implemented, and a new argument where the
+ renaming pattern can be specified will be added instead.
+
General parsing configuration
+++++++++++++++++++++++++++++
dtype : Type name or dict of column -> type, default ``None``
- Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32}``
- (unsupported with ``engine='python'``). Use ``str`` or ``object`` together
- with suitable ``na_values`` settings to preserve and
- not interpret dtype.
-engine : {``'c'``, ``'python'``}
- Parser engine to use. The C engine is faster while the Python engine is
- currently more feature-complete.
+ Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32, 'c': 'Int64'}``
+ Use ``str`` or ``object`` together with suitable ``na_values`` settings to preserve
+ and not interpret dtype. If converters are specified, they will be applied INSTEAD
+ of dtype conversion.
+
+ .. versionadded:: 1.5.0
+
+ Support for defaultdict was added. Specify a defaultdict as input where
+ the default determines the dtype of the columns which are not explicitly
+ listed.
+engine : {``'c'``, ``'python'``, ``'pyarrow'``}
+ Parser engine to use. The C and pyarrow engines are faster, while the python engine
+ is currently more feature-complete. Multithreading is currently only supported by
+ the pyarrow engine.
+
+ .. versionadded:: 1.4.0
+
+ The "pyarrow" engine was added as an *experimental* engine, and some features
+ are unsupported, or may not work correctly, with this engine.
converters : dict, default ``None``
Dict of functions for converting values in certain columns. Keys can either be
integers or column labels.
@@ -246,7 +289,9 @@ parse_dates : boolean or list of ints or names or list of lists or dict, default
* If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date
column.
* If ``{'foo': [1, 3]}`` -> parse columns 1, 3 as date and call result 'foo'.
- A fast-path exists for iso8601-formatted dates.
+
+ .. note::
+ A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default ``False``
If ``True`` and parse_dates is enabled for a column, attempt to infer the
datetime format to speed up the processing.
@@ -284,14 +329,14 @@ chunksize : int, default ``None``
Quoting, compression, and file format
+++++++++++++++++++++++++++++++++++++
-compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``, ``dict``}, default ``'infer'``
+compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``'zstd'``, ``None``, ``dict``}, default ``'infer'``
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
- bz2, zip, or xz if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2',
- '.zip', or '.xz', respectively, and no decompression otherwise. If using 'zip',
+ bz2, zip, xz, or zstandard if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2',
+ '.zip', '.xz', '.zst', respectively, and no decompression otherwise. If using 'zip',
the ZIP file must contain only one data file to be read in.
Set to ``None`` for no decompression. Can also be a dict with key ``'method'``
- set to one of {``'zip'``, ``'gzip'``, ``'bz2'``} and other key-value pairs are
- forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, or ``bz2.BZ2File``.
+ set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other key-value pairs are
+ forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``.
As an example, the following could be passed for faster compression and to
create a reproducible gzip archive:
``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``.
@@ -342,7 +387,7 @@ dialect : str or :class:`python:csv.Dialect` instance, default ``None``
Error handling
++++++++++++++
-error_bad_lines : boolean, default ``None``
+error_bad_lines : boolean, optional, default ``None``
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no ``DataFrame`` will be
returned. If ``False``, then these "bad lines" will dropped from the
@@ -352,7 +397,7 @@ error_bad_lines : boolean, default ``None``
.. deprecated:: 1.3.0
The ``on_bad_lines`` parameter should be used instead to specify behavior upon
encountering a bad line instead.
-warn_bad_lines : boolean, default ``None``
+warn_bad_lines : boolean, optional, default ``None``
If error_bad_lines is ``False``, and warn_bad_lines is ``True``, a warning for
each "bad line" will be output.
@@ -517,7 +562,8 @@ This matches the behavior of :meth:`Categorical.set_categories`.
df = pd.read_csv(StringIO(data), dtype="category")
df.dtypes
df["col3"]
- df["col3"].cat.categories = pd.to_numeric(df["col3"].cat.categories)
+ new_categories = pd.to_numeric(df["col3"].cat.categories)
+ df["col3"] = df["col3"].cat.rename_categories(new_categories)
df["col3"]
@@ -569,6 +615,10 @@ If the header is in a row other than the first, pass the row number to
Duplicate names parsing
'''''''''''''''''''''''
+ .. deprecated:: 1.5.0
+ ``mangle_dupe_cols`` was never implemented, and a new argument where the
+ renaming pattern can be specified will be added instead.
+
If the file or header contains duplicate names, pandas will by default
distinguish between them so as to prevent overwriting data:
@@ -579,27 +629,7 @@ distinguish between them so as to prevent overwriting data:
There is no more duplicate data because ``mangle_dupe_cols=True`` by default,
which modifies a series of duplicate columns 'X', ..., 'X' to become
-'X', 'X.1', ..., 'X.N'. If ``mangle_dupe_cols=False``, duplicate data can
-arise:
-
-.. code-block:: ipython
-
- In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
- In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
- Out[3]:
- a b a
- 0 2 1 2
- 1 5 4 5
-
-To prevent users from encountering this problem with duplicate data, a ``ValueError``
-exception is raised if ``mangle_dupe_cols != True``:
-
-.. code-block:: ipython
-
- In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
- In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
- ...
- ValueError: Setting mangle_dupe_cols=False is not supported yet
+'X', 'X.1', ..., 'X.N'.
.. _io.usecols:
@@ -805,13 +835,9 @@ input text data into ``datetime`` objects.
The simplest case is to just pass in ``parse_dates=True``:
.. ipython:: python
- :suppress:
-
- f = open("foo.csv", "w")
- f.write("date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5")
- f.close()
-.. ipython:: python
+ with open("foo.csv", mode="w") as f:
+ f.write("date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5")
# Use a column as an index, and parse it as dates.
df = pd.read_csv("foo.csv", index_col=0, parse_dates=True)
@@ -830,7 +856,6 @@ order) and the new column names will be the concatenation of the component
column names:
.. ipython:: python
- :suppress:
data = (
"KORD,19990127, 19:00:00, 18:56:00, 0.8100\n"
@@ -844,9 +869,6 @@ column names:
with open("tmp.csv", "w") as fh:
fh.write(data)
-.. ipython:: python
-
- print(open("tmp.csv").read())
df = pd.read_csv("tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]])
df
@@ -1026,19 +1048,20 @@ While US date formats tend to be MM/DD/YYYY, many international formats use
DD/MM/YYYY instead. For convenience, a ``dayfirst`` keyword is provided:
.. ipython:: python
- :suppress:
data = "date,value,cat\n1/6/2000,5,a\n2/6/2000,10,b\n3/6/2000,15,c"
+ print(data)
with open("tmp.csv", "w") as fh:
fh.write(data)
-.. ipython:: python
-
- print(open("tmp.csv").read())
-
pd.read_csv("tmp.csv", parse_dates=[0])
pd.read_csv("tmp.csv", dayfirst=True, parse_dates=[0])
+.. ipython:: python
+ :suppress:
+
+ os.remove("tmp.csv")
+
Writing CSVs to binary file objects
+++++++++++++++++++++++++++++++++++
@@ -1101,8 +1124,9 @@ For large numbers that have been written with a thousands separator, you can
set the ``thousands`` keyword to a string of length 1 so that integers will be parsed
correctly:
+By default, numbers with a thousands separator will be parsed as strings:
+
.. ipython:: python
- :suppress:
data = (
"ID|level|category\n"
@@ -1114,11 +1138,6 @@ correctly:
with open("tmp.csv", "w") as fh:
fh.write(data)
-By default, numbers with a thousands separator will be parsed as strings:
-
-.. ipython:: python
-
- print(open("tmp.csv").read())
df = pd.read_csv("tmp.csv", sep="|")
df
@@ -1128,7 +1147,6 @@ The ``thousands`` keyword allows integers to be parsed correctly:
.. ipython:: python
- print(open("tmp.csv").read())
df = pd.read_csv("tmp.csv", sep="|", thousands=",")
df
@@ -1202,16 +1220,18 @@ Returning Series
Using the ``squeeze`` keyword, the parser will return output with a single column
as a ``Series``:
+.. deprecated:: 1.4.0
+ Users should append ``.squeeze("columns")`` to the DataFrame returned by
+ ``read_csv`` instead.
+
.. ipython:: python
- :suppress:
+ :okwarning:
data = "level\nPatient1,123000\nPatient2,23000\nPatient3,1234018"
with open("tmp.csv", "w") as fh:
fh.write(data)
-.. ipython:: python
-
print(open("tmp.csv").read())
output = pd.read_csv("tmp.csv", squeeze=True)
@@ -1268,19 +1288,57 @@ You can elect to skip bad lines:
0 1 2 3
1 8 9 10
+Or pass a callable function to handle the bad line if ``engine="python"``.
+The bad line will be a list of strings that was split by the ``sep``:
+
+.. code-block:: ipython
+
+ In [29]: external_list = []
+
+ In [30]: def bad_lines_func(line):
+ ...: external_list.append(line)
+ ...: return line[-3:]
+
+ In [31]: pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python")
+ Out[31]:
+ a b c
+ 0 1 2 3
+ 1 5 6 7
+ 2 8 9 10
+
+ In [32]: external_list
+ Out[32]: [4, 5, 6, 7]
+
+ .. versionadded:: 1.4.0
+
+
You can also use the ``usecols`` parameter to eliminate extraneous column
data that appear in some lines but not others:
.. code-block:: ipython
- In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
+ In [33]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
- Out[30]:
+ Out[33]:
a b c
0 1 2 3
1 4 5 6
2 8 9 10
+In case you want to keep all data including the lines with too many fields, you can
+specify a sufficient number of ``names``. This ensures that lines with not enough
+fields are filled with ``NaN``.
+
+.. code-block:: ipython
+
+ In [34]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])
+
+ Out[34]:
+ a b c d
+ 0 1 2 3 NaN
+ 1 4 5 6 7
+ 2 8 9 10 NaN
+
.. _io.dialect:
Dialect
@@ -1290,15 +1348,11 @@ The ``dialect`` keyword gives greater flexibility in specifying the file format.
By default it uses the Excel dialect but you can specify either the dialect name
or a :class:`python:csv.Dialect` instance.
-.. ipython:: python
- :suppress:
-
- data = "label1,label2,label3\n" 'index1,"a,c,e\n' "index2,b,d,f"
-
Suppose you had data with unenclosed quotes:
.. ipython:: python
+ data = "label1,label2,label3\n" 'index1,"a,c,e\n' "index2,b,d,f"
print(data)
By default, ``read_csv`` uses the Excel dialect and treats the double quote as
@@ -1374,10 +1428,10 @@ a different usage of the ``delimiter`` parameter:
Can be used to specify the filler character of the fields
if it is not spaces (e.g., '~').
+Consider a typical fixed-width data file:
+
.. ipython:: python
- :suppress:
- f = open("bar.csv", "w")
data1 = (
"id8141 360.242940 149.910199 11950.7\n"
"id1594 444.953632 166.985655 11788.4\n"
@@ -1385,14 +1439,8 @@ a different usage of the ``delimiter`` parameter:
"id1230 413.836124 184.375703 11916.8\n"
"id1948 502.953953 173.237159 12468.3"
)
- f.write(data1)
- f.close()
-
-Consider a typical fixed-width data file:
-
-.. ipython:: python
-
- print(open("bar.csv").read())
+ with open("bar.csv", "w") as f:
+ f.write(data1)
In order to parse this file into a ``DataFrame``, we simply need to supply the
column specifications to the ``read_fwf`` function along with the file name:
@@ -1448,19 +1496,15 @@ Indexes
Files with an "implicit" index column
+++++++++++++++++++++++++++++++++++++
-.. ipython:: python
- :suppress:
-
- f = open("foo.csv", "w")
- f.write("A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5")
- f.close()
-
Consider a file with one less entry in the header than the number of data
column:
.. ipython:: python
- print(open("foo.csv").read())
+ data = "A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5"
+ print(data)
+ with open("foo.csv", "w") as f:
+ f.write(data)
In this special case, ``read_csv`` assumes that the first column is to be used
as the index of the ``DataFrame``:
@@ -1492,7 +1536,10 @@ Suppose you have data indexed by two columns:
.. ipython:: python
- print(open("data/mindex_ex.csv").read())
+ data = 'year,indiv,zit,xit\n1977,"A",1.2,.6\n1977,"B",1.5,.5'
+ print(data)
+ with open("mindex_ex.csv", mode="w") as f:
+ f.write(data)
The ``index_col`` argument to ``read_csv`` can take a list of
column numbers to turn multiple columns into a ``MultiIndex`` for the index of the
@@ -1500,9 +1547,14 @@ returned object:
.. ipython:: python
- df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1])
+ df = pd.read_csv("mindex_ex.csv", index_col=[0, 1])
df
- df.loc[1978]
+ df.loc[1977]
+
+.. ipython:: python
+ :suppress:
+
+ os.remove("mindex_ex.csv")
.. _io.multi_index_columns:
@@ -1526,20 +1578,18 @@ rows will skip the intervening rows.
of multi-columns indices.
.. ipython:: python
- :suppress:
data = ",a,a,a,b,c,c\n,q,r,s,t,u,v\none,1,2,3,4,5,6\ntwo,7,8,9,10,11,12"
- fh = open("mi2.csv", "w")
- fh.write(data)
- fh.close()
-
-.. ipython:: python
+ print(data)
+ with open("mi2.csv", "w") as fh:
+ fh.write(data)
- print(open("mi2.csv").read())
pd.read_csv("mi2.csv", header=[0, 1], index_col=0)
-Note: If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it
-with ``df.to_csv(..., index=False)``, then any ``names`` on the columns index will be *lost*.
+.. note::
+ If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it
+ with ``df.to_csv(..., index=False)``, then any ``names`` on the columns index will
+ be *lost*.
.. ipython:: python
:suppress:
@@ -1557,16 +1607,16 @@ comma-separated) files, as pandas uses the :class:`python:csv.Sniffer`
class of the csv module. For this, you have to specify ``sep=None``.
.. ipython:: python
- :suppress:
df = pd.DataFrame(np.random.randn(10, 4))
- df.to_csv("tmp.sv", sep="|")
- df.to_csv("tmp2.sv", sep=":")
+ df.to_csv("tmp.csv", sep="|")
+ df.to_csv("tmp2.csv", sep=":")
+ pd.read_csv("tmp2.csv", sep=None, engine="python")
.. ipython:: python
+ :suppress:
- print(open("tmp2.sv").read())
- pd.read_csv("tmp2.sv", sep=None, engine="python")
+ os.remove("tmp2.csv")
.. _io.multiple_files:
@@ -1587,8 +1637,9 @@ rather than reading the entire file into memory, such as the following:
.. ipython:: python
- print(open("tmp.sv").read())
- table = pd.read_csv("tmp.sv", sep="|")
+ df = pd.DataFrame(np.random.randn(10, 4))
+ df.to_csv("tmp.csv", sep="|")
+ table = pd.read_csv("tmp.csv", sep="|")
table
@@ -1597,7 +1648,7 @@ value will be an iterable object of type ``TextFileReader``:
.. ipython:: python
- with pd.read_csv("tmp.sv", sep="|", chunksize=4) as reader:
+ with pd.read_csv("tmp.csv", sep="|", chunksize=4) as reader:
reader
for chunk in reader:
print(chunk)
@@ -1610,23 +1661,28 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
.. ipython:: python
- with pd.read_csv("tmp.sv", sep="|", iterator=True) as reader:
+ with pd.read_csv("tmp.csv", sep="|", iterator=True) as reader:
reader.get_chunk(5)
.. ipython:: python
:suppress:
- os.remove("tmp.sv")
- os.remove("tmp2.sv")
+ os.remove("tmp.csv")
Specifying the parser engine
''''''''''''''''''''''''''''
-Under the hood pandas uses a fast and efficient parser implemented in C as well
-as a Python implementation which is currently more feature-complete. Where
-possible pandas uses the C parser (specified as ``engine='c'``), but may fall
-back to Python if C-unsupported options are specified. Currently, C-unsupported
-options include:
+Pandas currently supports three engines, the C engine, the python engine, and an experimental
+pyarrow engine (requires the ``pyarrow`` package). In general, the pyarrow engine is fastest
+on larger workloads and is equivalent in speed to the C engine on most other workloads.
+The python engine tends to be slower than the pyarrow and C engines on most workloads. However,
+the pyarrow engine is much less robust than the C engine, which lacks a few features compared to the
+Python engine.
+
+Where possible, pandas uses the C parser (specified as ``engine='c'``), but it may fall
+back to Python if C-unsupported options are specified.
+
+Currently, options unsupported by the C and pyarrow engines include:
* ``sep`` other than a single character (e.g. regex separators)
* ``skipfooter``
@@ -1635,6 +1691,32 @@ options include:
Specifying any of the above options will produce a ``ParserWarning`` unless the
python engine is selected explicitly using ``engine='python'``.
+Options that are unsupported by the pyarrow engine which are not covered by the list above include:
+
+* ``float_precision``
+* ``chunksize``
+* ``comment``
+* ``nrows``
+* ``thousands``
+* ``memory_map``
+* ``dialect``
+* ``warn_bad_lines``
+* ``error_bad_lines``
+* ``on_bad_lines``
+* ``delim_whitespace``
+* ``quoting``
+* ``lineterminator``
+* ``converters``
+* ``decimal``
+* ``iterator``
+* ``dayfirst``
+* ``infer_datetime_format``
+* ``verbose``
+* ``skipinitialspace``
+* ``low_memory``
+
+Specifying these options with ``engine='pyarrow'`` will raise a ``ValueError``.
+
.. _io.remote:
Reading/writing remote files
@@ -1744,7 +1826,7 @@ function takes a number of arguments. Only the first is required.
* ``mode`` : Python write mode, default 'w'
* ``encoding``: a string representing the encoding to use if the contents are
non-ASCII, for Python versions prior to 3
-* ``line_terminator``: Character sequence denoting line end (default ``os.linesep``)
+* ``lineterminator``: Character sequence denoting line end (default ``os.linesep``)
* ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a ``float_format`` then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric
* ``quotechar``: Character used to quote fields (default '"')
* ``doublequote``: Control quoting of ``quotechar`` in fields (default True)
@@ -1820,6 +1902,7 @@ with optional parameters:
``index``; dict like {index -> {column -> value}}
``columns``; dict like {column -> {index -> value}}
``values``; just the values array
+ ``table``; adhering to the JSON `Table Schema`_
* ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601.
* ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10.
@@ -2394,7 +2477,6 @@ A few notes on the generated table schema:
* For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
then ``level_`` is used.
-
``read_json`` also accepts ``orient='table'`` as an argument. This allows for
the preservation of metadata such as dtypes and index names in a
round-trippable manner.
@@ -2436,8 +2518,18 @@ indicate missing values and the subsequent read cannot distinguish the intent.
os.remove("test.json")
+When using ``orient='table'`` along with user-defined ``ExtensionArray``,
+the generated schema will contain an additional ``extDtype`` key in the respective
+``fields`` element. This extra key is not standard but does enable JSON roundtrips
+for extension types (e.g. ``read_json(df.to_json(orient="table"), orient="table")``).
+
+The ``extDtype`` key carries the name of the extension, if you have properly registered
+the ``ExtensionDtype``, pandas will use said name to perform a lookup into the registry
+and re-convert the serialized data into your custom dtype.
+
.. _Table Schema: https://specs.frictionlessdata.io/table-schema/
+
HTML
----
@@ -2462,40 +2554,66 @@ Let's look at a few examples.
Read a URL with no options:
-.. ipython:: python
+.. code-block:: ipython
- url = (
- "https://raw.githubusercontent.com/pandas-dev/pandas/master/"
- "pandas/tests/io/data/html/spam.html"
- )
- dfs = pd.read_html(url)
- dfs
+ In [320]: "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list"
+ In [321]: pd.read_html(url)
+ Out[321]:
+ [ Bank NameBank CityCity StateSt ... Acquiring InstitutionAI Closing DateClosing FundFund
+ 0 Almena State Bank Almena KS ... Equity Bank October 23, 2020 10538
+ 1 First City Bank of Florida Fort Walton Beach FL ... United Fidelity Bank, fsb October 16, 2020 10537
+ 2 The First State Bank Barboursville WV ... MVB Bank, Inc. April 3, 2020 10536
+ 3 Ericson State Bank Ericson NE ... Farmers and Merchants Bank February 14, 2020 10535
+ 4 City National Bank of New Jersey Newark NJ ... Industrial Bank November 1, 2019 10534
+ .. ... ... ... ... ... ... ...
+ 558 Superior Bank, FSB Hinsdale IL ... Superior Federal, FSB July 27, 2001 6004
+ 559 Malta National Bank Malta OH ... North Valley Bank May 3, 2001 4648
+ 560 First Alliance Bank & Trust Co. Manchester NH ... Southern New Hampshire Bank & Trust February 2, 2001 4647
+ 561 National State Bank of Metropolis Metropolis IL ... Banterra Bank of Marion December 14, 2000 4646
+ 562 Bank of Honolulu Honolulu HI ... Bank of the Orient October 13, 2000 4645
+
+ [563 rows x 7 columns]]
+
+.. note::
-Read in the content of the "banklist.html" file and pass it to ``read_html``
+ The data from the above URL changes every Monday so the resulting data above may be slightly different.
+
+Read in the content of the file from the above URL and pass it to ``read_html``
as a string:
.. ipython:: python
- :suppress:
- rel_path = os.path.join("..", "pandas", "tests", "io", "data", "html",
- "banklist.html")
- file_path = os.path.abspath(rel_path)
+ html_str = """
+
+
+
A
+
B
+
C
+
+
+
a
+
b
+
c
+
+
+ """
+
+ with open("tmp.html", "w") as f:
+ f.write(html_str)
+ df = pd.read_html("tmp.html")
+ df[0]
.. ipython:: python
+ :suppress:
- with open(file_path, "r") as f:
- dfs = pd.read_html(f.read())
- dfs
+ os.remove("tmp.html")
You can even pass in an instance of ``StringIO`` if you so desire:
.. ipython:: python
- with open(file_path, "r") as f:
- sio = StringIO(f.read())
-
- dfs = pd.read_html(sio)
- dfs
+ dfs = pd.read_html(StringIO(html_str))
+ dfs[0]
.. note::
@@ -2503,7 +2621,7 @@ You can even pass in an instance of ``StringIO`` if you so desire:
that having so many network-accessing functions slows down the documentation
build. If you spot an error or an example that doesn't run, please do not
hesitate to report it over on `pandas GitHub issues page
- `__.
+ `__.
Read a URL and match a table that contains specific text:
@@ -2613,6 +2731,30 @@ succeeds, the function will return*.
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"])
+Links can be extracted from cells along with the text using ``extract_links="all"``.
+
+.. ipython:: python
+
+ html_table = """
+
+ """
+
+ df = pd.read_html(
+ html_table,
+ extract_links="all"
+ )[0]
+ df
+ df[("GitHub", None)]
+ df[("GitHub", None)].str[1]
+
+.. versionadded:: 1.5.0
.. _io.html:
@@ -2629,77 +2771,48 @@ in the method ``to_string`` described above.
brevity's sake. See :func:`~pandas.core.frame.DataFrame.to_html` for the
full set of options.
-.. ipython:: python
- :suppress:
+.. note::
- def write_html(df, filename, *args, **kwargs):
- static = os.path.abspath(os.path.join("source", "_static"))
- with open(os.path.join(static, filename + ".html"), "w") as f:
- df.to_html(f, *args, **kwargs)
+ In an HTML-rendering supported environment like a Jupyter Notebook, ``display(HTML(...))```
+ will render the raw HTML into the environment.
.. ipython:: python
+ from IPython.display import display, HTML
+
df = pd.DataFrame(np.random.randn(2, 2))
df
- print(df.to_html()) # raw html
-
-.. ipython:: python
- :suppress:
-
- write_html(df, "basic")
-
-HTML:
-
-.. raw:: html
- :file: ../_static/basic.html
+ html = df.to_html()
+ print(html) # raw html
+ display(HTML(html))
The ``columns`` argument will limit the columns shown:
.. ipython:: python
- print(df.to_html(columns=[0]))
-
-.. ipython:: python
- :suppress:
-
- write_html(df, "columns", columns=[0])
-
-HTML:
-
-.. raw:: html
- :file: ../_static/columns.html
+ html = df.to_html(columns=[0])
+ print(html)
+ display(HTML(html))
``float_format`` takes a Python callable to control the precision of floating
point values:
.. ipython:: python
- print(df.to_html(float_format="{0:.10f}".format))
-
-.. ipython:: python
- :suppress:
-
- write_html(df, "float_format", float_format="{0:.10f}".format)
+ html = df.to_html(float_format="{0:.10f}".format)
+ print(html)
+ display(HTML(html))
-HTML:
-
-.. raw:: html
- :file: ../_static/float_format.html
``bold_rows`` will make the row labels bold by default, but you can turn that
off:
.. ipython:: python
- print(df.to_html(bold_rows=False))
+ html = df.to_html(bold_rows=False)
+ print(html)
+ display(HTML(html))
-.. ipython:: python
- :suppress:
-
- write_html(df, "nobold", bold_rows=False)
-
-.. raw:: html
- :file: ../_static/nobold.html
The ``classes`` argument provides the ability to give the resulting HTML
table CSS classes. Note that these classes are *appended* to the existing
@@ -2720,17 +2833,9 @@ that contain URLs.
"url": ["https://www.python.org/", "https://pandas.pydata.org/"],
}
)
- print(url_df.to_html(render_links=True))
-
-.. ipython:: python
- :suppress:
-
- write_html(url_df, "render_links", render_links=True)
-
-HTML:
-
-.. raw:: html
- :file: ../_static/render_links.html
+ html = url_df.to_html(render_links=True)
+ print(html)
+ display(HTML(html))
Finally, the ``escape`` argument allows you to control whether the
"<", ">" and "&" characters escaped in the resulting HTML (by default it is
@@ -2740,30 +2845,21 @@ Finally, the ``escape`` argument allows you to control whether the
df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)})
-
-.. ipython:: python
- :suppress:
-
- write_html(df, "escape")
- write_html(df, "noescape", escape=False)
-
Escaped:
.. ipython:: python
- print(df.to_html())
-
-.. raw:: html
- :file: ../_static/escape.html
+ html = df.to_html()
+ print(html)
+ display(HTML(html))
Not escaped:
.. ipython:: python
- print(df.to_html(escape=False))
-
-.. raw:: html
- :file: ../_static/noescape.html
+ html = df.to_html(escape=False)
+ print(html)
+ display(HTML(html))
.. note::
@@ -2943,13 +3039,10 @@ Read in the content of the "books.xml" file and pass it to ``read_xml``
as a string:
.. ipython:: python
- :suppress:
- rel_path = os.path.join("..", "pandas", "tests", "io", "data", "xml",
- "books.xml")
- file_path = os.path.abspath(rel_path)
-
-.. ipython:: python
+ file_path = "books.xml"
+ with open(file_path, "w") as f:
+ f.write(xml)
with open(file_path, "r") as f:
df = pd.read_xml(f.read())
@@ -2974,14 +3067,15 @@ Read in the content of the "books.xml" as instance of ``StringIO`` or
df = pd.read_xml(bio)
df
-Even read XML from AWS S3 buckets such as Python Software Foundation's IRS 990 Form:
+Even read XML from AWS S3 buckets such as NIH NCBI PMC Article Datasets providing
+Biomedical and Life Science Jorurnals:
.. ipython:: python
+ :okwarning:
df = pd.read_xml(
- "s3://irs-form-990/201923199349319487_public.xml",
- xpath=".//irs:Form990PartVIISectionAGrp",
- namespaces={"irs": "http://www.irs.gov/efile"}
+ "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml",
+ xpath=".//journal-meta",
)
df
@@ -3008,6 +3102,11 @@ Specify only elements or only attributes to parse:
df = pd.read_xml(file_path, attrs_only=True)
df
+.. ipython:: python
+ :suppress:
+
+ os.remove("books.xml")
+
XML documents can have namespaces with prefixes and default namespaces without
prefixes both of which are denoted with a special attribute ``xmlns``. In order
to parse by node under a namespace context, ``xpath`` must reference a prefix.
@@ -3170,6 +3269,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
df = pd.read_xml(xml, stylesheet=xsl)
df
+For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
+supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
+which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes.
+without holding entire tree in memory.
+
+ .. versionadded:: 1.5.0
+
+.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
+.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
+
+To use this feature, you must pass a physical XML file path into ``read_xml`` and use the ``iterparse`` argument.
+Files should not be compressed or point to online sources but stored on local disk. Also, ``iterparse`` should be
+a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of
+any element or attribute that is a descendant (i.e., child, grandchild) of repeating node. Since XPath is not
+used in this method, descendants do not need to share same relationship with one another. Below shows example
+of reading in Wikipedia's very large (12 GB+) latest article data dump.
+
+.. code-block:: ipython
+
+ In [1]: df = pd.read_xml(
+ ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
+ ... iterparse = {"page": ["title", "ns", "id"]}
+ ... )
+ ... df
+ Out[2]:
+ title ns id
+ 0 Gettysburg Address 0 21450
+ 1 Main Page 0 42950
+ 2 Declaration by United Nations 0 8435
+ 3 Constitution of the United States of America 0 8435
+ 4 Declaration of Independence (Israel) 0 17858
+ ... ... ... ...
+ 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
+ 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
+ 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
+ 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
+ 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
+
+ [3578765 rows x 3 columns]
.. _io.xml:
@@ -3368,7 +3506,7 @@ See the :ref:`cookbook` for some advanced strategies.
**Please do not report issues when using ``xlrd`` to read ``.xlsx`` files.**
This is no longer supported, switch to using ``openpyxl`` instead.
- Attempting to use the the ``xlwt`` engine will raise a ``FutureWarning``
+ Attempting to use the ``xlwt`` engine will raise a ``FutureWarning``
unless the option :attr:`io.excel.xls.writer` is set to ``"xlwt"``.
While this option is now deprecated and will also raise a ``FutureWarning``,
it can be globally set and the warning suppressed. Users are recommended to
@@ -3460,9 +3598,9 @@ with ``on_demand=True``.
Specifying sheets
+++++++++++++++++
-.. note :: The second argument is ``sheet_name``, not to be confused with ``ExcelFile.sheet_names``.
+.. note:: The second argument is ``sheet_name``, not to be confused with ``ExcelFile.sheet_names``.
-.. note :: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets.
+.. note:: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets.
* The arguments ``sheet_name`` allows specifying the sheet or sheets to read.
* The default value for ``sheet_name`` is 0, indicating to read the first sheet
@@ -3558,6 +3696,10 @@ should be passed to ``index_col`` and ``header``:
os.remove("path_to_file.xlsx")
+Missing values in columns specified in ``index_col`` will be forward filled to
+allow roundtripping with ``to_excel`` for ``merged_cells=True``. To avoid forward
+filling the missing values use ``set_index`` after reading the data instead of
+``index_col``.
Parsing specific columns
++++++++++++++++++++++++
@@ -3936,18 +4078,18 @@ Compressed pickle files
'''''''''''''''''''''''
:func:`read_pickle`, :meth:`DataFrame.to_pickle` and :meth:`Series.to_pickle` can read
-and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz`` are supported for reading and writing.
+and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz``, ``zstd`` are supported for reading and writing.
The ``zip`` file format only supports reading and must contain only one data file
to be read.
The compression type can be an explicit parameter or be inferred from the file extension.
-If 'infer', then use ``gzip``, ``bz2``, ``zip``, or ``xz`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, or
-``'.xz'``, respectively.
+If 'infer', then use ``gzip``, ``bz2``, ``zip``, ``xz``, ``zstd`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``,
+``'.xz'``, or ``'.zst'``, respectively.
The compression parameter can also be a ``dict`` in order to pass options to the
compression protocol. It must have a ``'method'`` key set to the name
of the compression protocol, which must be one of
-{``'zip'``, ``'gzip'``, ``'bz2'``}. All other key-value pairs are passed to
+{``'zip'``, ``'gzip'``, ``'bz2'``, ``'xz'``, ``'zstd'``}. All other key-value pairs are passed to
the underlying compression library.
.. ipython:: python
@@ -4872,7 +5014,7 @@ control compression: ``complevel`` and ``complib``.
rates but is somewhat slow.
- `lzo `_: Fast
compression and decompression.
- - `bzip2 `_: Good compression rates.
+ - `bzip2 `_: Good compression rates.
- `blosc `_: Fast compression and
decompression.
@@ -4881,10 +5023,10 @@ control compression: ``complevel`` and ``complib``.
- `blosc:blosclz `_ This is the
default compressor for ``blosc``
- `blosc:lz4
- `_:
+ `_:
A compact, very popular and fast compressor.
- `blosc:lz4hc
- `_:
+ `_:
A tweaked version of LZ4, produces better
compression ratios at the expense of speed.
- `blosc:snappy `_:
@@ -5226,15 +5368,6 @@ Several caveats:
See the `Full Documentation `__.
-.. ipython:: python
- :suppress:
-
- import warnings
-
- # This can be removed once building with pyarrow >=0.15.0
- warnings.filterwarnings("ignore", "The Sparse", FutureWarning)
-
-
.. ipython:: python
df = pd.DataFrame(
@@ -5314,7 +5447,7 @@ See the documentation for `pyarrow `__ an
.. note::
These engines are very similar and should read/write nearly identical parquet format files.
- Currently ``pyarrow`` does not support timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
+ ``pyarrow>=8.0.0`` supports timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
.. ipython:: python
@@ -5461,13 +5594,64 @@ ORC
.. versionadded:: 1.0.0
Similar to the :ref:`parquet ` format, the `ORC Format `__ is a binary columnar serialization
-for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
-ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow `__ library.
+for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the
+ORC format, :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc`. This requires the `pyarrow `__ library.
.. warning::
* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
- * :func:`~pandas.read_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `.
+ * :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0.
+ * :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc` are not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `.
+ * For supported dtypes please refer to `supported ORC features in Arrow `__.
+ * Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
+
+.. ipython:: python
+
+ df = pd.DataFrame(
+ {
+ "a": list("abc"),
+ "b": list(range(1, 4)),
+ "c": np.arange(4.0, 7.0, dtype="float64"),
+ "d": [True, False, True],
+ "e": pd.date_range("20130101", periods=3),
+ }
+ )
+
+ df
+ df.dtypes
+
+Write to an orc file.
+
+.. ipython:: python
+ :okwarning:
+
+ df.to_orc("example_pa.orc", engine="pyarrow")
+
+Read from an orc file.
+
+.. ipython:: python
+ :okwarning:
+
+ result = pd.read_orc("example_pa.orc")
+
+ result.dtypes
+
+Read only certain columns of an orc file.
+
+.. ipython:: python
+
+ result = pd.read_orc(
+ "example_pa.orc",
+ columns=["a", "b"],
+ )
+ result.dtypes
+
+
+.. ipython:: python
+ :suppress:
+
+ os.remove("example_pa.orc")
+
.. _io.sql:
@@ -5477,7 +5661,7 @@ SQL queries
The :mod:`pandas.io.sql` module provides a collection of query wrappers to both
facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction
is provided by SQLAlchemy if installed. In addition you will need a driver library for
-your database. Examples of such drivers are `psycopg2 `__
+your database. Examples of such drivers are `psycopg2 `__
for PostgreSQL or `pymysql `__ for MySQL.
For `SQLite `__ this is
included in Python's standard library by default.
@@ -5509,7 +5693,7 @@ The key functions are:
the provided input (database table name or sql query).
Table names do not need to be quoted if they have special characters.
-In the following example, we use the `SQlite `__ SQL database
+In the following example, we use the `SQlite `__ SQL database
engine. You can use a temporary SQLite database where data are stored in
"memory".
@@ -5526,13 +5710,23 @@ below and the SQLAlchemy `documentation `__
+for an explanation of how the database connection is handled.
.. code-block:: python
with engine.connect() as conn, conn.begin():
data = pd.read_sql_table("data", conn)
+.. warning::
+
+ When you open a connection to a database you are also responsible for closing it.
+ Side effects of leaving a connection open may include locking the database or
+ other breaking behaviour.
+
Writing DataFrames
''''''''''''''''''
@@ -5551,7 +5745,6 @@ the database using :func:`~pandas.DataFrame.to_sql`.
.. ipython:: python
- :suppress:
import datetime
@@ -5564,10 +5757,8 @@ the database using :func:`~pandas.DataFrame.to_sql`.
data = pd.DataFrame(d, columns=c)
-.. ipython:: python
-
- data
- data.to_sql("data", engine)
+ data
+ data.to_sql("data", engine)
With some databases, writing large DataFrames can result in errors due to
packet size limitations being exceeded. This can be avoided by setting the
@@ -5663,7 +5854,7 @@ Possible values are:
specific backend dialect features.
Example of a callable using PostgreSQL `COPY clause
-`__::
+`__::
# Alternative to_sql() *method* for DBs that support COPY FROM
import csv
@@ -5689,7 +5880,7 @@ Example of a callable using PostgreSQL `COPY clause
writer.writerows(data_iter)
s_buf.seek(0)
- columns = ', '.join('"{}"'.format(k) for k in keys)
+ columns = ', '.join(['"{}"'.format(k) for k in keys])
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
@@ -5925,7 +6116,7 @@ pandas integrates with this external package. if ``pandas-gbq`` is installed, yo
use the pandas methods ``pd.read_gbq`` and ``DataFrame.to_gbq``, which will call the
respective functions from ``pandas-gbq``.
-Full documentation can be found `here `__.
+Full documentation can be found `here `__.
.. _io.stata:
@@ -6133,7 +6324,7 @@ Obtain an iterator and read an XPORT file 100,000 lines at a time:
The specification_ for the xport file format is available from the SAS
web site.
-.. _specification: https://support.sas.com/techsup/technote/ts140.pdf
+.. _specification: https://support.sas.com/content/dam/SAS/support/en/technical-papers/record-layout-of-a-sas-version-5-or-6-data-set-in-sas-transport-xport-format.pdf
No official documentation is available for the SAS7BDAT format.
@@ -6175,7 +6366,7 @@ avoid converting categorical columns into ``pd.Categorical``:
More information about the SAV and ZSAV file formats is available here_.
-.. _here: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_22.0.0/com.ibm.spss.statistics.help/spss/base/savedatatypes.htm
+.. _here: https://www.ibm.com/docs/en/spss-statistics/22.0.0
.. _io.other:
@@ -6193,7 +6384,7 @@ xarray_ provides data structures inspired by the pandas ``DataFrame`` for workin
with multi-dimensional datasets, with a focus on the netCDF file format and
easy conversion to and from pandas.
-.. _xarray: https://xarray.pydata.org/
+.. _xarray: https://xarray.pydata.org/en/stable/
.. _io.perf:
diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst
index 09b3d3a8c96df..bbca5773afdfe 100644
--- a/doc/source/user_guide/merging.rst
+++ b/doc/source/user_guide/merging.rst
@@ -237,59 +237,6 @@ Similarly, we could index before the concatenation:
p.plot([df1, df4], result, labels=["df1", "df4"], vertical=False);
plt.close("all");
-.. _merging.concatenation:
-
-Concatenating using ``append``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append`
-instance methods on ``Series`` and ``DataFrame``. These methods actually predated
-``concat``. They concatenate along ``axis=0``, namely the index:
-
-.. ipython:: python
-
- result = df1.append(df2)
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append1.png
- p.plot([df1, df2], result, labels=["df1", "df2"], vertical=True);
- plt.close("all");
-
-In the case of ``DataFrame``, the indexes must be disjoint but the columns do not
-need to be:
-
-.. ipython:: python
-
- result = df1.append(df4, sort=False)
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append2.png
- p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True);
- plt.close("all");
-
-``append`` may take multiple objects to concatenate:
-
-.. ipython:: python
-
- result = df1.append([df2, df3])
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append3.png
- p.plot([df1, df2, df3], result, labels=["df1", "df2", "df3"], vertical=True);
- plt.close("all");
-
-.. note::
-
- Unlike the :py:meth:`~list.append` method, which appends to the original list
- and returns ``None``, :meth:`~DataFrame.append` here **does not** modify
- ``df1`` and returns its copy with ``df2`` appended.
-
.. _merging.ignore_index:
Ignoring indexes on the concatenation axis
@@ -309,19 +256,6 @@ do this, use the ``ignore_index`` argument:
p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True);
plt.close("all");
-This is also a valid argument to :meth:`DataFrame.append`:
-
-.. ipython:: python
-
- result = df1.append(df4, ignore_index=True, sort=False)
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append_ignore_index.png
- p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True);
- plt.close("all");
-
.. _merging.mixed_ndims:
Concatenating with mixed ndims
@@ -473,14 +407,13 @@ like GroupBy where the order of a categorical variable is meaningful.
Appending rows to a DataFrame
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-While not especially efficient (since a new object must be created), you can
-append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
-``append``, which returns a new ``DataFrame`` as above.
+If you have a series that you want to append as a single row to a ``DataFrame``, you can convert the row into a
+``DataFrame`` and use ``concat``
.. ipython:: python
s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])
- result = df1.append(s2, ignore_index=True)
+ result = pd.concat([df1, s2.to_frame().T], ignore_index=True)
.. ipython:: python
:suppress:
@@ -493,20 +426,6 @@ You should use ``ignore_index`` with this method to instruct DataFrame to
discard its index. If you wish to preserve the index, you should construct an
appropriately-indexed DataFrame and append or concatenate those objects.
-You can also pass a list of dicts or Series:
-
-.. ipython:: python
-
- dicts = [{"A": 1, "B": 2, "C": 3, "X": 4}, {"A": 5, "B": 6, "C": 7, "Y": 8}]
- result = df1.append(dicts, ignore_index=True, sort=False)
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append_dits.png
- p.plot([df1, pd.DataFrame(dicts)], result, labels=["df1", "dicts"], vertical=True);
- plt.close("all");
-
.. _merging.join:
Database-style DataFrame or named Series joining/merging
@@ -562,7 +481,7 @@ all standard database join operations between ``DataFrame`` or named ``Series``
(hierarchical), the number of levels must match the number of join keys
from the right DataFrame or Series.
* ``right_index``: Same usage as ``left_index`` for the right DataFrame or Series
-* ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``. Defaults
+* ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``, ``'cross'``. Defaults
to ``inner``. See below for more detailed description of each method.
* ``sort``: Sort the result DataFrame by the join keys in lexicographical
order. Defaults to ``True``, setting to ``False`` will improve performance
@@ -707,6 +626,7 @@ either the left or right tables, the values in the joined table will be
``right``, ``RIGHT OUTER JOIN``, Use keys from right frame only
``outer``, ``FULL OUTER JOIN``, Use union of keys from both frames
``inner``, ``INNER JOIN``, Use intersection of keys from both frames
+ ``cross``, ``CROSS JOIN``, Create the cartesian product of rows of both frames
.. ipython:: python
@@ -751,6 +671,17 @@ either the left or right tables, the values in the joined table will be
p.plot([left, right], result, labels=["left", "right"], vertical=False);
plt.close("all");
+.. ipython:: python
+
+ result = pd.merge(left, right, how="cross")
+
+.. ipython:: python
+ :suppress:
+
+ @savefig merging_merge_cross.png
+ p.plot([left, right], result, labels=["left", "right"], vertical=False);
+ plt.close("all");
+
You can merge a mult-indexed Series and a DataFrame, if the names of
the MultiIndex correspond to the columns from the DataFrame. Transform
the Series to a DataFrame using :meth:`Series.reset_index` before merging,
diff --git a/doc/source/user_guide/missing_data.rst b/doc/source/user_guide/missing_data.rst
index 1621b37f31b23..3052ee3001681 100644
--- a/doc/source/user_guide/missing_data.rst
+++ b/doc/source/user_guide/missing_data.rst
@@ -470,7 +470,7 @@ at the new values.
interp_s = ser.reindex(new_index).interpolate(method="pchip")
interp_s[49:51]
-.. _scipy: https://www.scipy.org
+.. _scipy: https://scipy.org/
.. _documentation: https://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
.. _guide: https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
@@ -580,7 +580,7 @@ String/regular expression replacement
backslashes than strings without this prefix. Backslashes in raw strings
will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. You
should `read about them
- `__
+ `__
if this is unclear.
Replace the '.' with ``NaN`` (str -> str):
diff --git a/doc/source/user_guide/options.rst b/doc/source/user_guide/options.rst
index 62a347acdaa34..c7f5d3ddf66d3 100644
--- a/doc/source/user_guide/options.rst
+++ b/doc/source/user_guide/options.rst
@@ -8,8 +8,8 @@ Options and settings
Overview
--------
-pandas has an options system that lets you customize some aspects of its behaviour,
-display-related options being those the user is most likely to adjust.
+pandas has an options API configure and customize global behavior related to
+:class:`DataFrame` display, data behavior and more.
Options have a full "dotted-style", case-insensitive name (e.g. ``display.max_rows``).
You can get/set options directly as attributes of the top-level ``options`` attribute:
@@ -31,18 +31,20 @@ namespace:
* :func:`~pandas.option_context` - execute a codeblock with a set of options
that revert to prior settings after execution.
-**Note:** Developers can check out `pandas/core/config_init.py `_ for more information.
+.. note::
+
+ Developers can check out `pandas/core/config_init.py `_ for more information.
All of the functions above accept a regexp pattern (``re.search`` style) as an argument,
-and so passing in a substring will work - as long as it is unambiguous:
+to match an unambiguous substring:
.. ipython:: python
- pd.get_option("display.max_rows")
- pd.set_option("display.max_rows", 101)
- pd.get_option("display.max_rows")
- pd.set_option("max_r", 102)
- pd.get_option("display.max_rows")
+ pd.get_option("display.chop_threshold")
+ pd.set_option("display.chop_threshold", 2)
+ pd.get_option("display.chop_threshold")
+ pd.set_option("chop", 4)
+ pd.get_option("display.chop_threshold")
The following will **not work** because it matches multiple option names, e.g.
@@ -51,17 +53,13 @@ The following will **not work** because it matches multiple option names, e.g.
.. ipython:: python
:okexcept:
- try:
- pd.get_option("column")
- except KeyError as e:
- print(e)
+ pd.get_option("max")
-**Note:** Using this form of shorthand may cause your code to break if new options with similar names are added in future versions.
+.. warning::
+ Using this form of shorthand may cause your code to break if new options with similar names are added in future versions.
-You can get a list of available options and their descriptions with ``describe_option``. When called
-with no argument ``describe_option`` will print out the descriptions for all available options.
.. ipython:: python
:suppress:
@@ -69,6 +67,18 @@ with no argument ``describe_option`` will print out the descriptions for all ava
pd.reset_option("all")
+.. _options.available:
+
+Available options
+-----------------
+
+You can get a list of available options and their descriptions with :func:`~pandas.describe_option`. When called
+with no argument :func:`~pandas.describe_option` will print out the descriptions for all available options.
+
+.. ipython:: python
+
+ pd.describe_option()
+
Getting and setting options
---------------------------
@@ -82,9 +92,11 @@ are available from the pandas namespace. To change an option, call
pd.set_option("mode.sim_interactive", True)
pd.get_option("mode.sim_interactive")
-**Note:** The option 'mode.sim_interactive' is mostly used for debugging purposes.
+.. note::
+
+ The option ``'mode.sim_interactive'`` is mostly used for debugging purposes.
-All options also have a default value, and you can use ``reset_option`` to do just that:
+You can use :func:`~pandas.reset_option` to revert to a setting's default value
.. ipython:: python
:suppress:
@@ -108,7 +120,7 @@ It's also possible to reset multiple options at once (using a regex):
pd.reset_option("^display")
-``option_context`` context manager has been exposed through
+:func:`~pandas.option_context` context manager has been exposed through
the top-level API, allowing you to execute code with given option values. Option values
are restored automatically when you exit the ``with`` block:
@@ -124,7 +136,9 @@ are restored automatically when you exit the ``with`` block:
Setting startup options in Python/IPython environment
-----------------------------------------------------
-Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas more efficient. To do this, create a .py or .ipy script in the startup directory of the desired profile. An example where the startup folder is in a default IPython profile can be found at:
+Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas more efficient.
+To do this, create a ``.py`` or ``.ipy`` script in the startup directory of the desired profile.
+An example where the startup folder is in a default IPython profile can be found at:
.. code-block:: none
@@ -138,45 +152,45 @@ More information can be found in the `IPython documentation
import pandas as pd
pd.set_option("display.max_rows", 999)
- pd.set_option("precision", 5)
+ pd.set_option("display.precision", 5)
.. _options.frequently_used:
Frequently used options
-----------------------
-The following is a walk-through of the more frequently used display options.
+The following is a demonstrates the more frequently used display options.
``display.max_rows`` and ``display.max_columns`` sets the maximum number
-of rows and columns displayed when a frame is pretty-printed. Truncated
+of rows and columns displayed when a frame is pretty-printed. Truncated
lines are replaced by an ellipsis.
.. ipython:: python
df = pd.DataFrame(np.random.randn(7, 2))
- pd.set_option("max_rows", 7)
+ pd.set_option("display.max_rows", 7)
df
- pd.set_option("max_rows", 5)
+ pd.set_option("display.max_rows", 5)
df
- pd.reset_option("max_rows")
+ pd.reset_option("display.max_rows")
Once the ``display.max_rows`` is exceeded, the ``display.min_rows`` options
determines how many rows are shown in the truncated repr.
.. ipython:: python
- pd.set_option("max_rows", 8)
- pd.set_option("min_rows", 4)
+ pd.set_option("display.max_rows", 8)
+ pd.set_option("display.min_rows", 4)
# below max_rows -> all rows shown
df = pd.DataFrame(np.random.randn(7, 2))
df
# above max_rows -> only min_rows (4) rows shown
df = pd.DataFrame(np.random.randn(9, 2))
df
- pd.reset_option("max_rows")
- pd.reset_option("min_rows")
+ pd.reset_option("display.max_rows")
+ pd.reset_option("display.min_rows")
-``display.expand_frame_repr`` allows for the representation of
-dataframes to stretch across pages, wrapped over the full column vs row-wise.
+``display.expand_frame_repr`` allows for the representation of a
+:class:`DataFrame` to stretch across pages, wrapped over the all the columns.
.. ipython:: python
@@ -187,19 +201,19 @@ dataframes to stretch across pages, wrapped over the full column vs row-wise.
df
pd.reset_option("expand_frame_repr")
-``display.large_repr`` lets you select whether to display dataframes that exceed
-``max_columns`` or ``max_rows`` as a truncated frame, or as a summary.
+``display.large_repr`` displays a :class:`DataFrame` that exceed
+``max_columns`` or ``max_rows`` as a truncated frame or summary.
.. ipython:: python
df = pd.DataFrame(np.random.randn(10, 10))
- pd.set_option("max_rows", 5)
+ pd.set_option("display.max_rows", 5)
pd.set_option("large_repr", "truncate")
df
pd.set_option("large_repr", "info")
df
pd.reset_option("large_repr")
- pd.reset_option("max_rows")
+ pd.reset_option("display.max_rows")
``display.max_colwidth`` sets the maximum width of columns. Cells
of this length or longer will be truncated with an ellipsis.
@@ -220,8 +234,8 @@ of this length or longer will be truncated with an ellipsis.
df
pd.reset_option("max_colwidth")
-``display.max_info_columns`` sets a threshold for when by-column info
-will be given.
+``display.max_info_columns`` sets a threshold for the number of columns
+displayed when calling :meth:`~pandas.DataFrame.info`.
.. ipython:: python
@@ -232,10 +246,10 @@ will be given.
df.info()
pd.reset_option("max_info_columns")
-``display.max_info_rows``: ``df.info()`` will usually show null-counts for each column.
-For large frames this can be quite slow. ``max_info_rows`` and ``max_info_cols``
-limit this null check only to frames with smaller dimensions then specified. Note that you
-can specify the option ``df.info(null_counts=True)`` to override on showing a particular frame.
+``display.max_info_rows``: :meth:`~pandas.DataFrame.info` will usually show null-counts for each column.
+For a large :class:`DataFrame`, this can be quite slow. ``max_info_rows`` and ``max_info_cols``
+limit this null check to the specified rows and columns respectively. The :meth:`~pandas.DataFrame.info`
+keyword argument ``null_counts=True`` will override this.
.. ipython:: python
@@ -248,18 +262,17 @@ can specify the option ``df.info(null_counts=True)`` to override on showing a pa
pd.reset_option("max_info_rows")
``display.precision`` sets the output display precision in terms of decimal places.
-This is only a suggestion.
.. ipython:: python
df = pd.DataFrame(np.random.randn(5, 5))
- pd.set_option("precision", 7)
+ pd.set_option("display.precision", 7)
df
- pd.set_option("precision", 4)
+ pd.set_option("display.precision", 4)
df
-``display.chop_threshold`` sets at what level pandas rounds to zero when
-it displays a Series of DataFrame. This setting does not change the
+``display.chop_threshold`` sets the rounding threshold to zero when displaying a
+:class:`Series` or :class:`DataFrame`. This setting does not change the
precision at which the number is stored.
.. ipython:: python
@@ -272,7 +285,7 @@ precision at which the number is stored.
pd.reset_option("chop_threshold")
``display.colheader_justify`` controls the justification of the headers.
-The options are 'right', and 'left'.
+The options are ``'right'``, and ``'left'``.
.. ipython:: python
@@ -288,210 +301,6 @@ The options are 'right', and 'left'.
pd.reset_option("colheader_justify")
-
-.. _options.available:
-
-Available options
------------------
-
-======================================= ============ ==================================
-Option Default Function
-======================================= ============ ==================================
-display.chop_threshold None If set to a float value, all float
- values smaller then the given
- threshold will be displayed as
- exactly 0 by repr and friends.
-display.colheader_justify right Controls the justification of
- column headers. used by DataFrameFormatter.
-display.column_space 12 No description available.
-display.date_dayfirst False When True, prints and parses dates
- with the day first, eg 20/01/2005
-display.date_yearfirst False When True, prints and parses dates
- with the year first, eg 2005/01/20
-display.encoding UTF-8 Defaults to the detected encoding
- of the console. Specifies the encoding
- to be used for strings returned by
- to_string, these are generally strings
- meant to be displayed on the console.
-display.expand_frame_repr True Whether to print out the full DataFrame
- repr for wide DataFrames across
- multiple lines, ``max_columns`` is
- still respected, but the output will
- wrap-around across multiple "pages"
- if its width exceeds ``display.width``.
-display.float_format None The callable should accept a floating
- point number and return a string with
- the desired format of the number.
- This is used in some places like
- SeriesFormatter.
- See core.format.EngFormatter for an example.
-display.large_repr truncate For DataFrames exceeding max_rows/max_cols,
- the repr (and HTML repr) can show
- a truncated table (the default),
- or switch to the view from df.info()
- (the behaviour in earlier versions of pandas).
- allowable settings, ['truncate', 'info']
-display.latex.repr False Whether to produce a latex DataFrame
- representation for Jupyter frontends
- that support it.
-display.latex.escape True Escapes special characters in DataFrames, when
- using the to_latex method.
-display.latex.longtable False Specifies if the to_latex method of a DataFrame
- uses the longtable format.
-display.latex.multicolumn True Combines columns when using a MultiIndex
-display.latex.multicolumn_format 'l' Alignment of multicolumn labels
-display.latex.multirow False Combines rows when using a MultiIndex.
- Centered instead of top-aligned,
- separated by clines.
-display.max_columns 0 or 20 max_rows and max_columns are used
- in __repr__() methods to decide if
- to_string() or info() is used to
- render an object to a string. In
- case Python/IPython is running in
- a terminal this is set to 0 by default and
- pandas will correctly auto-detect
- the width of the terminal and switch to
- a smaller format in case all columns
- would not fit vertically. The IPython
- notebook, IPython qtconsole, or IDLE
- do not run in a terminal and hence
- it is not possible to do correct
- auto-detection, in which case the default
- is set to 20. 'None' value means unlimited.
-display.max_colwidth 50 The maximum width in characters of
- a column in the repr of a pandas
- data structure. When the column overflows,
- a "..." placeholder is embedded in
- the output. 'None' value means unlimited.
-display.max_info_columns 100 max_info_columns is used in DataFrame.info
- method to decide if per column information
- will be printed.
-display.max_info_rows 1690785 df.info() will usually show null-counts
- for each column. For large frames
- this can be quite slow. max_info_rows
- and max_info_cols limit this null
- check only to frames with smaller
- dimensions then specified.
-display.max_rows 60 This sets the maximum number of rows
- pandas should output when printing
- out various output. For example,
- this value determines whether the
- repr() for a dataframe prints out
- fully or just a truncated or summary repr.
- 'None' value means unlimited.
-display.min_rows 10 The numbers of rows to show in a truncated
- repr (when ``max_rows`` is exceeded). Ignored
- when ``max_rows`` is set to None or 0. When set
- to None, follows the value of ``max_rows``.
-display.max_seq_items 100 when pretty-printing a long sequence,
- no more then ``max_seq_items`` will
- be printed. If items are omitted,
- they will be denoted by the addition
- of "..." to the resulting string.
- If set to None, the number of items
- to be printed is unlimited.
-display.memory_usage True This specifies if the memory usage of
- a DataFrame should be displayed when the
- df.info() method is invoked.
-display.multi_sparse True "Sparsify" MultiIndex display (don't
- display repeated elements in outer
- levels within groups)
-display.notebook_repr_html True When True, IPython notebook will
- use html representation for
- pandas objects (if it is available).
-display.pprint_nest_depth 3 Controls the number of nested levels
- to process when pretty-printing
-display.precision 6 Floating point output precision in
- terms of number of places after the
- decimal, for regular formatting as well
- as scientific notation. Similar to
- numpy's ``precision`` print option
-display.show_dimensions truncate Whether to print out dimensions
- at the end of DataFrame repr.
- If 'truncate' is specified, only
- print out the dimensions if the
- frame is truncated (e.g. not display
- all rows and/or columns)
-display.width 80 Width of the display in characters.
- In case Python/IPython is running in
- a terminal this can be set to None
- and pandas will correctly auto-detect
- the width. Note that the IPython notebook,
- IPython qtconsole, or IDLE do not run in a
- terminal and hence it is not possible
- to correctly detect the width.
-display.html.table_schema False Whether to publish a Table Schema
- representation for frontends that
- support it.
-display.html.border 1 A ``border=value`` attribute is
- inserted in the ``
`` tag
- for the DataFrame HTML repr.
-display.html.use_mathjax True When True, Jupyter notebook will process
- table contents using MathJax, rendering
- mathematical expressions enclosed by the
- dollar symbol.
-io.excel.xls.writer xlwt The default Excel writer engine for
- 'xls' files.
-
- .. deprecated:: 1.2.0
-
- As `xlwt `__
- package is no longer maintained, the ``xlwt``
- engine will be removed in a future version of
- pandas. Since this is the only engine in pandas
- that supports writing to ``.xls`` files,
- this option will also be removed.
-
-io.excel.xlsm.writer openpyxl The default Excel writer engine for
- 'xlsm' files. Available options:
- 'openpyxl' (the default).
-io.excel.xlsx.writer openpyxl The default Excel writer engine for
- 'xlsx' files.
-io.hdf.default_format None default format writing format, if
- None, then put will default to
- 'fixed' and append will default to
- 'table'
-io.hdf.dropna_table True drop ALL nan rows when appending
- to a table
-io.parquet.engine None The engine to use as a default for
- parquet reading and writing. If None
- then try 'pyarrow' and 'fastparquet'
-io.sql.engine None The engine to use as a default for
- sql reading and writing, with SQLAlchemy
- as a higher level interface. If None
- then try 'sqlalchemy'
-mode.chained_assignment warn Controls ``SettingWithCopyWarning``:
- 'raise', 'warn', or None. Raise an
- exception, warn, or no action if
- trying to use :ref:`chained assignment `.
-mode.sim_interactive False Whether to simulate interactive mode
- for purposes of testing.
-mode.use_inf_as_na False True means treat None, NaN, -INF,
- INF as NA (old way), False means
- None and NaN are null, but INF, -INF
- are not NA (new way).
-compute.use_bottleneck True Use the bottleneck library to accelerate
- computation if it is installed.
-compute.use_numexpr True Use the numexpr library to accelerate
- computation if it is installed.
-plotting.backend matplotlib Change the plotting backend to a different
- backend than the current matplotlib one.
- Backends can be implemented as third-party
- libraries implementing the pandas plotting
- API. They can use other plotting libraries
- like Bokeh, Altair, etc.
-plotting.matplotlib.register_converters True Register custom converters with
- matplotlib. Set to False to de-register.
-styler.sparse.index True "Sparsify" MultiIndex display for rows
- in Styler output (don't display repeated
- elements in outer levels within groups).
-styler.sparse.columns True "Sparsify" MultiIndex display for columns
- in Styler output.
-styler.render.max_elements 262144 Maximum number of datapoints that Styler will render
- trimming either rows, columns or both to fit.
-======================================= ============ ==================================
-
-
.. _basics.console_output:
Number formatting
@@ -504,8 +313,6 @@ Use the ``set_eng_float_format`` function
to alter the floating-point formatting of pandas objects to produce a particular
format.
-For instance:
-
.. ipython:: python
import numpy as np
@@ -521,7 +328,7 @@ For instance:
pd.reset_option("^display")
-To round floats on a case-by-case basis, you can also use :meth:`~pandas.Series.round` and :meth:`~pandas.DataFrame.round`.
+Use :meth:`~pandas.DataFrame.round` to specifically control rounding of an individual :class:`DataFrame`
.. _options.east_asian_width:
@@ -536,15 +343,11 @@ Unicode formatting
Some East Asian countries use Unicode characters whose width corresponds to two Latin characters.
If a DataFrame or Series contains these characters, the default output mode may not align them properly.
-.. note:: Screen captures are attached for each output to show the actual results.
-
.. ipython:: python
df = pd.DataFrame({"国籍": ["UK", "日本"], "名前": ["Alice", "しのぶ"]})
df
-.. image:: ../_static/option_unicode01.png
-
Enabling ``display.unicode.east_asian_width`` allows pandas to check each character's "East Asian Width" property.
These characters can be aligned properly by setting this option to ``True``. However, this will result in longer render
times than the standard ``len`` function.
@@ -554,19 +357,16 @@ times than the standard ``len`` function.
pd.set_option("display.unicode.east_asian_width", True)
df
-.. image:: ../_static/option_unicode02.png
-
-In addition, Unicode characters whose width is "Ambiguous" can either be 1 or 2 characters wide depending on the
+In addition, Unicode characters whose width is "ambiguous" can either be 1 or 2 characters wide depending on the
terminal setting or encoding. The option ``display.unicode.ambiguous_as_wide`` can be used to handle the ambiguity.
-By default, an "Ambiguous" character's width, such as "¡" (inverted exclamation) in the example below, is taken to be 1.
+By default, an "ambiguous" character's width, such as "¡" (inverted exclamation) in the example below, is taken to be 1.
.. ipython:: python
df = pd.DataFrame({"a": ["xxx", "¡¡"], "b": ["yyy", "¡¡"]})
df
-.. image:: ../_static/option_unicode03.png
Enabling ``display.unicode.ambiguous_as_wide`` makes pandas interpret these characters' widths to be 2.
(Note that this option will only be effective when ``display.unicode.east_asian_width`` is enabled.)
@@ -578,7 +378,6 @@ However, setting this option incorrectly for your terminal will cause these char
pd.set_option("display.unicode.ambiguous_as_wide", True)
df
-.. image:: ../_static/option_unicode04.png
.. ipython:: python
:suppress:
@@ -591,8 +390,8 @@ However, setting this option incorrectly for your terminal will cause these char
Table schema display
--------------------
-``DataFrame`` and ``Series`` will publish a Table Schema representation
-by default. False by default, this can be enabled globally with the
+:class:`DataFrame` and :class:`Series` will publish a Table Schema representation
+by default. This can be enabled globally with the
``display.html.table_schema`` option:
.. ipython:: python
diff --git a/doc/source/user_guide/reshaping.rst b/doc/source/user_guide/reshaping.rst
index 7d1d03fe020a6..adca9de6c130a 100644
--- a/doc/source/user_guide/reshaping.rst
+++ b/doc/source/user_guide/reshaping.rst
@@ -13,37 +13,12 @@ Reshaping by pivoting DataFrame objects
.. image:: ../_static/reshaping_pivot.png
-.. ipython:: python
- :suppress:
-
- import pandas._testing as tm
-
- def unpivot(frame):
- N, K = frame.shape
- data = {
- "value": frame.to_numpy().ravel("F"),
- "variable": np.asarray(frame.columns).repeat(N),
- "date": np.tile(np.asarray(frame.index), K),
- }
- columns = ["date", "variable", "value"]
- return pd.DataFrame(data, columns=columns)
-
- df = unpivot(tm.makeTimeDataFrame(3))
-
Data is often stored in so-called "stacked" or "record" format:
.. ipython:: python
- df
-
-
-For the curious here is how the above ``DataFrame`` was created:
-
-.. code-block:: python
-
import pandas._testing as tm
-
def unpivot(frame):
N, K = frame.shape
data = {
@@ -53,14 +28,15 @@ For the curious here is how the above ``DataFrame`` was created:
}
return pd.DataFrame(data, columns=["date", "variable", "value"])
-
df = unpivot(tm.makeTimeDataFrame(3))
+ df
To select out everything for variable ``A`` we could do:
.. ipython:: python
- df[df["variable"] == "A"]
+ filtered = df[df["variable"] == "A"]
+ filtered
But suppose we wish to do time series operations with the variables. A better
representation would be where the ``columns`` are the unique variables and an
@@ -70,11 +46,12 @@ top level function :func:`~pandas.pivot`):
.. ipython:: python
- df.pivot(index="date", columns="variable", values="value")
+ pivoted = df.pivot(index="date", columns="variable", values="value")
+ pivoted
-If the ``values`` argument is omitted, and the input ``DataFrame`` has more than
-one column of values which are not used as column or index inputs to ``pivot``,
-then the resulting "pivoted" ``DataFrame`` will have :ref:`hierarchical columns
+If the ``values`` argument is omitted, and the input :class:`DataFrame` has more than
+one column of values which are not used as column or index inputs to :meth:`~DataFrame.pivot`,
+then the resulting "pivoted" :class:`DataFrame` will have :ref:`hierarchical columns
` whose topmost level indicates the respective value
column:
@@ -84,7 +61,7 @@ column:
pivoted = df.pivot(index="date", columns="variable")
pivoted
-You can then select subsets from the pivoted ``DataFrame``:
+You can then select subsets from the pivoted :class:`DataFrame`:
.. ipython:: python
@@ -108,16 +85,16 @@ Reshaping by stacking and unstacking
Closely related to the :meth:`~DataFrame.pivot` method are the related
:meth:`~DataFrame.stack` and :meth:`~DataFrame.unstack` methods available on
-``Series`` and ``DataFrame``. These methods are designed to work together with
-``MultiIndex`` objects (see the section on :ref:`hierarchical indexing
+:class:`Series` and :class:`DataFrame`. These methods are designed to work together with
+:class:`MultiIndex` objects (see the section on :ref:`hierarchical indexing
`). Here are essentially what these methods do:
-* ``stack``: "pivot" a level of the (possibly hierarchical) column labels,
- returning a ``DataFrame`` with an index with a new inner-most level of row
+* :meth:`~DataFrame.stack`: "pivot" a level of the (possibly hierarchical) column labels,
+ returning a :class:`DataFrame` with an index with a new inner-most level of row
labels.
-* ``unstack``: (inverse operation of ``stack``) "pivot" a level of the
+* :meth:`~DataFrame.unstack`: (inverse operation of :meth:`~DataFrame.stack`) "pivot" a level of the
(possibly hierarchical) row index to the column axis, producing a reshaped
- ``DataFrame`` with a new inner-most level of column labels.
+ :class:`DataFrame` with a new inner-most level of column labels.
.. image:: ../_static/reshaping_unstack.png
@@ -139,22 +116,22 @@ from the hierarchical indexing section:
df2 = df[:4]
df2
-The ``stack`` function "compresses" a level in the ``DataFrame``'s columns to
+The :meth:`~DataFrame.stack` function "compresses" a level in the :class:`DataFrame` columns to
produce either:
-* A ``Series``, in the case of a simple column Index.
-* A ``DataFrame``, in the case of a ``MultiIndex`` in the columns.
+* A :class:`Series`, in the case of a simple column Index.
+* A :class:`DataFrame`, in the case of a :class:`MultiIndex` in the columns.
-If the columns have a ``MultiIndex``, you can choose which level to stack. The
-stacked level becomes the new lowest level in a ``MultiIndex`` on the columns:
+If the columns have a :class:`MultiIndex`, you can choose which level to stack. The
+stacked level becomes the new lowest level in a :class:`MultiIndex` on the columns:
.. ipython:: python
stacked = df2.stack()
stacked
-With a "stacked" ``DataFrame`` or ``Series`` (having a ``MultiIndex`` as the
-``index``), the inverse operation of ``stack`` is ``unstack``, which by default
+With a "stacked" :class:`DataFrame` or :class:`Series` (having a :class:`MultiIndex` as the
+``index``), the inverse operation of :meth:`~DataFrame.stack` is :meth:`~DataFrame.unstack`, which by default
unstacks the **last level**:
.. ipython:: python
@@ -177,9 +154,9 @@ the level numbers:
.. image:: ../_static/reshaping_unstack_0.png
-Notice that the ``stack`` and ``unstack`` methods implicitly sort the index
-levels involved. Hence a call to ``stack`` and then ``unstack``, or vice versa,
-will result in a **sorted** copy of the original ``DataFrame`` or ``Series``:
+Notice that the :meth:`~DataFrame.stack` and :meth:`~DataFrame.unstack` methods implicitly sort the index
+levels involved. Hence a call to :meth:`~DataFrame.stack` and then :meth:`~DataFrame.unstack`, or vice versa,
+will result in a **sorted** copy of the original :class:`DataFrame` or :class:`Series`:
.. ipython:: python
@@ -188,7 +165,7 @@ will result in a **sorted** copy of the original ``DataFrame`` or ``Series``:
df
all(df.unstack().stack() == df.sort_index())
-The above code will raise a ``TypeError`` if the call to ``sort_index`` is
+The above code will raise a ``TypeError`` if the call to :meth:`~DataFrame.sort_index` is
removed.
.. _reshaping.stack_multiple:
@@ -231,7 +208,7 @@ Missing data
These functions are intelligent about handling missing data and do not expect
each subgroup within the hierarchical index to have the same set of labels.
They also can handle the index being unsorted (but you can make it sorted by
-calling ``sort_index``, of course). Here is a more complex example:
+calling :meth:`~DataFrame.sort_index`, of course). Here is a more complex example:
.. ipython:: python
@@ -251,7 +228,7 @@ calling ``sort_index``, of course). Here is a more complex example:
df2 = df.iloc[[0, 1, 2, 4, 5, 7]]
df2
-As mentioned above, ``stack`` can be called with a ``level`` argument to select
+As mentioned above, :meth:`~DataFrame.stack` can be called with a ``level`` argument to select
which level in the columns to stack:
.. ipython:: python
@@ -281,7 +258,7 @@ the value of missing data.
With a MultiIndex
~~~~~~~~~~~~~~~~~
-Unstacking when the columns are a ``MultiIndex`` is also careful about doing
+Unstacking when the columns are a :class:`MultiIndex` is also careful about doing
the right thing:
.. ipython:: python
@@ -297,7 +274,7 @@ Reshaping by melt
.. image:: ../_static/reshaping_melt.png
The top-level :func:`~pandas.melt` function and the corresponding :meth:`DataFrame.melt`
-are useful to massage a ``DataFrame`` into a format where one or more columns
+are useful to massage a :class:`DataFrame` into a format where one or more columns
are *identifier variables*, while all other columns, considered *measured
variables*, are "unpivoted" to the row axis, leaving just two non-identifier
columns, "variable" and "value". The names of those columns can be customized
@@ -363,7 +340,7 @@ user-friendly.
Combining with stats and GroupBy
--------------------------------
-It should be no shock that combining ``pivot`` / ``stack`` / ``unstack`` with
+It should be no shock that combining :meth:`~DataFrame.pivot` / :meth:`~DataFrame.stack` / :meth:`~DataFrame.unstack` with
GroupBy and the basic Series and DataFrame statistical functions can produce
some very expressive and fast data manipulations.
@@ -385,8 +362,6 @@ Pivot tables
.. _reshaping.pivot:
-
-
While :meth:`~DataFrame.pivot` provides general purpose pivoting with various
data types (strings, numerics, etc.), pandas also provides :func:`~pandas.pivot_table`
for pivoting with aggregation of numeric data.
@@ -437,30 +412,29 @@ We can produce pivot tables from this data very easily:
aggfunc=np.sum,
)
-The result object is a ``DataFrame`` having potentially hierarchical indexes on the
+The result object is a :class:`DataFrame` having potentially hierarchical indexes on the
rows and columns. If the ``values`` column name is not given, the pivot table
-will include all of the data that can be aggregated in an additional level of
-hierarchy in the columns:
+will include all of the data in an additional level of hierarchy in the columns:
.. ipython:: python
- pd.pivot_table(df, index=["A", "B"], columns=["C"])
+ pd.pivot_table(df[["A", "B", "C", "D", "E"]], index=["A", "B"], columns=["C"])
-Also, you can use ``Grouper`` for ``index`` and ``columns`` keywords. For detail of ``Grouper``, see :ref:`Grouping with a Grouper specification `.
+Also, you can use :class:`Grouper` for ``index`` and ``columns`` keywords. For detail of :class:`Grouper`, see :ref:`Grouping with a Grouper specification `.
.. ipython:: python
pd.pivot_table(df, values="D", index=pd.Grouper(freq="M", key="F"), columns="C")
You can render a nice output of the table omitting the missing values by
-calling ``to_string`` if you wish:
+calling :meth:`~DataFrame.to_string` if you wish:
.. ipython:: python
- table = pd.pivot_table(df, index=["A", "B"], columns=["C"])
+ table = pd.pivot_table(df, index=["A", "B"], columns=["C"], values=["D", "E"])
print(table.to_string(na_rep=""))
-Note that ``pivot_table`` is also available as an instance method on DataFrame,
+Note that :meth:`~DataFrame.pivot_table` is also available as an instance method on DataFrame,
i.e. :meth:`DataFrame.pivot_table`.
.. _reshaping.pivot.margins:
@@ -468,13 +442,27 @@ Note that ``pivot_table`` is also available as an instance method on DataFrame,
Adding margins
~~~~~~~~~~~~~~
-If you pass ``margins=True`` to ``pivot_table``, special ``All`` columns and
+If you pass ``margins=True`` to :meth:`~DataFrame.pivot_table`, special ``All`` columns and
rows will be added with partial group aggregates across the categories on the
rows and columns:
.. ipython:: python
- df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
+ table = df.pivot_table(
+ index=["A", "B"],
+ columns="C",
+ values=["D", "E"],
+ margins=True,
+ aggfunc=np.std
+ )
+ table
+
+Additionally, you can call :meth:`DataFrame.stack` to display a pivoted DataFrame
+as having a multi-level index:
+
+.. ipython:: python
+
+ table.stack()
.. _reshaping.crosstabulations:
@@ -482,7 +470,7 @@ Cross tabulations
-----------------
Use :func:`~pandas.crosstab` to compute a cross-tabulation of two (or more)
-factors. By default ``crosstab`` computes a frequency table of the factors
+factors. By default :func:`~pandas.crosstab` computes a frequency table of the factors
unless an array of values and an aggregation function are passed.
It takes a number of arguments
@@ -501,7 +489,7 @@ It takes a number of arguments
Normalize by dividing all values by the sum of values.
-Any ``Series`` passed will have their name attributes used unless row or column
+Any :class:`Series` passed will have their name attributes used unless row or column
names for the cross-tabulation are specified
For example:
@@ -515,7 +503,7 @@ For example:
pd.crosstab(a, [b, c], rownames=["a"], colnames=["b", "c"])
-If ``crosstab`` receives only two Series, it will provide a frequency table.
+If :func:`~pandas.crosstab` receives only two Series, it will provide a frequency table.
.. ipython:: python
@@ -526,8 +514,8 @@ If ``crosstab`` receives only two Series, it will provide a frequency table.
pd.crosstab(df["A"], df["B"])
-``crosstab`` can also be implemented
-to ``Categorical`` data.
+:func:`~pandas.crosstab` can also be implemented
+to :class:`Categorical` data.
.. ipython:: python
@@ -560,9 +548,9 @@ using the ``normalize`` argument:
pd.crosstab(df["A"], df["B"], normalize="columns")
-``crosstab`` can also be passed a third ``Series`` and an aggregation function
-(``aggfunc``) that will be applied to the values of the third ``Series`` within
-each group defined by the first two ``Series``:
+:func:`~pandas.crosstab` can also be passed a third :class:`Series` and an aggregation function
+(``aggfunc``) that will be applied to the values of the third :class:`Series` within
+each group defined by the first two :class:`Series`:
.. ipython:: python
@@ -603,7 +591,7 @@ Alternatively we can specify custom bin-edges:
c = pd.cut(ages, bins=[0, 18, 35, 70])
c
-If the ``bins`` keyword is an ``IntervalIndex``, then these will be
+If the ``bins`` keyword is an :class:`IntervalIndex`, then these will be
used to bin the passed data.::
pd.cut([25, 20, 50], bins=c.categories)
@@ -614,9 +602,9 @@ used to bin the passed data.::
Computing indicator / dummy variables
-------------------------------------
-To convert a categorical variable into a "dummy" or "indicator" ``DataFrame``,
-for example a column in a ``DataFrame`` (a ``Series``) which has ``k`` distinct
-values, can derive a ``DataFrame`` containing ``k`` columns of 1s and 0s using
+To convert a categorical variable into a "dummy" or "indicator" :class:`DataFrame`,
+for example a column in a :class:`DataFrame` (a :class:`Series`) which has ``k`` distinct
+values, can derive a :class:`DataFrame` containing ``k`` columns of 1s and 0s using
:func:`~pandas.get_dummies`:
.. ipython:: python
@@ -626,7 +614,7 @@ values, can derive a ``DataFrame`` containing ``k`` columns of 1s and 0s using
pd.get_dummies(df["key"])
Sometimes it's useful to prefix the column names, for example when merging the result
-with the original ``DataFrame``:
+with the original :class:`DataFrame`:
.. ipython:: python
@@ -635,7 +623,7 @@ with the original ``DataFrame``:
df[["data1"]].join(dummies)
-This function is often used along with discretization functions like ``cut``:
+This function is often used along with discretization functions like :func:`~pandas.cut`:
.. ipython:: python
@@ -648,7 +636,7 @@ This function is often used along with discretization functions like ``cut``:
See also :func:`Series.str.get_dummies `.
-:func:`get_dummies` also accepts a ``DataFrame``. By default all categorical
+:func:`get_dummies` also accepts a :class:`DataFrame`. By default all categorical
variables (categorical in the statistical sense, those with ``object`` or
``categorical`` dtype) are encoded as dummy variables.
@@ -669,8 +657,8 @@ Notice that the ``B`` column is still included in the output, it just hasn't
been encoded. You can drop ``B`` before calling ``get_dummies`` if you don't
want to include it in the output.
-As with the ``Series`` version, you can pass values for the ``prefix`` and
-``prefix_sep``. By default the column name is used as the prefix, and '_' as
+As with the :class:`Series` version, you can pass values for the ``prefix`` and
+``prefix_sep``. By default the column name is used as the prefix, and ``_`` as
the prefix separator. You can specify ``prefix`` and ``prefix_sep`` in 3 ways:
* string: Use the same value for ``prefix`` or ``prefix_sep`` for each column
@@ -718,6 +706,30 @@ To choose another dtype, use the ``dtype`` argument:
pd.get_dummies(df, dtype=bool).dtypes
+.. versionadded:: 1.5.0
+
+To convert a "dummy" or "indicator" ``DataFrame``, into a categorical ``DataFrame``,
+for example ``k`` columns of a ``DataFrame`` containing 1s and 0s can derive a
+``DataFrame`` which has ``k`` distinct values using
+:func:`~pandas.from_dummies`:
+
+.. ipython:: python
+
+ df = pd.DataFrame({"prefix_a": [0, 1, 0], "prefix_b": [1, 0, 1]})
+ df
+
+ pd.from_dummies(df, sep="_")
+
+Dummy coded data only requires ``k - 1`` categories to be included, in this case
+the ``k`` th category is the default category, implied by not being assigned any of
+the other ``k - 1`` categories, can be passed via ``default_category``.
+
+.. ipython:: python
+
+ df = pd.DataFrame({"prefix_a": [0, 1, 0]})
+ df
+
+ pd.from_dummies(df, sep="_", default_category="b")
.. _reshaping.factorize:
@@ -734,7 +746,7 @@ To encode 1-d values as an enumerated type use :func:`~pandas.factorize`:
labels
uniques
-Note that ``factorize`` is similar to ``numpy.unique``, but differs in its
+Note that :func:`~pandas.factorize` is similar to ``numpy.unique``, but differs in its
handling of NaN:
.. note::
@@ -742,16 +754,12 @@ handling of NaN:
because of an ordering bug. See also
`here `__.
-.. code-block:: ipython
-
- In [1]: x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
- In [2]: pd.factorize(x, sort=True)
- Out[2]:
- (array([ 2, 2, -1, 3, 0, 1]),
- Index([3.14, inf, 'A', 'B'], dtype='object'))
+.. ipython:: python
+ :okexcept:
- In [3]: np.unique(x, return_inverse=True)[::-1]
- Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
+ ser = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
+ pd.factorize(ser, sort=True)
+ np.unique(ser, return_inverse=True)[::-1]
.. note::
If you just want to handle one column as a categorical variable (like R's factor),
@@ -899,13 +907,13 @@ We can 'explode' the ``values`` column, transforming each list-like to a separat
df["values"].explode()
-You can also explode the column in the ``DataFrame``.
+You can also explode the column in the :class:`DataFrame`.
.. ipython:: python
df.explode("values")
-:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting ``Series`` is always ``object``.
+:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting :class:`Series` is always ``object``.
.. ipython:: python
diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst
index 71aef4fdd75f6..129f43dd36930 100644
--- a/doc/source/user_guide/scale.rst
+++ b/doc/source/user_guide/scale.rst
@@ -18,36 +18,9 @@ tool for all situations. If you're working with very large datasets and a tool
like PostgreSQL fits your needs, then you should probably be using that.
Assuming you want or need the expressiveness and power of pandas, let's carry on.
-.. ipython:: python
-
- import pandas as pd
- import numpy as np
-
-.. ipython:: python
- :suppress:
-
- from pandas._testing import _make_timeseries
-
- # Make a random in-memory dataset
- ts = _make_timeseries(freq="30S", seed=0)
- ts.to_csv("timeseries.csv")
- ts.to_parquet("timeseries.parquet")
-
-
Load less data
--------------
-.. ipython:: python
- :suppress:
-
- # make a similar dataset with many columns
- timeseries = [
- _make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}")
- for i in range(10)
- ]
- ts_wide = pd.concat(timeseries, axis=1)
- ts_wide.to_parquet("timeseries_wide.parquet")
-
Suppose our raw dataset on disk has many columns::
id_0 name_0 x_0 y_0 id_1 name_1 x_1 ... name_8 x_8 y_8 id_9 name_9 x_9 y_9
@@ -66,6 +39,34 @@ Suppose our raw dataset on disk has many columns::
[525601 rows x 40 columns]
+That can be generated by the following code snippet:
+
+.. ipython:: python
+
+ import pandas as pd
+ import numpy as np
+
+ def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
+ index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
+ n = len(index)
+ state = np.random.RandomState(seed)
+ columns = {
+ "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
+ "id": state.poisson(1000, size=n),
+ "x": state.rand(n) * 2 - 1,
+ "y": state.rand(n) * 2 - 1,
+ }
+ df = pd.DataFrame(columns, index=index, columns=sorted(columns))
+ if df.index[-1] == end:
+ df = df.iloc[:-1]
+ return df
+
+ timeseries = [
+ make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}")
+ for i in range(10)
+ ]
+ ts_wide = pd.concat(timeseries, axis=1)
+ ts_wide.to_parquet("timeseries_wide.parquet")
To load the columns we want, we have two options.
Option 1 loads in all the data and then filters to what we need.
@@ -82,6 +83,13 @@ Option 2 only loads the columns we request.
pd.read_parquet("timeseries_wide.parquet", columns=columns)
+.. ipython:: python
+ :suppress:
+
+ import os
+
+ os.remove("timeseries_wide.parquet")
+
If we were to measure the memory usage of the two calls, we'd see that specifying
``columns`` uses about 1/10th the memory in this case.
@@ -99,9 +107,16 @@ can store larger datasets in memory.
.. ipython:: python
+ ts = make_timeseries(freq="30S", seed=0)
+ ts.to_parquet("timeseries.parquet")
ts = pd.read_parquet("timeseries.parquet")
ts
+.. ipython:: python
+ :suppress:
+
+ os.remove("timeseries.parquet")
+
Now, let's inspect the data types and memory usage to see where we should focus our
attention.
@@ -116,7 +131,7 @@ attention.
The ``name`` column is taking up much more memory than any other. It has just a
few unique values, so it's a good candidate for converting to a
-:class:`Categorical`. With a Categorical, we store each unique name once and use
+:class:`pandas.Categorical`. With a :class:`pandas.Categorical`, we store each unique name once and use
space-efficient integers to know which specific name is used in each row.
@@ -147,7 +162,7 @@ using :func:`pandas.to_numeric`.
In all, we've reduced the in-memory footprint of this dataset to 1/5 of its
original size.
-See :ref:`categorical` for more on ``Categorical`` and :ref:`basics.dtypes`
+See :ref:`categorical` for more on :class:`pandas.Categorical` and :ref:`basics.dtypes`
for an overview of all of pandas' dtypes.
Use chunking
@@ -168,7 +183,6 @@ Suppose we have an even larger "logical dataset" on disk that's a directory of p
files. Each file in the directory represents a different year of the entire dataset.
.. ipython:: python
- :suppress:
import pathlib
@@ -179,7 +193,7 @@ files. Each file in the directory represents a different year of the entire data
pathlib.Path("data/timeseries").mkdir(exist_ok=True)
for i, (start, end) in enumerate(zip(starts, ends)):
- ts = _make_timeseries(start=start, end=end, freq="1T", seed=i)
+ ts = make_timeseries(start=start, end=end, freq="1T", seed=i)
ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
@@ -200,7 +214,7 @@ files. Each file in the directory represents a different year of the entire data
├── ts-10.parquet
└── ts-11.parquet
-Now we'll implement an out-of-core ``value_counts``. The peak memory usage of this
+Now we'll implement an out-of-core :meth:`pandas.Series.value_counts`. The peak memory usage of this
workflow is the single largest chunk, plus a small series storing the unique value
counts up to this point. As long as each individual file fits in memory, this will
work for arbitrary-sized datasets.
@@ -211,9 +225,7 @@ work for arbitrary-sized datasets.
files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
counts = pd.Series(dtype=int)
for path in files:
- # Only one dataframe is in memory at a time...
df = pd.read_parquet(path)
- # ... plus a small Series ``counts``, which is updated.
counts = counts.add(df["name"].value_counts(), fill_value=0)
counts.astype(int)
@@ -221,7 +233,7 @@ Some readers, like :meth:`pandas.read_csv`, offer parameters to control the
``chunksize`` when reading a single file.
Manually chunking is an OK option for workflows that don't
-require too sophisticated of operations. Some operations, like ``groupby``, are
+require too sophisticated of operations. Some operations, like :meth:`pandas.DataFrame.groupby`, are
much harder to do chunkwise. In these cases, you may be better switching to a
different library that implements these out-of-core algorithms for you.
@@ -259,7 +271,7 @@ Inspecting the ``ddf`` object, we see a few things
* There are new attributes like ``.npartitions`` and ``.divisions``
The partitions and divisions are how Dask parallelizes computation. A **Dask**
-DataFrame is made up of many pandas DataFrames. A single method call on a
+DataFrame is made up of many pandas :class:`pandas.DataFrame`. A single method call on a
Dask DataFrame ends up making many pandas method calls, and Dask knows how to
coordinate everything to get the result.
@@ -275,6 +287,7 @@ column names and dtypes. That's because Dask hasn't actually read the data yet.
Rather than executing immediately, doing operations build up a **task graph**.
.. ipython:: python
+ :okwarning:
ddf
ddf["name"]
@@ -282,8 +295,8 @@ Rather than executing immediately, doing operations build up a **task graph**.
Each of these calls is instant because the result isn't being computed yet.
We're just building up a list of computation to do when someone needs the
-result. Dask knows that the return type of a ``pandas.Series.value_counts``
-is a pandas Series with a certain dtype and a certain name. So the Dask version
+result. Dask knows that the return type of a :class:`pandas.Series.value_counts`
+is a pandas :class:`pandas.Series` with a certain dtype and a certain name. So the Dask version
returns a Dask Series with the same dtype and the same name.
To get the actual result you can call ``.compute()``.
@@ -293,13 +306,13 @@ To get the actual result you can call ``.compute()``.
%time ddf["name"].value_counts().compute()
At that point, you get back the same thing you'd get with pandas, in this case
-a concrete pandas Series with the count of each ``name``.
+a concrete pandas :class:`pandas.Series` with the count of each ``name``.
Calling ``.compute`` causes the full task graph to be executed. This includes
reading the data, selecting the columns, and doing the ``value_counts``. The
execution is done *in parallel* where possible, and Dask tries to keep the
overall memory footprint small. You can work with datasets that are much larger
-than memory, as long as each partition (a regular pandas DataFrame) fits in memory.
+than memory, as long as each partition (a regular pandas :class:`pandas.DataFrame`) fits in memory.
By default, ``dask.dataframe`` operations use a threadpool to do operations in
parallel. We can also connect to a cluster to distribute the work on many
@@ -333,6 +346,7 @@ known automatically. In this case, since we created the parquet files manually,
we need to supply the divisions manually.
.. ipython:: python
+ :okwarning:
N = 12
starts = [f"20{i:>02d}-01-01" for i in range(N)]
@@ -364,6 +378,13 @@ out of memory. At that point it's just a regular pandas object.
@savefig dask_resample.png
ddf[["x", "y"]].resample("1D").mean().cumsum().compute().plot()
+.. ipython:: python
+ :suppress:
+
+ import shutil
+
+ shutil.rmtree("data/timeseries")
+
These Dask examples have all be done using multiple processes on a single
machine. Dask can be `deployed on a cluster
`_ to scale up to even larger
diff --git a/doc/source/user_guide/sparse.rst b/doc/source/user_guide/sparse.rst
index 52d99533c1f60..bc4eec1c23a35 100644
--- a/doc/source/user_guide/sparse.rst
+++ b/doc/source/user_guide/sparse.rst
@@ -23,7 +23,7 @@ array that are ``nan`` aren't actually stored, only the non-``nan`` elements are
Those non-``nan`` elements have a ``float64`` dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had a
-large, mostly NA ``DataFrame``:
+large, mostly NA :class:`DataFrame`:
.. ipython:: python
@@ -139,7 +139,7 @@ Sparse calculation
------------------
You can apply NumPy `ufuncs `_
-to ``SparseArray`` and get a ``SparseArray`` as a result.
+to :class:`arrays.SparseArray` and get a :class:`arrays.SparseArray` as a result.
.. ipython:: python
@@ -183,7 +183,7 @@ your code, rather than ignoring the warning.
**Construction**
From an array-like, use the regular :class:`Series` or
-:class:`DataFrame` constructors with :class:`SparseArray` values.
+:class:`DataFrame` constructors with :class:`arrays.SparseArray` values.
.. code-block:: python
@@ -240,7 +240,7 @@ Sparse-specific properties, like ``density``, are available on the ``.sparse`` a
**General differences**
In a ``SparseDataFrame``, *all* columns were sparse. A :class:`DataFrame` can have a mixture of
-sparse and dense columns. As a consequence, assigning new columns to a ``DataFrame`` with sparse
+sparse and dense columns. As a consequence, assigning new columns to a :class:`DataFrame` with sparse
values will not automatically convert the input to be sparse.
.. code-block:: python
@@ -266,10 +266,10 @@ have no replacement.
.. _sparse.scipysparse:
-Interaction with scipy.sparse
------------------------------
+Interaction with *scipy.sparse*
+-------------------------------
-Use :meth:`DataFrame.sparse.from_spmatrix` to create a ``DataFrame`` with sparse values from a sparse matrix.
+Use :meth:`DataFrame.sparse.from_spmatrix` to create a :class:`DataFrame` with sparse values from a sparse matrix.
.. versionadded:: 0.25.0
@@ -294,9 +294,9 @@ To convert back to sparse SciPy matrix in COO format, you can use the :meth:`Dat
sdf.sparse.to_coo()
-meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a :class:`MultiIndex` to a :class:`scipy.sparse.coo_matrix`.
+:meth:`Series.sparse.to_coo` is implemented for transforming a :class:`Series` with sparse values indexed by a :class:`MultiIndex` to a :class:`scipy.sparse.coo_matrix`.
-The method requires a ``MultiIndex`` with two or more levels.
+The method requires a :class:`MultiIndex` with two or more levels.
.. ipython:: python
@@ -315,7 +315,7 @@ The method requires a ``MultiIndex`` with two or more levels.
ss = s.astype('Sparse')
ss
-In the example below, we transform the ``Series`` to a sparse representation of a 2-d array by specifying that the first and second ``MultiIndex`` levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
+In the example below, we transform the :class:`Series` to a sparse representation of a 2-d array by specifying that the first and second ``MultiIndex`` levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
.. ipython:: python
@@ -341,7 +341,7 @@ Specifying different row and column labels (and not sorting them) yields a diffe
rows
columns
-A convenience method :meth:`Series.sparse.from_coo` is implemented for creating a ``Series`` with sparse values from a ``scipy.sparse.coo_matrix``.
+A convenience method :meth:`Series.sparse.from_coo` is implemented for creating a :class:`Series` with sparse values from a ``scipy.sparse.coo_matrix``.
.. ipython:: python
@@ -350,7 +350,7 @@ A convenience method :meth:`Series.sparse.from_coo` is implemented for creating
A
A.todense()
-The default behaviour (with ``dense_index=False``) simply returns a ``Series`` containing
+The default behaviour (with ``dense_index=False``) simply returns a :class:`Series` containing
only the non-null entries.
.. ipython:: python
diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb
index 7d8d8e90dfbda..620e3806a33b5 100644
--- a/doc/source/user_guide/style.ipynb
+++ b/doc/source/user_guide/style.ipynb
@@ -11,7 +11,7 @@
"\n",
"[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n",
"[viz]: visualization.rst\n",
- "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/user_guide/style.ipynb"
+ "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/main/doc/source/user_guide/style.ipynb"
]
},
{
@@ -49,6 +49,7 @@
"source": [
"import pandas as pd\n",
"import numpy as np\n",
+ "import matplotlib as mpl\n",
"\n",
"df = pd.DataFrame([[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232]], \n",
" index=pd.Index(['Tumour (Positive)', 'Non-Tumour (Negative)'], name='Actual Label:'), \n",
@@ -60,9 +61,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "The above output looks very similar to the standard DataFrame HTML representation. But the HTML here has already attached some CSS classes to each cell, even if we haven't yet created any styles. We can view these by calling the [.render()][render] method, which returns the raw HTML as string, which is useful for further processing or adding to a file - read on in [More about CSS and HTML](#More-About-CSS-and-HTML). Below we will show how we can use these to format the DataFrame to be more communicative. For example how we can build `s`:\n",
+ "The above output looks very similar to the standard DataFrame HTML representation. But the HTML here has already attached some CSS classes to each cell, even if we haven't yet created any styles. We can view these by calling the [.to_html()][tohtml] method, which returns the raw HTML as string, which is useful for further processing or adding to a file - read on in [More about CSS and HTML](#More-About-CSS-and-HTML). Below we will show how we can use these to format the DataFrame to be more communicative. For example how we can build `s`:\n",
"\n",
- "[render]: ../reference/api/pandas.io.formats.style.Styler.render.rst"
+ "[tohtml]: ../reference/api/pandas.io.formats.style.Styler.to_html.rst"
]
},
{
@@ -150,15 +151,14 @@
"\n",
"### Formatting Values\n",
"\n",
- "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value. To control the display value, the text is printed in each cell, and we can use the [.format()][formatfunc] method to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table or for individual columns. \n",
+ "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value, in both datavalues and index or columns headers. To control the display value, the text is printed in each cell as string, and we can use the [.format()][formatfunc] and [.format_index()][formatfuncindex] methods to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table, or index, or for individual columns, or MultiIndex levels. \n",
"\n",
- "Additionally, the format function has a **precision** argument to specifically help formatting floats, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML. The default formatter is configured to adopt pandas' regular `display.precision` option, controllable using `with pd.option_context('display.precision', 2):`\n",
- "\n",
- "Here is an example of using the multiple options to control the formatting generally and with specific column formatters.\n",
+ "Additionally, the format function has a **precision** argument to specifically help formatting floats, as well as **decimal** and **thousands** separators to support other locales, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML or safe-LaTeX. The default formatter is configured to adopt pandas' `styler.format.precision` option, controllable using `with pd.option_context('format.precision', 2):` \n",
"\n",
"[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n",
"[format]: https://docs.python.org/3/library/string.html#format-specification-mini-language\n",
- "[formatfunc]: ../reference/api/pandas.io.formats.style.Styler.format.rst"
+ "[formatfunc]: ../reference/api/pandas.io.formats.style.Styler.format.rst\n",
+ "[formatfuncindex]: ../reference/api/pandas.io.formats.style.Styler.format_index.rst"
]
},
{
@@ -167,28 +167,72 @@
"metadata": {},
"outputs": [],
"source": [
- "df.style.format(precision=0, na_rep='MISSING', \n",
+ "df.style.format(precision=0, na_rep='MISSING', thousands=\" \",\n",
" formatter={('Decision Tree', 'Tumour'): \"{:.2f}\",\n",
- " ('Regression', 'Non-Tumour'): lambda x: \"$ {:,.1f}\".format(x*-1e3)\n",
+ " ('Regression', 'Non-Tumour'): lambda x: \"$ {:,.1f}\".format(x*-1e6)\n",
" })"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using Styler to manipulate the display is a useful feature because maintaining the indexing and datavalues for other purposes gives greater control. You do not have to overwrite your DataFrame to display it how you like. Here is an example of using the formatting functions whilst still relying on the underlying data for indexing and calculations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "weather_df = pd.DataFrame(np.random.rand(10,2)*5, \n",
+ " index=pd.date_range(start=\"2021-01-01\", periods=10),\n",
+ " columns=[\"Tokyo\", \"Beijing\"])\n",
+ "\n",
+ "def rain_condition(v): \n",
+ " if v < 1.75:\n",
+ " return \"Dry\"\n",
+ " elif v < 2.75:\n",
+ " return \"Rain\"\n",
+ " return \"Heavy Rain\"\n",
+ "\n",
+ "def make_pretty(styler):\n",
+ " styler.set_caption(\"Weather Conditions\")\n",
+ " styler.format(rain_condition)\n",
+ " styler.format_index(lambda v: v.strftime(\"%A\"))\n",
+ " styler.background_gradient(axis=None, vmin=1, vmax=5, cmap=\"YlGnBu\")\n",
+ " return styler\n",
+ "\n",
+ "weather_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "weather_df.loc[\"2021-01-04\":\"2021-01-08\"].style.pipe(make_pretty)"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hiding Data\n",
"\n",
- "The index can be hidden from rendering by calling [.hide_index()][hideidx], which might be useful if your index is integer based.\n",
+ "The index and column headers can be completely hidden, as well subselecting rows or columns that one wishes to exclude. Both these options are performed using the same methods.\n",
"\n",
- "Columns can be hidden from rendering by calling [.hide_columns()][hidecols] and passing in the name of a column, or a slice of columns.\n",
+ "The index can be hidden from rendering by calling [.hide()][hideidx] without any arguments, which might be useful if your index is integer based. Similarly column headers can be hidden by calling [.hide(axis=\"columns\")][hideidx] without any further arguments.\n",
"\n",
- "Hiding does not change the integer arrangement of CSS classes, e.g. hiding the first two columns of a DataFrame means the column class indexing will start at `col2`, since `col0` and `col1` are simply ignored.\n",
+ "Specific rows or columns can be hidden from rendering by calling the same [.hide()][hideidx] method and passing in a row/column label, a list-like or a slice of row/column labels to for the ``subset`` argument.\n",
"\n",
- "We can update our `Styler` object to hide some data and format the values.\n",
+ "Hiding does not change the integer arrangement of CSS classes, e.g. hiding the first two columns of a DataFrame means the column class indexing will still start at `col2`, since `col0` and `col1` are simply ignored.\n",
"\n",
- "[hideidx]: ../reference/api/pandas.io.formats.style.Styler.hide_index.rst\n",
- "[hidecols]: ../reference/api/pandas.io.formats.style.Styler.hide_columns.rst"
+ "We can update our `Styler` object from before to hide some data and format the values.\n",
+ "\n",
+ "[hideidx]: ../reference/api/pandas.io.formats.style.Styler.hide.rst"
]
},
{
@@ -197,7 +241,7 @@
"metadata": {},
"outputs": [],
"source": [
- "s = df.style.format('{:.0f}').hide_columns([('Random', 'Tumour'), ('Random', 'Non-Tumour')])\n",
+ "s = df.style.format('{:.0f}').hide([('Random', 'Tumour'), ('Random', 'Non-Tumour')], axis=\"columns\")\n",
"s"
]
},
@@ -223,13 +267,15 @@
"\n",
"- Using [.set_table_styles()][table] to control broader areas of the table with specified internal CSS. Although table styles allow the flexibility to add CSS selectors and properties controlling all individual parts of the table, they are unwieldy for individual cell specifications. Also, note that table styles cannot be exported to Excel. \n",
"- Using [.set_td_classes()][td_class] to directly link either external CSS classes to your data cells or link the internal CSS classes created by [.set_table_styles()][table]. See [here](#Setting-Classes-and-Linking-to-External-CSS). These cannot be used on column header rows or indexes, and also won't export to Excel. \n",
- "- Using the [.apply()][apply] and [.applymap()][applymap] functions to add direct internal CSS to specific data cells. See [here](#Styler-Functions). These cannot be used on column header rows or indexes, but only these methods add styles that will export to Excel. These methods work in a similar way to [DataFrame.apply()][dfapply] and [DataFrame.applymap()][dfapplymap].\n",
+ "- Using the [.apply()][apply] and [.applymap()][applymap] functions to add direct internal CSS to specific data cells. See [here](#Styler-Functions). As of v1.4.0 there are also methods that work directly on column header rows or indexes; [.apply_index()][applyindex] and [.applymap_index()][applymapindex]. Note that only these methods add styles that will export to Excel. These methods work in a similar way to [DataFrame.apply()][dfapply] and [DataFrame.applymap()][dfapplymap].\n",
"\n",
"[table]: ../reference/api/pandas.io.formats.style.Styler.set_table_styles.rst\n",
"[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n",
"[td_class]: ../reference/api/pandas.io.formats.style.Styler.set_td_classes.rst\n",
"[apply]: ../reference/api/pandas.io.formats.style.Styler.apply.rst\n",
"[applymap]: ../reference/api/pandas.io.formats.style.Styler.applymap.rst\n",
+ "[applyindex]: ../reference/api/pandas.io.formats.style.Styler.apply_index.rst\n",
+ "[applymapindex]: ../reference/api/pandas.io.formats.style.Styler.applymap_index.rst\n",
"[dfapply]: ../reference/api/pandas.DataFrame.apply.rst\n",
"[dfapplymap]: ../reference/api/pandas.DataFrame.applymap.rst"
]
@@ -377,7 +423,7 @@
"metadata": {},
"outputs": [],
"source": [
- "out = s.set_table_attributes('class=\"my-table-cls\"').render()\n",
+ "out = s.set_table_attributes('class=\"my-table-cls\"').to_html()\n",
"print(out[out.find('
"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Acting on the Index and Column Headers\n",
+ "\n",
+ "Similar application is achieved for headers by using:\n",
+ " \n",
+ "- [.applymap_index()][applymapindex] (elementwise): accepts a function that takes a single value and returns a string with the CSS attribute-value pair.\n",
+ "- [.apply_index()][applyindex] (level-wise): accepts a function that takes a Series and returns a Series, or numpy array with an identical shape where each element is a string with a CSS attribute-value pair. This method passes each level of your Index one-at-a-time. To style the index use `axis=0` and to style the column headers use `axis=1`.\n",
+ "\n",
+ "You can select a `level` of a `MultiIndex` but currently no similar `subset` application is available for these methods.\n",
+ "\n",
+ "[applyindex]: ../reference/api/pandas.io.formats.style.Styler.apply_index.rst\n",
+ "[applymapindex]: ../reference/api/pandas.io.formats.style.Styler.applymap_index.rst"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "s2.applymap_index(lambda v: \"color:pink;\" if v>4 else \"color:darkblue;\", axis=0)\n",
+ "s2.apply_index(lambda s: np.where(s.isin([\"A\", \"B\"]), \"color:pink;\", \"color:darkblue;\"), axis=1)"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -959,7 +1046,7 @@
"source": [
"### 5. If every byte counts use string replacement\n",
"\n",
- "You can remove unnecessary HTML, or shorten the default class names with string replace functions."
+ "You can remove unnecessary HTML, or shorten the default class names by replacing the default css dict. You can read a little more about CSS [below](#More-About-CSS-and-HTML)."
]
},
{
@@ -968,21 +1055,24 @@
"metadata": {},
"outputs": [],
"source": [
- "html = Styler(df4, uuid_len=0, cell_ids=False)\\\n",
- " .set_table_styles([{'selector': 'td', 'props': props},\n",
- " {'selector': '.col1', 'props': 'color:green;'},\n",
- " {'selector': '.level0', 'props': 'color:blue;'}])\\\n",
- " .render()\\\n",
- " .replace('blank', '')\\\n",
- " .replace('data', '')\\\n",
- " .replace('level0', 'l0')\\\n",
- " .replace('col_heading', '')\\\n",
- " .replace('row_heading', '')\n",
- "\n",
- "import re\n",
- "html = re.sub(r'col[0-9]+', lambda x: x.group().replace('col', 'c'), html)\n",
- "html = re.sub(r'row[0-9]+', lambda x: x.group().replace('row', 'r'), html)\n",
- "print(html)"
+ "my_css = {\n",
+ " \"row_heading\": \"\",\n",
+ " \"col_heading\": \"\",\n",
+ " \"index_name\": \"\",\n",
+ " \"col\": \"c\",\n",
+ " \"row\": \"r\",\n",
+ " \"col_trim\": \"\",\n",
+ " \"row_trim\": \"\",\n",
+ " \"level\": \"l\",\n",
+ " \"data\": \"\",\n",
+ " \"blank\": \"\",\n",
+ "}\n",
+ "html = Styler(df4, uuid_len=0, cell_ids=False)\n",
+ "html.set_table_styles([{'selector': 'td', 'props': props},\n",
+ " {'selector': '.c1', 'props': 'color:green;'},\n",
+ " {'selector': '.l0', 'props': 'color:blue;'}],\n",
+ " css_class_names=my_css)\n",
+ "print(html.to_html())"
]
},
{
@@ -991,8 +1081,7 @@
"metadata": {},
"outputs": [],
"source": [
- "from IPython.display import HTML\n",
- "HTML(html)"
+ "html"
]
},
{
@@ -1011,7 +1100,7 @@
" - [.highlight_null][nullfunc]: for use with identifying missing data. \n",
" - [.highlight_min][minfunc] and [.highlight_max][maxfunc]: for use with identifying extremeties in data.\n",
" - [.highlight_between][betweenfunc] and [.highlight_quantile][quantilefunc]: for use with identifying classes within data.\n",
- " - [.background_gradient][bgfunc]: a flexible method for highlighting cells based or their, or other, values on a numeric scale.\n",
+ " - [.background_gradient][bgfunc]: a flexible method for highlighting cells based on their, or other, values on a numeric scale.\n",
" - [.text_gradient][textfunc]: similar method for highlighting text based on their, or other, values on a numeric scale.\n",
" - [.bar][barfunc]: to display mini-charts within cell backgrounds.\n",
" \n",
@@ -1042,7 +1131,7 @@
"source": [
"df2.iloc[0,2] = np.nan\n",
"df2.iloc[4,3] = np.nan\n",
- "df2.loc[:4].style.highlight_null(null_color='yellow')"
+ "df2.loc[:4].style.highlight_null(color='yellow')"
]
},
{
@@ -1107,7 +1196,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "You can create \"heatmaps\" with the `background_gradient` and `text_gradient` methods. These require matplotlib, and we'll use [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) to get a nice colormap."
+ "You can create \"heatmaps\" with the `background_gradient` and `text_gradient` methods. These require matplotlib, and we'll use [Seaborn](http://seaborn.pydata.org/) to get a nice colormap."
]
},
{
@@ -1188,9 +1277,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In version 0.20.0 the ability to customize the bar chart further was given. You can now have the `df.style.bar` be centered on zero or midpoint value (in addition to the already existing way of having the min value at the left side of the cell), and you can pass a list of `[color_negative, color_positive]`.\n",
+ "Additional keyword arguments give more control on centering and positioning, and you can pass a list of `[color_negative, color_positive]` to highlight lower and higher values or a matplotlib colormap.\n",
"\n",
- "Here's how you can change the above with the new `align='mid'` option:"
+ "To showcase an example here's how you can change the above with the new `align` option, combined with setting `vmin` and `vmax` limits, the `width` of the figure, and underlying css `props` of cells, leaving space to display the text and the bars. We also use `text_gradient` to color the text the same as the bars using a matplotlib colormap (although in this case the visualization is probably better without this additional effect)."
]
},
{
@@ -1199,7 +1288,10 @@
"metadata": {},
"outputs": [],
"source": [
- "df2.style.bar(subset=['A', 'B'], align='mid', color=['#d65f5f', '#5fba7d'])"
+ "df2.style.format('{:.3f}', na_rep=\"\")\\\n",
+ " .bar(align=0, vmin=-2.5, vmax=2.5, cmap=\"bwr\", height=50,\n",
+ " width=60, props=\"width: 120px; border-right: 1px solid black;\")\\\n",
+ " .text_gradient(cmap=\"bwr\", vmin=-2.5, vmax=2.5)"
]
},
{
@@ -1223,30 +1315,33 @@
"\n",
"# Test series\n",
"test1 = pd.Series([-100,-60,-30,-20], name='All Negative')\n",
- "test2 = pd.Series([10,20,50,100], name='All Positive')\n",
- "test3 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')\n",
+ "test2 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')\n",
+ "test3 = pd.Series([10,20,50,100], name='All Positive')\n",
+ "test4 = pd.Series([100, 103, 101, 102], name='Large Positive')\n",
+ "\n",
"\n",
"head = \"\"\"\n",
"
\".format(align)\n",
- " for series in [test1,test2,test3]:\n",
+ " for series in [test1,test2,test3, test4]:\n",
" s = series.copy()\n",
" s.name=''\n",
- " row += \"
'\n",
" head += row\n",
" \n",
@@ -1284,8 +1379,12 @@
"metadata": {},
"outputs": [],
"source": [
- "style1 = df2.style.applymap(style_negative, props='color:red;')\\\n",
- " .applymap(lambda v: 'opacity: 20%;' if (v < 0.3) and (v > -0.3) else None)"
+ "style1 = df2.style\\\n",
+ " .applymap(style_negative, props='color:red;')\\\n",
+ " .applymap(lambda v: 'opacity: 20%;' if (v < 0.3) and (v > -0.3) else None)\\\n",
+ " .set_table_styles([{\"selector\": \"th\", \"props\": \"color: blue;\"}])\\\n",
+ " .hide(axis=\"index\")\n",
+ "style1"
]
},
{
@@ -1312,13 +1411,10 @@
"source": [
"## Limitations\n",
"\n",
- "- DataFrame only `(use Series.to_frame().style)`\n",
- "- The index and columns must be unique\n",
+ "- DataFrame only (use `Series.to_frame().style`)\n",
+ "- The index and columns do not need to be unique, but certain styling functions can only work with unique indexes.\n",
"- No large repr, and construction performance isn't great; although we have some [HTML optimizations](#Optimization)\n",
- "- You can only style the *values*, not the index or columns (except with `table_styles` above)\n",
- "- You can only apply styles, you can't insert new HTML entities\n",
- "\n",
- "Some of these might be addressed in the future. "
+ "- You can only apply styles, you can't insert new HTML entities, except via subclassing."
]
},
{
@@ -1403,7 +1499,9 @@
"source": [
"### Sticky Headers\n",
"\n",
- "If you display a large matrix or DataFrame in a notebook, but you want to always see the column and row headers you can use the following CSS to make them stick. We might make this into an API function later."
+ "If you display a large matrix or DataFrame in a notebook, but you want to always see the column and row headers you can use the [.set_sticky][sticky] method which manipulates the table styles CSS.\n",
+ "\n",
+ "[sticky]: ../reference/api/pandas.io.formats.style.Styler.set_sticky.rst"
]
},
{
@@ -1412,20 +1510,15 @@
"metadata": {},
"outputs": [],
"source": [
- "bigdf = pd.DataFrame(np.random.randn(15, 100))\n",
- "bigdf.style.set_table_styles([\n",
- " {'selector': 'thead th', 'props': 'position: sticky; top:0; background-color:salmon;'},\n",
- " {'selector': 'tbody th', 'props': 'position: sticky; left:0; background-color:lightgreen;'} \n",
- "])"
+ "bigdf = pd.DataFrame(np.random.randn(16, 100))\n",
+ "bigdf.style.set_sticky(axis=\"index\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Hiding Headers\n",
- "\n",
- "We don't yet have any API to hide headers so a quick fix is:"
+ "It is also possible to stick MultiIndexes and even only specific levels."
]
},
{
@@ -1434,7 +1527,8 @@
"metadata": {},
"outputs": [],
"source": [
- "df3.style.set_table_styles([{'selector': 'thead tr', 'props': 'display: none;'}]) # or 'thead th'"
+ "bigdf.index = pd.MultiIndex.from_product([[\"A\",\"B\"],[0,1],[0,1,2,3]])\n",
+ "bigdf.style.set_sticky(axis=\"index\", pixel_size=18, levels=[1,2])"
]
},
{
@@ -1483,6 +1577,9 @@
"Some support (*since version 0.20.0*) is available for exporting styled `DataFrames` to Excel worksheets using the `OpenPyXL` or `XlsxWriter` engines. CSS2.2 properties handled include:\n",
"\n",
"- `background-color`\n",
+ "- `border-style` properties\n",
+ "- `border-width` properties\n",
+ "- `border-color` properties\n",
"- `color`\n",
"- `font-family`\n",
"- `font-style`\n",
@@ -1493,12 +1590,13 @@
"- `white-space: nowrap`\n",
"\n",
"\n",
- "- Currently broken: `border-style`, `border-width`, `border-color` and their {`top`, `right`, `bottom`, `left` variants}\n",
+ "- Shorthand and side-specific border properties are supported (e.g. `border-style` and `border-left-style`) as well as the `border` shorthands for all sides (`border: 1px solid green`) or specified sides (`border-left: 1px solid green`). Using a `border` shorthand will override any border properties set before it (See [CSS Working Group](https://drafts.csswg.org/css-backgrounds/#border-shorthands) for more details)\n",
"\n",
"\n",
"- Only CSS2 named colors and hex colors of the form `#rgb` or `#rrggbb` are currently supported.\n",
- "- The following pseudo CSS properties are also available to set excel specific style properties:\n",
+ "- The following pseudo CSS properties are also available to set Excel specific style properties:\n",
" - `number-format`\n",
+ " - `border-style` (for Excel-specific styles: \"hair\", \"mediumDashDot\", \"dashDotDot\", \"mediumDashDotDot\", \"dashDot\", \"slantDashDot\", or \"mediumDashed\")\n",
"\n",
"Table level styles, and data cell CSS-classes are not included in the export to Excel: individual cells must have their properties mapped by the `Styler.apply` and/or `Styler.applymap` methods."
]
@@ -1524,6 +1622,17 @@
"\n"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Export to LaTeX\n",
+ "\n",
+ "There is support (*since version 1.3.0*) to export `Styler` to LaTeX. The documentation for the [.to_latex][latex] method gives further detail and numerous examples.\n",
+ "\n",
+ "[latex]: ../reference/api/pandas.io.formats.style.Styler.to_latex.rst"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -1555,12 +1664,13 @@
" + `row`, where `m` is the numeric position of the cell.\n",
" + `col`, where `n` is the numeric position of the cell.\n",
"- Blank cells include `blank`\n",
+ "- Trimmed cells include `col_trim` or `row_trim`\n",
"\n",
"The structure of the `id` is `T_uuid_level_row_col` where `level` is used only on headings, and headings will only have either `row` or `col` whichever is needed. By default we've also prepended each row/column identifier with a UUID unique to each DataFrame so that the style from one doesn't collide with the styling from another within the same notebook or page. You can read more about the use of UUIDs in [Optimization](#Optimization).\n",
"\n",
- "We can see example of the HTML by calling the [.render()][render] method.\n",
+ "We can see example of the HTML by calling the [.to_html()][tohtml] method.\n",
"\n",
- "[render]: ../reference/api/pandas.io.formats.style.Styler.render.rst"
+ "[tohtml]: ../reference/api/pandas.io.formats.style.Styler.to_html.rst"
]
},
{
@@ -1569,7 +1679,7 @@
"metadata": {},
"outputs": [],
"source": [
- "print(pd.DataFrame([[1,2],[3,4]], index=['i1', 'i2'], columns=['c1', 'c2']).style.render())"
+ "print(pd.DataFrame([[1,2],[3,4]], index=['i1', 'i2'], columns=['c1', 'c2']).style.to_html())"
]
},
{
@@ -1653,7 +1763,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In the above case the text is blue because the selector `#T_b_ .cls-1` is worth 110 (ID plus class), which takes precendence."
+ "In the above case the text is blue because the selector `#T_b_ .cls-1` is worth 110 (ID plus class), which takes precedence."
]
},
{
@@ -1769,7 +1879,7 @@
" Styler.loader, # the default\n",
" ])\n",
" )\n",
- " template_html = env.get_template(\"myhtml.tpl\")"
+ " template_html_table = env.get_template(\"myhtml.tpl\")"
]
},
{
@@ -1796,7 +1906,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Our custom template accepts a `table_title` keyword. We can provide the value in the `.render` method."
+ "Our custom template accepts a `table_title` keyword. We can provide the value in the `.to_html` method."
]
},
{
@@ -1805,7 +1915,7 @@
"metadata": {},
"outputs": [],
"source": [
- "HTML(MyStyler(df3).render(table_title=\"Extending Example\"))"
+ "HTML(MyStyler(df3).to_html(table_title=\"Extending Example\"))"
]
},
{
@@ -1822,14 +1932,63 @@
"outputs": [],
"source": [
"EasyStyler = Styler.from_custom_template(\"templates\", \"myhtml.tpl\")\n",
- "EasyStyler(df3)"
+ "HTML(EasyStyler(df3).to_html(table_title=\"Another Title\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Template Structure\n",
+ "\n",
+ "Here's the template structure for the both the style generation template and the table generation template:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Style template:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "nbsphinx": "hidden"
+ },
+ "outputs": [],
+ "source": [
+ "with open(\"templates/html_style_structure.html\") as f:\n",
+ " style_structure = f.read()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "HTML(style_structure)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Here's the template structure:"
+ "Table template:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "nbsphinx": "hidden"
+ },
+ "outputs": [],
+ "source": [
+ "with open(\"templates/html_table_structure.html\") as f:\n",
+ " table_structure = f.read()"
]
},
{
@@ -1838,10 +1997,7 @@
"metadata": {},
"outputs": [],
"source": [
- "with open(\"templates/template_structure.html\") as f:\n",
- " structure = f.read()\n",
- " \n",
- "HTML(structure)"
+ "HTML(table_structure)"
]
},
{
@@ -1871,7 +2027,7 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3",
+ "display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@@ -1885,7 +2041,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.8.6"
+ "version": "3.9.5"
}
},
"nbformat": 4,
diff --git a/doc/source/user_guide/templates/html_style_structure.html b/doc/source/user_guide/templates/html_style_structure.html
new file mode 100644
index 0000000000000..dc0c03ac363a9
--- /dev/null
+++ b/doc/source/user_guide/templates/html_style_structure.html
@@ -0,0 +1,35 @@
+
+
+
+
before_style
+
style
+
<style type="text/css">
+
table_styles
+
before_cellstyle
+
cellstyle
+
</style>
+
diff --git a/doc/source/user_guide/templates/template_structure.html b/doc/source/user_guide/templates/html_table_structure.html
similarity index 80%
rename from doc/source/user_guide/templates/template_structure.html
rename to doc/source/user_guide/templates/html_table_structure.html
index 0778d8e2e6f18..e03f9591d2a35 100644
--- a/doc/source/user_guide/templates/template_structure.html
+++ b/doc/source/user_guide/templates/html_table_structure.html
@@ -25,15 +25,6 @@
}
-
{{ super() }}
diff --git a/doc/source/user_guide/text.rst b/doc/source/user_guide/text.rst
index db9485f3f2348..d350351075cb6 100644
--- a/doc/source/user_guide/text.rst
+++ b/doc/source/user_guide/text.rst
@@ -335,6 +335,19 @@ regular expression object will raise a ``ValueError``.
---------------------------------------------------------------------------
ValueError: case and flags cannot be set when pat is a compiled regex
+``removeprefix`` and ``removesuffix`` have the same effect as ``str.removeprefix`` and ``str.removesuffix`` added in Python 3.9
+`__:
+
+.. versionadded:: 1.4.0
+
+.. ipython:: python
+
+ s = pd.Series(["str_foo", "str_bar", "no_prefix"])
+ s.str.removeprefix("str_")
+
+ s = pd.Series(["foo_str", "bar_str", "no_suffix"])
+ s.str.removesuffix("_str")
+
.. _text.concatenate:
Concatenation
@@ -742,6 +755,8 @@ Method summary
:meth:`~Series.str.get_dummies`;Split strings on the delimiter returning DataFrame of dummy variables
:meth:`~Series.str.contains`;Return boolean array if each string contains pattern/regex
:meth:`~Series.str.replace`;Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
+ :meth:`~Series.str.removeprefix`;Remove prefix from string, i.e. only remove if string starts with prefix.
+ :meth:`~Series.str.removesuffix`;Remove suffix from string, i.e. only remove if string ends with suffix.
:meth:`~Series.str.repeat`;Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
:meth:`~Series.str.pad`;"Add whitespace to left, right, or both sides of strings"
:meth:`~Series.str.center`;Equivalent to ``str.center``
diff --git a/doc/source/user_guide/timedeltas.rst b/doc/source/user_guide/timedeltas.rst
index 0b4ddaaa8a42a..180de1df53f9e 100644
--- a/doc/source/user_guide/timedeltas.rst
+++ b/doc/source/user_guide/timedeltas.rst
@@ -88,13 +88,19 @@ or a list/array of strings:
pd.to_timedelta(["1 days 06:05:01.00003", "15.5us", "nan"])
-The ``unit`` keyword argument specifies the unit of the Timedelta:
+The ``unit`` keyword argument specifies the unit of the Timedelta if the input
+is numeric:
.. ipython:: python
pd.to_timedelta(np.arange(5), unit="s")
pd.to_timedelta(np.arange(5), unit="d")
+.. warning::
+ If a string or array of strings is passed as an input then the ``unit`` keyword
+ argument will be ignored. If a string without units is passed then the default
+ unit of nanoseconds is assumed.
+
.. _timedeltas.limitations:
Timedelta limitations
diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst
index 6f005f912fe37..474068e43a4d4 100644
--- a/doc/source/user_guide/timeseries.rst
+++ b/doc/source/user_guide/timeseries.rst
@@ -204,6 +204,7 @@ If you use dates which start with the day first (i.e. European style),
you can pass the ``dayfirst`` flag:
.. ipython:: python
+ :okwarning:
pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)
@@ -211,9 +212,10 @@ you can pass the ``dayfirst`` flag:
.. warning::
- You see in the above example that ``dayfirst`` isn't strict, so if a date
+ You see in the above example that ``dayfirst`` isn't strict. If a date
can't be parsed with the day being first it will be parsed as if
- ``dayfirst`` were False.
+ ``dayfirst`` were False, and in the case of parsing delimited date strings
+ (e.g. ``31-12-2012``) then a warning will also be raised.
If you pass a single string to ``to_datetime``, it returns a single ``Timestamp``.
``Timestamp`` can also accept string input, but it doesn't accept string parsing
@@ -386,7 +388,7 @@ We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by
.. _timeseries.origin:
-Using the ``origin`` Parameter
+Using the ``origin`` parameter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the ``origin`` parameter, one can specify an alternative starting point for creation
@@ -850,7 +852,7 @@ savings time. However, all :class:`DateOffset` subclasses that are an hour or sm
The basic :class:`DateOffset` acts similar to ``dateutil.relativedelta`` (`relativedelta documentation`_)
that shifts a date time by the corresponding calendar duration specified. The
-arithmetic operator (``+``) or the ``apply`` method can be used to perform the shift.
+arithmetic operator (``+``) can be used to perform the shift.
.. ipython:: python
@@ -864,10 +866,10 @@ arithmetic operator (``+``) or the ``apply`` method can be used to perform the s
friday.day_name()
# Add 2 business days (Friday --> Tuesday)
two_business_days = 2 * pd.offsets.BDay()
- two_business_days.apply(friday)
friday + two_business_days
(friday + two_business_days).day_name()
+
Most ``DateOffsets`` have associated frequencies strings, or offset aliases, that can be passed
into ``freq`` keyword arguments. The available date offsets and associated frequency strings can be found below:
@@ -936,14 +938,14 @@ in the operation).
ts = pd.Timestamp("2014-01-01 09:00")
day = pd.offsets.Day()
- day.apply(ts)
- day.apply(ts).normalize()
+ day + ts
+ (day + ts).normalize()
ts = pd.Timestamp("2014-01-01 22:00")
hour = pd.offsets.Hour()
- hour.apply(ts)
- hour.apply(ts).normalize()
- hour.apply(pd.Timestamp("2014-01-01 23:30")).normalize()
+ hour + ts
+ (hour + ts).normalize()
+ (hour + pd.Timestamp("2014-01-01 23:30")).normalize()
.. _relativedelta documentation: https://dateutil.readthedocs.io/en/stable/relativedelta.html
@@ -1183,16 +1185,16 @@ under the default business hours (9:00 - 17:00), there is no gap (0 minutes) bet
pd.offsets.BusinessHour().rollback(pd.Timestamp("2014-08-02 15:00"))
pd.offsets.BusinessHour().rollforward(pd.Timestamp("2014-08-02 15:00"))
- # It is the same as BusinessHour().apply(pd.Timestamp('2014-08-01 17:00')).
- # And it is the same as BusinessHour().apply(pd.Timestamp('2014-08-04 09:00'))
- pd.offsets.BusinessHour().apply(pd.Timestamp("2014-08-02 15:00"))
+ # It is the same as BusinessHour() + pd.Timestamp('2014-08-01 17:00').
+ # And it is the same as BusinessHour() + pd.Timestamp('2014-08-04 09:00')
+ pd.offsets.BusinessHour() + pd.Timestamp("2014-08-02 15:00")
# BusinessDay results (for reference)
pd.offsets.BusinessHour().rollforward(pd.Timestamp("2014-08-02"))
- # It is the same as BusinessDay().apply(pd.Timestamp('2014-08-01'))
+ # It is the same as BusinessDay() + pd.Timestamp('2014-08-01')
# The result is the same as rollworward because BusinessDay never overlap.
- pd.offsets.BusinessHour().apply(pd.Timestamp("2014-08-02"))
+ pd.offsets.BusinessHour() + pd.Timestamp("2014-08-02")
``BusinessHour`` regards Saturday and Sunday as holidays. To use arbitrary
holidays, you can use ``CustomBusinessHour`` offset, as explained in the
@@ -1269,6 +1271,36 @@ frequencies. We will refer to these aliases as *offset aliases*.
"U, us", "microseconds"
"N", "nanoseconds"
+.. note::
+
+ When using the offset aliases above, it should be noted that functions
+ such as :func:`date_range`, :func:`bdate_range`, will only return
+ timestamps that are in the interval defined by ``start_date`` and
+ ``end_date``. If the ``start_date`` does not correspond to the frequency,
+ the returned timestamps will start at the next valid timestamp, same for
+ ``end_date``, the returned timestamps will stop at the previous valid
+ timestamp.
+
+ For example, for the offset ``MS``, if the ``start_date`` is not the first
+ of the month, the returned timestamps will start with the first day of the
+ next month. If ``end_date`` is not the first day of a month, the last
+ returned timestamp will be the first day of the corresponding month.
+
+ .. ipython:: python
+
+ dates_lst_1 = pd.date_range("2020-01-06", "2020-04-03", freq="MS")
+ dates_lst_1
+
+ dates_lst_2 = pd.date_range("2020-01-01", "2020-04-01", freq="MS")
+ dates_lst_2
+
+ We can see in the above example :func:`date_range` and
+ :func:`bdate_range` will only return the valid timestamps between the
+ ``start_date`` and ``end_date``. If these are not valid timestamps for the
+ given frequency it will roll to the next value for ``start_date``
+ (respectively previous for the ``end_date``)
+
+
Combining aliases
~~~~~~~~~~~~~~~~~
@@ -1491,7 +1523,7 @@ or calendars with additional rules.
.. _timeseries.advanced_datetime:
-Time series-related instance methods
+Time Series-related instance methods
------------------------------------
Shifting / lagging
@@ -1789,7 +1821,7 @@ to resample based on datetimelike column in the frame, it can passed to the
),
)
df
- df.resample("M", on="date").sum()
+ df.resample("M", on="date")[["a"]].sum()
Similarly, if you instead want to resample by a datetimelike
level of ``MultiIndex``, its name or location can be passed to the
@@ -1797,7 +1829,7 @@ level of ``MultiIndex``, its name or location can be passed to the
.. ipython:: python
- df.resample("M", level="d").sum()
+ df.resample("M", level="d")[["a"]].sum()
.. _timeseries.iterating-label:
@@ -1949,7 +1981,6 @@ frequency. Arithmetic is not allowed between ``Period`` with different ``freq``
p = pd.Period("2012-01", freq="2M")
p + 2
p - 1
- @okexcept
p == pd.Period("2012-01", freq="3M")
@@ -2079,7 +2110,6 @@ The ``period`` dtype can be used in ``.astype(...)``. It allows one to change th
dti
dti.astype("period[M]")
-
PeriodIndex partial string indexing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2374,9 +2404,9 @@ you can use the ``tz_convert`` method.
.. warning::
- Be wary of conversions between libraries. For some time zones, ``pytz`` and ``dateutil`` have different
- definitions of the zone. This is more of a problem for unusual time zones than for
- 'standard' zones like ``US/Eastern``.
+ Be wary of conversions between libraries. For some time zones, ``pytz`` and ``dateutil`` have different
+ definitions of the zone. This is more of a problem for unusual time zones than for
+ 'standard' zones like ``US/Eastern``.
.. warning::
@@ -2389,7 +2419,7 @@ you can use the ``tz_convert`` method.
For ``pytz`` time zones, it is incorrect to pass a time zone object directly into
the ``datetime.datetime`` constructor
- (e.g., ``datetime.datetime(2011, 1, 1, tz=pytz.timezone('US/Eastern'))``.
+ (e.g., ``datetime.datetime(2011, 1, 1, tzinfo=pytz.timezone('US/Eastern'))``.
Instead, the datetime needs to be localized using the ``localize`` method
on the ``pytz`` time zone object.
@@ -2570,7 +2600,7 @@ Transform nonexistent times to ``NaT`` or shift the times.
.. _timeseries.timezone_series:
-Time zone series operations
+Time zone Series operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~
A :class:`Series` with time zone **naive** values is
diff --git a/doc/source/user_guide/visualization.rst b/doc/source/user_guide/visualization.rst
index 1c02be989eeeb..147981f29476f 100644
--- a/doc/source/user_guide/visualization.rst
+++ b/doc/source/user_guide/visualization.rst
@@ -3,9 +3,14 @@
{{ header }}
*******************
-Chart Visualization
+Chart visualization
*******************
+
+.. note::
+
+ The examples below assume that you're using `Jupyter `_.
+
This section demonstrates visualization through charting. For information on
visualization of tabular data please see the section on `Table Visualization `_.
@@ -272,7 +277,7 @@ horizontal and cumulative histograms can be drawn by
plt.close("all")
See the :meth:`hist ` method and the
-`matplotlib hist documentation `__ for more.
+`matplotlib hist documentation