pandas

Python's essential library for structured data manipulation.

By
NumFOCUS
pandas is an open-source Python library providing DataFrame and Series data structures for structured data manipulation, analysis, and transformation. It handles CSV, Excel, SQL, Parquet, and JSON natively, supports time-series operations, groupby aggregations, merges, and reshaping, and integrates directly with NumPy, Matplotlib, scikit-learn, and the broader Python data ecosystem. Data engineers, analysts, and software teams use it across ETL pipelines, exploratory analysis, financial data processing, and machine learning preprocessing workflows.
Vendor
NumFOCUS

Features

DataFrame and Series: two-dimensional and one-dimensional labelled data structures

Data I/O: native readers and writers for CSV, Excel, SQL, Parquet, JSON, and HDF5

Data cleaning: null handling, type casting, deduplication, and string normalisation

Aggregation and groupby: split-apply-combine operations across any dimension

Time-series operations: date range generation, resampling, shifting, and rolling windows

Merge and join: SQL-style inner, outer, left, and right joins between DataFrames

Python ecosystem integration: works natively with NumPy, Matplotlib, and scikit-learn

BSD-3-Clause license: free for commercial, research, and educational use

What is pandas?

pandas is an open-source Python library that provides two primary data structures: DataFrame (a labelled two-dimensional table) and Series (a labelled one-dimensional array). These structures support the full range of operations needed to work with structured data: loading from files and databases, cleaning and transforming, aggregating and reshaping, and exporting results.

It was originally developed for financial data analysis, and time-series handling remains one of its strongest features. Date-indexed DataFrames support resampling, shifting, rolling calculations, and timezone handling out of the box. The library has since become the standard data manipulation layer in Python across analytics, data engineering, and machine learning workflows.

pandas is fiscally sponsored by NumFOCUS and maintained by a large contributor community. It is one of the most downloaded Python packages on PyPI and forms part of the default data stack in virtually every Python data science environment.

Core Data Structures

A DataFrame is a two-dimensional, size-mutable, labelled data structure with columns of potentially different types, similar to a database table or a spreadsheet. Each column is a Series, which is a one-dimensional labelled array capable of holding any data type. Both structures support label-based indexing (loc), integer-based indexing (iloc), and boolean mask indexing, giving developers multiple ways to select and filter data depending on what is most readable for the task.

The index is a first-class concept in pandas: it can be a simple integer range, a string label, a DatetimeIndex, or a MultiIndex for hierarchical data. A well-designed index significantly speeds up joins, lookups, and time-series operations because pandas can use it for alignment rather than scanning the full dataset. For financial applications working with timestamped market data or transaction histories, DatetimeIndex is the most common choice.

Data I/O and Pipeline Integration

pandas ships with readers and writers for the most common data formats: read_csv, read_excel, read_sql, read_parquet, read_json, and several more. The corresponding to_* methods write DataFrames back to the same formats. This makes pandas a practical glue layer between data sources and downstream systems in ETL pipelines.

In Python data engineering stacks, pandas typically sits between a data ingestion layer (a database connector, an API client, or a file system reader) and a processing or output layer (a machine learning model, a reporting tool, or a data warehouse loader). Teams running these pipelines as part of a software engineering engagement commonly use pandas for the transformation layer even when the surrounding infrastructure uses dedicated pipeline orchestration tools like Airflow or Prefect.

Analytics and Transformation Capabilities

pandas covers the full range of tabular data transformations: filtering rows by condition, selecting and renaming columns, handling missing values with fillna and dropna, casting types, and applying arbitrary functions element-wise or column-wise. The groupby API implements split-apply-combine: split a DataFrame by one or more keys, apply an aggregation or transformation function to each group, and combine the results into a new DataFrame.

Merge and join operations follow SQL semantics: inner, left, right, and outer joins between DataFrames on one or more key columns. concat stacks DataFrames vertically or horizontally. pivot_table and melt reshape data between wide and long formats. For SaaS and technology product teams building analytics features or internal reporting pipelines, these capabilities cover the majority of transformation requirements without needing a dedicated query engine.

Performance and Scalability Considerations

pandas is optimised for datasets that fit comfortably in memory. Performance on large DataFrames benefits from using appropriate dtypes (for example, category for low-cardinality string columns, int32 instead of int64 where range allows) to reduce memory footprint and speed up operations.

For datasets that exceed available memory, Dask provides a pandas-compatible API that partitions DataFrames across cores or machines. Polars is a more recent alternative with a Rust-based execution engine that outperforms pandas on many in-memory workloads and supports lazy evaluation. For teams where pandas performance has become a bottleneck, both are worth evaluating. Most data engineering and analytics teams building on modern data infrastructure start with pandas and move to alternatives only when specific scale requirements demand it.