Pandas

Pandas is a powerful data manipulation and analysis library for Python. It introduces two main data structures:

DataFrame: A 2-dimensional labeled data structure with columns that can be of different types (like Excel spreadsheets)
Series: A 1-dimensional labeled array that can hold data of any type

Pandas excels at:

Reading/writing various file formats (CSV, Excel, SQL databases)
Data cleaning and preparation
Merging and joining datasets
Handling missing data
Time series functionality
Statistical operations

It's built on top of NumPy and integrates well with other scientific Python libraries, making it a cornerstone of data science in Python.

Official Docs

The official documentation for this library can be found at pandas.pydata.org

Below is a short reference for basic functionality of the pandas library. Check out the official user guide for a more in-depth reference.

Creating DataFrames

import pandas as pd

# From dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30]
})

# From CSV
df = pd.read_csv('file.csv')

# From Excel
df = pd.read_excel('file.xlsx')

Basic Operations

# Basic info
df.info()                    # DataFrame info
df.describe()               # Statistical summary
df.shape                    # (rows, columns)
df.columns                  # Column names
df.dtypes                   # Column data types

# Viewing data
df.head(n)                  # First n rows
df.tail(n)                  # Last n rows
df.sample(n)                # Random n rows

Selection and Indexing

# Column selection
df['column']                # Single column
df[['col1', 'col2']]       # Multiple columns

# Row selection
df.loc[row_label]          # Label-based
df.iloc[row_index]         # Integer-based
df.iloc[0:5, 0:2]          # Slice rows and columns

# Boolean indexing
df[df['age'] > 25]         # Filter by condition
df.query('age > 25')       # Query method

Data Cleaning

# Missing values 
df.isna().sum()            # Count missing values
df.dropna()                # Drop rows with any NA
df.fillna(value)           # Fill NA with value

# Duplicates
df.duplicated()            # Find duplicates
df.drop_duplicates()       # Remove duplicates

Data Manipulation

# Adding/removing columns
df['new_col'] = values     # Add new column
df.drop('col', axis=1)     # Remove column

# Sorting
df.sort_values('col')      # Sort by column
df.sort_index()            # Sort by index

# Type conversion
df['col'].astype('type')   # Convert type
pd.to_datetime(df['date']) # Convert to datetime

Grouping and Aggregation

# Group by operations
df.groupby('category').mean()
df.groupby(['cat1', 'cat2']).agg({
    'col1': 'sum',
    'col2': 'mean'
})

# Pivot tables
df.pivot_table(
    values='value',
    index='row_groups',
    columns='col_groups',
    aggfunc='mean'
)

Merging and Joining

# Combining DataFrames
pd.merge(df1, df2, on='key')           # SQL-style join
pd.concat([df1, df2], axis=0)          # Vertical concatenation
pd.concat([df1, df2], axis=1)          # Horizontal concatenation

Data Export

df.to_csv('output.csv')                 # Export to CSV
df.to_excel('output.xlsx')             # Export to Excel
df.to_json('output.json')              # Export to JSON

Common String Operations

df['col'].str.lower()                   # Lowercase
df['col'].str.contains('pattern')       # String matching
df['col'].str.split(',')               # Split strings
df['col'].str.strip()                  # Remove whitespace

Time Series

df.resample('M').mean()                 # Monthly resampling
df['date'].dt.month                     # Extract month
df.set_index('date').asfreq('D')       # Set frequency

I'll add the most useful plotting patterns with pandas and pyplot.

Basic Plotting with Pandas/Pyplot

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  # Optional but recommended

# Style configuration
plt.style.use('seaborn')  # Optional but improves default aesthetics

Line Plots

# Simple line plot
df.plot(x='date', y='value')
# Multiple lines
df.plot(x='date', y=['value1', 'value2'])

# More control with pyplot
plt.figure(figsize=(10, 6))
plt.plot(df['date'], df['value'], label='Series')
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.legend()
plt.grid(True)
plt.show()

Bar Plots

# Vertical bars
df.plot(kind='bar', x='category', y='value')
# Horizontal bars
df.plot(kind='barh', x='category', y='value')

# Grouped bars
df.plot(kind='bar', x='category', y=['value1', 'value2'])

# Stacked bars
df.plot(kind='bar', x='category', y=['value1', 'value2'], stacked=True)

Statistical Plots

# Histogram
df['value'].hist(bins=50)

# Box plot
df.boxplot(column='value', by='category')

# Violin plot (requires seaborn)
sns.violinplot(data=df, x='category', y='value')

# Distribution plot
df['value'].plot(kind='kde')

Scatter Plots

# Simple scatter
df.plot.scatter(x='value1', y='value2')

# With pyplot for more control
plt.scatter(df['value1'], df['value2'], 
           c=df['color_value'],  # Color by value
           s=df['size_value'],   # Size by value
           alpha=0.5)            # Transparency

# Matrix of scatter plots
pd.plotting.scatter_matrix(df)

Subplots

# Multiple plots in grid
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Plot on specific subplot
df.plot(ax=axes[0, 0], title='Plot 1')
df.plot.scatter(ax=axes[0, 1], x='val1', y='val2', title='Plot 2')
df.plot.box(ax=axes[1, 0], title='Plot 3')
df.plot.hist(ax=axes[1, 1], title='Plot 4')

plt.tight_layout()
plt.show()

Time Series Plots

# Time series with proper date formatting
df.set_index('date').plot()

# Customize time series plot
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['value'])
plt.gcf().autofmt_xdate()  # Angle and align the tick labels

Customization and Saving

# Common customization options
plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'],
         color='blue',
         linestyle='--',
         marker='o',
         linewidth=2,
         markersize=8,
         alpha=0.7)

# Add annotations
plt.annotate('Important Point', 
            xy=(x, y),
            xytext=(x+1, y+1),
            arrowprops=dict(facecolor='black', shrink=0.05))

# Save plot
plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Interactive Plotting with plotly (Optional)

import plotly.express as px

# Interactive line plot
fig = px.line(df, x='date', y='value')
fig.show()

# Interactive scatter
fig = px.scatter(df, x='value1', y='value2', 
                 color='category',
                 size='size_value',
                 hover_data=['additional_info'])
fig.show()

Remember to call plt.show() when using pyplot directly, unless you're in a Jupyter notebook where it's displayed automatically. Also, it's good practice to use plt.close() after saving figures to free up memory when creating many plots in a script.

Pandas excels at:​

Creating DataFrames​

Basic Operations​

Selection and Indexing​

Data Cleaning​

Data Manipulation​

Grouping and Aggregation​

Merging and Joining​

Data Export​

Common String Operations​

Time Series​

Basic Plotting with Pandas/Pyplot​

Line Plots​

Bar Plots​

Statistical Plots​

Scatter Plots​

Subplots​

Time Series Plots​

Customization and Saving​

Interactive Plotting with plotly (Optional)​