DataFrame 操作のチートシート

PandasはPythonのデータ分析における主要なライブラリの一つであり、多様な機能があります。そのため、はじめてPandasを使う人にとってはどのように使ってよいかわからないことがあります。本記事はそんな人のために、Pandasでよく利用される機能について一通り確認するためのチートシートとなっております。

ここでは、以下のようなサンプルのDataFrameを使用します。

import pandas as pd

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)

print(df)

このコードを実行すると、以下のような結果が得られます。

   Name  Age       City
0  John   28   New York
1  Anna   24      Paris
2  Peter  35     Berlin
3  Linda  32     London

Pandasでデータフレームの基本情報を確認するチートシート

DataFrameについての基本情報を確認するための操作をいくつか紹介します。

# 行列のデータ数を確認
print(df.shape)  # (4, 3)

# インデックスを確認
print(df.index)  # RangeIndex(start=0, stop=4, step=1)

# 列名一覧を確認
print(df.columns)  # Index(['Name', 'Age', 'City'], dtype='object')

# 各列のデータ型を確認
print(df.dtypes)  
# Name    object
# Age      int64
# City    object
# dtype: object

# ある列のユニークな値を確認
print(df['City'].unique())  # ['New York' 'Paris' 'Berlin' 'London']

# 欠損値の確認
print(df.isnull().sum())  
# Name    0
# Age     0
# City    0
# dtype:

 int64

# 要約統計量を確認
print(df.describe())
#            Age
# count   4.00000
# mean   29.75000
# std     4.78714
# min    24.00000
# 25%    27.00000
# 50%    30.00000
# 75%    32.75000
# max    35.00000

Pandasでデータ整形をするチートシート

次に、データの整形を行うための操作を紹介します。

# インデックスの設定と振り直し
df.set_index('Name', inplace=True)
print(df)
#       Age       City
# Name                
# John   28   New York
# Anna   24      Paris
# Peter  35     Berlin
# Linda  32     London

# 列名の変更
df.rename(columns={'Age': 'Age (years)'}, inplace=True)
print(df)
#       Age (years)       City
# Name                         
# John           28   New York
# Anna           24      Paris
# Peter          35     Berlin
# Linda          32     London

# ソート
df.sort_values(by='Age (years)', inplace=True)
print(df)
#       Age (years)       City
# Name                         
# Anna           24      Paris
# John           28   New York
# Linda          32     London
# Peter          35     Berlin

# 値の置換
df.replace({'City': {'New York': 'NY', 'Paris': 'PAR', 'Berlin': 'BER', 'London': 'LON'}}, inplace=True)
print(df)
#       Age (years) City
# Name                  
# Anna           24  PAR
# John           28   NY
# Linda          32  LON
# Peter          35  BER

# 値の削除
df.drop('Anna', axis=0, inplace=True)
print(df)
#       Age (years) City
# Name                  
# John           28   NY
# Linda          32  LON
# Peter          35  BER

# 欠損値の処理
df.loc['John', 'Age (years)'] = None  # create a missing value
df.fillna(df['Age (years)'].mean(), inplace=True)  # fill missing values with mean
print(df)
#       Age (years) City
# Name                  
# John    33.500000   NY
# Linda   32.000000  LON
# Peter   35.000000  BER

Pandasでデータ集計をするチートシート

データの集計を行うための操作を紹介します。

# 列の集計
print(df['Age (years)'].sum())  # 100.5

# 集約
print(df.groupby('City').mean())
#      Age (years)
# City             
# BER    35.0
# LON    32.0
# NY     33.5

Pandasでデータ抽出をするチートシート

特定のデータを抽出するための操作を紹介します。

# 列での抽出


print(df['City'])  # Name: City, dtype: object

# ラベルでの抽出
print(df.loc['John'])  
# Age (years)    33.5
# City             NY
# Name: John, dtype: object

# 行・列番号での抽出
print(df.iloc[0, 1])  # NY

Pandasでデータ結合のチートシート

DataFrameやSeriesを結合する操作を紹介します。

df2 = pd.DataFrame({'City': ['NY', 'BER', 'LON'], 'Population': [8623000, 3769000, 8908000]}, index=['John', 'Peter', 'Linda'])

# DataFrameの結合
df_merged = pd.merge(df, df2, on='City')
print(df_merged)
#   Age (years) City  Population
# 0        33.5   NY     8623000
# 1        32.0  LON     8908000
# 2        35.0  BER     3769000