开发手册 欢迎您!
软件开发者资料库

Python pandas merge(join) 通过单列或多列合并连接两个DataFrame

Python pandas中处理两个DataFrame时,有些情况我们可能需要将两个DataFrame合并成一个DataFrame,本文主要介绍Python pandas 中通过单列或多列合并连接两个DataFrame的方法,以及相关的示例代码。

1、内连接(inner join)

内连接是满足条件时,左边的和右边的DataFrame都存在的数据。

1)单列条件

import numpy as npimport pandas as pd  left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],                    'Name': ['Levi', 'John', 'Lynn',                             'Mickey', 'Rose'],                    'order': ['1', '2', '3', '4', '5']}) right = pd.DataFrame({'Id': ['2', '4', '6', '7', '8'],                    'Gender': ['F', 'M', 'M', 'F', 'F'],                    'Like': ['C', 'Java',                                 'JS', 'Linux','Python']})#合并        df = pd.merge(left, right, how ='inner', on ='Id')print(df)#两个DataFrame列名不同,使用left_on和right_on指定列名#df = pd.merge(left,right,how ='inner',left_on = "XId",right_on = "YId")

输出:

  Id    Name order Gender  Like0  2    John     2      F     C1  4  Mickey     4      M  Java

2)多列条件

import numpy as npimport pandas as pd  left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],                     'UserId':['1', '2', '3', '4', '5'],                    'Name': ['Levi', 'John', 'Lynn',                             'Mickey', 'Rose'],                    'order': ['1', '2', '3', '4', '5']}) right = pd.DataFrame({'Id': ['2', '9', '3', '4', '8'],                    'UserId': ['1', '2', '3', '4', '5'],                    'Gender': ['F', 'M', 'M', 'F', 'F'],                    'Like': ['C', 'Java',                                 'JS', 'Linux','Python']})#合并        df = pd.merge(left, right, how ='inner', on =['Id','UserId'])print(df)#两个DataFrame列名不同,使用left_on和right_on指定列名#df = pd.merge(left,right,how ='inner',left_on = ["XId","UserId"],right_on = ["YId","UserId"])

输出:

  Id UserId    Name order Gender   Like0  3      3    Lynn     3      M     JS1  4      4  Mickey     4      F  Linux

2、外连接(outer join)

外连接返回左侧DataFrame中的所有行,右侧DataFrame中的所有行,并在可能的情况下匹配行,在其他地方使用NaN

1)单列条件

import numpy as npimport pandas as pd  left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],                    'Name': ['Levi', 'John', 'Lynn',                             'Mickey', 'Rose'],                    'order': ['1', '2', '3', '4', '5']}) right = pd.DataFrame({'Id': ['2', '4', '6', '7', '8'],                    'Gender': ['F', 'M', 'M', 'F', 'F'],                    'Like': ['C', 'Java',                                 'JS', 'Linux','Python']})#合并        df = pd.merge(left, right, how ='outer', on ='Id')print(df)#两个DataFrame列名不同,使用left_on和right_on指定列名#df = pd.merge(left,right,how ='outer',left_on = "XId",right_on = "YId")

输出:

  Id    Name order Gender    Like0  1    Levi     1    NaN     NaN1  2    John     2      F       C2  3    Lynn     3    NaN     NaN3  4  Mickey     4      M    Java4  5    Rose     5    NaN     NaN5  6     NaN   NaN      M      JS6  7     NaN   NaN      F   Linux7  8     NaN   NaN      F  Python

2)多列条件

import numpy as npimport pandas as pd  left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],                     'UserId':['1', '2', '3', '4', '5'],                    'Name': ['Levi', 'John', 'Lynn',                             'Mickey', 'Rose'],                    'order': ['1', '2', '3', '4', '5']}) right = pd.DataFrame({'Id': ['2', '9', '3', '4', '8'],                    'UserId': ['1', '2', '3', '4', '5'],                    'Gender': ['F', 'M', 'M', 'F', 'F'],                    'Like': ['C', 'Java',                                 'JS', 'Linux','Python']})#合并        df = pd.merge(left, right, how ='outer', on =['Id','UserId'])print(df)#两个DataFrame列名不同,使用left_on和right_on指定列名#df = pd.merge(left,right,how ='outer',left_on = ["XId","UserId"],right_on = ["YId","UserId"])

输出:

  Id UserId    Name order Gender    Like0  1      1    Levi     1    NaN     NaN1  2      2    John     2    NaN     NaN2  3      3    Lynn     3      M      JS3  4      4  Mickey     4      F   Linux4  5      5    Rose     5    NaN     NaN5  2      1     NaN   NaN      F       C6  9      2     NaN   NaN      M    Java7  8      5     NaN   NaN      F  Python

3、左连接(left join)

使用左连接,将显示第一个DataFrame中的所有记录,而不管第一个DataFrame中的键是否可以在第二个DataFrame中找到。而对于第二个DataFrame,只会显示在第一个DataFrame中可以找到的具有第二个DataFrame中的key的记录。

1)单列条件

import numpy as npimport pandas as pd  left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],                    'Name': ['Levi', 'John', 'Lynn',                             'Mickey', 'Rose'],                    'order': ['1', '2', '3', '4', '5']}) right = pd.DataFrame({'Id': ['2', '4', '6', '7', '8'],                    'Gender': ['F', 'M', 'M', 'F', 'F'],                    'Like': ['C', 'Java',                                 'JS', 'Linux','Python']})#合并        df = pd.merge(left, right, how ='left', on ='Id')print(df)#两个DataFrame列名不同,使用left_on和right_on指定列名#df = pd.merge(left,right,how ='left',left_on = "XId",right_on = "YId")

输出:

  Id    Name order Gender  Like0  1    Levi     1    NaN   NaN1  2    John     2      F     C2  3    Lynn     3    NaN   NaN3  4  Mickey     4      M  Java4  5    Rose     5    NaN   NaN

2)多列条件

import numpy as npimport pandas as pd  left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],                     'UserId':['1', '2', '3', '4', '5'],                    'Name': ['Levi', 'John', 'Lynn',                             'Mickey', 'Rose'],                    'order': ['1', '2', '3', '4', '5']}) right = pd.DataFrame({'Id': ['2', '9', '3', '4', '8'],                    'UserId': ['1', '2', '3', '4', '5'],                    'Gender': ['F', 'M', 'M', 'F', 'F'],                    'Like': ['C', 'Java',                                 'JS', 'Linux','Python']})#合并        df = pd.merge(left, right, how ='left', on =['Id','UserId'])print(df)#两个DataFrame列名不同,使用left_on和right_on指定列名#df = pd.merge(left,right,how ='left',left_on = ["XId","UserId"],right_on = ["YId","UserId"])

输出:

  Id UserId    Name order Gender   Like0  1      1    Levi     1    NaN    NaN1  2      2    John     2    NaN    NaN2  3      3    Lynn     3      M     JS3  4      4  Mickey     4      F  Linux4  5      5    Rose     5    NaN    NaN

4、右连接(right join)

对于右连接,将显示来自第二个DataFrame的所有记录。但是,只会显示在第二个DataFrame中可以找到的第一个DataFrame中具有Key的记录。

1)单列条件

import numpy as npimport pandas as pd  left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],                    'Name': ['Levi', 'John', 'Lynn',                             'Mickey', 'Rose'],                    'order': ['1', '2', '3', '4', '5']}) right = pd.DataFrame({'Id': ['2', '4', '6', '7', '8'],                    'Gender': ['F', 'M', 'M', 'F', 'F'],                    'Like': ['C', 'Java',                                 'JS', 'Linux','Python']})#合并        df = pd.merge(left, right, how ='right', on ='Id')print(df)#两个DataFrame列名不同,使用left_on和right_on指定列名#df = pd.merge(left,right,how ='right',left_on = "XId",right_on = "YId")

输出:

  Id    Name order Gender    Like0  2    John     2      F       C1  4  Mickey     4      M    Java2  6     NaN   NaN      M      JS3  7     NaN   NaN      F   Linux4  8     NaN   NaN      F  Python

2)多列条件

import numpy as npimport pandas as pd  left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],                     'UserId':['1', '2', '3', '4', '5'],                    'Name': ['Levi', 'John', 'Lynn',                             'Mickey', 'Rose'],                    'order': ['1', '2', '3', '4', '5']}) right = pd.DataFrame({'Id': ['2', '9', '3', '4', '8'],                    'UserId': ['1', '2', '3', '4', '5'],                    'Gender': ['F', 'M', 'M', 'F', 'F'],                    'Like': ['C', 'Java',                                 'JS', 'Linux','Python']})#合并        df = pd.merge(left, right, how ='right', on =['Id','UserId'])print(df)#两个DataFrame列名不同,使用left_on和right_on指定列名#df = pd.merge(left,right,how ='right',left_on = ["XId","UserId"],right_on = ["YId","UserId"])

输出:

  Id UserId    Name order Gender    Like0  2      1     NaN   NaN      F       C1  9      2     NaN   NaN      M    Java2  3      3    Lynn     3      M      JS3  4      4  Mickey     4      F   Linux4  8      5     NaN   NaN      F  Python