pandas merging Joining Concatenating#

প্রয়োজনীয় data-set যা এই post এ ব্যবহার করা হয়েছে DOWNLOAD

এবং Official Doc

.concat() method#

.concat() method ব্যবহার করে আমরা same length এর একাধিক data-frame কে একটি data-frame এ রূপান্তর করতে পারি বা একাধিক data কে এক সঙ্ঘে concatination করতে পারি।

সাধারণত obj signature নিয়ে কাজ করা যায়। কিন্তু প্রয়োজন এর তাগিদে আমাদের আর ও কিছু signature ব্যবহার করতে হতে পারে। যেমন ignore_index ব্যবহার করলে একটি continous indexing পাওয়া যায়। তা না হইলে আমাদের 1st data-frame এর index 2nd data-frame এর index মিলে problem হতে পারে। keys ব্যবহার করে আমরা external Index তৈরি করতে পারি। names ব্যবহার করে আমরা multi-Index গুলোর specific নাম দিতে পারি।

pandas.concat(objs, keys, names, ignore_index)

Example:

import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")

len(week1)    # same length
len(week2)

wek1and2_index = pd.concat([week1, week2])
wek1and2_index.tail()

#  create continous numaric index
wek1and2 = pd.concat([week1, week2], ignore_index = True)

# create external index with key1 for obj1 and key2 for obj2
sales = pd.concat(objs= [week1, week2], keys= ["week1", "week2"])

# extract only week1 external row
sales.loc["week1"]

# extract whoes CustomerID 240 and external index week2
sales.loc[("week2", 240)]

# extract thoes all other datas whoes CustomerID 240 and external index week2
sales.loc[("week2", 240), "Customer ID"]

# we can change index label name 
sales = pd.concat(objs=[week1, week2], keys=["week1", "week2"], names=['data frames', 'indexs'])

sales.head()

OUTPUT:

Customer ID Food ID
data frames indexs
week1 0 537 9
1 97 4
2 658 1
3 202 2
4 155 9

.append() method#

.append() method অনেকটাই .concat() method এর মত। .concat() সরাসরি pandas কে call করে, আর .append() দ্বারা specific data-frame কে call করতে হয়।

dataFrame1.append(dataFrame2, ignore_index=True)

Example:


import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")

week1_2 = week2.append(week1, ignore_index = True)  # work same as concat method
week1_2.head()

OUTPUT:

Customer ID Food ID
0 688 10
1 813 7
2 495 10
3 189 5
4 267 3

.merge() Method#

left_on and right_on parameters#

২টি data-set কে একত্রে আনার প্রক্রিয়াই marging.

pandas.merge(leftDataFrame, rightDataFrame, how, left_on, right_on)

pandas কে merge method দ্বারা call করে signuture গুলোতে সঠিক data পাঠিয়ে marge করতে পারি। how parameter এ define করতে হবে merge টাই left dataFrame এ নাকি right dataFrame প্রথমে থাকবে । left_on এ left dataFrame এর specific label এবং right_on এ right dataFrame এর specific label থাকবে।

same type এর যে কোন label দ্বারা index define করে merge data কে compare করা যায়।

Example:

import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv", index_col = "ID")  # define index column cause data frame have extra number column serial

mergeData= pd.merge(week1, customers, how = "left", left_on = "Customer ID", right_on = "ID")
mergeData.head()

OUTPUT:

Customer ID Food ID First Name Last Name Gender Company Occupation
0 537 9 Cheryl Carroll Female Zoombeat Registered Nurse
1 97 4 Amanda Watkins Female Ozu Account Coordinator
2 658 1 Patrick Webb Male Browsebug Community Outreach Specialist
3 202 2 Louis Campbell Male Rhynoodle Account Representative III
4 155 9 Carolyn Diaz Female Gigazoom Database Administrator III

Merging by Indexes ( left_index & right_index) Parameters#

by-Default left_index এবং right_index এর মান False থাকে, merge করার সময় left Data-Frame বা right Data-Frame এর index কে main index হিসেবে ব্যবহার করার জন্য specific index এর মান True দিতে হবে।

leftDataFrame.merge(rightDataFrame, how, left_on, right_on, left_index, right_index, suffixes=[“lastNameForLeft”, “LastNameForRight”])

  • how তে left merge না কি right merge টা দিতে হবে,

  • left DataFrame এর index হিসেবে যদি label ব্যবহার করা হয় তবে টা left_on এ দিতে হবে। unless দিতে হবে না।

  • right DataFrame এর index হিসেবে যদি label ব্যবহার করা হয় তবে টা right_on এ দিতে হবে। unless দিতে হবে না।

  • left DataFrame এর index হিসেবে যদি left DataFrame এর index ব্যবহার করা হয় তবে left_index = True দিতে হবে। unless দিতে হবে না।

  • right DataFrame এর index হিসেবে যদি right DataFrame এর index ব্যবহার করা হয় তবে right_index = True দিতে হবে। unless দিতে হবে না।

  • merge data চেনার জন্য যদি specific data-frame এর সকল label এর সঙ্ঘে specific কোন word add করতে চাই তা হইলে suffixes parameter ব্যবহার করতে হবে।

Example:

import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv", index_col = "ID")  # define index column cause data frame have extra number column serial
food = pd.read_csv("Restaurant - Foods.csv",  index_col = "Food ID")
food.head()

""" 
use for marge with index value or marge with index and column
"""
sales = week1.merge(customers, how = "left", left_on = "Customer ID", right_index = True)

# now marge with foods data set to add food details with sales 

sales.merge(food, how = "left", left_on = "Food ID", right_index = True)
sales.head()


# merge by Index In both datasets and include suffixes name on label
week1.merge(week2, how = "left", left_index = True, right_index = True, suffixes = ["-week1", "-week2"])


OUTPUT:

Customer ID-week1 Food ID-week1 Customer ID-week2 Food ID-week2
0 537 9 688 10
1 97 4 813 7
2 658 1 495 10
3 202 2 189 5
4 155 9 267 3
... ... ... ... ...
245 413 9 783 10
246 926 6 556 10
247 134 3 547 9
248 396 6 252 9
249 535 10 249 6

250 rows × 4 columns

.join() Method#

.merge() method এর মত .join() method ব্যবহার করে আমরা ২টি data-frame কে একটি data-frame এ রূপান্তর করতে পারি। সাধারণত leftDataFrame.join(rightDataFrame) ব্যবহার করে পাশাপাশি 2 টি DataFrame কে join করা যায়।

Example:

import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
satisfaction = pd.read_csv("Restaurant - Week 1 Satisfaction.csv")

# use for join 2 dataSet with out extra parameter
week1.join(satisfaction).head()


OUTPUT:

Customer ID Food ID Satisfaction Rating
0 537 9 2
1 97 4 7
2 658 1 3
3 202 2 7
4 155 9 10

image

inner Joins (inner marge)#

২ টা data-frame এর মদ্ধে common value গুলো মিলে new data-frame তৈরি করলে inner-join করা হয়। join করার সময় notice করতে হবে যেন ২ টা data-frame এ same Name এবং same size এর common label থাকে, যে label টার উপর deppend করে আমাদের inner join complete হবে।

সকল প্রকার join করতে .marge() method এর how parameter এর পরিবর্তন ঘটাতে হয়।

leftDataFrame.merge(rightDataFrame, how=”inner”, left_on, right_on, left_index, right_index, suffixes=[“lastNameForLeft”, “LastNameForRight”])

Example:

import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")

""" try to find which customer are came on week1 and week2 both """

# both dataset have same name of column "Customer ID" so it defaultly create "Customer ID_x""Customer ID_y" for reduse duplicate column name
week1.merge(week2, how = "inner", on = "Customer ID").head(4)

# now give some specific name to "Customer ID"
week1.merge(week2, how = "inner", on = "Customer ID",suffixes = [" - week 1"," - week 2"]).head(4)

""" try to find which customer are came on both week1 and week2 both and ordered exjact same product to both week1 & week2"""
week1.merge(week2, how = "inner", on = ["Customer ID", "Food ID"]).head(4)




OUTPUT:

Customer ID Food ID
0 304 3
1 540 3
2 937 10
3 233 3

outer Joins (outer marge)#

উপরের picture এ সকল join এ কি ঘটে তা দাওয়া হয়েছে। আপনারা যদি উপরের topic গুলো ভাল মত বুঝতে পারেন এবং SQL এর join topic জেনে থাকেন তা হইলে code এর flow অনুসরণ করুন এবং line by line code run করুন। আশা করি বুঝতে পারবেন।

inner join এর similar. just how=”outer”

leftDataFrame.merge(rightDataFrame, how=”outer”, left_on, right_on, left_index, right_index, suffixes=[“lastNameForLeft”, “LastNameForRight”])

Example:


import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv")
food = pd.read_csv("Restaurant - Foods.csv")

""" try to find which customer are not came on week1 and week2 both """

# both dataset have same name of column "Customer ID" so it defaultly create "Customer ID_x""Customer ID_y" for reduse duplicate column name
week1.merge(week2, how = "outer", on = "Customer ID").head(4)

# NaN found when those costomer is not found in column
# now give some specific name to "Customer ID".  indicator parameter define where the value is
marged = week1.merge(week2, how = "outer", on = "Customer ID",
                     suffixes = [" - week 1"," - week 2"], indicator = True)

len(marged)  # 454 rows have not similar value 


""" try to find which customer are not came on both week1 and week2 both and ordered exjact same product to both week1 & week2"""
week1.merge(week2, how = "outer", on = ["Customer ID", "Food ID"]).head(4)


# total discription
marged["_merge"].value_counts()


OUTPUT:

right_only    197
left_only     195
both           62
Name: _merge, dtype: int64

left Joins (left marge)#

.merge() method এ যে example দেখেছেন তা left join এর উদাহরণ।

how = “left” ব্যবহার করে বাকি সব same.

just আপনাদের join বাপার টার visual flow মনে রাখতে হবে এবং কেন কোন merge ব্যবহার করবেন টা নিশ্চিত করতে হবে।

Example:


import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv")
food = pd.read_csv("Restaurant - Foods.csv")


""" try to find which food item buy customer on week1 """ # so food ID is our matching column
buy = week1.merge(food, how = "left", on = "Food ID", sort = True) # it's sort with matching column
buy.head()


OUTPUT:

Customer ID Food ID Food Item Price
0 658 1 Sushi 3.99
1 600 1 Sushi 3.99
2 155 1 Sushi 3.99
3 341 1 Sushi 3.99
4 20 1 Sushi 3.99

right Joins (right marge)#

how = “right” ব্যবহার করে বাকি সব same.

just আপনাদের join বাপার টার visual flow মনে রাখতে হবে এবং কেন কোন merge ব্যবহার করবেন টা নিশ্চিত করতে হবে।

Example:


import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv")
food = pd.read_csv("Restaurant - Foods.csv")


""" try to find which food item buy customer on week1 """ # so food ID is our matching column
buy = food.merge(week1, how = "right", on = "Food ID", sort = True) # it's sort with matching column
buy.head()


OUTPUT:

Food ID Food Item Price Customer ID
0 1 Sushi 3.99 658
1 1 Sushi 3.99 600
2 1 Sushi 3.99 155
3 1 Sushi 3.99 341
4 1 Sushi 3.99 20

Drop label#

যদি merged data-frame এ অপ্রয়োজনীয় label এবং data থাকে তা হইলে .drop() method ব্যবহার করব।

Example:

When marged column have not same name then we cann't work with previous marge or join system. matching or comparing columns always need same name. If that's not happend then we need extra 2 parameter left_on & right_on

import pandas as pd

week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv")
food = pd.read_csv("Restaurant - Foods.csv")

week2.merge(customers, how = "left", left_on = "Customer ID", right_on = "ID") # "Customer ID" & "ID" same so we don't need duplicate column
 
week2CustomerDemo = week2.merge(customers, how = "left", left_on = "Customer ID", 
                                right_on = "ID", sort = True).drop("ID", axis= "columns")  # now ID column is drop
week2CustomerDemo.head()


OUTPUT:

Customer ID Food ID First Name Last Name Gender Company Occupation
0 8 6 Frances Adams Female Dabshots Developer III
1 13 2 Ruth Alvarez Female Twitterlist Mechanical Systems Engineer
2 21 4 Albert Burns Male Rhynoodle Junior Executive
3 24 8 Donna Thomas Female Jaxbean Chief Design Engineer
4 27 4 Jessica Bennett Female Twitternation Account Executive