pandas merging Joining Concatenating#
প্রয়োজনীয় data-set যা এই post এ ব্যবহার করা হয়েছে DOWNLOAD
এবং Official Doc
.concat() method#
.concat() method ব্যবহার করে আমরা same length এর একাধিক data-frame কে একটি data-frame এ রূপান্তর করতে পারি বা একাধিক data কে এক সঙ্ঘে concatination করতে পারি।
সাধারণত obj signature নিয়ে কাজ করা যায়। কিন্তু প্রয়োজন এর তাগিদে আমাদের আর ও কিছু signature ব্যবহার করতে হতে পারে। যেমন ignore_index ব্যবহার করলে একটি continous indexing পাওয়া যায়। তা না হইলে আমাদের 1st data-frame এর index 2nd data-frame এর index মিলে problem হতে পারে। keys ব্যবহার করে আমরা external Index তৈরি করতে পারি। names ব্যবহার করে আমরা multi-Index গুলোর specific নাম দিতে পারি।
pandas.concat(objs, keys, names, ignore_index)
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
len(week1) # same length
len(week2)
wek1and2_index = pd.concat([week1, week2])
wek1and2_index.tail()
# create continous numaric index
wek1and2 = pd.concat([week1, week2], ignore_index = True)
# create external index with key1 for obj1 and key2 for obj2
sales = pd.concat(objs= [week1, week2], keys= ["week1", "week2"])
# extract only week1 external row
sales.loc["week1"]
# extract whoes CustomerID 240 and external index week2
sales.loc[("week2", 240)]
# extract thoes all other datas whoes CustomerID 240 and external index week2
sales.loc[("week2", 240), "Customer ID"]
# we can change index label name
sales = pd.concat(objs=[week1, week2], keys=["week1", "week2"], names=['data frames', 'indexs'])
sales.head()
OUTPUT:
Customer ID | Food ID | ||
---|---|---|---|
data frames | indexs | ||
week1 | 0 | 537 | 9 |
1 | 97 | 4 | |
2 | 658 | 1 | |
3 | 202 | 2 | |
4 | 155 | 9 |
.append() method#
.append() method অনেকটাই .concat() method এর মত। .concat() সরাসরি pandas কে call করে, আর .append() দ্বারা specific data-frame কে call করতে হয়।
dataFrame1.append(dataFrame2, ignore_index=True)
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
week1_2 = week2.append(week1, ignore_index = True) # work same as concat method
week1_2.head()
OUTPUT:
Customer ID | Food ID | |
---|---|---|
0 | 688 | 10 |
1 | 813 | 7 |
2 | 495 | 10 |
3 | 189 | 5 |
4 | 267 | 3 |
.merge() Method#
left_on and right_on parameters#
২টি data-set কে একত্রে আনার প্রক্রিয়াই marging.
pandas.merge(leftDataFrame, rightDataFrame, how, left_on, right_on)
pandas কে merge method দ্বারা call করে signuture গুলোতে সঠিক data পাঠিয়ে marge করতে পারি। how parameter এ define করতে হবে merge টাই left dataFrame এ নাকি right dataFrame প্রথমে থাকবে । left_on এ left dataFrame এর specific label এবং right_on এ right dataFrame এর specific label থাকবে।
same type এর যে কোন label দ্বারা index define করে merge data কে compare করা যায়।
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv", index_col = "ID") # define index column cause data frame have extra number column serial
mergeData= pd.merge(week1, customers, how = "left", left_on = "Customer ID", right_on = "ID")
mergeData.head()
OUTPUT:
Customer ID | Food ID | First Name | Last Name | Gender | Company | Occupation | |
---|---|---|---|---|---|---|---|
0 | 537 | 9 | Cheryl | Carroll | Female | Zoombeat | Registered Nurse |
1 | 97 | 4 | Amanda | Watkins | Female | Ozu | Account Coordinator |
2 | 658 | 1 | Patrick | Webb | Male | Browsebug | Community Outreach Specialist |
3 | 202 | 2 | Louis | Campbell | Male | Rhynoodle | Account Representative III |
4 | 155 | 9 | Carolyn | Diaz | Female | Gigazoom | Database Administrator III |
Merging by Indexes ( left_index & right_index) Parameters#
by-Default left_index এবং right_index এর মান False থাকে, merge করার সময় left Data-Frame বা right Data-Frame এর index কে main index হিসেবে ব্যবহার করার জন্য specific index এর মান True দিতে হবে।
leftDataFrame.merge(rightDataFrame, how, left_on, right_on, left_index, right_index, suffixes=[“lastNameForLeft”, “LastNameForRight”])
how তে left merge না কি right merge টা দিতে হবে,
left DataFrame এর index হিসেবে যদি label ব্যবহার করা হয় তবে টা left_on এ দিতে হবে। unless দিতে হবে না।
right DataFrame এর index হিসেবে যদি label ব্যবহার করা হয় তবে টা right_on এ দিতে হবে। unless দিতে হবে না।
left DataFrame এর index হিসেবে যদি left DataFrame এর index ব্যবহার করা হয় তবে left_index = True দিতে হবে। unless দিতে হবে না।
right DataFrame এর index হিসেবে যদি right DataFrame এর index ব্যবহার করা হয় তবে right_index = True দিতে হবে। unless দিতে হবে না।
merge data চেনার জন্য যদি specific data-frame এর সকল label এর সঙ্ঘে specific কোন word add করতে চাই তা হইলে suffixes parameter ব্যবহার করতে হবে।
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv", index_col = "ID") # define index column cause data frame have extra number column serial
food = pd.read_csv("Restaurant - Foods.csv", index_col = "Food ID")
food.head()
"""
use for marge with index value or marge with index and column
"""
sales = week1.merge(customers, how = "left", left_on = "Customer ID", right_index = True)
# now marge with foods data set to add food details with sales
sales.merge(food, how = "left", left_on = "Food ID", right_index = True)
sales.head()
# merge by Index In both datasets and include suffixes name on label
week1.merge(week2, how = "left", left_index = True, right_index = True, suffixes = ["-week1", "-week2"])
OUTPUT:
Customer ID-week1 | Food ID-week1 | Customer ID-week2 | Food ID-week2 | |
---|---|---|---|---|
0 | 537 | 9 | 688 | 10 |
1 | 97 | 4 | 813 | 7 |
2 | 658 | 1 | 495 | 10 |
3 | 202 | 2 | 189 | 5 |
4 | 155 | 9 | 267 | 3 |
... | ... | ... | ... | ... |
245 | 413 | 9 | 783 | 10 |
246 | 926 | 6 | 556 | 10 |
247 | 134 | 3 | 547 | 9 |
248 | 396 | 6 | 252 | 9 |
249 | 535 | 10 | 249 | 6 |
250 rows × 4 columns
.join() Method#
.merge() method এর মত .join() method ব্যবহার করে আমরা ২টি data-frame কে একটি data-frame এ রূপান্তর করতে পারি। সাধারণত leftDataFrame.join(rightDataFrame) ব্যবহার করে পাশাপাশি 2 টি DataFrame কে join করা যায়।
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
satisfaction = pd.read_csv("Restaurant - Week 1 Satisfaction.csv")
# use for join 2 dataSet with out extra parameter
week1.join(satisfaction).head()
OUTPUT:
Customer ID | Food ID | Satisfaction Rating | |
---|---|---|---|
0 | 537 | 9 | 2 |
1 | 97 | 4 | 7 |
2 | 658 | 1 | 3 |
3 | 202 | 2 | 7 |
4 | 155 | 9 | 10 |
inner Joins (inner marge)#
২ টা data-frame এর মদ্ধে common value গুলো মিলে new data-frame তৈরি করলে inner-join করা হয়। join করার সময় notice করতে হবে যেন ২ টা data-frame এ same Name এবং same size এর common label থাকে, যে label টার উপর deppend করে আমাদের inner join complete হবে।
সকল প্রকার join করতে .marge() method এর how parameter এর পরিবর্তন ঘটাতে হয়।
leftDataFrame.merge(rightDataFrame, how=”inner”, left_on, right_on, left_index, right_index, suffixes=[“lastNameForLeft”, “LastNameForRight”])
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
""" try to find which customer are came on week1 and week2 both """
# both dataset have same name of column "Customer ID" so it defaultly create "Customer ID_x""Customer ID_y" for reduse duplicate column name
week1.merge(week2, how = "inner", on = "Customer ID").head(4)
# now give some specific name to "Customer ID"
week1.merge(week2, how = "inner", on = "Customer ID",suffixes = [" - week 1"," - week 2"]).head(4)
""" try to find which customer are came on both week1 and week2 both and ordered exjact same product to both week1 & week2"""
week1.merge(week2, how = "inner", on = ["Customer ID", "Food ID"]).head(4)
OUTPUT:
Customer ID | Food ID | |
---|---|---|
0 | 304 | 3 |
1 | 540 | 3 |
2 | 937 | 10 |
3 | 233 | 3 |
outer Joins (outer marge)#
উপরের picture এ সকল join এ কি ঘটে তা দাওয়া হয়েছে। আপনারা যদি উপরের topic গুলো ভাল মত বুঝতে পারেন এবং SQL এর join topic জেনে থাকেন তা হইলে code এর flow অনুসরণ করুন এবং line by line code run করুন। আশা করি বুঝতে পারবেন।
inner join এর similar. just how=”outer”
leftDataFrame.merge(rightDataFrame, how=”outer”, left_on, right_on, left_index, right_index, suffixes=[“lastNameForLeft”, “LastNameForRight”])
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv")
food = pd.read_csv("Restaurant - Foods.csv")
""" try to find which customer are not came on week1 and week2 both """
# both dataset have same name of column "Customer ID" so it defaultly create "Customer ID_x""Customer ID_y" for reduse duplicate column name
week1.merge(week2, how = "outer", on = "Customer ID").head(4)
# NaN found when those costomer is not found in column
# now give some specific name to "Customer ID". indicator parameter define where the value is
marged = week1.merge(week2, how = "outer", on = "Customer ID",
suffixes = [" - week 1"," - week 2"], indicator = True)
len(marged) # 454 rows have not similar value
""" try to find which customer are not came on both week1 and week2 both and ordered exjact same product to both week1 & week2"""
week1.merge(week2, how = "outer", on = ["Customer ID", "Food ID"]).head(4)
# total discription
marged["_merge"].value_counts()
OUTPUT:
right_only 197
left_only 195
both 62
Name: _merge, dtype: int64
left Joins (left marge)#
.merge() method এ যে example দেখেছেন তা left join এর উদাহরণ।
how = “left” ব্যবহার করে বাকি সব same.
just আপনাদের join বাপার টার visual flow মনে রাখতে হবে এবং কেন কোন merge ব্যবহার করবেন টা নিশ্চিত করতে হবে।
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv")
food = pd.read_csv("Restaurant - Foods.csv")
""" try to find which food item buy customer on week1 """ # so food ID is our matching column
buy = week1.merge(food, how = "left", on = "Food ID", sort = True) # it's sort with matching column
buy.head()
OUTPUT:
Customer ID | Food ID | Food Item | Price | |
---|---|---|---|---|
0 | 658 | 1 | Sushi | 3.99 |
1 | 600 | 1 | Sushi | 3.99 |
2 | 155 | 1 | Sushi | 3.99 |
3 | 341 | 1 | Sushi | 3.99 |
4 | 20 | 1 | Sushi | 3.99 |
right Joins (right marge)#
how = “right” ব্যবহার করে বাকি সব same.
just আপনাদের join বাপার টার visual flow মনে রাখতে হবে এবং কেন কোন merge ব্যবহার করবেন টা নিশ্চিত করতে হবে।
Example:
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv")
food = pd.read_csv("Restaurant - Foods.csv")
""" try to find which food item buy customer on week1 """ # so food ID is our matching column
buy = food.merge(week1, how = "right", on = "Food ID", sort = True) # it's sort with matching column
buy.head()
OUTPUT:
Food ID | Food Item | Price | Customer ID | |
---|---|---|---|---|
0 | 1 | Sushi | 3.99 | 658 |
1 | 1 | Sushi | 3.99 | 600 |
2 | 1 | Sushi | 3.99 | 155 |
3 | 1 | Sushi | 3.99 | 341 |
4 | 1 | Sushi | 3.99 | 20 |
Drop label#
যদি merged data-frame এ অপ্রয়োজনীয় label এবং data থাকে তা হইলে .drop() method ব্যবহার করব।
Example:
When marged column have not same name then we cann't work with previous marge or join system. matching or comparing columns always need same name. If that's not happend then we need extra 2 parameter left_on & right_on
import pandas as pd
week1 = pd.read_csv("Restaurant - Week 1 Sales.csv")
week2 = pd.read_csv("Restaurant - Week 2 Sales.csv")
customers = pd.read_csv("Restaurant - Customers.csv")
food = pd.read_csv("Restaurant - Foods.csv")
week2.merge(customers, how = "left", left_on = "Customer ID", right_on = "ID") # "Customer ID" & "ID" same so we don't need duplicate column
week2CustomerDemo = week2.merge(customers, how = "left", left_on = "Customer ID",
right_on = "ID", sort = True).drop("ID", axis= "columns") # now ID column is drop
week2CustomerDemo.head()
OUTPUT:
Customer ID | Food ID | First Name | Last Name | Gender | Company | Occupation | |
---|---|---|---|---|---|---|---|
0 | 8 | 6 | Frances | Adams | Female | Dabshots | Developer III |
1 | 13 | 2 | Ruth | Alvarez | Female | Twitterlist | Mechanical Systems Engineer |
2 | 21 | 4 | Albert | Burns | Male | Rhynoodle | Junior Executive |
3 | 24 | 8 | Donna | Thomas | Female | Jaxbean | Chief Design Engineer |
4 | 27 | 4 | Jessica | Bennett | Female | Twitternation | Account Executive |