Pandas GroupBy#
প্রয়োজনীয় data-set যা এই post এ ব্যবহার করা হয়েছে DOWNLOAD
এবং Official Doc
Pandas এর groupby function টি পাইথনের একটি শক্তিশালী এবং বহুমুখী function. এটি আপনাকে আরও ভাল বিশ্লেষণের জন্য, গণনা সম্পাদনের জন্য, পৃথক গ্রুপে আপনার ডেটা বিভক্ত করতে দেয়।
common value গুলোকে আলাদা column এ পৃথক করতে বা data-set কে বেশকিছু segment এ ভাগ করতে groupby function অনেক কাজে লাগে।
fortune নাম এর ১০০০ company এর information সহ একটা data-set থেকে fortune নাম এর একটা data-frame তৈরি করা হয়েছে। এখন ঐ data-frame থেকে sector column টাকে আলাদা একটা group বানানো হয়েছে।
Example:
#please run one script line than another
import pandas as pd
fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)
# data-frame and groupby's data type are not same
print("dataframe type: ", type(fortune), "\n\n\n")
print("groupby object type: ", type(sectors), "\n\n\n")
len(sectors) # gropeBy create 21 unique group
print("total unique object : ",fortune["Sector"].nunique(), "\n\n\n") # we see total unique object
# we can see our total unique object with there value by .size()
sectors.size() # there are 21 sector , we see
# we can also see our unique object with its value by .value_count() but value_count make sorted as
fortune["Sector"].value_counts() # there are 21 sector , we see
Output:
dataframe type: <class 'pandas.core.frame.DataFrame'>
groupby object type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
total unique object : 21
Financials 139
Energy 122
Technology 102
Retailing 80
Health Care 75
Business Services 51
Industrials 46
Food, Beverages & Tobacco 43
Materials 43
Wholesalers 40
Transportation 36
Chemicals 30
Household Products 28
Engineering & Construction 26
Media 25
Hotels, Resturants & Leisure 25
Motor Vehicles & Parts 24
Aerospace & Defense 20
Telecommunications 15
Apparel 15
Food and Drug Stores 15
Name: Sector, dtype: int64
Basic group object#
data-set এর যে সব label এ repeated value থাকে সেই সব labels নিয়ে groupby object বানানো হয়। তাই একটা group value সম্পূর্ণ data-frame এ অনেক বার repeated থাকে।
আমরা groupbyDataFrame.first() ব্যবহার করে প্রতিটি group value এর under এ যে row টি প্রথমে আছে তা pick করতে পারি।
আমরা groupbyDataFrame.last() ব্যবহার করে প্রতিটি group value এর under এ যে row টি শেষ এ আছে তা pick করতে পারি।
আমরা groupbyDataFrame.groups ব্যবহার করে একটি python dictionary পেয়ে থাকি, যার key=each_group_value আর value=index_position_which_are_in_this_group_value.
আমরা .groups ব্যবহার করে index_number নিতে পারি। যা dataFrame.loc[index_number] এ ব্যবহার করে ঐ index এর সব data পাইতে পারি।
Example:
#please run one script line than another
import pandas as pd
fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
print(fortune.head(3),"\n\n\n")
# from 1000 companies total 21 sector's each values
print(fortune["Sector"].value_counts(), "\n\n\n")
# we can extract very first row on group by .first()
first=sectors.first()
# we can extract last row on group by .first()
sectors.last()
# we can represent every group as a python dictionary. each group object represent as key and values are index label as
sectors.groups
# now we see our first keys 1st value what represent by .loc[]
fortune.loc[24] # here sector which is group object, that match with 1st key
Output:
Company Sector Industry Location \
Rank
1 Walmart Retailing General Merchandisers Bentonville, AR
2 Exxon Mobil Energy Petroleum Refining Irving, TX
3 Apple Technology Computers, Office Equipment Cupertino, CA
Revenue Profits Employees
Rank
1 482130 14694 2300000
2 246204 16150 75600
3 233715 53394 110000
Financials 139
Energy 122
Technology 102
Retailing 80
Health Care 75
Business Services 51
Industrials 46
Food, Beverages & Tobacco 43
Materials 43
Wholesalers 40
Transportation 36
Chemicals 30
Household Products 28
Engineering & Construction 26
Media 25
Hotels, Resturants & Leisure 25
Motor Vehicles & Parts 24
Aerospace & Defense 20
Telecommunications 15
Apparel 15
Food and Drug Stores 15
Name: Sector, dtype: int64
Company Boeing
Sector Aerospace & Defense
Industry Aerospace and Defense
Location Chicago, IL
Revenue 96114
Profits 5176
Employees 161400
Name: 24, dtype: object
.get_group() Method#
datafeame.get_group(“target_value”) ব্যবহার করে আমরা DataFrame থেকে যে কোন value define করতে পারি এবং যে সব row তে ঐ value আছে শুধু সেই সব row বিশিষ্ট একটা নতুন dataFrame পেতে পারি।
Example:
import pandas as pd
fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)
# now try to create sub-set of all group values separetly by
fortune[fortune["Sector"] == "Apparel"] # thats create one dataframe similar as .get_group()method
# or
# now we extract any group object's total value as a sub set by .get_group("groupObj")
energy = sectors.get_group("Energy")
# same as
tec = sectors.get_group("Technology")
tec.head(5)
Output:
Company | Sector | Industry | Location | Revenue | Profits | Employees | |
---|---|---|---|---|---|---|---|
Rank | |||||||
3 | Apple | Technology | Computers, Office Equipment | Cupertino, CA | 233715 | 53394 | 110000 |
18 | Amazon.com | Technology | Internet Services and Retailing | Seattle, WA | 107006 | 596 | 230800 |
20 | HP | Technology | Computers, Office Equipment | Palo Alto, CA | 103355 | 4554 | 287000 |
25 | Microsoft | Technology | Computer Software | Redmond, WA | 93580 | 12193 | 118000 |
31 | IBM | Technology | Information Technology Services | Armonk, NY | 82461 | 13190 | 411798 |
Methods on GroupBy#
groupby দ্বারা specific label আলাদা করে নিয়ে আমরা সকল groupby object থেকে বেশকিছু operation করতে পারি। যা আমাদের analysis কে অনেক efficient করে।
আমরা আমাদের প্রতিটি group এর under এ যে valueটি maximum বা minimum কিংবা প্রতিটি group এর numerical value নিয়ে arithmatic operation করতে পারি।
Example:
import pandas as pd
fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)
# now if we use .max() its display last alphabetcaly or max value of each group object's very first columns defaultly
sectors.max()
# same for .min() method
sectors.min()
# but for numeric operation it's just take those columns which have numeric value as
sectors.sum()
sectors.mean()
# we can also work same for specific group object by
sectors.get_group("Apparel")["Revenue"].sum()
# we can also work same for specific colums all value on group boject by
sectors["Revenue"].sum()
sectors["Employees"].sum()
sectors["Profits"].max()
sectors[["Employees", "Profits"]].sum()
Output:
Employees | Profits | |
---|---|---|
Sector | ||
Aerospace & Defense | 968057 | 28742 |
Apparel | 346397 | 8236 |
Business Services | 1361050 | 28227 |
Chemicals | 463651 | 22628 |
Energy | 1188927 | -73447 |
Engineering & Construction | 406708 | 5304 |
Financials | 3359948 | 260209 |
Food and Drug Stores | 1395398 | 16759 |
Food, Beverages & Tobacco | 1211632 | 51417 |
Health Care | 2678289 | 106114 |
Hotels, Resturants & Leisure | 2484245 | 20697 |
Household Products | 646038 | 14428 |
Industrials | 1545229 | 20764 |
Materials | 638123 | 4428 |
Media | 550314 | 24347 |
Motor Vehicles & Parts | 1082560 | 25898 |
Retailing | 6227629 | 47830 |
Technology | 3578949 | 180473 |
Telecommunications | 832468 | 48637 |
Transportation | 1536793 | 44169 |
Wholesalers | 525597 | 8233 |
Grouping By Multiple Columns#
আমরা groupby দ্বারা সাধারণত series তৈরি করে থাকি যা প্রদত্ত data-frame এর সঙ্ঘে বিভিন্ন operation করে থাকে। কিন্তু multi-columns এ groupby ব্যবহার করা মানে multi-index series তৈরি করা। আমরা multi-index নিয়ে আগের post এ জেনেছি । আমরা groupby এ list আকারে multiple index define করতে পারি। যা একটি multi-index series.
আমরা groupby এর যে কোন method ব্যবহার করলে তা প্রথমে outer index টার পর same way তে inner index এ execute হবে।
Example:
import pandas as pd
fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby(["Sector", "Industry"])
fortune.head(3)
# here we can see our created multi indexing
sectors.size()
# now try to work with multiIndex group object
sectors["Revenue"].sum()
Output:
Sector Industry
Aerospace & Defense Aerospace and Defense 357940
Apparel Apparel 95968
Business Services Advertising, marketing 22748
Diversified Outsourcing Services 64829
Education 7485
...
Transportation Trucking, Truck Leasing 35950
Wholesalers Miscellaneous 8982
Wholesalers: Diversified 176138
Wholesalers: Electronics and Office Equipment 147906
Wholesalers: Food and Grocery 111774
Name: Revenue, Length: 79, dtype: int64
.agg() Method (Aggragation)#
.agg() method ব্যবহার করে আমরা অনেকগুলো label এ বিভিন্ন type এর operation একই সঙ্ঘে করতে পারি।
groupBy_Object.agg({"labelName1" : ["operation1", "operation2"], "labelName2" : "operation3", "labelName3" : "operation4"})
আবার সকল label এর জন্য বিভিন্ন operation এক সঙ্ঘে define করতে পারি।
groupBy_Object.agg(["operation1","operation2","operation3"])
Example:
import pandas as pd
fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)
# now we can aggragation specific unique column by
sectors.agg({"Revenue" : ["sum","mean"],
"Profits" : "sum",
"Employees" : "mean"})
# or we can do same by
a = sectors.agg(["size","sum","mean"])
a
Output:
Revenue | Profits | Employees | |||||||
---|---|---|---|---|---|---|---|---|---|
size | sum | mean | size | sum | mean | size | sum | mean | |
Sector | |||||||||
Aerospace & Defense | 20 | 357940 | 17897.000000 | 20 | 28742 | 1437.100000 | 20 | 968057 | 48402.850000 |
Apparel | 15 | 95968 | 6397.866667 | 15 | 8236 | 549.066667 | 15 | 346397 | 23093.133333 |
Business Services | 51 | 272195 | 5337.156863 | 51 | 28227 | 553.470588 | 51 | 1361050 | 26687.254902 |
Chemicals | 30 | 243897 | 8129.900000 | 30 | 22628 | 754.266667 | 30 | 463651 | 15455.033333 |
Energy | 122 | 1517809 | 12441.057377 | 122 | -73447 | -602.024590 | 122 | 1188927 | 9745.303279 |
Engineering & Construction | 26 | 153983 | 5922.423077 | 26 | 5304 | 204.000000 | 26 | 406708 | 15642.615385 |
Financials | 139 | 2217159 | 15950.784173 | 139 | 260209 | 1872.007194 | 139 | 3359948 | 24172.287770 |
Food and Drug Stores | 15 | 483769 | 32251.266667 | 15 | 16759 | 1117.266667 | 15 | 1395398 | 93026.533333 |
Food, Beverages & Tobacco | 43 | 555967 | 12929.465116 | 43 | 51417 | 1195.744186 | 43 | 1211632 | 28177.488372 |
Health Care | 75 | 1614707 | 21529.426667 | 75 | 106114 | 1414.853333 | 75 | 2678289 | 35710.520000 |
Hotels, Resturants & Leisure | 25 | 169546 | 6781.840000 | 25 | 20697 | 827.880000 | 25 | 2484245 | 99369.800000 |
Household Products | 28 | 234737 | 8383.464286 | 28 | 14428 | 515.285714 | 28 | 646038 | 23072.785714 |
Industrials | 46 | 497581 | 10816.978261 | 46 | 20764 | 451.391304 | 46 | 1545229 | 33591.934783 |
Materials | 43 | 259145 | 6026.627907 | 43 | 4428 | 102.976744 | 43 | 638123 | 14840.069767 |
Media | 25 | 220764 | 8830.560000 | 25 | 24347 | 973.880000 | 25 | 550314 | 22012.560000 |
Motor Vehicles & Parts | 24 | 482540 | 20105.833333 | 24 | 25898 | 1079.083333 | 24 | 1082560 | 45106.666667 |
Retailing | 80 | 1465076 | 18313.450000 | 80 | 47830 | 597.875000 | 80 | 6227629 | 77845.362500 |
Technology | 102 | 1377600 | 13505.882353 | 102 | 180473 | 1769.343137 | 102 | 3578949 | 35087.735294 |
Telecommunications | 15 | 461834 | 30788.933333 | 15 | 48637 | 3242.466667 | 15 | 832468 | 55497.866667 |
Transportation | 36 | 408508 | 11347.444444 | 36 | 44169 | 1226.916667 | 36 | 1536793 | 42688.694444 |
Wholesalers | 40 | 444800 | 11120.000000 | 40 | 8233 | 205.825000 | 40 | 525597 | 13139.925000 |
Iterating through Groups#
আমাদের data-set এ যে ১০০০ company এর data আছে টার মধ্যে প্রতিটি sector এ highest revenue যে company এর তা বাহির করার try করে দেখি।
Example:
import pandas as pd
fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)
# we can extract every value from a group object with specific operation by
df = pd.DataFrame(columns = fortune.columns) # seen empty value with all columns
for sector, data in sectors:
highest_revenue_company_in_group = data.nlargest(1, "Revenue")
df = df.append(highest_revenue_company_in_group)
# or we can do same by
cities = fortune.groupby("Location")
df2 = pd.DataFrame(columns = fortune.columns)
df2 # seen empty value with all columns
for city, data in cities:
highest_revenue_in_city = data.nlargest(1, "Revenue")
df2 = df2.append(highest_revenue_in_city)
df2
Output:
Company | Sector | Industry | Location | Revenue | Profits | Employees | |
---|---|---|---|---|---|---|---|
138 | Abbott Laboratories | Health Care | Medical Products and Equipment | Abbott Park, IL | 20661 | 4423 | 74000 |
169 | Goodyear Tire & Rubber | Motor Vehicles & Parts | Motor Vehicles and Parts | Akron, OH | 16443 | 307 | 66000 |
288 | Air Products & Chemicals | Chemicals | Chemicals | Allentown, PA | 9895 | 1278 | 19550 |
830 | Benchmark Electronics | Technology | Semiconductors and Other Electronic Components | Angleton, TX | 2541 | 95 | 10500 |
374 | Casey’s General Stores | Retailing | Specialty Retailers: Other | Ankeny, IA | 7052 | 181 | 22408 |
... | ... | ... | ... | ... | ... | ... | ... |
7 | CVS Health | Food and Drug Stores | Food and Drug Stores | Woonsocket, RI | 153290 | 5237 | 199000 |
506 | Hanover Insurance Group | Financials | Insurance: Property and Casualty (Stock) | Worcester, MA | 5034 | 332 | 4800 |
764 | Penn National Gaming | Hotels, Resturants & Leisure | Hotels, Casinos, Resorts | Wyomissing, PA | 2838 | 1 | 18204 |
773 | Bon-Ton Stores | Retailing | General Merchandisers | York, PA | 2790 | -57 | 24100 |
932 | Herman Miller | Household Products | Home Equipment, Furnishings | Zeeland, MI | 2142 | 98 | 7510 |
416 rows × 7 columns