Pandas GroupBy#

প্রয়োজনীয় data-set যা এই post এ ব্যবহার করা হয়েছে DOWNLOAD

এবং Official Doc

Pandas এর groupby function টি পাইথনের একটি শক্তিশালী এবং বহুমুখী function. এটি আপনাকে আরও ভাল বিশ্লেষণের জন্য, গণনা সম্পাদনের জন্য, পৃথক গ্রুপে আপনার ডেটা বিভক্ত করতে দেয়।

common value গুলোকে আলাদা column এ পৃথক করতে বা data-set কে বেশকিছু segment এ ভাগ করতে groupby function অনেক কাজে লাগে।

fortune নাম এর ১০০০ company এর information সহ একটা data-set থেকে fortune নাম এর একটা data-frame তৈরি করা হয়েছে। এখন ঐ data-frame থেকে sector column টাকে আলাদা একটা group বানানো হয়েছে।

Example:

#please run one script line than another

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)

# data-frame and groupby's data type are not same
print("dataframe type: ", type(fortune), "\n\n\n")
print("groupby object type: ", type(sectors), "\n\n\n")

len(sectors)  # gropeBy create 21 unique group
print("total unique object : ",fortune["Sector"].nunique(), "\n\n\n")  # we see total unique object

# we can see our total unique object with there value by .size()
sectors.size() # there are 21 sector , we see

# we can also see our unique object with its value by .value_count() but value_count make sorted as
fortune["Sector"].value_counts()  # there are 21 sector , we see

Output:

dataframe type:  <class 'pandas.core.frame.DataFrame'> 



groupby object type:  <class 'pandas.core.groupby.generic.DataFrameGroupBy'> 



total unique object :  21 








Financials                      139
Energy                          122
Technology                      102
Retailing                        80
Health Care                      75
Business Services                51
Industrials                      46
Food, Beverages & Tobacco        43
Materials                        43
Wholesalers                      40
Transportation                   36
Chemicals                        30
Household Products               28
Engineering & Construction       26
Media                            25
Hotels, Resturants & Leisure     25
Motor Vehicles & Parts           24
Aerospace & Defense              20
Telecommunications               15
Apparel                          15
Food and Drug Stores             15
Name: Sector, dtype: int64

Basic group object#

data-set এর যে সব label এ repeated value থাকে সেই সব labels নিয়ে groupby object বানানো হয়। তাই একটা group value সম্পূর্ণ data-frame এ অনেক বার repeated থাকে।

আমরা groupbyDataFrame.first() ব্যবহার করে প্রতিটি group value এর under এ যে row টি প্রথমে আছে তা pick করতে পারি।

আমরা groupbyDataFrame.last() ব্যবহার করে প্রতিটি group value এর under এ যে row টি শেষ এ আছে তা pick করতে পারি।

আমরা groupbyDataFrame.groups ব্যবহার করে একটি python dictionary পেয়ে থাকি, যার key=each_group_value আর value=index_position_which_are_in_this_group_value.

আমরা .groups ব্যবহার করে index_number নিতে পারি। যা dataFrame.loc[index_number] এ ব্যবহার করে ঐ index এর সব data পাইতে পারি।

Example:

#please run one script line than another

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
print(fortune.head(3),"\n\n\n")

# from 1000 companies total 21 sector's each values 
print(fortune["Sector"].value_counts(), "\n\n\n")

# we can extract very first row on group by .first()
first=sectors.first()

# we can extract last row on group by .first()
sectors.last()

# we can represent every group as a python dictionary. each group object represent as key and values are index label as
sectors.groups

# now we see our first keys 1st value what represent by .loc[]
fortune.loc[24] # here sector which is group object, that match with 1st key

Output:

          Company      Sector                     Industry         Location  \
Rank                                                                          
1         Walmart   Retailing        General Merchandisers  Bentonville, AR   
2     Exxon Mobil      Energy           Petroleum Refining       Irving, TX   
3           Apple  Technology  Computers, Office Equipment    Cupertino, CA   

      Revenue  Profits  Employees  
Rank                               
1      482130    14694    2300000  
2      246204    16150      75600  
3      233715    53394     110000   



Financials                      139
Energy                          122
Technology                      102
Retailing                        80
Health Care                      75
Business Services                51
Industrials                      46
Food, Beverages & Tobacco        43
Materials                        43
Wholesalers                      40
Transportation                   36
Chemicals                        30
Household Products               28
Engineering & Construction       26
Media                            25
Hotels, Resturants & Leisure     25
Motor Vehicles & Parts           24
Aerospace & Defense              20
Telecommunications               15
Apparel                          15
Food and Drug Stores             15
Name: Sector, dtype: int64 








Company                     Boeing
Sector         Aerospace & Defense
Industry     Aerospace and Defense
Location               Chicago, IL
Revenue                      96114
Profits                       5176
Employees                   161400
Name: 24, dtype: object

.get_group() Method#

datafeame.get_group(“target_value”) ব্যবহার করে আমরা DataFrame থেকে যে কোন value define করতে পারি এবং যে সব row তে ঐ value আছে শুধু সেই সব row বিশিষ্ট একটা নতুন dataFrame পেতে পারি।

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)

# now try to create sub-set of all group values separetly by
fortune[fortune["Sector"] == "Apparel"]  # thats create one dataframe similar as .get_group()method

# or
# now we extract any group object's total value as a sub set by .get_group("groupObj")
energy = sectors.get_group("Energy")
# same as
tec = sectors.get_group("Technology")
tec.head(5)

Output:

Company Sector Industry Location Revenue Profits Employees
Rank
3 Apple Technology Computers, Office Equipment Cupertino, CA 233715 53394 110000
18 Amazon.com Technology Internet Services and Retailing Seattle, WA 107006 596 230800
20 HP Technology Computers, Office Equipment Palo Alto, CA 103355 4554 287000
25 Microsoft Technology Computer Software Redmond, WA 93580 12193 118000
31 IBM Technology Information Technology Services Armonk, NY 82461 13190 411798

Methods on GroupBy#

groupby দ্বারা specific label আলাদা করে নিয়ে আমরা সকল groupby object থেকে বেশকিছু operation করতে পারি। যা আমাদের analysis কে অনেক efficient করে।

আমরা আমাদের প্রতিটি group এর under এ যে valueটি maximum বা minimum কিংবা প্রতিটি group এর numerical value নিয়ে arithmatic operation করতে পারি।

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)

# now if we use .max() its display last alphabetcaly or max value of each group object's very first columns defaultly
sectors.max()
# same for .min() method
sectors.min()

# but for numeric operation it's just take those columns which have numeric value as
sectors.sum()
sectors.mean()

# we can also work same for specific group object by
sectors.get_group("Apparel")["Revenue"].sum()

# we can also work same for specific colums  all value on group boject by
sectors["Revenue"].sum()
sectors["Employees"].sum()
sectors["Profits"].max()
sectors[["Employees", "Profits"]].sum()

Output:

Employees Profits
Sector
Aerospace & Defense 968057 28742
Apparel 346397 8236
Business Services 1361050 28227
Chemicals 463651 22628
Energy 1188927 -73447
Engineering & Construction 406708 5304
Financials 3359948 260209
Food and Drug Stores 1395398 16759
Food, Beverages & Tobacco 1211632 51417
Health Care 2678289 106114
Hotels, Resturants & Leisure 2484245 20697
Household Products 646038 14428
Industrials 1545229 20764
Materials 638123 4428
Media 550314 24347
Motor Vehicles & Parts 1082560 25898
Retailing 6227629 47830
Technology 3578949 180473
Telecommunications 832468 48637
Transportation 1536793 44169
Wholesalers 525597 8233

Grouping By Multiple Columns#

আমরা groupby দ্বারা সাধারণত series তৈরি করে থাকি যা প্রদত্ত data-frame এর সঙ্ঘে বিভিন্ন operation করে থাকে। কিন্তু multi-columns এ groupby ব্যবহার করা মানে multi-index series তৈরি করা। আমরা multi-index নিয়ে আগের post এ জেনেছি । আমরা groupby এ list আকারে multiple index define করতে পারি। যা একটি multi-index series.

আমরা groupby এর যে কোন method ব্যবহার করলে তা প্রথমে outer index টার পর same way তে inner index এ execute হবে।

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby(["Sector", "Industry"])
fortune.head(3)

# here we can see our created multi indexing
sectors.size()

# now try to work with multiIndex group object
sectors["Revenue"].sum()

Output:

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            357940
Apparel              Apparel                                           95968
Business Services    Advertising, marketing                            22748
                     Diversified Outsourcing Services                  64829
                     Education                                          7485
                                                                       ...  
Transportation       Trucking, Truck Leasing                           35950
Wholesalers          Miscellaneous                                      8982
                     Wholesalers: Diversified                         176138
                     Wholesalers: Electronics and Office Equipment    147906
                     Wholesalers: Food and Grocery                    111774
Name: Revenue, Length: 79, dtype: int64

.agg() Method (Aggragation)#

  • .agg() method ব্যবহার করে আমরা অনেকগুলো label এ বিভিন্ন type এর operation একই সঙ্ঘে করতে পারি।

      groupBy_Object.agg({"labelName1" : ["operation1", "operation2"],
                                                "labelName2" : "operation3",
                                                "labelName3" : "operation4"})
    
  • আবার সকল label এর জন্য বিভিন্ন operation এক সঙ্ঘে define করতে পারি।

      groupBy_Object.agg(["operation1","operation2","operation3"])
    

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)

# now we can aggragation specific unique column by
sectors.agg({"Revenue" : ["sum","mean"],
             "Profits" : "sum",
             "Employees" : "mean"})

# or we can do same by
a = sectors.agg(["size","sum","mean"])
a

Output:

Revenue Profits Employees
size sum mean size sum mean size sum mean
Sector
Aerospace & Defense 20 357940 17897.000000 20 28742 1437.100000 20 968057 48402.850000
Apparel 15 95968 6397.866667 15 8236 549.066667 15 346397 23093.133333
Business Services 51 272195 5337.156863 51 28227 553.470588 51 1361050 26687.254902
Chemicals 30 243897 8129.900000 30 22628 754.266667 30 463651 15455.033333
Energy 122 1517809 12441.057377 122 -73447 -602.024590 122 1188927 9745.303279
Engineering & Construction 26 153983 5922.423077 26 5304 204.000000 26 406708 15642.615385
Financials 139 2217159 15950.784173 139 260209 1872.007194 139 3359948 24172.287770
Food and Drug Stores 15 483769 32251.266667 15 16759 1117.266667 15 1395398 93026.533333
Food, Beverages & Tobacco 43 555967 12929.465116 43 51417 1195.744186 43 1211632 28177.488372
Health Care 75 1614707 21529.426667 75 106114 1414.853333 75 2678289 35710.520000
Hotels, Resturants & Leisure 25 169546 6781.840000 25 20697 827.880000 25 2484245 99369.800000
Household Products 28 234737 8383.464286 28 14428 515.285714 28 646038 23072.785714
Industrials 46 497581 10816.978261 46 20764 451.391304 46 1545229 33591.934783
Materials 43 259145 6026.627907 43 4428 102.976744 43 638123 14840.069767
Media 25 220764 8830.560000 25 24347 973.880000 25 550314 22012.560000
Motor Vehicles & Parts 24 482540 20105.833333 24 25898 1079.083333 24 1082560 45106.666667
Retailing 80 1465076 18313.450000 80 47830 597.875000 80 6227629 77845.362500
Technology 102 1377600 13505.882353 102 180473 1769.343137 102 3578949 35087.735294
Telecommunications 15 461834 30788.933333 15 48637 3242.466667 15 832468 55497.866667
Transportation 36 408508 11347.444444 36 44169 1226.916667 36 1536793 42688.694444
Wholesalers 40 444800 11120.000000 40 8233 205.825000 40 525597 13139.925000

Iterating through Groups#

আমাদের data-set এ যে ১০০০ company এর data আছে টার মধ্যে প্রতিটি sector এ highest revenue যে company এর তা বাহির করার try করে দেখি।

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)


# we can extract every value from a group object with specific operation by

df = pd.DataFrame(columns = fortune.columns) # seen empty value with all columns

for sector, data in sectors:
    highest_revenue_company_in_group = data.nlargest(1, "Revenue")
    df = df.append(highest_revenue_company_in_group)



# or we can do same by
cities = fortune.groupby("Location")
df2 = pd.DataFrame(columns = fortune.columns)
df2  # seen empty value with all columns

for city, data in cities:
    highest_revenue_in_city = data.nlargest(1, "Revenue")
    df2 = df2.append(highest_revenue_in_city)   

df2

Output:

Company Sector Industry Location Revenue Profits Employees
138 Abbott Laboratories Health Care Medical Products and Equipment Abbott Park, IL 20661 4423 74000
169 Goodyear Tire & Rubber Motor Vehicles & Parts Motor Vehicles and Parts Akron, OH 16443 307 66000
288 Air Products & Chemicals Chemicals Chemicals Allentown, PA 9895 1278 19550
830 Benchmark Electronics Technology Semiconductors and Other Electronic Components Angleton, TX 2541 95 10500
374 Casey’s General Stores Retailing Specialty Retailers: Other Ankeny, IA 7052 181 22408
... ... ... ... ... ... ... ...
7 CVS Health Food and Drug Stores Food and Drug Stores Woonsocket, RI 153290 5237 199000
506 Hanover Insurance Group Financials Insurance: Property and Casualty (Stock) Worcester, MA 5034 332 4800
764 Penn National Gaming Hotels, Resturants & Leisure Hotels, Casinos, Resorts Wyomissing, PA 2838 1 18204
773 Bon-Ton Stores Retailing General Merchandisers York, PA 2790 -57 24100
932 Herman Miller Household Products Home Equipment, Furnishings Zeeland, MI 2142 98 7510

416 rows × 7 columns