Pandas GroupBy#

প্রয়োজনীয় data-set যা এই post এ ব্যবহার করা হয়েছে DOWNLOAD

Pandas এর groupby function টি পাইথনের একটি শক্তিশালী এবং বহুমুখী function. এটি আপনাকে আরও ভাল বিশ্লেষণের জন্য, গণনা সম্পাদনের জন্য, পৃথক গ্রুপে আপনার ডেটা বিভক্ত করতে দেয়।

common value গুলোকে আলাদা column এ পৃথক করতে বা data-set কে বেশকিছু segment এ ভাগ করতে groupby function অনেক কাজে লাগে।

fortune নাম এর ১০০০ company এর information সহ একটা data-set থেকে fortune নাম এর একটা data-frame তৈরি করা হয়েছে। এখন ঐ data-frame থেকে sector column টাকে আলাদা একটা group বানানো হয়েছে।

Example:

#please run one script line than another

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)

# data-frame and groupby's data type are not same
print("dataframe type: ", type(fortune), "\n\n\n")
print("groupby object type: ", type(sectors), "\n\n\n")

len(sectors)  # gropeBy create 21 unique group
print("total unique object : ",fortune["Sector"].nunique(), "\n\n\n")  # we see total unique object

# we can see our total unique object with there value by .size()
sectors.size() # there are 21 sector , we see

# we can also see our unique object with its value by .value_count() but value_count make sorted as
fortune["Sector"].value_counts()  # there are 21 sector , we see

Output:

dataframe type:  <class 'pandas.core.frame.DataFrame'> 



groupby object type:  <class 'pandas.core.groupby.generic.DataFrameGroupBy'> 



total unique object :  21 








Financials                      139
Energy                          122
Technology                      102
Retailing                        80
Health Care                      75
Business Services                51
Industrials                      46
Food, Beverages & Tobacco        43
Materials                        43
Wholesalers                      40
Transportation                   36
Chemicals                        30
Household Products               28
Engineering & Construction       26
Media                            25
Hotels, Resturants & Leisure     25
Motor Vehicles & Parts           24
Aerospace & Defense              20
Telecommunications               15
Apparel                          15
Food and Drug Stores             15
Name: Sector, dtype: int64

Basic group object#

data-set এর যে সব label এ repeated value থাকে সেই সব labels নিয়ে groupby object বানানো হয়। তাই একটা group value সম্পূর্ণ data-frame এ অনেক বার repeated থাকে।

আমরা groupbyDataFrame.first() ব্যবহার করে প্রতিটি group value এর under এ যে row টি প্রথমে আছে তা pick করতে পারি।

আমরা groupbyDataFrame.last() ব্যবহার করে প্রতিটি group value এর under এ যে row টি শেষ এ আছে তা pick করতে পারি।

আমরা groupbyDataFrame.groups ব্যবহার করে একটি python dictionary পেয়ে থাকি, যার key=each_group_value আর value=index_position_which_are_in_this_group_value.

আমরা .groups ব্যবহার করে index_number নিতে পারি। যা dataFrame.loc[index_number] এ ব্যবহার করে ঐ index এর সব data পাইতে পারি।

Example:

#please run one script line than another

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
print(fortune.head(3),"\n\n\n")

# from 1000 companies total 21 sector's each values 
print(fortune["Sector"].value_counts(), "\n\n\n")

# we can extract very first row on group by .first()
first=sectors.first()

# we can extract last row on group by .first()
sectors.last()

# we can represent every group as a python dictionary. each group object represent as key and values are index label as
sectors.groups

# now we see our first keys 1st value what represent by .loc[]
fortune.loc[24] # here sector which is group object, that match with 1st key

Output:

          Company      Sector                     Industry         Location  \
Rank                                                                          
1         Walmart   Retailing        General Merchandisers  Bentonville, AR   
2     Exxon Mobil      Energy           Petroleum Refining       Irving, TX   
3           Apple  Technology  Computers, Office Equipment    Cupertino, CA   

      Revenue  Profits  Employees  
Rank                               
1      482130    14694    2300000  
2      246204    16150      75600  
3      233715    53394     110000   



Financials                      139
Energy                          122
Technology                      102
Retailing                        80
Health Care                      75
Business Services                51
Industrials                      46
Food, Beverages & Tobacco        43
Materials                        43
Wholesalers                      40
Transportation                   36
Chemicals                        30
Household Products               28
Engineering & Construction       26
Media                            25
Hotels, Resturants & Leisure     25
Motor Vehicles & Parts           24
Aerospace & Defense              20
Telecommunications               15
Apparel                          15
Food and Drug Stores             15
Name: Sector, dtype: int64 








Company                     Boeing
Sector         Aerospace & Defense
Industry     Aerospace and Defense
Location               Chicago, IL
Revenue                      96114
Profits                       5176
Employees                   161400
Name: 24, dtype: object

.get_group() Method#

datafeame.get_group(“target_value”) ব্যবহার করে আমরা DataFrame থেকে যে কোন value define করতে পারি এবং যে সব row তে ঐ value আছে শুধু সেই সব row বিশিষ্ট একটা নতুন dataFrame পেতে পারি।

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)

# now try to create sub-set of all group values separetly by
fortune[fortune["Sector"] == "Apparel"]  # thats create one dataframe similar as .get_group()method

# or
# now we extract any group object's total value as a sub set by .get_group("groupObj")
energy = sectors.get_group("Energy")
# same as
tec = sectors.get_group("Technology")
tec.head(5)

Output:

	Company	Sector	Industry	Location	Revenue	Profits	Employees
Rank
3	Apple	Technology	Computers, Office Equipment	Cupertino, CA	233715	53394	110000
18	Amazon.com	Technology	Internet Services and Retailing	Seattle, WA	107006	596	230800
20	HP	Technology	Computers, Office Equipment	Palo Alto, CA	103355	4554	287000
25	Microsoft	Technology	Computer Software	Redmond, WA	93580	12193	118000
31	IBM	Technology	Information Technology Services	Armonk, NY	82461	13190	411798

Methods on GroupBy#

groupby দ্বারা specific label আলাদা করে নিয়ে আমরা সকল groupby object থেকে বেশকিছু operation করতে পারি। যা আমাদের analysis কে অনেক efficient করে।

আমরা আমাদের প্রতিটি group এর under এ যে valueটি maximum বা minimum কিংবা প্রতিটি group এর numerical value নিয়ে arithmatic operation করতে পারি।

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)

# now if we use .max() its display last alphabetcaly or max value of each group object's very first columns defaultly
sectors.max()
# same for .min() method
sectors.min()

# but for numeric operation it's just take those columns which have numeric value as
sectors.sum()
sectors.mean()

# we can also work same for specific group object by
sectors.get_group("Apparel")["Revenue"].sum()

# we can also work same for specific colums  all value on group boject by
sectors["Revenue"].sum()
sectors["Employees"].sum()
sectors["Profits"].max()
sectors[["Employees", "Profits"]].sum()

Output:

	Employees	Profits
Sector
Aerospace & Defense	968057	28742
Apparel	346397	8236
Business Services	1361050	28227
Chemicals	463651	22628
Energy	1188927	-73447
Engineering & Construction	406708	5304
Financials	3359948	260209
Food and Drug Stores	1395398	16759
Food, Beverages & Tobacco	1211632	51417
Health Care	2678289	106114
Hotels, Resturants & Leisure	2484245	20697
Household Products	646038	14428
Industrials	1545229	20764
Materials	638123	4428
Media	550314	24347
Motor Vehicles & Parts	1082560	25898
Retailing	6227629	47830
Technology	3578949	180473
Telecommunications	832468	48637
Transportation	1536793	44169
Wholesalers	525597	8233

Grouping By Multiple Columns#

আমরা groupby দ্বারা সাধারণত series তৈরি করে থাকি যা প্রদত্ত data-frame এর সঙ্ঘে বিভিন্ন operation করে থাকে। কিন্তু multi-columns এ groupby ব্যবহার করা মানে multi-index series তৈরি করা। আমরা multi-index নিয়ে আগের post এ জেনেছি । আমরা groupby এ list আকারে multiple index define করতে পারি। যা একটি multi-index series.

আমরা groupby এর যে কোন method ব্যবহার করলে তা প্রথমে outer index টার পর same way তে inner index এ execute হবে।

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby(["Sector", "Industry"])
fortune.head(3)

# here we can see our created multi indexing
sectors.size()

# now try to work with multiIndex group object
sectors["Revenue"].sum()

Output:

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            357940
Apparel              Apparel                                           95968
Business Services    Advertising, marketing                            22748
                     Diversified Outsourcing Services                  64829
                     Education                                          7485
                                                                       ...  
Transportation       Trucking, Truck Leasing                           35950
Wholesalers          Miscellaneous                                      8982
                     Wholesalers: Diversified                         176138
                     Wholesalers: Electronics and Office Equipment    147906
                     Wholesalers: Food and Grocery                    111774
Name: Revenue, Length: 79, dtype: int64

.agg() Method (Aggragation)#

.agg() method ব্যবহার করে আমরা অনেকগুলো label এ বিভিন্ন type এর operation একই সঙ্ঘে করতে পারি।

  groupBy_Object.agg({"labelName1" : ["operation1", "operation2"],
                                            "labelName2" : "operation3",
                                            "labelName3" : "operation4"})

আবার সকল label এর জন্য বিভিন্ন operation এক সঙ্ঘে define করতে পারি।
```
  groupBy_Object.agg(["operation1","operation2","operation3"])
```

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)

# now we can aggragation specific unique column by
sectors.agg({"Revenue" : ["sum","mean"],
             "Profits" : "sum",
             "Employees" : "mean"})

# or we can do same by
a = sectors.agg(["size","sum","mean"])
a

Output:

	Revenue			Profits			Employees
	size	sum	mean	size	sum	mean	size	sum	mean
Sector
Aerospace & Defense	20	357940	17897.000000	20	28742	1437.100000	20	968057	48402.850000
Apparel	15	95968	6397.866667	15	8236	549.066667	15	346397	23093.133333
Business Services	51	272195	5337.156863	51	28227	553.470588	51	1361050	26687.254902
Chemicals	30	243897	8129.900000	30	22628	754.266667	30	463651	15455.033333
Energy	122	1517809	12441.057377	122	-73447	-602.024590	122	1188927	9745.303279
Engineering & Construction	26	153983	5922.423077	26	5304	204.000000	26	406708	15642.615385
Financials	139	2217159	15950.784173	139	260209	1872.007194	139	3359948	24172.287770
Food and Drug Stores	15	483769	32251.266667	15	16759	1117.266667	15	1395398	93026.533333
Food, Beverages & Tobacco	43	555967	12929.465116	43	51417	1195.744186	43	1211632	28177.488372
Health Care	75	1614707	21529.426667	75	106114	1414.853333	75	2678289	35710.520000
Hotels, Resturants & Leisure	25	169546	6781.840000	25	20697	827.880000	25	2484245	99369.800000
Household Products	28	234737	8383.464286	28	14428	515.285714	28	646038	23072.785714
Industrials	46	497581	10816.978261	46	20764	451.391304	46	1545229	33591.934783
Materials	43	259145	6026.627907	43	4428	102.976744	43	638123	14840.069767
Media	25	220764	8830.560000	25	24347	973.880000	25	550314	22012.560000
Motor Vehicles & Parts	24	482540	20105.833333	24	25898	1079.083333	24	1082560	45106.666667
Retailing	80	1465076	18313.450000	80	47830	597.875000	80	6227629	77845.362500
Technology	102	1377600	13505.882353	102	180473	1769.343137	102	3578949	35087.735294
Telecommunications	15	461834	30788.933333	15	48637	3242.466667	15	832468	55497.866667
Transportation	36	408508	11347.444444	36	44169	1226.916667	36	1536793	42688.694444
Wholesalers	40	444800	11120.000000	40	8233	205.825000	40	525597	13139.925000

Iterating through Groups#

আমাদের data-set এ যে ১০০০ company এর data আছে টার মধ্যে প্রতিটি sector এ highest revenue যে company এর তা বাহির করার try করে দেখি।

Example:

import pandas as pd

fortune = pd.read_csv("fortune1000.csv", index_col = "Rank")
sectors = fortune.groupby("Sector")
fortune.head(3)


# we can extract every value from a group object with specific operation by

df = pd.DataFrame(columns = fortune.columns) # seen empty value with all columns

for sector, data in sectors:
    highest_revenue_company_in_group = data.nlargest(1, "Revenue")
    df = df.append(highest_revenue_company_in_group)



# or we can do same by
cities = fortune.groupby("Location")
df2 = pd.DataFrame(columns = fortune.columns)
df2  # seen empty value with all columns

for city, data in cities:
    highest_revenue_in_city = data.nlargest(1, "Revenue")
    df2 = df2.append(highest_revenue_in_city)   

df2

Output:

	Company	Sector	Industry	Location	Revenue	Profits	Employees
138	Abbott Laboratories	Health Care	Medical Products and Equipment	Abbott Park, IL	20661	4423	74000
169	Goodyear Tire & Rubber	Motor Vehicles & Parts	Motor Vehicles and Parts	Akron, OH	16443	307	66000
288	Air Products & Chemicals	Chemicals	Chemicals	Allentown, PA	9895	1278	19550
830	Benchmark Electronics	Technology	Semiconductors and Other Electronic Components	Angleton, TX	2541	95	10500
374	Casey’s General Stores	Retailing	Specialty Retailers: Other	Ankeny, IA	7052	181	22408
...	...	...	...	...	...	...	...
7	CVS Health	Food and Drug Stores	Food and Drug Stores	Woonsocket, RI	153290	5237	199000
506	Hanover Insurance Group	Financials	Insurance: Property and Casualty (Stock)	Worcester, MA	5034	332	4800
764	Penn National Gaming	Hotels, Resturants & Leisure	Hotels, Casinos, Resorts	Wyomissing, PA	2838	1	18204
773	Bon-Ton Stores	Retailing	General Merchandisers	York, PA	2790	-57	24100
932	Herman Miller	Household Products	Home Equipment, Furnishings	Zeeland, MI	2142	98	7510

416 rows × 7 columns