Pandas Text Data#
প্রয়োজনীয় data-set যা এই post এ ব্যবহার করা হয়েছে DOWNLOAD
এবং Official Doc
Text-Data নিয়ে কাজ করতে হইলে pandas এর str function ব্যবহার করতে হয়। কারণ pandas এর text-data related কাজ গুলো .str function এর sub-function গুলো করে থাকে । আমরা নিচে function গুলো এর example দেখব।
.str.lower(), .str.upper(), .str.title(), .str.len()#
যদি data-frame এর সব data string type হয়, তবে আমরা .str function ব্যবহার করতে পারি।
.str.title() ব্যবহার করে data-frame এর specific column বা Row এর সকল data এর capitalize করা যায়। মানে সকল word এর প্রথম letter Capital হবে। বাকি letter গুলো smaller থাকবে।
.str.lower() ব্যবহার করে data-frame এর specific column বা Row এর সকল data এর সব word গুলো lower case করা যায়। মানে সকল word এর সকল letter lower case হবে।
.str.upper() ব্যবহার করে data-frame এর specific column বা Row এর সকল word গুলো Upper case করা যায়। মানে সকল word এর সকল letter Upper case হবে।
.str.len() ব্যবহার করে data-frame এর specific column বা Row এর সকল data এর মধ্যে total কতগুলো character আছে তা count করা যায়। মানে সকল word এর total character এর পরিমাণ জানা যায় ।
Example
import pandas as pd#
chicago = pd.read_csv("chicago.csv").dropna(how = "all")
# reduse memory size
chicago["Department"] = chicago["Department"].astype("category")
chicago["Position Title"] = chicago["Position Title"].astype("category")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].astype("category")
# info
chicago.info()
Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 32062 non-null object
1 Position Title 32062 non-null category
2 Department 32062 non-null category
3 Employee Annual Salary 32062 non-null category
dtypes: category(3), object(1)
memory usage: 756.6+ KB
# Capitalize Name column's Data
chicago["Name"] = chicago["Name"].str.title()
chicago["Name"]
Output
0 Aaron, Elvia J
1 Aaron, Jeffery M
2 Aaron, Karina
3 Aaron, Kimberlei R
4 Abad Jr, Vicente M
...
32057 Zygadlo, Michael J
32058 Zygowicz, Peter J
32059 Zymantas, Mark E
32060 Zyrkowski, Carlo E
32061 Zyskowski, Dariusz
Name: Name, Length: 32062, dtype: object
# lower all Department column's Data
chicago["Department"] = chicago["Department"].str.lower()
chicago["Department"]
Output
0 water mgmnt
1 police
2 police
3 general services
4 water mgmnt
...
32057 general services
32058 police
32059 police
32060 police
32061 doit
Name: Department, Length: 32062, dtype: object
# Upper all Position Title column's Data
chicago["Position Title"] = chicago["Position Title"].str.upper()
chicago["Position Title"]
Output
0 WATER RATE TAKER
1 POLICE OFFICER
2 POLICE OFFICER
3 CHIEF CONTRACT EXPEDITER
4 CIVIL ENGINEER IV
...
32057 FRM OF MACHINISTS - AUTOMOTIVE
32058 POLICE OFFICER
32059 POLICE OFFICER
32060 POLICE OFFICER
32061 CHIEF DATA BASE ANALYST
Name: Position Title, Length: 32062, dtype: object
# count total character par row by
chicago["Position Title"].str.len()
Output
0 16
1 14
2 14
3 24
4 17
..
32057 30
32058 14
32059 14
32060 14
32061 23
Name: Position Title, Length: 32062, dtype: int64
#count total String
print(len(chicago["Position Title"]))
Output
32062
.str.replace()#
আমরা data-frame এর specific column এর যে কোন data কে change করতে পারি। আমরা .str.replace(“target”,”output”) এ target=যে value change করতে চাই এবং output=update value হিসেবে যে value চাই। তা ব্যবহার করে আমরা Data-frame বা Series এর যে কোন data যত জায়গায় আছে সব জায়গা থেকে update করতে পারি define করার মাধ্যমে।
Example
import pandas as pd#
chicago = pd.read_csv("chicago.csv").dropna(how = "all")
# reduse memory size
chicago["Department"] = chicago["Department"].astype("category")
chicago["Position Title"] = chicago["Position Title"].astype("category")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].astype("category")
# # info
# chicago.info()
print("before : \n \n",chicago["Department"]) # before change
chicago["Department"] = chicago["Department"].str.replace("MGMNT","MANAGEMENT") # replace("terget", "output")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].str.replace("$","").astype(float) # astype() for reduce memory
print("\n after : \n \n ",chicago["Department"]) # after change
Output
before :
0 WATER MGMNT
1 POLICE
2 POLICE
3 GENERAL SERVICES
4 WATER MGMNT
...
32057 GENERAL SERVICES
32058 POLICE
32059 POLICE
32060 POLICE
32061 DoIT
Name: Department, Length: 32062, dtype: category
Categories (35, object): [ADMIN HEARNG, ANIMAL CONTRL, AVIATION, BOARD OF ELECTION, ..., STREETS & SAN, TRANSPORTN, TREASURER, WATER MGMNT]
after :
0 WATER MANAGEMENT
1 POLICE
2 POLICE
3 GENERAL SERVICES
4 WATER MANAGEMENT
...
32057 GENERAL SERVICES
32058 POLICE
32059 POLICE
32060 POLICE
32061 DoIT
Name: Department, Length: 32062, dtype: object
Filtering with String Method#
Data-frame বা Series এর data গুলো আমরা specific data অনুসারে filter করতে পারি।
filter এর জন্য সকল data কে upper বা lower case এ convert করার প্রয়োজন হইতে পারে। আমরা একটা script এ বেশকিছু condition apply করতে পারি।
.str.contains() ব্যবহার করে যদি আমরা filter-data define করি তা হইলে initial data-frame বা series এর যে সব data এর মধ্যে filter-data থাকবে , আমরা শুধু ঐ সব data পাব।
.str.startswith() ব্যবহার করে যদি আমরা filter-data define করি তা হইলে initial data-frame বা series এর যে সব data এর শুরুতে filter-data থাকবে , আমরা শুধু ঐ সব data পাব।
.str.endswith() ব্যবহার করে যদি আমরা filter-data define করি তা হইলে initial data-frame বা series এর যে সব data এর শেষ এ filter-data থাকবে , আমরা শুধু ঐ সব data পাব।
Example
import pandas as pd
chicago = pd.read_csv("chicago.csv").dropna(how = "all")
# reduse memory size
chicago["Department"] = chicago["Department"].astype("category")
chicago["Position Title"] = chicago["Position Title"].astype("category")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].astype("category")
# if found water in any whhere in Position Title
mask = chicago["Position Title"].str.lower().str.contains("water")
print("found water in any whhere in Position Title \n \n \n", chicago[mask])
# if found water in first in Position Title
print("\n \n if found water in first in Position Title: \n \n \n" , chicago[chicago["Position Title"].str.lower().str.startswith("water")])
# if found ist in last in Position Title
print("\n \n if found ist in last in Position Title: \n \n \n" , chicago[chicago["Position Title"].str.lower().str.endswith("ist")])
Output
found water in any whhere in Position Title
Name Position Title \
0 AARON, ELVIA J WATER RATE TAKER
554 ALUISE, VINCENT G FOREMAN OF WATER PIPE CONSTRUCTION
671 ANDER, PERRY A WATER CHEMIST II
685 ANDERSON, ANDREW J DISTRICT SUPERINTENDENT OF WATER DISTRIBUTION
702 ANDERSON, DONALD FOREMAN OF WATER PIPE CONSTRUCTION
... ... ...
29669 VERMA, ANUPAM MANAGING ENGINEER - WATER MANAGEMENT
30239 WASHINGTON, JOSEPH WATER CHEMIST III
30544 WEST, THOMAS R GEN SUPT OF WATER MANAGEMENT
30991 WILLIAMS, MATTHEW FOREMAN OF WATER PIPE CONSTRUCTION
31405 WOODRIDGE, ROBERT L FOREMAN OF WATER PIPE CONSTRUCTION
Department Employee Annual Salary
0 WATER MGMNT $90744.00
554 WATER MGMNT $102440.00
671 WATER MGMNT $82044.00
685 WATER MGMNT $109272.00
702 WATER MGMNT $102440.00
... ... ...
29669 WATER MGMNT $111192.00
30239 WATER MGMNT $89676.00
30544 WATER MGMNT $115704.00
30991 WATER MGMNT $102440.00
31405 WATER MGMNT $102440.00
[111 rows x 4 columns]
if found water in first in Position Title:
Name Position Title Department \
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT
671 ANDER, PERRY A WATER CHEMIST II WATER MGMNT
1054 ASHLEY, KARMA T WATER CHEMIST II WATER MGMNT
1079 ATKINS, JOANNA M WATER CHEMIST II WATER MGMNT
1181 AZEEM, MOHAMMED A WATER CHEMIST II WATER MGMNT
... ... ... ...
28574 THREATT, DENISE R WATER QUALITY INSPECTOR WATER MGMNT
28602 TIGNOR, DARRYL B WATER RATE TAKER WATER MGMNT
28955 TRAVIS COOK, LESLIE R WATER RATE TAKER WATER MGMNT
29584 VELAZQUEZ, JOHN WATER RATE TAKER WATER MGMNT
30239 WASHINGTON, JOSEPH WATER CHEMIST III WATER MGMNT
Employee Annual Salary
0 $90744.00
671 $82044.00
1054 $82044.00
1079 $82044.00
1181 $53172.00
... ...
28574 $62004.00
28602 $78948.00
28955 $78948.00
29584 $78948.00
30239 $89676.00
[75 rows x 4 columns]
if found ist in last in Position Title:
Name Position Title \
184 AFROZ, NAYYAR PSYCHIATRIST
308 ALARCON, LUIS J LOAN PROCESSING SPECIALIST
422 ALLAIN, CAROLYN SENIOR TELECOMMUNICATIONS SPECIALIST
472 ALLEN, ROBERT MACHINIST
705 ANDERSON, EDWARD M SR PROCUREMENT SPECIALIST
... ... ...
31667 YODER, TERESA G ARCHIVAL SPECIALIST
31688 YOUNGBLOOM, LAURENCE G CRIMES SURVEILLANCE SPECIALIST
31717 YOUNG, KIMBERLY M SR PROCUREMENT SPECIALIST
31837 ZAPATA, HUGO SR PROCUREMENT SPECIALIST
31918 ZEMKE, RICHARD P MACHINIST
Department Employee Annual Salary
184 HEALTH $99840.00
308 COMMUNITY DEVELOPMENT $81948.00
422 DoIT $89880.00
472 WATER MGMNT $94328.00
705 PROCUREMENT $91476.00
... ... ...
31667 PUBLIC LIBRARY $74304.00
31688 OEMC $19676.80
31717 PROCUREMENT $68556.00
31837 PROCUREMENT $87324.00
31918 AVIATION $94328.00
[172 rows x 4 columns]
.strip(), .lstrip(), .rstrip() for remove freeSpace#
data-set এর data গুলো তে অনেক সময় অপ্রয়োজনীয় space থেকে যায়। আমরা .str.strip() বা .str.lstrip() বা .str.rstrip() ব্যবহার করে space remove করতে পারি।
.str.rstrip() দ্বারা right-side এর space remove করা যায়। .str.lstrip() দ্বারা left-side এর space remove করা যায়।
.str.strip() দ্বারা left-side এবং right-side (উভয় পাশ) এর space একই সঙ্ঘে remove করা যায়।
Example
import pandas as pd
chicago = pd.read_csv("chicago.csv").dropna(how = "all")
# reduse memory size
chicago["Department"] = chicago["Department"].astype("category")
chicago["Position Title"] = chicago["Position Title"].astype("category")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].astype("category")
print(chicago.head())
# remove white space only from left and from right
chicago["Name"] = chicago["Name"].str.rstrip().str.lstrip()
print("use rstrip() and lstrip : \n \n \n",chicago["Name"])
chicago["Position Title"] = chicago["Position Title"].str.strip()
print("\n \n \n use strip() : \n \n \n", chicago["Position Title"])
Output
Name Position Title Department \
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT
1 AARON, JEFFERY M POLICE OFFICER POLICE
2 AARON, KARINA POLICE OFFICER POLICE
3 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER GENERAL SERVICES
4 ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT
Employee Annual Salary
0 $90744.00
1 $84450.00
2 $84450.00
3 $89880.00
4 $106836.00
use rstrip() and lstrip :
0 AARON, ELVIA J
1 AARON, JEFFERY M
2 AARON, KARINA
3 AARON, KIMBERLEI R
4 ABAD JR, VICENTE M
...
32057 ZYGADLO, MICHAEL J
32058 ZYGOWICZ, PETER J
32059 ZYMANTAS, MARK E
32060 ZYRKOWSKI, CARLO E
32061 ZYSKOWSKI, DARIUSZ
Name: Name, Length: 32062, dtype: object
use strip() :
0 WATER RATE TAKER
1 POLICE OFFICER
2 POLICE OFFICER
3 CHIEF CONTRACT EXPEDITER
4 CIVIL ENGINEER IV
...
32057 FRM OF MACHINISTS - AUTOMOTIVE
32058 POLICE OFFICER
32059 POLICE OFFICER
32060 POLICE OFFICER
32061 CHIEF DATA BASE ANALYST
Name: Position Title, Length: 32062, dtype: object
.str.split() method for Spliting Strings#
.str.split() method data-frame এ কাজ করে, series এ single data থাকে তাই series এ কিছুই split করার নাই। কোন series এ split() ব্যবহার করলে error face করে।
Example
import pandas as pd
chicago = pd.read_csv("chicago.csv").dropna(how = "all")
# reduse memory size
chicago["Department"] = chicago["Department"].astype("category")
chicago["Position Title"] = chicago["Position Title"].astype("category")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].astype("category")
print(chicago.head(3))
firstName = chicago["Name"].str.split(",").str.get(0)
lastName = chicago["Name"].str.split(",").str.get(1)
print(lastName)
#position = chicago["Position Title"].srt.split(" ").str.get(0).value_counts() # no separated multiple data as Series
# split and got single name
Output
Name Position Title Department Employee Annual Salary
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT $90744.00
1 AARON, JEFFERY M POLICE OFFICER POLICE $84450.00
2 AARON, KARINA POLICE OFFICER POLICE $84450.00
0 ELVIA J
1 JEFFERY M
2 KARINA
3 KIMBERLEI R
4 VICENTE M
...
32057 MICHAEL J
32058 PETER J
32059 MARK E
32060 CARLO E
32061 DARIUSZ
Name: Name, Length: 32062, dtype: object
some Fun With .split() and .strip()#
import pandas as pd
chicago = pd.read_csv("chicago.csv").dropna(how = "all")
# reduse memory size
chicago["Department"] = chicago["Department"].astype("category")
chicago["Position Title"] = chicago["Position Title"].astype("category")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].astype("category")
print(chicago.head(3))
# try to understand or comment
print("\n \n \n take first part of last Name \n \n \n", chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ").str.get(0))
# try to understand or comment
print("\n \n count last Name's first part \n \n \n", chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ").str.get(0).value_counts().head(7))
Output
Name Position Title Department Employee Annual Salary
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT $90744.00
1 AARON, JEFFERY M POLICE OFFICER POLICE $84450.00
2 AARON, KARINA POLICE OFFICER POLICE $84450.00
take first part of last Name
0 ELVIA
1 JEFFERY
2 KARINA
3 KIMBERLEI
4 VICENTE
...
32057 MICHAEL
32058 PETER
32059 MARK
32060 CARLO
32061 DARIUSZ
Name: Name, Length: 32062, dtype: object
count last Name's first part
MICHAEL 1153
JOHN 899
JAMES 676
ROBERT 622
JOSEPH 537
DAVID 506
THOMAS 490
Name: Name, dtype: int64
more parameter of .srt.split(expend = True)#
expend paramenet এর মান defaultly False থাকে। expend=True এর দ্বারা আমরা data-frame এর split করা data গুলো দ্বারা new data-frame structure পাই। যার columns নামে define করা থাকে না।
Example
import pandas as pd
chicago = pd.read_csv("chicago.csv").dropna(how = "all")
# reduse memory size
chicago["Department"] = chicago["Department"].astype("category")
chicago["Position Title"] = chicago["Position Title"].astype("category")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].astype("category")
print(chicago["Name"].str.split(",", expand=True))
chicago[["First Name", "Last Name"]] = chicago["Name"].str.split(",", expand=True)
chicago.head()
Output
0 1
0 AARON ELVIA J
1 AARON JEFFERY M
2 AARON KARINA
3 AARON KIMBERLEI R
4 ABAD JR VICENTE M
... ... ...
32057 ZYGADLO MICHAEL J
32058 ZYGOWICZ PETER J
32059 ZYMANTAS MARK E
32060 ZYRKOWSKI CARLO E
32061 ZYSKOWSKI DARIUSZ
[32062 rows x 2 columns]
Name | Position Title | Department | Employee Annual Salary | First Name | Last Name | |
---|---|---|---|---|---|---|
0 | AARON, ELVIA J | WATER RATE TAKER | WATER MGMNT | $90744.00 | AARON | ELVIA J |
1 | AARON, JEFFERY M | POLICE OFFICER | POLICE | $84450.00 | AARON | JEFFERY M |
2 | AARON, KARINA | POLICE OFFICER | POLICE | $84450.00 | AARON | KARINA |
3 | AARON, KIMBERLEI R | CHIEF CONTRACT EXPEDITER | GENERAL SERVICES | $89880.00 | AARON | KIMBERLEI R |
4 | ABAD JR, VICENTE M | CIVIL ENGINEER IV | WATER MGMNT | $106836.00 | ABAD JR | VICENTE M |
more parameter of .srt.split(expend = True, n)#
n এর মান define করার মাধ্যমে আমরা n পরিমাণ columns তৈরি করা যায় । .str.split(expend=True, n=3 দ্বারা ৩টি column create হয়। যে সব data এর ৩টি value নাই সেই সব data এর empty row তে None থাকে।