Some Usecases On A Test Dataset From SAP¶
[1]:
import pandas as pd
[2]:
model = pd.read_csv('sap-dataset.csv')
[3]:
model
[3]:
AccountId | AccountDescription | AccountParentId | StoreId | StoreDescription | LocationId | LocationDescription | ProductId | ProductDescription | Sales_ManagerId | ... | DateDATE | DateYear | DateQuarter | DateMonth | DateDAY | CategoryVersion | CategoryCategory | CategoryType | Value | ValueUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Original_Sales_Price | Original Sales Price | NaN | ST43 | franchescini Hand | CT14 | Salem | PD23 | IPA | SM1 | ... | 2014-12-01 | 2014 | 20144 | 201412 | 1 | Actual | Actuals | public | 1779.763910 | USD |
1 | Original_Sales_Price | Original Sales Price | NaN | ST35 | Quality Shopping | CT14 | Salem | PD2 | Pepsi | SM4 | ... | 2014-10-01 | 2014 | 20144 | 201410 | 1 | Actual | Actuals | public | 130.957113 | USD |
2 | Original_Sales_Price | Original Sales Price | NaN | ST98 | Walter Store Inc | CT14 | Salem | PD24 | Amber | SM1 | ... | 2014-06-01 | 2014 | 20142 | 201406 | 1 | Actual | Actuals | public | 10226.449395 | USD |
3 | Original_Sales_Price | Original Sales Price | NaN | ST188 | Arrowhead Corner Store | CT1 | Los Angeles | PD19 | Coconut Water | SM8 | ... | 2016-11-01 | 2016 | 20164 | 201611 | 1 | Actual | Actuals | public | 383.078689 | USD |
4 | Original_Sales_Price | Original Sales Price | NaN | ST107 | Spire Market | CT5 | San Jose | PD15 | Pineapple Juice | SM5 | ... | 2015-12-01 | 2015 | 20154 | 201512 | 1 | Actual | Actuals | public | 3110.303679 | USD |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1068 | Quantity_sold | Quantity sold | NaN | ST4 | Gstore | CT13 | Portland | PD26 | Red Wine | SM4 | ... | 2015-12-01 | 2015 | 20154 | 201512 | 1 | Actual | Actuals | public | 250.000000 | NaN |
1069 | Original_Sales_Price | Original Sales Price | NaN | ST49 | English Bay depot | CT4 | Sacramento | PD27 | White Wine | SM5 | ... | 2015-01-01 | 2015 | 20151 | 201501 | 1 | Actual | Actuals | public | 11348.592904 | USD |
1070 | Discount | Discount | NaN | ST36 | PriceLow Arena | CT10 | Reno | PD22 | Lager | SM3 | ... | 2016-05-01 | 2016 | 20162 | 201605 | 1 | Actual | Actuals | public | 4351.045830 | USD |
1071 | Original_Sales_Price | Original Sales Price | NaN | ST7 | Main Street Co | CT4 | Sacramento | PD16 | Watermelon Juice | SM5 | ... | 2015-12-01 | 2015 | 20154 | 201512 | 1 | Actual | Actuals | public | 7905.590606 | USD |
1072 | Quantity_sold | Quantity sold | NaN | ST98 | Walter Store Inc | CT14 | Salem | PD12 | Lemonade | SM1 | ... | 2015-07-01 | 2015 | 20153 | 201507 | 1 | Actual | Actuals | public | 100.000000 | NaN |
1073 rows × 21 columns
Dropping Columns That Contain Only NaN Values¶
DataFrame.drop()
accepts a inplace
parameter.
When not given,
drop()
returns a modified copy of the frameWhen given, it modified the frame in place
[4]:
modified_model = model.drop(model.columns[model.isna().all()], axis='columns')
modified_model.head(3)
[4]:
AccountId | AccountDescription | StoreId | StoreDescription | LocationId | LocationDescription | ProductId | ProductDescription | Sales_ManagerId | Sales_ManagerDescription | DateDATE | DateYear | DateQuarter | DateMonth | DateDAY | CategoryVersion | CategoryCategory | CategoryType | Value | ValueUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Original_Sales_Price | Original Sales Price | ST43 | franchescini Hand | CT14 | Salem | PD23 | IPA | SM1 | Janet Bury | 2014-12-01 | 2014 | 20144 | 201412 | 1 | Actual | Actuals | public | 1779.763910 | USD |
1 | Original_Sales_Price | Original Sales Price | ST35 | Quality Shopping | CT14 | Salem | PD2 | Pepsi | SM4 | Nancy Miller | 2014-10-01 | 2014 | 20144 | 201410 | 1 | Actual | Actuals | public | 130.957113 | USD |
2 | Original_Sales_Price | Original Sales Price | ST98 | Walter Store Inc | CT14 | Salem | PD24 | Amber | SM1 | Janet Bury | 2014-06-01 | 2014 | 20142 | 201406 | 1 | Actual | Actuals | public | 10226.449395 | USD |
Apparently the AccountParentId
column has been dropped. So how did this work? Lets dissect this setp by step.
DataFrame.isna()
¶
isna()
on a frame returns an equally shaped matrix, only with bool
values representing the Nan
-ness of the respective field in the original frame
[5]:
nan_fields = model.isna()
nan_fields.head(5)
[5]:
AccountId | AccountDescription | AccountParentId | StoreId | StoreDescription | LocationId | LocationDescription | ProductId | ProductDescription | Sales_ManagerId | ... | DateDATE | DateYear | DateQuarter | DateMonth | DateDAY | CategoryVersion | CategoryCategory | CategoryType | Value | ValueUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | False | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
5 rows × 21 columns
DataFrame.all()
(and any()
)¶
Returns a bitfield (err, a bool
type Series
object) to indicate whether all elements are True
. By default, it operates on columns - all elements of a column are True
.
The companion method, any()
, indicates whether there is at least one element whose value is True
.
[6]:
nan_fields.all()
[6]:
AccountId False
AccountDescription False
AccountParentId True
StoreId False
StoreDescription False
LocationId False
LocationDescription False
ProductId False
ProductDescription False
Sales_ManagerId False
Sales_ManagerDescription False
DateDATE False
DateYear False
DateQuarter False
DateMonth False
DateDAY False
CategoryVersion False
CategoryCategory False
CategoryType False
Value False
ValueUnit False
dtype: bool
Column Selection¶
A bitfied applied to a Series
(using the []
operator of Series
) selects the elements. jjj link to filters
[7]:
model.columns
[7]:
Index(['AccountId', 'AccountDescription', 'AccountParentId', 'StoreId',
'StoreDescription', 'LocationId', 'LocationDescription', 'ProductId',
'ProductDescription', 'Sales_ManagerId', 'Sales_ManagerDescription',
'DateDATE', 'DateYear', 'DateQuarter', 'DateMonth', 'DateDAY',
'CategoryVersion', 'CategoryCategory', 'CategoryType', 'Value',
'ValueUnit'],
dtype='object')
[8]:
nan_columns = model.columns[model.isna().all()]
nan_columns
[8]:
Index(['AccountParentId'], dtype='object')
Dropping Columns¶
[9]:
modified_model = model.drop(model.columns[model.isna().all()], axis='columns')
modified_model.head(5)
[9]:
AccountId | AccountDescription | StoreId | StoreDescription | LocationId | LocationDescription | ProductId | ProductDescription | Sales_ManagerId | Sales_ManagerDescription | DateDATE | DateYear | DateQuarter | DateMonth | DateDAY | CategoryVersion | CategoryCategory | CategoryType | Value | ValueUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Original_Sales_Price | Original Sales Price | ST43 | franchescini Hand | CT14 | Salem | PD23 | IPA | SM1 | Janet Bury | 2014-12-01 | 2014 | 20144 | 201412 | 1 | Actual | Actuals | public | 1779.763910 | USD |
1 | Original_Sales_Price | Original Sales Price | ST35 | Quality Shopping | CT14 | Salem | PD2 | Pepsi | SM4 | Nancy Miller | 2014-10-01 | 2014 | 20144 | 201410 | 1 | Actual | Actuals | public | 130.957113 | USD |
2 | Original_Sales_Price | Original Sales Price | ST98 | Walter Store Inc | CT14 | Salem | PD24 | Amber | SM1 | Janet Bury | 2014-06-01 | 2014 | 20142 | 201406 | 1 | Actual | Actuals | public | 10226.449395 | USD |
3 | Original_Sales_Price | Original Sales Price | ST188 | Arrowhead Corner Store | CT1 | Los Angeles | PD19 | Coconut Water | SM8 | Gabriel Walton | 2016-11-01 | 2016 | 20164 | 201611 | 1 | Actual | Actuals | public | 383.078689 | USD |
4 | Original_Sales_Price | Original Sales Price | ST107 | Spire Market | CT5 | San Jose | PD15 | Pineapple Juice | SM5 | David Carl | 2015-12-01 | 2015 | 20154 | 201512 | 1 | Actual | Actuals | public | 3110.303679 | USD |
Now for real:
[10]:
model.drop(model.columns[model.isna().all()], axis='columns', inplace=True)
Fixing Date Values/Columns, And Operating With Date And Time¶
The DateDATE
column contains string values. Convert it to contain real timestamps that one can work with, numerically (e.g. add and subtract).
[11]:
model['DateDATE']
[11]:
0 2014-12-01
1 2014-10-01
2 2014-06-01
3 2016-11-01
4 2015-12-01
...
1068 2015-12-01
1069 2015-01-01
1070 2016-05-01
1071 2015-12-01
1072 2015-07-01
Name: DateDATE, Length: 1073, dtype: object
[12]:
model['DateDATE'].iloc[0]
[12]:
'2014-12-01'
[13]:
type(model['DateDATE'].iloc[0])
[13]:
str
pandas.to_datetime()
¶
Converts anything to a working timestamp value (or a vectorized version thereof): a str
, a Series
, …
[14]:
ts = pd.to_datetime('2023-03-13 13:02:01') # str
ts
[14]:
Timestamp('2023-03-13 13:02:01')
[15]:
pd.to_datetime('2023-03-13') # str
[15]:
Timestamp('2023-03-13 00:00:00')
pandas.Timestamp
is compatible with datetime.datetime
jjjjj linke to both
[16]:
import datetime
[17]:
ref_ts = datetime.datetime(year=2016, month=3, day=13)
ref_ts
[17]:
datetime.datetime(2016, 3, 13, 0, 0)
[18]:
ts < ref_ts
[18]:
False
[19]:
ts > ref_ts
[19]:
True
Converting The Entire DateDATE
Column To A Workable Type (And Working With It)¶
The following converts a Series
object (of str
) to a Series
object. That contains numpy.datetime64
objects - which in turn go well together with any of pandas.Timestamp
and datetime
.
Conversion¶
[20]:
pd.to_datetime(model['DateDATE'])
[20]:
0 2014-12-01
1 2014-10-01
2 2014-06-01
3 2016-11-01
4 2015-12-01
...
1068 2015-12-01
1069 2015-01-01
1070 2016-05-01
1071 2015-12-01
1072 2015-07-01
Name: DateDATE, Length: 1073, dtype: datetime64[ns]
Long story short: convert damn column!
[21]:
model['DateDATE'] = pd.to_datetime(model['DateDATE'])
[22]:
model.head(5)
[22]:
AccountId | AccountDescription | StoreId | StoreDescription | LocationId | LocationDescription | ProductId | ProductDescription | Sales_ManagerId | Sales_ManagerDescription | DateDATE | DateYear | DateQuarter | DateMonth | DateDAY | CategoryVersion | CategoryCategory | CategoryType | Value | ValueUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Original_Sales_Price | Original Sales Price | ST43 | franchescini Hand | CT14 | Salem | PD23 | IPA | SM1 | Janet Bury | 2014-12-01 | 2014 | 20144 | 201412 | 1 | Actual | Actuals | public | 1779.763910 | USD |
1 | Original_Sales_Price | Original Sales Price | ST35 | Quality Shopping | CT14 | Salem | PD2 | Pepsi | SM4 | Nancy Miller | 2014-10-01 | 2014 | 20144 | 201410 | 1 | Actual | Actuals | public | 130.957113 | USD |
2 | Original_Sales_Price | Original Sales Price | ST98 | Walter Store Inc | CT14 | Salem | PD24 | Amber | SM1 | Janet Bury | 2014-06-01 | 2014 | 20142 | 201406 | 1 | Actual | Actuals | public | 10226.449395 | USD |
3 | Original_Sales_Price | Original Sales Price | ST188 | Arrowhead Corner Store | CT1 | Los Angeles | PD19 | Coconut Water | SM8 | Gabriel Walton | 2016-11-01 | 2016 | 20164 | 201611 | 1 | Actual | Actuals | public | 383.078689 | USD |
4 | Original_Sales_Price | Original Sales Price | ST107 | Spire Market | CT5 | San Jose | PD15 | Pineapple Juice | SM5 | David Carl | 2015-12-01 | 2015 | 20154 | 201512 | 1 | Actual | Actuals | public | 3110.303679 | USD |
Usage: A Filter¶
Now we are in a position to define a filter that works with time (and not with strings)
[23]:
ref_2015_12_04 = pd.to_datetime('2015-12-04') # or use datetime.datetime()
ref_2015_12_04
[23]:
Timestamp('2015-12-04 00:00:00')
[24]:
flt_before_2015_12_04 = model['DateDATE'] < ref_2015_12_04
[25]:
flt_before_2015_12_04
[25]:
0 True
1 True
2 True
3 False
4 True
...
1068 True
1069 True
1070 False
1071 True
1072 True
Name: DateDATE, Length: 1073, dtype: bool
Select row based on filter
[26]:
rows_before_2015_12_04 = model.loc[flt_before_2015_12_04]
rows_before_2015_12_04.head(5)
[26]:
AccountId | AccountDescription | StoreId | StoreDescription | LocationId | LocationDescription | ProductId | ProductDescription | Sales_ManagerId | Sales_ManagerDescription | DateDATE | DateYear | DateQuarter | DateMonth | DateDAY | CategoryVersion | CategoryCategory | CategoryType | Value | ValueUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Original_Sales_Price | Original Sales Price | ST43 | franchescini Hand | CT14 | Salem | PD23 | IPA | SM1 | Janet Bury | 2014-12-01 | 2014 | 20144 | 201412 | 1 | Actual | Actuals | public | 1779.763910 | USD |
1 | Original_Sales_Price | Original Sales Price | ST35 | Quality Shopping | CT14 | Salem | PD2 | Pepsi | SM4 | Nancy Miller | 2014-10-01 | 2014 | 20144 | 201410 | 1 | Actual | Actuals | public | 130.957113 | USD |
2 | Original_Sales_Price | Original Sales Price | ST98 | Walter Store Inc | CT14 | Salem | PD24 | Amber | SM1 | Janet Bury | 2014-06-01 | 2014 | 20142 | 201406 | 1 | Actual | Actuals | public | 10226.449395 | USD |
4 | Original_Sales_Price | Original Sales Price | ST107 | Spire Market | CT5 | San Jose | PD15 | Pineapple Juice | SM5 | David Carl | 2015-12-01 | 2015 | 20154 | 201512 | 1 | Actual | Actuals | public | 3110.303679 | USD |
6 | Original_Sales_Price | Original Sales Price | ST119 | Pay N Go | CT2 | San Francisco | PD11 | Orange no pulp | SM7 | Kiran Raj | 2015-03-01 | 2015 | 20151 | 201503 | 1 | Actual | Actuals | public | 548.226847 | USD |
Modify Rows: Increase All Prices Of One Particular Store By 10%¶
Let store be ST43
, and increment Value
by 10%
See how many different products the store has …
[27]:
(model['StoreId'] == 'ST43').value_counts()
[27]:
False 1069
True 4
Name: StoreId, dtype: int64
Select prices (column Value
) of those
[28]:
model.loc[model['StoreId'] == 'ST43', 'Value']
[28]:
0 1779.763910
465 13627.793342
782 3427.896898
917 15569.353983
Name: Value, dtype: float64
Create a Series
which contains increases prices
[29]:
model.loc[model['StoreId'] == 'ST43', 'Value'] * 1.1
[29]:
0 1957.740301
465 14990.572676
782 3770.686587
917 17126.289381
Name: Value, dtype: float64
Assign increased prices back into frame
[30]:
model.loc[model['StoreId'] == 'ST43', 'Value'] = model.loc[model['StoreId'] == 'ST43', 'Value'] * 1.1
[31]:
model.head(5)
[31]:
AccountId | AccountDescription | StoreId | StoreDescription | LocationId | LocationDescription | ProductId | ProductDescription | Sales_ManagerId | Sales_ManagerDescription | DateDATE | DateYear | DateQuarter | DateMonth | DateDAY | CategoryVersion | CategoryCategory | CategoryType | Value | ValueUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Original_Sales_Price | Original Sales Price | ST43 | franchescini Hand | CT14 | Salem | PD23 | IPA | SM1 | Janet Bury | 2014-12-01 | 2014 | 20144 | 201412 | 1 | Actual | Actuals | public | 1957.740301 | USD |
1 | Original_Sales_Price | Original Sales Price | ST35 | Quality Shopping | CT14 | Salem | PD2 | Pepsi | SM4 | Nancy Miller | 2014-10-01 | 2014 | 20144 | 201410 | 1 | Actual | Actuals | public | 130.957113 | USD |
2 | Original_Sales_Price | Original Sales Price | ST98 | Walter Store Inc | CT14 | Salem | PD24 | Amber | SM1 | Janet Bury | 2014-06-01 | 2014 | 20142 | 201406 | 1 | Actual | Actuals | public | 10226.449395 | USD |
3 | Original_Sales_Price | Original Sales Price | ST188 | Arrowhead Corner Store | CT1 | Los Angeles | PD19 | Coconut Water | SM8 | Gabriel Walton | 2016-11-01 | 2016 | 20164 | 201611 | 1 | Actual | Actuals | public | 383.078689 | USD |
4 | Original_Sales_Price | Original Sales Price | ST107 | Spire Market | CT5 | San Jose | PD15 | Pineapple Juice | SM5 | David Carl | 2015-12-01 | 2015 | 20154 | 201512 | 1 | Actual | Actuals | public | 3110.303679 | USD |
Filter Readability: Combining Filters¶
[32]:
flt_product_pd2 = (model['ProductId'] == 'PD2') # only product PD2
[33]:
flt_2015 = (model['DateDATE'] < pd.to_datetime('2016')) & \
(model['DateDATE'] >= pd.to_datetime('2015')) # only rows from 2015
[34]:
flt_only_sold = (model['AccountId'] == 'Quantity_sold') # only "sold" type records
[35]:
flt_combined = flt_product_pd2 & flt_2015 & flt_only_sold
[36]:
model.loc[flt_combined]
[36]:
AccountId | AccountDescription | StoreId | StoreDescription | LocationId | LocationDescription | ProductId | ProductDescription | Sales_ManagerId | Sales_ManagerDescription | DateDATE | DateYear | DateQuarter | DateMonth | DateDAY | CategoryVersion | CategoryCategory | CategoryType | Value | ValueUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
160 | Quantity_sold | Quantity sold | ST139 | Farlos | CT15 | Eugene | PD2 | Pepsi | SM11 | James Frank | 2015-11-01 | 2015 | 20154 | 201511 | 1 | Actual | Actuals | public | 6858.0 | NaN |
828 | Quantity_sold | Quantity sold | ST178 | Rays Super Foods | CT8 | Beverly Hills | PD2 | Pepsi | SM4 | Nancy Miller | 2015-08-01 | 2015 | 20153 | 201508 | 1 | Actual | Actuals | public | 200.0 | NaN |