Some Usecases On A Test Dataset From SAP

[1]:
import pandas as pd
[2]:
model = pd.read_csv('sap-dataset.csv')
[3]:
model
[3]:
AccountId AccountDescription AccountParentId StoreId StoreDescription LocationId LocationDescription ProductId ProductDescription Sales_ManagerId ... DateDATE DateYear DateQuarter DateMonth DateDAY CategoryVersion CategoryCategory CategoryType Value ValueUnit
0 Original_Sales_Price Original Sales Price NaN ST43 franchescini Hand CT14 Salem PD23 IPA SM1 ... 2014-12-01 2014 20144 201412 1 Actual Actuals public 1779.763910 USD
1 Original_Sales_Price Original Sales Price NaN ST35 Quality Shopping CT14 Salem PD2 Pepsi SM4 ... 2014-10-01 2014 20144 201410 1 Actual Actuals public 130.957113 USD
2 Original_Sales_Price Original Sales Price NaN ST98 Walter Store Inc CT14 Salem PD24 Amber SM1 ... 2014-06-01 2014 20142 201406 1 Actual Actuals public 10226.449395 USD
3 Original_Sales_Price Original Sales Price NaN ST188 Arrowhead Corner Store CT1 Los Angeles PD19 Coconut Water SM8 ... 2016-11-01 2016 20164 201611 1 Actual Actuals public 383.078689 USD
4 Original_Sales_Price Original Sales Price NaN ST107 Spire Market CT5 San Jose PD15 Pineapple Juice SM5 ... 2015-12-01 2015 20154 201512 1 Actual Actuals public 3110.303679 USD
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1068 Quantity_sold Quantity sold NaN ST4 Gstore CT13 Portland PD26 Red Wine SM4 ... 2015-12-01 2015 20154 201512 1 Actual Actuals public 250.000000 NaN
1069 Original_Sales_Price Original Sales Price NaN ST49 English Bay depot CT4 Sacramento PD27 White Wine SM5 ... 2015-01-01 2015 20151 201501 1 Actual Actuals public 11348.592904 USD
1070 Discount Discount NaN ST36 PriceLow Arena CT10 Reno PD22 Lager SM3 ... 2016-05-01 2016 20162 201605 1 Actual Actuals public 4351.045830 USD
1071 Original_Sales_Price Original Sales Price NaN ST7 Main Street Co CT4 Sacramento PD16 Watermelon Juice SM5 ... 2015-12-01 2015 20154 201512 1 Actual Actuals public 7905.590606 USD
1072 Quantity_sold Quantity sold NaN ST98 Walter Store Inc CT14 Salem PD12 Lemonade SM1 ... 2015-07-01 2015 20153 201507 1 Actual Actuals public 100.000000 NaN

1073 rows × 21 columns

Dropping Columns That Contain Only NaN Values

DataFrame.drop() accepts a inplace parameter.

  • When not given, drop() returns a modified copy of the frame

  • When given, it modified the frame in place

[4]:
modified_model = model.drop(model.columns[model.isna().all()], axis='columns')
modified_model.head(3)
[4]:
AccountId AccountDescription StoreId StoreDescription LocationId LocationDescription ProductId ProductDescription Sales_ManagerId Sales_ManagerDescription DateDATE DateYear DateQuarter DateMonth DateDAY CategoryVersion CategoryCategory CategoryType Value ValueUnit
0 Original_Sales_Price Original Sales Price ST43 franchescini Hand CT14 Salem PD23 IPA SM1 Janet Bury 2014-12-01 2014 20144 201412 1 Actual Actuals public 1779.763910 USD
1 Original_Sales_Price Original Sales Price ST35 Quality Shopping CT14 Salem PD2 Pepsi SM4 Nancy Miller 2014-10-01 2014 20144 201410 1 Actual Actuals public 130.957113 USD
2 Original_Sales_Price Original Sales Price ST98 Walter Store Inc CT14 Salem PD24 Amber SM1 Janet Bury 2014-06-01 2014 20142 201406 1 Actual Actuals public 10226.449395 USD

Apparently the AccountParentId column has been dropped. So how did this work? Lets dissect this setp by step.

DataFrame.isna()

isna() on a frame returns an equally shaped matrix, only with bool values representing the Nan-ness of the respective field in the original frame

[5]:
nan_fields = model.isna()
nan_fields.head(5)
[5]:
AccountId AccountDescription AccountParentId StoreId StoreDescription LocationId LocationDescription ProductId ProductDescription Sales_ManagerId ... DateDATE DateYear DateQuarter DateMonth DateDAY CategoryVersion CategoryCategory CategoryType Value ValueUnit
0 False False True False False False False False False False ... False False False False False False False False False False
1 False False True False False False False False False False ... False False False False False False False False False False
2 False False True False False False False False False False ... False False False False False False False False False False
3 False False True False False False False False False False ... False False False False False False False False False False
4 False False True False False False False False False False ... False False False False False False False False False False

5 rows × 21 columns

DataFrame.all() (and any())

Returns a bitfield (err, a bool type Series object) to indicate whether all elements are True. By default, it operates on columns - all elements of a column are True.

The companion method, any(), indicates whether there is at least one element whose value is True.

[6]:
nan_fields.all()
[6]:
AccountId                   False
AccountDescription          False
AccountParentId              True
StoreId                     False
StoreDescription            False
LocationId                  False
LocationDescription         False
ProductId                   False
ProductDescription          False
Sales_ManagerId             False
Sales_ManagerDescription    False
DateDATE                    False
DateYear                    False
DateQuarter                 False
DateMonth                   False
DateDAY                     False
CategoryVersion             False
CategoryCategory            False
CategoryType                False
Value                       False
ValueUnit                   False
dtype: bool

Column Selection

A bitfied applied to a Series (using the [] operator of Series) selects the elements. jjj link to filters

[7]:
model.columns
[7]:
Index(['AccountId', 'AccountDescription', 'AccountParentId', 'StoreId',
       'StoreDescription', 'LocationId', 'LocationDescription', 'ProductId',
       'ProductDescription', 'Sales_ManagerId', 'Sales_ManagerDescription',
       'DateDATE', 'DateYear', 'DateQuarter', 'DateMonth', 'DateDAY',
       'CategoryVersion', 'CategoryCategory', 'CategoryType', 'Value',
       'ValueUnit'],
      dtype='object')
[8]:
nan_columns = model.columns[model.isna().all()]
nan_columns
[8]:
Index(['AccountParentId'], dtype='object')

Dropping Columns

[9]:
modified_model = model.drop(model.columns[model.isna().all()], axis='columns')
modified_model.head(5)
[9]:
AccountId AccountDescription StoreId StoreDescription LocationId LocationDescription ProductId ProductDescription Sales_ManagerId Sales_ManagerDescription DateDATE DateYear DateQuarter DateMonth DateDAY CategoryVersion CategoryCategory CategoryType Value ValueUnit
0 Original_Sales_Price Original Sales Price ST43 franchescini Hand CT14 Salem PD23 IPA SM1 Janet Bury 2014-12-01 2014 20144 201412 1 Actual Actuals public 1779.763910 USD
1 Original_Sales_Price Original Sales Price ST35 Quality Shopping CT14 Salem PD2 Pepsi SM4 Nancy Miller 2014-10-01 2014 20144 201410 1 Actual Actuals public 130.957113 USD
2 Original_Sales_Price Original Sales Price ST98 Walter Store Inc CT14 Salem PD24 Amber SM1 Janet Bury 2014-06-01 2014 20142 201406 1 Actual Actuals public 10226.449395 USD
3 Original_Sales_Price Original Sales Price ST188 Arrowhead Corner Store CT1 Los Angeles PD19 Coconut Water SM8 Gabriel Walton 2016-11-01 2016 20164 201611 1 Actual Actuals public 383.078689 USD
4 Original_Sales_Price Original Sales Price ST107 Spire Market CT5 San Jose PD15 Pineapple Juice SM5 David Carl 2015-12-01 2015 20154 201512 1 Actual Actuals public 3110.303679 USD

Now for real:

[10]:
model.drop(model.columns[model.isna().all()], axis='columns', inplace=True)

Fixing Date Values/Columns, And Operating With Date And Time

The DateDATE column contains string values. Convert it to contain real timestamps that one can work with, numerically (e.g. add and subtract).

[11]:
model['DateDATE']
[11]:
0       2014-12-01
1       2014-10-01
2       2014-06-01
3       2016-11-01
4       2015-12-01
           ...
1068    2015-12-01
1069    2015-01-01
1070    2016-05-01
1071    2015-12-01
1072    2015-07-01
Name: DateDATE, Length: 1073, dtype: object
[12]:
model['DateDATE'].iloc[0]
[12]:
'2014-12-01'
[13]:
type(model['DateDATE'].iloc[0])
[13]:
str

pandas.to_datetime()

Converts anything to a working timestamp value (or a vectorized version thereof): a str, a Series, …

[14]:
ts = pd.to_datetime('2023-03-13 13:02:01')       # str
ts
[14]:
Timestamp('2023-03-13 13:02:01')
[15]:
pd.to_datetime('2023-03-13')                # str
[15]:
Timestamp('2023-03-13 00:00:00')

pandas.Timestamp is compatible with datetime.datetime jjjjj linke to both

[16]:
import datetime
[17]:
ref_ts = datetime.datetime(year=2016, month=3, day=13)
ref_ts
[17]:
datetime.datetime(2016, 3, 13, 0, 0)
[18]:
ts < ref_ts
[18]:
False
[19]:
ts > ref_ts
[19]:
True

Converting The Entire DateDATE Column To A Workable Type (And Working With It)

The following converts a Series object (of str) to a Series object. That contains numpy.datetime64 objects - which in turn go well together with any of pandas.Timestamp and datetime.

Conversion

[20]:
pd.to_datetime(model['DateDATE'])
[20]:
0      2014-12-01
1      2014-10-01
2      2014-06-01
3      2016-11-01
4      2015-12-01
          ...
1068   2015-12-01
1069   2015-01-01
1070   2016-05-01
1071   2015-12-01
1072   2015-07-01
Name: DateDATE, Length: 1073, dtype: datetime64[ns]

Long story short: convert damn column!

[21]:
model['DateDATE'] = pd.to_datetime(model['DateDATE'])
[22]:
model.head(5)
[22]:
AccountId AccountDescription StoreId StoreDescription LocationId LocationDescription ProductId ProductDescription Sales_ManagerId Sales_ManagerDescription DateDATE DateYear DateQuarter DateMonth DateDAY CategoryVersion CategoryCategory CategoryType Value ValueUnit
0 Original_Sales_Price Original Sales Price ST43 franchescini Hand CT14 Salem PD23 IPA SM1 Janet Bury 2014-12-01 2014 20144 201412 1 Actual Actuals public 1779.763910 USD
1 Original_Sales_Price Original Sales Price ST35 Quality Shopping CT14 Salem PD2 Pepsi SM4 Nancy Miller 2014-10-01 2014 20144 201410 1 Actual Actuals public 130.957113 USD
2 Original_Sales_Price Original Sales Price ST98 Walter Store Inc CT14 Salem PD24 Amber SM1 Janet Bury 2014-06-01 2014 20142 201406 1 Actual Actuals public 10226.449395 USD
3 Original_Sales_Price Original Sales Price ST188 Arrowhead Corner Store CT1 Los Angeles PD19 Coconut Water SM8 Gabriel Walton 2016-11-01 2016 20164 201611 1 Actual Actuals public 383.078689 USD
4 Original_Sales_Price Original Sales Price ST107 Spire Market CT5 San Jose PD15 Pineapple Juice SM5 David Carl 2015-12-01 2015 20154 201512 1 Actual Actuals public 3110.303679 USD

Usage: A Filter

Now we are in a position to define a filter that works with time (and not with strings)

[23]:
ref_2015_12_04 = pd.to_datetime('2015-12-04')      # or use datetime.datetime()
ref_2015_12_04
[23]:
Timestamp('2015-12-04 00:00:00')
[24]:
flt_before_2015_12_04 = model['DateDATE'] < ref_2015_12_04
[25]:
flt_before_2015_12_04
[25]:
0        True
1        True
2        True
3       False
4        True
        ...
1068     True
1069     True
1070    False
1071     True
1072     True
Name: DateDATE, Length: 1073, dtype: bool

Select row based on filter

[26]:
rows_before_2015_12_04 = model.loc[flt_before_2015_12_04]
rows_before_2015_12_04.head(5)
[26]:
AccountId AccountDescription StoreId StoreDescription LocationId LocationDescription ProductId ProductDescription Sales_ManagerId Sales_ManagerDescription DateDATE DateYear DateQuarter DateMonth DateDAY CategoryVersion CategoryCategory CategoryType Value ValueUnit
0 Original_Sales_Price Original Sales Price ST43 franchescini Hand CT14 Salem PD23 IPA SM1 Janet Bury 2014-12-01 2014 20144 201412 1 Actual Actuals public 1779.763910 USD
1 Original_Sales_Price Original Sales Price ST35 Quality Shopping CT14 Salem PD2 Pepsi SM4 Nancy Miller 2014-10-01 2014 20144 201410 1 Actual Actuals public 130.957113 USD
2 Original_Sales_Price Original Sales Price ST98 Walter Store Inc CT14 Salem PD24 Amber SM1 Janet Bury 2014-06-01 2014 20142 201406 1 Actual Actuals public 10226.449395 USD
4 Original_Sales_Price Original Sales Price ST107 Spire Market CT5 San Jose PD15 Pineapple Juice SM5 David Carl 2015-12-01 2015 20154 201512 1 Actual Actuals public 3110.303679 USD
6 Original_Sales_Price Original Sales Price ST119 Pay N Go CT2 San Francisco PD11 Orange no pulp SM7 Kiran Raj 2015-03-01 2015 20151 201503 1 Actual Actuals public 548.226847 USD

Modify Rows: Increase All Prices Of One Particular Store By 10%

Let store be ST43, and increment Value by 10%

See how many different products the store has …

[27]:
(model['StoreId'] == 'ST43').value_counts()
[27]:
False    1069
True        4
Name: StoreId, dtype: int64

Select prices (column Value) of those

[28]:
model.loc[model['StoreId'] == 'ST43', 'Value']
[28]:
0       1779.763910
465    13627.793342
782     3427.896898
917    15569.353983
Name: Value, dtype: float64

Create a Series which contains increases prices

[29]:
model.loc[model['StoreId'] == 'ST43', 'Value'] * 1.1
[29]:
0       1957.740301
465    14990.572676
782     3770.686587
917    17126.289381
Name: Value, dtype: float64

Assign increased prices back into frame

[30]:
model.loc[model['StoreId'] == 'ST43', 'Value'] = model.loc[model['StoreId'] == 'ST43', 'Value'] * 1.1
[31]:
model.head(5)
[31]:
AccountId AccountDescription StoreId StoreDescription LocationId LocationDescription ProductId ProductDescription Sales_ManagerId Sales_ManagerDescription DateDATE DateYear DateQuarter DateMonth DateDAY CategoryVersion CategoryCategory CategoryType Value ValueUnit
0 Original_Sales_Price Original Sales Price ST43 franchescini Hand CT14 Salem PD23 IPA SM1 Janet Bury 2014-12-01 2014 20144 201412 1 Actual Actuals public 1957.740301 USD
1 Original_Sales_Price Original Sales Price ST35 Quality Shopping CT14 Salem PD2 Pepsi SM4 Nancy Miller 2014-10-01 2014 20144 201410 1 Actual Actuals public 130.957113 USD
2 Original_Sales_Price Original Sales Price ST98 Walter Store Inc CT14 Salem PD24 Amber SM1 Janet Bury 2014-06-01 2014 20142 201406 1 Actual Actuals public 10226.449395 USD
3 Original_Sales_Price Original Sales Price ST188 Arrowhead Corner Store CT1 Los Angeles PD19 Coconut Water SM8 Gabriel Walton 2016-11-01 2016 20164 201611 1 Actual Actuals public 383.078689 USD
4 Original_Sales_Price Original Sales Price ST107 Spire Market CT5 San Jose PD15 Pineapple Juice SM5 David Carl 2015-12-01 2015 20154 201512 1 Actual Actuals public 3110.303679 USD

Filter Readability: Combining Filters

[32]:
flt_product_pd2 = (model['ProductId'] == 'PD2')               # only product PD2
[33]:
flt_2015 = (model['DateDATE'] < pd.to_datetime('2016')) & \
           (model['DateDATE'] >= pd.to_datetime('2015'))      # only rows from 2015
[34]:
flt_only_sold = (model['AccountId'] == 'Quantity_sold')       # only "sold" type records
[35]:
flt_combined = flt_product_pd2 & flt_2015 & flt_only_sold
[36]:
model.loc[flt_combined]
[36]:
AccountId AccountDescription StoreId StoreDescription LocationId LocationDescription ProductId ProductDescription Sales_ManagerId Sales_ManagerDescription DateDATE DateYear DateQuarter DateMonth DateDAY CategoryVersion CategoryCategory CategoryType Value ValueUnit
160 Quantity_sold Quantity sold ST139 Farlos CT15 Eugene PD2 Pepsi SM11 James Frank 2015-11-01 2015 20154 201511 1 Actual Actuals public 6858.0 NaN
828 Quantity_sold Quantity sold ST178 Rays Super Foods CT8 Beverly Hills PD2 Pepsi SM4 Nancy Miller 2015-08-01 2015 20153 201508 1 Actual Actuals public 200.0 NaN