Whetting Your Appetite (Livehacking Demo)¶
Pipeline Driver¶
Read CSV
Read pipe members; drive empty pipeline
Print frame
⟶
DataFrame
#!/usr/bin/env python
import sys
import pandas
csvname = sys.argv[1]
pipe_stages = sys.argv[2:]
data = pandas.read_csv(
csvname,
delimiter=';', encoding='iso-8859-1',
names=('account', 'info', 'time_booked', 'time_valuta', 'amount', 'unit'))
for ps in pipe_stages:
context = {
'pd': pandas,
}
exec('import numpy as np', context)
code = open(ps).read()
exec(code, context)
transform = context['transform']
data = transform(data)
pandas.options.display.max_colwidth = None
pandas.options.display.max_columns = None
pandas.options.display.max_rows = None
pandas.options.display.width = None
print(data)
Won’t work though …
$ ./drive-pipeline.py
Traceback (most recent call last):
File "/home/jfasch/Homebrain/Firma/Kunden/039-IT-Visions/2023-03-13--Python-SAP--Consolut/Demo/./drive-pipeline.py", line 4, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
Virtual Environment Setup¶
Modules from Python standard library
Modules that are not in Python standard library (like
pandas
, for example)Sandboxing
Python interpreter (and standard library) version
External module version
Jupyter notebook? Sure!
Filter: Add Category: card-payment
¶
info.startswith('Bezahlung Karte')
Add column
category
, containing valuecard-payment
only
def transform(data):
data['category'] = data['info'].str.startswith('Bezahlung Karte')
return data
Pandas vectorized string methods
⟶
Series
Modeled after Python’s built-in
str
methods (only on aSeries
instead``)Insufficient: adds only
bool
column
def categorize(info):
if info.startswith('Bezahlung Karte'):
return 'card-payment'
else:
return 'unknown'
def transform(data):
data['category'] = data['info'].apply(categorize)
return data
apply()
⟶ generic way to hook-in custom fuctionality
⟶ enter real Python programming
Filter: Select Uncategorized¶
def transform(data):
filt_uncat = data['category'] == 'unknown'
uncat_rows = data.loc[filt_uncat]
return uncat_rows
Hiccup: duplicating the string
unknown
across (at least) two different filters/files
More Categories¶
card-payment
is far too unspecificUseless: want “Food”, “Car”, “Luxury”, …
def categorize(info):
if info.startswith('Bezahlung Karte'):
return categorize_card_payment(info)
return 'unknown'
def categorize_card_payment(info):
fields = info.split('|')
which = fields[0]
pos = fields[1]
company = fields[2]
if company.startswith('SPAR DANKT'):
return 'living'
if company.startswith('JET'):
return 'car'
return 'card-unknown'
def transform(data):
data['category'] = data['info'].apply(categorize)
return data
Heavily modified though
Python programming
Split the
info
fieldManually unpacking fields, first
⟶ tuple unpacking
Interpret fields
Guess category
Into the wild
Working with crap data
Date formats
Floating point/currency formats and units (
EUR
?!)Field tunneling (
info
has three fields, but not always)
Uncertainty!
Fear!!
⟶ Testing!!!
Testing¶
Modularize
Externalize stuff from
filters/categorize_v1.py
⟶
filters/categorize_v2.py
import stuff.category # <-- use code from stuff/category.py def transform(data): data['category'] = data['info'].apply(stuff.category.categorize) return data
Import from
stuff/category.py
def categorize(info): if info.startswith('Bezahlung Karte'): return categorize_card_payment(info) return 'unknown' def categorize_card_payment(info): fields = info.split('|') which = fields[0] pos = fields[1] company = fields[2] if company.startswith('SPAR DANKT'): return 'living' if company.startswith('JET'): return 'car' return 'card-unknown'
See if still works ⟶ ok
Add second importer: test
Problems
Primary problem: finding a category based upon the
info
field (a str`)Secondary (if at all): reading CSV
Unit test:
tests/test_category.py
Test only the
info
column (⟶ raw strings)Minor hiccup: have to set
PYTHONPATH
(see here)
from stuff.category import categorize def test_basic(): info = r'Bezahlung Karte MC/000009258|POS 2800 K002 07.02. 12:34|SPAR DANKT 5362\\GRAZ\8020' cat = categorize(info) assert cat == 'living'
$ pytest ======================================= test session starts ======================================= platform linux -- Python 3.10.7, pytest-7.2.0, pluggy-1.0.0 rootdir: /home/jfasch/work/jfasch-home/trainings/log/detail/2023-03-13-Python-SAP/Demo collected 1 item tests/test_category.py . [100%] ======================================== 1 passed in 0.01s ========================================