Pandas: Basics (DataFrame
And Series
)¶
Naive: Objects, And Collections Of Objects¶
Person
object, represented as naive dictionary in Python
joerg = {
'firstname': 'Joerg',
'lastname': 'Faschingbauer',
'email': 'jf@faschingbauer.co.at',
'age': 56,
}
Again, naive collection of persons: native Python
list
caro = {
'firstname': 'Caro',
'lastname': 'Faschingbauer',
'email': 'caro@email.com',
'age': 25,
}
persons = [joerg, caro]
persons
[{'firstname': 'Joerg',
'lastname': 'Faschingbauer',
'email': 'jf@faschingbauer.co.at',
'age': 56},
{'firstname': 'Caro',
'lastname': 'Faschingbauer',
'email': 'caro@email.com',
'age': 25}]
Inverted: Objects, And Collections Of Objects (⟶ DataFrame
)¶
Pandas DataFrame ist different
… analogous to a dictionary that contains database columns
persons = {
'firstname': ['Joerg', 'Johanna', 'Caro', 'Philipp' ],
'lastname': ['Faschingbauer', 'Faschingbauer', 'Faschingbauer', 'Lichtenberger' ],
'email': ['jf@faschingbauer.co.at', 'johanna@email.com', 'caro@email.com', 'philipp@email.com'],
'age': [56, 27, 25, 37 ],
}
Operation: column selection
persons['firstname']
['Joerg', 'Johanna', 'Caro', 'Philipp']
persons['age']
[56, 27, 25, 37]
Operation: aggregation
sum(persons['age'])
145
Enter pandas
, DataFrame
, Series
¶
import pandas as pd
Native Python dictionaries are not efficient enough
Native Python dictionaries are feature-rich enough
Mixing of data types inside a list/column
Pandas uses NumPy internally ⟶ values inside one column (
Series
) have same type
persons = pd.DataFrame(persons)
persons
firstname | lastname | age | ||
---|---|---|---|---|
0 | Joerg | Faschingbauer | jf@faschingbauer.co.at | 56 |
1 | Johanna | Faschingbauer | johanna@email.com | 27 |
2 | Caro | Faschingbauer | caro@email.com | 25 |
3 | Philipp | Lichtenberger | philipp@email.com | 37 |
Note the index column
persons.shape
(4, 4)
Selecting A Column ⟶ Series
¶
Just like a Python dictionary: index operator
[]
persons.columns
Index(['firstname', 'lastname', 'email', 'age'], dtype='object')
persons['firstname']
0 Joerg
1 Johanna
2 Caro
3 Philipp
Name: firstname, dtype: object
type(persons['firstname'])
pandas.core.series.Series
persons['firstname'].iloc[0]
'Joerg'
Selecting Multiple Columns¶
Unlike Python dictionary: using index operator with a list of column names
persons[['firstname', 'age']]
firstname | age | |
---|---|---|
0 | Joerg | 56 |
1 | Johanna | 27 |
2 | Caro | 25 |
3 | Philipp | 37 |
type(persons[['firstname', 'age']])
pandas.core.frame.DataFrame
Note
One would wish that slicing works, just as with loc
and
iloc
(see Pandas: Selecting Rows (And Columns) With iloc[] and Pandas: Selecting Rows (And Columns) With loc[]):
persons['firstname':'age']
Unfortunately this does not work.
To Copy Or Not To Copy¶
Working on large datasets (i.e. that take a long time to load)
One does not want to make irreversible changes
⟶ make a backup copy before trying around
persons2 = persons.copy()
Or use
inplace=False
(which is the default when that parameter exists)