Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let’s introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.
Series
The first main data type we will learn about for pandas is the Series data type. Let’s import Pandas and explore the Series object. A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.
Let’s explore this concept through some examples:
import numpy as np
import pandas as pd
You can convert a list,numpy array, or dictionary to a Series:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
** Using Lists**
pd.Series(data=my_list)
0 10
1 20
2 30
dtype: int64
pd.Series(data=my_list,index=labels)
a 10
b 20
c 30
dtype: int64
pd.Series(my_list,labels)
a 10
b 20
c 30
dtype: int64
** NumPy Arrays **
pd.Series(arr)
0 10
1 20
2 30
dtype: int64
pd.Series(arr,labels)
a 10
b 20
c 30
dtype: int64
** Dictionary**
pd.Series(d)
a 10
b 20
c 30
dtype: int64
Data in a Series
A pandas Series can hold a variety of object types:
pd.Series(data=labels)
0 a
1 b
2 c
dtype: object
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
Using an Index
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).
Let’s see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser2
USA 1
Germany 2
Italy 5
Japan 4
dtype: int64
ser1['USA']
1
Operations are then also done based off of index:
ser1 + ser2
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
Let’s stop here for now and move on to DataFrames, which will expand on the concept of Series!
DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
Selection and Indexing
Let’s learn the various methods to grab data from a DataFrame
df['W']
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
# Pass a list of column names
df[['W','Z']]
| W | Z |
---|
A | 2.706850 | 0.503826 |
---|
B | 0.651118 | 0.605965 |
---|
C | -2.018168 | -0.589001 |
---|
D | 0.188695 | 0.955057 |
---|
E | 0.190794 | 0.683509 |
---|
# SQL Syntax (NOT RECOMMENDED!)
df.W
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
DataFrame Columns are just Series
type(df['W'])
pandas.core.series.Series
Creating a new column
df['new'] = df['W'] + df['Y']
df
| W | X | Y | Z | new |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
---|
Removing Columns
df.drop('new',axis=1)
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
# Not inplace unless specified!
df
| W | X | Y | Z | new |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
---|
df.drop('new',axis=1,inplace=True)
df
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
Can also drop rows this way:
df.drop('E',axis=0)
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
Selecting Rows
df.loc['A']
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
Or select based off of position instead of label
df.iloc[2]
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
Selecting subset of rows and columns
df.loc['B','Y']
-0.84807698340363147
df.loc[['A','B'],['W','Y']]
| W | Y |
---|
A | 2.706850 | 0.907969 |
---|
B | 0.651118 | -0.848077 |
---|
Conditional Selection
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
df
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
df>0
| W | X | Y | Z |
---|
A | True | True | True | True |
---|
B | True | False | False | True |
---|
C | False | True | True | False |
---|
D | True | False | False | True |
---|
E | True | True | True | True |
---|
df[df>0]
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
B | 0.651118 | NaN | NaN | 0.605965 |
---|
C | NaN | 0.740122 | 0.528813 | NaN |
---|
D | 0.188695 | NaN | NaN | 0.955057 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
df[df['W']>0]
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
df[df['W']>0]['Y']
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
| Y | X |
---|
A | 0.907969 | 0.628133 |
---|
B | -0.848077 | -0.319318 |
---|
D | -0.933237 | -0.758872 |
---|
E | 2.605967 | 1.978757 |
---|
For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] > 1)]
| W | X | Y | Z |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
More Index Details
Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!
df
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
# Reset to default 0,1...n index
df.reset_index()
| index | W | X | Y | Z |
---|
0 | A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
1 | B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
2 | C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
3 | D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
4 | E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df
| W | X | Y | Z | States |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | CA |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | NY |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | WY |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | OR |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | CO |
---|
df.set_index('States')
| W | X | Y | Z |
---|
States | | | | |
---|
CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
df
| W | X | Y | Z | States |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | CA |
---|
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | NY |
---|
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | WY |
---|
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | OR |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | CO |
---|
df.set_index('States',inplace=True)
df
| W | X | Y | Z |
---|
States | | | | |
---|
CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
---|
NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
---|
WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
---|
OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
---|
CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
---|
Multi-Index and Index Hierarchy
Let us go over how to work with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
| | A | B |
---|
G1 | 1 | 0.153661 | 0.167638 |
---|
2 | -0.765930 | 0.962299 |
---|
3 | 0.902826 | -0.537909 |
---|
G2 | 1 | -1.549671 | 0.435253 |
---|
2 | 1.259904 | -0.447898 |
---|
3 | 0.266207 | 0.412580 |
---|
Now let’s show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
df.loc['G1']
| A | B |
---|
1 | 0.153661 | 0.167638 |
---|
2 | -0.765930 | 0.962299 |
---|
3 | 0.902826 | -0.537909 |
---|
df.loc['G1'].loc[1]
A 0.153661
B 0.167638
Name: 1, dtype: float64
df.index.names
FrozenList([None, None])
df.index.names = ['Group','Num']
df
| | A | B |
---|
Group | Num | | |
---|
G1 | 1 | 0.153661 | 0.167638 |
---|
2 | -0.765930 | 0.962299 |
---|
3 | 0.902826 | -0.537909 |
---|
G2 | 1 | -1.549671 | 0.435253 |
---|
2 | 1.259904 | -0.447898 |
---|
3 | 0.266207 | 0.412580 |
---|
df.xs('G1')
| A | B |
---|
Num | | |
---|
1 | 0.153661 | 0.167638 |
---|
2 | -0.765930 | 0.962299 |
---|
3 | 0.902826 | -0.537909 |
---|
df.xs(['G1',1])
A 0.153661
B 0.167638
Name: (G1, 1), dtype: float64
df.xs(1,level='Num')
| A | B |
---|
Group | | |
---|
G1 | 0.153661 | 0.167638 |
---|
G2 | -1.549671 | 0.435253 |
---|
Missing Data
Let’s show a few convenient methods to deal with Missing Data in pandas:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
| A | B | C |
---|
0 | 1.0 | 5.0 | 1 |
---|
1 | 2.0 | NaN | 2 |
---|
2 | NaN | NaN | 3 |
---|
df.dropna()
df.dropna(axis=1)
df.dropna(thresh=2)
df.fillna(value='FILL VALUE')
| A | B | C |
---|
0 | 1 | 5 | 1 |
---|
1 | 2 | FILL VALUE | 2 |
---|
2 | FILL VALUE | FILL VALUE | 3 |
---|
df['A'].fillna(value=df['A'].mean())
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
Groupby
The groupby method allows you to group rows of data together and call aggregate functions
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
| Company | Person | Sales |
---|
0 | GOOG | Sam | 200 |
---|
1 | GOOG | Charlie | 120 |
---|
2 | MSFT | Amy | 340 |
---|
3 | MSFT | Vanessa | 124 |
---|
4 | FB | Carl | 243 |
---|
5 | FB | Sarah | 350 |
---|
Now you can use the .groupby() method to group rows together based off of a column name. For instance let’s group based off of Company. This will create a DataFrameGroupBy object:
df.groupby('Company')
You can save this object as a new variable:
by_comp = df.groupby("Company")
And then call aggregate methods off the object:
by_comp.mean()
| Sales |
---|
Company | |
---|
FB | 296.5 |
---|
GOOG | 160.0 |
---|
MSFT | 232.0 |
---|
df.groupby('Company').mean()
| Sales |
---|
Company | |
---|
FB | 296.5 |
---|
GOOG | 160.0 |
---|
MSFT | 232.0 |
---|
More examples of aggregate methods:
by_comp.std()
| Sales |
---|
Company | |
---|
FB | 75.660426 |
---|
GOOG | 56.568542 |
---|
MSFT | 152.735065 |
---|
by_comp.min()
| Person | Sales |
---|
Company | | |
---|
FB | Carl | 243 |
---|
GOOG | Charlie | 120 |
---|
MSFT | Amy | 124 |
---|
by_comp.max()
| Person | Sales |
---|
Company | | |
---|
FB | Sarah | 350 |
---|
GOOG | Sam | 200 |
---|
MSFT | Vanessa | 340 |
---|
by_comp.count()
| Person | Sales |
---|
Company | | |
---|
FB | 2 | 2 |
---|
GOOG | 2 | 2 |
---|
MSFT | 2 | 2 |
---|
by_comp.describe()
| | Sales |
---|
Company | | |
---|
FB | count | 2.000000 |
---|
mean | 296.500000 |
---|
std | 75.660426 |
---|
min | 243.000000 |
---|
25% | 269.750000 |
---|
50% | 296.500000 |
---|
75% | 323.250000 |
---|
max | 350.000000 |
---|
GOOG | count | 2.000000 |
---|
mean | 160.000000 |
---|
std | 56.568542 |
---|
min | 120.000000 |
---|
25% | 140.000000 |
---|
50% | 160.000000 |
---|
75% | 180.000000 |
---|
max | 200.000000 |
---|
MSFT | count | 2.000000 |
---|
mean | 232.000000 |
---|
std | 152.735065 |
---|
min | 124.000000 |
---|
25% | 178.000000 |
---|
50% | 232.000000 |
---|
75% | 286.000000 |
---|
max | 340.000000 |
---|
by_comp.describe().transpose()
Company | FB | GOOG | MSFT |
---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max |
---|
Sales | 2.0 | 296.5 | 75.660426 | 243.0 | 269.75 | 296.5 | 323.25 | 350.0 | 2.0 | 160.0 | ... | 180.0 | 200.0 | 2.0 | 232.0 | 152.735065 | 124.0 | 178.0 | 232.0 | 286.0 | 340.0 |
---|
1 rows × 24 columns
by_comp.describe().transpose()['GOOG']
| count | mean | std | min | 25% | 50% | 75% | max |
---|
Sales | 2.0 | 160.0 | 56.568542 | 120.0 | 140.0 | 160.0 | 180.0 | 200.0 |
---|
Merging, Joining, and Concatenating
There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this lecture we will discuss these 3 methods with examples.
Example DataFrames
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df1
| A | B | C | D |
---|
0 | A0 | B0 | C0 | D0 |
---|
1 | A1 | B1 | C1 | D1 |
---|
2 | A2 | B2 | C2 | D2 |
---|
3 | A3 | B3 | C3 | D3 |
---|
df2
| A | B | C | D |
---|
4 | A4 | B4 | C4 | D4 |
---|
5 | A5 | B5 | C5 | D5 |
---|
6 | A6 | B6 | C6 | D6 |
---|
7 | A7 | B7 | C7 | D7 |
---|
df3
| A | B | C | D |
---|
8 | A8 | B8 | C8 | D8 |
---|
9 | A9 | B9 | C9 | D9 |
---|
10 | A10 | B10 | C10 | D10 |
---|
11 | A11 | B11 | C11 | D11 |
---|
Concatenation
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:
pd.concat([df1,df2,df3])
| A | B | C | D |
---|
0 | A0 | B0 | C0 | D0 |
---|
1 | A1 | B1 | C1 | D1 |
---|
2 | A2 | B2 | C2 | D2 |
---|
3 | A3 | B3 | C3 | D3 |
---|
4 | A4 | B4 | C4 | D4 |
---|
5 | A5 | B5 | C5 | D5 |
---|
6 | A6 | B6 | C6 | D6 |
---|
7 | A7 | B7 | C7 | D7 |
---|
8 | A8 | B8 | C8 | D8 |
---|
9 | A9 | B9 | C9 | D9 |
---|
10 | A10 | B10 | C10 | D10 |
---|
11 | A11 | B11 | C11 | D11 |
---|
pd.concat([df1,df2,df3],axis=1)
| A | B | C | D | A | B | C | D | A | B | C | D |
---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
---|
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
---|
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
---|
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
---|
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 | NaN | NaN | NaN | NaN |
---|
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 | NaN | NaN | NaN | NaN |
---|
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 | NaN | NaN | NaN | NaN |
---|
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 | NaN | NaN | NaN | NaN |
---|
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A8 | B8 | C8 | D8 |
---|
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A9 | B9 | C9 | D9 |
---|
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A10 | B10 | C10 | D10 |
---|
11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A11 | B11 | C11 | D11 |
---|
Example DataFrames
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
left
| A | B | key |
---|
0 | A0 | B0 | K0 |
---|
1 | A1 | B1 | K1 |
---|
2 | A2 | B2 | K2 |
---|
3 | A3 | B3 | K3 |
---|
right
| C | D | key |
---|
0 | C0 | D0 | K0 |
---|
1 | C1 | D1 | K1 |
---|
2 | C2 | D2 | K2 |
---|
3 | C3 | D3 | K3 |
---|
Merging
The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:
pd.merge(left,right,how='inner',on='key')
| A | B | key | C | D |
---|
0 | A0 | B0 | K0 | C0 | D0 |
---|
1 | A1 | B1 | K1 | C1 | D1 |
---|
2 | A2 | B2 | K2 | C2 | D2 |
---|
3 | A3 | B3 | K3 | C3 | D3 |
---|
Or to show a more complicated example:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on=['key1', 'key2'])
| A | B | key1 | key2 | C | D |
---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
---|
1 | A2 | B2 | K1 | K0 | C1 | D1 |
---|
2 | A2 | B2 | K1 | K0 | C2 | D2 |
---|
pd.merge(left, right, how='outer', on=['key1', 'key2'])
| A | B | key1 | key2 | C | D |
---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
---|
1 | A1 | B1 | K0 | K1 | NaN | NaN |
---|
2 | A2 | B2 | K1 | K0 | C1 | D1 |
---|
3 | A2 | B2 | K1 | K0 | C2 | D2 |
---|
4 | A3 | B3 | K2 | K1 | NaN | NaN |
---|
5 | NaN | NaN | K2 | K0 | C3 | D3 |
---|
pd.merge(left, right, how='right', on=['key1', 'key2'])
| A | B | key1 | key2 | C | D |
---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
---|
1 | A2 | B2 | K1 | K0 | C1 | D1 |
---|
2 | A2 | B2 | K1 | K0 | C2 | D2 |
---|
3 | NaN | NaN | K2 | K0 | C3 | D3 |
---|
pd.merge(left, right, how='left', on=['key1', 'key2'])
| A | B | key1 | key2 | C | D |
---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
---|
1 | A1 | B1 | K0 | K1 | NaN | NaN |
---|
2 | A2 | B2 | K1 | K0 | C1 | D1 |
---|
3 | A2 | B2 | K1 | K0 | C2 | D2 |
---|
4 | A3 | B3 | K2 | K1 | NaN | NaN |
---|
Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
| A | B | C | D |
---|
K0 | A0 | B0 | C0 | D0 |
---|
K1 | A1 | B1 | NaN | NaN |
---|
K2 | A2 | B2 | C2 | D2 |
---|
left.join(right, how='outer')
| A | B | C | D |
---|
K0 | A0 | B0 | C0 | D0 |
---|
K1 | A1 | B1 | NaN | NaN |
---|
K2 | A2 | B2 | C2 | D2 |
---|
K3 | NaN | NaN | C3 | D3 |
---|
Operations
There are lots of operations with pandas that will be really useful to you, but don’t fall into any distinct category. Let’s show them here in this lecture:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
| col1 | col2 | col3 |
---|
0 | 1 | 444 | abc |
---|
1 | 2 | 555 | def |
---|
2 | 3 | 666 | ghi |
---|
3 | 4 | 444 | xyz |
---|
Info on Unique Values
df['col2'].unique()
array([444, 555, 666])
df['col2'].nunique()
3
df['col2'].value_counts()
444 2
555 1
666 1
Name: col2, dtype: int64
Selecting Data
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
Applying Functions
def times2(x):
return x*2
df['col1'].apply(times2)
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
df['col3'].apply(len)
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
df['col1'].sum()
10
Permanently Removing a Column
del df['col1']
df
| col2 | col3 |
---|
0 | 444 | abc |
---|
1 | 555 | def |
---|
2 | 666 | ghi |
---|
3 | 444 | xyz |
---|
Get column and index names
df.columns
Index(['col2', 'col3'], dtype='object')
df.index
RangeIndex(start=0, stop=4, step=1)
Sorting and Ordering a DataFrame
df
| col2 | col3 |
---|
0 | 444 | abc |
---|
1 | 555 | def |
---|
2 | 666 | ghi |
---|
3 | 444 | xyz |
---|
df.sort_values(by='col2') #inplace=False by default
| col2 | col3 |
---|
0 | 444 | abc |
---|
3 | 444 | xyz |
---|
1 | 555 | def |
---|
2 | 666 | ghi |
---|
Find Null Values or Check for Null Values
df.isnull()
| col2 | col3 |
---|
0 | False | False |
---|
1 | False | False |
---|
2 | False | False |
---|
3 | False | False |
---|
# Drop rows with NaN Values
df.dropna()
| col2 | col3 |
---|
0 | 444 | abc |
---|
1 | 555 | def |
---|
2 | 666 | ghi |
---|
3 | 444 | xyz |
---|
Filling in NaN values with something else
import numpy as np
df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
df.head()
| col1 | col2 | col3 |
---|
0 | 1.0 | NaN | abc |
---|
1 | 2.0 | 555.0 | def |
---|
2 | 3.0 | 666.0 | ghi |
---|
3 | NaN | 444.0 | xyz |
---|
df.fillna('FILL')
| col1 | col2 | col3 |
---|
0 | 1 | FILL | abc |
---|
1 | 2 | 555 | def |
---|
2 | 3 | 666 | ghi |
---|
3 | FILL | 444 | xyz |
---|
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
| A | B | C | D |
---|
0 | foo | one | x | 1 |
---|
1 | foo | one | y | 3 |
---|
2 | foo | two | x | 2 |
---|
3 | bar | two | y | 5 |
---|
4 | bar | one | x | 4 |
---|
5 | bar | one | y | 1 |
---|
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
| C | x | y |
---|
A | B | | |
---|
bar | one | 4.0 | 1.0 |
---|
two | NaN | 5.0 |
---|
foo | one | 1.0 | 3.0 |
---|
two | 2.0 | NaN |
---|
CSV Data
df = pd.read_csv('example')
df
| a | b | c | d |
---|
0 | 0 | 1 | 2 | 3 |
---|
1 | 4 | 5 | 6 | 7 |
---|
2 | 8 | 9 | 10 | 11 |
---|
3 | 12 | 13 | 14 | 15 |
---|
CSV Output
df.to_csv('example',index=False)
Excel Data
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')
| a | b | c | d |
---|
0 | 0 | 1 | 2 | 3 |
---|
1 | 4 | 5 | 6 | 7 |
---|
2 | 8 | 9 | 10 | 11 |
---|
3 | 12 | 13 | 14 | 15 |
---|
Excel Output
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
HTML Data
You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:
conda install lxml
conda install html5lib
conda install BeautifulSoup4
Then restart Jupyter Notebook.
(or use pip install if you aren’t using the Anaconda Distribution)
Pandas can read table tabs off of html. For example:
Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
df[0]
| Bank Name | City | ST | CERT | Acquiring Institution | Closing Date | Updated Date | Loss Share Type | Agreement Terminated | Termination Date |
---|
0 | First CornerStone Bank | King of Prussia | PA | 35312 | First-Citizens Bank & Trust Company | May 6, 2016 | July 12, 2016 | none | NaN | NaN |
---|
1 | Trust Company Bank | Memphis | TN | 9956 | The Bank of Fayette County | April 29, 2016 | August 4, 2016 | none | NaN | NaN |
---|
2 | North Milwaukee State Bank | Milwaukee | WI | 20364 | First-Citizens Bank & Trust Company | March 11, 2016 | June 16, 2016 | none | NaN | NaN |
---|
3 | Hometown National Bank | Longview | WA | 35156 | Twin City Bank | October 2, 2015 | April 13, 2016 | none | NaN | NaN |
---|
4 | The Bank of Georgia | Peachtree City | GA | 35259 | Fidelity Bank | October 2, 2015 | April 13, 2016 | none | NaN | NaN |
---|
5 | Premier Bank | Denver | CO | 34112 | United Fidelity Bank, fsb | July 10, 2015 | July 12, 2016 | none | NaN | NaN |
---|
6 | Edgebrook Bank | Chicago | IL | 57772 | Republic Bank of Chicago | May 8, 2015 | July 12, 2016 | none | NaN | NaN |
---|
7 | Doral BankEn Espanol | San Juan | PR | 32102 | Banco Popular de Puerto Rico | February 27, 2015 | May 13, 2015 | none | NaN | NaN |
---|
8 | Capitol City Bank & Trust Company | Atlanta | GA | 33938 | First-Citizens Bank & Trust Company | February 13, 2015 | April 21, 2015 | none | NaN | NaN |
---|
9 | Highland Community Bank | Chicago | IL | 20290 | United Fidelity Bank, fsb | January 23, 2015 | April 21, 2015 | none | NaN | NaN |
---|
10 | First National Bank of Crestview | Crestview | FL | 17557 | First NBC Bank | January 16, 2015 | January 15, 2016 | none | NaN | NaN |
---|
11 | Northern Star Bank | Mankato | MN | 34983 | BankVista | December 19, 2014 | January 6, 2016 | none | NaN | NaN |
---|
12 | Frontier Bank, FSB D/B/A El Paseo Bank | Palm Desert | CA | 34738 | Bank of Southern California, N.A. | November 7, 2014 | January 6, 2016 | none | NaN | NaN |
---|
13 | The National Republic Bank of Chicago | Chicago | IL | 916 | State Bank of Texas | October 24, 2014 | January 6, 2016 | none | NaN | NaN |
---|
14 | NBRS Financial | Rising Sun | MD | 4862 | Howard Bank | October 17, 2014 | March 26, 2015 | none | NaN | NaN |
---|
15 | GreenChoice Bank, fsb | Chicago | IL | 28462 | Providence Bank, LLC | July 25, 2014 | July 28, 2015 | none | NaN | NaN |
---|
16 | Eastside Commercial Bank | Conyers | GA | 58125 | Community & Southern Bank | July 18, 2014 | July 11, 2016 | none | NaN | NaN |
---|
17 | The Freedom State Bank | Freedom | OK | 12483 | Alva State Bank & Trust Company | June 27, 2014 | March 25, 2016 | none | NaN | NaN |
---|
18 | Valley Bank | Fort Lauderdale | FL | 21793 | Landmark Bank, National Association | June 20, 2014 | June 29, 2015 | none | NaN | NaN |
---|
19 | Valley Bank | Moline | IL | 10450 | Great Southern Bank | June 20, 2014 | June 26, 2015 | none | NaN | NaN |
---|
20 | Slavie Federal Savings Bank | Bel Air | MD | 32368 | Bay Bank, FSB | May 30, 2014 | June 15, 2015 | none | NaN | NaN |
---|
21 | Columbia Savings Bank | Cincinnati | OH | 32284 | United Fidelity Bank, fsb | May 23, 2014 | May 28, 2015 | none | NaN | NaN |
---|
22 | AztecAmerica Bank En Espanol | Berwyn | IL | 57866 | Republic Bank of Chicago | May 16, 2014 | July 18, 2014 | none | NaN | NaN |
---|
23 | Allendale County Bank | Fairfax | SC | 15062 | Palmetto State Bank | April 25, 2014 | July 18, 2014 | none | NaN | NaN |
---|
24 | Vantage Point Bank | Horsham | PA | 58531 | First Choice Bank | February 28, 2014 | March 3, 2015 | none | NaN | NaN |
---|
25 | Millennium Bank, National Association | Sterling | VA | 35096 | WashingtonFirst Bank | February 28, 2014 | March 03, 2015 | none | NaN | NaN |
---|
26 | Syringa Bank | Boise | ID | 34296 | Sunwest Bank | January 31, 2014 | April 12, 2016 | none | NaN | NaN |
---|
27 | The Bank of Union | El Reno | OK | 17967 | BancFirst | January 24, 2014 | March 25, 2016 | none | NaN | NaN |
---|
28 | DuPage National Bank | West Chicago | IL | 5732 | Republic Bank of Chicago | January 17, 2014 | February 19, 2015 | none | NaN | NaN |
---|
29 | Texas Community Bank, National Association | The Woodlands | TX | 57431 | Spirit of Texas Bank, SSB | December 13, 2013 | December 29, 2014 | none | NaN | NaN |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
515 | ANB Financial, NA | Bentonville | AR | 33901 | Pulaski Bank and Trust Company | May 9, 2008 | August 28, 2012 | none | NaN | NaN |
---|
516 | Hume Bank | Hume | MO | 1971 | Security Bank | March 7, 2008 | August 28, 2012 | none | NaN | NaN |
---|
517 | Douglass National Bank | Kansas City | MO | 24660 | Liberty Bank and Trust Company | January 25, 2008 | October 26, 2012 | none | NaN | NaN |
---|
518 | Miami Valley Bank | Lakeview | OH | 16848 | The Citizens Banking Company | October 4, 2007 | August 28, 2012 | none | NaN | NaN |
---|
519 | NetBank | Alpharetta | GA | 32575 | ING DIRECT | September 28, 2007 | August 28, 2012 | none | NaN | NaN |
---|
520 | Metropolitan Savings Bank | Pittsburgh | PA | 35353 | Allegheny Valley Bank of Pittsburgh | February 2, 2007 | October 27, 2010 | none | NaN | NaN |
---|
521 | Bank of Ephraim | Ephraim | UT | 1249 | Far West Bank | June 25, 2004 | April 9, 2008 | none | NaN | NaN |
---|
522 | Reliance Bank | White Plains | NY | 26778 | Union State Bank | March 19, 2004 | April 9, 2008 | none | NaN | NaN |
---|
523 | Guaranty National Bank of Tallahassee | Tallahassee | FL | 26838 | Hancock Bank of Florida | March 12, 2004 | June 5, 2012 | none | NaN | NaN |
---|
524 | Dollar Savings Bank | Newark | NJ | 31330 | No Acquirer | February 14, 2004 | April 9, 2008 | none | NaN | NaN |
---|
525 | Pulaski Savings Bank | Philadelphia | PA | 27203 | Earthstar Bank | November 14, 2003 | July 22, 2005 | none | NaN | NaN |
---|
526 | First National Bank of Blanchardville | Blanchardville | WI | 11639 | The Park Bank | May 9, 2003 | June 5, 2012 | none | NaN | NaN |
---|
527 | Southern Pacific Bank | Torrance | CA | 27094 | Beal Bank | February 7, 2003 | October 20, 2008 | none | NaN | NaN |
---|
528 | Farmers Bank of Cheneyville | Cheneyville | LA | 16445 | Sabine State Bank & Trust | December 17, 2002 | October 20, 2004 | none | NaN | NaN |
---|
529 | Bank of Alamo | Alamo | TN | 9961 | No Acquirer | November 8, 2002 | March 18, 2005 | none | NaN | NaN |
---|
530 | AmTrade International BankEn Espanol | Atlanta | GA | 33784 | No Acquirer | September 30, 2002 | September 11, 2006 | none | NaN | NaN |
---|
531 | Universal Federal Savings Bank | Chicago | IL | 29355 | Chicago Community Bank | June 27, 2002 | April 9, 2008 | none | NaN | NaN |
---|
532 | Connecticut Bank of Commerce | Stamford | CT | 19183 | Hudson United Bank | June 26, 2002 | February 14, 2012 | none | NaN | NaN |
---|
533 | New Century Bank | Shelby Township | MI | 34979 | No Acquirer | March 28, 2002 | March 18, 2005 | none | NaN | NaN |
---|
534 | Net 1st National Bank | Boca Raton | FL | 26652 | Bank Leumi USA | March 1, 2002 | April 9, 2008 | none | NaN | NaN |
---|
535 | NextBank, NA | Phoenix | AZ | 22314 | No Acquirer | February 7, 2002 | February 5, 2015 | none | NaN | NaN |
---|
536 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | February 1, 2002 | October 25, 2012 | none | NaN | NaN |
---|
537 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | January 18, 2002 | November 6, 2003 | none | NaN | NaN |
---|
538 | Hamilton Bank, NAEn Espanol | Miami | FL | 24382 | Israel Discount Bank of New York | January 11, 2002 | September 21, 2015 | none | NaN | NaN |
---|
539 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | September 7, 2001 | February 10, 2004 | none | NaN | NaN |
---|
540 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | July 27, 2001 | August 19, 2014 | none | NaN | NaN |
---|
541 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | May 3, 2001 | November 18, 2002 | none | NaN | NaN |
---|
542 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | February 2, 2001 | February 18, 2003 | none | NaN | NaN |
---|
543 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | December 14, 2000 | March 17, 2005 | none | NaN | NaN |
---|
544 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | October 13, 2000 | March 17, 2005 | none | NaN | NaN |
---|
545 rows × 10 columns
SQL Data
- Note: If you are completely unfamiliar with SQL you can check out my other course: “Complete SQL Bootcamp” to learn SQL.
The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.
If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-API.
See also some cookbook examples for some advanced strategies.
The key functions are:
- read_sql_table(table_name, con[, schema, …])
- Read SQL database table into a DataFrame.
- read_sql_query(sql, con[, index_col, …])
- Read SQL query into a DataFrame.
- read_sql(sql, con[, index_col, …])
- Read SQL query or database table into a DataFrame.
- DataFrame.to_sql(name, con[, flavor, …])
- Write records stored in a DataFrame to a SQL database.
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
df.to_sql('data', engine)
sql_df = pd.read_sql('data',con=engine)
sql_df
| index | a | b | c | d |
---|
0 | 0 | 0 | 1 | 2 | 3 |
---|
1 | 1 | 4 | 5 | 6 | 7 |
---|
2 | 2 | 8 | 9 | 10 | 11 |
---|
3 | 3 | 12 | 13 | 14 | 15 |
---|