This is an introduction of using NumPy and Pandas based on the course Intro to Data Analysis on Udacity. It includes NumPy and Pandas data structures, basic operations and functions with code examples.. I learnt a lot from this course and shared this notes for your reference.




1. Data Analysis Process

Introduction

  • Question
    • Characteristics of students who pass projects
    • How to stock my store with the products people most want to buy
  • Wrangling
    • Data acquisition
    • Data cleaning
  • Explore

    • Build intuition
    • Find patterns
  • Draw conclusion/or make prediction

    • Predict: Which movies users will like
    • Conclude: Users are less likely to click certain articles
    • Usually requres statistics or machine learning
  • Communication
    • Blog, paper, email, powerpoint, etc
    • Data visualization

Data wrangling phase is very interwined with data explore phase. Data acquisition actually comes before you pose a question.

2. NumPy and Pandas for 1D Data

In [1]:
import numpy as np
import pandas as pd
In [2]:
# read csv via pandas
data = pd.read_csv('gdp_per_capita.csv')

# check the unique items in a column
len(data['Country'].unique())
Out[2]:
275

One-dimensional data structures

  • Numpy (Numerical Python)
    • Array.
  • Pandas
    • Series. Built on Array of Numpy, with more features.

Numpy array

  • Similarities between Numpy arrays and Python lists.
    • Access elements by position. a[0]
    • Access a range of elements. a[0:3] (note upper bound is not inclusive)
    • Use loops. for x in a:
  • Differences between Numpy arrays and Python lists.
    • For Numpy, each elements should have same type (string, int, boolean, etc). For python list, you can put them together.
    • Numpy has convenient functions. mean(), std(). List also has similar functions, but Numpy is faster.
    • Numpy can be multi-dimensional, similar to list of list in python.
In [3]:
# First 20 countries with employment data
countries = np.array([
    'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
    'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
    'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
    'Belize', 'Benin', 'Bhutan', 'Bolivia',
    'Bosnia and Herzegovina'
])

# Employment data in 2007 for those 20 countries
employment = np.array([
    55.70000076,  51.40000153,  50.5       ,  75.69999695,
    58.40000153,  40.09999847,  61.5       ,  57.09999847,
    60.90000153,  66.59999847,  60.40000153,  68.09999847,
    66.90000153,  53.40000153,  48.59999847,  56.79999924,
    71.59999847,  58.40000153,  70.40000153,  41.20000076
])

# try different expressions, access element and slicing
print(countries[0])
print(countries[0:3])
print(countries[:2])
print(countries[:])

# elements type
print('-'*100)
print(countries.dtype)
print(employment.dtype)

# looping
print('-'*100)
for i in range(3):
    country = countries[i]
    country_employment = employment[i]
    print('Country {} has employment {}'.format(country, country_employment))

# Numpy functions. np.argmax() returns the index of maximum
print('-'*100)
print(employment.mean(), employment.std(), employment.max(), employment.sum(), employment.argmax())
Afghanistan
['Afghanistan' 'Albania' 'Algeria']
['Afghanistan' 'Albania']
['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Argentina' 'Armenia'
 'Australia' 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh'
 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin' 'Bhutan' 'Bolivia'
 'Bosnia and Herzegovina']
----------------------------------------------------------------------------------------------------

Vectorized Operations

A vector is a list of elements.

  • Vector addition.
  • Multiplying by a scalar
  • More vectorized operations.
    • Math operation: +, -, *, /, **,
    • Logical operations: And:&, Or:|, Not:~. Only valid for boolean. If use it on integers, then it's bitwise operation.
    • Comparison operations: >, >=, <=, <, ==, !=.
In [4]:
# vector addition, linear algebra
a = np.array([1,2,3])
b = np.array([4,5,6])
print(a + b)
print(a * b)

# list concatenation, note the difference of python list
print([1,2,3] + [4,5,6])

# multiplying by a scalar
print(a * 3)

# in python list
print([1,2,3]*3)

# arthmetic operations between a Numpy array and a single number
print(a+5)

# logical operations with Numpy arrays
print('-'*100)
a = np.array([True, True, False, False])
b = np.array([True, False, True, False])
print(a & b)
print(~a)
print(a & False)

# comparison operations between 2 NumPy arrays
print('-'*100)
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print(a > b)

# comparison operations between a Numpy array and a single number
print(a > 2)
[5 7 9]
[ 4 10 18]
[1, 2, 3, 4, 5, 6]
[3 6 9]
[1, 2, 3, 1, 2, 3, 1, 2, 3]
[6 7 8]
----------------------------------------------------------------------------------------------------
[ True False False False]
[False False  True  True]
[False False False False]
----------------------------------------------------------------------------------------------------
[False False False  True  True]
[False False  True  True  True]

NumPy index arrays

An array of booleans can be a index array.

In [5]:
# using index arrays
a = np.array([1, 2, 3, 4])
b = np.array([True, True, False, False])

print(a[b])
print(a[a>2])

# creating the index array using vectorized operations
b = a >2
print(b)

# array of booleans can be used with numpy functions
print(b.mean())
[1 2]
[3 4]
[False False  True  True]
0.5

+ vs. +=

+= operates in-place while + does not. That is, += stores all the values in the same place as the original values rather than creating a new array.

In [6]:
# case 1: both a and b refer to the same array
a = np.array([1,2,3])
b = a
a += 1
print(b)

# case 2: both a and b refer to the same array, but later a is reassigned, which creates a new array. 
# Then a refers to a new array while b refers to the original array
a = np.array([1,2,3])
b = a
a = a + 1
print(b)
[2 3 4]
[1 2 3]

In-Place vs. Not In-Place

When taking a slice, numpy doesn't create a new array. Instead it refers to a what's called a view of the original array. If the slice is modified, then the original array is modified as well. So slice in numpy is very fast.

Note this is different from python list.

In [7]:
a = np.array([1,2,3,4])

# take a slice, numpy doesn't create a new array. It
Slice = a[:3]
Slice[0] = 10
print(a)

# try python list
a = [1,2,3,4]
Slice = a[:3]
Slice[0] = 10
print(a)
[10  2  3  4]
[1, 2, 3, 4]

Pandas Series

Similar to numpy array, but it has more extra functionality. e.g. s.describe() Similarities:

- Accessing elements, `s[0]`, `s[3:7]`
- Looping `for x in s`.
- Convenient functions. `s.mean()`, `s.max()`.
- Vectorized operations. `s1 + s2`
- Implemented in C, fast.
In [8]:
import pandas as pd

countries = ['Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda',
             'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan',
             'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
             'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia']

life_expectancy_values = [74.7,  75. ,  83.4,  57.6,  74.6,  75.4,  72.3,  81.5,  80.2,
                          70.3,  72.1,  76.4,  68.1,  75.2,  69.8,  79.4,  70.8,  62.7,
                          67.3,  70.6]

gdp_values = [ 1681.61390973,   2155.48523109,  21495.80508273,    562.98768478,
              13495.1274663 ,   9388.68852258,   1424.19056199,  24765.54890176,
              27036.48733192,   1945.63754911,  21721.61840978,  13373.21993972,
                483.97086804,   9783.98417323,   2253.46411147,  25034.66692293,
               3680.91642923,    366.04496652,   1175.92638695,   1132.21387981]

# Life expectancy and gdp data in 2007 for 20 countries
life_expectancy = pd.Series(life_expectancy_values)
gdp = pd.Series(gdp_values)
In [9]:
# pandas accessing elements and slicing
print(life_expectancy[0])
print(gdp[0:3])

# panda looping
print('-'*100)
for country_life_expectancy in life_expectancy[0:3]:
    print('Examining life expectancy {}'.format(country_life_expectancy))

# pandas functions
print('-'*100)
print(life_expectancy.mean(), life_expectancy.std(), gdp.max(), gdp.min())

# Vectorized operations and index arrays
print('-'*100)
a = pd.Series([1, 2, 3, 4])
b = pd.Series([1, 2, 1, 2])
print(a+b)
print(a*2)
print(a>=3)
print(a[a>=3])
74.7
0     1681.613910
1     2155.485231
2    21495.805083
dtype: float64
----------------------------------------------------------------------------------------------------
Examining life expectancy 74.7
Examining life expectancy 75.0
Examining life expectancy 83.4
----------------------------------------------------------------------------------------------------
72.86999999999999 6.213999474869968 27036.4873319 366.04496652
----------------------------------------------------------------------------------------------------
0    2
1    4
2    4
3    6
dtype: int64
0    2
1    4
2    6
3    8
dtype: int64
0    False
1    False
2     True
3     True
dtype: bool
2    3
3    4
dtype: int64

Series index

The main difference between pandas series and numpy arrays is that pandas series has something called index.

NumPy arrays are like souped version of python lists. Similarly, a pandas series is like a cross between a list and a dictionary. For list, elements are sorted in order and you access each element or series by position. For dictionary, you access value by key. So for pandas series, you can use both position and index. (I think here index is more similar to key in dictioanry)

If you didn't specify index when creating a pandas series, then 0,1,2,3... are used as index by default.

In pandas series, use s.loc[] to access elements by index. When you use s[] then you are accessing elements by position. Pandas series has another attribute s.iloc[] to access element by position, which is the same as s[] but more clear. E.g., we say s.iloc[0] is position zero rather than index 0, as the index could be other character.

In [10]:
p = pd.Series([1,2,3], index=['a', 'b', 'c'])
print(p)
print(p['a'])
print(p[0]) # as 0 is not the index, so bracket access [0] is by position
print(p.iloc[0]) # iloc is location-based indexing, which is more explicit than p[0], 
#so we should say it's position zero rather than index 0, as the index should be 'a'
a    1
b    2
c    3
dtype: int64
1
1
1
In [11]:
countries = ['Afghanistan', 'Albania', 'Algeria', 'Angola']

employment_values = [55.70000076,  51.40000153,  50.5       ,  75.69999695,]

# Employment data in 2007 for 20 countries
employment = pd.Series(employment_values, index=countries)
In [12]:
print(employment[0])
print(employment.iloc[0])
print(employment['Afghanistan'])
print(employment.argmax()) # as we know, this returns the index of maxmium, which is country name
55.70000076
55.70000076
55.70000076
Angola
In [13]:
def max_employment(employment):
    max_country = employment.argmax()
    max_value = employment[max_country]
    return (max_country, max_value)
max_employment(employment)
Out[13]:
('Angola', 75.699996949999999)

Vectorized Operations and Series Indexes

  • Add two Numpy arrays, you are adding by position
  • Add tow series with different indexes.
In [14]:
# Addition when indexes are the same
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s1 + s2)

# Indexes have same elements in a different order
print('-'*100)
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['b', 'd', 'a', 'c'])
print(s1 + s2)

# Indexes overlap, but do not have exactly the same elements
print('-'*100)
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
print(s1 + s2) # only the overlapped indexes are added

print('-'*100)
print((s1 + s2).dropna()) # drop NA in a series

# Indexes do not overlap
print('-'*100)
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['e', 'f', 'g', 'h'])
print(s1 + s2) # no index is added
a    11
b    22
c    33
d    44
dtype: int64
----------------------------------------------------------------------------------------------------
a    31
b    12
c    43
d    24
dtype: int64
----------------------------------------------------------------------------------------------------
a     NaN
b     NaN
c    13.0
d    24.0
e     NaN
f     NaN
dtype: float64
----------------------------------------------------------------------------------------------------
c    13.0
d    24.0
dtype: float64
----------------------------------------------------------------------------------------------------
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
g   NaN
h   NaN
dtype: float64
In [15]:
# default index of pandas series is 0,1,2
p2=pd.Series([1,2,3])
p2
p2.iloc()
Out[15]:
In [16]:
# pandas argmax() function. It returns the index of first maximum values, which is 'c' here
p.argmax()
Out[16]:
'c'

Pandas Series apply().

  • apply() takes a series and a function and returns a new series. It makes code concise.
  • Applies function along input axis of DataFrame/Series. axis : {0 or ‘index’, 1 or ‘columns’}, default 0
In [17]:
def reverse(name):
    s = name.split(' ')
    return s[1]+', '+s[0]

names = pd.Series([
    'Andre Agassi',
    'Barry Bonds',
    'Christopher Columbus',
    'Daniel Defoe'])
print(names.apply(reverse))
0            Agassi, Andre
1             Bonds, Barry
2    Columbus, Christopher
3            Defoe, Daniel
dtype: object

Plotting in Pandas

In [18]:
from IPython.display import Image
#%pylab inline
%matplotlib inline 
# make plots appear within the iPython notebook.
import matplotlib.pyplot as plt
data = pd.Series(np.random.normal(0,1,1000))
plt.hist(data)
Out[18]:
(array([  13.,   49.,  165.,  258.,  269.,  178.,   51.,   14.,    2.,    1.]),
 array([-2.82878883, -2.1349477 , -1.44110657, -0.74726544, -0.05342431,
         0.64041682,  1.33425795,  2.02809908,  2.72194021,  3.41578134,
         4.10962247]),
 )
In [19]:
# pandas also has build-in plotting using matplotlib
# The index of the Series will be used for the x-axis and the values for the y-axis
data = pd.Series([7,3,8,5], index=['a', 'b', 'c', 'd'])
data.plot()
Out[19]:

3. NumPy and Pandas for 2D Data

Two-dimensional data:

  • Python: list of lists
  • NumPy: 2D array
  • Pandas: Data Frame 2D arrays, as opposed to array of arrays:
  • more memory efficient. See this page.
  • The syntax to access the element is different. For 2D array, use a[1,3] rather than a[1][3] of array of arrays. The row and column can also be a slice using colon notation. a[1:3, 4:5]
  • mean(), std(), etc operate on entire array.
In [20]:
ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

def mean_riders_for_max_station(ridership):
    '''
    Fill in this function to find the station with the maximum riders on the
    first day, then return the mean riders per day for that station. Also
    return the mean ridership overall for comparsion.
    
    Hint: NumPy's argmax() function might be useful:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html
    '''
    overall_mean = ridership.mean() # Replace this with your code
    max_station_first_day = ridership[0,:].argmax()
    mean_for_max = ridership[:, max_station_first_day].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)
print(mean_riders_for_max_station(ridership))
(2342.5999999999999, 3239.9000000000001)

NumPy Axis

E.g. array.mean(axis = 0), compute the mean of each column; array.mean(axis = 1), compute the mean of each row.

In [21]:
print(ridership.mean(axis = 0))
[ 1071.2  2814.9  2718.8  3239.9  1868.2]

NumPy and Pandas Data Types

  • When not all elements are numbers like there is NAN, numpy array can't take mean, while Pandas dataframe can.
  • Pandas Dataframe has indexes similar to Pandas series. There is an index value for each row, and a name for column. (I think index is kind of row name)
  • When taking mean, pandas computes the mean of each column rather than the mean of the entire data frame, since each column is assumed to be different types.
  • Use axis argument to take the mean of each row. df.mean(axis=1). But it doesn't work if the data types are different.
In [22]:
print(np.array([1,2,3]).dtype)

# convert a dictionary to dataframe, the key is converted as column names.
dic = {'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]}
df = pd.DataFrame(dic, index=['x', 'y', 'z'])
print(df)
print(df.mean())
print(df.values.mean())

# convert a 2D numpy array or list of lists to pandas dataframe
print('-'*100)
df_2 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'B', 'C'])
print(df_2)
int32
   a  b  c
x  1  4  7
y  2  5  8
z  3  6  9
a    2.0
b    5.0
c    8.0
dtype: float64
5.0
----------------------------------------------------------------------------------------------------
   A  B  C
0  0  1  2
1  3  4  5

Access Elements of DataFrame

  • Access a single row of a data frame is similiar to asscess a single element of a series. Use loc[] to access rows by index, and use iloc[] to access rows by position.
  • Access a single element. df.iloc[2, 3], or df.loc[index, colName]
  • Access columns using square brackets. df[colName]
  • Use df.values() to get a 2D Numpy array. There is only values without column names and row indexes. If you want to compute the mean of all value you can use this. Be careful of the data types when doing this.
In [23]:
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)

print(ridership_df)

# access elements, first row
print('-'*100)
print(ridership_df.iloc[0]) # note .iloc[0, :] is wrong
print(type(ridership_df.iloc[0])) # note the first row is a pandas series, and the column names became series indexes.

# row whose index is '05-02-11'
print('-'*100)
print(ridership_df.loc['05-02-11'])

# column whose column name is 'R003'
print('-'*100)
print(ridership_df['R003']) # note [:, 'R003'] is wrong

# access 2nd row, 4th column
print('-'*100)
print(ridership_df.iloc[1, 3])

# access multiple rows
print('-'*100)
print(ridership_df.iloc[1:3])

# access multiple columns
print('-'*100)
print(ridership_df[['R003', 'R005']])
          R003  R004  R005  R006  R007
05-01-11     0     0     2     5     0
05-02-11  1478  3877  3674  2328  2539
05-03-11  1613  4088  3991  6461  2691
05-04-11  1560  3392  3826  4787  2613
05-05-11  1608  4802  3932  4477  2705
05-06-11  1576  3933  3909  4979  2685
05-07-11    95   229   255   496   201
05-08-11     2     0     1    27     0
05-09-11  1438  3785  3589  4174  2215
05-10-11  1342  4043  4009  4665  3033
----------------------------------------------------------------------------------------------------
R003    0
R004    0
R005    2
R006    5
R007    0
Name: 05-01-11, dtype: int64

----------------------------------------------------------------------------------------------------
R003    1478
R004    3877
R005    3674
R006    2328
R007    2539
Name: 05-02-11, dtype: int64
----------------------------------------------------------------------------------------------------
05-01-11       0
05-02-11    1478
05-03-11    1613
05-04-11    1560
05-05-11    1608
05-06-11    1576
05-07-11      95
05-08-11       2
05-09-11    1438
05-10-11    1342
Name: R003, dtype: int64
----------------------------------------------------------------------------------------------------
2328
----------------------------------------------------------------------------------------------------
          R003  R004  R005  R006  R007
05-02-11  1478  3877  3674  2328  2539
05-03-11  1613  4088  3991  6461  2691
----------------------------------------------------------------------------------------------------
          R003  R005
05-01-11     0     2
05-02-11  1478  3674
05-03-11  1613  3991
05-04-11  1560  3826
05-05-11  1608  3932
05-06-11  1576  3909
05-07-11    95   255
05-08-11     2     1
05-09-11  1438  3589
05-10-11  1342  4009
In [24]:
# the index of maximum
ridership_df.iloc[0].argmax()
Out[24]:
'R006'

Loading Data into a DataFrame

df = pd.read_csv(fileNmae)

df.head()

df.describe()

In [25]:
# read csv file
subway_df = pd.read_csv("nyc_subway_weather.csv")

# compute the mean of one column
print(np.mean(subway_df['ENTRIESn_hourly']))

# compute the correlation of two columns
def correlation(x, y):
    '''
    Fill in this function to compute the correlation between the two
    input variables. Each input is either a NumPy array or a Pandas
    Series.
    
    correlation = average of (x in standard units) times (y in standard units)
    
    Remember to pass the argument "ddof=0" to the Pandas std() function!
    '''
    std_x = (x-x.mean())/x.std(ddof=0)
    std_y = (y-y.mean())/y.std(ddof=0)
    
    return (std_x * std_y).mean()

entries = subway_df['ENTRIESn_hourly']
cum_entries = subway_df['ENTRIESn']
print(correlation(entries, cum_entries))
1886.5899552158317
0.5858954707662182

Pandas Axis Name

axis : {0 or ‘index’, 1 or ‘columns’}, default 0. You can use axis='index' or axis='column'.

For example, pd.mean(axis='columns) means take the means along the column, which returns the mean of each row. Similarily, pd.mean(axis='index) means take the means along the index, which returns the mean of each column.

DataFrame Vectorized Operations

In [26]:
# adding DataFrame with the column names
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
print(df1)
print(df2)
print(df1+df2)

# adding DataFrame with overlapping column names
print('-'*100)
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
print(df1 + df2)

# adding DataFrame with overlapping row indexes
print('-'*100)
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
                       index=['row1', 'row2', 'row3'])
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
                       index=['row4', 'row3', 'row2'])
print(df1 + df2)
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9
    a   b   c
0  10  40  70
1  20  50  80
2  30  60  90
    a   b   c
0  11  44  77
1  22  55  88
2  33  66  99
----------------------------------------------------------------------------------------------------
    a   b   c   d
0 NaN  74  47 NaN
1 NaN  85  58 NaN
2 NaN  96  69 NaN
----------------------------------------------------------------------------------------------------
         a     b     c
row1   NaN   NaN   NaN
row2  32.0  65.0  98.0
row3  23.0  56.0  89.0
row4   NaN   NaN   NaN
In [27]:
# compute the entries per hour from cumulative entries
entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})


# method 1:
# create a data frame with specific column names
result = pd.DataFrame(columns = ['entries', 'exits'])
result.loc[0] = 0
for i in range(entries_and_exits.shape[0]-1):
    result.loc[i+1] = (entries_and_exits.loc[i+1] - entries_and_exits.loc[i]).values # add rows to DataFrame using loc[]
print(result)

# method 2: using shift method, DataFrame minus DataFrame
print('-'*100)
print(entries_and_exits - entries_and_exits.shift(1))
   entries  exits
0      0.0    0.0
1     23.0    8.0
2     18.0   18.0
3     71.0   54.0
4    170.0   44.0
5    214.0   42.0
6     87.0   11.0
7     10.0    3.0
8     36.0   89.0
9    153.0  333.0
----------------------------------------------------------------------------------------------------
   ENTRIESn  EXITSn
0       NaN     NaN
1      23.0     8.0
2      18.0    18.0
3      71.0    54.0
4     170.0    44.0
5     214.0    42.0
6      87.0    11.0
7      10.0     3.0
8      36.0    89.0
9     153.0   333.0
In [28]:
# add a row in a data frame
temp = pd.DataFrame(columns = ['col1', 'col2'])
temp.loc[2] = [1,2]
print(temp)
   col1  col2
2   1.0   2.0

DataFrame applymap() and apply()

  • applymap() applys a function to a DataFrame that is intended to operate elementwise. Like doing map for each series in the DataFrame.
  • apply() on data frame is direct analog of apply on a series. Think a piece is a column rather than a single element. You pass apply() a pandas series(a column of data frame), and return a new pandas series(a new column of data frame).
  • The benefit of apply() over applymap() is that the operation you want to do on each element might depend on the entire column. However applymap() is just element-to-element mapping.
  • apply() also takes an axis argument so you can write a function operates on one row each time.
  • apply() can also take in one column and return just one single value. Then if takes a data frame, it returns a series.
In [29]:
# applymap()
df = pd.DataFrame({
        'a': [1, 2, 3],
        'b': [10, 20, 30],
        'c': [5, 10, 15]
    })
def add_one(x):
        return x + 1
print(df.applymap(add_one))
   a   b   c
0  2  11   6
1  3  21  11
2  4  31  16
In [30]:
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
grades_df
    
def convert_grades(grades):
    '''
    Fill in this function to convert the given DataFrame of numerical
    grades to letter grades. Return a new DataFrame with the converted
    grade.
    
    The conversion rule is:
        90-100 -> A
        80-89  -> B
        70-79  -> C
        60-69  -> D
        0-59   -> F
    '''
    def conversion(x):
        if 90 <= x <=100:
            x = 'A'
        elif 80 <= x < 90:
            x = 'B'
        elif 70 <= x < 80:
            x = 'C'
        elif 60 <= x < 70:
            x = 'D'
        else:
            x = 'F'
        return x
    return grades.applymap(conversion)
convert_grades(grades_df)
Out[30]:
exam1 exam2
Andre F F
Barry B D
Chris C F
Dan C F
Emilio B D
Fred C F
Greta A C
Humbert D F
Ivan A C
James B D
In [31]:
# use apply() to compute standarized value of each element
def standardize(df):
    '''
    Fill in this function to standardize each column of the given
    DataFrame. To standardize a variable, convert each value to the
    number of standard deviations it is above or below the mean.
    '''
    return df.apply(lambda x: (x-x.mean())/x.std(ddof = 0))
print(standardize(grades_df))
            exam1     exam2
Andre   -2.315341 -2.304599
Barry    0.220191  0.386400
Chris    0.020017 -0.096600
Dan     -0.180156 -0.096600
Emilio   0.753987  0.662400
Fred    -0.513779 -0.441600
Greta    0.887436  1.490400
Humbert -0.847401 -0.786600
Ivan     1.354508  1.007400
James    0.620538  0.179400
In [32]:
# case 2 of apply()
df = pd.DataFrame({
    'a': [4, 5, 3, 1, 2],
    'b': [20, 10, 40, 50, 30],
    'c': [25, 20, 5, 15, 10]
})
print(df.apply(np.max))

def second_largest(df):
    '''
    Fill in this function to return the second-largest value of each 
    column of the input DataFrame.
    '''
    def second_largest_column(column):
        # note the the sort_values() doesn't modify the original column, so sort first, column.iloc[1] is wrong.
        return column.sort_values(ascending = False).iloc[1] # second largest element of each column 
    return df.apply(second_largest_column)

print(second_largest(df))
a     5
b    50
c    25
dtype: int64
a     4
b    40
c    20
dtype: int64

Adding a DataFrame to a Series

It matches up the series to the dataframe using index of the series and the column name of the dataframe.

In [33]:
# adding a Series to a square DataFrame
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })    
print(df)
print(df+s)

# adding a Series to a one-row DataFrame
print('-'*100)
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
print(df)
print(df+s)

s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})

# adding a Series to a one-column DataFrame
print('-'*100)
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})
print(df)
print(df+s)

# adding when DataFrame column names match Series index
# actually the first case is the same as this one. As the default Series index is 0,1,2..., which is the same as the colnames. 
print('-'*100)
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
        'a': [10, 20, 30, 40],
        'b': [50, 60, 70, 80],
        'c': [90, 100, 110, 120],
        'd': [130, 140, 150, 160]
    })
print(df)
print(df+s)

# adding when DataFrame column names don't match Series index
print('-'*100)
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
        'a': [10, 20, 30, 40],
        'b': [50, 60, 70, 80],
        'c': [90, 100, 110, 120],
        'd': [130, 140, 150, 160]
    })
print(df)
print(df+s)
    0   1    2    3
0  10  50   90  130
1  20  60  100  140
2  30  70  110  150
3  40  80  120  160
    0   1    2    3
0  11  52   93  134
1  21  62  103  144
2  31  72  113  154
3  41  82  123  164
----------------------------------------------------------------------------------------------------
    0   1   2   3
0  10  20  30  40
    0   1   2   3
0  11  22  33  44
----------------------------------------------------------------------------------------------------
    0
0  10
1  20
2  30
3  40
    0   1   2   3
0  11 NaN NaN NaN
1  21 NaN NaN NaN
2  31 NaN NaN NaN
3  41 NaN NaN NaN
----------------------------------------------------------------------------------------------------
    a   b    c    d
0  10  50   90  130
1  20  60  100  140
2  30  70  110  150
3  40  80  120  160
    a   b    c    d
0  11  52   93  134
1  21  62  103  144
2  31  72  113  154
3  41  82  123  164
----------------------------------------------------------------------------------------------------
    a   b    c    d
0  10  50   90  130
1  20  60  100  140
2  30  70  110  150
3  40  80  120  160
    a   b   c   d   0   1   2   3
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
In [34]:
# using .add() method
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
print(df.add(s, axis='columns')) # the same as df+s, axis='columns' by default
print(df.add(s, axis='index')) 
    0   1    2    3
0  11  52   93  134
1  21  62  103  144
2  31  72  113  154
3  41  82  123  164
    0   1    2    3
0  11  51   91  131
1  22  62  102  142
2  33  73  113  153
3  44  84  124  164
In [35]:
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
grades_df
Out[35]:
exam1 exam2
Andre 43 24
Barry 81 63
Chris 78 56
Dan 75 56
Emilio 89 67
Fred 70 51
Greta 91 79
Humbert 65 46
Ivan 98 72
James 87 60

Standardize Columns and Rows via Vectorized Operation

In [36]:
def standardize(df):
    '''
    Fill in this function to standardize each column of the given
    DataFrame. To standardize a variable, convert each value to the
    number of standard deviations it is above or below the mean.
    
    This time, try to use vectorized operations instead of apply().
    You should get the same results as you did before.
    '''
    return (df - df.mean())/df.std(ddof=0)

def standardize_rows(df):
    '''
    Optional: Fill in this function to standardize each row of the given
    DataFrame. Again, try not to use apply().
    
    This one is more challenging than standardizing each column!
    '''
    mean = df.mean(axis='columns')
    std = df.std(ddof=0, axis='columns')
    return df.add(-mean, axis='index').div(std, axis='index')   # note .div() is similar to .add()

print(standardize(grades_df))
print(standardize_rows(grades_df))
            exam1     exam2
Andre   -2.315341 -2.304599
Barry    0.220191  0.386400
Chris    0.020017 -0.096600
Dan     -0.180156 -0.096600
Emilio   0.753987  0.662400
Fred    -0.513779 -0.441600
Greta    0.887436  1.490400
Humbert -0.847401 -0.786600
Ivan     1.354508  1.007400
James    0.620538  0.179400
         exam1  exam2
Andre      1.0   -1.0
Barry      1.0   -1.0
Chris      1.0   -1.0
Dan        1.0   -1.0
Emilio     1.0   -1.0
Fred       1.0   -1.0
Greta      1.0   -1.0
Humbert    1.0   -1.0
Ivan       1.0   -1.0
James      1.0   -1.0
In [37]:
# show how .add() works
t= grades_df.add(-grades_df.mean(axis='columns'), axis='index') # .add(negative) is the same as .sub(positive)
print(t)

# show how .div() works
print(t.div(grades_df.std(ddof=0, axis='columns'), axis='index'))
         exam1  exam2
Andre      9.5   -9.5
Barry      9.0   -9.0
Chris     11.0  -11.0
Dan        9.5   -9.5
Emilio    11.0  -11.0
Fred       9.5   -9.5
Greta      6.0   -6.0
Humbert    9.5   -9.5
Ivan      13.0  -13.0
James     13.5  -13.5
         exam1  exam2
Andre      1.0   -1.0
Barry      1.0   -1.0
Chris      1.0   -1.0
Dan        1.0   -1.0
Emilio     1.0   -1.0
Fred       1.0   -1.0
Greta      1.0   -1.0
Humbert    1.0   -1.0
Ivan       1.0   -1.0
James      1.0   -1.0

Pandas groupby()

In [38]:
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
example_df
Out[38]:
above_three even value
a False False 1
b False False 3
c False True 2
d True True 4
e False False 1
f True True 6
g True True 4
In [39]:
# get sum of each group
print(example_df.groupby(['even']).sum())

# take a subset of the groupby object then compute sum
print(example_df.groupby('even')['value'].sum())
       above_three  value
even                     
False          0.0      5
True           3.0     16
even
False     5
True     16
Name: value, dtype: int32

Combining Pandas DataFrame

merge() in Pandas is very similar to join in SQL

In [40]:
subway_df = pd.DataFrame({
    'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
             'R004', 'R004'],
    'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
              '05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'ENTRIESn': [ 4388333,  4388348,  4389885,  4391507,  4393043, 14656120,
                 14656174, 14660126, 14664247, 14668301],
    'EXITSn': [ 2911002,  2911036,  2912127,  2913223,  2914284, 14451774,
               14451851, 14454734, 14457780, 14460818],
    'latitude': [ 40.689945,  40.689945,  40.689945,  40.689945,  40.689945,
                  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ],
    'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
                  -73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})

weather_df = pd.DataFrame({
    'date': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
              '05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'latitude': [ 40.689945,  40.69132 ,  40.689945,  40.69132 ,  40.689945,
                  40.69132 ,  40.689945,  40.69132 ,  40.689945,  40.69132 ],
    'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
                  -73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
    'pressurei': [ 30.24,  30.24,  30.32,  30.32,  30.14,  30.14,  29.98,  29.98,
                   30.01,  30.01],
    'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'tempi': [ 52. ,  52. ,  48.9,  48.9,  54. ,  54. ,  57.2,  57.2,  48.9,  48.9],
    'wspdi': [  8.1,   8.1,   6.9,   6.9,   3.5,   3.5,  15. ,  15. ,  15. ,  15. ]
})
print(subway_df.head(3))
print(weather_df.head(3))
      DATEn  ENTRIESn   EXITSn  UNIT  hour   latitude  longitude
0  05-01-11   4388333  2911002  R003     0  40.689945 -73.872564
1  05-02-11   4388348  2911036  R003     0  40.689945 -73.872564
2  05-03-11   4389885  2912127  R003     0  40.689945 -73.872564
       date  fog  hour   latitude  longitude  pressurei  rain  tempi  wspdi
0  05-01-11    0     0  40.689945 -73.872564      30.24     0   52.0    8.1
1  05-01-11    0     0  40.691320 -73.867135      30.24     0   52.0    8.1
2  05-02-11    0     0  40.689945 -73.872564      30.32     0   48.9    6.9
In [41]:
# by default `on=None` uses the intersection of the columns 
subway_df.merge(weather_df, how='inner', on=None).head(3)
Out[41]:
DATEn ENTRIESn EXITSn UNIT hour latitude longitude date fog pressurei rain tempi wspdi
0 05-01-11 4388333 2911002 R003 0 40.689945 -73.872564 05-01-11 0 30.24 0 52.0 8.1
1 05-01-11 4388333 2911002 R003 0 40.689945 -73.872564 05-02-11 0 30.32 0 48.9 6.9
2 05-01-11 4388333 2911002 R003 0 40.689945 -73.872564 05-03-11 0 30.14 0 54.0 3.5
In [42]:
# specify the left_on and right_on, meaning the columns for left table and right table seperately
subway_df.merge(weather_df, how='inner', 
                left_on = ['DATEn', 'hour', 'latitude', 'longitude'],
                right_on = ['date', 'hour', 'latitude', 'longitude']).head(3)
Out[42]:
DATEn ENTRIESn EXITSn UNIT hour latitude longitude date fog pressurei rain tempi wspdi
0 05-01-11 4388333 2911002 R003 0 40.689945 -73.872564 05-01-11 0 30.24 0 52.0 8.1
1 05-02-11 4388348 2911036 R003 0 40.689945 -73.872564 05-02-11 0 30.32 0 48.9 6.9
2 05-03-11 4389885 2912127 R003 0 40.689945 -73.872564 05-03-11 0 30.14 0 54.0 3.5

Comments

comments powered by Disqus