This is an introduction of using
NumPy
and Pandas
based on the course Intro to Data Analysis
on Udacity. It includes NumPy
and Pandas
data structures, basic operations and functions with code examples.. I learnt a lot from this course and shared this notes for your reference.
Table of Contents¶
1. Data Analysis Process¶
Introduction¶
- Question
- Characteristics of students who pass projects
- How to stock my store with the products people most want to buy
- Wrangling
- Data acquisition
- Data cleaning
Explore
- Build intuition
- Find patterns
Draw conclusion/or make prediction
- Predict: Which movies users will like
- Conclude: Users are less likely to click certain articles
- Usually requres statistics or machine learning
- Communication
- Blog, paper, email, powerpoint, etc
- Data visualization
Data wrangling phase is very interwined with data explore phase. Data acquisition actually comes before you pose a question.
2. NumPy and Pandas for 1D Data¶
import numpy as np
import pandas as pd
# read csv via pandas
data = pd.read_csv('gdp_per_capita.csv')
# check the unique items in a column
len(data['Country'].unique())
One-dimensional data structures¶
- Numpy (Numerical Python)
- Array.
- Pandas
- Series. Built on Array of Numpy, with more features.
Numpy array¶
- Similarities between Numpy arrays and Python lists.
- Access elements by position.
a[0]
- Access a range of elements.
a[0:3]
(note upper bound is not inclusive) - Use loops.
for x in a:
- Access elements by position.
- Differences between Numpy arrays and Python lists.
- For Numpy, each elements should have same type (string, int, boolean, etc). For python list, you can put them together.
- Numpy has convenient functions.
mean()
,std()
. List also has similar functions, but Numpy is faster. - Numpy can be multi-dimensional, similar to list of list in python.
# First 20 countries with employment data
countries = np.array([
'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
'Belize', 'Benin', 'Bhutan', 'Bolivia',
'Bosnia and Herzegovina'
])
# Employment data in 2007 for those 20 countries
employment = np.array([
55.70000076, 51.40000153, 50.5 , 75.69999695,
58.40000153, 40.09999847, 61.5 , 57.09999847,
60.90000153, 66.59999847, 60.40000153, 68.09999847,
66.90000153, 53.40000153, 48.59999847, 56.79999924,
71.59999847, 58.40000153, 70.40000153, 41.20000076
])
# try different expressions, access element and slicing
print(countries[0])
print(countries[0:3])
print(countries[:2])
print(countries[:])
# elements type
print('-'*100)
print(countries.dtype)
print(employment.dtype)
# looping
print('-'*100)
for i in range(3):
country = countries[i]
country_employment = employment[i]
print('Country {} has employment {}'.format(country, country_employment))
# Numpy functions. np.argmax() returns the index of maximum
print('-'*100)
print(employment.mean(), employment.std(), employment.max(), employment.sum(), employment.argmax())
Vectorized Operations¶
A vector is a list of elements.
- Vector addition.
- Multiplying by a scalar
- More vectorized operations.
- Math operation:
+, -, *, /, **
, - Logical operations:
And:&, Or:|, Not:~
. Only valid for boolean. If use it on integers, then it's bitwise operation. - Comparison operations:
>, >=, <=, <, ==, !=
.
- Math operation:
# vector addition, linear algebra
a = np.array([1,2,3])
b = np.array([4,5,6])
print(a + b)
print(a * b)
# list concatenation, note the difference of python list
print([1,2,3] + [4,5,6])
# multiplying by a scalar
print(a * 3)
# in python list
print([1,2,3]*3)
# arthmetic operations between a Numpy array and a single number
print(a+5)
# logical operations with Numpy arrays
print('-'*100)
a = np.array([True, True, False, False])
b = np.array([True, False, True, False])
print(a & b)
print(~a)
print(a & False)
# comparison operations between 2 NumPy arrays
print('-'*100)
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print(a > b)
# comparison operations between a Numpy array and a single number
print(a > 2)
NumPy index arrays¶
An array of booleans can be a index array.
# using index arrays
a = np.array([1, 2, 3, 4])
b = np.array([True, True, False, False])
print(a[b])
print(a[a>2])
# creating the index array using vectorized operations
b = a >2
print(b)
# array of booleans can be used with numpy functions
print(b.mean())
+ vs. +=¶
+=
operates in-place while +
does not. That is, +=
stores all the values in the same place as the original values rather than creating a new array.
# case 1: both a and b refer to the same array
a = np.array([1,2,3])
b = a
a += 1
print(b)
# case 2: both a and b refer to the same array, but later a is reassigned, which creates a new array.
# Then a refers to a new array while b refers to the original array
a = np.array([1,2,3])
b = a
a = a + 1
print(b)
In-Place vs. Not In-Place¶
When taking a slice, numpy doesn't create a new array. Instead it refers to a what's called a view of the original array. If the slice is modified, then the original array is modified as well. So slice in numpy is very fast.
Note this is different from python list.
a = np.array([1,2,3,4])
# take a slice, numpy doesn't create a new array. It
Slice = a[:3]
Slice[0] = 10
print(a)
# try python list
a = [1,2,3,4]
Slice = a[:3]
Slice[0] = 10
print(a)
Pandas Series¶
Similar to numpy array, but it has more extra functionality. e.g. s.describe()
Similarities:
- Accessing elements, `s[0]`, `s[3:7]`
- Looping `for x in s`.
- Convenient functions. `s.mean()`, `s.max()`.
- Vectorized operations. `s1 + s2`
- Implemented in C, fast.
import pandas as pd
countries = ['Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda',
'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan',
'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia']
life_expectancy_values = [74.7, 75. , 83.4, 57.6, 74.6, 75.4, 72.3, 81.5, 80.2,
70.3, 72.1, 76.4, 68.1, 75.2, 69.8, 79.4, 70.8, 62.7,
67.3, 70.6]
gdp_values = [ 1681.61390973, 2155.48523109, 21495.80508273, 562.98768478,
13495.1274663 , 9388.68852258, 1424.19056199, 24765.54890176,
27036.48733192, 1945.63754911, 21721.61840978, 13373.21993972,
483.97086804, 9783.98417323, 2253.46411147, 25034.66692293,
3680.91642923, 366.04496652, 1175.92638695, 1132.21387981]
# Life expectancy and gdp data in 2007 for 20 countries
life_expectancy = pd.Series(life_expectancy_values)
gdp = pd.Series(gdp_values)
# pandas accessing elements and slicing
print(life_expectancy[0])
print(gdp[0:3])
# panda looping
print('-'*100)
for country_life_expectancy in life_expectancy[0:3]:
print('Examining life expectancy {}'.format(country_life_expectancy))
# pandas functions
print('-'*100)
print(life_expectancy.mean(), life_expectancy.std(), gdp.max(), gdp.min())
# Vectorized operations and index arrays
print('-'*100)
a = pd.Series([1, 2, 3, 4])
b = pd.Series([1, 2, 1, 2])
print(a+b)
print(a*2)
print(a>=3)
print(a[a>=3])
Series index¶
The main difference between pandas series and numpy arrays is that pandas series has something called index.
NumPy arrays are like souped version of python lists. Similarly, a pandas series is like a cross between a list and a dictionary. For list, elements are sorted in order and you access each element or series by position. For dictionary, you access value by key. So for pandas series, you can use both position and index. (I think here index is more similar to key in dictioanry)
If you didn't specify index when creating a pandas series, then 0,1,2,3...
are used as index by default.
In pandas series, use s.loc[]
to access elements by index. When you use s[]
then you are accessing elements by position. Pandas series has another attribute s.iloc[]
to access element by position, which is the same as s[]
but more clear. E.g., we say s.iloc[0] is position zero rather than index 0, as the index could be other character.
p = pd.Series([1,2,3], index=['a', 'b', 'c'])
print(p)
print(p['a'])
print(p[0]) # as 0 is not the index, so bracket access [0] is by position
print(p.iloc[0]) # iloc is location-based indexing, which is more explicit than p[0],
#so we should say it's position zero rather than index 0, as the index should be 'a'
countries = ['Afghanistan', 'Albania', 'Algeria', 'Angola']
employment_values = [55.70000076, 51.40000153, 50.5 , 75.69999695,]
# Employment data in 2007 for 20 countries
employment = pd.Series(employment_values, index=countries)
print(employment[0])
print(employment.iloc[0])
print(employment['Afghanistan'])
print(employment.argmax()) # as we know, this returns the index of maxmium, which is country name
def max_employment(employment):
max_country = employment.argmax()
max_value = employment[max_country]
return (max_country, max_value)
max_employment(employment)
Vectorized Operations and Series Indexes¶
- Add two Numpy arrays, you are adding by position
- Add tow series with different indexes.
# Addition when indexes are the same
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s1 + s2)
# Indexes have same elements in a different order
print('-'*100)
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['b', 'd', 'a', 'c'])
print(s1 + s2)
# Indexes overlap, but do not have exactly the same elements
print('-'*100)
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
print(s1 + s2) # only the overlapped indexes are added
print('-'*100)
print((s1 + s2).dropna()) # drop NA in a series
# Indexes do not overlap
print('-'*100)
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['e', 'f', 'g', 'h'])
print(s1 + s2) # no index is added
# default index of pandas series is 0,1,2
p2=pd.Series([1,2,3])
p2
p2.iloc()
# pandas argmax() function. It returns the index of first maximum values, which is 'c' here
p.argmax()
Pandas Series apply()
.¶
apply()
takes a series and a function and returns a new series. It makes code concise.- Applies function along input axis of DataFrame/Series. axis : {0 or ‘index’, 1 or ‘columns’}, default 0
def reverse(name):
s = name.split(' ')
return s[1]+', '+s[0]
names = pd.Series([
'Andre Agassi',
'Barry Bonds',
'Christopher Columbus',
'Daniel Defoe'])
print(names.apply(reverse))
Plotting in Pandas¶
from IPython.display import Image
#%pylab inline
%matplotlib inline
# make plots appear within the iPython notebook.
import matplotlib.pyplot as plt
data = pd.Series(np.random.normal(0,1,1000))
plt.hist(data)
# pandas also has build-in plotting using matplotlib
# The index of the Series will be used for the x-axis and the values for the y-axis
data = pd.Series([7,3,8,5], index=['a', 'b', 'c', 'd'])
data.plot()
3. NumPy and Pandas for 2D Data¶
Two-dimensional data:¶
- Python: list of lists
- NumPy: 2D array
- Pandas: Data Frame 2D arrays, as opposed to array of arrays:
- more memory efficient. See this page.
- The syntax to access the element is different. For 2D array, use
a[1,3]
rather thana[1][3]
of array of arrays. The row and column can also be a slice using colon notation.a[1:3, 4:5]
mean(), std(), etc
operate on entire array.
ridership = np.array([
[ 0, 0, 2, 5, 0],
[1478, 3877, 3674, 2328, 2539],
[1613, 4088, 3991, 6461, 2691],
[1560, 3392, 3826, 4787, 2613],
[1608, 4802, 3932, 4477, 2705],
[1576, 3933, 3909, 4979, 2685],
[ 95, 229, 255, 496, 201],
[ 2, 0, 1, 27, 0],
[1438, 3785, 3589, 4174, 2215],
[1342, 4043, 4009, 4665, 3033]
])
def mean_riders_for_max_station(ridership):
'''
Fill in this function to find the station with the maximum riders on the
first day, then return the mean riders per day for that station. Also
return the mean ridership overall for comparsion.
Hint: NumPy's argmax() function might be useful:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html
'''
overall_mean = ridership.mean() # Replace this with your code
max_station_first_day = ridership[0,:].argmax()
mean_for_max = ridership[:, max_station_first_day].mean() # Replace this with your code
return (overall_mean, mean_for_max)
print(mean_riders_for_max_station(ridership))
NumPy Axis¶
E.g. array.mean(axis = 0)
, compute the mean of each column; array.mean(axis = 1)
, compute the mean of each row.
print(ridership.mean(axis = 0))
NumPy and Pandas Data Types¶
- When not all elements are numbers like there is NAN, numpy array can't take mean, while Pandas dataframe can.
- Pandas Dataframe has indexes similar to Pandas series. There is an index value for each row, and a name for column. (I think index is kind of row name)
- When taking mean, pandas computes the mean of each column rather than the mean of the entire data frame, since each column is assumed to be different types.
- Use axis argument to take the mean of each row.
df.mean(axis=1)
. But it doesn't work if the data types are different.
print(np.array([1,2,3]).dtype)
# convert a dictionary to dataframe, the key is converted as column names.
dic = {'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]}
df = pd.DataFrame(dic, index=['x', 'y', 'z'])
print(df)
print(df.mean())
print(df.values.mean())
# convert a 2D numpy array or list of lists to pandas dataframe
print('-'*100)
df_2 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'B', 'C'])
print(df_2)
Access Elements of DataFrame¶
- Access a single row of a data frame is similiar to asscess a single element of a series. Use
loc[]
to access rows by index, and useiloc[]
to access rows by position. - Access a single element.
df.iloc[2, 3]
, ordf.loc[index, colName]
- Access columns using square brackets.
df[colName]
- Use
df.values()
to get a 2D Numpy array. There is only values without column names and row indexes. If you want to compute the mean of all value you can use this. Be careful of the data types when doing this.
ridership_df = pd.DataFrame(
data=[[ 0, 0, 2, 5, 0],
[1478, 3877, 3674, 2328, 2539],
[1613, 4088, 3991, 6461, 2691],
[1560, 3392, 3826, 4787, 2613],
[1608, 4802, 3932, 4477, 2705],
[1576, 3933, 3909, 4979, 2685],
[ 95, 229, 255, 496, 201],
[ 2, 0, 1, 27, 0],
[1438, 3785, 3589, 4174, 2215],
[1342, 4043, 4009, 4665, 3033]],
index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
'05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
columns=['R003', 'R004', 'R005', 'R006', 'R007']
)
print(ridership_df)
# access elements, first row
print('-'*100)
print(ridership_df.iloc[0]) # note .iloc[0, :] is wrong
print(type(ridership_df.iloc[0])) # note the first row is a pandas series, and the column names became series indexes.
# row whose index is '05-02-11'
print('-'*100)
print(ridership_df.loc['05-02-11'])
# column whose column name is 'R003'
print('-'*100)
print(ridership_df['R003']) # note [:, 'R003'] is wrong
# access 2nd row, 4th column
print('-'*100)
print(ridership_df.iloc[1, 3])
# access multiple rows
print('-'*100)
print(ridership_df.iloc[1:3])
# access multiple columns
print('-'*100)
print(ridership_df[['R003', 'R005']])
# the index of maximum
ridership_df.iloc[0].argmax()
# read csv file
subway_df = pd.read_csv("nyc_subway_weather.csv")
# compute the mean of one column
print(np.mean(subway_df['ENTRIESn_hourly']))
# compute the correlation of two columns
def correlation(x, y):
'''
Fill in this function to compute the correlation between the two
input variables. Each input is either a NumPy array or a Pandas
Series.
correlation = average of (x in standard units) times (y in standard units)
Remember to pass the argument "ddof=0" to the Pandas std() function!
'''
std_x = (x-x.mean())/x.std(ddof=0)
std_y = (y-y.mean())/y.std(ddof=0)
return (std_x * std_y).mean()
entries = subway_df['ENTRIESn_hourly']
cum_entries = subway_df['ENTRIESn']
print(correlation(entries, cum_entries))
Pandas Axis Name¶
axis : {0 or ‘index’, 1 or ‘columns’}, default 0. You can use axis='index' or axis='column'.
For example, pd.mean(axis='columns)
means take the means along the column, which returns the mean of each row. Similarily, pd.mean(axis='index)
means take the means along the index, which returns the mean of each column.
DataFrame Vectorized Operations¶
# adding DataFrame with the column names
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
print(df1)
print(df2)
print(df1+df2)
# adding DataFrame with overlapping column names
print('-'*100)
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
print(df1 + df2)
# adding DataFrame with overlapping row indexes
print('-'*100)
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
index=['row1', 'row2', 'row3'])
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
index=['row4', 'row3', 'row2'])
print(df1 + df2)
# compute the entries per hour from cumulative entries
entries_and_exits = pd.DataFrame({
'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
3144808, 3144895, 3144905, 3144941, 3145094],
'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
1088317, 1088328, 1088331, 1088420, 1088753]
})
# method 1:
# create a data frame with specific column names
result = pd.DataFrame(columns = ['entries', 'exits'])
result.loc[0] = 0
for i in range(entries_and_exits.shape[0]-1):
result.loc[i+1] = (entries_and_exits.loc[i+1] - entries_and_exits.loc[i]).values # add rows to DataFrame using loc[]
print(result)
# method 2: using shift method, DataFrame minus DataFrame
print('-'*100)
print(entries_and_exits - entries_and_exits.shift(1))
# add a row in a data frame
temp = pd.DataFrame(columns = ['col1', 'col2'])
temp.loc[2] = [1,2]
print(temp)
DataFrame applymap() and apply()¶
applymap()
applys a function to a DataFrame that is intended to operate elementwise. Like doing map for each series in the DataFrame.apply()
on data frame is direct analog of apply on a series. Think a piece is a column rather than a single element. You passapply()
a pandas series(a column of data frame), and return a new pandas series(a new column of data frame).- The benefit of
apply()
overapplymap()
is that the operation you want to do on each element might depend on the entire column. Howeverapplymap()
is just element-to-element mapping. apply()
also takes an axis argument so you can write a function operates on one row each time.apply()
can also take in one column and return just one single value. Then if takes a data frame, it returns a series.
# applymap()
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [10, 20, 30],
'c': [5, 10, 15]
})
def add_one(x):
return x + 1
print(df.applymap(add_one))
grades_df = pd.DataFrame(
data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio',
'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
grades_df
def convert_grades(grades):
'''
Fill in this function to convert the given DataFrame of numerical
grades to letter grades. Return a new DataFrame with the converted
grade.
The conversion rule is:
90-100 -> A
80-89 -> B
70-79 -> C
60-69 -> D
0-59 -> F
'''
def conversion(x):
if 90 <= x <=100:
x = 'A'
elif 80 <= x < 90:
x = 'B'
elif 70 <= x < 80:
x = 'C'
elif 60 <= x < 70:
x = 'D'
else:
x = 'F'
return x
return grades.applymap(conversion)
convert_grades(grades_df)
# use apply() to compute standarized value of each element
def standardize(df):
'''
Fill in this function to standardize each column of the given
DataFrame. To standardize a variable, convert each value to the
number of standard deviations it is above or below the mean.
'''
return df.apply(lambda x: (x-x.mean())/x.std(ddof = 0))
print(standardize(grades_df))
# case 2 of apply()
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
print(df.apply(np.max))
def second_largest(df):
'''
Fill in this function to return the second-largest value of each
column of the input DataFrame.
'''
def second_largest_column(column):
# note the the sort_values() doesn't modify the original column, so sort first, column.iloc[1] is wrong.
return column.sort_values(ascending = False).iloc[1] # second largest element of each column
return df.apply(second_largest_column)
print(second_largest(df))
Adding a DataFrame to a Series¶
It matches up the series to the dataframe using index of the series and the column name of the dataframe.
# adding a Series to a square DataFrame
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
0: [10, 20, 30, 40],
1: [50, 60, 70, 80],
2: [90, 100, 110, 120],
3: [130, 140, 150, 160]
})
print(df)
print(df+s)
# adding a Series to a one-row DataFrame
print('-'*100)
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
print(df)
print(df+s)
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})
# adding a Series to a one-column DataFrame
print('-'*100)
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})
print(df)
print(df+s)
# adding when DataFrame column names match Series index
# actually the first case is the same as this one. As the default Series index is 0,1,2..., which is the same as the colnames.
print('-'*100)
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
'a': [10, 20, 30, 40],
'b': [50, 60, 70, 80],
'c': [90, 100, 110, 120],
'd': [130, 140, 150, 160]
})
print(df)
print(df+s)
# adding when DataFrame column names don't match Series index
print('-'*100)
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
'a': [10, 20, 30, 40],
'b': [50, 60, 70, 80],
'c': [90, 100, 110, 120],
'd': [130, 140, 150, 160]
})
print(df)
print(df+s)
# using .add() method
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
0: [10, 20, 30, 40],
1: [50, 60, 70, 80],
2: [90, 100, 110, 120],
3: [130, 140, 150, 160]
})
print(df.add(s, axis='columns')) # the same as df+s, axis='columns' by default
print(df.add(s, axis='index'))
grades_df = pd.DataFrame(
data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio',
'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
grades_df
Standardize Columns and Rows via Vectorized Operation¶
def standardize(df):
'''
Fill in this function to standardize each column of the given
DataFrame. To standardize a variable, convert each value to the
number of standard deviations it is above or below the mean.
This time, try to use vectorized operations instead of apply().
You should get the same results as you did before.
'''
return (df - df.mean())/df.std(ddof=0)
def standardize_rows(df):
'''
Optional: Fill in this function to standardize each row of the given
DataFrame. Again, try not to use apply().
This one is more challenging than standardizing each column!
'''
mean = df.mean(axis='columns')
std = df.std(ddof=0, axis='columns')
return df.add(-mean, axis='index').div(std, axis='index') # note .div() is similar to .add()
print(standardize(grades_df))
print(standardize_rows(grades_df))
# show how .add() works
t= grades_df.add(-grades_df.mean(axis='columns'), axis='index') # .add(negative) is the same as .sub(positive)
print(t)
# show how .div() works
print(t.div(grades_df.std(ddof=0, axis='columns'), axis='index'))
Pandas groupby()
¶
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
'value': values,
'even': values % 2 == 0,
'above_three': values > 3
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
example_df
# get sum of each group
print(example_df.groupby(['even']).sum())
# take a subset of the groupby object then compute sum
print(example_df.groupby('even')['value'].sum())
Combining Pandas DataFrame¶
merge()
in Pandas is very similar to join
in SQL
subway_df = pd.DataFrame({
'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
'R004', 'R004'],
'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
'05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'ENTRIESn': [ 4388333, 4388348, 4389885, 4391507, 4393043, 14656120,
14656174, 14660126, 14664247, 14668301],
'EXITSn': [ 2911002, 2911036, 2912127, 2913223, 2914284, 14451774,
14451851, 14454734, 14457780, 14460818],
'latitude': [ 40.689945, 40.689945, 40.689945, 40.689945, 40.689945,
40.69132 , 40.69132 , 40.69132 , 40.69132 , 40.69132 ],
'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
-73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})
weather_df = pd.DataFrame({
'date': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
'05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'latitude': [ 40.689945, 40.69132 , 40.689945, 40.69132 , 40.689945,
40.69132 , 40.689945, 40.69132 , 40.689945, 40.69132 ],
'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
-73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
'pressurei': [ 30.24, 30.24, 30.32, 30.32, 30.14, 30.14, 29.98, 29.98,
30.01, 30.01],
'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'tempi': [ 52. , 52. , 48.9, 48.9, 54. , 54. , 57.2, 57.2, 48.9, 48.9],
'wspdi': [ 8.1, 8.1, 6.9, 6.9, 3.5, 3.5, 15. , 15. , 15. , 15. ]
})
print(subway_df.head(3))
print(weather_df.head(3))
# by default `on=None` uses the intersection of the columns
subway_df.merge(weather_df, how='inner', on=None).head(3)
# specify the left_on and right_on, meaning the columns for left table and right table seperately
subway_df.merge(weather_df, how='inner',
left_on = ['DATEn', 'hour', 'latitude', 'longitude'],
right_on = ['date', 'hour', 'latitude', 'longitude']).head(3)
Comments
comments powered by Disqus