PANDAS
The examples below are using the dataset(s) described by the snippit below.
In [1]:
data.head()
Out[1]:
room_id host_id room_type neighborhood reviews overall_satisfaction accommodates bedrooms price minstay
0 5453 8021 Private room Jamaica Plain 53 5.0 2 1.0 171.0 1
1 5506 8229 Private room Roxbury 30 4.5 2 1.0 165.0 3
2 6695 8229 Entire home/apt Roxbury 39 5.0 4 1.0 222.0 3
3 6976 16701 Private room Roslindale 26 5.0 2 1.0 74.0 1
4 8789 26988 Entire home/apt Downtown 1 5.0 2 1.0 165.0 5


DataFrame


Create DataFrame

There are multiple methods when it comes to reading files into a dataframe such as read_csv, read_table, and read_json. All have a number of arguments to deal with different problems when reading in data (no header, missing data, specific delimiter, etc). Below are a few these methods used with a variety of arguments.

pd.read_csv('file_name.csv') # clean comma separated file
pd.read_csv('file_name.csv', names= ['First', 'Second'])  # creating a header using the names argument
pd.read_csv('file_name.csv', header=None) # specifying there is no header
pd.read_json('data.txt', lines=True) # file containing one json object per line
pd.read_table('file.csv', sep="/") # specifying a slash as the delimiter

Basic Operations

Series and DataFrames allow for basic vector operations to be done on them such as multiplying or dividing something on each value of the DataFrame. There are also a couple of methods to give some statistical information about the DataFrame itself such as describe() and info().

# basic operation
df * 3 # multiply each element in the dataframe by 3 

# head(x) displays the first number of rows in the dataframe
df.head(10)

# describe gives you a number of statistics such as standard deviation, min, count, max,
# and three quartiles 25%, 50%, and 75 %
df.describe() 

# info gives general information about the dataframe like row counts and type
df.info()

Filtering

Filtering with pandas allows you to return specific rows of data using boolean operators on column values. Using either loc or bracket notation you can write a variety of filters that use multiple boolean operators and return only certain columns. Below are some example to who some ways to filter out rows of a dataframe.

df[df['price'] > 200] # return all rows that are greater than 200
df[df['price'] > 200]['room_type'] # same as above but only return room_type column for each row

# using loc, same two outputs as above
df.loc[df['price'] > 200]
df.loc[df['price'] > 200, 'room_type']

# multiple boolean operators 
# return rows that are greater than 200 and that have the room_type equal to 'Private room'
df.loc[(df['price'] > 200) & (df['room_type'] == 'Private room')] 

Row/Column Selection

Selecting rows from a dataframe can be done in a few ways. Pandas recommends using the loc or iloc methods when doing this. Using loc is for using the labels when specifying rows/columns and iloc is for using the integer index values of the row or column. Below is a variety of row selection and filtering. It is done by specifying a slice of the row and column you want to return. You do not have to specify the column slice if you are only filtering rows.

df2 = pd.DataFrame({'name' : ['Mike', 'Eric', 'Sam', 'Dan', 'Megan'],
                    'age' : [20, 33, 40, 13, 26],
                    'gender' : ['male', 'male', 'female', 'male', 'female']
})

df2.iloc[:3] # every row up to and including the 3rd row value
df2.iloc[3:] # every row from row 4 and up.  (doesnt include the 3rd row)
df2.iloc[:4 , :2] # [row, col]: first 4 rows and first two columns of the dataframe
df2.iloc[:,:] # all rows and columns

# same but using loc
# first need to make a labeled index
df2.index = df2['name'] # set the index as the name column
df2.loc[:'Sam']
df2.loc['Dan':] # unlike iloc the row value is included.
df2.loc[:'Dan', :'gender']
df2.loc[:,:]

# selecting single values
df2.iloc[1,2] # returns 'Eric'
df2.loc['Megan','age'] # returns 26

Missing Data

Dealing with missing data is something that cannot be avoided. Luckily Pandas makes it easy to deal with missing data in your datasets. Using methods such as notnull() or isnull() return a boolean value that tell you which values are null. Methods such as fillna() and dropna() allow to deal with the null values by either filling in values or dropping the row or column from the dataframe.

df.isnull().sum() # count for number of null values in each column

df.dropna() # drop row wise if there is a null value
df.dropna(axis=1) # drop column wise

# using fillna to fill empty values
# if the price column has any null values fill that value with the mean of the price column
df['price'] = df['price'].fillna(df['price'].mean()) 

GroupBy

The groupby method allows for grouping the data by the column or columns given to it. This then allows for different aggregate functions to be applied to these groupings.

by_neighborhood = df.groupby('neighborhood') # group the dataframe by neighborhood
by_neighborhood['price'].mean() # gives you the mean price for each neighborhood

# you can also group by multiple columns
# this grouping creates a nested index where it is grouped by the neighborhood
# and then by the room type
by_room_and_neighborhood = df.groupby(['neighborhood', 'room_type'])

# gives you the price averages for each room_type of every neighborhood
by_room_and_neighborhood['price'].mean() 

Apply/Map Functions

The apply/map functions allow for functions to be run on the dataframe data. The first method is apply which runs the given function on either the row or columns. Applymap is used to run the function on each individual element. There are also the functions apply and map that are used on a Series.

# for a series
# for each individual minstay row value and " days" to it,
df['minstay'].map(lambda x: str(x) + " days") # in this case .apply will have the same output

# for a dataframe
num_cols = df[['minstay', 'price', 'bedrooms']]
num_cols.apply(lambda x: x.max() - x.min()) # can use axis = 1 to apply function on each row instead of each column

Interactive Shell

Here is an interactive shell to try out the methods discussed on a dataframe

# this gets executed each time the exercise is initialized import pandas as pd data = pd.DataFrame({'price': {'35': 126.0, '64': 394.0, '109': 108.0, '117': 74.0, '146': 171.0, '148': 314.0, '165': 74.0, '262': 56.0, '281': 165.0, '311': 102.0, '325': 23.0, '343': 86.0, '356': 191.0, '358': 102.0, '405': 171.0, '414': 86.0, '426': 124.0, '449': 66.0, '483': 80.0, '511': 399.0, '517': 171.0, '529': 63.0, '555': 97.0, '585': 66.0, '586': 257.0, '623': 80.0, '643': 257.0, '649': 365.0, '656': 131.0, '666': 86.0, '726': 285.0, '733': 365.0, '741': 68.0, '747': 113.0, '749': 131.0, '764': 366.0, '768': 371.0, '816': 204.0, '883': 204.0, '914': 799.0, '929': 257.0, '1030': 80.0, '1038': 171.0, '1112': 183.0, '1143': 143.0, '1158': 148.0, '1166': 86.0, '1173': 97.0, '1175': 160.0, '1208': 227.0}, 'host_id': {'35': 85770, '64': 25188, '109': 1092168, '117': 1176995, '146': 1444340, '148': 411715, '165': 2053557, '262': 4480671, '281': 1590532, '311': 848706, '325': 1651480, '343': 2254462, '356': 1444340, '358': 6549153, '405': 7658308, '414': 7673571, '426': 8085264, '449': 6608084, '483': 9287212, '511': 9617517, '517': 6469268, '529': 9663343, '555': 1037913, '585': 2621102, '586': 10750832, '623': 4232125, '643': 10250257, '649': 12220952, '656': 5573822, '666': 3875741, '726': 12985903, '733': 13929879, '741': 14050476, '747': 324630, '749': 1362285, '764': 25188, '768': 50866, '816': 3629953, '883': 13325723, '914': 10502779, '929': 7629859, '1030': 18165984, '1038': 8812693, '1112': 9258077, '1143': 7000428, '1158': 1805868, '1166': 1480518, '1173': 1805868, '1175': 2830216, '1208': 21063555}, 'minstay': {'35': 1, '64': 4, '109': 2, '117': 2, '146': 2, '148': 1, '165': 3, '262': 7, '281': 2, '311': 1, '325': 1, '343': 1, '356': 1, '358': 1, '405': 1, '414': 2, '426': 1, '449': 1, '483': 5, '511': 2, '517': 2, '529': 1, '555': 2, '585': 1, '586': 1, '623': 2, '643': 3, '649': 1, '656': 1, '666': 1, '726': 1, '733': 2, '741': 2, '747': 1, '749': 1, '764': 7, '768': 2, '816': 1, '883': 2, '914': 4, '929': 2, '1030': 1, '1038': 1, '1112': 1, '1143': 3, '1158': 2, '1166': 1, '1173': 1, '1175': 1, '1208': 2}, 'reviews': {'35': 123, '64': 4, '109': 52, '117': 19, '146': 22, '148': 30, '165': 101, '262': 63, '281': 16, '311': 23, '325': 12, '343': 12, '356': 18, '358': 29, '405': 25, '414': 24, '426': 47, '449': 47, '483': 1, '511': 1, '517': 1, '529': 63, '555': 8, '585': 6, '586': 11, '623': 7, '643': 1, '649': 5, '656': 3, '666': 11, '726': 8, '733': 7, '741': 12, '747': 42, '749': 1, '764': 1, '768': 1, '816': 12, '883': 7, '914': 3, '929': 13, '1030': 4, '1038': 2, '1112': 5, '1143': 1, '1158': 1, '1166': 5, '1173': 2, '1175': 3, '1208': 2}, 'room_id': {'35': 22354, '64': 54215, '109': 211921, '117': 225834, '146': 349347, '148': 350205, '165': 447826, '262': 856876, '281': 935554, '311': 1067184, '325': 1106555, '343': 1166808, '356': 1197857, '358': 1198779, '405': 1422837, '414': 1471308, '426': 1514227, '449': 1584362, '483': 1767672, '511': 1840255, '517': 1860782, '529': 1884045, '555': 1977951, '585': 2088320, '586': 2108738, '623': 2268196, '643': 2376518, '649': 2392404, '656': 2426468, '666': 2473997, '726': 2694019, '733': 2722165, '741': 2754149, '747': 2776143, '749': 2777752, '764': 2821921, '768': 2831504, '816': 2979108, '883': 3244362, '914': 3351728, '929': 3377100, '1030': 3678429, '1038': 3704801, '1112': 3894320, '1143': 3938428, '1158': 3969867, '1166': 3987926, '1173': 3997572, '1175': 4004152, '1208': 4061059}, 'bedrooms': {'35': 1.0, '64': 2.0, '109': 1.0, '117': 1.0, '146': 1.0, '148': 1.0, '165': 1.0, '262': 1.0, '281': 1.0, '311': 1.0, '325': 1.0, '343': 1.0, '356': 1.0, '358': 1.0, '405': 0.0, '414': 1.0, '426': 1.0, '449': 1.0, '483': 1.0, '511': 2.0, '517': 0.0, '529': 1.0, '555': 1.0, '585': 1.0, '586': 0.0, '623': 1.0, '643': 1.0, '649': 2.0, '656': 0.0, '666': 1.0, '726': 1.0, '733': 2.0, '741': 1.0, '747': 0.0, '749': 1.0, '764': 2.0, '768': 2.0, '816': 1.0, '883': 1.0, '914': 5.0, '929': 1.0, '1030': 1.0, '1038': 1.0, '1112': 1.0, '1143': 1.0, '1158': 2.0, '1166': 1.0, '1173': 1.0, '1175': 1.0, '1208': 0.0}, 'room_type': {'35': 'Private room', '64': 'Entire home/apt', '109': 'Entire home/apt', '117': 'Private room', '146': 'Entire home/apt', '148': 'Entire home/apt', '165': 'Private room', '262': 'Shared room', '281': 'Private room', '311': 'Private room', '325': 'Shared room', '343': 'Shared room', '356': 'Entire home/apt', '358': 'Private room', '405': 'Entire home/apt', '414': 'Private room', '426': 'Private room', '449': 'Private room', '483': 'Private room', '511': 'Entire home/apt', '517': 'Entire home/apt', '529': 'Private room', '555': 'Private room', '585': 'Private room', '586': 'Entire home/apt', '623': 'Private room', '643': 'Entire home/apt', '649': 'Entire home/apt', '656': 'Entire home/apt', '666': 'Private room', '726': 'Entire home/apt', '733': 'Entire home/apt', '741': 'Private room', '747': 'Entire home/apt', '749': 'Private room', '764': 'Entire home/apt', '768': 'Entire home/apt', '816': 'Entire home/apt', '883': 'Private room', '914': 'Entire home/apt', '929': 'Entire home/apt', '1030': 'Shared room', '1038': 'Entire home/apt', '1112': 'Entire home/apt', '1143': 'Private room', '1158': 'Entire home/apt', '1166': 'Private room', '1173': 'Private room', '1175': 'Entire home/apt', '1208': 'Entire home/apt'}, 'accommodates': {'35': 1, '64': 4, '109': 5, '117': 2, '146': 4, '148': 5, '165': 2, '262': 2, '281': 2, '311': 2, '325': 16, '343': 2, '356': 4, '358': 3, '405': 2, '414': 2, '426': 2, '449': 3, '483': 1, '511': 4, '517': 2, '529': 2, '555': 4, '585': 1, '586': 2, '623': 2, '643': 4, '649': 6, '656': 2, '666': 3, '726': 3, '733': 4, '741': 2, '747': 4, '749': 4, '764': 4, '768': 6, '816': 2, '883': 2, '914': 8, '929': 2, '1030': 1, '1038': 4, '1112': 2, '1143': 2, '1158': 3, '1166': 2, '1173': 2, '1175': 2, '1208': 6}, 'neighborhood': {'35': 'South End', '64': 'Fenway', '109': 'Roslindale', '117': 'Roslindale', '146': 'South End', '148': 'Fenway', '165': 'Jamaica Plain', '262': 'Fenway', '281': 'South End', '311': 'Jamaica Plain', '325': 'South End', '343': 'South End', '356': 'Roxbury', '358': 'South Boston', '405': 'South End', '414': 'Mission Hill', '426': 'East Boston', '449': 'Dorchester', '483': 'Jamaica Plain', '511': 'Fenway', '517': 'Longwood Medical Area', '529': 'Roxbury', '555': 'Roxbury', '585': 'Jamaica Plain', '586': 'Back Bay', '623': 'East Boston', '643': 'North End', '649': 'Charlestown', '656': 'Beacon Hill', '666': 'Roxbury', '726': 'South End', '733': 'Chinatown', '741': 'Mission Hill', '747': 'Beacon Hill', '749': 'West Roxbury', '764': 'Downtown', '768': 'North End', '816': 'Fenway', '883': 'South End', '914': 'Beacon Hill', '929': 'Beacon Hill', '1030': 'Dorchester', '1038': 'Back Bay', '1112': 'South Boston', '1143': 'Beacon Hill', '1158': 'Jamaica Plain', '1166': 'Jamaica Plain', '1173': 'Jamaica Plain', '1175': 'South End', '1208': 'Allston'}, 'overall_satisfaction': {'35': 4.5, '64': 4.0, '109': 5.0, '117': 5.0, '146': 4.5, '148': 5.0, '165': 5.0, '262': 4.5, '281': 5.0, '311': 4.5, '325': 5.0, '343': 4.5, '356': 4.5, '358': 4.0, '405': 4.5, '414': 4.0, '426': 5.0, '449': 5.0, '483': 5.0, '511': 5.0, '517': 5.0, '529': 4.5, '555': 4.5, '585': 5.0, '586': 4.5, '623': 5.0, '643': 5.0, '649': 5.0, '656': 4.5, '666': 4.5, '726': 5.0, '733': 4.5, '741': 5.0, '747': 4.0, '749': 4.0, '764': 5.0, '768': 5.0, '816': 5.0, '883': 4.5, '914': 5.0, '929': 5.0, '1030': 5.0, '1038': 4.5, '1112': 4.5, '1143': 4.0, '1158': 5.0, '1166': 5.0, '1173': 5.0, '1175': 4.5, '1208': 5.0}}) # dataframe name is data and only the first dataframe is available # not needed right now