Lecture 07: Working with structured data in Python

Pandas dataframes
Reading and writing data
Summary

Before we begin, let's first take a quick survey on the Inaugural assignment

By and large, handling data sets in Python means working with Pandas.

Pandas is a standard element in the Anaconda package, so you'll have it automatically.

The fact that Python is a general purpose language and has a good way of handling data sets through pandas has helped it become such a popular language for scientific and general purposes.

Today, you will learn about 1. the pandas data frame object and the pandas series. 2. how to load and save data both to and from offline sources (e.g. CSV or Excel). 3. and how to clean, rename, structure and index your data.

Links:

Official tutorials
DataCamp's pandas' cheat sheet
DataCamp has additional courses on pandas like Writing efficient code with pandas.
About the pandas project

[ ]

import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

1. Pandas dataframes

In Pandas, the fundamental object of interest is a pandas dataframe.

A pandas data frame is superficially like the data frames you know from stata and sas: it is in 2-d, each column has a name.

The data type of a column in a pandas data frame is a pandas series.

A pandas series is a lot like a numpy array and they can be used in much the same way.

A pandas data frame can be thought of as a dictionary of pandas series. (Keys are column names)

To create a DataFrame:

[ ]

ids = pd.Series([1, 2, 3])
incs = pd.Series([11.7, 13.9, 14.6])
names = pd.Series(['Vitus', 'Maximilian', 'Bo-bob'])

# Use data frame definition
X = pd.DataFrame({'id': ids, 'inc':incs, 'name': names})
display(X)

When creating a DataFrame, you can also rely on python to recast the variables into pandas series at creation.

[ ]

# Variables are cast into pandas series as the DataFrame is created
X = pd.DataFrame({'id': [1, 2, 3], 
                  'inc': [11.7, 13.9, 14.6], 
                  'name': ['Vitus', 'Maximilian', 'Bo-bob']}) 
type(X['id'])

You can also pass in data as a list of lists and provide column names as argument

[ ]

X = pd.DataFrame(data = [[1,11.7,'Vitus'],
                         [2,13.9,'Maximilian'],
                         [3,14.6,'Bo-Bob']], 
                 columns=['id','inc','name'])
display(X)

A dataframe is essentially a matrix.

rows = observations
columns = variables
the index = keeps track of the rows' locations

General information:

[ ]

X.info()

What does object mean? In practice it is a str but it can give rise to difficulties.

Note: You can also show a dataframe in the middle of some code.

[ ]

print('before')
display(X)
print('after')

1.1 Indexing ("subsetting")

Choosing a subset of the rows and/or columns of a dataframe is known as "indexing".

Recall the stuff about slicing and logical indices from previous lectures. Since Pandas is build in Numpy, we can do the same here.

All pandas dataframes are born with the method .loc[] and .iloc[]: 1. .iloc[] is for numeric indexing 2. .loc[] for logical and name-based indexing.

Examples

df.iloc[0:3,1] selects rows 0,1,2 and column 2.
df.loc[:, ['year']] selects all rows (indicated by :) but only the column (variable) year.
df.loc[df['year'] == 2002, :] selects the rows where the variable year is equal to 2002 and all columns (indicated by :)
df.loc[df['year'] == 2002, ['name']] selects the variable name and shows the rows where year is equal to 2002.

You cannot write:
df.iloc[0:2, ['year']]

You should not write
df.loc[0:2, ['year']]
*It will only work with a numerical index and now the slice intervals are closed instead of half open*

In general, the syntax is df.loc[CONDITION, [VARLIST]], where CONDITION is a vector of logical statements with the same length as the number of rows in the dataframe, and VARLIST is a list over variables.

[ ]

# Use logical indexing to subset from variable name based on id
X.loc[X['id'] > 1, ['name']]

Subset all variables:

[ ]

X.loc[X['id'] > 1]

Alternatives:

Create a boolean series

[ ]

I = X['id'] > 1
print(I)
X.loc[I, ['name']]

Use .VARIABLE notation

[ ]

X.loc[(X.id > 1) & (X.inc > 14), ['id','name']]

Why do you think the .VARIABLE notation works at all? What does it make you suspect a variable is to the DataFrame?

Subsetting with numerical indexing works the same way as lists and arrays.
Syntax: df.iloc[ROW INDICES, [COLUMN INDICES]]

[ ]

display(X.iloc[0:2,[0,2]])

Remember the half-open intervals!

1.2 Adding a variable

Variables are added with df['newvar'] = SOMETHING. The length must match or RHS is a scalar (broadcasting).

[ ]

X['year'] = [2003, 2005, 2010]
X['zone'] = 7
X

Note: You canNOT write df.newvar = SOMETHING. Some of you will forget. I promise.
Also: note that you could add the year-variable even though it does not have an explicit row dimension.

The something can be an expression based on other variables.

[ ]

X['inc_adj'] = X.inc - X.inc.mean() + 0.1
X

1.3 Assignments to a subset of rows

LHS: Selected using logical statement.
RHS: Must either be:

a single value (all rows are set to this)
a list of values with same length as the number of selected rows

Multiple rows, one value:

[ ]

# Create copy of X to avoid overwriting it.
Y = X.iloc[:,0:4].copy()
Y.loc[Y.id > 1, ['name']] = 'no name'
print('Y After change in names:')
Y

Multiple rows, multiple values:

[ ]

print('Original df:')
Y = X.iloc[:,0:4].copy()
display(Y)

# Subset the rows, where name is Vitus or year is 2005. LHS is incidentally only 2 rows, which match the RHS!
I = (Y.name == 'Vitus') | (Y.year == 2010)

# Print LHS
print('Subset of Y, LHS in assignment:')
display(Y.loc[I,:])

# Assignment
Y.loc[I, ['name']] = ['Bib', 'Peter']

print('Final Y:')
Y

1.4 Copies vs. views

Remember the stuff about references to objects from L02 and how making changes in a reference also causes changes in the "original" object? Pandas sort of shields you from that trap.
Here is how: When looking at the data it is natural to just avoid the .loc (as in most other languages):

[ ]

# Here I'm NOT using the .loc function
Z = Y[['id','name']]
Z

You can even make subsets without it:

[ ]

I = Y['id'] > 1
Z[I]

Importantly, this does not work with assignment.

Case 1: It does not work with views, as they are references.

[ ]

display(X)

[ ]

Y = X.copy()          # Create Y as a new instance by copying
I = Y['id'] > 2       # Boolean index
Z1 = Y[['id','name']] # returns a VIEW through chained assignment

# We CANNOT change Z1 as it is a view of Y
Z1.loc[I, ['name']] = 'test'

[ ]

# But it works with Z2 
Z2 = Y.loc[:, ['id','name']] 
Z2.loc[I, ['name']] = 'test'
display(Z2)

# Importantly, we did not change names in Y
display(Y)

Case 2: Sometimes it works, but not how you want it to..

[ ]

#display(X)
Y = X.copy()

I = Y['id'] > 1
Z = Y['name'] # returns a view of the column (same with Y.name)
Z[I] = 'test' # Reassigning values to the view of name in Y

## WOOPS:
display(Y)
display(Z)

Solution: Do the assignment in one step.

[ ]

I = Y['id'] > 1
Y.loc[I, ['name']] = 'test'
Y

1.5 The index

The first column in the dataset is referred to as the index of the dataframe.
Baseline: If you haven't done anything, it is just [0, 1, 2, ....].

[ ]

X = pd.DataFrame({'id': [1, 2, 3], 
                  'inc': [11.7, 13.9, 14.6], 
                  'name': ['Vitus', 'Maximilian', 'Bo-bob'],
                  'year': [2010, 2010, 2019]}) 

# See the indices of X
print(X.index.values)

Custom: You can actually use any unique identifier. It does not have to be numbers. For example, you can assign the name column to be the index instead.

[ ]

Y = X.set_index('name') # returns a copy
Y # notice name is now below the other variables

We could also have specified an index at creation of X

[ ]

X = pd.DataFrame({'id': [1, 2, 3], 
                  'inc': [11.7, 13.9, 14.6],
                  'year': [2010, 2010, 2019]}, 
                  index= ['Vitus', 'Maximilian', 'Bo-bob'])
X

[ ]

# Use index of rows:
Y.loc['Vitus']

[ ]

# See the indices of Y
print(Y.index.values)

Lets have a quizz on subsetting.

1.6 Series and numpy arrays

When you select an individual variable, it has the data type Series. Some functions work on a pandas series (e.g. most numpy functions), but it is sometimes nice to extract the underlying numpy objects:

df: pandas dataframe
df['variable']: pandas series
df['variabe'].values (or .to_numpy()): numpy array

[ ]

# One way to do it
X.inc.to_numpy()

[ ]

# Another way
display(X.inc.values)
type(X.inc.values)

[ ]

# Get a list instead
display([*X['id'].values]) # returns a view
display(type([*X['id'].values]))

1.7 Calling functions on a DataFrame

Row-by-row
Create function that takes row as an argument, and then apply the action of the function along the row dimension (axis=1).

[ ]

Y = pd.DataFrame({'id': [1, 2, 3], 
                  'inc': [11.7, 13.9, 14.6],
                  'year': [2010, 2010, 2019], 
                  'name': ['Vitus', 'Maximilian', 'Bo-bob']})

# Notice that row is an input argument here
def conc_row_wise(row):
    return str(row['year']) + ' - ' + row['name'] 

# The fact that row is an input argument in the conc_row_wise function is implicitly understood by .apply()
Y['year_name'] = Y.apply(conc_row_wise, axis=1)  # Notice that axis = 1 is going down rows. Kind of confusing. 
Y

Function for numpy arrays:
Use the fact that a Pandas df is based on Numpy arrays to create a function that operate on the rows.
This may involve broadcasting (see L03).

[ ]

def all_at_once(inc, year):
    return inc * year.max() # Notice that the values of a pd DataFrame column is Numpy, so it has a .max() method. 

Y['inc_adj_year'] = all_at_once(Y.inc.values, Y.year.values)
Y

Using the assign method of DataFrames
Apply the assing method coupled with a lambda function using the functionality of numpy arrays to get inplace changes:

[ ]

Y = Y.assign(inc_adj_inplace = lambda x: x.inc * x.year.max())
Y

2. Reading and writing data

Check: We make sure that we have the data/ subfolder, and that it has the datasets we need.

[ ]

import os 

# Using assert to check that paths exist on computer. See L05 for details.
assert os.path.isdir('data/')
assert os.path.isfile('data/RAS200.xlsx')
assert os.path.isfile('data/INDKP107.xlsx')

# Print everything in data
os.listdir('data/')

2.1 Reading in data

Pandas offers a lot of facilities for reading and writing to different formats. The functions have logical names:

CSV: pd.read_csv()
SAS: pd.read_sas()
Excel: pd.read_excel()
Stata: pd.read_stata()
Parquet: pd.read_parquet()

Inspecting:

df.head(10) is ued to inspect the first 10 rows
df.sample(10) is ued to look at 10 random rows

Example: Raw download from DST

Clearly not quite right!

[ ]

filename = 'data/RAS200.xlsx' # open the file and have a look at it
pd.read_excel(filename).head(5)

We need to clean this mess up.

Getting the right columns and rows

Skipping rows: Clearly, we should skip the first three rows and the first four columns

[ ]

empl = pd.read_excel(filename, skiprows=2)
empl.head(5)

Dropping columns: The first couple of columns are not needed and contain only missing values (denoted by NaN (not-a-number)), so we will drop those.

Note: df.drop() is a function that the data frame object applies to itself. Hence, no return value is used.

[ ]

# These columns have to go: 'Unnamed: 0' 'Unnamed: 1' 'Unnamed: 2' 'Unnamed: 3'
drop_these = ['Unnamed: ' + str(num) for num in range(4)] # use list comprehension to create list of columns
print(drop_these)
empl.drop(drop_these, axis=1, inplace=True) # axis = 1 -> columns, inplace=True -> changed, no copy made
empl.head(5)

Alternative: Use del empl['Unnamed: 0'], empl['Unnamed: 1']...

But! that borders on code repetition.. Would give you 4 places to make code changes rather than 2 as with the list comprehension above, in case data changed.

Renaming variables

We are not happy with the column comprising regions, which is currently called Unnamed: 4.

We rename using df.rename(columns=dict), where dict must be a Python dictionary. Why a dictionary? It is simply the most practical solution if you are renaming several columns at once.

[ ]

empl.rename(columns = {'Unnamed: 4':'municipality'}, inplace=True)
empl.head(5)

Rename all year columns: We also see that the employment rate in 2008 has been named 2008.

This is allowed in Python, but having a variable named as a number can cause problems with some functions (and many other programming languages do not even allow it), so let us change their names.

To change all columns, we need to create a dictionary that maps each of the years {2008, ..., 2016} to {e2008, ..., e2016}.

[ ]

col_dict = {}
for i in range(2008, 2020+1): # range goes from 2008 to but not including 2018
    col_dict[str(i)] = f'empl{i}' 
col_dict

[ ]

empl.rename(columns = col_dict, inplace=True)
empl.head(10)

A big NO-NO!! is to put white spaces in column names. You can theoretically have a column such as empl'e 2017' in a pandas df, but this is very likely to get messy. And you can no longer use .notation.

Extract: Now we can find the employment rate in the municipality where Christian grew up:

[ ]

empl.loc[empl.municipality == 'Hillerød']

Dropping observations that are not actually municipalities

The dataset contains observations like "Region Hovedstaden", which is not a municipality, so we want to drop such rows. To do this, we can use the df['var'].str functionalities. These are all sorts of functions that work with strings, in particular searching for instances of specific content by df['var'].str.contains('PATTERN').

[ ]

# Build up a logical index I
I = empl.municipality.str.contains('Region')
I |= empl.municipality.str.contains('Province')
I |= empl.municipality.str.contains('All Denmark')
empl.loc[I, :]

Delete these rows:

[ ]

empl = empl.loc[I == False] # keep everything else
empl.head(10)

Very important: reset index

[ ]

empl.reset_index(inplace = True, drop = True) # Drop old index too
empl.iloc[0:4,:]

Summary statistics

To get an overview of employments across municipalities we can use the function df.describe().

[ ]

empl.describe()

Single descriptive statistic: We can also just get the mean for each year:

[ ]

empl.iloc[:,1:].mean()

2.2 Long vs. wide datasets: `pd.wide_to_long()`

Often in economic applications, it can be useful to switch between wide vs. long formats (long is sometimes referred to as tall, e.g. in Stata). This is done by the commands pd.wide_to_long() (and pd.long_to_wide()). Many types of analysis are easier to do in one format than in another so it is extremely useful to be able to switch comfortably between formats.

Common: Think of a dataset as having an ID and a PERIOD variable. In our dataset empl, the ID variable is municipality, and the PERIOD variable is year.

Wide dataset: The default from Statistics Denmark: 1 row in data per ID and a variable for each PERIOD. If there are more than one variable per observation that varies by period, then a new block of period-wise cases must be created along columns.

Long dataset: There is one row for each combination of (ID, PERIOD). Vertical blocks of periods.

A long dataset is often easier to work with if you have more than one time-varying variable in the data set.

In general, Pandas will assume that the variables in the wide format have a particular structure: namely they are of the form XPERIOD, where X is called the "stub". In our case, the variable names are e.g. e2011, so the stub is e and the period (for that variable) is 2011. You'll want to clean out the variable names if there is anything after the period part.

[ ]

empl

[ ]

empl_long = pd.wide_to_long(empl, stubnames='empl', i='municipality', j='year')
empl_long.head(10)

Note: The variables municipality and year are now in the index!! We see that because they are "below" e in the head overview.

[ ]

# The index variable now consists of tuples. 
print(empl_long.index.values[0:8])

We can select a specific municipality using .xs:

[ ]

empl_long.xs('Roskilde',level='municipality')

Or .loc[] in a special way:

[ ]

empl_long.loc[empl_long.index.get_level_values('municipality') == 'Roskilde', :]

Alternative: Reset the index, and use .loc as normal.

[ ]

empl_long = empl_long.reset_index()
empl_long.loc[empl_long.municipality == 'Roskilde', :]

Plotting interactively

A pandas DataFrame has built-in functions for plotting. Works a bit differently from matplotlib.

Example:

[ ]

# Data frame with roskilde
empl_roskilde = empl_long.loc[empl_long['municipality'] == 'Roskilde', :]

# Plot the content of the data frame
empl_roskilde.plot(x='year',y='empl',legend=False);

We can even do it interactively:

[ ]

import ipywidgets as widgets
def plot_e(df, municipality): 
    I = df['municipality'] == municipality
    ax=df.loc[I,:].plot(x='year', y='empl', style='-o', legend=False)

[ ]

widgets.interact(plot_e, 
    df = widgets.fixed(empl_long),
    municipality = widgets.Dropdown(description='Municipality', 
                                    options=empl_long.municipality.unique(), 
                                    value='Roskilde')
);

2.3 Income

Next, we will read in the avg. disposable income for highly educated in each municipality. Here we do the cleaning, renaming and structuring in a few condensed lines.

[ ]

# a. load
inc = pd.read_excel('data/INDKP107.xlsx', skiprows=2)

# b. clean and rename
inc.drop([f'Unnamed: {i}' for i in range(4)], axis=1, inplace=True) # using list comprehension
inc.rename(columns = {'Unnamed: 4':'municipality'}, inplace=True) 
inc.rename(columns = {str(i): f'inc{i}' for i in range(2004,2020+1)}, inplace=True) # using dictionary comprehension

# c. drop rows with missing values. Denoted na
inc.dropna(inplace=True)

# d. remove non-municipalities. Notice how to avoid code repetition!
for val in ['Region','Province', 'All Denmark']: 
    I = inc.municipality.str.contains(val)
    inc.drop(inc[I].index, inplace=True) # .index -> get the indexes of the series
    
inc.head(5)

Convert wide -> long:

[ ]

inc_long = pd.wide_to_long(df=inc, stubnames='inc', i='municipality', j='year')
inc_long.reset_index(inplace=True)
inc_long.head(5)

2.4 Municipal area

Finally, let's read in a dataset on municipality areas in km $^2$ .

[ ]

# a. load
area = pd.read_excel('data/areal.xlsx', skiprows=2)

# b. clean and rename
area.rename(columns = {'Unnamed: 0':'municipality','2019':'km2'}, inplace=True)

# c. drop rows with missing
area.dropna(inplace=True)

# d. remove non-municipalities
for val in ['Region','Province', 'All Denmark']: 
    I = area.municipality.str.contains(val)
    area.drop(area[I].index, inplace=True)
    
area.head(5)

2.5 Writing data

As with reading in data, we have the corresponding functions for writing data:

CSV: pd.to_csv()
SAS: pd.to_sas()
Excel: pd.to_excel()
Stata: pd.to_stata()
Parquet: pd.to_parquet()

Let's save our dataset to CSV form. We will set index=False to avoid saving the index (which does not mean anything here but can in other contexts be an annoying thing).

[ ]

empl_long.to_csv('data/RAS200_long.csv', index=False)
inc_long.to_csv('data/INDKP107_long.csv', index=False)
area.to_csv('data/area.csv', index=False)

2.6 Be cautious

Code for cleaning data tend to get long and repetetive. But remember DRY! Errors crop up in data cleaning when you just copy blocks of code around. Avoid repetitions at all costs.

3. Summary

This lecture: We have discussed

The generel pandas framework (indexing, assigment, copies vs. views, functions)
Loading and saving data
Basic data cleaning (renaming, droping etc.)
Wide $\leftrightarrow$ long transformations

Your work: Before solving Problem Set 3 read through this notebook and play around with the code.

Next lecture: Basic data analysis.

Data exploration?: Try out dtale.