Lecture 08: Basic data analysis

First Let's start with a survey on your background in data wrangling.

You may need to install the DST api-data reader, the pandas_datareader and the matplotlib_venn module. Uncomment the following cells and run to install.
The ! in front of each command indicates that this is a system command that may as well have been executed in the terminal/command prompt of your computer.

[ ]
# The DST API wrapper
!pip install git+https://github.com/elben10/pydst
[ ]
# A wrapper for multiple APIs with a pandas interface
!pip install pandas-datareader
[ ]
# For Venn diagrams
!pip install matplotlib-venn
[ ]
import numpy as np
import pandas as pd
import datetime

import pandas_datareader # install with `pip install pandas-datareader`
import pydst # install with `pip install git+https://github.com/elben10/pydst`

import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
from matplotlib_venn import venn2 # `pip install matplotlib-venn`

1. Combining datasets (merging and concatenating)

When combining datasets there are a few crucial concepts:

  1. Concatenate (append): "stack" rows (observations) on top of each other. This works if the datasets have the same columns (variables).
  2. Merge: the two datasets have different variables, but may or may not have the same observations.

There are different kinds of merges depending on which observations you want to keep:

  1. Outer join (one-to-one) Keep observations which are in either or in both datasets.
  2. Inner join (one-to-one) Keep observations which are in both datasets.
  3. Left join (many-to-one) Keep observations which are in the left dataset or in both datasets.

Keeping observations which are not in both datasets will result in missing values for the variables comming from the dataset, where the observation does not exist.

Read data:

[ ]
empl = pd.read_csv('../07/data/RAS200_long.csv') # .. -> means one folder up
inc = pd.read_csv('../07/data/INDKP107_long.csv')
area = pd.read_csv('../07/data/area.csv')

1.1 Concatenating datasets

Suppose we have two datasets that have the same variables and we just want to concatenate them.

[ ]
empl.head(5)
[ ]
N = empl.shape[0]

A = empl.loc[empl.index < N/2,:] # first half of observations
B = empl.loc[empl.index >= N/2,:] # second half of observations

print(f'A has shape {A.shape} ')
print(f'B has shape {B.shape} ')

Concatenation is done using the command pd.concat([df1, df2]).

[ ]
C = pd.concat([A,B])
print(f'C has shape {C.shape} (same as the original empl, {empl.shape})')

1.2 Merging datasets

Two datasets with different variables: empl and inc.

Central command: pd.merge(empl, inc, on=[municipalitiy, year], how=METHOD).

  1. The keyword on specifies the merge key(s). They uniquely identify observations in both datasets (for sure in at least one of them).

  2. The keyword how specifies the merge method (taking values such as 'outer', 'inner', or 'left').

Look at datasets:

[ ]
print(f'Years in empl: {empl.year.unique()}')
print(f'Municipalities in empl = {len(empl.municipality.unique())}')
print(f'Years in inc: {inc.year.unique()}')
print(f'Municipalities in inc = {len(inc.municipality.unique())}')

Find differences:

[ ]
diff_y = [y for y in inc.year.unique() if y not in empl.year.unique()] 
print(f'years in inc data, but not in empl data: {diff_y}')

diff_m = [m for m in empl.municipality.unique() if m not in inc.municipality.unique()] 
print(f'municipalities in empl data, but not in inc data: {diff_m}')

Conclusion: inc has more years than empl, but empl has one municipality that is not in inc.

[ ]
plt.figure()
v = venn2(subsets = (4, 4, 10), set_labels = ('empl', 'inc'))
v.get_label_by_id('100').set_text('Cristiansø')
v.get_label_by_id('010').set_text('2004-07' )
v.get_label_by_id('110').set_text('common observations')
plt.show()

Outer join: union

[ ]
plt.figure()
v = venn2(subsets = (4, 4, 10), set_labels = ('empl', 'inc'))
v.get_label_by_id('100').set_text('included')
v.get_label_by_id('010').set_text('included')
v.get_label_by_id('110').set_text('included')
plt.title('outer join')
plt.show()
[ ]
outer = pd.merge(empl,inc,on=['municipality','year'],how='outer')

print(f'Number of municipalities = {len(outer.municipality.unique())}')
print(f'Number of years = {len(outer.year.unique())}')

We see that the outer join includes rows that exist in either dataframe and therefore includes missing values:

[ ]
I = (outer.year.isin(diff_y)) | (outer.municipality.isin(diff_m))
outer.loc[I, :].head(15)

Inner join

[ ]
plt.figure()
v = venn2(subsets = (4, 4, 10), set_labels = ('empl', 'inc'))
v.get_label_by_id('100').set_text('dropped'); v.get_patch_by_id('100').set_alpha(0.05)
v.get_label_by_id('010').set_text('dropped'); v.get_patch_by_id('010').set_alpha(0.05)
v.get_label_by_id('110').set_text('included')
plt.title('inner join')
plt.show()
[ ]
inner = pd.merge(empl,inc,how='inner',on=['municipality','year'])

print(f'Number of municipalities = {len(inner.municipality.unique())}')
print(f'Number of years          = {len(inner.year.unique())}')

We see that the inner join does not contain any rows that are not in both datasets.

[ ]
I = (inner.year.isin(diff_y)) | (inner.municipality.isin(diff_m))
inner.loc[I, :].head(15)

Left join

In my work, I most frequently use the left join. It is also known as a many-to-one join.

  • Left dataset: inner many observations of a given municipality (one per year),
  • Right dataset: area at most one observation per municipality and new variable (km2).
[ ]
inner_with_area = pd.merge(inner, area, on='municipality', how='left')
inner_with_area.head(10)
[ ]
print(f'inner has shape {inner.shape}')
print(f'area has shape {area.shape}')
print(f'merge result has shape {inner_with_area.shape}')
[ ]
plt.figure()
v = venn2(subsets = (4, 4, 10), set_labels = ('inner', 'area'))
v.get_label_by_id('100').set_text('included:\n no km2'); 
v.get_label_by_id('010').set_text('dropped'); v.get_patch_by_id('010').set_alpha(0.05)
v.get_label_by_id('110').set_text('included:\n with km2')
plt.title('left join')
plt.show()

Intermezzo: Finding the non-overlapping observations

[ ]
not_in_area = [m for m in inner.municipality.unique() if m not in area.municipality.unique()]
not_in_inner = [m for m in area.municipality.unique() if m not in inner.municipality.unique()]

print(f'There are {len(not_in_area)} municipalities in inner that are not in area. They are:')
print(not_in_area)
print('')

print(f'There is {len(not_in_inner)} municipalities in area that are not in inner. They are:')
print(not_in_inner)
print('')

Check that km2 is never missing:

[ ]
inner_with_area.km2.isnull().sum()

Alternative function for left joins: df.join() which uses the index

To use a left join function df.join(), we must first set the index. Technically, we do not need this, but if you ever need to join on more than one variable, df.join() requires you to work with indices so we might as well learn it now.

[ ]
area.sample(10)
[ ]
inner.set_index(['municipality', 'year'], inplace=True)
area.set_index('municipality', inplace=True)
[ ]
inner.head()
final = inner.join(area)
print(f'final has shape: {final.shape}')
final.head(5)

1.3 Other programming languages

SQL (including SAS proc sql)

SQL is one of the most powerful database languages and many other programming languages embed a version of it. For example, SAS has the proc SQL, where you can use SQL syntax.

SQL is written in statements such as

  • left join select * from empl left join inc on empl.municipality = inc.municipality and empl.year = inc.year
  • outer join select * from empl full outer join inc on empl.municipality = inc.municipality and empl.year = inc.year

STATA

In Stata, the command merge nests many of the commands mentioned above. You specify merge 1:1 for a one-to-one merge or merge m:1 or merge 1:m for many-to-one or one-to-many merges, and you do not use merge m:m (until you are quite advanced).

2. Fetching data using an API

API stands for Application Programming Interface. An API is an interface through which we can directly ask for and receive data from an online source. We will be using packages for this and will not look at what is going on underneath.

  1. We use pandas_datareader to access many common international online data sources (install with pip install pandas-datareader)
  2. For Statistics Denmark, Jakob Elben has written the pydst package (install with pip install git+https://github.com/elben10/pydst)

Fetching data from an API requires an internet connection and works directly without saving data to your harddisk (unless you ask Python to do so afterwards). You can use it to automate tasks such as fetching the most recent data, doing some calculations and outputting it in the same manner. This can be useful e.g. for quarterly reports. Remember to save the data on your computer if you really need for later though. The admins of the data may turn off the water..

Pros: Automatic; smart; everything is done from Python (so no need to remember steps in between).

Cons: The connection can be slow or drop out, which may lead to errors. If e.g. 100 students simultaneously fetch data (during, say, a lecture), the host server may not be able to service all the requests and may drop out.

The raw output data from an API could look like this: https://stats.oecd.org/SDMX-JSON/data/NAAG. It is a log list of non-human-readable gobledygook in the so-called "JSON" format.

2.1 Import data from Denmark Statistics

Setup:
Create an dst api object that will allow us to interact with the DST server.

[ ]
Dst = pydst.Dst(lang='en') # setup data loader with the langauge 'english'
[ ]
# What is the Dst variable?
print(type(Dst))

Data from DST are organized into:

  1. Subjects: indexed by numbers. Use Dst.get_subjects() to see the list.
  2. Tables: with names like "INDKP107". Use Dst.get_tables(subjects=['X']) to see all tables in a subject.

Data is extracted with Dst.get_data(table_id = 'NAME', variables = DICT).

Subjects: With Dst.get_subjects() we can list all subjects.

[ ]
Dst.get_subjects()

Tables: With get_tables(), we can list all tables under a subject.

[ ]
tables = Dst.get_tables(subjects=['2'])
print(type(tables))
display(tables)

Variable in a dataset:

[ ]
tables[tables.id == 'INDKP107']
[ ]
indk_vars = Dst.get_variables(table_id='INDKP107')
indk_vars

We want to know the available levels of each conditioning variable that we may subset by. Use a loop to print out those levels.

Values of variable in a dataset:

[ ]
indk_vars = Dst.get_variables(table_id='INDKP107')

for id in ['ENHED','KOEN','UDDNIV','INDKOMSTTYPE']:
    print(id)
    values = indk_vars.loc[indk_vars.id == id,['values']].values[0,0]
    for value in values:      
        print(f' id = {value["id"]}, text = {value["text"]}')

There are quite a few to select from. Need to use a dictionary to specify the desired subset of data. Note: a * indicates that you want all levels. For example, we are subsetting all periods below.

Get data:

[ ]
variables = {'OMRÃ…DE':['*'],'ENHED':['110', '116'],'KOEN':['M','K'],'TID':['*'],'UDDNIV':['65'],'INDKOMSTTYPE':['100']}
inc_api = Dst.get_data(table_id = 'INDKP107', variables=variables)
inc_api.sort_values(by=['OMRÃ…DE', 'TID', 'KOEN'], inplace=True)
inc_api.head(5)

.. now you have a data set ready for cleaning and renaming.

2.2 FRED (Federal Reserve Economic Data)

GDP data for the US

[ ]
# Need first to encode dates in a python friendly to specify the length of the desired time period. 
# Use the datetime module - it is the general way to handle dates in python. 
start = datetime.datetime(2005,1,1)
end = datetime.datetime(2017,1,1)
timespan = end - start # We can investigate the precise time span by just subtracting to time variables.
print('total number of days:', timespan.days) # The timespan object has a days attribute.
[ ]
# Call the FRED api using pandas_datareader 
gdp = pandas_datareader.data.DataReader('GDP', 'fred', start, end)
gdp.head(10)

Finding data:

  1. go to https://fred.stlouisfed.org
  2. search for data in main bar, e.g. employment and unemployment
  3. click the first links
  4. table name is next to header

We now want to pull down data on aggregate employment (PAYEMS) and unemployment (UNEMPLOY) levels and then plot the development.

Pulling the data:

[ ]
start = datetime.datetime(1939,1,1)
end = datetime.datetime(2021,12,1)

# We can pull from multiple sources in one go. Just combine them in a list.
empl_us = pandas_datareader.data.DataReader(['PAYEMS', 'UNEMPLOY'], 'fred', start, end)

Plot:

[ ]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)

# Now we are just plotting directly from the pandas dataframe. Still using matplotlib under the hood.
empl_us.plot(ax=ax)

ax.legend(frameon=True)
ax.set_xlabel('')
ax.set_ylabel('employment (US)');

2.3 World Bank indicators: wb

Finding data:

  1. go to https://data.worldbank.org/indicator/
  2. search for GDP
  3. variable name ("NY.GDP.PCAP.KD") is in the URL

Pull GDP numbers:

[ ]
# Need a different module than in the FRED case
from pandas_datareader import wb
[ ]
wb_gdp = wb.download(indicator='NY.GDP.PCAP.KD', country=['SE','DK','NO'], start=1990, end=2017)

wb_gdp = wb_gdp.rename(columns = {'NY.GDP.PCAP.KD':'GDP'})
wb_gdp = wb_gdp.reset_index()
wb_gdp.sample(5)
[ ]
wb_gdp.info()

Problems:

  • It turns out that the dataframe has stored the variable year as an "object", meaning in practice that it is a string. This must be converted to an int, as we want to use it as a number.
  • country is in fact a text variable, so it is acceptable to have it as an object type. But pandas has implemented a string type on its own. It is called 'string', while the text type of object that you normally encounter is of type 'str'. Yes, confusing!! But you want to get it right, because an object variable can also contain numbers in addition to text. Which is bad.
  • Fortunately, GDP is a float (i.e. a number).
[ ]
wb_gdp.year = wb_gdp.year.astype(int) # convert year
wb_gdp.country = wb_gdp.country.astype('string') # convert country to the special pandas string type
wb_gdp.info()

Fetch employment-to-population ratio:

[ ]
wb_empl = wb.download(indicator='SL.EMP.TOTL.SP.ZS', country=['SE','DK','NO'], start=1990, end=2017) # don't need the special datetime here.
wb_empl.rename(columns = {'SL.EMP.TOTL.SP.ZS':'employment_to_pop'}, inplace=True) # Better col name
wb_empl.reset_index(inplace = True)
wb_empl.year = wb_empl.year.astype(int)
wb_empl.sample(3)

Merge:

[ ]
wb = pd.merge(wb_gdp, wb_empl, how = 'outer', on = ['country','year']);
wb.head(5)

3. Split-apply-combine

One of the most useful skills to learn is the split-apply-combine process. For example, we may want to compute the average employment rate within a municipality over time and calculate whether the employment rate in each year is above or below the average. We calculate this variable using a split-apply-combine procedure:

  1. split: divide the dataset into units (one for each municipality)
  2. apply: compute the average employment rate for each unit
  3. combine: merge this new variable back onto the original dataset

3.1 Groupby

Example data:

[ ]
empl.head()
[ ]
empl.rename(columns={'empl':'e'}, inplace=True)
[ ]
# A simple, unconditional transformation of data. Works because of broadcasting. 
empl['empl_demean'] = empl.e - empl.e.mean()
empl.head()
[ ]
empl = empl.sort_values(['municipality','year']) # sort by first municipality then year
empl.head(5)

Use groupby to calculate within means:

[ ]
empl.groupby(['municipality'])['e'].mean().head(5)

Custom functions the apply part can be specified by using the lambda notation.

Warning: lambda implementations will often be a pretty slow alternative to vectorized operations. More on that later.

An example with average change:

[ ]
# Define a lambda function to applied down rows of a column in blocks defined by the groupby. 
avg_first_diff = lambda x: x.diff(1).mean() # A pd.Series has a function diff that does the job.

# Apply the lambda and print head of output
empl.groupby('municipality')['e'].apply(avg_first_diff).head(5)

Or:

[ ]
# We can also define our lambda with a numpy implementation. 
avg_first_diff = lambda x: np.mean(x[1:]-x[:-1])

# Need the extra lambda function to retrieve values (aka a numpy array) of e for the avg_first_diff.
empl.groupby('municipality')['e'].apply(lambda x: avg_first_diff(x.values)).head(5) 

Plot statistics: Dispersion in employment rate across Danish municipalities over time.

[ ]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)

empl.groupby('year')['e'].std().plot(ax=ax,style='-o')

ax.set_ylabel('std. dev.')
ax.set_title('std. dev. across municipalities in the employment rate');

3.2 Split-Apply-Combine

Goal: Calculate within municipality difference to mean employment rate.

Start by splitting, applying and combining manually:

1. Split:

[ ]
e_grouped = empl.groupby('municipality')['e']

# The e_grouped object is not ready for inspection
print(e_grouped)

2. Apply:

[ ]
e_mean = e_grouped.mean() # mean employment rate
e_mean.head(10)

Change name of series:

[ ]
e_mean.name = 'e_mean' # necessary for join

3. Combine:

[ ]
empl_ = empl.set_index('municipality').join(e_mean, how='left')
empl_['e_demean'] = empl_.e - empl_.e_mean
empl_.xs('Copenhagen')

Plot:

[ ]
municipalities = ['Copenhagen','Roskilde','Lejre']

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

# Here we use the fact that the index has multiple levels (years) for an elegant loop
for m in municipalities:
    empl_.xs(m).plot(x='year',y='e_demean',ax=ax,label=m)

ax.legend(frameon=True)
ax.set_ylabel('difference to mean')

Do the splitting and applying in one fell swoop

with agg()

Agg: The same value for all observations in a group.
We can use lambdas or built-in functions for the operation. Use built-in whenever you can! Here we use lambda for exposition.

[ ]
empl_ = empl.copy()

# a. Good use: a built-in function for mean rather than lambda.
e_mean = empl_.groupby('municipality')['e'].agg('mean')

# Same result with a lambda
#e_mean = empl_.groupby('municipality')['e'].agg(lambda x: x.mean())

e_mean.name = 'e_mean'

# b. combine
empl_ = empl_.set_index('municipality').join(e_mean, how='left')
empl_['diff'] = empl_.e - empl_.e_mean
empl_.xs('Copenhagen')

Note: Same result!!

Question Are there any dangers with the variable name 'diff'?

This is pretty cumbersome though. Creating a new variable and then merging in separate step - we can do better with the tools in Pandas.

Splitting, applying and combining all together

with - apply() and transform() directly

Transform: In case you are dealing with multiple variables, transform will work on one variable/column at a time. In the case below, had we selected both e and year rather than just e, x.mean() would and could only have been applied to observations within one column at a time. transform has to return an array of size 1 or of the same size as the original column.
Apply: The set of columns passed to the apply function is considered to be a whole dataframe on its own. You can therefore make lambda functions that utilizes several columns of data in each operation. That is not possible with the transform function.
More info: you can read more about the differences between transform and apply here.

Note when you are dealing with selections of only 1 variable, then transform and apply behave similarly.

[ ]
empl_ = empl.copy()
empl_['e_demean'] = empl_.groupby('municipality')['e'].apply(lambda x:  x - x.mean())

# In this case, you could have used apply as well
#empl_[['e_demean']] = empl_.groupby('municipality')['e'].transform(lambda x: x - x.mean())

empl_.set_index('municipality').xs('Copenhagen')

3.3 Optimizing performance

It is quite important for your own and other's productivity to implement effecient procedures when dealing with large datasets. The apply method (as well as the transform) essentially loops over the rows of a column when applying a lambda function. This may be much slower than needed if you for example end up calculating averages over the whole column or group many, many times (one per row) as in the case below. Using pandas functions without lambdas gets it right. Important to avoid such behavior with large data sets.

[ ]
import time
N = 300

# a. Check performance with lambda function. Sooo slooow.. 
demean = lambda x: x - x.mean()
tic = time.perf_counter()
for i in range(N):
    d1 = empl.groupby('municipality')['e'].transform(demean)
toc = time.perf_counter()
print(f'Performance with lambda function {toc-tic: 5.3f}')

# b. Performance when relying on built-in pandas methods. It is not because we're using transform per se. 
# It's much faster, because mean is not calculated for each row in data and we're in Cython. 
tic = time.perf_counter()
for i in range(N):
    d2 = empl.e - empl.groupby('municipality')['e'].transform('mean') # Demean by subtracting grouped mean from e column.    
toc = time.perf_counter()
print(f'Performance with pandas vectorized: {toc-tic: 5.3f}')

print('Check of consistency: ', np.all(d1==d2))

We can also see that an explicit numpy implementation is faster than relying on pandas methods. The example with first differencing from above.

[ ]
# a. The pandas series implementation
avg_first_diff = lambda x: x.diff(1).mean()
tic = time.perf_counter()
for i in range(N):
    d1 = empl.groupby('municipality')['e'].apply(avg_first_diff)
toc = time.perf_counter()
print(f'Performance with pandas: {toc-tic: 3.6f}')

# b. The Numpy implementation
avg_first_diff = lambda x: np.mean(x.values[1:]-x.values[:-1])
tic = time.perf_counter()
for i in range(N):
    d2 = empl.groupby('municipality')['e'].apply(avg_first_diff)
toc = time.perf_counter()
print(f'Performance with numpy: {toc-tic: 3.6f}')
print('Is d1 == d2:', np.all(d1==d2))

Note: Same result!!

Need more complex group by stuff?

Look here.

Additional links

  • Do you have missing values in data? Check here
  • About strings and the objecttype in pandas, here.
  • Comparison of SQL statements and pandas group by here
  • Optimizing pandas routines incl. apply, here. (less technical)
  • Stackoverflow musings on optimal use of apply( ) and it's downsides. See also this. (both pretty technical)
  • About optimizing pandas with numpy and vectorization, here.

4. A few examples of open access APIs

As already demonstrated, you can pull data from DST using their API. Just to give a few examples of where else you may find open access to data by API:

4. Other sources

5. All built-in functions to aggregate data in Pandas

You can use the functions in apply(), transform() and agg() by writing them out in a string. See above. Will normally be the fastest implementation.

Function Description

  • count: Number of non-null observations
  • sum: Sum of values
  • mean: Mean of values
  • mad: Mean absolute deviation
  • min: Minimum
  • max: Maximum
  • mode: Mode
  • abs: Absolute Value
  • prod: Product of values
  • std: Unbiased standard deviation
  • var: Unbiased variance
  • sem: Unbiased standard error of the mean
  • skew: Unbiased skewness (3rd moment)
  • kurt: Unbiased kurtosis (4th moment)
  • quantile: Sample quantile (value at %)
  • cumsum: Cumulative sum
  • cumprod: Cumulative product
  • cummax: Cumulative maximum
  • cummin: Cumulative minimum

There is also your DataCamp cheatsheet for pandas for references.

6. Summary

This lecture: We have discussed

  1. Combining datasets (merging and concatenating)
  2. Fatching data using an API (DST, FRED, World Bank, etc.)
  3. Split-apply-combine (groupby, agg, transform)

Your work: Before solving Problem Set 4 read through this notebook and play around with the code.

Project 1: See the details in the projects folder of Lectures2021 or Project 1: Data analysis here.
Deadline: 11th of April.

Next lecture: Algorithms: Searching and sorting algorithms.
Remember no lecture Monday 5th of April! (it's a holiday)