Problem set 3: Loading and structuring data from Denmark Statistics

[ ]
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pandas as pd
import ipywidgets as widgets

Tasks

Create a pandas DataFrame

Modify the code below such that income and consumption are variables in the dt DataFrame.

[ ]
np.random.seed(1999)
 
N = 100
mydata = {}
mydata['id'] = range(N)
income = np.exp(np.random.normal(size=N))
consumption = np.sqrt(income)

dt = pd.DataFrame(mydata)
dt.head()

Answer: see A01.py

Create new variable

1) Add a new variable ratio which is the ratio of consumption to income.

[ ]
# write your code here
dt.head()

Answer: See A02.py

Summary statistics

Produce summary statistics using .describe().

[ ]
# write your code here

Answer: See A03.py

Indexing

Select everybody with an income above 1.

[ ]
# write your code here
dt.head()

Answer: See A04.py

Select everybody with an income above 1 and a ratio above 0.7.

[ ]
# write your code here

Answer: See A05.py

Set consumption equal to 0.5 if income is less than 0.5.

[ ]
# write your code here
# dt['consumption'].mean() # <- compare with answer

Answer: See A06.py

Set consumption equal to income if income is less than 0.5.

[ ]
# write your code here
# dt['consumption'].mean() # <- compare with answer

Answer: See A07.py

Dropping

Drop the ratio variable and all rows with an income above 1.5. After this, also drop the first 5 rows.

[ ]
print(f'before: {dt.shape[0]} observations, {dt.shape[1]} variables')
# write your code here
print(f'after: {dt.shape[0]} observations, {dt.shape[1]} variables')

Answer: see A08.py

Renaming

Rename consumption to cons and income to inc.

[ ]
# write your code
dt.head()

Answer: see A09.py

Functions

Correct the wrong lines such that assets_1 = assets_2 = assets_3 = assets_4.

[ ]
def assets_row_by_row(x,R,Y):
    return 0 # wrong line
    
def assets_all_at_once(income,consumption,R,Y):
    return 0 # wrong line

def assets_adj(assets,R,Y):
    # missing lines
    pass

R = 1.2 # return rate
Y = 1 # income
try:
    dt['assets_1'] = R*(dt['inc']-dt['con'])+Y
    dt['assets_2'] = dt.apply(assets_row_by_row,axis=1,args=(R,Y))
    dt['assets_3'] = assets_all_at_once(dt['inc'].values(),dt['con'].values(),R,Y)
    dt['assets_4'] = dt['inc']-dt['con']
    assets_adj(dt['assets_4'],R,Y)
except:
    print('failed')
dt.head()    

Answer: see A10.py

Problem

Load the data set in data/NAH1_pivoted.xlsx and clean and structure it such that the plot_timeseries(dataframe) below can be run and produce an interactive figure.

[ ]
def _plot_timeseries(dataframe, variable, years):
    
    fig = plt.figure(dpi=100)
    ax = fig.add_subplot(1,1,1)
    
    dataframe.loc[:,['year']] = pd.to_numeric(dataframe['year'])
    I = (dataframe['year'] >= years[0]) & (dataframe['year'] <= years[1])
    
    x = dataframe.loc[I,'year']
    y = dataframe.loc[I,variable]
    ax.plot(x,y)
    
    ax.set_xticks(list(range(years[0], years[1] + 1, 5)))    
    
def plot_timeseries(dataframe):
    
    widgets.interact(_plot_timeseries, 
    dataframe = widgets.fixed(dataframe),
    variable = widgets.Dropdown(
        description='variable', 
        options=['Y','C','G','I','X','M'], 
        value='Y'),
    years=widgets.IntRangeSlider(
        description="years",
        min=1966,
        max=2018,
        value=[1980, 2018],
        continuous_update=False,
    )                 
); 

Hint 1: You can base your renaming on this dictionary:

[ ]
rename_dict = {}
rename_dict['P.1 Output'] = 'Y'
rename_dict['P.3 Final consumption expenditure'] = 'C'
rename_dict['P.3 Government consumption expenditure'] = 'G'
rename_dict['P.5 Gross capital formation'] = 'I'
rename_dict['P.6 Export of goods and services'] = 'X'
rename_dict['P.7 Import of goods and services'] = 'M'

Hint 2: You code should have the following structure:

[ ]
# a. load data set
# nah1 = ?

# b. rename variables

# c. remove rows where Y is nan

# d. correct year column data
# hint, nah1.loc[J,['year']] = nah1.loc[I,['year']].values

# e. only keep rows with '2010-prices, chained values'

# f. only keep renamed variables

# g. interactive plot
# plot_timeseries(nan)

Answer: see A11.py

[ ]
plot_timeseries(nah1)

Extra problems

Extend interactive plot

Extend the interactive plot with a choice of real vs nominal.

New data set

Load data from an Excel or CSV file you have downloaded from e.g. Statistikbanken.dk. Clean, structure and present the data as you see fit.