library(tidyverse) # includes tidyr and dplyr and ggplot
library(tidyquant) # includes lubridate
library(timetk)
library(readxl)
library(scales)
library(fredr)
library(riingo)
library(janitor)
GDP
Introduction
Welcome to our introductory chapter on Gross Domestic Product (or GDP
), a term with which our readers are probably familiar. GDP is widely considered to be a measure of The Economy
- when pundits remark that the economy is slowing or at risk of recession, they are referring to the change in GDP.1
The official definition of GDP is “the value of goods and services produced in the United States”. That might sound like a difficult thing to encapsulate in one number and indeed it is. An important word to call out in the definition of GDP is the word ‘value’. We numerically measure the value of goods and services, of course, with their price. However, when we consider whether the economy has grown or not, we typically mean real growth, not growth that reflects an increase in prices only.
The distinction between ‘real growth’ and ‘price growth’ is reflected in two different measures of GDP: Nominal GDP and Real GDP. As those names imply, Real GDP is adjusted for increases in price, whereas Nominal GDP is not.
Before we get to our analysis, let’s spend a few minutes on our immediate goals, tools and R packages.
By the end of the next section on downloading GDP data via excel spreadsheet, you will be able to locate data on the internet, import it via spreadsheet and work through some complex wrangling to get the data into a nice, tidy format. There is no theory in this section. It’s meant to help you learn some excel wrangling tips and tricks, and to see a concrete example of tidying data.
We also get a chance to review the foundational data wrangling functions that will cover around 90% of our work in tidying data. Remember that data scientists and quants spend at least 50% of their time wrangling and cleaning data. These functions and concepts will stay with us for our entire career and even though we saw most of them in previous chapters, it doesn’t hurt to review them in a new context.
By way of brief level setting, here’s a list of those fundamental functions.
From the dplyr
package:
- select() to choose columns
- rename() to rename columns
- filter() to choose rows based on values
- mutate() to add new columns
- slice() to choose rows based on position
- distinct() to get unique row values
- arrange() tochange the order of rows
- group_by() to group data within data frames
From the lubdridate
package:
- parse_date_time() to change strings to dates
- ymd() to put dates into better format
From the tidyr
package:
- pivot_longer() to make data tidy and long
Another important function is the pipe
function, which is this thing: %>%
. You will notice that throughout our code even though it looks funny right now. It’s called a pipe
because we use to pass data from function to another, line by line, until we are done manipulating the data. Read more about the pipe here.
Let’s load our packages and get to work importing a messy spreadsheet of economic data.
Real v. Nominal GDP
Both Real and Nominal GDP are reported on a quarterly basis by the Bureau of Economic Analysis (the BEA) and consist of four sub-components: Net Exports, Consumption, Government Spending and Investment. Those four sub-components consist of several sub-sub-components, and those sub-sub-components further consist of of several components. When we analyze or discuss GDP
, we are referring to the sum of these components. We could analyze those sub-components individually, for example if we were particularly interested in personal consumption (we’ll discuss this in our Inflation Chapter).
For those who are interested in a deep dive into how the BEA calculates Real v. Nominal GDP, the New York Fed has research on that available here: https://www.newyorkfed.org/medialibrary/media/research/current_issues/ci1-9.pdf.
Let’s take a quick peek at a chart of Real v. Nominal GDP.
<-
gdp_nom_real tribble(~ symbol, ~ name,
"GDPC1", "real_gdp",
"GDP", "nominal_gdp") %>%
tq_get(get = "economic.data", from = "1979-01-01") %>%
group_by(name) %>%
mutate(yoy_change = price / lag(price, 12) - 1)
%>%
gdp_nom_real filter(date >= "2000-01-01") %>%
ggplot(aes(x = date, y = price, color = name)) +
geom_line(show.legend = T) +
geom_label(
aes(x = date,
label = str_glue("${round(price/1000)}T")),
data = . %>% filter(date == max(date)),
show.legend = F
+
) geom_textvline(
xintercept =
as.integer(ymd("2017-06-01")),
label = "Base Date for Real Indexing",
hjust = 0.1,
linetype = 2,
size = 3,
vjust = 1.3,
color = "darkred"
+
) scale_color_manual(values = c("darkgreen",
"darkblue")) +
scale_y_continuous(labels = dollar_format(scale = 1 / 1000,
suffix = "T"),
breaks = pretty_breaks(15)) +
scale_x_date(date_breaks = "2 years",
date_labels = "%Y") +
theme_minimal() +
labs(
x = "",
y = "",
title = " Real v. Nominal GDP",
color = ""
+
) theme(
plot.title = element_text(hjust = .5),
plot.margin = margin(0.1, 1, 0.1, 0.1, "cm"),
legend.position = c(.5, .8),
legend.text = element_text(size = 12)
)
That chart shows us a few interesting points. The most recent readings of Real and Nominal are separated by about $6 Trillion. If someone asks, what’s the GDP of the United States, is the answer $28 Trillion or $22 Trillion? It depends if we or they mean Real or Nominal.
Also, have a look at the red line that we have labeled “Base Date for Real Indexing”. That line occurs at June of 2017, and it’s year at which Real and Nominal GDP are exactly equal to each other, and that’s due to how Real GDP is indexed against Nominal GDP as of this writing. 2017 is the indexing data for the latest iteration of Real v. Nominal - notice how quickly Nominal GDP has outstripped Real GDP since then.
The difference between Real and Nominal GDP is called the Deflator, and it’s defined as: Nominal GDP divided by Real GDP multiplied by 100.
Visually, here’s how that looks:
%>%
gdp_nom_real select(-symbol, -yoy_change) %>%
pivot_wider(
values_from = price,
names_from = name
%>%
) mutate(deflator = (nominal_gdp/real_gdp) * 100) %>%
ggplot(aes(x = date,
y = deflator)) +
geom_line(color = "steelblue") +
geom_textvline(xintercept =
as.integer(ymd("2017-06-01")),
label = "Base Date for Real Indexing",
hjust = 0.1,
linetype = 2,
size = 3,
vjust = 1.3,
color = "darkred") +
scale_y_continuous(breaks = pretty_breaks(10)) +
scale_x_date(date_breaks = "4 years",
date_labels = "%Y") +
theme_minimal() +
labs(x = "", y = "Index 2017=100",
title = "GDP Implicit Deflator",
color = "") +
theme(plot.title = element_text(hjust = .5),
plot.margin = margin(0.1, 1, 0.1, 0.1, "cm"),
legend.position = c(.5, .8),
legend.text = element_text(size = 12))
Notice how the GDP deflator is equal to 100 in 2017 - that’s not a coincidence. It will always be equal to 100 in the indexing year.
Messy Data from the Bureau of Economic Analysis
We want this book to mirror reality, and the reality is that our first task (and many many subsequent tasks) in industry is usually to work with a tangled, messy, impenetrable spreadsheet. Luckily, the BEA provides one such spreadsheet where they report on GDP data.
Starting with a messy spreadsheet will also be our chance to introduce foundational data wrangling functions that will cover around 90% of our work in tidying data.
It’s not the fun stuff, but we want to start with spreadsheet work and data wrangling.
By way of brief level setting, here’s a list of those fundamental functions that we’ll need to clean the BEA spreadsheets and organize data on the change in Real GDP.
From the dplyr
package:
- select() to choose columns
- rename() to rename columns
- filter() to choose rows based on values
- mutate() to add new columns
- slice() to choose rows based on position
- distinct() to get unique row values
- arrange() tochange the order of rows
- group_by() to group data within data frames
- tibble() to create a dataframe, also called a tibble in R
From the lubdridate
package:
- parse_date_time() to change strings to dates
- ymd() to put dates into better format
From the tidyr
package:
- pivot_longer() to make data tidy and long
Retrieving GDP data from the Bureau of Economic Analysis (the “BEA”) requires a trip to the BEA website to figure out the exact URL where the data is stored.
We navigate to this URL https://apps.bea.gov/iTable/itable.cfm?isuri=1&reqid=19&step=4&categories=flatfiles&nipa_table_list=1, which shows the following:
The first list is Section 1: Domestic Product and Income
. If we right click on that link, we see this image:
Next click on Copy Link Address
, navigate back over to RStudio and paste that URL into a code chunk. We will assign this to a variable called url
.
<- "https://apps.bea.gov/national/Release/XLS/Survey/Section1All_xls.xlsx" url
We just copied the exact URL where the BEA houses it’s GDP data.
To download the entire data set, we use the following code:
<- "Section1All_xls.xlsx"
destfile ::curl_download(url, destfile) curl
After we run that code snippet, we have downloaded the Section 1
data from the BEA and it’s now sitting in a location called destfile
on our computer. That’s the end of our need for an internet connection. We could disconnect from Wifi and get on an airplane for the rest of this section.
The downloaded excel file has more than one sheet (another popular industry trick is to hide data in different tabs or sheets within one excel workbook), and unless we already know which one we want, we can view the available sheets with the excel_sheets()
function from the readxl
package. We pipe the results of excel_sheets(destfile)
to the tibble()
function so we can store the results in a data frame. We can pipe to the head()
function to peek at the first few rows.
excel_sheets(destfile) %>%
tibble() %>%
head(3)
# A tibble: 3 × 1
.
<chr>
1 Contents
2 T10101-A
3 T10101-Q
How many sheets are available? We can use count()
to find out exactly.
excel_sheets(destfile) %>%
tibble() %>%
count()
# A tibble: 1 × 1
n
<int>
1 107
There are over 100 sheets available, and the first one is called Contents
. That’s a good place to start to find out what’s in this data. Not all spreadsheets will have a Contents
sheet, but the BEA has been kind enough to include one We can look at the Contents
sheet by calling read_excel(...., sheet = "Contents")
, and then piping the results to gt()
for a quick table.
<-
contents read_excel(destfile, sheet = "Contents")
%>%
contents head() %>%
gt()
Code | Title |
---|---|
NA | NA |
T10101 | Table 1.1.1. Percent Change From Preceding Period in Real Gross Domestic Product |
T10102 | Table 1.1.2. Contributions to Percent Change in Real Gross Domestic Product |
T10103 | Table 1.1.3. Real Gross Domestic Product, Quantity Indexes |
T10104 | Table 1.1.4. Price Indexes for Gross Domestic Product |
T10105 | Table 1.1.5. Gross Domestic Product |
There are two columns in this sheet, Code
and Title
. The Title
column will let us know the names for each data set. The Code
column provides the same information if we have memorized the table codes for our data sets.
We can see at that the first row contains an NA
and the second contains the information for Percent Change From Preceding Period in Real Gross Domestic Product
. How many possible Title
entries are there. We can count()
them and make sure.
read_excel(destfile, sheet = "Contents") %>%
select(Title) %>%
count()
# A tibble: 1 × 1
n
<int>
1 58
There are 58 rows or titles in Contents
here but we have 107 sheets, as we saw when we ran excel_sheets()
. What’s going on?
Those data codes and titles are both subdivided into annual and quarterly data. 58 titles, but annual and quarterly for each. Have a look back up to when we ran the excel_sheets()
function and notice how A
and Q
are appended to the codes, standing for “Annual” and “Quarterly”. We will need to deal with this when we decide exactly which data set to extract from this spreadsheet.
Quarterly Change in Real GDP
We can see in the table above that the first sheet, 10101-Q
, holds data on the quarterly change in Real GDP, expressed as a percentage. That tends to be the headline GDP number reported in the media so let’s work with that data (NB: when the media report on change in GDP, they report on change in Real GDP, not nominal GDP).
Remember, we have already downloaded the spreadsheet with 107 worksheets to our machine - we grabbed a huge excel file from the BEA. Now we want to move just one of the sheets to our R environment and ultimately we want the resulting data to look like this:
read_excel(destfile,
sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) pivot_longer(-account:-code, names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq") %>%
ymd(),
percent_change = percent_change/100) %>%
arrange(desc(date)) %>%
relocate(date, everything()) %>%
filter(account == "Gross domestic product") %>%
head(3)
# A tibble: 3 × 4
date account code percent_change
<date> <chr> <chr> <dbl>
1 2024-04-01 Gross domestic product A191RL 0.03
2 2024-01-01 Gross domestic product A191RL 0.014
3 2023-10-01 Gross domestic product A191RL 0.034
We want 3 columns: date
, account
and percent_change
. Note that we chose the word account
as a good title for the column holding the description of this data. You might disagree or think there’s a better a column name and we will cover how to change that.
The first step is to move data from that huge spreadsheet into our RStudio environment and that means a call to the read_excel()
function from the readxl
package. If the data is perfectly formatted, that’s the only step we’ll need. Let’s give it a shot:
read_excel(destfile,
sheet = "T10101-Q") %>%
head(3)
# A tibble: 3 × 312
Table 1.1.1. Percent C…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 [Percent] Seasonally ad… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 Quarterly data from 194… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Bureau of Economic Anal… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# ℹ abbreviated name:
# ¹`Table 1.1.1. Percent Change From Preceding Period in Real Gross Domestic Product`
# ℹ 302 more variables: ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>,
# ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>,
# ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, ...24 <chr>,
# ...25 <chr>, ...26 <chr>, ...27 <chr>, ...28 <chr>, ...29 <chr>,
# ...30 <chr>, ...31 <chr>, ...32 <chr>, ...33 <chr>, ...34 <chr>, …
Have a look at the first 3 rows above, and notice that we don’t need the first 7 rows. They contain meta data about when this file was created that is interesting but we don’t want that to be part of our analysis. Let’s add skip = 7
to the read_excel()
call.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
head(3)
# A tibble: 3 × 312
Line ...2 ...3 `1947Q2` `1947Q3` `1947Q4` `1948Q1` `1948Q2` `1948Q3`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Gross domes… A191… -1 -0.8 6.4 6.2 6.8 2.29999…
2 2 Personal co… DPCE… 6.8 1.3 0.1 2 4.7 0.6
3 3 Goods DGDS… 7.4 2.7 1.5 0.6 3.7 -0.4
# ℹ 303 more variables: `1948Q4` <chr>, `1949Q1` <chr>, `1949Q2` <chr>,
# `1949Q3` <chr>, `1949Q4` <chr>, `1950Q1` <chr>, `1950Q2` <chr>,
# `1950Q3` <chr>, `1950Q4` <chr>, `1951Q1` <chr>, `1951Q2` <chr>,
# `1951Q3` <chr>, `1951Q4` <chr>, `1952Q1` <chr>, `1952Q2` <chr>,
# `1952Q3` <chr>, `1952Q4` <chr>, `1953Q1` <chr>, `1953Q2` <chr>,
# `1953Q3` <chr>, `1953Q4` <chr>, `1954Q1` <chr>, `1954Q2` <chr>,
# `1954Q3` <chr>, `1954Q4` <chr>, `1955Q1` <chr>, `1955Q2` <chr>, …
That looks much better, but the first three columns have small issues. The first column is called Line
and seems unnecessary. Let’s remove it with select(-Line)
. select()
is from the dplyr()
package and is used to choose columns. If we wish to remove a column called Line
, we use the negative sign.
read_excel(destfile,
sheet = "T10101-Q",
skip = 7) %>%
select(-Line) %>%
head(3)
# A tibble: 3 × 311
...2 ...3 `1947Q2` `1947Q3` `1947Q4` `1948Q1` `1948Q2` `1948Q3` `1948Q4`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Gross do… A191… -1 -0.8 6.4 6.2 6.8 2.29999… 0.5
2 Personal… DPCE… 6.8 1.3 0.1 2 4.7 0.6 3.2
3 Goods DGDS… 7.4 2.7 1.5 0.6 3.7 -0.4 2.7
# ℹ 302 more variables: `1949Q1` <chr>, `1949Q2` <chr>, `1949Q3` <chr>,
# `1949Q4` <chr>, `1950Q1` <chr>, `1950Q2` <chr>, `1950Q3` <chr>,
# `1950Q4` <chr>, `1951Q1` <chr>, `1951Q2` <chr>, `1951Q3` <chr>,
# `1951Q4` <chr>, `1952Q1` <chr>, `1952Q2` <chr>, `1952Q3` <chr>,
# `1952Q4` <chr>, `1953Q1` <chr>, `1953Q2` <chr>, `1953Q3` <chr>,
# `1953Q4` <chr>, `1954Q1` <chr>, `1954Q2` <chr>, `1954Q3` <chr>,
# `1954Q4` <chr>, `1955Q1` <chr>, `1955Q2` <chr>, `1955Q3` <chr>, …
That looks better but the names of the first two columns are ...2
and ...3
. With rename()
, we can assign new names by position or by the original name. By position means we can tell the function to rename the first column, with rename(new_name = 1)
. By original name means we can assign new names by referring to the original name, with rename(new_name = ...2)
. Let’s use the position method, and rename the first two columns as account
and code
.
read_excel(destfile,
sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
head(3)
# A tibble: 3 × 311
account code `1947Q2` `1947Q3` `1947Q4` `1948Q1` `1948Q2` `1948Q3` `1948Q4`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Gross do… A191… -1 -0.8 6.4 6.2 6.8 2.29999… 0.5
2 Personal… DPCE… 6.8 1.3 0.1 2 4.7 0.6 3.2
3 Goods DGDS… 7.4 2.7 1.5 0.6 3.7 -0.4 2.7
# ℹ 302 more variables: `1949Q1` <chr>, `1949Q2` <chr>, `1949Q3` <chr>,
# `1949Q4` <chr>, `1950Q1` <chr>, `1950Q2` <chr>, `1950Q3` <chr>,
# `1950Q4` <chr>, `1951Q1` <chr>, `1951Q2` <chr>, `1951Q3` <chr>,
# `1951Q4` <chr>, `1952Q1` <chr>, `1952Q2` <chr>, `1952Q3` <chr>,
# `1952Q4` <chr>, `1953Q1` <chr>, `1953Q2` <chr>, `1953Q3` <chr>,
# `1953Q4` <chr>, `1954Q1` <chr>, `1954Q2` <chr>, `1954Q3` <chr>,
# `1954Q4` <chr>, `1955Q1` <chr>, `1955Q2` <chr>, `1955Q3` <chr>, …
The column names are more informative now, but look closely at the columns that hold the actual percent change values and notice that <chr>
is written underneath the column names. That tells us that the column has been read in as a character
rather than as a double
or something numeric. That was probably caused by the periods and negative signs. We want to change those to numeric columns.
To do so, we use the mutate()
function, which allows us to change an existing column’s content or add a new column.
Let’s change just the column called 1947Q2
to numeric. To do so, we call mutate("1947Q2"= as.numeric('1947Q2'))
.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(`1947Q2` = as.numeric(`1947Q2`)) %>%
head(3)
# A tibble: 3 × 311
account code `1947Q2` `1947Q3` `1947Q4` `1948Q1` `1948Q2` `1948Q3` `1948Q4`
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 Gross do… A191… -1 -0.8 6.4 6.2 6.8 2.29999… 0.5
2 Personal… DPCE… 6.8 1.3 0.1 2 4.7 0.6 3.2
3 Goods DGDS… 7.4 2.7 1.5 0.6 3.7 -0.4 2.7
# ℹ 302 more variables: `1949Q1` <chr>, `1949Q2` <chr>, `1949Q3` <chr>,
# `1949Q4` <chr>, `1950Q1` <chr>, `1950Q2` <chr>, `1950Q3` <chr>,
# `1950Q4` <chr>, `1951Q1` <chr>, `1951Q2` <chr>, `1951Q3` <chr>,
# `1951Q4` <chr>, `1952Q1` <chr>, `1952Q2` <chr>, `1952Q3` <chr>,
# `1952Q4` <chr>, `1953Q1` <chr>, `1953Q2` <chr>, `1953Q3` <chr>,
# `1953Q4` <chr>, `1954Q1` <chr>, `1954Q2` <chr>, `1954Q3` <chr>,
# `1954Q4` <chr>, `1955Q1` <chr>, `1955Q2` <chr>, `1955Q3` <chr>, …
That worked, notice how <dbl>
is now written under the column name. We want to convert all except the first two columns to numeric but we don’t want to type out the names of each. Fortunately, we can use across()
inside of mutate()
to accomplish this. With across()
, we can apply a function across multiple columns. Here we will tell across()
to apply the as.numeric()
function to all of our columns except the account
and code
columns. The full call is mutate(across(c(-account, -code), as.numeric))
.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
head(3)
# A tibble: 3 × 310
account `1947Q2` `1947Q3` `1947Q4` `1948Q1` `1948Q2` `1948Q3` `1948Q4`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Gross domestic… -1 -0.8 6.4 6.2 6.8 2.3 0.5
2 Personal consu… 6.8 1.3 0.1 2 4.7 0.6 3.2
3 Goods 7.4 2.7 1.5 0.6 3.7 -0.4 2.7
# ℹ 302 more variables: `1949Q1` <dbl>, `1949Q2` <dbl>, `1949Q3` <dbl>,
# `1949Q4` <dbl>, `1950Q1` <dbl>, `1950Q2` <dbl>, `1950Q3` <dbl>,
# `1950Q4` <dbl>, `1951Q1` <dbl>, `1951Q2` <dbl>, `1951Q3` <dbl>,
# `1951Q4` <dbl>, `1952Q1` <dbl>, `1952Q2` <dbl>, `1952Q3` <dbl>,
# `1952Q4` <dbl>, `1953Q1` <dbl>, `1953Q2` <dbl>, `1953Q3` <dbl>,
# `1953Q4` <dbl>, `1954Q1` <dbl>, `1954Q2` <dbl>, `1954Q3` <dbl>,
# `1954Q4` <dbl>, `1955Q1` <dbl>, `1955Q2` <dbl>, `1955Q3` <dbl>, …
E voila! We just converted 294 columns to numeric.
But look at how much of the columns are dates - that’s a wide format that spreadhseety people like but data scientists typically don’t. We prefer long, tidy data to wide data and it’s not just personal preference. We are missing an important column here that should be called date
. That missing column means this data is not tidy
, which means more than just clean. It has a definition in the world of data science and we like this one:
tidy
data that has three main elements.
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
In our opinion, it is more efficient to examine this data in tidy format than to linger on the definition of tidy data.
Here is our same GDP data, in tidy format (we convert it with the pivot_longer()
function)
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
head(3)
# A tibble: 3 × 3
account date percent_change
<chr> <chr> <dbl>
1 Gross domestic product 1947Q2 -1
2 Gross domestic product 1947Q3 -0.8
3 Gross domestic product 1947Q4 6.4
Notice how we have gone from 296 columns to 3 columns. Our data is much less wide now. At the same time, we have gone from 28 rows to 8,532 rows. Our data is longer. It is also tidy
in the sense that we have a column for date
and for percent_change
. In the wide format, those two variables did not have their own column. In the wide format, there was one column for each date. That wasn’t “wrong” but it makes it harder for data science work and we could say we were “missing” a column called date
.
The mechanics of how we reshaped that data are very important and not very intuitive. We will describe it in words and demonstrate the code but we find the best way to get comfortable converting wide data to long, tidy data is start experimenting.
Here are the steps:
+ move column names (which are dates like 1947Q2) into a column called `date`
+ move the values (the percent changes) into a new column called `percent_change`.
To do this, we use the pivot_longer()
function from tidyr
and we set names_to = "date"
to indicate we want to put the column names into a new column called date
, and we set values_to = "percent_change"
to indicate we want to put the values in a new column called percent_change
. Note, though, we don’t want to change our original columns called account
and code
. We want the pivot_longer()
function to ignore those columns.
Thus our full call is pivot_longer(-account:-code, names_to = "date", values_to = "percent_change")
. For the pivot_longer()
function, those negative signs mean to ignore the account
and code
columns, it does not mean to delete them.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account, #ignore these columns
names_to = "date", # move column names to this column
values_to = "percent_change" # move column values to this column
%>%
) head(3)
# A tibble: 3 × 3
account date percent_change
<chr> <chr> <dbl>
1 Gross domestic product 1947Q2 -1
2 Gross domestic product 1947Q3 -0.8
3 Gross domestic product 1947Q4 6.4
Have a good look at that code, but also make sure the logic makes sense. Our new structure holds the exact same data as the BEA spreadsheet, but it’s in a different (and we think better) format.
We have one more important transformation to make in the date
column. We see that it is still in the <chr>
format, meaning it is a character column and not a <date>
class column.
There is a very useful package in R called lubridate
that can be used to convert and work with dates. In this case, we have a difficult situation. 2020Q3
does not actually refer to a date. It refers to a time period, the third quarter of 2020. But we want to anchor that period to the first date in Q3 2020. lubridate
has a function that will accomplish this for us, called parse_date_time()
.
We need to tell that function the format of the string that we want to parse, so we add orders = "Yq"
. The full call is mutate(date = parse_date_time(date, orders = "Yq"))
.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq")) %>%
head(3)
# A tibble: 3 × 3
account date percent_change
<chr> <dttm> <dbl>
1 Gross domestic product 1947-04-01 00:00:00 -1
2 Gross domestic product 1947-07-01 00:00:00 -0.8
3 Gross domestic product 1947-10-01 00:00:00 6.4
The date
column is now in class <S3: POSIXct>
. That is a valid date format but we prefer to move it to class <date>
. We do that with another call to mutate and use the ymd()
function from lubridate
.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date)) %>%
head(3)
# A tibble: 3 × 3
account date percent_change
<chr> <date> <dbl>
1 Gross domestic product 1947-04-01 -1
2 Gross domestic product 1947-07-01 -0.8
3 Gross domestic product 1947-10-01 6.4
Let’s also add columns to store the month, quarter, year-month and year-quarter. To do this we use a few new functions:
month()
andyear()
fromlubridate
to extract the month.str_glue()
from thestringr
package to paste strings together. This is very useful in working with weird date strings and coercing them into the order we desire.as.yearmon
andas.yearqtr
from thezoo
package.
Those two functions from zoo
allow us to create a time aware date label that uses just the month and year or just the quarter and year. We can break the rule
that a date needs a year, month and day. For example, November 2022 is not a date, exactly. It’s a time period. But as.yearmon
allows us to coerce it into a date period.
as.yearmon("2022-11")
[1] "Nov 2022"
zoo
has been in existence longer than lubridate
and was built specifically to deal with time series.2 We use lubridate
much more frequently but these year month and year quarter functions are very useful.
Below is the full code chunk for wrangling into different date formats. Notice we can create many different columns around date
but need to call mutate
only once.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date),
year = year(date),
month_num = month(date),
month_label = month(date, label = T, abbr = F),
quarter = quarter(date),
month_year = str_glue("{month_label} {year}") %>%
as.yearmon(),
quarter_year= str_glue(" {year} Q{quarter}") %>%
as.yearqtr()) %>%
head(3)
# A tibble: 3 × 9
account date percent_change year month_num month_label quarter
<chr> <date> <dbl> <dbl> <dbl> <ord> <int>
1 Gross domestic … 1947-04-01 -1 1947 4 April 2
2 Gross domestic … 1947-07-01 -0.8 1947 7 July 3
3 Gross domestic … 1947-10-01 6.4 1947 10 October 4
# ℹ 2 more variables: month_year <yearmon>, quarter_year <yearqtr>
The dataframe is in good shape now. We have all the data we need and in a tidy format. Let’s do a bit more clean up and review a few more functions.
First, let’s keep just the rows that are equal to Gross domestic product
in the account
column. We need to filter that column to just our desired values with filter(account == "Gross domestic product")
.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date)) %>%
filter(account == "Gross domestic product") %>%
head(3)
# A tibble: 3 × 3
account date percent_change
<chr> <date> <dbl>
1 Gross domestic product 1947-04-01 -1
2 Gross domestic product 1947-07-01 -0.8
3 Gross domestic product 1947-10-01 6.4
filter()
can be used on <date>
and <dbl>
columns as well. For example, suppose we wished to look at just the data for recent dates, those from “2020-01-01” onward. We would call filter(date >= "2020-01-01")
.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date)) %>%
filter(date >= "2020-01-01") %>%
head(3)
# A tibble: 3 × 3
account date percent_change
<chr> <date> <dbl>
1 Gross domestic product 2020-01-01 -5.3
2 Gross domestic product 2020-04-01 -28
3 Gross domestic product 2020-07-01 34.8
Another important function is slice()
. This lets us choose which rows to keep by number. It is quite useful when combined with group_by()
, which tells the data frame that we wish to create groups inside the data frame. Here, we probably want to treat each value in the account
column as a separate group.
If we want to view just the first and last row of each group, we call group_by(account)
and slice(1, n())
.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date)) %>%
group_by(account) %>%
slice(1, n()) %>%
head(3)
# A tibble: 3 × 3
# Groups: account [2]
account date percent_change
<chr> <date> <dbl>
1 Addendum: 1947-04-01 NA
2 Addendum: 2024-04-01 NA
3 Change in private inventories 1947-04-01 NA
That’s a good way to make sure we have the first and last date for each account, and a side benefit is we can notice some accounts that are not actually components of GDP, like addendum
. We exclude the rows labeled addendum
later in the code.
Lastly, we prefer to have the most recent date at the top, meaning the date in descending order. That necessitates a call to arrange(desc(date))
. And suppose we want the date
column to be the first column, on the far left? We can use relocate()
to change the column orders.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date),
month_year =
str_glue("{lubridate::month(date, label = T)} {year(date)}") %>%
as.yearmon(),
quarter_year= str_glue(" {year(date)} Q{quarter(date)} ") %>%
as.yearqtr()) %>%
filter(account == "Gross domestic product") %>%
arrange(desc(date)) %>%
relocate(date, month_year, quarter_year, everything()) %>%
head(3)
# A tibble: 3 × 5
date month_year quarter_year account percent_change
<date> <yearmon> <yearqtr> <chr> <dbl>
1 2024-04-01 Apr 2024 2024 Q2 Gross domestic product 3
2 2024-01-01 Jan 2024 2024 Q1 Gross domestic product 1.4
3 2023-10-01 Oct 2023 2023 Q4 Gross domestic product 3.4
Now that we have this data in good shape, we save it as a new object, called gdp_percent_change_from_spreadsheet
. That’s a verbose name that reminds us what the data holds and where it came from - the BEA spreadsheets.
<-
gdp_percent_change_from_spreadsheet read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) select(account, contains("Q")) %>%
pivot_longer(-account,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date),
month_year =
str_glue("{lubridate::month(date, label = T)} {year(date)}") %>%
as.yearmon(),
quarter_year= str_glue(" {year(date)} Q{quarter(date)} ") %>%
as.yearqtr()) %>%
filter(account == "Gross domestic product") %>%
arrange(desc(date)) %>%
relocate(date, month_year, quarter_year, everything()) %>%
head(3)
That concludes are work wrangling this spreadsheet. For a quick recap, we started with this monstrosity of data criminality (with dates running across the spreadsheet!)
read_excel(destfile, sheet = "T10101-Q",
skip = 7)
# A tibble: 28 × 312
Line ...2 ...3 `1947Q2` `1947Q3` `1947Q4` `1948Q1` `1948Q2` `1948Q3`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Gross dome… A191… -1 -0.8 6.4 6.2 6.8 2.29999…
2 2 Personal c… DPCE… 6.8 1.3 0.1 2 4.7 0.6
3 3 Goods DGDS… 7.4 2.7 1.5 0.6 3.7 -0.4
4 4 Durable go… DDUR… 8.6 5.8 28.5 -1.9 2.7 6.8
5 5 Nondurable… DNDG… 7 1.9 -5.0999… 1.3 4 -2.4
6 6 Services DSER… 5.9 -0.8 -2.1 4.3 6.4 2.1
7 7 Gross priv… A006… -27.3 -11.3 103 47.8 27.1 4.2
8 8 Fixed inve… A007… -10 13.5 37.2999… 13.4 -0.6 -4
9 9 Nonresiden… A008… -6.1 -7.2 11.6 25.3 -12.7 1
10 10 Structures A009… -2.1 2.8 -5.6 8.6 11.6 12.5
# ℹ 18 more rows
# ℹ 303 more variables: `1948Q4` <chr>, `1949Q1` <chr>, `1949Q2` <chr>,
# `1949Q3` <chr>, `1949Q4` <chr>, `1950Q1` <chr>, `1950Q2` <chr>,
# `1950Q3` <chr>, `1950Q4` <chr>, `1951Q1` <chr>, `1951Q2` <chr>,
# `1951Q3` <chr>, `1951Q4` <chr>, `1952Q1` <chr>, `1952Q2` <chr>,
# `1952Q3` <chr>, `1952Q4` <chr>, `1953Q1` <chr>, `1953Q2` <chr>,
# `1953Q3` <chr>, `1953Q4` <chr>, `1954Q1` <chr>, `1954Q2` <chr>, …
And transformed it to this:
%>%
gdp_percent_change_from_spreadsheet tail(3)
# A tibble: 3 × 5
date month_year quarter_year account percent_change
<date> <yearmon> <yearqtr> <chr> <dbl>
1 2024-04-01 Apr 2024 2024 Q2 Gross domestic product 3
2 2024-01-01 Jan 2024 2024 Q1 Gross domestic product 1.4
3 2023-10-01 Oct 2023 2023 Q4 Gross domestic product 3.4
With the date running vertically along a column, we could now align this quarterly change in GDP data with other market and economic time series. Before we get to the market data, let’s examine another source for this data that will save us a lot of work (sometimes - as we said, a lot of internal data will be stored in spreadsheets like this and no clean versions will exist to help us).
That spreadsheet work was painful and we went through and repeated each step because they come up so frequently. We hope those code chunks can serve as a future reference for you when someone emails a spreadsheet over and asks if you could “just clean that up”.
Most fortunately we usually don’t need to use a spreadsheet for macroeconomic variables. For that, we turn to FRED!
FRED data
FRED
stands for Federal Reserve Economic Data. It’s a free repository of data that is is maintained by the St. Louis Federal Reserve and even has a very helpful About
webpage available here.
The data available there goes far beyond the series produced by the Federal Reserve - there are currently more than 700,000 data series available. In this section we will import GDP data from FRED, which is substantially easier than doing so directly from the BEA.
There are many ways to import data from FRED - we prefer to use the tq_get()
function from the tidyquant package
. In order to use tq_get()
, we need to supply the FRED code
and we need to set get = "economic.data"
as an argument to the function. As we said FRED has over 700,000 data sets and each has a unique code. Some intuitive, some not at all intuitive. For example, the code for quarterly change in GDP is A191RL1Q225SBEA
.
A full successful call to download that data looks like this:
<-
change_gdp_raw "A191RL1Q225SBEA" %>%
tq_get(get = "economic.data", from = "1979-01-01")
Here is what we just imported:
%>%
change_gdp_raw head(3)
# A tibble: 3 × 3
symbol date price
<chr> <date> <dbl>
1 A191RL1Q225SBEA 1979-01-01 0.7
2 A191RL1Q225SBEA 1979-04-01 0.4
3 A191RL1Q225SBEA 1979-07-01 3
The call to FRED via tq_get()
is short and simple, and if we wish to download a different data set we simple change to a different FRED code.
A key, gateway question is: how did we know to use A191RL1Q225SBEA
?
Well, try googling “fred code gdp change quarterly” and scan the top 5 results. When we did so and clicked on the second link, and it took us to the FRED website that shows the code for “Change in GDP”. We can see the FRED code on the screen just to the right of “Real Gross Domestic Product”.
The data is imported, but it’s not quite how we want it yet:
%>%
change_gdp_raw head()
# A tibble: 6 × 3
symbol date price
<chr> <date> <dbl>
1 A191RL1Q225SBEA 1979-01-01 0.7
2 A191RL1Q225SBEA 1979-04-01 0.4
3 A191RL1Q225SBEA 1979-07-01 3
4 A191RL1Q225SBEA 1979-10-01 1
5 A191RL1Q225SBEA 1980-01-01 1.3
6 A191RL1Q225SBEA 1980-04-01 -8
There are several changes we want to make:
+ rename a couple of columns with the `rename()` function.
+ make sure the data is the right order with `arrange()`
+ add new columns with `mutate()`
+ choose columns with `select()`
+ only keep rows after 1979 with `filter()`
+ reorder columns with `relocate()`
Here’s a quick code chunk utilizing those functions.
%>%
change_gdp_raw rename(fred_code = symbol, percent_change = price) %>%
arrange(desc(date)) %>%
mutate(title = "Change in GDP",
percent_change = percent_change/100) %>%
select(-fred_code) %>%
filter(date >= "1979-01-01") %>%
relocate(title, date, percent_change) %>%
tail(3)
# A tibble: 3 × 3
title date percent_change
<chr> <date> <dbl>
1 Change in GDP 1979-07-01 0.03
2 Change in GDP 1979-04-01 0.004
3 Change in GDP 1979-01-01 0.007
Let’s save the result as a tibble
object called change_in_gdp_wrangled
.
<-
change_in_gdp_wrangled %>%
change_gdp_raw rename(fred_code = symbol, percent_change = price) %>%
arrange(desc(date)) %>%
mutate(title = "Change in Real GDP",
percent_change = percent_change/100) %>%
select(-fred_code) %>%
filter(date >= "1976-01-01") %>%
relocate(title, date, percent_change)
Here are the first few rows of the result:
%>%
change_in_gdp_wrangled head(3)
# A tibble: 3 × 3
title date percent_change
<chr> <date> <dbl>
1 Change in Real GDP 2024-04-01 0.03
2 Change in Real GDP 2024-01-01 0.014
3 Change in Real GDP 2023-10-01 0.034
We now have our main GDP data object that we will work with for the rest of this section and have covered several workhorse functions for data wrangling.
Now let’s spend some time on data visualization and some fundamental skills in that area. If you’re already a ggplot()
advanced user, you can skim most of this.
Introduction to Visualization
Our GDP data objects are imported and wrangled, and that means it’s a good time to introduce data visualization.
In the next section, we review in detail how to build a nice chart with the ggplot2
package. It’s the most popular visualization package in R and there are a lot of good tutorials on it, but we want to include some common features when working with time series.
Let’s start with a line chart for percent change in GDP by quarter. Don’t get too bogged into the details of the code but have a look at how things proceed in steps.
%>%
change_in_gdp_wrangled filter(date > "1990-01-01") %>%
ggplot(aes(x = date,
y = percent_change)) +
geom_line(show.legend = F,
color = "steelblue") +
labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: FRED data") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels =
function(x) zoo::format.yearqtr(x, "Q%q %Y"),
breaks = pretty_breaks(n = 7)) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
We will walk through the building of this chart step-by-step, not because it’s the most exciting chart in the world, but because the steps will serve as the foundation for our charting work - and it’s good to have a template to start when you want to get into more complex charts.
Let’s start with the first four lines of code used to generate the bones of the chart.
We are charting just our data since “2011-01-01” and that means we begin with change_in_gdp_wrangled %>% filter(date > "2011-01-01")
. That has nothing to do with data visualization, it is a way to filter our data.
To start building a chart we pipe that data to the ggplot()
function from the ggplot2
package. That’s the base function of ggplot2
and it gets us started but it doesn’t actually draw us a nice chart. To do that, we need to add the keys to ggplot2
: 1) aesthetic mappings and 2) geoms
.
Very often, aesthetic mappings
is a precise way of saying: what goes on the x-axis and what goes on the y-axis.
We put the date on the x-axis by setting aes(x = date)
(this is aesthetically mapping the date
variable to the x-axis). We want percent change on the y-axis and thus we add aes(x = date, y = percent_change)
. That line of code - ggplot(aes(x = date, y = percent_change))
- has told the computer to draw a chart with date on the x-axis and percent change on the y-axis.
Let’s see what happens when we stop there. Remember, we need aesthetic mappings and geoms
but we haven’t added a geom
yet.
%>%
change_in_gdp_wrangled filter(date > "2010-01-01") %>%
ggplot(aes(x = date, y = percent_change))
ggplot()
has produced exactly what we specified - a chart with an x-axis mapped to date
and the y-axis mapped to percent_change
, but nothing drawn on the chart. That’s because we didn’t tell ggplot()
what to draw and it won’t guess at what we want - do we want a scatter plot, a line plot, a column chart? To build a line plot, we need to call geom_line()
. If we want that line to have a color, we need to specify that color, and we choose color = "steelblue"
.
%>%
change_in_gdp_wrangled filter(date > "2010-01-01") %>%
ggplot(aes(x = date, y = percent_change)) +
geom_line(color = "steelblue")
The job is done, line chart is produced and we could move on to different geom
options for scatter and bar charts now, but let’s clean up some of the details on this chart.
Notice a few things:
+ there is no title
+ the label of the x-axis is "date", which seems superfluous
+ the title of the y-axis is "percent_change", perhaps unneeded
+ there is no data source caption
+ we can address these issues with the `labs()` function from `ggplot2()`.
The code chunk below includes the labs()
function to change these features. Notice if we want no label for the x- or y-axis, we set x = ""
, or the x label is equal to an empty quotation marks.
ggplot2
defaults to have the title at the left of the chart. There’s probably a theoretical data visualization reason for that, but we often want the title to be centered. To center it, use theme(plot.title = element_text(hjust = .5))
.
The final pieces of clean up involve the appearance of the x and y-axes. The x-axis is currently displaying just the year and the y-axis is not showing the %
sign. The y-axis also is showing only the numerals .2, -.2 and 0. We want more “breaks” to be shown. To address the y-axis first, we use the scale_y_continuous()
function and set labels = percent
to add the %
sign to the y-axis and breaks = pretty_breaks(n = 10)
to add more breaks.
To make our date labels on the x-axis more informative, we have two options, both of which use the scale_x_date()
function.
Let’s start with the more common way to use this function. Since we are dealing with quarterly GDP change, we want date breaks that are more frequent than one per year. We change that with date_breaks = "6 months"
. We can also specify the format of the date to make sure it includes certain information. If we wish to show the year and month, we set date_labels = "%b %Y"
. This is a very common way to use the scale_x_date()
function.
%>%
change_in_gdp_wrangled filter(date > "2010-01-01") %>%
ggplot(aes(x = date, y = percent_change)) +
geom_line(show.legend = F,
color = "steelblue") +
labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: FRED data") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(date_breaks = "18 months",
date_labels = "%b %Y") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
The x-axis looks good, but what if we want to emphasize that these are quarterly measurements? We can turn to the format.yearqtr()
function from zoo
and call scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q")
. This is quite an uncommon way to use the scale_x_date()
function - it only arises when we have quarterly data. But it’s a good example of how we can use a function inside of our call to scale_x_date()
. And if we or you ever need to format a x-axis in a quarterly format again, the code is here.
Finally, we like to add theme_minimal()
to give things a more spare look and remove the gray background, but it needs to be added before our centering of the title.
%>%
change_in_gdp_wrangled filter(date > "2010-01-01") %>%
ggplot(aes(x = date, y = percent_change)) +
geom_line(show.legend = F,
color = "steelblue") +
labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: FRED data") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_yearqtr(format = "%YQ%q", n = 5) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
That was quite a bit of work to create a line chart - but the good news is when we want to toggle to a column chart, we change only one line of the code: replace geom_line()
with geom_col()
.
%>%
change_in_gdp_wrangled filter(date > "2010-01-01") %>%
ggplot(aes(x = date, y = percent_change)) +
# geom_line()
geom_col(#color = "green",
fill = "steelblue",
width = 50) +
labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: BEA data") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 10)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5))
Let’s imagine we wish to color negative changes in GDP differently from positive changes. We insert a call to mutate(fill_column = percent_change > 0)
to create a column that is “TRUE” if percent_change
is greater than 0 and “FALSE” otherwise. We can then tell the fill
aesthetic to be conditional on that column.
We also specify better colors with scale_fill_manual( values = c("darkred", "darkgreen"))
. The manual
part of that function indicates we can manually choose whatever colors we wish.
%>%
change_in_gdp_wrangled filter(date > "2010-01-01") %>%
mutate(fill_column = percent_change > 0) %>%
ggplot(aes(x = date,
y = percent_change,
fill = fill_column)) +
# geom_line()
geom_col(width = 35) +
scale_fill_manual(
values = c("darkred", "darkgreen")
+
) labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: BEA data",
fill = "") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 10)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5))
Have a look at the legend, it says TRUE
and FALSE
but perhaps we wish to conserve that space for our chart and let the user interpret the colors on their own, so we can add show.legend = F
to geom_col()
.
%>%
change_in_gdp_wrangled filter(date > "2010-01-01") %>%
mutate(fill_column = percent_change > 0) %>%
ggplot(aes(x = date, y = percent_change, fill = fill_column)) +
# geom_line()
geom_col(width = 35,
show.legend = F) +
scale_fill_manual(
values = c("darkred", "darkgreen")
+
) labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: BEA data",
fill = "") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 10)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5) )
We like bar charts for showing percentage changes but it’s fast to experiment with a dot plot. We replace geom_col()
with geom_point()
and replace fill
with color
in the aes()
function, and replace scale_fill_manual()
with scale_color_manual()
.
%>%
change_in_gdp_wrangled filter(date > "2010-01-01") %>%
mutate(color_column = percent_change > 0) %>%
ggplot(aes(x = date, y = percent_change, color = color_column)) +
# geom_line()
geom_point(
show.legend = F
+
) # not scale_fill_manual
scale_color_manual(
values = c("darkred", "darkgreen")
+
) labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: BEA data",
fill = "") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 10)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5) )
Visualizing More than One Timeseries
We have been visualizing change in overall GDP, but let’s suppose we wish to see how different components of GDP have changed on a quarterly basis. For that, we go back and re-wrangle the data we imported directly from the BEA. Let’s take a peek at the various GDP accounts.
read_excel(destfile,
sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
distinct(account)
# A tibble: 24 × 1
account
<chr>
1 Gross domestic product
2 Personal consumption expenditures
3 Goods
4 Durable goods
5 Nondurable goods
6 Services
7 Gross private domestic investment
8 Fixed investment
9 Nonresidential
10 Structures
# ℹ 14 more rows
Let’s filter()
to only the four top level components of GDP.
read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) pivot_longer(-account:-code,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date)) %>%
filter(account %in% c("Personal consumption expenditures",
"Exports",
"Imports",
"Gross private domestic investment",
"Government consumption expenditures and gross investment" )
%>%
) arrange(desc(date))
# A tibble: 1,545 × 4
account code date percent_change
<chr> <chr> <date> <dbl>
1 Personal consumption expenditures DPCE… 2024-04-01 2.9
2 Gross private domestic investment A006… 2024-04-01 7.5
3 Exports A020… 2024-04-01 1.6
4 Imports A021… 2024-04-01 7
5 Government consumption expenditures and gros… A822… 2024-04-01 2.7
6 Personal consumption expenditures DPCE… 2024-01-01 1.5
7 Gross private domestic investment A006… 2024-01-01 4.4
8 Exports A020… 2024-01-01 1.6
9 Imports A021… 2024-01-01 6.1
10 Government consumption expenditures and gros… A822… 2024-01-01 1.8
# ℹ 1,535 more rows
Before we visualize, let’s save our new data as a data frame, clean up the column names with the very useful case_when()
and check our results with slice()
.
<-
gdp_components read_excel(destfile, sheet = "T10101-Q",
skip = 7) %>%
select(-1) %>%
rename(account = 1, code = 2) %>%
mutate(
across(c(-account, -code), as.numeric)
%>%
) pivot_longer(-account:-code,
names_to = "date",
values_to = "percent_change") %>%
mutate(date = parse_date_time(date, orders = "Yq"),
date = ymd(date)) %>%
filter(account %in% c("Personal consumption expenditures",
"Exports",
"Imports",
"Gross private domestic investment",
"Government consumption expenditures and gross investment")
%>%
) arrange(desc(date)) %>%
relocate(date, everything()) %>%
mutate(
account =
case_when(
str_detect(account, "Personal") ~ "PCE",
str_detect(account, "Government") ~ "Government",
str_detect(account, "investment") ~ "Investment",
~ account
T
)%>%
) group_by(account) %>%
select(-code)
%>%
gdp_components slice(1, n())
# A tibble: 10 × 3
# Groups: account [5]
date account percent_change
<date> <chr> <dbl>
1 2024-04-01 Exports 1.6
2 1947-04-01 Exports -4.9
3 2024-04-01 Government 2.7
4 1947-04-01 Government -0.3
5 2024-04-01 Imports 7
6 1947-04-01 Imports 8.1
7 2024-04-01 Investment 7.5
8 1947-04-01 Investment -27.3
9 2024-04-01 PCE 2.9
10 1947-04-01 PCE 6.8
Once our data is clean and organized, we are ready to start visualizing again.
Let’s start with a line chart that includes more than one time series.
%>%
gdp_components filter(date >= "2006-01-01") %>%
ggplot(aes(x = date, y = percent_change, color = account)) +
geom_line() +
labs(title = "Percent Change Components from Previous Quarter",
y = "",
x = "",
caption = "Source: BEA data",
fill = "") +
scale_y_continuous(labels = percent_format(scale = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 5)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5) )
Let’s toggle to a stacked bar chart with geom_col()
.
%>%
gdp_components filter(date >= "2006-01-01") %>%
ggplot(aes(x = date, y = percent_change, fill = account)) +
geom_col() +
labs(title = "Percent Change Components from Previous Quarter",
y = "",
x = "",
caption = "Source: BEA data",
fill = "") +
scale_y_continuous(labels = percent_format(scale = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 5)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5) )
The stacked chart is misleading in this case, because the sum of the component percent changes does not equal total percent change in GDP, due to different weightings. To account for this we can break out each component to it’s own chart with facet_wrap()
.
%>%
gdp_components filter(date >= "2006-01-01") %>%
ggplot(aes(x = date, y = percent_change, fill = account)) +
geom_col(
show.legend = F
+
) labs(title = "Percent Change Components from Previous Quarter",
y = "",
x = "",
caption = "Source: BEA data",
fill = "") +
scale_y_continuous(labels = percent_format(scale = 1),
breaks = pretty_breaks(n = 5)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 5)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5) ) +
facet_wrap(~account,
ncol = 2)
That faceted chart defaults to using a common scale for all the components, but let’s observe how different the charts look when each component has it’s own scale. We do this by setting scales = "free_y"
inside facet_wrap()
.
%>%
gdp_components filter(date >= "2013-01-01") %>%
ggplot(aes(x = date, y = percent_change, fill = account)) +
geom_col(
show.legend = F
+
) labs(title = "Percent Change Components from Previous Quarter",
y = "",
x = "",
caption = "Source: BEA data",
fill = "") +
scale_y_continuous(labels = percent_format(scale = 1)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 5)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5) ) +
facet_wrap(~account,
ncol = 2,
scales = "free_y")
Visualization is an important component of macroeconomic work and so is interpreting visualizations. There are a lot of tools that will automatically create charts for us, but we find that creating our own charts let’s us build more original interpretations, and gives us the freedom to work outside of automated tooling. In the next section, we will take on a more complicated use case, but the end result can be applied broadly to other macro situations.
Recession Use Case and Building our Own Function
Before we conclude our Chapter on GDP, let’s create a function for adding recession shading to our charts. This involves creating a custom function for use in our ggplot
code flows and is a bit more involved than the previous section. The code to create this function is an interesting exercise but we also wish to emphasize the importance of creating our own functions in R.
If we find ourselves copy pasting complicated code more than a few times, we try to convert that code to a function. This recession shading is a great example because we add it to many charts, but the underlying code doesn’t change.
For a preview, our goal is to create the following function:
<- function(color_chosen = "darkgray",
recession_shade_fun start_date = "1979-01-01"){
"USREC" %>%
tq_get(get = "economic.data", from = start_date) %>%
select(date, recession_flag = price) %>%
mutate(recession_label =
case_when(
== 1 &
recession_flag lag(recession_flag == 0) ~ str_glue("{year(date)} recession"),
TRUE ~ NA_character_)) %>%
filter(recession_flag == 1) %>%
fill(recession_label, .direction = "down") %>%
group_by(recession_label) %>%
slice(1, n()) %>%
select(-recession_flag) %>%
mutate(start = case_when(date == min(date) ~ date,
TRUE ~ NA_Date_),
end = case_when(date == max(date) ~ date,
TRUE ~ NA_Date_),
end = lead(end)
%>%
) filter(!is.na(start)) %>%
select(-date) %>%
geom_rect(data = .,
inherit.aes = F,
aes(xmin = start,
xmax = end,
ymin = -Inf,
ymax = +Inf),
fill = color_chosen,
alpha = 0.4)
}
We see why it might be useful to turn that into a function - it involves a lot of code that we don’t want to copy and paste each time we want a recession shading on a chart.
Here is how we can use that function - we simply include it at the end of our ggplot()
code flow. If you wish to use that function and don’t really care about how to build it, you can copy the code above and load it into your setup
code chunk, then add it to any ggplot()
code.
%>%
change_in_gdp_wrangled ggplot(aes(x = date, y = percent_change)) +
geom_point(color = "purple") +
labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: FRED data") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 7)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5)) +
recession_shade_fun(color_chosen = "pink")
For the bold who have stuck with us and wish to go through the steps to build that function, let’s dig in to the recession shader.
Here are the first 4 steps we need to encode.
1) Import the `USREC` data set from FRED.
This data set flags recessionary periods based on official NBER data.
2) `rename()` the price column as `recession_flag`.
This column holds a 1 if the economy was in recession and 0 otherwise.
3) use `case_when()` to add a `recession_label` column that
glues together the year a recession started and the word "recession".
We just want to add that label to the first month of the recession.
4) `filter()` to just periods where `recession_flag` == 1.
"USREC" %>%
tq_get(get = "economic.data", from = "1979-01-01") %>%
select(date, recession_flag = price) %>%
mutate(recession_label =
case_when(
== 1 & lag(recession_flag == 0) ~
recession_flag str_glue("{year(date)} recession"),
~ NA_character_)
T %>%
) filter(recession_flag == 1) %>%
head(3)
# A tibble: 3 × 3
date recession_flag recession_label
<date> <int> <glue>
1 1980-02-01 1 1980 recession
2 1980-03-01 1 <NA>
3 1980-04-01 1 <NA>
Next things get a little more complicated. We need to create a column that tells us when a recession began and when it ended. Here are the steps:
1) `fill()` the NAs with the recession label.
2) `slice()` just the first and last date of each recession.
3) use `mutate()` to create a column called `start` and a column called `end` to hold the start and end dates.
We use `case_when()` to grab the `min(date)` and `max(date)` for each recession to populate the columns.
The result is a table identifying our recessions and their end and start dates.
"USREC" %>%
tq_get(get = "economic.data", from = "1979-01-01") %>%
select(date, recession_flag = price) %>%
mutate(recession_label = case_when(recession_flag == 1 &
lag(recession_flag == 0) ~
str_glue("{year(date)} recession"),
TRUE ~ NA_character_)) %>%
filter(recession_flag == 1) %>%
fill(recession_label, .direction = "down") %>%
group_by(recession_label) %>%
slice(1, n()) %>%
select(-recession_flag) %>%
mutate(start = case_when(date == min(date) ~ date,
TRUE ~ NA_Date_),
end = case_when(date == max(date) ~ date,
TRUE ~ NA_Date_),
end = lead(end)
%>%
) filter(!is.na(start)) %>%
select(-date)
# A tibble: 6 × 3
# Groups: recession_label [6]
recession_label start end
<glue> <date> <date>
1 1980 recession 1980-02-01 1980-07-01
2 1981 recession 1981-08-01 1982-11-01
3 1990 recession 1990-08-01 1991-03-01
4 2001 recession 2001-04-01 2001-11-01
5 2008 recession 2008-01-01 2009-06-01
6 2020 recession 2020-03-01 2020-04-01
Next we use that table to create shaded rectangles for use with ggplot
. Those recession regions are considered geom_rect()
objects - rectangular geoms. We also wrap the entire flow into a function, so we can choose how far back we wish to search for recessions and the color we wish to put on our charts. That means we create a function that accepts two arguments: start_date
and color_chosen
.
<- function(color_chosen = "darkgray",
recession_shade_fun start_date = "1979-01-01"){
"USREC" %>%
tq_get(get = "economic.data", from = start_date) %>%
select(date, recession_flag = price) %>%
mutate(recession_label = case_when(recession_flag == 1 &
lag(recession_flag == 0) ~
str_glue("{year(date)} recession"),
TRUE ~ NA_character_)) %>%
filter(recession_flag == 1) %>%
fill(recession_label, .direction = "down") %>%
group_by(recession_label) %>%
slice(1, n()) %>%
select(-recession_flag) %>%
mutate(start = case_when(date == min(date) ~ date,
TRUE ~ NA_Date_),
end = case_when(date == max(date) ~ date,
TRUE ~ NA_Date_),
end = lead(end)
%>%
) filter(!is.na(start)) %>%
select(-date) %>%
geom_rect(data = .,
inherit.aes = F,
aes(xmin = start,
xmax = end,
ymin = -Inf,
ymax = +Inf),
fill = color_chosen,
alpha = 0.4)
}
And we’re done! We can add recession shaded regions to any ggplot()
object by including this function in the code flow. Here’s an example where we want pink shaded regions instead of gray.
%>%
change_in_gdp_wrangled ggplot(aes(x = date, y = percent_change)) +
geom_point(color = "purple") +
labs(title = "Percent Change GDP from Previous Quarter",
y = "",
x = "",
caption = "Source: FRED data") +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = pretty_breaks(n = 10)) +
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"),
breaks = pretty_breaks(n = 7)
+
) theme_minimal() +
theme(plot.title = element_text(hjust = .5)) +
recession_shade_fun(color_chosen = "pink")
That was quite a bit of work, but now we have that function in our toolkit and don’t need to revisit that code again.
Realized vs. Expected GDP
GDP is a crucial macroeconomic indicator, however in financial markets the expected GDP may be even more important. Market participants make decisions today based on where they believe the economy will be tomorrow. Participants make investment decisions based on their current expectations and then may revise their portfolios when the actual (or realized) data is released. How can we estimate the expected GDP? Typically there are two ways: we can survey economic experts and directly ask them what they expect, or we can use a mathematical model that gives us a (hopefully good) forecast. Fortunately for us, we can easily access both types of expected GDP.
Since 1968, the Federal Reserve Bank of Philadelphia conducts a quarterly survey of macroeconomic forecasts, commonly called the Survey of Professional Forecasters (SPF) since all participants are market professionals. The results are posted in the Philadelphia Fed’s website (<https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/survey-of-professional-forecasters}) as Excel spreadsheets that are easily accessible.
In a more recent effort the Federal Reserve Bank of Atlanta has created a mathematical model to forecast GDP that is updated in real time as other economic indicators are released. The Atlanta Fed call their estimate the GDPNow since it incorporates the most up to date economic information in its construction. The GDPNow is a purely mathematical model and as such it is a great complement to the SPF, more information about its construction can be found in the Atlanta Fed’s website (<https://www.atlantafed.org/cqer/research/gdpnow}).
Before we move forward and pull data for the SPF and GDPNow we have to make an important point about how the GDP information is released. The BEA releases quarterly GDP data roughly one month after the quarter ended—for the first quarter of the year the BEA would release GDP data on April—but this is called an advanced estimate because some of the GDP components, such as imports and exports, have not been properly collected and are instead being estimated. The advanced estimate goes through two monthly revisions (in May and June in our example of the first quarter), called the Second and Third estimate, until it is considered finalized in a Fourth monthly revision (July). Hence, the first estimate of GDP released to the market can go through multiple revisions that may increase or decrease its value, and in FRED we only have the GDP after its final (or most recent) revision. Thus, the estimate of GDP that market participants see in its first release may be quite different from the one we have in FRED, this is an important point to keep in mind if we are trying to gauge what the market reaction was a particular GDP release.
We will pull the GDPNow estimate from the Atlanta Fed website which makes it available as an Excel file, and in addition it contains the advanced estimate for the realized GDP and the exact date when it was released to the market. The Excel file is in the following url <https://www.atlantafed.org/-/media/documents/cqer/researchcq/gdpnow/GDPTrackingModelDataAndForecasts.xlsx}, let’s load it into an object.
<-
url "https://www.atlantafed.org/-/media/documents/cqer/researchcq/gdpnow/GDPTrackingModelDataAndForecasts.xlsx"
<- "GDPTrackingModelDataAndForecasts.xlsx"
destfile ::curl_download(url, destfile) curl
Next we use read_excel()
and specify the sheet as “TrackRecord”. This gives us data on the gdpNow estimate versus what was actually reported, the difference being the “surprise”.
<-
gdp_now read_excel(destfile, sheet = "TrackRecord") %>%
clean_names() %>%
select(1:4) %>%
mutate(forecast_period = ymd(quarter_being_forecasted),
gdp_now = as.numeric(model_forecast_right_before_be_as_advance_estimate),
gdp_advance = be_as_advance_estimate,
release_date = ymd(release_date)) %>%
select(forecast_period, gdp_now, gdp_advance, release_date)
%>%
gdp_now head(5)
# A tibble: 5 × 4
forecast_period gdp_now gdp_advance release_date
<date> <dbl> <dbl> <date>
1 2024-06-30 2.61 2.8 2024-07-25
2 2024-03-31 2.70 1.59 2024-04-25
3 2023-12-31 2.38 3.3 2024-01-25
4 2023-09-30 5.41 4.88 2023-10-26
5 2023-06-30 2.41 2.41 2023-07-27
Here we have the period for when the forecast is being made, the GDPNow forecast, the realized GDP advanced estimate, and the day when it was released to the market.
Let’s now pull the median forecast for GDP growth from the Philadelphia Fed’s SPF, the Excel file for this forecast is <https://www.philadelphiafed.org/-/media/frbp/assets/surveys-and-data/survey-of-professional-forecasters/data-files/files/median_rgdp_growth.xlsx}.
<-
spf_url "https://www.philadelphiafed.org/-/media/frbp/assets/surveys-and-data/survey-of-professional-forecasters/data-files/files/median_rgdp_growth.xlsx"
<- "GDPTrackingModelDataAndForecasts.xlsx"
spf_destfile ::curl_download(spf_url, spf_destfile) curl
The column strangely named DRGDP2
represents the median forecast for GDP growth for the combination of YEAR
and QUARTER
. We need to construct a proper YYYY-MM-DD date variable that represents the end period of the YEAR
and QUARTER
combination, and in addition give a proper name to the DRGDP2
variable. The snippet below achieves that by transforming the YEAR
and QUARTER
variables to strings, concatenating them, and then adding a an offset that makes the date be the last day of the quarter (otherwise it would be the first day of the quarter).
<-
gdp_spf read_excel(spf_destfile) %>%
clean_names() %>%
mutate(
forecast_period = str_glue("Q{quarter}/{year}"),
forecast_period = as.Date(as.yearqtr(forecast_period, format = "Q%q/%Y")) + months(3) - days(1)
%>%
) rename("gdp_philly" = "drgdp2") %>%
select(forecast_period, gdp_philly)
Finally, we join our gdpNow data with our SPF data.
<-
gdp_now_spf %>%
gdp_now left_join(gdp_spf, by = "forecast_period")
Let’s now make a plot where we can observe their time series behavior. We will first make the data long—recall that this how ggplot
prefers data—and then chart them beginning in 2015.
%>%
gdp_now_spf select(-release_date) %>%
filter(forecast_period >= "2000-01-01") %>%
pivot_longer(-forecast_period) %>%
ggplot(aes(x = forecast_period,
y = value,
color = name)) +
geom_line() +
theme_minimal() +
scale_y_continuous(labels = percent_format(scale = 1),
breaks = pretty_breaks(10)) +
scale_x_date(
date_breaks = "1 year",
date_labels = "%Y"
+
) theme(legend.position = c(.9, .8),
plot.title = element_text(hjust = .5)) +
labs(
title = "GDP Forecasts",
y = "",
x = "",
color = ""
)
Let’s also take a quick look at the correlation between the realized GDP and its two expectations.
%>%
gdp_now_spf ::correlate(diagonal = NA) corrr
# A tibble: 3 × 4
term gdp_now gdp_advance gdp_philly
<chr> <dbl> <dbl> <dbl>
1 gdp_now NA 0.987 0.909
2 gdp_advance 0.987 NA 0.931
3 gdp_philly 0.909 0.931 NA
The two forecasts track the realized value of GDP quite closely, reassuring us that they are good measures of expectations. Nevertheless, the GDPNow estimate seems to track closer the realized GDP.
It seems the estimate from the SPF is just slightly off: it does not track the realized GDP as well since the Covid-19 shock of 2020, whereas the GDPNow expectation is very close to the real value. We will return to this difference between expected and realized GDP growth when we analyze stock market returns and how it reacts to GDP releases that are surprising, that is the expected and realized value differ significantly.
International Comparison
So far we have looked at GDP for only one country, the United States, but this macroeconomic aggregate is fundamental for any economy. In principle, we can replicate the previous workflow for any country, as long as we can find the data.
Lucky for us, international organizations such as the World Bank, the International Monetary Fund, or the Bank for International Settlements, collect a lot of information about the world economy and financial markets. Here we will access the World Bank’s World Development Indicators (WDI) to pull data for GDP growth for international economies. We could access the WDI by visiting its website (<https://databank.worldbank.org/home.aspx}), but fortunately there is an R package called WDI
that makes it seamless to pull World Bank data.
The WDI contains a variety of indicators and all of them have their unique ID, which is a mnemonic that rarely makes sense. We have to navigate to the WDI website and search in its toolbar for the series we need, and we find its ID in its details. The mnemonic for annual GDP is NY.GDP.MKTP.KD.ZG, we’ll pull this series for the United States, China, and the whole world. In the code snippet below, we first import the module that allows us to pull data from the WDI, and then we download the annual GDP growth from 1980 to 2022; note that we provide a list of three countries according to two digit ISO conventions, and the 1W code is for the whole world.
library(WDI)
<-
us_china_world_gdp_growth WDI(
country = c("US", "CN", "1W"),
indicator = 'NY.GDP.MKTP.KD.ZG',
%>%
) mutate(date = str_glue("{year}-01-01") %>% ymd()) %>%
select(country, date, gdp_growth = `NY.GDP.MKTP.KD.ZG`)
The data we pull is already in long format and porting to ggplot()
to make a chart is quite smooth.
%>%
us_china_world_gdp_growth filter(date >= "1971-01-01") %>%
ggplot(aes(x = date,
y = gdp_growth,
color = country)) +
geom_line() +
scale_y_continuous(labels = percent_format(scale = 1),
breaks = pretty_breaks(10)) +
scale_x_date(
date_breaks = "5 years",
date_labels = "%Y"
+
) theme_minimal() +
labs(x = "",
y = "",
title = "Annual Growth GDP",
color = "") +
theme(legend.position = c(.9, .9),
plot.title = element_text(hjust = .5))
The U.S. growth rate tracks the world’s rate closely, a little bit faster in the 1990s and a little slower in the 2010s, this makes sense since the U.S. is the largest economy in the world. China, on the other hand, grows much faster from the early 1990s to the late 2010s, even reaching double digit GDP growth in several years. However, since the global financial crisis its growth rate has decelerated noticeably.
Important market commentators, such as Michael Pettis and Adam S. Posen, believe that imbalances in the Chinese economy represent an important challenge for its future growth.3 In their view, a serious structural challenge that China faces is that its economic model is based on fixed capital investment (think of this like buildings, roads, bridges, etc.). As a percentage of GDP they invest a lot more than the U.S. or the average country, for most advanced economies consumption is the main driver of GDP but not in China. The challenge is that over time fixed capital investment may suffer from decreasing marginal returns (i.e. bridges to nowhere) which severely impacts overall growth.
A proper analysis of the Chinese macroeconomy is way outside of the scope of this book, but let’s do some quick charts along the lines suggested by Pettis and Posen using data from the WDI. Let’s pull the series for fixed capital investment as a percentage of GDP for China, the U.S., and the world and see how they compare. The ID for this series is NE.GDI.FTOT.ZS
Here we see the point that Pettis and others make: China’s fixed capital investment, as percent of GDP, is about twice the U.S. and world’s rate; and this imbalance seems to be growing over time. For completeness sake, let’s now pull household consumption as percent of GDP (the ID is NE.CON.PRVT.ZS) and chart it over the years.
<-
us_china_world_gdp_components WDI(
country = c("US", "CN", "1W"),
indicator = c('NE.GDI.FTOT.ZS', 'NE.CON.PRVT.ZS')
%>%
) rename("consumption_percent" = "NE.CON.PRVT.ZS",
"fixed_capital_percent" = "NE.GDI.FTOT.ZS") %>%
mutate(date = str_glue("{year}-01-01") %>% ymd()) %>%
select(country, date, contains("percent"))
%>%
us_china_world_gdp_components filter(date >= "1971-01-01") %>%
ggplot(aes(x = date,
y = fixed_capital_percent,
color = country)) +
geom_line() +
scale_y_continuous(labels = percent_format(scale = 1),
breaks = pretty_breaks(10)) +
scale_x_date(
date_breaks = "5 years",
date_labels = "%Y"
+
) theme_minimal() +
labs(x = "",
y = "",
title = "Fixed Investment Contribution to GDP",
color = "") +
theme(legend.position = c(.2, .8))
%>%
us_china_world_gdp_components filter(date >= "1971-01-01") %>%
ggplot(aes(x = date,
y = consumption_percent,
color = country)) +
geom_line() +
scale_y_continuous(labels = percent_format(scale = 1),
breaks = pretty_breaks(10)) +
scale_x_date(
date_breaks = "5 years",
date_labels = "%Y"
+
) theme_minimal() +
labs(x = "",
y = "",
title = "Consumption Contribution to GDP",
color = "") +
theme(legend.position = c(.2, .2))
Here we observe that household consumption in the U.S. comprises around two thirds of GDP and seems to be growing in importance over time, whereas in China it is close to one third and decreasing over time. We won’t delve further into international data but the WDI
package makes it easy to run this same analysis across different countries and observe where GDP trends are headed.
Conclusion
That concludes are our introduction to GDP data, importing from FRED, wrangling and data visualization. The amazing thing to us is that FRED has 800,000 data sets on it, and we can use the exact code flow we just reviewed to work with the majority of them: tq_get()
+ data wrangling + ggplot2
opens up a huge world to us. We will revisit FRED data many more times, but for now let’s move to a different source and type of data and dig into the stock market.
Footnotes
For a primer on GDP, see https://www.bea.gov/resources/methodologies/nipa-handbook.↩︎
More background can be found in Michael Pettis’ blogpost at https://carnegieendowment.org/chinafinancialmarkets/89466 and the article “The End of China’s Economic Miracle” by Adam S. Posen in the September 2023 edition of Foreign Affairs.↩︎