In this post, we will explore some date and time parsing. As an example, we will work with weather data provided by City of Nuremberg, Environmental and Meteorological Data.

We will need these packages:

library(tidyverse)  # data reading and wrangling
library(lubridate)  # working with dates/times

First, let’s import some precipitation data:

file_name <- "~/Downloads/export-sun-nuremberg--flugfeld--airport--precipitation-data--1-hour--individuell.csv"
rain <- read_csv2(file_name,
                 skip = 13,
                 col_names = FALSE)
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 300 parsing failures.
## row # A tibble: 5 x 5 col     row   col expected actual expected   <int> <chr>    <chr>  <chr> actual 1  1643    X2 a double      - file 2  1644    X2 a double      - row 3  1645    X2 a double      - col 4  1646    X2 a double      - expected 5  1647    X2 a double      - actual # ... with 1 more variables: file <chr>
## ... ................. ... ............................. ........ ............................. ...... ............................. .... ............................. ... ............................. ... ............................. ........ ............................. ...... .......................................
## See problems(...) for more details.
colnames(rain) <- c("date_time", "precip")

As there was some strange, non-UTF8 character in line 12, we just skipped this line. As this was the row with the column names, we informed read_csv that there are no col_names (using col_names = FALSE). Also, some missing data occured. Luckily, readr takes care (despite of a lot of warning output).

Unfortunately, R has not yet reckognized that the first column consists of a date/time object. So let’s tell her:

rain$dt <- parse_date_time(rain$date_time, orders="dmy HM")

The parameter order tell R, that the format of date_time is day month year (hence dmy), followed by hour and minute (hence HM). Now let’s spread the date/time to several columns:

rain %>%
  select(-date_time) %>%
  mutate(day = day(dt),
         month = month(dt),
         year = year(dt),
         hour = hour(dt)) -> rain

Let’s give each day a number, first day in record is #1, second day #2, and so on.

rain$date <- date(rain$dt)

first_day <- min(rain$date)
last_day <- max(rain$date)

rain$day_ID <- rain$date - min(rain$date)

OK, now let’s look at the precipitation in Nuremberg.

How many dry days (no rain) have there been in 2016?

rain %>%
  group_by(day_ID) %>%
  summarise(precip_day_sum = sum(precip)) %>%
  filter(precip_day_sum == 0) %>%
  nrow -> n_dry_days

n_dry_days
## [1] 207

Oh, 207 out of 366 days no rain, ie., 0.55. BTW: It was a leap year, that’s why I put 366 days.

rain %>%
  group_by(day_ID) %>%
  slice(1) %>% nrow
## [1] 366

Easy enought, this little sweet function tells us whether a given year was a leap year or not.

leap_year(2016)
## [1] TRUE

Let’s visualize whether a day was dry or not. First, we compute a variable rain_today.

rain %>%
  group_by(year, month, day, day_ID) %>%
  summarise(rain_today = any(precip != 0)) %>%
  filter(!is.na(rain_today)) %>%
  group_by(month) %>%   
  summarise(prop_rain_days_per_month = sum(rain_today)/n()) %>%
  ggplot() +
  aes(x = month, y = prop_rain_days_per_month) %>%
  geom_col() +
  scale_x_continuous(breaks = 1:12)

plot of chunk unnamed-chunk-9

August and September then were quite dry, it appears. Let’s look at the amount of precipitation for comparison.

rain %>%
  group_by(month) %>%
  summarise(precip_mean_today = sum(precip, na.rm = T)) %>%
  ggplot(aes(x = month, y = precip_mean_today)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Total precipitation in liter per square meter")

plot of chunk unnamed-chunk-10

For the fun of it, let’s prepare a more nuanced picture.

rain %>%
  ggplot +
  aes(x = date, y = hour, fill = precip) +
  geom_tile() +
  scale_fill_distiller(palette = "Spectral") +
  scale_x_date(date_breaks = "1 month", date_labels = "%b")

plot of chunk unnamed-chunk-11

Hey, in June there must have been some heavy raining. Let’s zoom in.

rain %>%
  filter(month == 6) %>%
  ggplot +
  aes(x = date, y = hour, fill = precip) +
  geom_tile() +
  scale_fill_distiller(palette = "Spectral") +
  scale_x_date(date_breaks = "1 day", date_labels = "%d") +
  scale_y_continuous(breaks = c(4, 8, 12, 16, 20, 24))

plot of chunk unnamed-chunk-12

Most of the days in June were dry, but on one day, heave rains occurred on the 12th or 13th in the late afternoon.

rain %>%
  filter(precip > 7)
## # A tibble: 4 x 8
##   precip                  dt   day month  year  hour       date   day_ID
##    <dbl>              <dttm> <int> <dbl> <dbl> <int>     <date>   <time>
## 1    7.4 2016-05-29 21:00:00    29     5  2016    21 2016-05-29 149 days
## 2    9.5 2016-06-12 17:00:00    12     6  2016    17 2016-06-12 163 days
## 3    8.7 2016-06-12 18:00:00    12     6  2016    18 2016-06-12 163 days
## 4    7.2 2016-07-13 15:00:00    13     7  2016    15 2016-07-13 194 days

On June, 12th between 17:00 and 19:00 CEDT, a lot of rain came down in Nuremburg.

Is there a preference for time in terms of precipitation? Maybe it rather rains in the late afternoon in general?

rain %>%
  group_by(hour) %>%
  summarise(precip_mean_per_hour = mean(precip, na.rm = T),
            precip_sd_per_hour = sd(precip, na.rm = T)) %>%
  ggplot +
  aes(x = hour, y = precip_mean_per_hour) +
  geom_line() +
  geom_errorbar(aes(ymin = precip_mean_per_hour - precip_sd_per_hour,
                    ymax = precip_mean_per_hour + precip_sd_per_hour)) +
  geom_point(color = "tomato4")

plot of chunk unnamed-chunk-14

Hm, not much difference between the hours. But the variability seems to differ quite some bit. Would that picture change if we consider only days with precipation?

rain %>%
  filter(precip != 0) %>%
  group_by(hour) %>%
  summarise(precip_mean_per_hour = mean(precip, na.rm = T),
            precip_sd_per_hour = sd(precip, na.rm = T)) %>%
  ggplot +
  aes(x = hour, y = precip_mean_per_hour) +
  geom_line() +
  geom_errorbar(aes(ymin = precip_mean_per_hour - precip_sd_per_hour,
                    ymax = precip_mean_per_hour + precip_sd_per_hour)) +
  geom_point(color = "tomato4")

plot of chunk unnamed-chunk-15

Not too much difference in the overall pattern. But what about if we look at the median and IQR, instead of mean and sd?

rain %>%
  filter(precip != 0) %>%
  group_by(hour) %>%
  summarise(precip_median_per_hour = median(precip, na.rm = T),
            precip_iqr_per_hour = IQR(precip, na.rm = T)) %>%
  ggplot +
  aes(x = hour, y = precip_median_per_hour) +
  geom_line() +
  geom_errorbar(aes(ymin = precip_median_per_hour - precip_iqr_per_hour,
                    ymax = precip_median_per_hour + precip_iqr_per_hour),
                color = "grey60") +
  geom_point(color = "tomato4")

plot of chunk unnamed-chunk-16

That the pattern breaks down essentially. Not that much difference between the hours left. Maybe 10:00 is rather dry, but that may be sampling errror of this year.

Wie gut schätzt eine Stichprobe die Grundgesamtheit?

# DatenSie arbeiten bei der Flughafenaufsicht von NYC. Cooler Job.```rlibrary(nycflights13)data(flights)```## Pakete laden```rlibrary(mos...… Continue reading

Some thoughts on tidyveal and environments in R

Published on November 16, 2017

Yart - Yet Another Markdown Report Template

Published on November 15, 2017