── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
5.1.1 导入数据集 (data set)
数据框 (data frame): 每一列对应一个变量, 每一个行对应一个记录。
示例数据集是38种不同型号车辆的参数信息,来源于美国环保局。
displ, 表示发动机排量, 单位是升.
hwy, 表示在高速公路上的燃油效率, 以”英里/加仑” (mpg)为单位.
mpg
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
# Use '?mpg' for more information of this data set.
# N.B. Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries.
5.3 探索性数据分析
5.3.1 提出”好”问题
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
by_dest <-group_by(flights, dest)delay <-summarise(by_dest,count =n(),dist =mean(distance, na.rm =TRUE),delay =mean(arr_delay, na.rm =TRUE))delay <-filter(delay, count >20, dest !="HNL")# It looks like delays increase with distance up to ~750 miles # and then decrease. Maybe as flights get longer there's more # ability to make up delays in the air?ggplot(data = delay, mapping =aes(x = dist, y = delay)) +geom_point(aes(size = count), alpha =1/3) +geom_smooth(se =FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'