数据科学导论:理论、方法与R语言实现

1 背景

  • “大数据”时代:海量; 高维 (\(p \gg n\))。

  • 数字经济时代:数据成为生产要素; 《数字经济时代及其核心产业统计分类(2021)》(国家统计局令 第33号

  • 数据科学时代:现代统计推断\(\longrightarrow\) 高性能计算机时代:统计计算、统计学习 \(\longrightarrow\) 数据科学(Donoho 2017)

  • 从数据分析到数据科学:“数据科学”命名者,数据科学的主要奠基者——世界著名统计学家John W. Tukey (1915-2000). 代表贡献:探索性数据分析 (EDA), 快速傅里叶变换 (FFT), 箱线图 (box-plot).

2 数据科学工作流程

图片来源:数据科学工作流程 by Dr. Hadley Wickham
  1. 数据准备: 导入与整理 (数据清洗)

  2. 数据理解(数据分析)–“循环迭代”

    • 数据变换 (Data Transformation)

    • 数据可视化 (Data Visualization)

    • 数据建模 (Data Modeling)

  3. 信息交流: 数据科学产品, e.g., 分析报告, 交互式应用程序等。

3 理论与方法

3.1 理论基础

  • 概率论:极限定理

  • 统计学理论:参数估计与假设检验

3.2 方法

  • 应用统计模型:线性模型, 时间序列模型, 等等。

    说明:许多参考书的主题词不同,比如,应用统计方法、应用统计分析、统计建模、预测建模等,但涵盖的专题基本相同。

  • 统计计算:计算机时代的统计推断,包含优化、模拟、密度估计与平滑、交叉验证等。

    说明:统计学习/机器学习中的”算法”(algorithm),主要是指统计计算方法。

4 方法实现

4.1 实现工具

  • 编程语言: R

  • 集成开发环境 (IDE) : RStudio

  • 开源工具平台: POSIT

4.2 程序包

  1. 数据整理: tidyr

  2. 数据分析

  3. 展示交流: Quarto

5 启发式例题

注: 改编自Dr. Hadley Wickham著作 R for Data Science 第一章。

5.1 数据可视化

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

举例:拿破仑的俄罗斯远征——“有史以来最好的统计图形”

# install.packages("tidyverse")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

5.1.1 导入数据集 (data set)

数据框 (data frame): 每一列对应一个变量, 每一个行对应一个记录。

示例数据集是38种不同型号车辆的参数信息,来源于美国环保局。

  1. displ, 表示发动机排量, 单位是升.

  2. hwy, 表示在高速公路上的燃油效率, 以”英里/加仑” (mpg)为单位.

mpg
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows
# Use '?mpg' for more information of this data set. 

5.1.2 使用基础作图模板

ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) 

# Comments: negative relationship, that is, cars with big engines use more fuel
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

# geometrical object
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy)) # loess: local polynomial regression smoother.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

5.1.3 层次化绘图

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

5.2 数据变换

library(nycflights13)
library(tidyverse)
flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# ?flights
# On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

5.2.1 取子集:使用filter()选择特定行

例如,选取1月1日的航班信息:

filter(flights, month == 1, day == 1) # for particular observation value
# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

又如,选取11月或12月的航班信息:

filter(flights, month == 11 | month == 12)
# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
 3  2013    11     1      455            500        -5      641            651
 4  2013    11     1      539            545        -6      856            827
 5  2013    11     1      542            545        -3      831            855
 6  2013    11     1      549            600       -11      912            923
 7  2013    11     1      550            600       -10      705            659
 8  2013    11     1      554            600        -6      659            701
 9  2013    11     1      554            600        -6      826            827
10  2013    11     1      554            600        -6      749            751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# filter(flights, month %in% c(11, 12)) # equivalently

注:利用德摩根律,保证运算逻辑正确:!(x & y) 等价于 !x | !y, !(x | y) 等价于 !x & !y

5.2.2 取子集:使用select()选择特定列

# Select columns by name
select(flights, year, month, day)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
# A tibble: 336,776 × 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1      517            515         2      830            819        11 UA     
 2      533            529         4      850            830        20 UA     
 3      542            540         2      923            850        33 AA     
 4      544            545        -1     1004           1022       -18 B6     
 5      554            600        -6      812            837       -25 DL     
 6      554            558        -4      740            728        12 UA     
 7      555            600        -5      913            854        19 B6     
 8      557            600        -3      709            723       -14 EV     
 9      557            600        -3      838            846        -8 B6     
10      558            600        -2      753            745         8 AA     
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

5.2.3 总结信息

summarise()group_by() 搭配使用获取信息

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 × 1
  delay
  <dbl>
1  12.6
# get the average delay per date
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) # group mean
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day delay
   <int> <int> <int> <dbl>
 1  2013     1     1 11.5 
 2  2013     1     2 13.9 
 3  2013     1     3 11.0 
 4  2013     1     4  8.95
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.55
 9  2013     1     9  2.28
10  2013     1    10  2.84
# ℹ 355 more rows
# N.B. Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. 

5.3 探索性数据分析

5.3.1 提出”好”问题

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

5.3.2 关注点在于:

  1. 分布: 散点图,直方图 geom_histogram

  2. 典型数值特征:信息总结

  3. 异常值

  4. 缺失值:计算均值时需要 na.rm

  5. 模式与模型:进一步研究的方向

假设现在我们对航班延误数据好奇并提出问题:对于每一个目的地来说,航班的飞行距离和平均延误时间有没有什么关系,如果有那么关系是怎样的?

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")

# It looks like delays increase with distance up to ~750 miles 
# and then decrease. Maybe as flights get longer there's more 
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

参考文献

Donoho, David. 2017. “50 Years of Data Science.” Journal of Computational and Graphical Statistics 26 (4): 745–66. https://doi.org/10.1080/10618600.2017.1384734.