
<!-- badges: start -->

[![Lifecycle:
stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![CRAN_Status](http://www.r-pkg.org/badges/version/CASIdata)](https://cran.r-project.org/package=CASIdata)
[![R-Universe](https://friendly.r-universe.dev/badges/CASIdata)](https://friendly.r-universe.dev)
[![Last
Commit](https://img.shields.io/github/last-commit/friendly/CASIdata)](https://github.com/friendly/CASIdata)
<!-- badges: end -->

# CASIdata <img src="man/figures/logo.jpg" style="float:right; height:180px;" alt="CASIdata logo"/>

CASIdata provides the datasets from Efron & Hastie (2016, ISBN:
9781108107952), *Computer Age Statistical Inference: Algorithms,
Evidence, and Data Science* in an accessible R format for those who want
to use them for teaching, study or to try to reproduce or extend
analyses from the book. They were downloaded from Trevor Hastie’s web
site, <https://hastie.su.domains/CASI_files/DATA/>, but quite a few
files were messy and required some processing to make into R datasets.

Even so, some of the datasets may require data cleaning, renaming of
variables, re-shaping or other tidying steps to be useful for analysis.
But that’s part of learning.

## Installation

This package is not yet on CRAN. You can install it from this
[GitHub](https://github.com/friendly/CASIdata/) repo or from
[R-universe](https://friendly.r-universe.dev/CASIdata)

``` r
remotes::install.github("friendly/CASIdata")
install.packages('CASIdata', repos = c('https://friendly.r-universe.dev'))
```

## Datasets included here

Loading package: CASIdata

| Dataset          | dim      | Title                                 |
|:-----------------|:---------|:--------------------------------------|
| DTI              | 15443x4  | DTI Brain Imaging Data                |
| als              | 1822x371 | ALS Data                              |
| baseball         | 18x3     | Baseball Batting Averages             |
| bivnorm          | 40x2     | Bivariate Normal Data                 |
| butterfly        | 24x2     | Butterfly Species Data                |
| cellinfusion     | 25x4     | Cell Infusion Data                    |
| cholesterol      | 164x2    | Cholesterol Data                      |
| diabetes         | 442x12   | Diabetes Data                         |
| doseresponse     | 11x2     | Dose Response Data                    |
| galaxy           | 270x3    | Galaxy Data                           |
| haplotype        | 197x102  | Human Ancestry Haplotype Data         |
| insurance        | 60x3     | Insurance Life Table Data             |
| leukemia_small   | 3571x72  | Leukemia Gene Expression Data (Small) |
| ncog             | 96x6     | NCOG Head and Neck Cancer Data        |
| nodes            | 844x2    | Lymph Nodes Cancer Data               |
| pediatric        | 1620x7   | Pediatric Cancer Survival Data        |
| police           | 2748x1   | Police Racial Bias Data               |
| prostz           | 6032x1   | Prostate Cancer Z-values              |
| student_score    | 22x5     | Student Score Data                    |
| supernova        | 39x11    | Type Ia Supernova Data                |
| vasoconstriction | 39x2     | Vasoconstriction Data                 |

## Missing Datasets

The following dataset appears in `data-raw/CASI-save.R` but is **not**
(yet) included in the package:

| Dataset | Reason |
|----|----|
| `SPAM` | Variable names need cleanup; requires mapping from UCI Spambase documentation |

See `data-raw/missing-datasets.md` for details on resolving this.

## External Datasets (Not Included)

These large datasets are referenced in the book but not included in the
package due to size constraints. They can be downloaded directly from
the sources listed below.

### CASI datasets (too large for CRAN)

- **protein_kernel**: 1708 x 1708 inner-product (kernel) matrix for
  human proteins (Section 19.6). Computed using a string kernel on
  bag-of-4-grams amino acid representations.
  - Source:
    <https://hastie.su.domains/CASI_files/DATA/protein_kernel.txt>
  - Load in R:
    `protein_kernel <- matrix(scan("https://hastie.su.domains/CASI_files/DATA/protein_kernel.txt", what=0), 1708, 1708)`
- **protein_label**: Response labels (-1/+1) for the 1708 proteins (45
  positives, 1663 negatives).
  - Source:
    <https://hastie.su.domains/CASI_files/DATA/protein_label.txt>
  - Load in R:
    `protein_label <- scan("https://hastie.su.domains/CASI_files/DATA/protein_label.txt", what=0)`
- **prostmat**: 6033 x 102 gene expression matrix comparing 50 controls
  vs 52 prostate cancer patients (Section 3.3).
  - Source: <https://hastie.su.domains/CASI_files/DATA/prostmat.csv>
  - Load in R:
    `prostmat <- read.csv("https://hastie.su.domains/CASI_files/DATA/prostmat.csv")`
  - Note: Column names need cleanup (see `data-raw/missing-datasets.md`
    for renaming code)
- **leukemia_big**: 7128 x 72 gene expression matrix (10MB). A larger
  version of `leukemia_small`.
  - Source: <https://hastie.su.domains/CASI_files/DATA/leukemia_big.csv>
  - Load in R:
    `leukemia_big <- read.csv("https://hastie.su.domains/CASI_files/DATA/leukemia_big.csv")`

### Image datasets (hosted externally)

- **CIFAR-100**: 100 image classes, 600 images each (32x32x3 color).
  Used in Chapter 18.
  - Source: <https://www.cs.toronto.edu/~kriz/cifar.html>
- **MNIST**: Handwritten digit database, 60K training + 10K test images
  (28x28 grayscale). Used in Chapter 18.
  - Source: <http://yann.lecun.com/exdb/mnist/>

## Variable Renaming

Some datasets had variables renamed for clarity:

| Dataset | Original | Renamed |
|----|----|----|
| `butterfly` | x, y | k, count |
| `police` | X2.411 | z |
| `prostz` | X1.47236666651029 | z |
| `galaxy` | Reshaped from wide to long format with `mag`, `red`, `freq` |  |

## Example

No examples yet.

``` r
library(CASIdata)
## basic example code
```
