| Type: | Package |
| Title: | Tools for an Introductory Class in Regression and Modeling |
| Version: | 1.7 |
| Date: | 2025-5-23 |
| Depends: | R (≥ 3.6), bestglm, leaps, VGAM, rpart, randomForest |
| Imports: | rpart.plot |
| Suggests: | stringr, multcompView |
| Description: | Contains basic tools for visualizing, interpreting, and building regression models. It has been designed for use with the book Introduction to Regression and Modeling with R by Adam Petrie, Cognella Publishers, ISBN: 978-1-63189-250-9. |
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
| NeedsCompilation: | no |
| Encoding: | UTF-8 |
| Packaged: | 2025-05-26 18:25:06 UTC; adamp |
| RoxygenNote: | 7.3.2 |
| Author: | Adam Petrie [aut, cre] |
| Maintainer: | Adam Petrie <apetrie@utk.edu> |
| Repository: | CRAN |
| Date/Publication: | 2025-05-26 19:10:02 UTC |
Predicting whether a customer will open a new kind of account
Description
Customers were marketed a new type of account at a bank. It is desired to model what factors seemed to be associated with the probability of opening the account to tune marketing strategy.
Usage
data("ACCOUNT")
Format
A data frame with 24242 observations on the following 8 variables.
Purchasea factor with levels
NoYesTenurea numeric vector, the number of years the customer has been with the bank
CheckingBalancea numeric vector, amount currently held in checking (may be negative if overdrafted)
SavingBalancea numeric vector, amount currently held in savings (0 or larger)
Incomea numeric vector, yearly income in thousands of dollars
Homeownera factor with levels
NoYesAgea numeric vector
Area.Classificationa factor with levels
RSUfor rural, suburban, or urban
Details
Who is more likely to open a new type of account that a bank wants to try to sell its customers? Try logistic regression or partition models to see if you can develop a model that accurately classifies purchasers vs. non-purchasers. Or, try to develop a model that does well in promoting to nearly all customers who would buy the account.
Appliance shipments
Description
Appliance shipments from 1960 to 1985
Usage
data("APPLIANCE")
Format
A data frame with 26 observations on the following 7 variables.
Yeara numeric vector
Dishwashera numeric vector, Factory shipments (domestic) of dishwashers (thousands)
Disposala numeric vector, Factory shipments (domestic) of disposers (thousands)
Refrigeratora numeric vector, Factory shipments (domestic) of refrigerators (thousands)
Washera numeric vector, Factory shipments (domestic) of washing machines (thousands)
DurableGoodsExpa numeric vector, Durable goods expenditures (billions of 1972 dollars)
PrivateResInvesta numeric vector, Private residential investment (billions of 1972 dollars)
Details
From the (former) Data and Story library.
The file gives unit shipments of dishwashers, disposers, refrigerators, and washers in the United States from 1960 to 1985. This and other data are published currently in the Department of Commerce's Survey of Current Business, and are summarized from time to time in their publication, Business Statistics. Also included in the file are durable goods expenditures and private residential investment in the United States.
Attractiveness Score (female)
Description
The average attractiveness scores of 70 females along with physical attributes
Usage
data("ATTRACTF")
Format
A data frame with 70 observations on the following 21 variables.
Scorea numeric vector giving the average attractivness score compiled after 100 student ratings
Actual.Sexualitya factor with levels
GayStraightindicating the self-reported sexuality of the person in the pictureApparentRacea factor with levels
blackotherwhiteindicating the consensus regarding the apparent race of the personChina factor with levels
pointedroundedindicating the consensus regarding the shape of the person's chinCleavagea factor with levels
noyesindicating the consensus regarding whether the pictured woman was prominently displaying cleavageClothingStylea factor with levels
conservativerevealingindicating the consensus regarding how the women was dressedFaceSymmetryScorea numeric vector indicating the number of people (out of 2) who agreed the woman's case was symmetric
FashionScorea numeric vector indicating the number of people (out of 4) who agreed the woman was fashionable
FitnessScorea numeric vector indicating the number of people (out of 4) who agreed the woman was physically fit
GayScorea numeric vector indicating the number of people (out of 16) who agreed the woman was a lesbian
Glassesa factor with levels
GlassesNo GlassesGroomedScorea numeric vector indicating the number of people (out of 4) who agreed the woman made a noticeable effort to look nice
HairColora factor with levels
darklightindicating the consensus regarding the woman's hair colorHairstyleUniquessa numeric vector indicating the number of people (out of 2) who agreed the woman had an unconventional haircut
HappinessRatinga numeric vector indicating the number of people (out of 2) who agreed the woman looked happy in her photo
LookingAtCameraa factor with levels
noyesMakeupScorea numeric vector indicating the number of people (out of 5) who agreed the woman was wearing a noticeable amount of makeup
NoseOddScorea numeric vector indicating the number of people (out of 3) who agreed the woman had an unusually shaped nose
Selfiea factor with levels
noyesSkinClearScorea numeric vector indicating the number of people (out of 2) who agreed the woman's complexion was clear.
Smilea factor with levels
noyes
Details
Students were asked to rate on a scale of 1 (very unattractive) to 5 (very attractive) the attractiveness of 70 college-aged women who had posted their photos on a dating website. Of the nearly 100 respondents, most were straight males. Score represents the average of these ratings.
In a separate survey, students (of both genders) were asked to rate characteristics of the woman by answering the questions: what is her race, is she displaying her cleavage prominently, is she a lesbian, is she physically fit, etc. The variables ending “Score" represent the number of students who answered Yes to the question. Other variables (such as Selfie, Smile) represent the consensus among the students. The only attribute taken from the woman's profile was Actual.Sexuality.
Source
Students in BAS 320 at the University of Tennessee from 2013-2015.
Attractiveness Score (male)
Description
The average attractiveness scores of 70 males along with physical attributes
Usage
data("ATTRACTM")
Format
A data frame with 70 observations on the following 23 variables.
Scorea numeric vector giving the average attractivness score compiled after 60 student ratings
Actual.Sexualitya factor with levels
GayStraightindicating the self-reported sexuality of the person in the pictureApparentRacea factor with levels
blackotherwhiteindicating the consensus regarding the apparent race of the personChina factor with levels
pointedroundedindicating the consensus regarding the shape of the person's chinClothingStylea factor with levels
conservativerevealingindicating the consensus regarding how the man was dressedFaceSymmetryScorea numeric vector indicating the number of people (out of 7) who agreed the woman's case was symmetric
FacialHaira factor with levels
noyesindicating the consensus regarding whether the man appeared to maintain facial hairFashionScorea numeric vector indicating the number of people (out of 7) who agreed the woman was fashionable
FitnessScorea numeric vector indicating the number of people (out of 8) who agreed the woman was physically fit
GayScorea numeric vector indicating the number of people (out of 16) who agreed the man was gay
Glassesa factor with levels
noyesGroomedScorea numeric vector indicating the number of people (out of 6) who agreed the woman made a noticeable effort to look nice
HairColora factor with levels
darklightunseenindicating the consensus regarding the man's hair colorHairstyleUniquessa numeric vector indicating the number of people (out of 4) who agreed the woman had an unconventional haircut
HappinessRatinga numeric vector indicating the number of people (out of 6) who agreed the man looked happy in her photo
Hata factor with levels
noyesLookingAtCameraa factor with levels
noyesNoseOddScorea numeric vector indicating the number of people (out of 3) who agreed the woman had an unusually shaped nose
Piercingsa factor with levels
noyesindicating whether the man had visible piercingsSelfiea factor with levels
noyesSkinClearScorea numeric vector indicating the number of people (out of 2) who agreed the woman's complexion was clear.
Smilea factor with levels
noyesTattooa factor with levels
noyes
Details
Students were asked to rate on a scale of 1 (very unattractive) to 5 (very attractive) the attractiveness of 70 college-aged men who had posted their photos on a dating website. Of the nearly 60 respondents, most were straight females. Score represents the average of these ratings.
In a separate survey, students (of both genders) were asked to rate characteristics of the man by answering the questions: what is his race, how symmetric does his face look, is he gay, is he physically fit, etc. The variables ending “Score" represent the number of students who answered Yes to the question. Other variables (such as Hat, Smile) represent the consensus among the students. The only attribute taken from the man's profile was Actual.Sexuality.
Source
Students in BAS 320 at the University of Tennessee from 2013-2015.
AUTO dataset
Description
Characteristics of cars from 1991
Usage
data("AUTO")
Format
A data frame with 82 observations on the following 5 variables.
CabVolumea numeric vector, cubic feet of cab space
Horsepowera numeric vector, engine horsepower
FuelEfficiencya numeric vector, average miles per gallon
TopSpeeda numeric vector, miles per hour
Weighta numeric vector, in units of 100 lbs
Details
Although this is a popular dataset, there is some question as to the units of the fuel efficiency. The source claims it to be in miles per gallon, but the numbers reported seem unrealistic. However, the units do not appear to be in km/gallon or km/L.
Source
Data provided by the U.S. Environmental Protection Agency and obtained from the (former) Data and Story library
References
R.M. Heavenrich, J.D. Murrell, and K.H. Hellman, Light Duty Automotive Technology and Fuel Economy Trends Through 1991, U.S. Environmental Protection Agency, 1991 (EPA/AA/CTAB/91-02)
BODYFAT data
Description
Popular Bodyfat dataset
Usage
data("BODYFAT")
Format
A data frame with 252 observations on the following 14 variables.
BodyFata numeric vector indicating the percentage body fat 0-100
Agea numeric vector, yrs
Weighta numeric vector, lbs
Heighta numeric vector, inches
Necka numeric vector
Chesta numeric vector
Abdomena numeric vector
Hipa numeric vector
Thigha numeric vector
Kneea numeric vector
Anklea numeric vector
Bicepsa numeric vector
Forearma numeric vector
Wrista numeric vector
Details
Bodyfat can be accurately measured by the hydrostatic technique, where someone is submereged in a tank of water. It would be useful to be able to predict body fat from measurements that are simpler to obtain. Unless otherwise specified, all physical measurements are in centimeters.
Source
This is a modified version of the data available in “Fitting Percentage of Body Fat to Simple Body Measurements" as appearing in Journal of Statistics Education v4 n1 (1996).
Secondary BODYFAT dataset
Description
Bodyfat dataset illustrating quirks of statistical significance
Usage
data("BODYFAT2")
Format
A data frame with 20 observations on the following 4 variables.
Tricepsa numeric vector, cm
Thigha numeric vector, cm
Midarma numeric vector, cm
BodyFata numeric vector, 0-100 representing percent
Details
The physical measurements are circumferences of body parts of 25-34 year-old healthy females.
Source
This is a classic dataset found in many textbooks and in many places online. The original source may be Neter, Kutner, Nachtsheim, Wasserman, 1997, p. 261: Applied Statistical Models (4th Edition).
BULLDOZER data
Description
Predicting the sales price of a bulldozer at auction
Usage
data("BULLDOZER")
Format
A data frame with 924 observations on the following 6 variables.
SalePricea numeric vector
YearsAgoa numeric vector, the number of years ago (before present) that the sale occurred
YearMadea numeric vector, year of manufacture of machine
Usagea numeric vector, hours of usage at time of sale
Bladea numeric vector, width of the bulldozer blade (feet)
Tirea numeric vector, size of primary tires
Details
The goal is to predict the sale price of a particular piece of heavy equiment at auction based on its usage, equipment type, and configuration. The data represents a heavily modified version of competition data found on kaggle.com. See original source for actual dataset
References
https://www.kaggle.com/c/bluebook-for-bulldozers
Modified BULLDOZER data
Description
The BULLDOZER dataset but with the year the dozer was made as a categorical variable
Usage
data("BULLDOZER2")
Format
A data frame with 924 observations on the following 6 variables.
Pricea numeric vector
YearsAgoa numeric vector
Usagea numeric vector
Tirea numeric vector
Decadea factor with levels
1960s and 1970s1980s1990s2000sBladeSizea numeric vector
Details
This is the BULLDOZER data except here YearMade has been coded into a four level categorical varaible called Decade
CALLS dataset
Description
Summary of students' cell phone providers and relative frequency of dropped calls
Usage
data("CALLS")
Format
A data frame with 579 observations on the following 2 variables.
Providera factor with levels
ATTSprintUSCellularVerizonDropCallFreqa factor with levels
OccasionallyOftenRarely
Details
Data is self-reported by students. The dropped call frequency is based on individuals' perceptions and not any independent quantititatve measure. The data is a subset of SURVEY09.
Source
Student survey from STAT 201, University of Tennessee Knoxville, Fall 2009
CENSUS data
Description
Information from the 2010 US Census
Usage
data("CENSUS")
Format
A data frame with 3534 observations on the following 39 variables.
ResponseRatea numeric vector, 0-100 representing the percentage of households in a block group that mailed in the form
Areaa numeric vector, land area in square miles
Urbana numeric vector, percentage of block group in Urbanized area (50000 or greater)
Suburbana numeric vector, percentage of block group in an Urban Cluster area (2500 to 49999)
Rurala numeric vector, percentage of block group in an Urban Cluster area (2500 to 49999)
Malea numeric vector, percentage of males
AgeLess5a numeric vector, percentage of individuals aged less than 5 years old
Age5to17a numeric vector
Age18to24a numeric vector
Age25to44a numeric vector
Age45to64a numeric vector
Age65plusa numeric vector
Hispanicsa numeric vector, percentage of individuals who identify as Hispanic
Whitesa numeric vector, percentage of individuals who identify as white (alone)
Blacksa numeric vector
NativeAmericansa numeric vector
Asiansa numeric vector
Hawaiiansa numeric vector
Othera numeric vector, percentage of individuals who identify as another ethnicity
RelatedHHa numeric vector, percentage of households where at least 2 members are related by birth, marriage, or adoption; same-sex couple households with no relatives of the householder present are not included
MarriedHHa numeric vector, percentage of households in which the householder and his or her spouse are listed as members of the same household; does not include same-sex married couples
NoSpouseHHa numeric vector, percentage of households with no spousal relationship present
FemaleHHa numeric vector, percentage of households with a female householder and no husband of householder present
AloneHHa numeric vector, percentage of households where householder is living alone
WithKidHHa numeric vector, percentage of households which have at least one person under the age of 18
MedianHHIncomeBlocka numeric vector, median income of households in the block group (from American Community Survey)
MedianHHIncomeCitya numeric vector, median income of households in the tract
OccupiedUnitsa numeric vector, percentage of housing units that are occupied
RentingHHa numeric vector, percentage of housing units occupied by renters
HomeownerHHa numeric vector, percentage of housing units occupied by the owner
MobileHomeUnitsa numeric vector, percentage of housing units that are mobile homes (from American Community Survey)
CrowdedUnitsa numeric vector, percentage of housing units with more than 1 person per room on average
NoPhoneUnitsa numeric vector, percentage of housing units without a landline
NoPlumbingUnitsa numeric vector, percentage of housing units without active plumbing
NewUnitsa numeric vector, percentage of housing units constructed in 2010 or later
Populationa numeric vector, number of people in the block group
NumHHa numeric vector, number of households in the block group
NumUnitsa numeric vector, number of housing units in the block group
logMedianHouseValuea numeric vector, the logarithm of the median home value in the block group
Details
The goal is to predict ResponseRate from the other predictors. ResponseRate is the percentage of households in a block group that mailed in the census forms. A block group is on average about 40 blocks, each typically bounded by streets, roads, or water. The number of block groups per county in the US is typically between about 5 and 165 with a median of about 20.
References
See https://www2.census.gov/programs-surveys/research/guidance/planning-databases/2014/pdb-block-2014-11-20a.pdf for variable definitions.
Subset of CENSUS data
Description
A portion of the CENSUS dataset used for illustration
Usage
data("CENSUSMLR")
Format
A data frame with 1000 observations on the following 7 variables.
Responsea numeric vector, percentage 0-100 of household that mailed in the census form
Populationa numeric vector, the number of people living in the census block based on 2010 census
ACSPopulationa numeric vector, the number of people living in the census block based on 2010 census
Rurala numeric vector, the number of people living in a rural area (in that census block)
Malesa numeric vector, the number of males living in the census block
Elderlya numeric vector, the number of people aged 65+ living in the census block
Hispanica numeric vector, the number of people who self-identify as Hispanic in the census block
Details
See CENSUS data for more information.
CHARITY dataset
Description
Charity data (adapted from a small section of a charity's donor database)
Usage
data("CHARITY")
Format
A data frame with 15283 observations on the following 11 variables.
Donatea factor with levels
DonateNoHomeownera factor with levels
NoYesGendera factor with levels
FMUnlistedPhonea factor with levels
NoYesResponseProportiona numeric vector giving the fraction of solications that resulted in a donation
NumResponsesa numeric vector giving the number of past donations
CardResponseCounta numeric vector giving the number of past solicitations
MonthsSinceLastResponsea numeric vector giving the number of months since last response to solicitation (which may have been declining to give)
LastGiftAmounta numeric vector giving the amount of the last donation
MonthSinceLastGifta numeric vector giving the number of months since last donation
LogIncomea numeric vector giving the logarithm of a scaled and normalized yearly income
Details
This dataset is adapted from a real-world database of donors to a charity.
Source
Unknown
CHURN dataset
Description
Churn data (artificial based on claims similar to real world) from the UCI data repository
Usage
data("CHURN")
Format
A data frame with 5000 observations on the following 18 variables.
churna factor with levels
NoYesaccountlengtha numeric vector
internationalplana factor with levels
noyesvoicemailplana factor with levels
noyesnumbervmailmessagesa numeric vector
totaldayminutesa numeric vector
totaldaycallsa numeric vector
totaldaychargea numeric vector
totaleveminutesa numeric vector
totalevecallsa numeric vector
totalevechargea numeric vector
totalnightminutesa numeric vector
totalnightcallsa numeric vector
totalnightchargea numeric vector
totalintlminutesa numeric vector
totalintlcallsa numeric vector
totalintlchargea numeric vector
numbercustomerservicecallsa numeric vector
Details
This dataset is modified from the one stored at the UCI data repository (namely, the area code and phone number have been deleted). This is artificial data similar to what is found in actual customer profiles. Charges are in dollars.
Source
This dataset is modified from the one stored at the UCI data repository
CUSTCHURN dataset
Description
Customer database describing customer churn (adapted from a former case study)
Usage
data("CUSTCHURN")
Format
A data frame with 500 observations on the following 11 variables.
Durationa numeric vector giving the days that the company was considered a customer. Note: censored at 730 days, which is the value for someone who is currently a customer (not churned)
Churna factor with levels
NYgiving whether the customer has churned or notRetentionCosta numeric vector giving the average amount of money spent per year to retain the individual or company as a customer
EBiza factor with levels
NoYesgiving whether the customer was an e-business or notCompanyRevenuea numeric vector giving the company's revenue
CompanyEmployeesa numeric vector giving the number of employees working for the company
Categoriesa numeric vector giving the number of product categories from which customer made a purchase of their lifetime
NumPurchasesa numeric vector giving the total amount of purchases over the customer's lifetime
Details
Each row corresponds to a customer of a Fortune 500 company. These customers are businesses, which may or may not exclusively be an e-business. Whether a customer is still a customer (or has churned) after 730 days is recorded.
Source
Unknown
CUSTLOYALTY dataset
Description
Customer database describing customer value (adapted from a former case study) and whether they have a loyalty card
Usage
data("CUSTLOYALTY")
Format
A data frame with 500 observations on the following 9 variables.
Gendera factor with levels
FemaleMalegiving the customer's genderMarrieda factor with levels
MarriedSinglegiving the customer's marital statusIncomea factor with levels
f0t30,f30t45,f45t60,f60t75,f75t90,f90toINFgiving the approximate yearly income of the customer. The first level corresponds to 30K or less, the second level corresponds to 30K to 45K, and the last level corresponds to 90K or aboveFirstPurchasea numeric vector giving the amount of the customer's first purchase amount
LoyaltyCarda factor with levels
NoYesthat gives whether the customer has a loyalty card for the storeWalletSharea numeric vector giving the percentage from 0 to 100 of similar products that the customer makes at this store. A value of 100 means the customer uses this store exclusively for such purchases.
CustomerLVa numeric vector giving the lifetime value of the customer and reflects the amount spent acquiring and retaining the customer along with the revenue brought in by the customer
TotTransactionsa numeric vector giving the total number of consecutive months the customer has made a transaction in the last year
LastTransactiona numeric vector giving the total amount of months since the customers last transaction
Details
Each row corresponds to a customer of a local chain. Does having a loyalty card increase the customer's value?
Source
Unknown
CUSTREACQUIRE dataset
Description
Customer reacquisition
Usage
data("CUSTREACQUIRE")
Format
A data frame with 500 observations on the following 9 variables.
Reacquirea factor with levels
NoYesindicating whether a customer who has previously churned was reacquiredLifetime2a numeric vector giving the days that the company was considered a customer
Value2a numeric vector giving the lifetime value of the customer (related to the amount of money spent on reacquisition and the revenue brought in by the customer; can be negative)
Lifetime1a numeric vector giving the days that the company was considered a customer before churning the first time
OfferAmounta numeric vector giving the money equivalent of a special offer given to the former customer in an attempt to reacquire
Lapsea numeric vector giving the number of days between the customer churning and the time of the offer
PriceChangea numeric vector giving the percentage by which the typical product purchased by the customer has changed from the time they churned to the time the special offer was sent
Gendera factor with levels
FemaleMalegiving the gender of the customerAgea numeric vector giving the age of the customer
Details
A company kept records of its success in reacquiring customers that had previously churned. Data is based on a previous case study.
Source
Unknown
CUSTVALUE dataset
Description
Customer database describing customer value (adapted from a former case study)
Usage
data("CUSTVALUE")
Format
A data frame with 500 observations on the following 11 variables.
Acquireda factor with levels
NoYesindicating whether a potential customer was acquiredDurationa numeric vector giving the days that the company was considered a customer
LifetimeValuea numeric vector giving the lifetime value of the customer (related to the amount of money spent on acquisition and the revenue brought in by the customer; can be negative)
AcquisitionCosta numeric vector giving the amount of money spent attempting to acquire as a customer
RetentionCosta numeric vector giving the average amount of money spent per year to retain the individual or company as a customer
NumPurchasesa numeric vector giving the total amount of purchases over the customer's lifetime
Categoriesa numeric vector giving the number of product categories from which customer made a purchase of their lifetime
WalletSharea numeric vector giving the percentage of purchases of similar products the customer makes with this company; a few values exceed 100 for some reason
EBiza factor with levels
NoYesgiving whether the customer was an e-business or notCompanyRevenuea numeric vector giving the company's revenue
CompanyEmployeesa numeric vector giving the number of employees working for the company
Details
Each row corresponds to a (potential) customer of a Fortune 500 company. These customers are businesses, which may or may not exclusively an e-business.
Source
Unknown
DIET data
Description
The weight of a person over time who is dieting and exercising
Usage
data("DIET")
Format
A data frame with 35 observations on the following 2 variables.
Weighta numeric vector, lbs
Daya numeric vector, the number of days after the diet started
Details
This data was collected by the author and consists of his weight measured first thing in the morning over the course of amount a month. The scale round to the nearest 0.2 lbs.
DONOR dataset
Description
Adapted from the KDD-CUP-98 data set concerning data regarding donations made to a national veterans organization.
Usage
data("DONOR")
Format
A data frame with 19372 observations on the following 50 variables.
Donatea factor with levels
NoYesDonation.Amounta numeric vector
IDa numeric vector
MONTHS_SINCE_ORIGINa numeric vector, number of months donor has been in the database
DONOR_AGEa numeric vector
IN_HOUSEa numeric vector, 1 if person has donated to the charity's “In House" program
URBANICITYa factor with levels
?CRSTUSESa factor with levels
?1234, one of five possible codes indicating socioeconomic statusCLUSTER_CODEa factor with levels
.0102...53, one of 54 possible cluster codes, which are unique in terms of socioeconomic status, urbanicity, ethnicity, and other demographic characteristicsHOME_OWNERa factor with levels
HUDONOR_GENDERa factor with levels
AFMUINCOME_GROUPa numeric vector, but in reality one of 7 possible income groups inferred from demographics
PUBLISHED_PHONEa numeric vector, listed (1) vs not listed (0)
OVERLAY_SOURCEa factor with levels
BMNP, source from which the donor was match; B is both sources and N is neitherMOR_HIT_RATEa numeric vector, number of known times donor has responded to a mailed solicitation from a group other than the charity
WEALTH_RATINGa numeric vector, but in reality one of 10 groups based on demographics
MEDIAN_HOME_VALUEa numeric vector, inferred from other variables
MEDIAN_HOUSEHOLD_INCOMEa numeric vector, inferred from other variables
PCT_OWNER_OCCUPIEDa numeric vector, percent of owner-occupied housing near where person lives
PER_CAPITA_INCOMEa numeric vector, of neighborhood in which person lives
PCT_ATTRIBUTE1a numeric vector, percent of residents in person's neighborhood that are male and active military
PCT_ATTRIBUTE2a numeric vector, percent of residents in person's neighborhood that are male and veterans
PCT_ATTRIBUTE3a numeric vector, percent of residents in person's neighborhood that are Vietnam veterans
PCT_ATTRIBUTE4a numeric vector, percent of residents in person's neighborhood that are WW2 veterans
PEP_STARa numeric vector, 1 if has achieved STAR donor status and 0 otherwise
RECENT_STAR_STATUSa numeric vector, 1 if achieved STAR within last 4 years
RECENCY_STATUS_96NKa factor with levels
A(active)E(inactive)F(first time)L(lapsing)N(new)S(star donor) as of 1996.FREQUENCY_STATUS_97NKa numeric vector indicating number of times donated in last period (but period is determined by RECENCY STATUS 96NK)
RECENT_RESPONSE_PROPa numeric vector, proportion of responses to the individual to the number of (card or other) solicitations from the charitable organization since four years ago
RECENT_AVG_GIFT_AMTa numeric vector, average donation from the individual to the charitable organization since four years ago
RECENT_CARD_RESPONSE_PROPa numeric vector, number of times the individual has responded to a card solicitation from the charitable organization since four years ago
RECENT_AVG_CARD_GIFT_AMTa numeric vector, average donation from the individual in response to a card solicitation from the charitable organization since four years ago
RECENT_RESPONSE_COUNTa numeric vector, number of times the individual has responded to a promotion (card or other) from the charitable organization since four years ago
RECENT_CARD_RESPONSE_COUNTa numeric vector, number of times the individual has responded to a card solicitation from the charitable organization since four years ago
MONTHS_SINCE_LAST_PROM_RESPa numeric vector, number of months since the individual has responded to a promotion by the charitable organization
LIFETIME_CARD_PROMa numeric vector, total number of card promotions sent to the individual by the charitable organization
LIFETIME_PROMa numeric vector, total number of promotions sent to the individual by the charitable organization
LIFETIME_GIFT_AMOUNTa numeric vector, total lifetime donation amount from the individual to the charitable organization
LIFETIME_GIFT_COUNTa numeric vector, total number of donations from the individual to the charitable organization
LIFETIME_AVG_GIFT_AMTa numeric vector, lifetime average donation from the individual to the charitable organization
LIFETIME_GIFT_RANGEa numeric vector, difference between maximum and minimum donation amounts from the individual
LIFETIME_MAX_GIFT_AMTa numeric vector
LIFETIME_MIN_GIFT_AMTa numeric vector
LAST_GIFT_AMTa numeric vector
CARD_PROM_12a numeric vector, number of card promotions sent to the individual by the charitable organization in the last 12 months
NUMBER_PROM_12a numeric vector, number of promotions (card or other) sent to the individual by the charitable organization in the last 12 months
MONTHS_SINCE_LAST_GIFTa numeric vector
MONTHS_SINCE_FIRST_GIFTa numeric vector
FILE_AVG_GIFTa numeric vector, same as
LIFETIME_AVG_GIFT_AMTFILE_CARD_GIFTa numeric vector, lifetime average donation from the individual in response to all card solicitations from the charitable organization
Details
Originally, this data was used with the 1998 KDD competition (https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html). This particular version has been adapted from the version available in SAS Enterprise Miner (http://support.sas.com/documentation/cdl/en/emgsj/61207/PDF/default/emgsj.pdf Appendix 2 for descriptions of variable names). One goal is to determine whether a past donor donated in response to the 97NK mail solicitation and (if so), how much, based on age, gender, most recent donation amount, total gift amount, etc.
EDUCATION data
Description
Data on the College GPAs of students in an introductory statistics class
Usage
data("EDUCATION")
Format
A data frame with 607 observations on the following 18 variables.
CollegeGPAa numeric vector
Gendera factor with levels
FemaleMaleHSGPAa numeric vector, can range up to 5 if the high school allowed it
ACTa numeric vector, ACT score
APHoursa numeric vector, number of AP hours student took in HS
JobHoursa numeric vector, number of hours student currently works on average
Schoola factor with levels
PrivatePublic, type of HSLanguagesSpokena numeric vector
HSHonorsClassesa numeric vector, number of honors classes taken in HS
SmokeInHSa factor with levels
NoYesPayCollegeNoLoansa factor with levels
NoYes, can the student and his/her family pay for the University of Tennessee without taking out loans?ClubsInHSa numeric vector, number of clubs belonged to in HS
JobInHSa factor with levels
NoYes, whether the student maintained a job at some point while in HSChurchgoera factor with levels
NoYes, answer to the question Do you regularly attend chruch?Heighta numeric vector (inches)
Weighta numeric vector (lbs)
Familywhat position they are in the family, a factor with levels
Middle ChildOldest ChildOnly ChildYoungest ChildPetfavorite pet, a factor with levels
BothCatDogNeither
Details
Responses are from students in an introductory statistics class at the University of Tennessee in 2010. One goal to try to predict someone's college GPA from some of the students' characteristics. What information about a high school student could a college admission's counselor use to anticipate that student's performance in college?
CENSUS data for Exercise 5 in Chapter 2
Description
CENSUS data for Exercise 5 in Chapter 2
Usage
data("EX2.CENSUS")
Format
A data frame with 3534 observations on the following 41 variables.
ResponseRatea numeric vector
Areaa numeric vector
Urbana numeric vector
Suburbana numeric vector
Rurala numeric vector
Malea numeric vector
Femalea numeric vector
AgeLess5a numeric vector
Age5to17a numeric vector
Age18to24a numeric vector
Age25to44a numeric vector
Age45to64a numeric vector
Age65plusa numeric vector
Hispanicsa numeric vector
Whitesa numeric vector
Blacksa numeric vector
NativeAmericansa numeric vector
Asiansa numeric vector
Hawaiiansa numeric vector
Othera numeric vector
RelatedHHa numeric vector
MarriedHHa numeric vector
NoSpouseHHa numeric vector
FemaleHHa numeric vector
AloneHHa numeric vector
WithKidHHa numeric vector
MedianHHIncomeBlocka numeric vector
MedianHHIncomeCitya numeric vector
OccupiedUnitsa numeric vector
VacantUnitsa numeric vector
RentingHHa numeric vector
HomeownerHHa numeric vector
MobileHomeUnitsa numeric vector
CrowdedUnitsa numeric vector
NoPhoneUnitsa numeric vector
NoPlumbingUnitsa numeric vector
NewUnitsa numeric vector
Populationa numeric vector
NumHHa numeric vector
NumUnitsa numeric vector
logMedianHouseValuea numeric vector
Details
See CENSUS for variable descriptions (this data is nearly identical). The goal is to predict ResponseRate from the other predictors. ResponseRate is the percentage of households in a block group that mailed in the census forms. A block group is on average about 40 blocks, each typically bounded by streets, roads, or water. The number of block groups per county in the US is typically between about 5 and 165 with a median of about 20.
TIPS data for Exercise 6 in Chapter 2
Description
TIPS data for Exercise 6 in Chapter 2
Usage
data("EX2.TIPS")
Format
A data frame with 244 observations on the following 8 variables.
Tip.Percentagea numeric vector
Bill_in_USDa numeric vector
Tip_in_USDa numeric vector
Gendera factor with levels
FemaleMaleSmokera factor with levels
NoYesWeekdaya factor with levels
FridaySaturdaySundayThursdayDay_Nighta factor with levels
DayNightSize_of_Partya numeric vector
Details
See TIPS for more details. This is the same dataset except that the names of the variables are different.
ABALONE dataset for Exercise D in Chapter 3
Description
ABALONE dataset for Exercise D in Chapter 3
Usage
data("EX3.ABALONE")
Format
A data frame with 1528 observations on the following 7 variables.
Lengtha numeric vector
Diametera numeric vector
Heighta numeric vector
Whole.Weighta numeric vector
Meat.Weighta numeric vector
Shell.Weighta numeric vector
Ringsa numeric vector
Details
Abalone are sea creatures that are considered a delicacy and have very pretty iridescent shells. See https://en.wikipedia.org/wiki/Abalone. Predicting the age of the abalone from physical measurements could be useful for harvesting purposes. Dimensions are in mm and weights are in grams. Rings is an indicator of the age of the abalone (Age is about 1.5 plus the number of rings).
Source
Data is adapted from the abalone dataset on UCI Data Repository https://archive.ics.uci.edu/ml/datasets/Abalone. Only the male abalone are represented in this dataset.
References
See page on UCI for full details of owner and donor of this data.
Bodyfat data for Exercise F in Chapter 3
Description
Bodyfat data for Exercise F in Chapter 3
Usage
data("EX3.BODYFAT")
Format
A data frame with 20 observations on the following 4 variables.
Tricepsa numeric vector
Thigha numeric vector
Midarma numeric vector
Fata numeric vector
Details
Same data as BODYFAT2, which you can see for more details.
Housing data for Exercise E in Chapter 3
Description
Housing data for Exercise E in Chapter 3
Usage
data("EX3.HOUSING")
Format
A data frame with 522 observations on the following 2 variables.
AREAa numeric vector, square area of house
PRICEa numeric vector, selling price
Details
Selling prices of houses (perhaps in the Boston area in Massachusettes).
Source
Original source unknown, but it appears in many places around the internet, e.g., public.iastate.edu/~pdixon/stat500/data/realestate.txt
NFL data for Exercise A in Chapter 3
Description
NFL data for Exercise A in Chapter 3
Usage
data("EX3.NFL")
Format
A data frame with 352 observations on the following 137 variables.
Yeara numeric vector
Teama factor with levels
ArizonaAtlantaBaltimoreBuffaloCarolinaChicagoCincinnatiClevelandDallasDenverDetroitGreenBayHoustonIndianapolisJacksonvilleKansasCityMiamiMinnesotaNewEnglandNewOrleansNYGiantsNYJetsOaklandPhiladelphiaPittsburghSanDiegoSanFranciscoSeattleSt.LouisTampaBayTennesseeWashingtonNext.Years.Winsa numeric vector
Winsa numeric vector
X1.Off.Tot.Ydsa numeric vector
X2.Off.Tot.Playsa numeric vector
X3.Off.Tot.Yds.per.Plya numeric vector
X4.Off.Tot.1st.Dwnsa numeric vector
X5.Off.Pass.1st.Dwnsa numeric vector
X6.Off.Rush.1st.Dwnsa numeric vector
X7.Off.Tot.Turnoversa numeric vector
X8.Off.Fumbles.Losta numeric vector
X9.Off.1st.Dwns.by.Penaltya numeric vector
X10.Off.Pass.Compa numeric vector
X11.Off.Pass.Comp.a numeric vector
X12.Off.Pass.Ydsa numeric vector
X13.Off.Pass.Tdsa numeric vector
X14.Off.Pass.INTsa numeric vector
X15.Off.Pass.INT.a numeric vector
X16.Off.Pass.Longesta numeric vector
X17.Off.Pass.Yds.per.Atta numeric vector
X18.Off.Pass.Adj.Yds.per.Atta numeric vector
X19.Off.Pass.Yds.per.Compa numeric vector
X20.Off.Pass.Yds.per.Gamea numeric vector
X21.Off.Passer.Ratinga numeric vector
X22.Off.Pass.Sacks.Alwda numeric vector
X23.Off.Pass.Sack.Ydsa numeric vector
X24.Off.Pass.Net.Yds.per.Atta numeric vector
X25.Off.Pass.Adj.Net.Yds.per.Atta numeric vector
X26.Off.Pass.Sack.a numeric vector
X27.Off.Game.Winning.Drivesa numeric vector
X28.Off.Rush.Ydsa numeric vector
X29.Off.Rush.Tdsa numeric vector
X30.Off.Rush.Longesta numeric vector
X31.Off.Rush.Yds.per.Atta numeric vector
X32.Off.Rush.Yds.per.Gamea numeric vector
X33.Off.Fumblesa numeric vector
X34.Off.Punt.Returnsa numeric vector
X35.Off.PR.Ydsa numeric vector
X36.Off.PR.Tdsa numeric vector
X37.Off.PR.Longesta numeric vector
X38.Off.PR.Yds.per.Atta numeric vector
X39.Off.Kick.Returnsa numeric vector
X40.Off.KR.Ydsa numeric vector
X41.Off.KR.Tdsa numeric vector
X42.Off.KR.Longesta numeric vector
X43.Off.KR.Yds.per.Atta numeric vector
X44.Off.All.Purpose.Ydsa numeric vector
X45.X1.19.yd.FG.Atta numeric vector
X46.X1.19.yd.FG.Madea numeric vector
X47.X20.29.yd.FG.Atta numeric vector
X48.X20.29.yd.FG.Madea numeric vector
X49.X1.29.yd.FG.a numeric vector
X50.X30.39.yd.FG.Atta numeric vector
X51.X30.39.yd.FG.Madea numeric vector
X52.X30.39.yd.FG.a numeric vector
X53.X40.49.yd.FG.Atta numeric vector
X54.X40.49.yd.FG.Madea numeric vector
X55.X50yd.FG.Atta numeric vector
X56.X50yd.FG.Madea numeric vector
X57.X40yd.FG.a numeric vector
X58.Total.FG.Atta numeric vector
X59.Off.Tot.FG.Madea numeric vector
X60.Off.Tot.FG.a numeric vector
X61.Off.XP.Atta numeric vector
X62.Off.XP.Madea numeric vector
X63.Off.XP.a numeric vector
X64.Off.Times.Punteda numeric vector
X65.Off.Punt.Yardsa numeric vector
X66.Off.Longest.Punta numeric vector
X67.Off.Times.Had.Punt.Blockeda numeric vector
X68.Off.Yards.Per.Punta numeric vector
X69.Fmbl.Tdsa numeric vector
X70.Def.INT.Tds.Scoreda numeric vector
X71.Blocked.Kick.or.Missed.FG.Ret.Tdsa numeric vector
X72.Total.Tds.Scoreda numeric vector
X73.Off.2pt.Conv.Madea numeric vector
X74.Def.Safeties.Scoreda numeric vector
X75.Def.Tot.Yds.Alwda numeric vector
X76.Def.Tot.Plays.Alwda numeric vector
X77.Def.Tot.Yds.per.Play.Alwda numeric vector
X78.Def.Tot.1st.Dwns.Alwda numeric vector
X79.Def.Pass.1st.Dwns.Alwda numeric vector
X80.Def.Rush.1st.Dwns.Alwda numeric vector
X81.Def.Turnovers.Createda numeric vector
X82.Def.Fumbles.Recovereda numeric vector
X83.Def.1st.Dwns.Alwd.by.Penaltya numeric vector
X84.Def.Pass.Comp.Alwda numeric vector
X85.Def.Pass.Att.Alwda numeric vector
X86.Def.Pass.Comp..Alwda numeric vector
X87.Def.Pass.Yds.Alwda numeric vector
X88.Def.Pass.Tds.Alwda numeric vector
X89.Def.Pass.TDAlwda numeric vector
X90.Def.Pass.INTsa numeric vector
X91.Def.Pass.INT.a numeric vector
X92.Def.Pass.Yds.per.Att.Alwda numeric vector
X93.Def.Pass.Adj.Yds.per.Att.Alwda numeric vector
X94.Def.Pass.Yds.per.Comp.Alwda numeric vector
X95.Def.Pass.Yds.per.Game.Alwda numeric vector
X96.Def.Passer.Rating.Alwda numeric vector
X97.Def.Pass.Sacksa numeric vector
X98.Def.Pass.Sack.Ydsa numeric vector
X99.Def.Pass.Net.Yds.per.Att.Alwda numeric vector
X100.Def.Pass.Adj.Net.Yds.per.Att.Alwda numeric vector
X101.Def.Pass.Sack.a numeric vector
X102.Def.Rush.Yds.Alwda numeric vector
X103.Def.Rush.Tds.Alwda numeric vector
X104.Def.Rush.Yds.per.Att.Alwda numeric vector
X105.Def.Rush.Yds.per.Game.Alwda numeric vector
X106.Def.Punt.Returns.Alwda numeric vector
X107.Def.PR.Tds.Alwda numeric vector
X108.Def.Kick.Returns.Alwda numeric vector
X109.Def.KR.Yds.Alwda numeric vector
X110.Def.KR.Tds.Alwda numeric vector
X111.Def.KR.Yds.per.Att.Alwda numeric vector
X112.Def.Tot.FG.Att.Alwda numeric vector
X113.Def.Tot.FG.Made.Alwda numeric vector
X114.Def.Tot.FG..Alwda numeric vector
X115.Def.XP.Att.Alwda numeric vector
X116.Def.XP.Made.Alwda numeric vector
X117.Def.XP..Alwda numeric vector
X118.Def.Punts.Alwda numeric vector
X119.Def.Punt.Yds.Alwda numeric vector
X120.Def.Punt.Yds.per.Att.Alwda numeric vector
X121.Def.2pt.Conv.Alwda numeric vector
X122.Off.Safetiesa numeric vector
X123.Off.Rush.Success.Ratea numeric vector
X124.Head.Coach.Disturbance.a factor with levels
NoYesX125.QB.Disturbancea factor with levels
NoYesX126.RB.Disturbancea factor with levels
?NoYesX127.Off.Run.Pass.Ratioa numeric vector
X128.Off.Pass.Ply.a numeric vector
X129.Off.Run.Ply.a numeric vector
X130.Off.Yds.Pta numeric vector
X131.Def.Yds.Pta numeric vector
X132.Off.Pass.Drop.ratea numeric vector
X133.Def.Pass.Drop.Ratea numeric vector
Details
See NFL for more details. This dataset is actually a more complete version of NFL and contains additional variables such as the year, team, next year's wins of the team, etc., and could be used in place of the NFL data
Bike data for Exercise 1 in Chapter 4
Description
Bike data for Exercise 1 in Chapter 4
Usage
data("EX4.BIKE")
Format
A data frame with 414 observations on the following 5 variables.
Demanda numeric vector, total number of rental bikes
AvgTempa numeric vector, average temperature of the day
EffectiveAvgTempa numeric vector, average temperature it feels like (taking into account dewpoint) for the day
AvgHumiditya numeric vector, average humidity for the day
AvgWindspeeda numeric vector, average wind speed for the day
Details
Adapted from the bike sharing dataset on the UCI data repository http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. This concerns the demand for rental bikes in the DC area.
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
References
Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
Stock data for Exercise 2 in Chapter 4 (prediction set)
Description
Stock data for Exercise 2 in Chapter 4 (prediction set)
Usage
data("EX4.STOCKPREDICT")
Format
A data frame with 5 observations on the following 40 variables.
AAPLlag2a numeric vector
AXPlag2a numeric vector
BAlag2a numeric vector
BAClag2a numeric vector
CATlag2a numeric vector
CSCOlag2a numeric vector
CVXlag2a numeric vector
DDlag2a numeric vector
DISlag2a numeric vector
GElag2a numeric vector
HDlag2a numeric vector
HPQlag2a numeric vector
IBMlag2a numeric vector
INTClag2a numeric vector
JNJlag2a numeric vector
JPMlag2a numeric vector
KOlag2a numeric vector
MCDlag2a numeric vector
MMMlag2a numeric vector
MRKlag2a numeric vector
MSFTlag2a numeric vector
PFElag2a numeric vector
PGlag2a numeric vector
Tlag2a numeric vector
TRVlag2a numeric vector
UNHlag2a numeric vector
VZlag2a numeric vector
WMTlag2a numeric vector
XOMlag2a numeric vector
Australialag2a numeric vector
Copperlag2a numeric vector
DollarIndexlag2a numeric vector
Europelag2a numeric vector
Exchangelag2a numeric vector
GlobalDowlag2a numeric vector
HongKonglag2a numeric vector
Indialag2a numeric vector
Japanlag2a numeric vector
Oillag2a numeric vector
Shanghailag2a numeric vector
Details
The data frame for which you are to predict the closing price of Alcoa stock based on the model built using EX4.STOCKS. The actual closing prices are not given.
Stock data for Exercise 2 in Chapter 4
Description
Stock data for Exercise 2 in Chapter 4
Usage
data("EX4.STOCKS")
Format
A data frame with 216 observations on the following 41 variables.
AAa numeric vector
AAPLlag2a numeric vector
AXPlag2a numeric vector
BAlag2a numeric vector
BAClag2a numeric vector
CATlag2a numeric vector
CSCOlag2a numeric vector
CVXlag2a numeric vector
DDlag2a numeric vector
DISlag2a numeric vector
GElag2a numeric vector
HDlag2a numeric vector
HPQlag2a numeric vector
IBMlag2a numeric vector
INTClag2a numeric vector
JNJlag2a numeric vector
JPMlag2a numeric vector
KOlag2a numeric vector
MCDlag2a numeric vector
MMMlag2a numeric vector
MRKlag2a numeric vector
MSFTlag2a numeric vector
PFElag2a numeric vector
PGlag2a numeric vector
Tlag2a numeric vector
TRVlag2a numeric vector
UNHlag2a numeric vector
VZlag2a numeric vector
WMTlag2a numeric vector
XOMlag2a numeric vector
Australialag2a numeric vector
Copperlag2a numeric vector
DollarIndexlag2a numeric vector
Europelag2a numeric vector
Exchangelag2a numeric vector
GlobalDowlag2a numeric vector
HongKonglag2a numeric vector
Indialag2a numeric vector
Japanlag2a numeric vector
Oillag2a numeric vector
Shanghailag2a numeric vector
Details
The goal is to predict the closing price of Alcoa stock (AA) from the closing prices of other stocks and commodities two days prior (IMBlag2, HongKonglag2, etc.). If this were possible, and if the association between the prices continued into the future, it would be possible to use this information to make smart trades.
Source
Compiled from various sources on the internet, e.g., Yahoo historical prices.
BIKE dataset for Exercise 4 Chapter 5
Description
BIKE dataset for Exercise 4 Chapter 5
Usage
data("EX5.BIKE")
Format
A data frame with 413 observations on the following 9 variables.
Demanda numeric vector
Daya factor with levels
FridayMondaySaturdaySundayThursdayTuesdayWednesdayWorkingdaya factor with levels
noyesHolidaya factor with levels
noyesWeathera factor with levels
No rainRainAvgTempa numeric vector
EffectiveAvgTempa numeric vector
AvgHumiditya numeric vector
AvgWindspeeda numeric vector
Details
Adapted from the bike sharing dataset on the UCI data repository http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. This concerns the demand for rental bikes in the DC area. This is an expanded version of EX4.BIKE with more variables and without the row containing bad data.
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
References
Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
DONOR dataset for Exercise 4 in Chapter 5
Description
DONOR dataset for Exercise 4 in Chapter 5
Usage
data("EX5.DONOR")
Format
A data frame with 8132 observations on the following 18 variables.
Donatea factor with levels
NoYesLastAmounta numeric vector
AccountAgea numeric vector
Agea numeric vector
Settinga factor with levels
RuralSuburbanUrbanHomeownera factor with levels
NoYesGendera factor with levels
FemaleMaleUnknownPhonea factor with levels
ListedUnlistedSourcea factor with levels
BMNP, source from which the donor was match; B is both sources and N is neitherMedianHomeValuea numeric vector
MedianIncomea numeric vector
PercentOwnerOccupieda numeric vector, of the neighborhood in which donor lives
Recenta factor with levels
NoYesRecentResponsePercenta numeric vector
RecentAvgAmounta numeric vector
MonthsSinceLastGifta numeric vector
TotalAmounta numeric vector
TotalDonationsa numeric vector
Details
See DONOR for details. This data is a subset, though attributes have been renamed.
CLICK data for Exercise 2 in Chapter 6
Description
CLICK data for Exercise 2 in Chapter 6
Usage
data("EX6.CLICK")
Format
A data frame with 13594 observations on the following 15 variables.
Clicka factor with levels
NoYesBannerPositiona factor with levels
Pos1Pos2, location of adSiteIDa factor with levels
S1S2S3S4S5S6S7S8SiteDomaina factor with levels
SD1SD2SD3SD4SD5SD6SD7SD8SiteCategorya factor with levels
SCat1SCat2SCat3SCat4SCat5AppDomaina factor with levels
AD1AD2AD3AppCategorya factor with levels
AC1AC2DeviceModela factor with levels
D1D10D11D12D13D14D15D16D17D18D2D3D4D5D6D7D8D9x1a numeric vector
x2a factor with levels
ABCDEFGHIJKLMNOPQRx3a factor with levels
abcdefx4a factor with levels
val1val2val3x5a factor with levels
type1type2type3type4x6a factor with levels
class1class2class3class4x7a factor with levels
AABBCCDDEE
Details
Inspired from a competition to predict the click-thru rates of ads displayed on mobile devices https://www.kaggle.com/c/avazu-ctr-prediction. Does the click-thru rate vary based on where the ad placed, what kind of site and device is used to view the ad, something else? All variables are anonymized.
DONOR dataset for Exercise 1 in Chapter 6
Description
DONOR dataset for Exercise 1 in Chapter 6
Usage
data("EX6.DONOR")
Format
A data frame with 8132 observations on the following 18 variables.
Donatea factor with levels
NoYesLastAmounta numeric vector
AccountAgea numeric vector
Agea numeric vector
Settinga factor with levels
RuralSuburbanUrbanHomeownera factor with levels
NoYesGendera factor with levels
FemaleMaleUnknownPhonea factor with levels
ListedUnlistedSourcea factor with levels
BMNPMedianHomeValuea numeric vector
MedianIncomea numeric vector
PercentOwnerOccupieda numeric vector
Recenta factor with levels
NoYesRecentResponsePercenta numeric vector
RecentAvgAmounta numeric vector
MonthsSinceLastGifta numeric vector
TotalAmounta numeric vector
TotalDonationsa numeric vector
Details
Identical to EX5.DONOR, so see that for details
WINE data for Exercise 3 Chapter 6
Description
WINE data for Exercise 3 Chapter 6
Usage
data("EX6.WINE")
Format
A data frame with 2700 observations on the following 12 variables.
Qualitya factor with levels
HighLowfixed.aciditya numeric vector
volatile.aciditya numeric vector
citric.acida numeric vector
residual.sugara numeric vector
free.sulfur.dioxidea numeric vector
total.sulfur.dioxidea numeric vector
densitya numeric vector
pHa numeric vector
sulphatesa numeric vector
alcohola numeric vector
chloridesa factor with levels
LittleLots
Details
Adapted from the wine quality dataset at the UCI data repository. In this case, the original quality metric has been recoded from a score between 0 and 10 to either High or Low, and the chlorides is treated here as a categorical variable instead of a quantitative variable.
Source
https://archive.ics.uci.edu/ml/datasets/Wine+Quality
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
BIKE dataset for Exercise 1 Chapters 7 and 8
Description
BIKE dataset for Exercise 1 Chapters 7 and 8
Usage
data("EX7.BIKE")
Format
A data frame with 410 observations on the following 9 variables.
Demanda numeric vector
Daya factor with levels
FridayMondaySaturdaySundayThursdayTuesdayWednesdayWorkingdaya factor with levels
noyesHolidaya factor with levels
noyesWeathera factor with levels
No rainRainAvgTempa numeric vector
EffectiveAvgTempa numeric vector
AvgHumiditya numeric vector
AvgWindspeeda numeric vector
Details
Identical to EX5.BIKE except with three additional rows deleted. See that dataset for details.
CATALOG data for Exercise 2 in Chapters 7 and 8
Description
CATALOG data for Exercise 2 in Chapters 7 and 8
Usage
data("EX7.CATALOG")
Format
A data frame with 4000 observations on the following 7 variables.
Buya factor with levels
NoYes, whether customer made a purchase through the catalog next quarterQuartersWithPurchasea numeric vector, number of quarters where customer made a purchase through the catalog
PercentQuartersWithPurchasea numeric vector, percentage of quarters where customer made a purchase through the catalog
CatalogsReceiveda numeric vector, total number of catalogs customer has received
DaysSinceLastPurchasea numeric vector, number of days since customer placed his or her last order
AvgOrderSizea numeric vector, the typical number of items per order when customers buys through the catalog
LifetimeOrdera numeric vector, the number of orders the customer has placed through the catalog
Details
The original source of this data is lost, but it is likely adapted from real data.
Birthweight dataset for Exercise 1 in Chapter 9
Description
Birthweight dataset for Exercise 1 in Chapter 9
Usage
data("EX9.BIRTHWEIGHT")
Format
A data frame with 553 observations on the following 13 variables.
Birthweighta numeric vector, grams
Gestationa numeric vector, weeks
MotherRacea factor with levels
AsianBlackMexicanMixedWhite, self-reportedMotherAgea numeric vector, self-reported
MotherEducationa factor with levels
below HSCollegeHS, self-reportedMotherHeighta numeric vector, inches
MotherWeighta numeric vector, pounds
FatherRacea factor with levels
AsianBlackMexicanMixedWhite, self-reportedFatherAgea numeric vector, self-reported
Father_Educationa factor with levels
below HSCollegeHS, self-reportedFatherHeighta numeric vector, inches
FatherWeighta numeric vector, pounds
Smokinga factor with levels
nevernow, self-reported
Details
An examination of birthweights and their link to gestation, mother and father characteristics, and whether the mother smoked during pregnancy.
Source
Adapted from a subset of a study from Nolan and Speed (2000) consisting of male, single births which survived for at least 28 days. Some rows that contained bad data have been omitted. http://had.co.nz/stat645/week-05/birthweight.txt
NFL data for Exercise 2 Chapter 9
Description
NFL data for Exercise 2 Chapter 9
Usage
data("EX9.NFL")
Format
A data frame with 352 observations on the following 26 variables.
Winsa numeric vector
X1.OffTotPlaysa numeric vector
X2.OffTotYdsperPlya numeric vector
X3.OffPass1stDwnsa numeric vector
X4.OffRush1stDwnsa numeric vector
X5.OffFumblesLosta numeric vector
X6.OffPassCompa numeric vector
X7.OffPassINTa numeric vector
X8.OffPassLongesta numeric vector
X9.OffPassYdsperAtta numeric vector
X10.OffPassYdsperCompa numeric vector
X11.OffPassSackYdsa numeric vector
X12.OffPassSacka numeric vector
X13.OffRushLongesta numeric vector
X14.OffRushYdsperAtta numeric vector
X15.OffRushYdsperGamea numeric vector
X16.OffFumblesa numeric vector
X17.1to29ydFGa numeric vector
X18.30to39ydFGa numeric vector
X19.40.ydFGa numeric vector
X20.TotalFGAtta numeric vector
X21.OffTimesPunteda numeric vector
X22.OffTimesHadPuntBlockeda numeric vector
X23.OffYardsPerPunta numeric vector
X24.Off2ptConvMadea numeric vector
X25.OffSafetiesa numeric vector
Details
A subset of the NFL data (see entry for details) containing statistics on the offense.
Data for Exercise 3 Chapter 9
Description
Data for Exercise 3 Chapter 9
Usage
data("EX9.STORE")
Format
A data frame with 1500 observations on the following 68 variables.
Store1a factor with levels
BuyNoStore2a factor with levels
BuyNoStore3a factor with levels
BuyNoStore4a factor with levels
BuyNoStore5a factor with levels
BuyNoStore6a factor with levels
BuyNoStore7a factor with levels
BuyNoStore8a factor with levels
BuyNoStore9a factor with levels
BuyNoStore10a factor with levels
BuyNoStore11a factor with levels
BuyNoStore12a factor with levels
BuyNoStore13a factor with levels
BuyNoStore14a factor with levels
BuyNoStore15a factor with levels
BuyNoStore16a factor with levels
BuyNoStore17a factor with levels
BuyNoStore18a factor with levels
BuyNoStore19a factor with levels
BuyNoStore20a factor with levels
BuyNoStore21a factor with levels
BuyNoStore22a factor with levels
BuyNoStore23a factor with levels
BuyNoStore24a factor with levels
BuyNoStore25a factor with levels
BuyNoStore26a factor with levels
BuyNoStore27a factor with levels
BuyNoStore28a factor with levels
BuyNoStore29a factor with levels
BuyNoStore30a factor with levels
BuyNoStore31a factor with levels
BuyNoStore32a factor with levels
BuyNoStore33a factor with levels
BuyNoStore34a factor with levels
BuyNoStore35a factor with levels
BuyNoStore36a factor with levels
BuyNoStore37a factor with levels
BuyNoStore38a factor with levels
BuyNoStore39a factor with levels
BuyNoStore40a factor with levels
BuyNoStore41a factor with levels
BuyNoStore42a factor with levels
BuyNoStore43a factor with levels
BuyNoStore44a factor with levels
BuyNoStore45a factor with levels
BuyNoStore46a factor with levels
BuyNoStore47a factor with levels
BuyNoStore48a factor with levels
BuyNoStore49a factor with levels
BuyNoStore50a factor with levels
BuyNoStore51a factor with levels
BuyNoStore52a factor with levels
BuyNoStore53a factor with levels
BuyNoStore54a factor with levels
BuyNoStore55a factor with levels
BuyNoStore56a factor with levels
BuyNoStore57a factor with levels
BuyNoStore58a factor with levels
BuyNoStore59a factor with levels
BuyNoStore60a factor with levels
BuyNoStore61a factor with levels
BuyNoStore62a factor with levels
BuyNoStore63a factor with levels
BuyNoStore64a factor with levels
BuyNoStore65a factor with levels
BuyNoStore66a factor with levels
BuyNoStore67a factor with levels
BuyNoStore68a factor with levels
BuyNo
Details
The data consists of a random sample of 1500 credit card customers and their shopping habits regarding 68 different stores (whether they did or did not make a purchase in the last 90 days). Shoppers don't pick and choose places to shop at random, so it is interesting to study which stores appear together in a customers' history.
Source
Consultation with an anonymous client. Stores have been anonymized to protect the source.
Friendship Potential vs. Attractiveness Ratings
Description
Examining the relationship between how likely someone would be friends with a person based on that person's level of attractiveness
Usage
data("FRIEND")
Format
A data frame with 54 observations on the following 2 variables.
Attractivenessa numeric vector - the average scores (1-5) from about 80 male students who rated the attractiveness of the women in each picture
FriendshipPotentiala numeric vector - the average scores (1-5) from about 30 female students who rated how likely they would be to be friends with the pictured woman
Details
The data contain information on 54 pictures of women posted on the (now defunct/renamed) site hotornot.com. The women in two classes of introductory statistics at the University of Tennessee rated how likely they would be friends with the pictured women (on a scale of 1-5, 1 being very unlikely and 5 being very likely). The men in three (different) classes of introductory statistics gave an attractiveness score to each woman (on a scale of 1-5, 1 being very unattractive and 5 being very attractive). The numbers presented are the averages over all student ratings.
Source
Surveys administered to introductory statistics students at the University of Tennessee from 2008-2010.
Wins vs. Fumbles of an NFL team
Description
Wins vs. Fumbles of an NFL team
Usage
data("FUMBLES")
Format
A data frame with 352 observations on the following 2 variables.
Winsa numeric vector, number of wins (0-16) of an NFL team over the course of a season
FumblesLosta numeric vector, the number of fumbles lost by that team over the course of a season
Details
This is a subset of the NFL data. Data is from the 2002-2012 seasons.
Source
Collected by an undergraduate student from available web data in 2013.
Junk-mail dataset
Description
Building a junk mail classifier based on word and character frequencies
Usage
data("JUNK")
Format
A data frame with 4601 observations on the following 58 variables.
Junka factor with levels
JunkSafemakea numeric vector, the percentage (0-100) of words in the email that are the word
makeaddressa numeric vector
alla numeric vector
X3da numeric vector, the percentage (0-100) of words in the email that are the word
3doura numeric vector
overa numeric vector
removea numeric vector
interneta numeric vector
ordera numeric vector
maila numeric vector
receivea numeric vector
willa numeric vector
peoplea numeric vector
reporta numeric vector
addressesa numeric vector
freea numeric vector
businessa numeric vector
emaila numeric vector
youa numeric vector
credita numeric vector
youra numeric vector
fonta numeric vector
X000a numeric vector, the percentage (0-100) of words in the email that are the word
000moneya numeric vector
hpa numeric vector
hpla numeric vector
georgea numeric vector
X650a numeric vector
laba numeric vector
labsa numeric vector
telneta numeric vector
X857a numeric vector
dataa numeric vector
X415a numeric vector
X85a numeric vector
technologya numeric vector
X1999a numeric vector
partsa numeric vector
pma numeric vector
directa numeric vector
csa numeric vector
meetinga numeric vector
originala numeric vector
projecta numeric vector
rea numeric vector
edua numeric vector
tablea numeric vector
conferencea numeric vector
semicolona numeric vector, the percentage (0-100) of characters in the email that are semicolons
parenthesisa numeric vector
bracketa numeric vector
exclamationa numeric vector
dollarsigna numeric vector
hashtaga numeric vector
capital_run_length_averagea numeric vector, average length of uninterrupted sequence of capital letters
capital_run_length_longesta numeric vector, length of longest uninterrupted sequence of capital letters
capital_run_length_totala numeric vector, total number of capital letters in the email
Details
The collection of junk emails came from the postmaster and individuals who classified the email as junk. The collection of safe emails were from work and personal emails. Note that most of the variables are percents and can vary from 0-100, though most values are much less than 1 (1%).
Source
Adapted from the Spambase Data Set at the UCI data repository https://archive.ics.uci.edu/ml/datasets/Spambase. Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt; Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. Donor: George Forman (gforman at nospam hpl.hp.com)
Interest in frequent flier program (large version)
Description
Interest in frequent flier program (artificial)
Usage
data("LARGEFLYER")
Format
A data frame with 100000 observations on the following 2 variables.
Gendera factor with levels
FemaleMaleInteresta factor with levels
NoYes
Details
This artificial datasets tabulates the interest in a new frequent flyer program based on gender. It illustrates that a statistically significant association may have absolutely no practical significance.
New product launch data
Description
The profit of newly released products over the first few months of their release
Usage
data("LAUNCH")
Format
A data frame with 652 observations on the following 420 variables.
Profitan anonymized numeric vector, the profit from the product over the first few months of release
x1an anonymized numeric vector
x2an anonymized numeric vector
x3an anonymized numeric vector
x4an anonymized numeric vector
x5an anonymized numeric vector
x6an anonymized numeric vector
x7an anonymized numeric vector
x8an anonymized numeric vector
x9an anonymized numeric vector
x10an anonymized numeric vector
x11an anonymized numeric vector
x12an anonymized numeric vector
x13an anonymized numeric vector
x14an anonymized numeric vector
x15an anonymized numeric vector
x16an anonymized numeric vector
x17an anonymized numeric vector
x18an anonymized numeric vector
x19an anonymized numeric vector
x20an anonymized numeric vector
x21an anonymized numeric vector
x22an anonymized numeric vector
x23an anonymized numeric vector
x24an anonymized numeric vector
x25an anonymized numeric vector
x26an anonymized numeric vector
x27an anonymized numeric vector
x28an anonymized numeric vector
x29an anonymized numeric vector
x30an anonymized numeric vector
x31an anonymized numeric vector
x32an anonymized numeric vector
x33an anonymized numeric vector
x34an anonymized numeric vector
x35an anonymized numeric vector
x36an anonymized numeric vector
x37an anonymized numeric vector
x38an anonymized numeric vector
x39an anonymized numeric vector
x40an anonymized numeric vector
x41an anonymized numeric vector
x42an anonymized numeric vector
x43an anonymized numeric vector
x44an anonymized numeric vector
x45an anonymized numeric vector
x46an anonymized numeric vector
x47an anonymized numeric vector
x48an anonymized numeric vector
x49an anonymized numeric vector
x50an anonymized numeric vector
x51an anonymized numeric vector
x52an anonymized numeric vector
x53an anonymized numeric vector
x54an anonymized numeric vector
x55an anonymized numeric vector
x56an anonymized numeric vector
x57an anonymized numeric vector
x58an anonymized numeric vector
x59an anonymized numeric vector
x60an anonymized numeric vector
x61an anonymized numeric vector
x62an anonymized numeric vector
x63an anonymized numeric vector
x64an anonymized numeric vector
x65an anonymized numeric vector
x66an anonymized numeric vector
x67an anonymized numeric vector
x68an anonymized numeric vector
x69an anonymized numeric vector
x70an anonymized numeric vector
x71an anonymized numeric vector
x72an anonymized numeric vector
x73an anonymized numeric vector
x74an anonymized numeric vector
x75an anonymized numeric vector
x76an anonymized numeric vector
x77an anonymized numeric vector
x78an anonymized numeric vector
x79an anonymized numeric vector
x80an anonymized numeric vector
x81an anonymized numeric vector
x82an anonymized numeric vector
x83an anonymized numeric vector
x84an anonymized numeric vector
x85an anonymized numeric vector
x86an anonymized numeric vector
x87an anonymized numeric vector
x88an anonymized numeric vector
x89an anonymized numeric vector
x90an anonymized numeric vector
x91an anonymized numeric vector
x92an anonymized numeric vector
x93an anonymized numeric vector
x94an anonymized numeric vector
x95an anonymized numeric vector
x96an anonymized numeric vector
x97an anonymized numeric vector
x98an anonymized numeric vector
x99an anonymized numeric vector
x100an anonymized numeric vector
x101an anonymized numeric vector
x102an anonymized numeric vector
x103an anonymized numeric vector
x104an anonymized numeric vector
x105an anonymized numeric vector
x106an anonymized numeric vector
x107an anonymized numeric vector
x108an anonymized numeric vector
x109an anonymized numeric vector
x110an anonymized numeric vector
x111an anonymized numeric vector
x112an anonymized numeric vector
x113an anonymized numeric vector
x114an anonymized numeric vector
x115an anonymized numeric vector
x116an anonymized numeric vector
x117an anonymized numeric vector
x118an anonymized numeric vector
x119an anonymized numeric vector
x120an anonymized numeric vector
x121an anonymized numeric vector
x122an anonymized numeric vector
x123an anonymized numeric vector
x124an anonymized numeric vector
x125an anonymized numeric vector
x126an anonymized numeric vector
x127an anonymized numeric vector
x128an anonymized numeric vector
x129an anonymized numeric vector
x130an anonymized numeric vector
x131an anonymized numeric vector
x132an anonymized numeric vector
x133an anonymized numeric vector
x134an anonymized numeric vector
x135an anonymized numeric vector
x136an anonymized numeric vector
x137an anonymized numeric vector
x138an anonymized numeric vector
x139an anonymized numeric vector
x140an anonymized numeric vector
x141an anonymized numeric vector
x142an anonymized numeric vector
x143an anonymized numeric vector
x144an anonymized numeric vector
x145an anonymized numeric vector
x146an anonymized numeric vector
x147an anonymized numeric vector
x148an anonymized numeric vector
x149an anonymized numeric vector
x150an anonymized numeric vector
x151an anonymized numeric vector
x152an anonymized numeric vector
x153an anonymized numeric vector
x154an anonymized numeric vector
x155an anonymized numeric vector
x156an anonymized numeric vector
x157an anonymized numeric vector
x158an anonymized numeric vector
x159an anonymized numeric vector
x160an anonymized numeric vector
x161an anonymized numeric vector
x162an anonymized numeric vector
x163an anonymized numeric vector
x164an anonymized numeric vector
x165an anonymized numeric vector
x166an anonymized numeric vector
x167an anonymized numeric vector
x168an anonymized numeric vector
x169an anonymized numeric vector
x170an anonymized numeric vector
x171an anonymized numeric vector
x172an anonymized numeric vector
x173an anonymized numeric vector
x174an anonymized numeric vector
x175an anonymized numeric vector
x176an anonymized numeric vector
x177an anonymized numeric vector
x178an anonymized numeric vector
x179an anonymized numeric vector
x180an anonymized numeric vector
x181an anonymized numeric vector
x182an anonymized numeric vector
x183an anonymized numeric vector
x184an anonymized numeric vector
x185an anonymized numeric vector
x186an anonymized numeric vector
x187an anonymized numeric vector
x188an anonymized numeric vector
x189an anonymized numeric vector
x190an anonymized numeric vector
x191an anonymized numeric vector
x192an anonymized numeric vector
x193an anonymized numeric vector
x194an anonymized numeric vector
x195an anonymized numeric vector
x196an anonymized numeric vector
x197an anonymized numeric vector
x198an anonymized numeric vector
x199an anonymized numeric vector
x200an anonymized numeric vector
x201an anonymized numeric vector
x202an anonymized numeric vector
x203an anonymized numeric vector
x204an anonymized numeric vector
x205an anonymized numeric vector
x206an anonymized numeric vector
x207an anonymized numeric vector
x208an anonymized numeric vector
x209an anonymized numeric vector
x210an anonymized numeric vector
x211an anonymized numeric vector
x212an anonymized numeric vector
x213an anonymized numeric vector
x214an anonymized numeric vector
x215an anonymized numeric vector
x216an anonymized numeric vector
x217an anonymized numeric vector
x218an anonymized numeric vector
x219an anonymized numeric vector
x220an anonymized numeric vector
x221an anonymized numeric vector
x222an anonymized numeric vector
x223an anonymized numeric vector
x224an anonymized numeric vector
x225an anonymized numeric vector
x226an anonymized numeric vector
x227an anonymized numeric vector
x228an anonymized numeric vector
x229an anonymized numeric vector
x230an anonymized numeric vector
x231an anonymized numeric vector
x232an anonymized numeric vector
x233an anonymized numeric vector
x234an anonymized numeric vector
x235an anonymized numeric vector
x236an anonymized numeric vector
x237an anonymized numeric vector
x238an anonymized numeric vector
x239an anonymized numeric vector
x240an anonymized numeric vector
x241an anonymized numeric vector
x242an anonymized numeric vector
x243an anonymized numeric vector
x244an anonymized numeric vector
x245an anonymized numeric vector
x246an anonymized numeric vector
x247an anonymized numeric vector
x248an anonymized numeric vector
x249an anonymized numeric vector
x250an anonymized numeric vector
x251an anonymized numeric vector
x252an anonymized numeric vector
x253an anonymized numeric vector
x254an anonymized numeric vector
x255an anonymized numeric vector
x256an anonymized numeric vector
x257an anonymized numeric vector
x258an anonymized numeric vector
x259an anonymized numeric vector
x260an anonymized numeric vector
x261an anonymized numeric vector
x262an anonymized numeric vector
x263an anonymized numeric vector
x264an anonymized numeric vector
x265an anonymized numeric vector
x266an anonymized numeric vector
x267an anonymized numeric vector
x268an anonymized numeric vector
x269an anonymized numeric vector
x270an anonymized numeric vector
x271an anonymized numeric vector
x272an anonymized numeric vector
x273an anonymized numeric vector
x274an anonymized numeric vector
x275an anonymized numeric vector
x276an anonymized numeric vector
x277an anonymized numeric vector
x278an anonymized numeric vector
x279an anonymized numeric vector
x280an anonymized numeric vector
x281an anonymized numeric vector
x282an anonymized numeric vector
x283an anonymized numeric vector
x284an anonymized numeric vector
x285an anonymized numeric vector
x286an anonymized numeric vector
x287an anonymized numeric vector
x288an anonymized numeric vector
x289an anonymized numeric vector
x290an anonymized numeric vector
x291an anonymized numeric vector
x292an anonymized numeric vector
x293an anonymized numeric vector
x294an anonymized numeric vector
x295an anonymized numeric vector
x296an anonymized numeric vector
x297an anonymized numeric vector
x298an anonymized numeric vector
x299an anonymized numeric vector
x300an anonymized numeric vector
x301an anonymized numeric vector
x302an anonymized numeric vector
x303an anonymized numeric vector
x304an anonymized numeric vector
x305an anonymized numeric vector
x306an anonymized numeric vector
x307an anonymized numeric vector
x308an anonymized numeric vector
x309an anonymized numeric vector
x310an anonymized numeric vector
x311an anonymized numeric vector
x312an anonymized numeric vector
x313an anonymized numeric vector
x314an anonymized numeric vector
x315an anonymized numeric vector
x316an anonymized numeric vector
x317an anonymized numeric vector
x318an anonymized numeric vector
x319an anonymized numeric vector
x320an anonymized numeric vector
x321an anonymized numeric vector
x322an anonymized numeric vector
x323an anonymized numeric vector
x324an anonymized numeric vector
x325an anonymized numeric vector
x326an anonymized numeric vector
x327an anonymized numeric vector
x328an anonymized numeric vector
x329an anonymized numeric vector
x330an anonymized numeric vector
x331an anonymized numeric vector
x332an anonymized numeric vector
x333an anonymized numeric vector
x334an anonymized numeric vector
x335an anonymized numeric vector
x336an anonymized numeric vector
x337an anonymized numeric vector
x338an anonymized numeric vector
x339an anonymized numeric vector
x340an anonymized numeric vector
x341an anonymized numeric vector
x342an anonymized numeric vector
x343an anonymized numeric vector
x344an anonymized numeric vector
x345an anonymized numeric vector
x346an anonymized numeric vector
x347an anonymized numeric vector
x348an anonymized numeric vector
x349an anonymized numeric vector
x350an anonymized numeric vector
x351an anonymized numeric vector
x352an anonymized numeric vector
x353an anonymized numeric vector
x354an anonymized numeric vector
x355an anonymized numeric vector
x356an anonymized numeric vector
x357an anonymized numeric vector
x358an anonymized numeric vector
x359an anonymized numeric vector
x360an anonymized numeric vector
x361an anonymized numeric vector
x362an anonymized numeric vector
x363an anonymized numeric vector
x364an anonymized numeric vector
x365an anonymized numeric vector
x366an anonymized numeric vector
x367an anonymized numeric vector
x368an anonymized numeric vector
x369an anonymized numeric vector
x370an anonymized numeric vector
x371an anonymized numeric vector
x372an anonymized numeric vector
x373an anonymized numeric vector
x374an anonymized numeric vector
x375an anonymized numeric vector
x376an anonymized numeric vector
x377an anonymized numeric vector
x378an anonymized numeric vector
x379an anonymized numeric vector
x380an anonymized numeric vector
x381an anonymized numeric vector
x382an anonymized numeric vector
x383an anonymized numeric vector
x384an anonymized numeric vector
x385an anonymized numeric vector
x386an anonymized numeric vector
x387an anonymized numeric vector
x388an anonymized numeric vector
x389an anonymized numeric vector
x390an anonymized numeric vector
x391an anonymized numeric vector
x392an anonymized numeric vector
x393an anonymized numeric vector
x394an anonymized numeric vector
x395an anonymized numeric vector
x396an anonymized numeric vector
x397an anonymized numeric vector
x398an anonymized numeric vector
x399an anonymized numeric vector
x400an anonymized numeric vector
x401an anonymized numeric vector
x402an anonymized numeric vector
x403an anonymized numeric vector
x404an anonymized numeric vector
x405an anonymized numeric vector
x406an anonymized numeric vector
x407an anonymized numeric vector
x408an anonymized numeric vector
x409an anonymized numeric vector
x410an anonymized numeric vector
x411an anonymized numeric vector
x412an anonymized numeric vector
x413an anonymized numeric vector
x414an anonymized numeric vector
x415an anonymized numeric vector
x416an anonymized numeric vector
x417an anonymized numeric vector
x418an anonymized numeric vector
x419an anonymized numeric vector
Details
This example is inspired by the Online Product Sales competition on kaggle.com. The goal is to isolate the minimum number predictors required for accurately predicting Profit. Since the data is based on an actual case, all predictors are anonymized (some were originally categorical but are treated as numerical for the example).
Source
Inspired by https://www.kaggle.com/c/online-sales
Movie grosses
Description
Movie grosses from the late 1990s
Usage
data("MOVIE")
Format
A data frame with 309 observations on the following 3 variables.
Moviea factor giving the name of the movie
Weekenda numeric vector, the opening weekend gross (millions of dollars)
Totala numeric vector, the total US gross (millions of dollars)
Details
The goal is to predict the total gross of a movie based on its opening weekend gross.
Source
Scraped from the Internet Movie Database in early 2010.
NFL database
Description
Statistics for NFL teams from the 2002-2012 seasons
Usage
data("NFL")
Format
A data frame with 352 observations on the following 113 variables.
X4.Winsa numeric vector, number of wins (0-16) of an NFL team for the season
X5.OffTotPlaysa numeric vector, number of total plays made on offense for the season
X6.OffTotYdsperPlya numeric vector
X7.OffTot1stDwnsa numeric vector
X8.OffPass1stDwnsa numeric vector
X9.OffRush1stDwnsa numeric vector
X10.OffFumblesLosta numeric vector
X11.OffPassCompa numeric vector
X12.OffPassCompa numeric vector
X13.OffPassYdsa numeric vector
X14.OffPassTdsa numeric vector
X15.OffPassTDa numeric vector
X16.OffPassINTsa numeric vector
X17.OffPassINTa numeric vector
X18.OffPassLongesta numeric vector
X19.OffPassYdsperAtta numeric vector
X20.OffPassAdjYdsperAtta numeric vector
X21.OffPassYdsperCompa numeric vector
X22.OffPasserRatinga numeric vector
X23.OffPassSacksAlwda numeric vector
X24.OffPassSackYdsa numeric vector
X25.OffPassNetYdsperAtta numeric vector
X26.OffPassAdjNetYdsperAtta numeric vector
X27.OffPassSacka numeric vector
X28.OffRushYdsa numeric vector
X29.OffRushTdsa numeric vector
X30.OffRushLongesta numeric vector
X31.OffRushYdsperAtta numeric vector
X32.OffFumblesa numeric vector
X33.OffPuntReturnsa numeric vector
X34.OffPRYdsa numeric vector
X35.OffPRTdsa numeric vector
X36.OffPRLongesta numeric vector
X37.OffPRYdsperAtta numeric vector
X38.OffKRTdsa numeric vector
X39.OffKRLongesta numeric vector
X40.OffKRYdsperAtta numeric vector
X41.OffAllPurposeYdsa numeric vector
X42.1to19ydFGAtta numeric vector
X43.1to19ydFGMadea numeric vector
X44.20to29ydFGAtta numeric vector
X45.20to29ydFGMadea numeric vector
X46.1to29ydFGa numeric vector
X47.30to39ydFGAtta numeric vector
X48.30to39ydFGMadea numeric vector
X49.30to39ydFGa numeric vector
X50.40to49ydFGAtta numeric vector
X51.40to49ydFGMadea numeric vector
X52.50ydFGAtta numeric vector
X53.50ydFGAtta numeric vector
X54.40ydFGa numeric vector
X55.OffTotFGa numeric vector
X56.OffXPa numeric vector
X57.OffTimesPunteda numeric vector
X58.OffPuntYardsa numeric vector
X59.OffLongestPunta numeric vector
X60.OffTimesHadPuntBlockeda numeric vector
X61.OffYardsPerPunta numeric vector
X62.FmblTdsa numeric vector
X63.DefINTTdsScoreda numeric vector
X64.BlockedKickorMissedFGRetTdsa numeric vector
X65.Off2ptConvMadea numeric vector
X66.DefSafetiesScoreda numeric vector
X67.DefTotYdsAlwda numeric vector
X68.DefTotPlaysAlwda numeric vector
X69.DefTotYdsperPlayAlwda numeric vector
X70.DefTot1stDwnsAlwda numeric vector
X71.DefPass1stDwnsAlwda numeric vector
X72.DefRush1stDwnsAlwda numeric vector
X73.DefFumblesRecovereda numeric vector
X74.DefPassCompAlwda numeric vector
X75.DefPassAttAlwda numeric vector
X76.DefPassCompAlwda numeric vector
X77.DefPassYdsAlwda numeric vector
X78.DefPassTdsAlwda numeric vector
X79.DefPassTDAlwda numeric vector
X80.DefPassINTsa numeric vector
X81.DefPassINTa numeric vector
X82.DefPassYdsperAttAlwda numeric vector
X83.DefPassAdjYdsperAttAlwda numeric vector
X84.DefPassYdsperCompAlwda numeric vector
X85.DefPasserRatingAlwda numeric vector
X86.DefPassSacksa numeric vector
X87.DefPassSackYdsa numeric vector
X88.DefPassNetYdsperAttAlwda numeric vector
X89.DefPassAdjNetYdsperAttAlwda numeric vector
X90.DefPassSacka numeric vector
X91.DefRushYdsAlwda numeric vector
X92.DefRushTdsAlwda numeric vector
X93.DefRushYdsperAttAlwda numeric vector
X94.DefPuntReturnsAlwda numeric vector
X95.DefPRTdsAlwda numeric vector
X96.DefKickReturnsAlwda numeric vector
X97.DefKRTdsAlwda numeric vector
X98.DefKRYdsperAttAlwda numeric vector
X99.DefTotFGAttAlwda numeric vector
X100.DefTotFGAlwda numeric vector
X101.DefXPAlwda numeric vector
X102.DefPuntsAlwda numeric vector
X103.DefPuntYdsAlwda numeric vector
X104.DefPuntYdsperAttAlwda numeric vector
X105.Def2ptConvAlwda numeric vector
X106.OffSafetiesa numeric vector
X107.OffRushSuccessRatea numeric vector
X108.OffRunPassRatioa numeric vector
X109.OffRunPlya numeric vector
X110.OffYdsPta numeric vector
X111.DefYdsPta numeric vector
X112.HeadCoachDisturbancea factor with levels
NoYes, whether the head coached changed between this season and the lastX113.QBDisturbancea factor with levels
NoYes, whether the quarterback changed between this season and the lastX114.RBDisturbancea factor with levels
?NoYes, whether the runningback changed between this seasons and the lastX115.OffPassDropRatea numeric vector
X116.DefPassDropRatea numeric vector
Details
Data was collected from many sources on the internet by a student for use in an independent study in the spring of 2013. Abbreviations for predictor variables typically follow the full name in prior variables, e.g., KR = kick returns, PR = punt returns, XP = extra point. Data is organized by year, so rows 1-32 rows are from 2002, rows 33-64 are from 2003, etc.
Source
Contact the originator Weller Ross (jwellerross@gmail.com) for further details.
Some offensive statistics from NFL dataset
Description
A subset of the NFL dataset contain some statistics of teams on offense
Usage
data("OFFENSE")
Format
A data frame with 352 observations on the following 10 variables.
Wina numeric vector, number of wins of team over the season (0-16)
FirstDownsa numeric vector, number of first downs made over the season
PassingYardsa numeric vector, number of passing yards over the season
Interceptionsa numeric vector, number of times ball was intercepted on offense
RushingYardsa numeric vector, number of rushing yards over the season
Fumblesa numeric vector, number of fumbles made on offense
X1to19FGAttemptsa numeric vector, number of field goal attempts made from 1-19 yards
X20to29FGAttemptsa numeric vector, number of field goal attemps made from 20-29 yards
X30to39FGAttemptsa numeric vector
X40to50FGAttemptsa numeric vector
Details
A small subset of the NFL dataset contain select statistics. Seasons are from 2002-2012
Pima Diabetes dataset
Description
Diabetes among women aged 21+ with Pima heritage
Usage
data("PIMA")
Format
A data frame with 392 observations on the following 8 variables.
Pregnanta numeric vector, number of times the woman has been pregnant
Glucosea numeric vector, plasma glucose concentration
BloodPressurea numeric vector, diastolic blood pressure in mm Hg
BodyFata numeric vector, a measurement of the triceps skinfold thickness which is an indicator of body fat percentage
Insulina numeric vector, 2-hour serum insulin
BMIa numeric vector, body mass index
Agea numeric vector, years
Diabetesa factor with levels
NoYes
Details
Data on 768 women belonging to the Pima tribe. The purpose is to study the associations between having diabetes and various physiological characteristics. Although there are surely other factors (including genetic) that influence the chance of having diabetes, the hope is that by having women who are genetically similar (all from the Pima tribe), that these other factors are naturally accounted for.
Source
Adapted from the UCI data repository. A variable measuring the “diabetes pedigree function" has been omitted.
Cockroach poisoning data
Description
Dosages and mortality of cockroaches
Usage
data("POISON")
Format
A data frame with 481 observations on the following 2 variables.
Dosea numeric vector indicated the dosage of the poison administered to the cockroach
Outcomea factor with levels
DieLive
Details
Artificial data illustrating a dose-reponse curve. The probability of dying is well-modeled by a logistic regression model.
Sales of a product one quarter after release
Description
Sales of a product two quarters after release
Usage
data("PRODUCT")
Format
A data frame with 2768 observations on the following 4 variables.
Outcomea factor with levels
failsuccessindicating whether the product was deemed a success or failureCategorya factor with levels
ABCD, the type of item (e.g., kitchen, toys, consumables)Trenda factor with levels
downup, indicating whether the sales over the first 13 weeks had an upward trend or downward trend according to a simple linear regressionSoldWeek13a numeric vector, the number of items sold 13 weeks after release
Details
Inspired by the dunnhumby hackathon hosted at https://www.kaggle.com/c/hack-reduce-dunnhumby-hackathon. The goal is to predict whether a product will be a success or failure half a year after its release based on its characteristics and performance during the first quarter after its release.
Source
Adapted from https://www.kaggle.com/c/hack-reduce-dunnhumby-hackathon
PURCHASE data
Description
Purchase habits of customers
Usage
data("PURCHASE")
Format
A data frame with 27723 observations on the following 6 variables.
Purchasea factor with levels
BuyNo, whether the customer made a purchase in the following 30 daysVisitsa numeric vector, number of visits customer has made to the chain in last 90 days
Spenta numeric vector, amount of money customer has spent at the chain the last 90 days
PercentClosea numeric vector, the percentage of customers' purchases that occur within 5 miles of their home
Closesta numeric vector, the distance between the customer's home and the nearest store in the chain
CloseStoresa numeric vector, the number of stores in the chain within 5 miles of the customer's home
Details
A nation-wide chain is curious as to whether it can predict whether a former customer will make a purchase at one of its stores in the next 30 days based on the customer's spending habits. Some variables are known by the chain (e.g., Visits) and some are available to purchase from credit card companies (e.g., PercentClose). Is purchasing additional information about the customer worth it?
Source
Adapted from real data on the condition that neither the name of the chain nor other parties be disclosed.
Harris Bank Salary data
Description
Harris Bank Salary data
Usage
data("SALARY")
Format
A data frame with 93 observations on the following 5 variables.
Salarya numeric vector, starting monthly salary in dollars
Educationa numeric vector, years of schooling at the time of hire
Experiencea numeric vector, number of years of previous work experience
Monthsa numeric vector, number of months after Jan 1 1969 that the individual was hired
Gendera factor with levels
FemaleMale
Details
Real data used in a court lawsuit. 93 randomly selected employees of Harris Bank Chicago from 1977. Values in this data have been scaled from the original values (e.g., Experience in years instead of months, Education starts at 0 instead of 8, etc.)
Source
Adapted from the case study at http://www.stat.ualberta.ca/statslabs/casestudies/sexdiscrimination.htm
Interest in a frequent flier program (small version)
Description
Interest in a frequent flier program (artificial)
Usage
data("SMALLFLYER")
Format
A data frame with 100 observations on the following 2 variables.
Gendera factor with levels
FemaleMaleInteresta factor with levels
NoYes
Details
This artificial datasets tabulates the interest in a new frequent flyer program based on gender. A larger version of the same data is in LARGEFLYER.
Predicting future sales
Description
Predicting future sales based on sales data in first quarter after release
Usage
data("SOLD26")
Format
A data frame with 2768 observations on the following 16 variables.
SoldWeek26a numeric vector, the number of items sold 26 weeks after release and the quantity to predict
StoresSelling1a numeric vector, the number of stores selling the item 1 week after release
StoresSelling3a numeric vector
StoresSelling5a numeric vector
StoresSelling7a numeric vector
StoresSelling9a numeric vector
StoresSelling11a numeric vector
StoresSelling13a numeric vector
StoresSelling26a numeric vector, the planned number of stores selling the item 26 weeks after release
Sold1a numeric vector, the number of items sold 1 week after release
Sold3a numeric vector
Sold5a numeric vector
Sold7a numeric vector
Sold9a numeric vector
Sold11a numeric vector
Sold13a numeric vector, the number of items sold 13 weeks after release
Details
Inspired by the dunnhumby hackathon hosted at https://www.kaggle.com/c/hack-reduce-dunnhumby-hackathon. The goal is to predict the number of items sold 26 weeks after released based on the characteristics of its sales during the first 13 weeks after release (along with information about how many stores are planning to sell the product 26 weeks after release).
Source
Adapted from https://www.kaggle.com/c/hack-reduce-dunnhumby-hackathon
Speed vs. Fuel Efficiency
Description
Speed vs. Fuel Efficiency
Usage
data("SPEED")
Format
A data frame with 40 observations on the following 2 variables.
AverageSpeeda numeric vector describing the average speed that the vehicle was driven
FuelEfficiencya numeric vector describing the measured fuel efficiency
Details
The relationship between fuel efficiency and speed is non-monotonic.
Source
Artificial
STUDENT data
Description
Data on the College GPAs of students in an introductory statistics class
Usage
data("STUDENT")
Format
A data frame with 607 observations on the following 19 variables.
CollegeGPAa numeric vector
Gendera factor with levels
FemaleMaleHSGPAa numeric vector, can range up to 5 if the high school allowed it
ACTa numeric vector, ACT score
APHoursa numeric vector, number of AP hours student took in HS
JobHoursa numeric vector, number of hours student currently works on average
Schoola factor with levels
PrivatePublic, type of HSLanguagesa numeric vector
Honorsa numeric vector, number of honors classes taken in HS
Smokera factor with levels
NoYesAffordCollegea factor with levels
NoYes, can the student and his/her family pay for the University of Tennessee without taking out loans?HSClubsa numeric vector, number of clubs belonged to in HS
HSJoba factor with levels
NoYes, whether the student maintained a job at some point while in HSChurchgoera factor with levels
NoYes, answer to the question Do you regularly attend chruch?Heighta numeric vector (inches)
Weighta numeric vector (lbs)
Classa factor with levels
JuniorSeniorSophomoreFamilywhat position they are in the family, a factor with levels
Middle ChildOldest ChildOnly ChildYoungest ChildPetfavorite pet, a factor with levels
BothCatDogNeither
Details
Same data as EDUCATION with the addition of the Class variable and with slighly different names for variables.
Source
Responses are from students in an introductory statistics class at the University of Tennessee in 2010.
Student survey 2009
Description
Characteristics of students in an introductory statistics class at the University of Tennessee in 2009
Usage
data("SURVEY09")
Format
A data frame with 579 observations on the following 47 variables.
X01.IDa numeric vector
X02.Gendera factor with levels
FemaleMaleX03.Weighta numeric vector, estimated weight
X04.DesiredWeighta numeric vector
X05.Classa factor with levels
FreshmanJuniorSeniorSophmoreX06.BornInTNa factor with levels
NoYesX07.Greeka factor with levels
NoYes, if the student belongs to a fraternity/sororityX08.UTFirstChoicea factor with levels
NoYesX09.Churchgoera factor with levels
NoYes, does student attend a religious service once a weekX10.ParentsMarrieda factor with levels
NoYesX11.GPAa numeric vector
X12.SittingLocationa factor with levels
BackFrontMiddleVariesX13.WeeklyHoursStudieda numeric vector
X14.Scholarshipa factor with levels
NoYesX15.FacebookFriendsa numeric vector
X16.AgeFirstKissa numeric vector, age at which student had their first romantic kiss
X17.CarYeara numeric vector
X18.DaysPerWeekAlcohola numeric vector, how many days a week student typically drinks
X19.NumDrinksPartya numeric vector, how many drinks student typically has when he or she goes to a party
X20.CellProvidera factor with levels
ATTSprintUSCellarVerizonX21.FreqDroppedCallsa factor with levels
OccasionallyOftenRarelyX22.MarriedAta numeric vector, age by which student hopes to be married
X23.KidsBya numeric vector, age by which students hopes to have kids
X24.Computera factor with levels
MacWindowsX25.FastestDrivingSpeeda numeric vector
X26.BusinessMajora factor with levels
NoYesX27.Majora factor with levels
BusinessNonBusinessX28.TxtsPerDaya numeric vector
X29.FootballGamesa numeric vector, games student hopes to attend
X30.HoursWorkOuta numeric vector, per week
X31.MilesToSchoola numeric vector, each day
X32.MoneyInBanka numeric vector
X33.MoneyOnHaircuta numeric vector
X34.PercentTuitionYouPaya numeric vector
X35.SongsDownloadeda numeric vector, songs typically downloaded (legally/illegally) a month
X36.ParentCollegeGraduatea factor with levels
NoYesX37.HoursSleepPerNighta numeric vector
X38.Last2DigitsPhonea numeric vector
X39.NumClassesMisseda numeric vector
X40.BooksReadThisYeara numeric vector
X41.UseChopsticksa factor with levels
NoYesX42.YourAttractivenessa numeric vector, 1 (unattractive) to 5 (very attractive)
X43.Obamaa factor with levels
NoNotVoteYesX44.HoursWorkedPerWeeka numeric vector, at a job outside of a school
X45.MoviesInTheatera numeric vector, number watched in theater this year
X46.KnowSomeoneH1N1a factor with levels
NoYesX47.ReadBeacona factor with levels
NoYes, the school newspaper
Details
Students answered 47 questions to generate data for a project in an introductory statistics class at the University of Tennessee in the Fall of 2009. The responses here have only had minimal cleaning (negative numbers omitted) so some data is bad (e.g., a weight of 16). The questions were:
Stat 201 Fall 2009 Survey Questions 1. What section are you in? 2. Gender [Male, Female] 3. Your weight (in pounds) [0 to 500] 4. What is your desired weight (in pounds)? [0 to 1000] 5. What year are you? [Freshman, Sophomore, Junior, Senior, Other] 6. Were you born in Tennessee? [Yes, No] 7. Are you a member of a Greek social society (i.e., a Fraternity/Sorority? [Yes, No] 8. Was UT your first choice? [Yes, No] 9. Do you usually attend a religious service once a week? [Yes, No] 10. Are your parents married? [Yes, No] 11. Thus far, what is your GPA (look up on CPO if you need to)? [0 to 4] 12. Given a choice, where do you like to sit in class? [The front row, Near the front, Around the middle, Near the back, The back row, Somewhere different all the time] 13. On average, how many hours per day do you study/do homework? [0 to 24] 14. Do you receive one or more scholarships? [Yes, No] 15. How many Facebook friends do you have? Type -1 if you dont use Facebook. [-1 to 5000] 16. How old were you when you had your first romantic kiss? Type -1 if it has not happened yet. [-1 to 100] 17. What is the year of the car you drive most often? Type a four digit number. Enter 1908 if you never drive a car. [1908 to 2011] 18. On average, how many days per week do you consume one or more alcoholic beverage? Type -1 if you never drink alcoholic beverages. [-1 to 7] 19. On average, how many alcoholic drinks do you have when you party? Type -1 if you never drink alcoholic beverages. [-1 to 100] 20. Which cell phone provider do you use (the most, if you have multiple services)? [ATT (Cingular), Cricket, Sprint, T-Mobile, U.S. Cellular, Verizon, Other, I dont use a cell phone] 21. How often do you have dropped calls? [Never, Rarely, Sometimes, Often, Constantly] 22. What is the age at which you hope to be married? Type -1 if you are already married and type -2 if you never want to get married. [-2 to 100] 23. What is the age at which you hope to have your first child? Type -1 if you already have one or more children, type -2 if you never want to have children. [-2 to 100] 24. What type of computer do you use most often? [PC running Windows, PC running linux, Mac running Mac OS, Mac running linux, Mac running Windows, Other, I dont understand the choices above] 25. What is the fastest speed (in miles per hour) you have ever achieved while driving a car? [0 to 300] 26. Do you plan on going into the Business School? [Yes, No] 27. What is your desired (or actual) major? [Accounting, Economics, Finance, Logistics, Marketing, Statistics, Other] 28. How many text messages do you typically send on any given day? Type -1 if you never send text messages. [-1 to 1000] 29. How many UT football games do you hope to attend this year? (Include games already attended this year. Do not include scrimmages.) [0 to 14] 30. How many hours a week do you work out/play sports/exercise, etc.? [0 to 168] 31. How many miles do you drive to school on a typical day? [0 to 500] 32. How much money do you have in your bank account? Type -999 if you think its none of our business. [-999 to 10000000] 33. How much do you typically spend on a hair cut? [0 to 1000] 34. What percent of tuition are you personally responsible for? Type a number between 0 and 100. [0 to 100] 35. Typically, how many songs do you download a month (both legally and/or illegally)? [0 to 10000] 36. Did at least one of your parents graduate from college? [Yes, No] 37. On average, how many hours do you sleep a night? [0 to 24] 38. What are the last two digits of your phone number? (Type 0 for 00, 1 for 01, 2 for 02, etc.) [0 to 99] 39. Approximately how many classes have you missed/skipped so far this semester? (For all your courses, including absences for legitimate excuses) [0 to 150] 40. How many books (other than textbooks) have you read so far this year? [0 to 1000] 41. Are you proficient with a pair of chopsticks? [Yes, No] 42. How would you rate your attractiveness on a scale of 1 to 5, with 5 being the most attractive? [1 to 5] 43. Did you vote for Barack Obama in last Novembers election? [Yes, No I voted for someone else, No I didnt vote at all] 44. On average, how many hours do you work at a job per week? [0 to 168] 45. How many movies have you watched in theaters this year? [0 to 1000] 46. Do you personally know someone who has come down with H1N1 virus? [Yes, No] 47. Do you read the Daily Beacon on a regular basis? [Yes, No]
Student survey 2010
Description
Characteristics of students in an introductory statistics class at the University of Tennessee in 2010
Usage
data("SURVEY10")
Format
A data frame with 699 observations on the following 20 variables.
Gendera factor with levels
FemaleMaleHeighta numeric vector
Weighta numeric vector
DesiredWeighta numeric vector
GPAa numeric vector
TxtPerDaya numeric vector
MinPerDayFaceBooka numeric vector
NumTattoosa numeric vector
NumBodyPiercingsa numeric vector
Handednessa factor with levels
AmbidextrousLeftRightWeeklyHrsVideoGamea numeric vector
DistanceMovedToSchoola numeric vector
PercentDateablea numeric vector
NumPhoneContactsa numeric vector
PercMoreAttractiveThana numeric vector
PercMoreIntelligentThana numeric vector
PercMoreAthleticThana numeric vector
PercFunnierThana numeric vector
SigificantOthera factor with levels
NoYesOwnAttractivenessa numeric vector
Details
Students answered 50 questions to generate data for a project in an introductory statistics class at the University of Tennessee in the Fall of 2010. The data here represent a selection of the questions. The responses have been somewhat cleaned (unlike SURVEY09) where obviously bogus responses have been omitted, but there may still be issue.
The selected questions were:
Gender Gender [Male, Female]
Height Your height (in inches) [48 to 96]
Weight Your weight (in pounds) [0 to 500]
DesiredWeight What is your desired weight (in pounds)? [0 to 1000]
GPA Thus far, what is your GPA (look up on CPO if you need to)? [0 to 4]
TxtPerDay How many text messages do you typically send on any given day? Type 0 if you
never send text messages. [0 to 1000]
MinPerDayFaceBook On average, how many minutes per day do you spend on internet social networks
(such as Facebook, MySpace, Twitter, LinkedIn, etc.)? [0 to 1440]
NumTattoos How many tattoos do you have? [0 to 100]
NumBodyPiercings How many body piercings do you have (do not include piercings you have let heal
up and are gone)? Count each piercing separately (i.e., pierced ears counts as 2
piercings). [0 to 100]
Handedness Are you right-handed, left-handed, or ambidextrous? [Right-Handed, Left-
Handed, Ambidextrous]
WeeklyHrsVideoGame About how many hours a week do you play video games? This includes console games like Wii, Playstation, Xbox, as well as gaming apps for your phone, online games in Facebook, general computer games, etc. [0 to 168]
DistanceMovedToSchool Go to maps.google.com or another website that provides maps. Get directions from your home address (the house/apartment/etc. you most recently lived in before coming to college) and the zip code 37996. How many miles does it say the trip is? Type the smallest number if offered multiple routes. Type 0 if you are unable to get driving directions for any reason. [0 to 5000]
PercentDateable What percentage of people around your age in your preferred gender do you
consider dateable? [0 to 100]
NumPhoneContacts How many contacts do you have in your cell phone? Answer 0 if you don't use a
cell phone, or have no contacts in your cell phone. [0 to 1000]
PercMoreAttractiveThan What percentage of people at UT of your own gender and class level do you think you are more attractive than? [0 to 100]
PercMoreIntelligentThan What percentage of people at UT of your own gender and class level do you think you are more intelligent than? [0 to 100]
PercMoreAthleticThan What percentage of people at UT of your own gender and class level do you think you are more athletic than? [0 to 100]
PercFunnierThan What percentage of people at UT of your own gender and class level do you think you are funnier than? [0 to 100]
SigificantOther Do you have a significant other? [Yes, No]
OwnAttractiveness On a scale of 1-100, with 100 being the most attractive, rate your own
attractiveness. [1 to 100]
Student survey 2011
Description
Characteristics of students in an introductory statistics class at the University of Tennessee in 2011
Usage
data("SURVEY11")
Format
A data frame with 628 observations on the following 51 variables.
X01.IDa numeric vector
X02.Gendera factor with levels
FMX03.Heighta numeric vector
X04.Weighta numeric vector
X05.SatisfiedWithWeighta factor with levels
No I Wish I Weighed LessNo I Wish I Weighed MoreYesX06.Classa factor with levels
FreshmanJuniorSeniorSophomoreX07.GPAa numeric vector
X08.Greeka factor with levels
NoYesX09.PoliticalBeliefsa factor with levels
ConservativeLiberalMixX10.BornInTNa factor with levels
NoYesX11.HairColora factor with levels
BlackBlondeBrownRedX12.GrowUpInUSa factor with levels
NoYesX13.NumberHousematesa numeric vector
X14.FacebookFriendsa numeric vector
X15.NumPeopleTalkToOnPhonea numeric vector
X16.MinutesTalkOnPhonea numeric vector
X17.PeopleSendTextsToa numeric vector
X18.NumSentTextsa numeric vector
X19.Computera factor with levels
MacPCX20.Churchgoera factor with levels
NoYesX21.HoursAtJoba numeric vector
X22.FastestCarSpeeda numeric vector
X23.NumTimesBrushTeetha numeric vector
X24.SleepPerNighta numeric vector
X25.MinutesExercisingDaya numeric vector
X26.BooksReadMontha numeric vector
X27.ShowerLengtha numeric vector
X28.PercentRecordedTVa numeric vector
X29.MostMilesRunOneDaya numeric vector
X30.MorningPersona factor with levels
NoYesX31.PercentStudentsDateablea numeric vector
X32.PercentYouAreMoreAttractivea numeric vector
X33.PercentYouAreSmartera numeric vector
X34.RelationshipStatusa factor with levels
ComplicatedDatingMarriedSingleX35.AgeFirstKissa numeric vector
X36.WeaponAttractMatea factor with levels
HumorIntelligenceLooksOtherX37.NumSignificantOthersa numeric vector
X38.WeeksLongestRelationshipa numeric vector
X39.NumDrinksWeeka numeric vector
X40.FavAlcohola factor with levels
BeerLiquorNoneWineX41.SpeedingTicketsa numeric vector
X42.Smokera factor with levels
NoYesX43.IllegalDrugsa factor with levels
NoYesX44.DefendantInCourta factor with levels
NoYesX45.NightInJaila factor with levels
NoYesX46.BrokenBonea factor with levels
NoYesX47.CentsCarryinga numeric vector
X48.SawLastHarryPottera factor with levels
NoYesX49.NumHarryPotterReada numeric vector
X50.HoursContinuouslyAwakea numeric vector
X51.NumCountriesVisiteda numeric vector
Details
Students answered 51 questions to generate data for a project in an introductory statistics class at the University of Tennessee in the Fall of 2011. The responses have been minimally modified or cleaned. The questions were:
1. What section are you in? (To be viewed only by the Stat 201 coordinator, and removed prior to distributing the data.) 2. What is your gender? [M,F] 3. What is your height (in inches)? [0,100] 4. What is your weight (in pounds)? [0,1000] 5. Are you satisfied with your current weight? [Yes, No I wish I weighed less, No I wish I weighed more] 6. What is your class level? [Freshman, Sophomore, Junior, Senior, 5+ year senior, Non-traditional] 7. What is your current GPA? [0,4] 8. Are you a member of a fraternity/sorority? [Yes, No] 9. Overall, do you consider your social/political beliefs to be: [more liberal, more conservative, a mix of liberal and conservative views] 10. Were you born in Tennessee? [Yes, No] 11. What is your natural hair color? [Black, Brown, Red, Blond, Gray] ##There was a database error requiring Blond and Gray to be combined into one category. 12. Did you grow up in the US? [Yes, No, Some time in the US but a significant time in another country] 13. How many people share your current residence? Count yourself, so if you live alone, answer 1. Also, if you live in a dorm, count yourself plus just your roommates/suitemates. [1, 1000] 14. How many Facebook friends do you currently have? (To see how many friends you have in Facebook, open a new tab or browser window and log in to Facebook, click the down arrow next to Account, select Edit Friends, and on the left of your screen your friends count is in parentheses.) [0,10000] 15. How many people do you talk to on the phone in a typical day? [0,1000] 16. How many MINUTES a day do you typically spend on the phone talking to people? [0,1440] 17. How many different people do you typically send text messages to on a typical day? [0,1000] 18. How many total texts do you think you send to people on a typical day? [0,5000] 19. What type of computer do you use the most? [Mac, PC, Linux] 20. Do you currently attend religious services at least once a month? [Yes, No] 21. About how many HOURS PER WEEK do you work at a job? [0,168] 22. What is the fastest speed you have achieved while driving a car (in miles per hour)? [0, 500] 23. How many times per day do you typically brush your teeth? [0, 100] 24. On a typical school night, how many HOURS do you sleep? [0, 24] 25. How many MINUTES PER DAY do you typically engage in physical activity (e.g., walking to and from class, working out at the gym, sports practice, etc.)? [0, 1440] 26. How many books have you read from cover to cover over the last month for pleasure? [0, 1000] 27. How many MINUTES do you typically spend when you take a shower? [0, 1440] 28. Advertisers are concerned that people are "fast forwarding" past their TV commercials, because more and more people are recording broadcast television and watching it later (for example, on a DVR). Approximately what percent of the TV that you watch (that HAS commercials in it) is something you recorded, and therefore you can "fast forward" past the commercials? [0, 100] 29. What is the longest that you've ever walked/run/hiked in a single day (in MILES)? [0,189] 30. Do you consider yourself a "morning person"? [Yes, No] 31. What percentage of UT students in your preferred gender do you think are dateable? [0, 100] 32. What percentage of UT students do you think you are more attractive than? [0, 100] 33. What percentage of UT students do you think you are more intelligent than? [0, 100] 34. What is your relationship status? [Single, Casually dating one or more people, Dating someone regularly, Engaged, Married, It's complicated] 35. How old were you when you had your first romantic kiss? (Enter 0 if this has not yet happened.) [0, 99] 36. Which of the following would you consider to be your main weapon for attracting a potential mate? [Looks, Intelligence, Sense of Humor, Other] 37. How many boyfriends/girlfriends have you had? (We'll leave it up to you as to what constitutes a boyfriend or girlfriend.) [0, 1000] 38. What is the longest amount of time (in WEEKS) that you have been in a relationship with a significant other? (A shortcut: take the number of months and multiply by 4, or the number of years and multiply by 52.) [0, 4000] 39. How many alcoholic beverages do you typically consume PER WEEK? (consider 1 alcoholic beverage a 12 oz. beer, a 4 oz. glass of wine, a 1 oz. shot of liquor, etc.) [0, 200] 40. What is your favorite kind of alcoholic beverage? [I don't drink alcoholic beverages, Beer, Wine, Whiskey, Vodka, Gin, Tequila, Rum, Other] 41. How may speeding tickets have you received? [0, 500] 42. Do you consider yourself a "smoker"? [Yes, No] 43. Have you ever used an illegal/controlled substance? (Exclude alcohol/cigarettes consumed when underaged.) [Yes, No] 44. Have you ever appeared before a judge/jury as a defendant? (Exclude speeding or parking tickets.) [Yes, No] 45. Have you ever spent the night in a jail cell? [Yes, No] 46. Have you ever broken a bone that required surgery or a cast (or both)? [Yes, No] 47. Check your pockets and/or purse and report how much money in coins (in CENTS) that you currently are carrying. For example, if you have one quarter and one penny, type 26, not 0.26. [0, 1000] 48. Have you seen the latest Harry Potter movie that came out in July 2011? [Yes, No] 49. How many of the seven Harry Potter books have you completely read? [0, 7] 50. Estimate the longest amount of time (in HOURS) that you have continuously stayed awake. [0, 450] 51. How many countries have you ever stepped foot in outside an airport (include the US in your count)? [1, 196]
TIPS dataset
Description
One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:
Usage
data("TIPS")
Format
A data frame with 244 observations on the following 8 variables.
TipPercentagea numeric vector, the tip written as a percentage (0-100) of the total bill
Billa numeric vector, the bill amount (dollars)
Tipa numeric vector, the tip amount (dollars)
Gendera factor with levels
FemaleMale, gender of the payer of the billSmokera factor with levels
NoYes, whether the party included smokersWeekdaya factor with levels
FridaySaturdaySundayThursday, day of the weekTimea factor with levels
DayNight, rough time of dayPartySizea numeric vector, number of people in party
Source
This is the Tips dataset in package reshape, modified to include the tip percentage.
Variance Inflation Factor
Description
Calculates the variation inflation factors of all predictors in regression models
Usage
VIF(mod)
Arguments
mod |
A linear or logistic regression model |
Details
This function is a simple port of vif from the car package. The VIF of a predictor is a measure for how easily it is predicted from a linear regression using the other predictors. Taking the square root of the VIF tells you how much larger the standard error of the estimated coefficient is respect to the case when that predictor is independent of the other predictors.
A general guideline is that a VIF larger than 5 or 10 is large, indicating that the model has problems estimating the coefficient. However, this in general does not degrade the quality of predictions. If the VIF is larger than 1/(1-R2), where R2 is the Multiple R-squared of the regression, then that predictor is more related to the other predictors than it is to the response.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling with R
Examples
#A case where the VIFs are small
data(SALARY)
M <- lm(Salary~.,data=SALARY)
VIF(M)
#A case where (some of) the VIFs are large
data(BODYFAT)
M <- lm(BodyFat~.,data=BODYFAT)
VIF(M)
WINE data
Description
Predicting the quality of wine based on its chemical characteristics
Usage
data("WINE")
Format
A data frame with 2700 observations on the following 12 variables.
Qualitya factor with levels
highlowfixed.aciditya numeric vector
volatile.aciditya numeric vector
citric.acida numeric vector
residual.sugara numeric vector
chloridesa numeric vector
free.sulfur.dioxidea numeric vector
total.sulfur.dioxidea numeric vector
densitya numeric vector
pHa numeric vector
sulphatesa numeric vector
alcohola numeric vector
Details
This is the famous wine dataset from the UCI data repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality with some modifications. Namely, the quality in the original data was a score between 0 and 10. These has been coded as either high or low. See description on UCI for description of variables.
References
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pairwise correlations between quantitative variables
Description
This function gives a list of all pairwise correlations between quantitative variables in a dataframe. Alternatively, it can provide all pairwise correlations with just a particular variable.
Usage
all_correlations(X,type="pearson",interest=NA,sorted="none")
Arguments
X |
A data frame |
type |
Either |
interest |
If specified, returns only pairwise correlations with this variable. Argument should be in quotes and must give the exact name of the column of the variable of interest. |
sorted |
Either |
Details
This function filters out any non-numerical variables in the data frame and provides correlations only between quantitative variables. It is useful for quickly glancing at the size of the correlations between many pairs of variables or all correlations with a particular variable. Further analysis should be done on pairs of interest using associate.
Note: if Spearmans' rank correlations are computed, warnings message result indicating that the exact p-value cannot be computed with ties. Running associate will give you an approximate p-value using the permutation procedure.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
#all pairwise (Pearson) correlations between all quantitative variables
data(STUDENT)
all_correlations(STUDENT)
#Spearman correlations between all quantitative variables and CollegeGPA, sorted by pvalue.
#Gives warnings due to ties
all_correlations(STUDENT,interest="CollegeGPA",type="spearman",sorted="significance")
Association Analysis
Description
This function takes two quantities and computes relevent numerical measures of association. The p-values of the associations are estimated via permutation tests. Plots for diagnostics are provided as well, with optional arguments that allow for classic tests.
Usage
associate(formula, data, permutations = 500, seed=NA, plot = TRUE, classic = FALSE,
cex.leg=0.7, n.levels=NA,prompt=TRUE,color=TRUE,...)
Arguments
formula |
A standard R formula written as y~x, where y is the name of the variable playing the role of y and x is the name of the variable playing the role of x. |
data |
An optional argument giving the name of the data frame that contains x and y. If not specified, the function will use existing definitions in the parent environment. |
permutations |
The number of permutations for Monte Carlo estimation of the p-value. If 0, function defaults to reporting classic results. |
seed |
An optional argument specifying the random number seed for permutations. |
plot |
|
classic |
|
cex.leg |
Scale factor for the size of legends in plots. Larger values make legends bigger. |
n.levels |
An optional argument of interest only when y is categorical and x is quantitative. It specifies the number of levels when converting x to a categorical variable during the analysis. Each level will have the same number of cases. If this does not work out evenly, some levels are randomly picked to have one more case than the others. If unspecified, the default is to pick the number of levels so that there are 10 cases per level or a maximum of 6 levels (whichever is smaller). |
prompt |
|
color |
|
... |
Additional arguments related to plotting, e.g., pch, lty, lwd |
Details
This function uses Monte Carlo simulation (permutation procedure) to approximate the p-value of an association. Only complete cases are considered in the analysis.
Valid formulas may include functions of the variable, e.g. y^2, log10(x), or more complicated functions like I(x1/(x2+x3)). In the latter case, I() must surround the function of interest to be computed correctly.
When both x and y are quantitative variables, an analysis of Pearson's correlation and Spearman's rank correlation is provided. Scatterplots and histograms of the variables are provided. If classic is TRUE, the QQ-plots of the variables are provided along with tests of assumptions.
When x is categorical and y is quantitative, the averages (as well as mean ranks and medians) of y are compared between levels of x. The "discrepancy" is the F statistic for averages, Kruskal-Wallis statistic for mean ranks, and the chi-squared statistic for the median test. Side-by-side boxplots are also provided. If classic is TRUE, the QQ-plots of the distribution of y for each level of x are provided.
When x is quantitative and y is categorical, x is converted to a categorical variable with n.levels levels with equal numbers of cases. A chi-squared test is performed for the association. The classic approach assumes a multinomial logistic regression to check significance. A mosaic plot showing the distribution of y for each induced level of x is provided as well as a probability "curve". If classic is TRUE, the multinomial logistic curves for each level are provided versus x..
When both x and y are categorical, a chi-squared test is performed. The contingency table, table of expected counts, and conditional distributions are also reported along with a mosaic plot.
If the permutation procedure is used, the sampling distribution of the measure of association is displayed over the requested amount of permutations along with the observed value on the actual data (except when y is categorical with x quantitative).
If classic results are desired, then plots and tests to check assumptions are supplied. white.test from package bstats (version 1.1-11-5) and mshapiro.test from package mvnormtest (version 0.1-9) are built into the function to avoid directly referencing the libraries (which sometimes causes problems).
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
lm, glm, anova, cor, chisq.test, vglm
Examples
#Two quantitative variables
data(SALARY)
associate(Salary~Education,data=SALARY,permutations=1000)
#y is quantitative while x is categorical
data(SURVEY11)
associate(X07.GPA~X40.FavAlcohol,data=SURVEY11,permutations=0,classic=TRUE)
#y is categorical while x is quantitative
data(WINE)
associate(Quality~alcohol,data=WINE,classic=TRUE,n.levels=5)
#Two categorical variables (many cases, turns off prompt asking for user input)
data(ACCOUNT)
set.seed(320)
#Work with a smaller subset
SUBSET <- ACCOUNT[sample(nrow(ACCOUNT),1000),]
associate(Purchase~Area.Classification,data=SUBSET,classic=TRUE,prompt=FALSE)
Variable selection for descriptive or predictive linear and logistic regression models
Description
This function uses bestglm to consider an extensive array of models and makes recommendations on what set of variables is appropriate for the final model. Model hierarchy is not preserved. Interactions and multi-level categorical variables are allowed.
Usage
build_model(form,data,type="predictive",Kfold=5,repeats=10,
prompt=TRUE,seed=NA,holdout=NA,...)
Arguments
form |
A model formula giving the most complex model to consider (often predicting y from all variables |
data |
Name of the data frame that contain all variables specifed by |
type |
Either "predictive" or "descriptive". If |
Kfold |
The number of folds for repeated K-fold cross-validation for predictive model building |
repeats |
The number of repeats for repeated K-fold cross-validation for predictive model building |
seed |
If specified, the random number seed used to initialize the repeated K-fold cross-validation procedure so that results can be reproduced. |
prompt |
If |
holdout |
A optional dataframe to serve as a holdout sample. The generalization error on the holdout sample will be calculated and displayed for the best model at each number of predictors. |
... |
Additional arguments to |
Details
This procedure takes the formula specified by form and the original dataframe and simply converts it into a form that bestglm (which normally cannot do cross-validation when categorical variables are involved) can use by adding in columns to represent interactions and categorical variables.
One the dataframe has been generated, a warning is given to the user if the procedure may take too long (many rows or many potential predictors), and then bestglm is run. A plot and table of models' performances is given, as well as a recommendation for a final set of variables (model with the lowest AIC/estimated generalization error, or a simpler model that is more or less equivalent).
The command returns a list with bestformula (the formula of the model with the lowest AIC or the model chosen by the one standard deviation rule), bestmodel (the fitted model that had the lowest AIC or the one chosen by the one standard deviation rule), predictors (a list giving the predictors that appeared in the best model with 1 predictor, with 2 predictors, etc).
If a descriptive model is sought, the last component of the returned list is AICtable (a data frame containing the number of predictors and the AIC of the best model with that number of predictors; a * denotes the model with the lowest AIC while a + denotes the simplest model whose AIC is within 2 of the lowest).
If a predictive model is sought, the last component of the returned list is CVtable (a data frame containing the number of predictors and the estimated generalization error of the best model with that number of predictors along with the SD from repeated K-fold cross validation; a * denotes the model with the lowest error while the + denotes the model selected with the one standard deviation rule). Note that the generalization error in the second column of this table is the squared error if the response is quantitative and is another measure of error (not the misclassification rate) if the response is categorical. Additional columns are provided to give the root mean squared error or misclassification rate.
Note: bestmodel is the one selected by the one standard deviation rule or the simplest one whose AIC is no more than 2 above the model with the lowest AIC. Because the procedure does not respect model hierarchy and can include interactions, the formula returned may not be immediately useable if it involves a categorical variable since the variable returned is how R names indicator variables. You may have to manually fit the model based on the selected predictors.
If HOLDOUT is given a plot of the error on the holdout sample versus the number of predictors (for the best model at that number of predictors) is provided along with the estimated generalization error from the training set. This can be used to see if the models generalize well, but is in general not used to tune which model is selected.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling with R
See Also
bestglm, regsubsets, see.models, generalization.error.
Examples
#Descriptive model. Note: Tip and Bill should not be used simultaneously as
#predictors of TipPercentage, so leave Tip out since it's not known ahead of time
data(TIPS)
MODELS <- build_model(TipPercentage~.-Tip,data=TIPS,type="descriptive")
MODELS$AICtable
MODELS$predictors[[1]] #Variable in best model with a single predictors
MODELS$predictors[[2]] #Variables in best model with two predictors
summary(MODELS$bestmodel) #Summary of best model, in this case with two predictors
#Another descriptive model (large dataset so changing prompt=FALSE for documentation)
data(PURCHASE)
set.seed(320)
#Take a subset of full dataframe for quick illustration
SUBSET <- PURCHASE[sample(nrow(PURCHASE),500),]
MODELS <- build_model(Purchase~.,data=SUBSET,type="descriptive",prompt=FALSE)
MODELS$AICtable #Model with 1 or 2 variables look pretty good
MODELS$predictors[[2]]
#Predictive model.
data(SALARY)
set.seed(2010)
train.rows <- sample(nrow(SALARY),0.7*nrow(SALARY),replace=TRUE)
TRAIN <- SALARY[train.rows,]
HOLDOUT <- SALARY[-train.rows,]
MODELS <- build_model(Salary~.^2,data=TRAIN,holdout=HOLDOUT)
summary(MODELS$bestmodel)
M <- lm(Salary~Gender+Education:Months,data=TRAIN)
generalization_error(M,HOLDOUT)
#Predictive model for WINE data, takes a while. Misclassification rate on holdout sample is 18%.
data(WINE)
set.seed(2010)
train.rows <- sample(nrow(WINE),0.7*nrow(WINE),replace=TRUE)
TRAIN <- WINE[train.rows,]
HOLDOUT <- WINE[-train.rows,]
## Not run: MODELS <- build_model(Quality~.,data=TRAIN,seed=1919,holdout=HOLDOUT)
## Not run: MODELS$CVtable
Exploratory building of partition models
Description
A tool to choose the "correct" complexity parameter of a tree
Usage
build_tree(form, data, minbucket = 5, seed=NA, holdout, mincp=0)
Arguments
form |
A formula describing the tree to be built |
data |
Data frame containing the variables to build the tree |
minbucket |
The minimum number of cases allowed in any leaf in the tree |
seed |
If given, specifies the random number seed so the crossvalidation error can be reproduced. |
holdout |
If given, the error on the holdout sample is calculated and given in the cp table. |
mincp |
The |
Details
This command combines the action of building a tree to its maximum possible extent using rpart and looking at the results using getcp. A plot of the estimated relative generalization error (as determined by 10-fold cross validation) versus the number of splits is provided. In addition, the complexity parameter table giving the cp of the tree with the lowest error (and of the simplest tree with an error within one standard deviation of the lowest error) is reported.
If holdout is given, the RMSE/misclassification rate on the training and holdout samples are provided in the cp table.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(JUNK)
build_tree(Junk~.,data=JUNK,seed=1337)
data(CENSUS)
build_tree(ResponseRate~.,data=CENSUS,seed=2017,mincp=0.001)
data(OFFENSE)
build_tree(Win~.,data=OFFENSE[1:200,],seed=2029,holdout=OFFENSE[201:352,])
Linear and Logistic Regression diagnostics
Description
If the model is a linear regression, obtain tests of linearity, equal spread, and Normality as well as relevant plots (residuals vs. fitted values, histogram of residuals, QQ plot of residuals, and predictor vs. residuals plots). If the model is a logistic regression model, a goodness of fit test is given.
Usage
check_regression(M,extra=FALSE,tests=TRUE,simulations=500,n.cats=10,seed=NA,prompt=TRUE)
Arguments
M |
|
extra |
If |
tests |
If |
simulations |
The number of artificial samples to generate for estimating the p-value of the goodness of fit test for logistic regression models. These artificial samples are generated assuming the fitted logistic regression is correct. |
n.cats |
Number of (roughly) equal sized categories for the Hosmer-Lemeshow goodness of fit test for logistic regression models |
seed |
If specified, sets the random number seed before generation of artificial samples in the goodness of fit tests for logistic regression models. |
prompt |
For documentation only, if |
Details
This function provides standard visual and statistical diagnostics for regression models.
For linear regression, tests of linearity, equal spread, and Normality are performed and residuals plots are generated.
The test for linearity (a goodness of fit test) is an F-test. A simple linear regression model predicting y from x is fit and compared to a model treating each value of the predictor as some level of a categorical variable. If this more sophisticated model does not offer a significant improvement in the sum of squared errors, the linearity assumption in that predictor is reasonable. If the p-value is larger 0.05, then statistically we can consider the relationship to be linear. If the p-value is smaller than 0.05, check the residuals plot and the predictor vs residuals plots for signs of obvious curvature (the test can be overly sensitive to inconsequential violations for larger sample sizes). The test can only be run if are two or more individuals that have a common value of x. A test of the model as a whole is run similarly if at least two individuals have identical combinations of all predictor variables.
Note: if categorical variables, interactions, polynomial terms, etc., are present in the model, the test for linearity is conducted for each term even when it does not necessarily make sense to do so.
The test for equal spread is the Breusch-Pagan test. If the p-value is larger 0.05, then statistically we can consider the residuals to have equal spread everywhere. If the p-value is smaller than 0.05, check the residuals plot for obvious signs of unequal spread (the test can be overly sensitive to inconsequential violations for larger sample sizes).
The test for Normality is the Shapiro-Wilk test when the sample size is smaller than 5000, or the KS-test for larger sample sizes. If the p-value is larger 0.05, then statistically we can consider the residuals to be Normally distributed. If the p-value is smaller than 0.05, check the histogram and QQ plot of residuals to look for obvious signs of non-Normality (e.g., skewness or outlier). The test can be overly sensitive to inconsequential violations for larger sample sizes.
The first three plots displayed are the residuals plot (residuals vs. fitted values), histogram of residuals, and QQ plot of residuals. The function gives the option of pressing Enter to display additional predictor vs. residual plots if extra=TRUE, or to terminate by typing 'q' in the console and pressing Enter. If polynomial or interactions terms are present in the model, a plot is provided for each term. If categorical predictors are present, plots are provided for each indicator variable.
For logistic regression, two goodness of fit tests are offered.
Method 1 is a crude test that assumes the fitted logistic regression is correct, then generates an artifical sample according the predicted probabilities. A chi-squared test is conducted that compares the observed levels to the predicted levels. The test is failed is the p-value is less than 0.05. The test is not sensitive to departures from the logistic curve unless the sample size is very large or the logistic curve is a really bad model.
Method 2 is a Hosmer-Lemeshow type goodness of fit test. The observations are put into 10 groups according to the probability predicted by the logistic regression model. For example, if there were 200 observations, the first group would have the cases with the 20 smallest predicted probabilities, the second group would have the cases with the 20 next smallest probabilities, etc. The number of cases with the level of interest is compared with the expected number given the fitted logistic regression model via a chi-squared test. The test is failed is the p-value is less than 0.05.
Note: for both methods, the p-values of the chi-squared tests are estimate via Monte Carlo simulation instead of any asymptotic results.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
lm, glm, shapiro.test, ks.test, bptest (in package lmtest). The goodness of fit test for logistic regression is further detailed and implemented in package 'rms' using the commands lrm and residuals.
Examples
#Simple linear regression where everything looks good
data(FRIEND)
M <- lm(FriendshipPotential~Attractiveness,data=FRIEND)
check_regression(M)
#Multiple linear regression (prompt is FALSE only for documentation)
data(AUTO)
M <- lm(FuelEfficiency~.,data=AUTO)
check_regression(M,extra=TRUE,prompt=FALSE)
#Multiple linear regression with a categorical predictors and an interaction
data(TIPS)
M <- lm(TipPercentage~Bill*PartySize*Weekday,data=TIPS)
check_regression(M)
#Multiple linear regression with polynomial term (prompt is FALSE only for documentation)
#Note: in this example only plots are provided
data(BULLDOZER)
M <- lm(SalePrice~.-YearMade+poly(YearMade,2),data=BULLDOZER)
check_regression(M,extra=TRUE,tests=FALSE,prompt=FALSE)
#Simple logistic regression. Use 8 categories since only 8 unique values of Dose
data(POISON)
M <- glm(Outcome~Dose,data=POISON,family=binomial)
check_regression(M,n.cats=8,seed=892)
#Multiple logistic regression
data(WINE)
M <- glm(Quality~.,data=WINE,family=binomial)
check_regression(M,seed=2010)
Choosing order of a polynomial model
Description
This function takes a simple linear regression model and displays the adjusted R^2 and AICc for the original model (order 1) and for polynomial models up to a specified maximum order and plots the fitted models.
Usage
choose_order(M,max.order=6,sort=FALSE,loc="topleft",show=NULL,...)
Arguments
M |
A simple linear regression model fitted with lm() |
max.order |
The maximum order of the polynomial model to consider. |
sort |
How to sort the results. If TRUE, "R2", "r2", "r2adj", or "R2adj", sorts from highest to lowest adjusted R^2. If "AIC", "aic", "AICC", "AICc", sorts by AICc. |
loc |
Location of the legend. Can also be "top", "topright", "bottomleft", "bottom", "bottomright", "left", "right", "center" |
show |
An optional vector of orders to examine, e.g. |
... |
Additional arguments to plot(), e.g., pch |
Details
The function outputs a table of the order of the polynomial and the according adjusted R^2 and AICc. One strategy for picking the best order is to find the highest value of R^2 adjusted, then to choose the smallest order (simplest model) that has an R^2 adjusted within 0.005. Another strategy is the find the lowest value of AICc, then to choose the smallest order that has an AICc no more than 2 higher.
The scatterplot of the data is provided and the fitted models are displayed as well.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
data(BULLDOZER)
M <- lm(SalePrice~YearMade,data=BULLDOZER)
#Unsorted list, messing with plot options to make it look alright
choose_order(M,pch=20,cex=.3)
#Sort by R2adj. A 10th order polynomial is highest, but this seems overly complex
choose_order(M,max.order=10,sort=TRUE)
#Sort by AICc. 4th order is lowest, but 2nd order is simpler and within 2 of lowest
choose_order(M,max.order=10,sort="aic")
Combines rare levels of a categorical variable
Description
This function takes a categorical variable and combines all levels with frequencies less than a user-specified threshold named Combined
Usage
combine_rare_levels(x,threshold=20,newname="Combined")
Arguments
x |
a vector of categorical values |
threshold |
levels that appear a total of |
newname |
defaults to |
Details
Returns a list of two objects:
values - The recoded values of the categorical variable. All levels which appeared threshold times or fewer are now known as Combined
combined - The levels that have been combined together
If, after being combined, the newname level has threshold or fewer instances, the remaining level that appears least often is combined as well.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
data(EX6.CLICK)
x <- EX6.CLICK[,15]
table(x)
#Combine all levels which appear 700 or fewer times (AA, CC, DD)
y <- combine_rare_levels(x,700)
table( y$values )
#Combine all levels which appear 1350 or fewer times. This forces BB (which
#occurs 2422 times) into the Combined level since the three levels that appear
#fewer than 1350 times do not appear more than 1350 times combined
y <- combine_rare_levels(x,1350)
table( y$values )
Confusion matrix for logistic regression models
Description
This function takes the output of a logistic regression created with glm and returns the confusion matrix.
Usage
confusion_matrix(M,DATA=NA)
Arguments
M |
A logistic regression model created with |
DATA |
A data frame on which the confusion matrix will be made. If omitted, the confusion matrix is on the data used in |
Details
This function makes classifications on the data used to build a logistic regression model by predicting the "level of interest" (last alphabetically) when the predicted probability exceeds 50%.
Author(s)
Adam Petrie
See Also
Examples
#On WINE data as a whole
data(WINE)
M <- glm(Quality~.,data=WINE,family=binomial)
confusion_matrix(M)
#Calculate generalization error using training/holdout
set.seed(1010)
train.rows <- sample(nrow(WINE),0.7*nrow(WINE),replace=TRUE)
TRAIN <- WINE[train.rows,]
HOLDOUT <- WINE[-train.rows,]
M <- glm(Quality~.,data=TRAIN,family=binomial)
confusion_matrix(M,HOLDOUT)
#Predicting donation
#Model predicting from recent average gift amount is significant, but its
#classifications are the same as the naive model (majority rules)
data(DONOR)
M.naive <- glm(Donate~1,data=DONOR,family=binomial)
confusion_matrix(M.naive)
M <- glm(Donate~RECENT_AVG_GIFT_AMT,data=DONOR,family=binomial)
confusion_matrix(M)
Correlation demo
Description
This function shows the correlation and coefficient of determination as user interactively adds datapoints. Useful for seeing what different values of correlation look like and seeing the effect of outliers.
Usage
cor_demo(newplot=FALSE,cex.leg=0.8)
Arguments
newplot |
If |
cex.leg |
A number specifying the magnification of legends inside the plot. Smaller numbers mean smaller font. |
Details
This function allows the user to generate data by click on a plot. Once two points are added, the correlation (r) and coefficient of determination (r^2) are displayed. When an additional point is added, these values are updated in the upper left with previous values being displayed in the upper right. The effect of outliers on the correlation and coefficient of determination can easily be illustrated. Pressing the red UNDO button on the plot will allow you to take away recently added points for further exploration.
Note: To end the demo, you MUST click on the red box labeled "End" (or press Escape, which will return an error)
Author(s)
Adam Petrie
Correlation Matrix
Description
This function produces the matrix of correlations between all quantitative variables in a dataframe.
Usage
cor_matrix(X,type="pearson")
Arguments
X |
A data frame |
type |
Either |
Details
This function filters out any non-numerical variables and provides correlations only between quantitative variables. Best for datasets with only a few variables. The correlation matrix is returned (with class matrix).
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(TIPS)
cor_matrix(TIPS)
data(AUTO)
cor_matrix(AUTO,type="spearman")
Main driver analysis when Y is a categorical quantity
Description
This function provides a "main driver analysis" on the association between a categorical y variable and the "driver" x. A visualization (mosaic plot) of the strength of the relationship is provided as well as numerical output to help quantify the variation of y across possible values of the "driver" x.
Usage
examine_driver_Ycat(formula,data,sort=TRUE,inside=TRUE,equal=TRUE)
Arguments
formula |
A standard R formula written as y=="Yes"~x, where y is the variable of interest and "Yes" is the level of interest (you need to pick one of the levels of y to be the level of interest) and x is the driver. |
data |
An argument giving the name of the data frame that contains x and y. |
sort |
|
equal |
If |
inside |
If |
Details
Main driver analysis is a cornerstone of business analytics where we identify and quantify the key factors (drivers) that most strongly influence a business outcome or performance metric.
This function handles the case when y (the outcome variable) is categorical and you want to analyze the chance that an entity has a specific value of y (the level of interest). See examine_driver_Ynumeric when y is numeric.
This function works best if x is a categorical variable (with multiple examples of each level of x), since the probability that y equals the level of interest is estimated for each unique value of x.
A mosaic plot (see mosaic and its associated arguments) is presented to visualize the relationship between y and the driver.
A table giving the estimated probability that y has the level of interest for each value of x is provided. A "connecting letters report" shows which levels have statistically significant differences in the probability that y has the level of interest (if ANY letters are in common between two values of x, there is not a statistically significant difference in the probability that y has the level of interest between those two values of x; if ALL letters are different, the difference is statistically significant).
The function also provides a "Driver Score" (a value between 0 and 1; larger driver scores indicate stronger associations between the chance that y has the level of interest and x). This driver is the R-squared of a simple linear regression predicting 1 (y has the level of interest) or 0 (y does not have the level of interest) from x, and is at best treated as a relative score indicating the strength of the relationship (the value itself does not hold any practical significance).
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
examine_driver_Ynumeric,mosaic
Examples
#No statistically significant differences in levels
data(CUSTLOYALTY)
examine_driver_Ycat(Married=="Single"~Income,data=CUSTLOYALTY)
#Some statistically significant differences in levels
data(EX6.CLICK)
examine_driver_Ycat(Click=="Yes"~SiteID,data=EX6.CLICK)
examine_driver_Ycat(Click=="Yes"~DeviceModel,data=EX6.CLICK)
Main driver analysis when Y is a numeric quantity
Description
This function provides a "main driver analysis" on the association between a numeric y variable and the "driver" x. A visualization of the strength of the relationship is provided as well as numerical output to help quantify the variation of y across possible values of the "driver" x.
Usage
examine_driver_Ynumeric(formula,data,sort=TRUE)
Arguments
formula |
A standard R formula written as y~x, where y is the variable of interest and x is the driver. |
data |
An argument giving the name of the data frame that contains x and y. |
sort |
|
Details
Main driver analysis is a cornerstone of business analytics where we identify and quantify the key factors (drivers) that most strongly influence a business outcome or performance metric.
This function handles the case when y (the outcome variable) is numeric (see examine_driver_Ycat when y is categorical).
If the driver x is numeric, a scatterplot is presented along with a trend line (in blue; a black line for the average value of y is added). A summary of a simple linear regression model is also provided.
If the driver x is categorical, side-by-side boxplots of the distribution of y for each value of x is provided (a black line gives the average value of y in the data). A table giving the average value of y for each value of x is provided along with a "connecting letters report" to discern which levels have statistically significant differences in the average value of y (if ANY letters are in common between two values of x, there is not a statistically significant difference in the average value of y between those two values of x; if ALL letters are different, the difference in the average value of y is statistically significant).
The function also provides a "Driver Score" (a value between 0 and 1 which is simply the R-squared of a simple linear regression predicting y from x). Larger driver scores indicate stronger associations between y and x.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
#X numeric
data(CUSTLOYALTY)
examine_driver_Ynumeric(CustomerLV~WalletShare,data=CUSTLOYALTY)
#X categorical (no statistically significant differences in levels)
data(CUSTLOYALTY)
examine_driver_Ynumeric(CustomerLV~Married,data=CUSTLOYALTY)
#X categorical (statistically significant differences in levels)
data(CUSTLOYALTY)
examine_driver_Ynumeric(CustomerLV~Income,data=CUSTLOYALTY)
A crude check for extrapolation
Description
This function computes the Mahalanobis distance of points as a check for potential extrapolation.
Usage
extrapolation_check(M,newdata)
Arguments
M |
A fitted model that uses only quantitative variables |
newdata |
Data frame (that has the exact same columns as predictors used to fit the model |
Details
This function computes the shape of the predictor data cloud and calculates the distances of points from the center (with respect to the shape of the data cloud). Extrapolation occurs at a combination of predictors that is far from combinations used to build the model. An observation with a large Mahalanobis distance MAY be far from the observations used to build the model and thus MAY require extrapolation.
Note: analysis assumes the predictor data cloud is roughly elliptical (this may not be a good assumptions).
The function reports the percentiles of the Mahalanobis distances of the points in newdata. Percentiles are the fraction of observations used in model that are CLOSER to
the center than the point(s) in question. Large values of these percentages indicate a greater risk for extrapolation. If Percentile is about 99
you may be extrapolating.
The method is sensitive to outliers clusters of outliers and gives only a crude idea of the potential for extrapolation.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(SALARY)
M <- lm(Salary~Education*Experience+Months,data=SALARY)
newdata <- data.frame(Education=c(0,5,10),Experience=c(15,15,15),Months=c(0,0,0))
extrapolation_check(M,newdata)
#Individuals 1 and 3 are rather unusual (though not terribly) while individual 2 is typical.
Transformations for simple linear regression
Description
This function takes a simple linear regression model and finds the transformation of x and y that results in the highest R2
Usage
find_transformations(M,powers=seq(from=-3,to=3,by=.25),threshold=0.02,...)
Arguments
M |
A simple linear regression model fitted with |
powers |
A sequence of powers to try for x and y. By default this ranges from -3 to 3 in steps of 0.25. If 0 is a valid power, then the logarithm is used instead. |
threshold |
Report all models that have an R2 that is within |
... |
Additional arguments to |
Details
The relationship between y and x may not be linear. However, some transformation of y may have a linear relationship with some transformation of x. This function considers simple linear regression with x and y raised to powers between -3 and 3 (in 0.25 increments) by default. The function outputs a list of the top models as gauged by R^2 (all models within 0.02 of the highest R^2). Note: there is no guarantee that these "best" transformations are actually good, since a large R^2 can be produced by outliers created during transformations. A plot of the transformation is also provided.
It is exceedingly rare that the "best" transformation is raising x and y to the 1 power (i.e., the original variables). Transformations are typically used only when there are issues in the residuals plots, highly skewed variables, or physical/logical justifications.
Note: if a variable has 0s or negative numbers, only integer transformations are considered.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
#Straightforward example
data(BULLDOZER)
M <- lm(SalePrice~YearMade,data=BULLDOZER)
find_transformations(M,pch=20,cex=0.3)
#Results are very misleading since selected models have high R2 due to outliers
data(MOVIE)
M <- lm(Total~Weekend,data=MOVIE)
find_transformations(M,powers=seq(-2,2,by=0.5),threshold=0.05)
Calculating the generalization error of a model on a set of data
Description
This function takes a linear regression from lm, logistic regression from glm, partition model from rpart, or random forest from randomForest and calculates the generalization error on a dataframe.
Usage
generalization_error(MODEL,HOLDOUT,Kfold=FALSE,K=5,R=10,seed=NA)
Arguments
MODEL |
A linear regression model created using |
HOLDOUT |
A dataset for which the generalization error will be calculated. If not given, the error on the data used to build the model ( |
Kfold |
If |
K |
The number of folds used in repeated K-fold cross-validation for the estimation of the generalization error for the model |
R |
The number of repeats used in repeated K-fold cross-validation. |
seed |
an optional argument priming the random number seed for estimating the generalization error |
Details
This function calculates the error on MODEL, its estimated generalization error from repeated K-fold cross-validation (for regression models only), and the actual generalization error on HOLDOUT. If the response is quantitative, the RMSE is reported. If the response is categorical, the confusion matrices and misclassification rates are returned.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
#Education analytics
data(STUDENT)
set.seed(1010)
train.rows <- sample(1:nrow(STUDENT),0.5*nrow(STUDENT))
TRAIN <- STUDENT[train.rows,]
HOLDOUT <- STUDENT[-train.rows,]
M <- lm(CollegeGPA~.,data=TRAIN)
#Also estimate the generalization error of the model
generalization_error(M,HOLDOUT,Kfold=TRUE,seed=5020)
#Try partition and randomforest, though they do not perform as well as regression here
TREE <- rpart(CollegeGPA~.,data=TRAIN)
FOREST <- randomForest(CollegeGPA~.,data=TRAIN,ntrees=50)
generalization_error(TREE,HOLDOUT)
generalization_error(FOREST,HOLDOUT)
#Wine
data(WINE)
set.seed(2020)
train.rows <- sample(1:nrow(WINE),0.5*nrow(WINE))
TRAIN <- WINE[train.rows,]
HOLDOUT <- WINE[-train.rows,]
M <- glm(Quality~.^2,data=TRAIN,family=binomial)
generalization_error(M,HOLDOUT)
#Random forest predicts best on the holdout sample
TREE <- rpart(Quality~.,data=TRAIN)
generalization_error(TREE,HOLDOUT)
Complexity Parameter table for partition models
Description
A simple function to take the output of a partition model created with rpart and return information abouthe complexity parameter and performance of varies models.
Usage
getcp(TREE)
Arguments
TREE |
An object of class |
Details
This function prints out a table of the complexity parameter, number of splits, relative error, cross validation error, and standard deviation of cross validation error for a partition model. It adds helpful advice for what the value of CP is for the tree that had the lowest cross validation error and also the value of CP for the simplest tree with a cross validation error at most 1 standard deviation above the lowest.
Further, a plot is made of the estimated generalization error (xerror) versus the number of splits to illustrate when the tree stops improving. Vertical lines are draw at the number of splits corresponding to the lowest estimated generalization error to the tree selected by the one standard deviation rule.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(JUNK)
TREE <- rpart(Junk~.,data=JUNK,control=rpart.control(cp=0,xval=10,minbucket=5))
getcp(TREE)
Influence plot for regression diganostics
Description
This function plots the leverage vs. deleted studentized residuals for a regression model, highlighting points that are influent based on these two factors as well as Cook's distance
Usage
influence_plot(M,large.cook,cooks=FALSE,label=FALSE)
Arguments
M |
A linear regression model fitted with lm() |
large.cook |
The threshold for a "large" Cook's distance. If not specified, a default of 4/n is used. |
cooks |
|
label |
|
Details
A point is influential if its addition to the data changes the regression substantially. One way of measuring influence is by looking at the point's leverage (distance from the center of the predictor's datacloud with respect to it shape) and deleted studentized residual (relative size of the residual with respect to a regression made without that point). Points with leverages larger than 2(k+1)/n (where k is the number of predictors) and deleted studentized residuals larger than 2 in magnitude are considered influential.
Influence can also be measured by Cook's distance, which essentially combines the above two measures. This function considers the Cook's distances to be large when it exceeds 4/n, but the user can specify another cutoff.
The radius of a point is proportional to the square root of the Cook's distance. Influential points according to leverage/residual criteria have an X through them while influential points according to Cook's distance are bolded.
The function returns the row numbers of influential observations.
Value
A list with the row numbers of influential points according to Cook's distance ($Cooks) and according to leverage/residual criteria ($Leverage).
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
cooks.distance, hatvalues, rstudent
Examples
data(TIPS)
M <- lm(TipPercentage~.-Tip,data=TIPS)
influence_plot(M)
Find the mode of a categorical variable
Description
This function finds the mode of a categorical variable
Usage
mode_factor(x)
Arguments
x |
a factor |
Details
The mode is the most frequently occuring level of a categorical variable. This function returns the mode of a categorical variable. If there is a tie for the most frequent level, it returns all modes.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
data(EX6.CLICK)
mode_factor(EX6.CLICK$DeviceModel)
#To see how often it appears try sorting a table
sort( table(EX6.CLICK$DeviceModel),decreasing=TRUE )
x <- c( rep(letters[1:4],5), "e", "f" ) #multimodel
mode_factor(x)
Mosaic plot
Description
Provides a mosaic plot to visualize the association between two categorical variables
Usage
mosaic(formula,data,color=TRUE,labelat=c(),xlab=c(),ylab=c(),
magnification=1,equal=FALSE,inside=FALSE,ordered=FALSE)
Arguments
formula |
A standard R formula written as y~x, where y is the name of the variable playing the role of y and x is the name of the variable playing the role of x. |
data |
An optional argument giving the name of the data frame that contains x and y. If not specified, the function will use existing definitions in the parent environment. |
color |
|
labelat |
a vector of factor levels of |
xlab |
Label of horizontal axis if you want something different that the name of the |
ylab |
Label of vertical axis if you want something different that the name of the |
magnification |
Magnification of the labels of the |
equal |
If |
inside |
If |
ordered |
If |
Details
This function shows a mosaic plot to visualize the conditional distributions of y for each level of x, along with the marginal distribution of y to the right of the plot. The widths of the segmented bar charts are proportional to the frequency of each level of x. These plots are the same that appear using associate.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(ACCOUNT)
mosaic(Area.Classification~Purchase,data=ACCOUNT,color=TRUE)
data(EX6.CLICK)
#Default presentation: not very useful
mosaic(Click~DeviceModel,data=EX6.CLICK)
#Better presentation
mosaic(Click~DeviceModel,data=EX6.CLICK,equal=TRUE,inside=TRUE,magnification=0.8)
Interactive demonstration of the effect of an outlier on a regression
Description
This function shows regression lines on user-defined data before and after adding an additional point.
Usage
outlier_demo(newplot=FALSE,cex.leg=0.8)
Arguments
newplot |
If |
cex.leg |
A number specifying the magnification of legends inside the plot. Smaller numbers mean smaller font. |
Details
This function allows the user to generate data by click on a plot. Once two points are added, the least squares regression line is draw. When an additional point is added, the regression line updates while also showing the line without that point. The effect of outliers on a regression line can easily be illustrated. Pressing the red UNDO button on the plot will allow you to take away recently added points for further exploration.
Note: To end the demo, you MUST click on the red box labeled "End" (or press Escape, which will return an error)
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Demonstration of overfitting
Description
This function gives a demonstration of how overfitting occurs on a user-inputted dataset by showing the estimated generalization error as additional variables are added to the regression model (up to all two-way interactions).
Usage
overfit_demo(DF,y=NA,seed=NA,aic=TRUE)
Arguments
DF |
The data frame where demonstration will occur. |
y |
The response variable (in quotes) |
seed |
Optional argument setting the random number seed if results need to be reproduced |
aic |
logical, if |
Details
This function splits DF in half to obtain training and holdout samples. Regression models are constructed using a forward selection procedure (adding the variable that decreases the AIC the most on the training set), starting at the naive model and terminating at the full model with all two-way interactions.
The generalization error of each model is computed on the holdout sample. The AIC (or RMSE on the training) and generalization errors are plotted versus the number of variables in the model to illustrate overfitting. Typically, the generalization error decreases at first as useful variables are added to the model, then the generalization error increases after the new variables added start to fit the quirks present only in the training data. When this happens, the model is said to be overfit.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
#Overfitting occurs after about 10 predictors (AIC begins to increase after 12/13)
data(BODYFAT)
overfit_demo(BODYFAT,y="BodyFat",seed=1010)
#Overfitting occurs after about 5 predictors
data(OFFENSE)
overfit_demo(OFFENSE,y="Win",seed=1997,aic=FALSE)
Illustrating how a simple linear/logistic regression could have turned out via permutations
Description
This function gives a demonstration of what simple linear or logistic regression lines could have looked like "by chance" if x and y were unrelated. A scatterplot and fitted regression line is displayed along with the regression lines produced when x and y are unrelated via the permutation procedure. The sum of squared error reductions for all lines (for linear regressions) are also displayed for an informal assessement of significance.
Usage
possible_regressions(M,permutations=100,sse=TRUE,reduction=TRUE)
Arguments
M |
A simple linear regression model from |
permutations |
The number of artificial samples generated with the permutation procedure to consider (each will have y and x be independent by design). |
sse |
Optional argument to either show or hide the histogram of sum of squared errors of the regression lines. |
reduction |
Optional argument that, if |
Details
This function gives a scatterplot and fitted regression line for M in red for a linear regression, or the fitted logistic curve (in black) for logistic regression. Then, via the permutation procedure, it generates permutations, artificial samples where the observed values of x and y are paired up at random, ensuring that no relationship exists between them. A regression is fit on this permutation sample, and the regression line is drawn in grey to illustrate how it may look "by chance" when x and y are unrelated.
If requested, a histogram of the sum of squared error reductions of each of the regressions on the permutation datasets (and the original regression in red) is displayed to allow for an informal assessement of the statistical significance of the regression.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
#A weak but statistically significant relationship
data(TIPS)
M <- lm(TipPercentage~Bill,data=TIPS)
possible_regressions(M)
#A very strong relationship
data(SURVEY10)
M <- lm(PercMoreIntelligentThan~PercMoreAttractiveThan,data=SURVEY10)
possible_regressions(M,permutations=1000)
#Show raw SSE instead of reductions
M <- lm(TipPercentage~PartySize,data=TIPS)
possible_regressions(M,reduction=FALSE)
QQ plot
Description
A QQ plot designed with statistics students in mind
Usage
qq(x,ax=NA,leg=NA,cex.leg=0.8)
Arguments
x |
A vector of data |
ax |
The name you want to call |
leg |
Optional argument that places a legend in the top left of the plot with the text given by |
cex.leg |
Optional argument that gives the magnification of the text in the legend |
Details
This function gives a "QQ plot" that is more easily interpreted than the standard QQ plot. Instead of plotting quantiles, it plots the observed values of x versus the values expected had x come from a Normal distribution.
The distribution can be considered approximately Normal if the points stay within the upper/lower dashed red lines (with the possible exception at the far left/right) and if there is no overall global curvature.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
#Distribution does not resemble a Normal
data(TIPS)
qq(TIPS$Bill,ax="Bill")
#Distribution resembles aNormal
data(ATTRACTF)
qq(ATTRACTF$Score,ax="Attractiveness Score")
Replaces rare levels of a categorical variable
Description
This function takes a categorical variable and replaces all levels with frequencies less than or equal to a user-specified threshold named Other
Usage
replace_rare_levels(x,threshold=20,newname="Other")
Arguments
x |
a vector of categorical values |
threshold |
levels that appear a total of |
newname |
defaults to |
Details
Returns the recoded values of the categorical variable. All levels which appeared threshold times or fewer are now known as Other
If, after being combined, the newname level has threshold or fewer instances, the remaining level that appears least often is combined as well.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
data(EX6.CLICK)
x <- EX6.CLICK[,15]
table(x)
#Replace all levels which appear 700 or fewer times (AA, CC, DD)
y <- replace_rare_levels(x,700)
table( y )
#Replace all levels which appear 1350 or fewer times. This forces BB (which
#occurs 2422 times) into the Other level since the three levels that appear
#fewer than 1350 times do not appear more than 1350 times combined
y <- replace_rare_levels(x,1350)
table( y )
Examining pairwise interactions between quantitative variables for a fitted regression model
Description
Plots all pairwise interactions present in a regression model to allow for an informal assessment of their strength. When both variables are quantitative, the implicit regression lines of y vs. x1 for a small, the median, and a large value of x2 are provided (and vice versa). If one of the variables is categorical, the implicit regression lines of y vs. x as displayed for each level of the categorical variable.
Usage
see_interactions(M,pos="bottomright",many=FALSE,level=0.95,...)
Arguments
M |
A fitted linear regression model with interactions between quantitative variables. |
pos |
Where to put the legend, one of "topleft", "top", "topright", "left","center","right","bottomleft","bottom","bottomright" |
many |
If |
level |
Defines what makes a "small" and "large" value of x1 and x2. By default |
... |
Additional arguments to |
Details
When determining the implicit regression lines, all variables not involved in the interaction are assumed to be equal 0 (if quantitative) or equal to the level that comes first alphabetically (if categorical). Tickmarks on the y axis are thus irrelevant and are not displayed.
The plots allow an informal assessment of the presence of an interaction between the variables x1 and x2 in the model, after accounting for the other predictors. If the implicit regression lines are nearly parallel, then the interaction is weak if it exists at all. If the implicit regression lines have noticeably different slopes, then the interaction is strong.
When an interaction is present, then the strength of the relationship between y and x1 depends on the value of x2. In other words, the difference in the average value of y between two individuals who differ in x1 by 1 unit depends on their (common) value of x2 (sometimes the expected difference is large; sometimes it is small).
If one of the variables in the interaction is cateogorical, the presence of an interaction implies that the strength of the relationship between y and x is different between levels of the categorical variable. In other words, sometimes the difference in the expected value of y between an individual with level A and an individual with level B is large and sometimes it is small (and this depends on the common value of x of the individuals we are comparing).
The command visualize.model gives a better representation when only two predictors are in the model.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(SALARY)
M <- lm(Salary~.^2,data=SALARY)
#see_interactions(M,many=TRUE) #not run since it requires user input
data(STUDENT)
M <- lm(CollegeGPA~(Gender+HSGPA+Family)^2+HSGPA*ACT,data=STUDENT)
see_interactions(M,cex=0.6)
Examining model AICs from the "all possible" regressions procedure using regsubsets
Description
This function takes the output of regsubsets and prints out a table of the top performing models based on AIC criteria.
Usage
see_models(ALLMODELS,report=0,aicc=FALSE,reltomin=FALSE)
Arguments
ALLMODELS |
An object of class regsubsets created from |
report |
An optional argument specifying the number of top models to print out. If left at a default of 0, the function reports all models whose AICs are within 4 of the lowest overall AIC. |
aicc |
Either |
reltomin |
Either |
Details
This function uses the summary function applied to the output of regsubsets. The AIC is calculated to be the one obtained via extractAIC to allow for easy comparison with build.model and step.
Although the model with the lowest AIC is typically chosen when making a descriptive model, models with AICs within 2 are essentially functionally equivalent. Any model with an AIC within 2 of the smallest is a reasonable choice since there is no statistical reason to prefer one over the other. The function returns a data frame of the AIC (or AICc), the number of variables, and the predictors in the "best" models.
Recall that the function regsubsets by default considers up to 8 predictors and does not preserve model hierarchy. Interactions may appear without both component terms. Further, only a subset of the indicator variables used to represent a categorical variable may appear.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(SALARY)
ALL <- regsubsets(Salary~.^2,data=SALARY,method="exhaustive",nbest=4)
see_models(ALL)
#By default, regsubsets considers up to 8 predictors, here it looks at up to 15
data(ATTRACTF)
ALL <- regsubsets(Score~.,data=ATTRACTF,nvmax=15,nbest=1)
see_models(ALL,aicc=TRUE,report=5)
Segmented barchart
Description
Produces a segmented barchart of the input variable, forcing it to be categorical if necessary
Usage
segmented_barchart(x)
Arguments
x |
A vector. If numerical, it is treated as categorical variable in the form of a factor |
Details
Standard segmented barchart. Shaded areas are labeled with the levels they represent, and the percentage of cases with that level is labeled on the axis to the right.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
data(STUDENT)
segmented_barchart(STUDENT$Family) #Categorical variable
data(TIPS)
segmented_barchart(TIPS$PartySize) #Numerical variable treated as categorical
Combining levels of a categorical variable
Description
This function determines levels that are similar to each other either in terms of their average value of some quantitative variable or the percentages of each level of a two-level categorical variable. Use it to get a rough idea of what levels are "about the same" with regard to some variable.
Usage
suggest_levels(formula,data,maxlevels=NA,target=NA,recode=FALSE,plot=TRUE,...)
Arguments
formula |
A standard R formula written as y~x. Here, x is the variable whose levels you wish to combine, and y is the quantitative or two-level categorical variable. |
data |
An optional argument giving the name of the data frame that contains x and y. If not specified, the function will use existing definitions in the parent environment. |
maxlevels |
The maximum number of combined levels to consider (cannot exceed 26). |
target |
The number of resulting levels into which the levels of x will be combined. Will default to the suggested value of the fewest number whose resulting BIC is no more than 4 above the lowest BIC of any combination. |
recode |
|
plot |
|
... |
Additional arguments used to make the plot. Typically this will be |
Details
This function calculates the average value (or percentage of each level) of y for each level of x. It then builds a partition model taking y to be this average value (or percentage) with x being the predictor variable. The first split yields the "best" scheme for combining levels of x into 2 values. The second split yields the "best" scheme for combining levels of x into 3 values, etc.
The argument maxlevels specifies the maximum numbers of levels in the combination scheme. By default, it will use the number of levels of x (ie, no combination). Setting this to a lower number saves time, since most likely a small number of combined levels is desired. This is useful for seeing how different combination schemes compare.
The argument target will force the algorithm to producing exactly this number of combined levels. This is useful once you have determined how many levels of x you want.
If recode is FALSE, a table showing the combined levels along with the "BIC" of the combination scheme (lower is better, but a difference of around 4 or less is negligible). The suggested combination will be the fewer number of levels which has as BIC no more than 4 above the scheme that gave the lowest BIC.
If recode is TRUE, a list of three elements is produced. $Conversion1 gives a table of the Old and New levels alphabetized by Old while $Conversion2 gives a table of the Old and New levels alphabized by New. $newlevels gives a factor of the cases levels under the new combination scheme. If target is not set, it will use the suggested number of levels.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
data(DONOR)
#Can levels of URBANICITY be treated the same with regards to probability of donation?
#Analysis suggests yes (all levels in one)
suggest_levels(Donate~URBANICITY,data=DONOR)
#Can levels of URBANICITY be treated the same with regards to donation amount?
#Analysis suggests yes, but perhaps there are four "effective levels"
suggest_levels(Donation.Amount~URBANICITY,data=DONOR)
SL <- suggest_levels(Donation.Amount~URBANICITY,data=DONOR,target=4,recode=TRUE)
SL$Conversion
#Add a column to the DONOR dataframe that contains these new cluster identities
DONOR$newCLUSTER_CODE <- SL$newlevels
Useful summaries of partition models from rpart
Description
Reports the RMSE, AIC, and variable importances for a partition model or the variable importances from a random forest.
Usage
summarize_tree(TREE)
Arguments
TREE |
A partition model created with |
Details
Extracts the RMSE and AIC of a partition model and the variable importances of partition models or random forests.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(WINE)
set.seed(2025); SUBSET <- WINE[sample(1:nrow(WINE),size=500),]
TREE <- rpart(Quality~.,data=SUBSET,control=rpart.control(cp=0.01,xval=10,minbucket=5))
summarize_tree(TREE)
RF <- randomForest(Quality~.,data=SUBSET,ntrees=50)
summarize_tree(RF)
data(NFL)
SUBSET <- NFL[,1:10]
TREE <- rpart(X4.Wins~.,data=SUBSET,control=rpart.control(cp=0.002,xval=10,minbucket=5))
summarize_tree(TREE)
RF <- randomForest(X4.Wins~.,data=SUBSET,ntrees=50)
summarize_tree(RF)
Visualizations of one or two variable linear or logistic regressions or of partitions models
Description
Provides useful plots to illustrate the inner-workings of regression models with one or two predictors or a partition model with not too many branches.
Usage
visualize_model(M,loc="topleft",level=0.95,cex.leg=0.7,midline=TRUE,...)
Arguments
M |
A linear or logistic regression model with one or two predictors (not all categorical) produced by |
loc |
The location for the legend, if one is to be displayed. Can also be "top", "topright", "left", "center", "right", "bottomleft", "bottom", or "bottomright". |
level |
The level of confidence for confidence and prediction intervals for the case of simple linear regression. |
cex.leg |
Magnification factor for text in legends. Smaller numbers indicate smaller text. Default is 0.7. |
midline |
logical, either |
... |
Additional arguments to |
Details
If M is a simple linear regression model, this provides a scatter plot, fitted line, and confidence/prediction intervals.
If M is a simple logistic regression model, this provides the fitted logistic curve.
If M is a regression with two quantitative predictors, this provides the implicit regression lines when one of the variables equals its 5th (small), 50th (median), and 95th (large) percentiles. The model may have interaction terms. In this case, the p-value of the interaction is output. The definition of small and large can be changed with the level argument.
If M is a regression with a quantitative predictor and a categorical predictor (with or without interactions), this provides the implicit regression lines for each level of the categorical predictor. The p-value of the effect test is displayed if an interaction is in the model.
If M is a partition model from rpart, this shows the tree.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(SALARY)
#Simple linear regression with 90% confidence and prediction intervals
M <- lm(Salary~Education,data=SALARY)
visualize_model(M,level=0.90,loc="bottomright")
#Multiple linear regression with two quantitative predictors (no interaction)
M <- lm(Salary~Education+Experience,data=SALARY)
visualize_model(M)
#Multiple linear regression with two quantitative predictors (with interaction)
#Take small and large to be the 25th and 75th percentiles
M <- lm(Salary~Education*Experience,data=SALARY)
visualize_model(M,level=0.75)
#Multiple linear regression with one categorical and one quantitative predictor
M <- lm(Salary~Education*Gender,data=SALARY)
visualize_model(M)
data(WINE)
#Simple logistic regression with expanded x limits
M <- glm(Quality~alcohol,data=WINE,family=binomial)
visualize_model(M,xlim=c(0,20))
#Multiple logistic regression with two quantitative predictors
M <- glm(Quality~alcohol*sulphates,data=WINE,family=binomial)
visualize_model(M,loc="left",midline=FALSE)
data(TIPS)
#Multiple logistic regression with one categorical and one quantitative predictor
#expanded x-limits to see more of the curve
M <- glm(Smoker~PartySize*Weekday,data=TIPS,family=binomial)
visualize_model(M,loc="topright",xlim=c(-5,15))
#Partition model predicting a quantitative response
TREE <- rpart(Salary~.,data=SALARY)
visualize_model(TREE)
#Partition model predicting a categorical response
TREE <- rpart(Quality~.,data=WINE)
visualize_model(TREE)
Visualizing the relationship between y and x in a partition model
Description
Attempts to show how the relationship between y and x is being modeled in a partition or random forest model
Usage
visualize_relationship(TREE,interest,on,smooth=TRUE,marginal=TRUE,nplots=5,
seed=NA,pos="topright",...)
Arguments
TREE |
A partition or random forest model (though it works with many regression models as well) |
interest |
The name of the predictor variable for which the plot of y vs. x is to be made. |
on |
A dataframe giving the values of the other predictor variables for which the relationship is to be visualized. Typically this is the dataframe on which the partition model was built. |
smooth |
If |
marginal |
If |
nplots |
The number of rows of |
seed |
the seed for the random number seed if reproducibility is required |
pos |
the location of the legend |
... |
additional arguments past to |
Details
The function shows a scatterplot of y vs. x in the on dataframe, then shows how TREE is modeling the relationship between y and x with predicted values of y for each row in the data and also a curve illustrating the relationship. It is useful for seeing what the relationship between y and x as modeled by TREE "looks like", both as a whole and for particular combinations of other variables. If marginal is FALSE, then differences in the curves indicate the presence of some interaction between x and another variable.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
See Also
Examples
data(SALARY)
FOREST <- randomForest(Salary~.,data=SALARY)
visualize_relationship(FOREST,interest="Experience",on=SALARY)
visualize_relationship(FOREST,interest="Months",on=SALARY,xlim=c(1,15),ylim=c(2500,4500))
data(WINE)
TREE <- rpart(Quality~.,data=WINE)
visualize_relationship(TREE,interest="alcohol",on=WINE,smooth=FALSE)
visualize_relationship(TREE,interest="alcohol",on=WINE,marginal=FALSE,nplots=7,smooth=FALSE)