Some information about the book Analyzing Baseball Data With R, 3rd edition by Max Marchi, Jim Albert, and Ben Baumer:
Some useful links for the book.
- The Amazon page for the book
- The online Quarto version of the book
- The Github repository containing the datasets and the scripts used in the book.

Jim,
I hope all is well. I have continued working through the text and just have a question regarding Chapter 5, Exercise 3.
Is there anyway you can confirm the Runs Value for Rickie Weeks and Michael Bourn. I have 9.029 for Weeks and 7.360 for Bourn. I just want to be sure I am following along with the code properly. I used your code from the in Chapter Exercises with Pujols.
Thanks so much,
Lloyd
Lloyd:
Here’s what I have for Exercise 3 of Chapter 5:
d2016 %>% filter(BAT_ID %in% c(“eatoa002”, “marts002”),
BAT_EVENT_FL == TRUE) %>%
group_by(BAT_ID) %>%
summarize(N = n(),
M = mean(run_value),
S = sum(run_value))
## # A tibble: 2 x 4
## BAT_ID N M S
##
## 1 eatoa002 706 0.0188 13.3
## 2 marts002 529 0.0179 9.48
Hope this helps.
Jim
Hi Jim,
Hope all is well. I am working through Chapter 4 Exercise 3 (Manager Effect in Baseball) and ran into an issue running the solution. I get the following error running the code:
Error: Problem with `summarise()` input `Mean_Residual`.
x object ‘.resid’ not found
i Input `Mean_Residual` is `mean(.resid)`.
i The error occurred in group 1: playerID = “actama99”.
It seems that for whatever reason the Augment function is not adding the .resid column. Instead I only get the following:
> out out
# A tibble: 345 x 10
yearID teamID R RA .fitted .hat .sigma .cooksd .std.resid playerID
I am using the solutions dated 1/10/2019.
Any help or guidance on what is going wrong would be greatly appreciated.
Thanks,
Brandon
Hi Brandon:
It appears that the broom package has changed what happens with the augment() function. I haven’t looked it carefully, but I see that the data frame out has the variable .std.resid instead of .resid. So I think if you replace mean(.resid) with mean(.std.resid) it should work fine.
I’ll make a correction on those solutions.
Thanks.
Jim
Thanks Jim! Greatly appreciate the quick response. Loving the book!
In section 6.2.3 and getting below error. Package missing driving this?
count_plot %+% run_value_by_count +
+ scale_fill_gradient2(“xRV”, low = grey10, high = crcblue,
+ mid = white)
Error in count_plot %+% run_value_by_count :
is.character(lhs) is not TRUE
Robert, I just tried running that Chapter 6 from the script posted on our Github site and I couldn’t reproduce your error. We are using the tidyverse suite of packages, but nothing else. Unfortunately, without having your computer in front of me, I am not sure what is creating the issue. Sorry not to be of more help. Jim
Can someone help me figure out how to make sense of this book? Chapter 2 is allegedly “Introduction to R,” and yet I can’t seem to find any instructions on how to actually access/import the data necessary to work alongside each exercise.
All of the datasets described in the book are found in the data folder at https://github.com/beanumber/baseball_R. Also there is a package ABSRdata that contains most of the datasets. You can install this package by following the instructions at https://github.com/bayesball/ABWRdata
Good luck — I know it can be challenging to get started.
This book has been a great help, but I’ve got stuck in section 3.7.1. I don’t believe I have the same type as the person above, but any help is appreciated. Here is what I’m typing:
get_birthyear<-function(Name){
Names%
filter(nameFirst==Names[1],
nameLast==Names[2])%>%
mutate(birthyear=ifelse(birthMonth>=7,
birthYear+1,birthYear),
Player=paste(nameFirst,nameLast))%>%
select(playerID,Player,birthyear)
}
It seems to me like this part of the code isn’t working. When I go to the next steps in the chapter for setting up the table, it comes up with 0 observations of 3 variables. I’ve read the Lahman database into R, so I’m not exactly sure what’s not connecting. Thanks!
Nick:
It seems that you didn’t type in the get_birthyear() function correctly — it should be as I’ve copied below.
By the way, all of the code for the chapters can be found at. https://github.com/beanumber/baseball_R/tree/master/chapter_code
Best: Jim
get_birthyear <- function(Name) {
Names %
filter(nameFirst == Names[1],
nameLast == Names[2]) %>%
mutate(birthyear = ifelse(birthMonth >= 7,
birthYear + 1, birthYear),
Player = paste(nameFirst, nameLast)) %>%
select(playerID, Player, birthyear)
}
Can anyone explain this message when trying to install a package: “Do you want to install from sources the package which needs compilation?”
Thanks in advance.
Robert, there are two ways to install packages, precompiled and those that consist of source programs (like C++ or Fortran) that need to be compiled. Most packages come precompiled, but if you have C++ and Fortran on your computer, you can compile the source packages. Usually the need-to-be compiled packages are the ones that are recently released — if you wait a few days, they will be available in the precompiled version.
Excellent book, it is a great tool for the study of statistics, and to make every baseball game even more interesting. I have been trying to follow the example of the Pythagorean expectation formula and how to obtain its exponent (pages 94-102), but I wonder what to do when in the final score a team made no runs, that is, ended with zero runs. In such cases, for example, for the calculation of logRratio (page 95), log(0) is Inf.
> log(0)
[1] -Inf
What is the way to deal with this data: remove this log from the data, or calculate, for example, log(0.1)?
Thanks
Alfredo, thanks for the kind comments on the book. Usually we apply the Pythagorean expectation formula for a collection of games where the runs scored for and against are positive. I don’t think we use it for a single game.
Thanks a lot for you response.
What I was trying to do, as a statistical exercise, is to apply the formula to several games in a season to observe how the expectation of games won changes over the dates. My intention was to find the time when the prediction most closely matched the final outcome. I am looking at a short 49-game season in my country’s professional league. Currently, they are going through game 20. My first choice was to think of a sample size for all 49 games. But doing the exercise by dates, I plan to find the point at which the prediction was most closely matched. Kindly, I would like to ask if you can think of any recommendations for my exercise? Thank you.
I apologize as I am only on chapter 2 but already having a problem while following along with Warren Spahn’s csv. As I run this code, I get to
install.packages(“tidyverse”)
library(tidyverse)
library(Lahman)
getwd()
spahn <- read_csv("data/spahn.csv")
while read.csv works for getting data/sphan.csv, read_csv produces this error
Error in `vec_as_location()`:
! `…` must be empty.
x Problematic argument:
* call = call
Run `rlang::last_error()` to see where the error occurred.
I am fairly sure my working directory is set up correctly so I thought it might be something else wrong. I am completely lost and am just getting back to coding so any help would be greatly appreciated. Thank you!
Justin, sorry, but there is no simple answer to that particular error message. I can’t reproduce that on my laptop. I’d suggest using read.csv() instead of read_csv() to read in data files — both functions do the same thing. Jim
Hello!
I was wondering how to get the hofbaseball.csv file to use for Chapter 3? I know it wasn’t in the original files when I download the csv files for 2017, but I can’t seem to figure out how to get a hold of it?
In Chapter 3, I don’t believe there is a hopbaseball.csv file mentioned, but there is hofbatting.csv and hofpitching.csv. These two files are available in the data folder on Github https://github.com/beanumber/baseball_R/tree/master/data
Hi,
I recently bought the book, and love the work you’ve done. However, I am running into many issues becasue I believe the latest version of the Lahman package in R removed the Master table. Is there a work around to this, or am I doing something incorrect?
Jayson, in the Lahman package, that Master data frame has been renamed as the People data frame. JIm
You’re a lifesaver. Thanks again!
Hi Jim!
Love the book!
I wanted to ask for some clarification on the first “Baseball Question” in section 1.2.8.
What I wanted to ask was how the “average number of home runs per game recorded in each decade” is calculated. Specifically, how are you obtaining the values 0.3, 0.8, and 2.2 in the paragraph below the question?
My approach was to use the Teams dataset and group_by the variable year_id. I thought that “home runs per game” could be calculated by taking the total number of home runs and dividing by the total number of games. But my results didn’t match what you had.
Thanks!
Addison:
What you did was fine. But you were computing the average number of home runs per team per game. Since two teams are playing, the average number of home runs would be double of what you are finding.
Jim
Hi Jim, Thanks for the quick reply!
Is this correct, or do I need to further group by decade?
View(
Teams %>%
group_by(yearID) %>%
summarise(total_games = sum(G),
total_hr = sum(HR),
hr_pg = 2 * total_hr / total_games)
)
Hi Jim.
I am trying to replicate your analysis on catcher framing in section 7.5.
I am struggling with matching the player ids to their names. Where does the masterid.csv file come from?
This is my only problem. I love the book and have used for answering many of my own baseball questions.
Thank you!
Kyle:
That particular masterid.csv file can be found in the folder at https://github.com/beanumber/baseball_R/tree/master/data
We have a new edition of this book available at https://beanumber.github.io/abdwr3e/ . In the new edition, we use a function from the baseballr package to get the names for those player ids.
Glad you enjoy the book.
Jim
Thanks for your help! I appreciate the quick reply!
Hello all!
I am very excited about the new edition to the book, Analzing Baseball Data with R (3e)! I previously worked through the 2nd edition and primarily focused on ideas with the Lahman database. This edition has motivated me to step my game up a notch and I wanted to start working with the Retrosheet data to work through the exercises in the book. However – even after working in R for years, I am having a hard time understanding and/or working through the instructions in the appendix in order to set up the ‘Chadwick’ component that working with Retrosheet entails.
Are there additional resources/videos one can look at to get Chadwick properly installed and working with R? I am working in Windows. Any help is greatly appreciated! Thank you!
Manny, I understand that it can be problematic getting the Chadwick files to work. If you are on a Windows laptop, then you can download the executable versions of cwevent, etc. If if you are in the same folder as those problems, then you should be able to use them to extract the Retrosheet variables. Our best advice is given at https://beanumber.github.io/abdwr3e/A_retrosheet.html. Jim
Hello again!
Thank you for your help! I have worked on and off with the link/page that you suggested. Following the code in the book in Appendix A:
retro_data <- baseballr::retrosheet_data(
here::here(“data_large/retrosheet”), c(1992, 1996, 1998, 2016)
)
I have downloaded the ‘chadwick files’ specifically the cwevent.exe file into an unzipped.folder for the above. However, I continue to receive this error in R:
“‘cwevent’ is not recognized as an internal or external command, operable program or batch file. Error in `purrr::map()`:……………… “
In this regard – I was able to create the ‘Runs Expectancy Matrix’ using the code from the older version of the book. Very, very cool!!!!!! Nevertheless – I would like to keep up-to-date with the new edition. Any help with the above error is greatly appreciated!
Otherwise – I am not sure if this is place for it w/r/t feedback, but I was able to replicate and create a database following Chapter 11 – ‘Using a Database to Computer Park Factors’ which was also very, very cool and encouraging w/r/t to working through other parts of the book! 🙂
Thank you for all your help! It is appreciated!
Hello again!
I hope I am not being a pest but quick update. I happened to have a Mac “Mini-me” (or at least I think that is what it is called). I followed the instructions to run the Appendix and Chapter 5 code using Homebrew(?) and, well – wow – it ran. Isnt this a conumdrum? 🙂
So – I can run the book’s new edition code, at least Chapter 5, in R-Studio on a Mac (keeping my fingers crossed for the rest of the book). Nevertheless, and as always, thank you for your help and assistance with everything. I appreciate it!
Holiday Greetings –
In Section 10.4 (Local Patterns of Statcast Launch Velocity) in the 3rd Edition of the book, the sample code uses a file named “sc_2017_ls.rds” which is loaded from the data directory.
I can’t seem to find the datafile, nor can I find any directions on how to create the file. I was hoping someone could point me in the right direction. Really enjoying the book.
Warm regards, Murray
Hi Murray:
We were not able to include all datasets used in the book in the abdwr3edata package primarily since they were so large. But we do provide advice for downloading Statcast data in Section 12.2 from the baseballr package.
Glad you are enjoying the book and don’t hesitate to ask any other questions.
Best:
Jim
Thanks very much for the prompt reply.
Regards, Murray
Hi,
I love your book. I am learning a lot about baseball and R.
I’m struggling with section 9.3.6 Function to simulate one season.
“It is convenient to place all of these commands including the functions make.schedule and win.league in a single function one.simulation.68.”
I cannot figure out how to do this. I appreciate any assistance.
Ralph:
We’re happy you enjoy the book.
The one.simulation.68() function is actually contained in the abdwr3data package. If you install and load the package, then you can implement a simulation by typing
S <- one_simuiation_68(s.talent = 0.3)
s.talent is the standard deviation of the talent distribution of teams. If you want to view the function, just type
View(one_simulation_68)
Hope this helps.
Jim
Thank you, Jim!