Analyzing Baseball Data with R

Some information about the book Analyzing Baseball Data With R, 3rd edition by Max Marchi, Jim Albert, and Ben Baumer:

Some useful links for the book.

The Amazon page for the book
The online Quarto version of the book
The Github repository containing the datasets and the scripts used in the book.

163 responses

Lloyd Hill January 21, 2021 at 4:49 pm | Reply

Jim,

I hope all is well. I have continued working through the text and just have a question regarding Chapter 5, Exercise 3.

Is there anyway you can confirm the Runs Value for Rickie Weeks and Michael Bourn. I have 9.029 for Weeks and 7.360 for Bourn. I just want to be sure I am following along with the code properly. I used your code from the in Chapter Exercises with Pujols.

Thanks so much,
Lloyd
1. Jim Albert January 21, 2021 at 7:46 pm | Reply
  
  Lloyd:
  
  Here’s what I have for Exercise 3 of Chapter 5:
  
  d2016 %>% filter(BAT_ID %in% c(“eatoa002”, “marts002”),
  BAT_EVENT_FL == TRUE) %>%
  group_by(BAT_ID) %>%
  summarize(N = n(),
  M = mean(run_value),
  S = sum(run_value))
  ## # A tibble: 2 x 4
  ## BAT_ID N M S
  ##
  ## 1 eatoa002 706 0.0188 13.3
  ## 2 marts002 529 0.0179 9.48
  
  Hope this helps.
  
  Jim
Brandon Alfond February 19, 2021 at 7:40 pm | Reply

Hi Jim,

Hope all is well. I am working through Chapter 4 Exercise 3 (Manager Effect in Baseball) and ran into an issue running the solution. I get the following error running the code:
Error: Problem with `summarise()` input `Mean_Residual`.
x object ‘.resid’ not found
i Input `Mean_Residual` is `mean(.resid)`.
i The error occurred in group 1: playerID = “actama99”.

It seems that for whatever reason the Augment function is not adding the .resid column. Instead I only get the following:
> out out
# A tibble: 345 x 10
yearID teamID R RA .fitted .hat .sigma .cooksd .std.resid playerID

I am using the solutions dated 1/10/2019.

Any help or guidance on what is going wrong would be greatly appreciated.

Thanks,
Brandon
1. Jim Albert February 19, 2021 at 10:58 pm | Reply
  
  Hi Brandon:
  
  It appears that the broom package has changed what happens with the augment() function. I haven’t looked it carefully, but I see that the data frame out has the variable .std.resid instead of .resid. So I think if you replace mean(.resid) with mean(.std.resid) it should work fine.
  
  I’ll make a correction on those solutions.
  
  Thanks.
  
  Jim
  1. Brandon Alfond February 20, 2021 at 11:32 pm
    
    Thanks Jim! Greatly appreciate the quick response. Loving the book!
Robert H Carden February 20, 2021 at 9:47 pm | Reply

In section 6.2.3 and getting below error. Package missing driving this?

count_plot %+% run_value_by_count +
+ scale_fill_gradient2(“xRV”, low = grey10, high = crcblue,
+ mid = white)
Error in count_plot %+% run_value_by_count :
is.character(lhs) is not TRUE
1. Jim Albert February 21, 2021 at 1:30 pm | Reply
  
  Robert, I just tried running that Chapter 6 from the script posted on our Github site and I couldn’t reproduce your error. We are using the tidyverse suite of packages, but nothing else. Unfortunately, without having your computer in front of me, I am not sure what is creating the issue. Sorry not to be of more help. Jim
BENZI BLATMAN June 3, 2021 at 12:08 am | Reply

Can someone help me figure out how to make sense of this book? Chapter 2 is allegedly “Introduction to R,” and yet I can’t seem to find any instructions on how to actually access/import the data necessary to work alongside each exercise.
1. Jim Albert June 3, 2021 at 2:02 am | Reply
  
  All of the datasets described in the book are found in the data folder at https://github.com/beanumber/baseball_R. Also there is a package ABSRdata that contains most of the datasets. You can install this package by following the instructions at https://github.com/bayesball/ABWRdata
  
  Good luck — I know it can be challenging to get started.
Nick June 19, 2021 at 11:55 pm | Reply

This book has been a great help, but I’ve got stuck in section 3.7.1. I don’t believe I have the same type as the person above, but any help is appreciated. Here is what I’m typing:

get_birthyear<-function(Name){
Names%
filter(nameFirst==Names[1],
nameLast==Names[2])%>%
mutate(birthyear=ifelse(birthMonth>=7,
birthYear+1,birthYear),
Player=paste(nameFirst,nameLast))%>%
select(playerID,Player,birthyear)
}

It seems to me like this part of the code isn’t working. When I go to the next steps in the chapter for setting up the table, it comes up with 0 observations of 3 variables. I’ve read the Lahman database into R, so I’m not exactly sure what’s not connecting. Thanks!
1. Jim Albert June 20, 2021 at 12:08 am | Reply
  
  Nick:
  It seems that you didn’t type in the get_birthyear() function correctly — it should be as I’ve copied below.
  By the way, all of the code for the chapters can be found at. https://github.com/beanumber/baseball_R/tree/master/chapter_code
  Best: Jim
  
  get_birthyear <- function(Name) {
  Names %
  filter(nameFirst == Names[1],
  nameLast == Names[2]) %>%
  mutate(birthyear = ifelse(birthMonth >= 7,
  birthYear + 1, birthYear),
  Player = paste(nameFirst, nameLast)) %>%
  select(playerID, Player, birthyear)
  }
Robert Carden June 20, 2021 at 1:53 pm | Reply

Can anyone explain this message when trying to install a package: “Do you want to install from sources the package which needs compilation?”

Thanks in advance.
1. Jim Albert June 20, 2021 at 10:38 pm | Reply
  
  Robert, there are two ways to install packages, precompiled and those that consist of source programs (like C++ or Fortran) that need to be compiled. Most packages come precompiled, but if you have C++ and Fortran on your computer, you can compile the source packages. Usually the need-to-be compiled packages are the ones that are recently released — if you wait a few days, they will be available in the precompiled version.
Alfredo November 13, 2021 at 12:02 pm | Reply

Excellent book, it is a great tool for the study of statistics, and to make every baseball game even more interesting. I have been trying to follow the example of the Pythagorean expectation formula and how to obtain its exponent (pages 94-102), but I wonder what to do when in the final score a team made no runs, that is, ended with zero runs. In such cases, for example, for the calculation of logRratio (page 95), log(0) is Inf.
> log(0)
[1] -Inf
What is the way to deal with this data: remove this log from the data, or calculate, for example, log(0.1)?
Thanks
1. Jim Albert November 15, 2021 at 2:38 pm | Reply
  
  Alfredo, thanks for the kind comments on the book. Usually we apply the Pythagorean expectation formula for a collection of games where the runs scored for and against are positive. I don’t think we use it for a single game.
  1. Alfredo November 16, 2021 at 7:46 am
    
    Thanks a lot for you response.
  2. Alfredo November 16, 2021 at 7:56 am
    
    What I was trying to do, as a statistical exercise, is to apply the formula to several games in a season to observe how the expectation of games won changes over the dates. My intention was to find the time when the prediction most closely matched the final outcome. I am looking at a short 49-game season in my country’s professional league. Currently, they are going through game 20. My first choice was to think of a sample size for all 49 games. But doing the exercise by dates, I plan to find the point at which the prediction was most closely matched. Kindly, I would like to ask if you can think of any recommendations for my exercise? Thank you.
Justin Cassidy October 8, 2022 at 11:25 pm | Reply

I apologize as I am only on chapter 2 but already having a problem while following along with Warren Spahn’s csv. As I run this code, I get to
install.packages(“tidyverse”)
library(tidyverse)
library(Lahman)
getwd()
spahn <- read_csv("data/spahn.csv")
while read.csv works for getting data/sphan.csv, read_csv produces this error
Error in `vec_as_location()`:
! `…` must be empty.
x Problematic argument:
* call = call
Run `rlang::last_error()` to see where the error occurred.

I am fairly sure my working directory is set up correctly so I thought it might be something else wrong. I am completely lost and am just getting back to coding so any help would be greatly appreciated. Thank you!
1. Jim Albert October 10, 2022 at 12:03 am | Reply
  
  Justin, sorry, but there is no simple answer to that particular error message. I can’t reproduce that on my laptop. I’d suggest using read.csv() instead of read_csv() to read in data files — both functions do the same thing. Jim
aves25 March 15, 2023 at 12:56 am | Reply

Hello!
I was wondering how to get the hofbaseball.csv file to use for Chapter 3? I know it wasn’t in the original files when I download the csv files for 2017, but I can’t seem to figure out how to get a hold of it?
1. Jim Albert March 16, 2023 at 7:59 pm | Reply
  
  In Chapter 3, I don’t believe there is a hopbaseball.csv file mentioned, but there is hofbatting.csv and hofpitching.csv. These two files are available in the data folder on Github https://github.com/beanumber/baseball_R/tree/master/data
Jayson Stancil May 22, 2023 at 1:32 pm | Reply

Hi,
I recently bought the book, and love the work you’ve done. However, I am running into many issues becasue I believe the latest version of the Lahman package in R removed the Master table. Is there a work around to this, or am I doing something incorrect?
1. Jim Albert May 22, 2023 at 1:34 pm | Reply
  
  Jayson, in the Lahman package, that Master data frame has been renamed as the People data frame. JIm
  1. Jayson Stancil May 26, 2023 at 12:50 pm
    
    You’re a lifesaver. Thanks again!
Addison McGhee February 27, 2024 at 6:17 pm | Reply

Hi Jim!

Love the book!

I wanted to ask for some clarification on the first “Baseball Question” in section 1.2.8.

What I wanted to ask was how the “average number of home runs per game recorded in each decade” is calculated. Specifically, how are you obtaining the values 0.3, 0.8, and 2.2 in the paragraph below the question?

My approach was to use the Teams dataset and group_by the variable year_id. I thought that “home runs per game” could be calculated by taking the total number of home runs and dividing by the total number of games. But my results didn’t match what you had.

Thanks!
1. Jim Albert February 27, 2024 at 9:14 pm | Reply
  
  Addison:
  
  What you did was fine. But you were computing the average number of home runs per team per game. Since two teams are playing, the average number of home runs would be double of what you are finding.
  
  Jim
  1. addisonmcg99 February 27, 2024 at 9:39 pm
    
    Hi Jim, Thanks for the quick reply!
    
    Is this correct, or do I need to further group by decade?
    
    View(
    Teams %>%
    group_by(yearID) %>%
    summarise(total_games = sum(G),
    total_hr = sum(HR),
    hr_pg = 2 * total_hr / total_games)
    )
Kyle April 28, 2024 at 9:46 pm | Reply

Hi Jim.

I am trying to replicate your analysis on catcher framing in section 7.5.

I am struggling with matching the player ids to their names. Where does the masterid.csv file come from?

This is my only problem. I love the book and have used for answering many of my own baseball questions.

Thank you!
1. Jim Albert April 29, 2024 at 12:39 pm | Reply
  
  Kyle:
  
  That particular masterid.csv file can be found in the folder at https://github.com/beanumber/baseball_R/tree/master/data
  
  We have a new edition of this book available at https://beanumber.github.io/abdwr3e/ . In the new edition, we use a function from the baseballr package to get the names for those player ids.
  
  Glad you enjoy the book.
  
  Jim
  1. Kyle April 29, 2024 at 2:24 pm
    
    Thanks for your help! I appreciate the quick reply!
Manny September 3, 2024 at 12:46 am | Reply

Hello all!

I am very excited about the new edition to the book, Analzing Baseball Data with R (3e)! I previously worked through the 2nd edition and primarily focused on ideas with the Lahman database. This edition has motivated me to step my game up a notch and I wanted to start working with the Retrosheet data to work through the exercises in the book. However – even after working in R for years, I am having a hard time understanding and/or working through the instructions in the appendix in order to set up the ‘Chadwick’ component that working with Retrosheet entails.

Are there additional resources/videos one can look at to get Chadwick properly installed and working with R? I am working in Windows. Any help is greatly appreciated! Thank you!
1. Jim Albert September 3, 2024 at 12:40 pm | Reply
  
  Manny, I understand that it can be problematic getting the Chadwick files to work. If you are on a Windows laptop, then you can download the executable versions of cwevent, etc. If if you are in the same folder as those problems, then you should be able to use them to extract the Retrosheet variables. Our best advice is given at https://beanumber.github.io/abdwr3e/A_retrosheet.html. Jim
  1. exactlyhideoutcef8839f5c October 17, 2024 at 9:35 pm
    
    Hello again!
    
    Thank you for your help! I have worked on and off with the link/page that you suggested. Following the code in the book in Appendix A:
    
    retro_data <- baseballr::retrosheet_data(
    
    here::here(“data_large/retrosheet”), c(1992, 1996, 1998, 2016)
    
    )
    
    I have downloaded the ‘chadwick files’ specifically the cwevent.exe file into an unzipped.folder for the above. However, I continue to receive this error in R:
    
    “‘cwevent’ is not recognized as an internal or external command, operable program or batch file. Error in `purrr::map()`:……………… “
    
    In this regard – I was able to create the ‘Runs Expectancy Matrix’ using the code from the older version of the book. Very, very cool!!!!!! Nevertheless – I would like to keep up-to-date with the new edition. Any help with the above error is greatly appreciated!
    
    Otherwise – I am not sure if this is place for it w/r/t feedback, but I was able to replicate and create a database following Chapter 11 – ‘Using a Database to Computer Park Factors’ which was also very, very cool and encouraging w/r/t to working through other parts of the book! 🙂
    
    Thank you for all your help! It is appreciated!
exactlyhideoutcef8839f5c October 17, 2024 at 9:39 pm | Reply

Hello again!

I hope I am not being a pest but quick update. I happened to have a Mac “Mini-me” (or at least I think that is what it is called). I followed the instructions to run the Appendix and Chapter 5 code using Homebrew(?) and, well – wow – it ran. Isnt this a conumdrum? 🙂

So – I can run the book’s new edition code, at least Chapter 5, in R-Studio on a Mac (keeping my fingers crossed for the rest of the book). Nevertheless, and as always, thank you for your help and assistance with everything. I appreciate it!
Murray Sondergard December 22, 2024 at 5:10 pm | Reply

Holiday Greetings –

In Section 10.4 (Local Patterns of Statcast Launch Velocity) in the 3rd Edition of the book, the sample code uses a file named “sc_2017_ls.rds” which is loaded from the data directory.

I can’t seem to find the datafile, nor can I find any directions on how to create the file. I was hoping someone could point me in the right direction. Really enjoying the book.

Warm regards, Murray
1. Jim Albert December 22, 2024 at 5:35 pm | Reply
  
  Hi Murray:
  
  We were not able to include all datasets used in the book in the abdwr3edata package primarily since they were so large. But we do provide advice for downloading Statcast data in Section 12.2 from the baseballr package.
  
  Glad you are enjoying the book and don’t hesitate to ask any other questions.
  
  Best:
  
  Jim
  1. Murray Sondergard December 22, 2024 at 8:48 pm
    
    Thanks very much for the prompt reply.
    
    Regards, Murray
Ralph Zuccarino January 15, 2025 at 3:35 am | Reply

Hi,

I love your book. I am learning a lot about baseball and R.

I’m struggling with section 9.3.6 Function to simulate one season.

“It is convenient to place all of these commands including the functions make.schedule and win.league in a single function one.simulation.68.”

I cannot figure out how to do this. I appreciate any assistance.
1. Jim Albert January 15, 2025 at 12:11 pm | Reply
  
  Ralph:
  
  We’re happy you enjoy the book.
  
  The one.simulation.68() function is actually contained in the abdwr3data package. If you install and load the package, then you can implement a simulation by typing
  
  S <- one_simuiation_68(s.talent = 0.3)
  
  s.talent is the standard deviation of the talent distribution of teams. If you want to view the function, just type
  
  View(one_simulation_68)
  
  Hope this helps.
  
  Jim
  1. Ralph Zuccarino January 15, 2025 at 11:18 pm
    
    Thank you, Jim!

	Jim Albert on A New Chapter with the As…
	pf_hayes on A New Chapter with the As…
	Andrew Rivers on MLB Teams are Hiring Baye…
	Jim Albert on retrosheet Package and Compari…
	addisonmcg99 on retrosheet Package and Compari…

Exploring Baseball Data with R

Analyzing Baseball Data with R

163 responses

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Exploring Baseball Data with R

Analyzing Baseball Data with R

Share this:

163 responses

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta