/
Baseball Hall of Fame Predictions Baseball Hall of Fame Predictions

Baseball Hall of Fame Predictions - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
342 views
Uploaded On 2019-11-18

Baseball Hall of Fame Predictions - PPT Presentation

Baseball Hall of Fame Predictions Brent Belanger Dave Matrisciano II Phillip McLaurin You miss 100 of the shots you dont take Wayne Gretzky Michael Scott Background The debate between public perception and statistics is one of the most hotly debated topics of all time ID: 765224

fame pitchers time hall pitchers fame hall time variable data pitcher algorithm players era fip baseball avg seconds classifier

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Baseball Hall of Fame Predictions" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Baseball Hall of Fame Predictions Brent BelangerDave Matrisciano IIPhillip McLaurin “ “You miss 100% of the shots you don’t take” - Wayne Gretzky” -Michael Scott

Background The debate between public perception and statistics is one of the most hotly debated topics of all timeIn Major League Baseball (MLB) the perception of the Baseball Writers’ Association of America (BBWAA) is the only deciding factor when it comes to choosing to elect or not elect a player into the Baseball Hall of Fame Is the perception of the BBWAA accurate when it comes to judging players eligibility into one of the most honored fraternities in all of sports?

Some Terminology First: Hall Of Fame (HOF): An american history museum dedicated to a group of the greatest baseball players in history. Members of the sports media vote to determine whether a player is eligible for inclusion in the Hall of Fame Fielding Independent Pitching(FIP): FIP is similar to ERA, but it focuses solely on the events a pitcher has the most control over -- strikeouts, unintentional walks, hit-by-pitches and home runs. It entirely removes results on balls hit into the field of play. FIP-: A normalized modification of the FIP statistic accounting for difference in parks (stadiums) and difference in leagues Earned run(ER): An earned run is any run that scores against a pitcher without the benefit of an error or a passed ball.Earned Run Average (ERA): The average of earned runs given up per nine innings of baseball

Terminology Part 2 :ERA-: A normalized modification of the ERA statistic taking into account park factor and difference in leagues. Base on Balls(BB) - When a pitcher throws 4 pitches outside of the strike zone in an at bat, the batter is awarded first base. Hit By Pitch( HBP ) - When a pitcher hits a batter with a pitch, the batter is automatically awarded first base. Strikeout( K ) - When a batter accumulates three strikes they are out. Park Factor(PF) : Park factor takes the runs scored by Team X (and its competitors) in Team X's home ballpark and divides the figure by the runs scored by Team X and its competitors in Team X's road contests.For example: In 2014, 642 runs were scored at Kauffman Stadium, and 633 runs were scored in Royals games away from Kauffman Stadium. That gives Kauffman Stadium a park factor of 1.014, when looking at runs scored. That number will be adjusted if a team doesn't play the same opponents at home as on the road.

Formal Definition Given two sets of pitchers: SHOF and SRest where each element in each set corresponds to a starting pitcher represented by a 3-dimensional vector consisting of that pitcher's average Earned Run Average- (ERA-), Fielding Independent Pitching- (FIP-), and Walks Hits per Innings Pitched (WHIP) from their prime seasons(age 25 to age 31 seasons). SHOF contains starting pitchers who are currently in the Hall of Fame and SRest contains all other starting pitchers not currently in the Hall of Fame. Each starting pitcher in the MLB from 1960 to 2010 who pitched at least 1000 innings over those prime seasons can be classified into one of these sets. Then determine if public perception of these pitchers matches their statistical career performance and whether or not they belong in the Hall of Fame.

Natural Language Based off of three reliable sabermetrics of Hall of Fame pitchers, we can use this data to determine who the future Hall of Fame baseball players might be, despite public opinion of the players.

Context Websites such as FanGraphs and Baseball Reference aggregate data from every baseball game ever played by every player in existence. Newscasters, sports commentators, and even gambling sites such as FanDuel use this data to make predictions, or weigh different players against each other in order to pick the best player possible. Members of the BBWAA vote on players once they are eligible for the Hall of Fame. Our algorithms will be used to predict whether or not past players and currently ineligible players will make the Hall of Fame based on their prime stats.

Pitchers used for the Hall of Fame Set Sandy KoufaxPedro Martinez Tom SeaverJuan Marichal Greg MadduxRoger ClemensGaylord PerryFergie JenkinsNolan RyanDon Sutton Bob Gibson Curt Schilling Bert Blyleven Steve Carlton John Smoltz Phil Niekro Jim Palmer Robin RobertsRandy JohnsonTom Glavine

Experimental Procedure Using the ERA-, FIP-, and WHIP values for our list of Hall of Fame pitchers, we used a Naive Bayesian Classifier, a K-Nearest Neighbor classifier, and a naive brute force method involving linear regression. These different sorting algorithms were picked to determine the efficacy of each algorithm, compare their runtimes to decide which method is not only faster, but most accurate, and compare that to public opinion of which players are considered “Hall of Fame Worthy”.

What is a Gaussian Naive Bayesian Classifier? A Gaussian naive Bayesian classifier is a classifying function that uses Bayes’ Theorem to predict the class(HOF or not) of an input.Makes a “naive” assumption about the classes. Must be used on a normal(Gaussian) distribution.Applies Bayes’ Theorem: Probability Density Function:

Is it Normal? WHIP FIP-ERA- Mean 1.354 101.5 102.2 Median 1.35 101 102WHIP1.35 102 103

Bayesian Classifier Results Avg T ime for All Real Pitchers (232) was 0.009232 seconds Avg Time for 1000 Pitchers was 0.015626 secondsAvg Time for 10000 Pitchers was 0.192508 secondsAvg Time for 100000 Pitchers was 1.637600 seconds Avg Time for 1 Million Pitchers was 16.24630 seconds These times are an average of ten. Adam Wainwright Ben Sheets Bret SaberhagenCamilo PascualCC SabathiaCole HamelsDavid ConeErik Bedard Gary PetersJake Peavy Jim Bunning Jim KaatJohan Santana Jon Lester Jose Rijo Josh Johnson Kevin Brown Kevin Millwood Mike Cuellar Mike Mussina Orel Hershiser Ron Guidry Roy Oswalt Teddy Higuera Tim Lincecum Ubaldo Jiminez Wilbur Wood Zack Greinke Pitchers Found Eligible:

Bayesian Sources of Error Certain individuals that belong in the Hall of Fame may have been robbed. Need more Hall of Fame pitchers.

Bayesian Classifier Interpretation Overall Algorithm Complexity of Big-O of N Slightly more pitchers made the cut with this algorithm compared to Linear Regression and kNN.

K Nearest Neighbor Algorithm

How does it work? K Nearest Neighbor uses 3 variablesK, is a user defined constant that can be described as similar to a standard deviation if used properly by the user 2 training sets, for comparison 1 test variable to compare distances between the two training setsClassifies the test variable based on its k distance between the two training setsAside: Our K Nearest Neighbor is based on three dimensional points rather than the traditional K Nearest Neighbor is based on 2D points.

Pitcher Object CSV file containing 4 variables passed through to a pitcher objectPitcher NamePitcher WHIP Pitcher ERA- Pitcher FIP-3D PointsThe points used were:WHIP (x variable)ERA- (y variable) FIP- (z variable)

K-Nearest Neighbor Results Avg Time for All Real Pitchers (232) is 0.007753 secondsAvg Time for 1000 Pitchers is 0.033932 seconds Avg Time for 10000 Pitchers is 0.358340 seconds Avg Time for 100000 Pitchers is 3.494425 secondsAvg Time for 1 Million Pitchers is 34.915516 seconds Pitchers Found Eligible: Mike Cuellar Teddy Higuera Tim Lincecum Jake Peavy Josh Johnson Adam Wainwright Zack Greinke Ben Sheets Wilbur Wood Roy Oswalt Jose Rijo Brandon Webb Ron Guidry Kelvim EscobarOrel Hershiser Dan HarenBret Saberhagen Erik BedardDavid Cone Kevin AppierJon Lester Mike MussinaCamilo Pascual Tim HudsonUbaldo JimenezCC SabathiaJohan Santana

Conclusion Algorithm as a whole has a big o of O(n*m) Due to our small training which was used across all trials, the big O of this experiment was O(n) as you can see based on the run times Errors:Due to the small amount of starting pitchers that qualified for this experiment, the accuracy of this algorithm cannot be verified until adding normalized sets of 1,000 , 10,000 , 100,000 , 1,000,000There are certain members of the Hall of Fame who do not deserve to be in there based on their statistics as well as certain pitchers the BBWAA doesn’t believe should be in the Hall of Fame even though their statistics say that they should. Results: In the end, this experiment found that the algorithm elected a similar percentage of pitchers to the Hall of Fame as the BBWAA, however it did not elect the same members

What is Linear Regression Linear Regression is the analysis of an independent variable, which in our case was the statistic of the Hall of Fame player, and a dependent variable, which in our case was all the rest of the pitchers in the MLB. The analysis of this data creates a map that shows the trends of each point of data, and using this map, determine which data from the dependent variable closely relates to the independent variable. In short, how far does the dependent variable stray from the line of best fit the independent variable generated.

Hall of Fame Data Used FIP- Mean: 77.37FIP- SDev: 77.372 WHIP Mean: 1.12 WHIP SDev: 0.104 ERA- Mean: 74.36 ERA- SDev: 10.474

Linear Regression (Naive Brute Force)Obtained Mean and Standard Deviation for Hall of Fame Pitchers Ran every pitcher through a comparisonIf pitcher was between the mean of the HOF pitchers and +- 1 SD, they are HOF worthy

L.R. Results Avg Time for All Real Pitchers (232) was 0.000204 seconds Avg Time for 1000 Pitchers was 0.000848 seconds Avg Time for 10000 Pitchers was 0.008479 secondsAvg Time for 100000 Pitchers was 0.084114 secondsAvg Time for 1 Million Pitchers was 0.882123 seconds The algorithm was run 10 times to produce these times. Pitchers Found Eligible: Jose Rijo CC Sabathia Johan Santana Ron Guidry Bret SaberhagenMike MussinaRoy OswaltWilbur WoodOrel HershiserDan HarenDavid ConeCamilo PascualTeddy HigueraJim BunningGary Peters Steve Rogers

L.R. Sources of Error Not accurate enough HOF players not making the cut Sample size for HOF players is too small Bigger = better

L.R. Interpretation Overall Algorithm Complexity of Big-O of NBig-Omega of 1 Time increases by scalar of 10 for pitchers

Conclusions Although the three algorithms approached this problem using different methods, all three produced very similar lists, making all three algorithms viable options to determining potential Hall of Fame players in the MLB. The biggest difference is the complexity of each and how that affects the overall processing speed and accuracy of the algorithms.

Future Study Since for the foreseeable future baseball will exist, we can continue to use this to predict who may be great for fantasy teams, as well as potential Hall of Fame nominations. With more data from more pitchers in the future, we might be able to verify if our predictions actually come true.

Questions How does the K-Nearest Neighbor algorithm classifies data? It classifies data based on the K variable, and measures the closest points within that K variable circle. The training set that the test data is closest to is the one that the test data is classifed into What makes it a Naive Bayesian Classifier?It makes the naive assumption that the features of each class are independent of each other.What type of distribution is required for a Gaussian naive Bayesian Classifier?A normal(gaussian) distribution. Why does the Linear Regression algorithm have a Big-O of (N)? It has a O(N) due to the fact that the individual comparisons are at worst O(1) to calculate, but the algorithm need to iterate through N items in a test set. What happens to a Linear Regression if there is very little data to compare against? The algorithm isn’t as helpful due to a small base set, which leads to inaccurate measurements to compare against.

Questions, comments, concerns.