One day when I was driving and I was thinking to myself what is the probability of any one getting in an accident. Hmm, I’m a data scientist I should be able to get the numbers fairly easily. As soon as I reached my house I started analyzing the problem.
I thought to myself let’s take the same approach that we use in the job. The strategy that I use is taken from book “How to Solve it” by G.Polya. In this book G.Polya outlines the steps required to solve any problem. It is based on a mathematical method.
First step is to figure out the unknown, get the data, find out the condition and see if it is redundant, or insufficient, or contradictory. In my problem unknown is the probability of me getting in to an accident, for the data I used the data from NHTSA (https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812783) It provides a quarterly level aggregation of number of accidents, accidents per 100 million miles driven, total miles driven and more on quarterly basis. The last piece in the first step is the condition, this made me think. Hmm, what is the condition here? Condition could be I keep driving every other weekend through all four quarters of year (which I do). Every step here is a iterative step. Let’s go back and think what else we are missing. We are missing the data about me. Which needs to be translated to numbers. I drive every other weekend, let’s say that’s about 6 weekends in a quarter, which comes around 4000 miles in a quarter on average.
We’ve come to the second step now. Second step is to devise a plan. We need to find the connection between data and the unknown, we need to consider auxiliary problems. One of the main action items in the step 2 is to find an analogous problem and see if it is solved before or if some of the solution from that problem can be used here. When we think about analogous problem what comes to our mind is insurance industry. They must have solved this problem and there could be solution in the internet already for this problem. But here due to the personal interest in solving the problem by myself I’ve decided not to “google” the solution. Apart from trying to find the analogous problem the main action item in step two is to devise a plan. Let’s see what type of distribution best describes this data. Obviously when there is a number and interval the first distribution that comes to mind id Poisson distribution, which can be used to find 0 events in a unit of time, this also leads us to exponential distribution which can tell us amount of time left until I meet an accident at given point of time. Adding to that exponential distribution has memoryless property that makes it ideal for this problem.
Now we go to the third step. Here we just have to find the event rate (which is an average for every quarter) and apply the formula for Poission distribution for the zero event happening in one quarter is exp (-lamda), lamda is the rate. The number comes to 0.99 which the probability of me not getting in to accident. I used the average number of accidents from the data for a quarter.
And finally, we are at the fourth step. This step is to examine the result, check the argument and see if we can obtain the result differently. Hmm, my intuition says why not try to find a distribution of people similar to me and then find the probability on that. Here we don’t have the data, so we need to assume a lot. Ok let’s take one quarter data of 100,000,000 miles (1.1 accidents have happened) and say in this set all of them are driving to see his girlfriend and all of them drive 4000 miles in a quarter, so that’s about 25,000 people. We will round the accident to 2 since that we can’t have 1.1 accidents and we need to round at the upper level considering the importance of the problem. So the odds of me getting into accident will be 1 in 12,500.
Conclusion here is I know the probability is low, but still possible.