Probability of 10 random digits from an array of [0,1,2,3,4,5,6,7,8,9] occuring twice or more
Sebastian Wright
It's probably something trivial for you guys, but I am quite bad at math, so ;).
BACKGROUND:
I want to create a system of "ID's" in the URL like:
instead of
this pattern is easily predictable and you just need to replace the value after the "article" and inrease in every cycle by 1.
or
because the article name could change and there are other problems involved, like security (digits are easier to validate and trim etc. than if you allow other letters.) and in this case I don't want Google or other search engines to index my articles, because they will be private anyway.
Where every article is accessible like:
or
etc.
I think you get the idea now.
MATHEMATICAL PROBLEM:
I know how to check if the 10 digits number generated by the script is already used in my database for any record or not, however, I would like to know what is the probability of this event happening ;D.
I can easily test if my "if the record exist in the database" condition works by manually set the same value to the variable instead of generating a random number.
However, I am still interested how do you compute that using mathematics, I mean the probability of the same 10 digits occurring twice. If every digit can be repeated in that number and the array of digits is 0,1,2,3,4,5,6,7,8,9.
IMPORTANT! With every new number stored in my database this number needs to be considered as a new possible duplicate. I don't know how it's called in Math, but I guess the possibility of generating a duplicate rises with every new generated number in my database.
Is there some formula how to calculate that?
What is the chance of having the same number (with the conditions mentioned above) within e.g. 1 000 000 db records?
Is there any formula how to get the result using mathematics?
When will the second occurrence will happen (what is the probability)?
Is there a formula for that? Do I need to take in count the processor unit in the server or it's not important in this case and the formula is "universal"?
$\endgroup$ 42 Answers
$\begingroup$I saw this in Mathematica and I had half written an answer so it would be a shame to not post it.
This is the same as the birthday problem which is normally phrased as "how many people do you need to put in a room before the probability of any two of them having their birthday on the same day exceeds x% ?". To solve this you think the negation of what is asked as in what is the probability of no two people having their birthday on the same day. Then you start putting people in an imaginary room. For 1 person we are guaranteed there is noone else in the room with the same birthday. When we add another person, to guarantee she won't have the same birthday as the first we need to demand that her birthday is in any of the other 364 days of the year. For the next person the same and so on for the n-th person: $$ P=1 \times \frac{365-1}{365} \times \frac{365-2}{365}\times\dots \times\frac{365-n}{365} $$ Notice this is the probability that no two people will have the same birthday so its complement $ (1 - P) $ is what we were asked in the first place.
Now in your problem, you are asking what is the probability that you will have reproduced the 10-digit number by using a random selection of 10-digit numbers. The probability of you not reproducing the same 10-digit number is calculated with the same reasoning. The first number you pick will be unique. The second can be any number but for the initial one and so on. Given that the numbers you can make with 10 digits repeating is $ 10^{10} $ you have:
$$ P=1 \times \frac{10^{10}-1}{10^{10}} \times \frac{10^{10}-2}{10^{10}}\times\dots \times\frac{10^{10}-n}{10^{10}}. $$
As before, $ P $ is the probability that no two numbers will have been repeated after $ n $ have been chosen. Mathematica calculated that product analytically as
$$ \left(-\frac{1}{10000000000}\right)^n (-9999999999)_n $$
where the underscore denotes the Pochhammer function. Calculating this for 10000 numbers gives me a probability of $ 0.5 \% $ that any two of your numbers will be reproduced.
Unless you are going to have many more than that, you need not worry :)
$\endgroup$ $\begingroup$As pointed out by @gpap, this is the Birthday Problem. More generally, let's say you have M possible ID's, so your particular case is M = 10^10. According to [1], an asymptotic estimate of the expected number of ID's generated until the first collision occurs is
$$\sqrt{\frac{\pi M}{2}} + \frac{2}{3}$$
which works out to be approximately 125,332 when M = 10^10.
- Robert Sedgewick and Philippe Flajolet, An Introduction to the Analysis of Algorithms, Second Edition, Theorem 9.1