A browser cookie contains a 128 bit "unique" identifier. 128 bits is an inconvenient size, being longer than native integer size on nearly all CPUs. The 128 bit number can be split into two 64-bit numbers; are both needed to correlate user records?
In my sample of 103 million cookie identifiers, there were only 2 collisions when grouping by the left half of the identifier. For statistical calculations, this 0.000002% error rate is negligible.
The expected number of collisions among k samples out of n possible codes is:
k ( 1 − ( 1 − 1/n )^{k−1} ) |
In order to keep the numbers within double-precision range, use the fact that ln(1+ε) ≈ ε for ε near 0:
k ⋅ ( 1 − ( 1 − 1/n )^{k−1} ) | ≈ | k ⋅ ( 1 − e^{(1−k)/n} ) |
Solving the expected-collisions formula for n will give an estimate of the entropy (c is the number of collisions):
n ≈ | 1 − k
ln( 1−c/k ) |
M N 3001457e-e07a-4b20-ba16-f3f8624da98f 73f9436b-2868-4325-a62c-bf5a4f38dfdf 1c4df5b6-210b-4aba-9660-bdb129ea6f31 34c3fcb1-bf16-4a96-a7e9-ec377bface30 a1ddb5fc-4bb8-4321-a1bf-753668387197 3edd32e0-5bad-476d-bc1c-8cd8c80b88f8 2ee886c8-d877-480c-b109-455cd3c1b099 5c76c27d-7caf-4adf-8aec-65fee5ce281a 93af502b-61db-4596-bcfe-188c352296b5 ed981151-6179-45cf-8e0d-e4293caf62a0 54ea634b-ebfc-4371-97a7-5b1123264e4b 9a3cce97-2a4b-4a56-8fd9-85d9f07abce4 0bfdbd6f-5fd8-4c74-ab7e-1ccb1a81fd70 84ac5d8a-c64f-48a3-8d11-653aea56e2b4 f5e3afb2-e87b-45e1-9b1c-a49d3e106124 b5f832f5-0771-4554-8a00-da142aab1832 ca025994-9412-4311-bb62-c7e6458a94dc cf1fc6a6-b3d4-47a5-8aa5-063120d5b8c4 83add10c-b266-4995-8803-805d32c58147 d85c3ec5-3d77-4767-b87b-a42a20dc9f74 b6104d07-b29e-4b07-b3c8-65efd08bd003 21111ab5-83ab-4af8-a39f-1824a4055a9a 78346096-3202-48d8-9ed8-797a0f233804 1d124d3d-8ac2-405b-9a8a-afb70dac4720 e187bd94-c091-4fb5-a3c4-2a50e2fe4e35 8dca4524-cb3d-4a40-ae4d-881976a6173f 94e458b1-04c8-4495-b3fb-4cbbc4c37561 7d0819df-4a08-4553-b10b-379433b09e35 74332f07-3183-479b-a285-04c1a9e61bc4 4ac04ba2-950e-4315-bb14-eb3369dd1591 f8a20163-1db1-47e4-a6b5-d8fa4519eb79 ba778464-0bf4-4575-93e9-ac5b41923e3a 6dda046c-651d-4d9c-9482-4d3c1247a14aThis turns out to be RFC-4122 A Universally Unique IDentifier (UUID) URN Namespace format version 4. The left half of the UUID should have 60 bits of entropy if the random or pseudo-random source is uniformly distributed. The left half having only 52 bits of entropy indicates that the source quality could be improved.
Note that a linearly increasing sequence would not have any collisions although, as a sequence, it has low sequence-entropy. Although having the expected hash entropy is a test which a pseudo-random-number-generator should pass, PRNG sequences must also be "random". So low collision counts are a necessary, but not sufficient condition for PRNG quality.
This collision test's insensitivity to sequence makes it useful for testing hash functions. A good hash function will have have entropy near the number of codes it can output.
Sarah V. Jaffer has a much simpler way to compute the number of collisions. k is much smaller than n. The probability of one choice hitting another is k/n. There are k chances to hit another choice; so the expected number of collisions is roughly k^{2}/n. In the cookie calculation, Sarah's formula matches within one part-per-million.
How are these formulas related? Because ln(1+ε)≈ε, it follows that e^{ε}−1≈ε (for small ε). Thus the expected number of collisions is:
Realizing that k≫1 leads to Sarah's formula k^{2}/n.
k ⋅ ( 1 − e^{(1−k)/n} ) ≈ k ⋅ (k−1)
n