php – Typical URL lengths for storage calculation purposes (URL-shortener)-ThrowExceptions

Exception or error:

After reading several of the hits on a quick google search, it seems there is not a whole lot of consistency when it comes to determining average URL length.

I know IE has a maximum URL length of 2083 characters (from here) – so I have a good maximum to work with.

My concern is that I am writing a URL-shortener in PHP (similar to some other questions on SO), and want to make sure I am not likely to exceed the storage capability of the server hosting it.

If all URLs are the IE maximum, then 2^32 won’t fit comfortably anywhere – it’d take 2K x 4B ~= 8TB of storage: an unrealistic expectation.

Without adding-in a trimming function (ie, purging “old” shortened URLs), what is the safest way to calculate storage usage of the app?

Is ~34 characters a safe guess? If so, then a fully-populated (using an int type for a primary key) database would chew 292GB of space (double 146GB for any meta data that may want to be stored).

What is the best-guess for an application such as this?

How to solve:

Well, you don’t need to know the avarage url length. It is a guess, but I’d figure that an URL shortener is mainly used to shorten long URLs. Why bother shortening one that is short already? 🙂

That said, there’s another issue. A database will have some overhead too, so you can’t just calculate an avarage and said that is the avarage byte size.

I’ve written an url shortener myself and it already contains about 45 items. So I’d suggest you write yours, and by the time it actually contains 2^32 URLs, buying an 8TB hard disk will probably not pose a problem anymore. 😉

Answer:

This is probably unknowable without indexing the entire Internet, but according to an analysis by Kelvin Tan on a dataset of 6,627,999 unique URLs from 78,764 unique domains, the answer is 76.97:

Mean: 76.97

Standard Deviation: 37.41

95th% confidence interval: 157

99.5th% confidence interval: 218

Answer:

I’m not sure what is typical, but of 11,000 urls in our request database, the average length is 62 characters. We may be an exception because every month we receive hundreds of requests from our customer for items from Japan. Our database includes hundreds of urls with several hundred characters. The longest is a google translate link at 1689 characters.

top 10 len(producturl):
1689
792
707
693
647
606
574
569
562
560

sample url 647 characters:

http://www.amazon.co.jp/%E9%AD%94%E7%95%8C%E6%88%A6%E8%A8%98%E3%83%87%E3%82%A3%E3%82%B9%E3%82%AC%E3%82%A4%E3%82%A24-%E5%88%9D%E5%9B%9E%E9%99%90%E5%AE%9A%E7%89%88-%E5%A0%95%E5%A4%A9%E4%BD%BF%E3%83%95%E3%83%AD%E3%83%B3-%E3%83%97%E3%83%AD%E3%83%80%E3%82%AF%E3%83%88%E3%82%B3%E3%83%BC%E3%83%89%E4%BB%98%E3%81%8D%E7%89%B9%E8%A3%BD%E3%82%AB%E3%83%BC%E3%83%89-%E3%83%88%E3%83%AC%E3%83%BC%E3%83%87%E3%82%A3%E3%83%B3%E3%82%B0%E3%82%AB%E3%83%BC%E3%83%89%E3%80%8C%E3%83%B4%E3%82%A1%E3%82%A4%E3%82%B9%E3%82%B7%E3%83%A5%E3%83%B4%E3%82%A1%E3%83%AB%E3%83%84%E3%80%8D%E9%99%90%E5%AE%9APR%E3%82%AB%E3%83%BC%E3%83%89%E4%BB%98%E3%81%8D/dp/B0043RT8UO/ref=pd_rhf_p_t_1

for estimating purposes you should extrapolate from some dataset after applying standard deviation to throw out the outliers which could distort your mean.

Answer:

From RFC 2068 section 3.2.1:

The HTTP protocol does not place any a priori limit on the length of
a URI. Servers MUST be able to handle the URI of any resource they
serve, and SHOULD be able to handle URIs of unbounded length if they
provide GET-based forms that could generate such URIs. A server
SHOULD return 414 (Request-URI Too Long) status if a URI is longer
than the server can handle (see section 10.4.15).

Note: Servers should be cautious about depending on URI lengths
above 255 bytes, because some older client or proxy implementations
may not properly support these lengths.

Although IE (and probably most other browsers) support much longer URI lengths, I don’t believe most forms or client-side apps rely on anything above 255 bytes working. Your server logs should provide some statistics about what kind of urls you are seeing.

Leave a Reply

Your email address will not be published. Required fields are marked *