Request URI Handling in HTTP: A Primer
There’s a pretty simple convention in place for handling requests for files, directories, and indexes on a web server. It works pretty well, and it is definitely worth understanding what happens and why, if you don’t already.
This came up today in the context of a question about Snap, I figured I should write about it. This topic isn’t particularly advanced, so it is likely to be nothing new if you already know HTTP fairly well. But I couldn’t find a reference to point to, so I’m writing it down.
Web servers have to be pretty flexible when it comes to handling requests for files. People might type any of the following URLs:
So what should happen in response to each? And why?
This post covers the basics. Remember that we’re only talking about basic straight-forward file serving here. Once you throw in custom code and server-side programming, things can change a lot. But a solid understanding of the basics is a stepping stone to understanding the things built on it… and all of HTTP is built on the metaphor of serving files from a web server. Then here we go.
This is the easy one, because it’s not up to the web server at all. If a user types the URL http://example.com, their web browser is required to send the request with a request URI of “/”, as if they’d placed a slash as the request URI. This is a matter of HTTP syntax; because request URIs are not quoted in HTTP, there’s no way to send an empty one.
In this case, as well, the request URI sent by the web browser will be “/”. So what is a web server going to do with that?
Most people, I think, know this answer. The server finds the root directory of content it expects to serve. (On many UNIX web servers, for example, this might be called /var/www, but server configuration is a whole different topic, and there are many reasons it might be something else.) Since the request was for the entire directory, it typically tries to find a file that represents an index of all the content in that directory.
Web servers can be configured to look for this content under many names, but a common one is “index.html”. A server will generally have a list of such index names, in order of preference: perhaps index.html, followed by index.htm, and so on. If one of these files exists, the server sends its contents as a response to the request.
Notice that typically, the client (that is, the web browser) never actually gets told that it is receiving a file named index.html. Instead, it just knows that it sent a request for the root directory (the request URI “/”) and got back some HTML. It does know that it’s HTML, because of the Content-type header in the request, but it doesn’t get sent the file name.
If no index.html or similar file exists, then the server has a choice about what to do next. Historically, web servers generated on the fly a list of all the files in the directory, formatted as HTML, and served that in response to the request for the directory. However, in more recent years, the obfuscation-based security movement has made that less common, so it’s more common to get back an error message letting you know that the server doesn’t want to tell you what’s in the directory. The error is often 403 (Forbidden), though it’s sometimes 404 (Not Found) as well.
Here, the request is for “/somedirectory”, a directory inside of the top-level.
(Note that in practice, the decision of what to do should be made based on whether there’s really a directory and/or file with that name, not the presence or absence of an extension. I’ve included the extension above just to give a hint to the reader — you — that one request is for a directory while the other is for a file.)
Ideally, the server would like to find an index file of some sort to serve for this subdirectory, just as it did in the previous case. But we have a problem. If the server gives back some HTML in response to the request URI “/somedirectory”, the user’s web browser will think that’s a file! Remember, we don’t rely on extensions to decide what’s a file or a directory. Also, remember that the web server doesn’t ever tell the client the name of the file it’s actually sending; it just sends the content.
Now, maybe you think it doesn’t matter if the web browser thinks it’s asked for a file and is really getting a directory. But in fact it does, and the reason is relative URLs. Suupose you were writing the index.html file that belongs in somedirectory, and you wanted to refer to an image alongside the file. You’d probably drop that image inside of somedirectory as well. But if the web browser gets back this HTML and thinks that it requested a file called somedirectory instead, then it will not know to ask for “/somedirectory/image.png”. Instead, it will ask for “/image.png”, thinking the image is sitting in the directory alongside a file called “somedirectory”. This is not going to work.
For this reason, the web server does not immediately send back the index.html or similar file in somedirectory. Instead, it sends a redirect. This response (using response code 302) informs the browser to come back and ask again, but this time use “/somedirectory/” as its request URI. Note the slash at the end: that’s what tells the web browser that this is a directory, and not a file.
It’s now the browser’s job to re-request the file with the right URI, so the server sends a redirect, and is done.
In this case, the browser has sent a request for “/somedirectory/”. Perhaps it got here by following the redirect from the previous request, or perhaps the request URI was correct from the beginning. In either case the response is just like Case 2. The server will look for a file called something like “index.html”… but this time, it will look inside of somedirectory to find it. If it doesn’t find an appropriate index file, once again it may or may not try to generate one.
Now the request is for “/somefile.html”, a file. This is the easy one. The server will find the file with that exact name, and send its contents back in the response. (Note that I’m assuming somefile.txt is a file. It’s entirely possible to create a directory with that name, in which you’d follow the instructions for case 3 instead, and send the redirect.)
Finally, suppose a user requests “/somefile.html/”. That is, the request is for a file, but there is a slash at the end of the path.
You might think we should just send the file; it’s pretty obvious what the user wanted, even if they’ve said it oddly. But there’s a problem with that. Just like in case 3, the web browser looks at that trailing slash and decides that somefile.html is a directory name. Then if it refers to images, style sheets, etc., the browser will try to request something like “/somefile.html/image.png”. That’s not going to accomplish anything useful. So it’s incorrect to serve a file with this URL.
We could send a redirect again, like we did for directories, but that’s less common in this case. Directories can be typed without a trailing slash because people do it a lot. On the other hand, hardly anyone habitually adds a trailing slash after their file names. So instead, we should just fail with a 404 (Not Found) response code.
Understanding the above conventions is important because when people design URLs for web applications, they should keep in mind how browsers expect pages to behave. Remember that the trailing slash decides how a web browser will resolve relative URLs. As such, it’s actually a significant part of the public interface of your application.
Quite a few times, this is gotten wrong. Snap, for example (which by the way is my current favorite foundation for building web applications, so don’t think I have anything against it), got this wrong in its file serving code, and the conversation I mentioned above was with regard to making it work. Even more significantly, a large number of web frameworks provide request routing, but ignore the question of whether content should be served as a file or a directory. For dynamic content, either one works… but you do need to keep in mind which one you choose and how your static resources are chosen on that basis. Though you can work around this using all absolute paths or the HTML “base” tag, this throws away a good bit of how relative URIs are supposed to work.
So it’s good to understand these conventions, and understand how web browsers are interacting with your application.