URL Encoded Strings Author: David Zimmer Site: http://sandsprite.com/Sleuth ----------------------------------------------------------------------------------- This document is broken up into two distinct sections so as not to assume any knowledge. 1) The ABC's of URL Encoding 2) How URL encoding effects Web Applications 1) The ABC's of URL Encoding We all recgonize the term URL as a web address. URL formally stands for 'Uniform Resource Locator' hinting at there are some rules to follow for a URL to valid or not. RFC 1738 (http://www.faqs.org/rfcs/rfc1738.html) goes into quite some depth about what exactly is allow and what is not. As with all things programmed, data is expected to conform to some form of order so that it can be parsed and a URL's format is no different. To state a basic overview of the process we will start with the definition that we want a URL to be long continuous block of text. From there we can jump into an explanation of the overall structure of a URL. For example sake we will enclose all mandatory data , and all optional data [like this]. ://[username:password@][:port]/[path] http://fred:smith@www.microsoft.com/index.asp Is specifying that the page index.asp from www.microsoft.com authenticating with the login 'fred:smith' over the http protocol. Phew got all that ? What happened to the port ? Well http (HyperText Transfer Protocol) has a standard default port of 80. So it isnt necessary to specify it. In practice almost all of the URLs you will see will simply be in the protocol://server/path format...The reason I include the rest will become apparent in a moment. One other quick detour i would like to make is a quick expansion of the [path] value above...the path is also slightly more complex than show above it too can contain subsections how often have you seen a link like http://blah.com/index.html#section1 In this URL you notice the '#' character in the URL. This denotes an internal link within the page. When you goto a url like this first the browser goes to the page, then it skips down the document to the specified section. Yet another incarnation of the [page] section looks like this: http://blah.com/search.asp?search=taffy&maxresults=100 What this means is that the specified page, search.asp, is actually a script to execute on the server and you are starting that script with the parameters search=taffy and maxresults=100. These are the equilivent of command line parameters you might use when running a dos program. Ok enough all ready what does this have to do with URL encoding ! Well from the quick refresher course in URL syntax hopefully you noticed a couple key things...The URL is a string that has to be parsed programatically, this means that it has to abide by certain rules. The key sections of a URL are divided up by special delimiter characters. If these special characters where to show up in unexpected places the parsing would fail and the browser would have no idea what to do and just nope sorry couldnt find it. Aghhhh the light at the end of the tunnel. So for the sake of argument lets say our username is still 'fred' but our password is ':0' so our full url from before would look like this: http://fred::0@www.microsoft.com/index.asp You can try it, but I guarantee the browser wont even understand that you want to goto www.microsoft.com. Because of reserved characters such as this they realized that they needed an alternative way to represent some special characters so that thier programs would not get confused and would be able to do thier jobs. When a String is said to be URL encoded it is meant that it has had these special reserved characters replaced in an alternative manner that will not interfere with the operation of things. So what are these bad character to be replaced? First off as already explained the characters " : , / , @ , ? , # , , % " all denote markers in the URL string so they cannot appear unencoded unless they are intended to denote a specific section of a URL as defined above. The next set of characters not allowed unencoded are those that may have special meanings to other programs that you may interact with on the net such as proxies, or server scripts and applications. This set consists of " { , } , | , \ , ^ , ~ ,[ , ] , ` " characters. The last set of characters we would have to have to encode are those that are not printable. This includes character codes from 00-1F, 7F, and 80-FF. [1] The bottom line of what i need you to come away with is that every character has a numeric value associated with it. Naturally it then follows then these character codes are the perfect way to encode these special characters in URL's For an example say we need to include the space character in a URL. The hexidecimal character code for space is 20. So to represent this character in a URL we simply replace it with '%20'. So everywhere in a URL you see a %xx that is where one of these special characters was that the browser had to replace. In each case you can determine what character is actually there by looking up what character is assigned to the hex value xx. Now that we understand how it all works and why we can move onto the more poient question.. why should i care! 2) How URL encoding effects Web Applications a) datahiding/obsfucation b) server misdirection (using the @ to fool them) c) filtering bypassing d) double escape vunurabilities [1] These values are represented in hexidecimal (Base 16) format. The hex system is an alternative way of representing numbers that uses the characters 0-9 and A-F. In this way numbers can be represented more compactly and also more intutivly from the computers/ programmers point of view. Since it is Base16 the number we would commonly think of as 10 is equal to 'A' in the hex system, and the number we recgoniz as 16 is equal to 10 in hex. likewise 32 decimal = 20 hex etc.. If you would like to learn more about the hexidecimal system check out this great tutorial: http://www.thirdm.com/web/hex.htm