This is a blog post summarising a few notes I’ve gathered around the internet, with the purpose of cementing them in my mind rather than adding anything new or attempting to broadcast them to a wider crowd. If you find it useful, that’s great, but it’s nothing original and its been pulled from several sources noted at the end of the post.
The post is organised in several categories: general attacks against URL parsers and implementations, and specific attacks against parsers in specific languages, with the intention of highlighting differences in interpretation of URL strings.
Discrepancies between parsers and HTTP libraries
This is interesting. These three URL parsing libs in python all interpret this individual URL differently:
url = "http://18.104.22.168 &@22.214.171.124# @126.96.36.199/"
urllib.urlopen(url) # 188.8.131.52
urllib2.urlopen(url) # 184.108.40.206
requests.get(url) # 220.127.116.11
urlparse(url) # ParseResult(scheme='http', netloc='18.104.22.168 &@22.214.171.124', path='', params='', query='', fragment=' @126.96.36.199/')
In versions earlier than a few years ago, there was an inconsistency between parse_url and readfile which could lead to URL parser bypass with URLs with several colons:
http://127.0.0.1:11211:80/ or a URL with extraneous characters
http://firstname.lastname@example.org/. In both examples parse_url and readfile interpreted the URL differently. In the second case, readfile interprets the host as evil.com, whereas parse_url interprets the host as google.com.
This may affect other programming languages.
Curl is in widespread usage, and there are curl bindings in every language under the sun. Discrepancies between language URL parsers and Curl could lead to SSRF. Consider the following URL, as interpreted by PHP:
<?php print_r(parse_url("http://email@example.com @google.com/"));
[scheme] => http
[host] => google.com
[user] => firstname.lastname@example.org
[path] => /
And as interpreted by cURL:
curl 'http://email@example.com @google.com/'
curl: (7) Failed to connect to 127.0.0.1 @google.com port 80: Connection refused
As you can see it is attempting to retrieve 127.0.0.1. This example uses PHP, but these discrepancies have been identified in other languages as well and more are bound to exist. Example vulnerabilities have been found in WordPress, VBulletin and MyBB utilising this technique.
Path traversal bypasses are possible with special unicode character U+FF2E. This happens because node’s internal unicode parser interprets this multibyte character as two separate bytes, and then proceeds to discard a part, leaving \x2E, a dot.
Similar results can be observed by injecting U+FF0D and U+FFOA, which results in a newline, allowing newline injection.
In linux, hostname resolution is generally done with
gethostbyname. As per RFC1035 it supports escaping of values with \DDD notation. This may allow for additional parser confusion.
# echo or\\097nge.tw
# nslookup or\\097nge.tw
Interestingly enough, gethostbyname will remove all backslashes that are not followed by a digit.
# nslookup or\\an\\g\\e.tw
gethostbyname will also pass input to getaddrinfo at times, which means that it will ignore invalid input as long as it is preceded by a valid address. Examples below:
>>> import socket
>>> socket.gethostbyname("127.0.0.1 foo")
The ability to add invalid trailing content can lead to an attacker that can perform HTTP header smuggling attacks in the event they can inject encoded new lines, such as
Additionally, an attacker can smuggle other protocols (such as SMTP) thanks to TLS’ SNI.
Internationalizing Domain Names in Applications (IDNA) is a standard or a set of standards that allow for characters not in the ascii set to be used in domain names. There are two diferring standards, IDNA2003 and IDNA2008, which are difficult to transition between for client implementations, which lead the unciode consortium to release UTS46.
Different HTTP libraries and URL parsing libraries implement different versions of this standard, as well as implementing the standard in different ways. This can be useful to avoid blacklists of disallowed hosts. An example of this is an inconsistency in PHP’s
gethostbynamel function and curl’s resolver: PHP’s gethostbynamel fails when provided with a domain with a special character, which can lead to bypasses. cURL will then retrieve the URL and resolve it successfully.
Values synonymous to localhost
Beside the obvious examples, the following URLs will all attempt to retrieve localhost.
Several mechanisms for bypassing SSRF protections through DNS shenanigans exist. I will cover these on a high level below:
Host that resolves to a malicious IP
DNS records may point to an internal IP address (such as
127.0.0.1). This frequently works because developers check whether the ip matches an address range but accept arbitrary DNS names regardless of what they resolve to.
Time of check, time of use vulnerabilities (TOCTOU)
A TOCTOU vulnerability can occur if the target application implements host whitelisting or host blacklisting. Imagine the following pseudo-code:
download_url = request.get('target')
target_host = urlparse.urlparse(download_url).netloc
target_ip = socket.gethostbyname(target_host) #resolve
if target_ip in blacklist:
A TOCTOU vulnerability allows for a bypass of the blacklist check if DNS resolution occurs twice: once for the check, and twice for the retrieval. An attacker-controlled DNS server could resolve the first time to a good address and the second time to a malicious IP.
A SSRF protection bypass may occur if an attacker creates a malicious site that redirects to an internal IP because the check is performed on the initial IP address and not the address the HTTP client gets redirected to. This frequently works for most HTTP clients as they tend to follow HTTP redirections by default. Here’s an example:
http://victim.org?url=http://127.0.0.1 -> FAIL, not allowed
http://victim.org?url=http://attacker.com/ -> 301 REDIRECT TO 127.0.0.1 -> 200 OK.
The ability to inject \r\n in combination with either spaces or \t characters allows you to inject new headers into the request, which may allow other attacks. Imagine a request to `victim.com?url=yourserver.com/aa’ results in the following request:
GET /aa HTTP/1.1
A request that looks like
victim.com?url=yourserver.com/aa%20HTTP/1.1%0Ainjected-header: true%0Ax-aa: could result in the following:
GET /aa HTTP/1.1
This would allow you to attack a lot of plaintext protocols.