在http请求中,空格被encode成'+' or '%20的历史

这是一段让人纠结的历史

Posted by 夏敏的博客 on January 27, 2019

序言

在http请求中,有时候我们的请求参数会带一些特殊符号,因此需要对请求进行encode,以方便其传输。 而’ ‘即空格,有被encode成’+’,有些地方也encode成,’%20’。因此本篇博客主要探讨一下这个 加号与%20的历史

日常用法

打开百度,输入“hello world java”,回车,随后查看地址栏
发现 hello 与 world java中间的空格,被翻译成了 %20 地址为 https://www.baidu.com/s?wd=hello%20world%20java

打开Google,输入“hello world java”,回车,随后查看地址栏
发现 hello 与 world java中间的空格,被翻译成了 + 地址为 https://www.google.com/search?q=hello+world+java

好了, 问题来了。为啥会有这两种情况,并且都work

甚至我们这样请求都ok,https://www.google.com/search?q=hello+world%20java

+与%20 都会被decode成空格, 那么为啥呢?

历史

stackoverflow有一篇讨论贴
https://stackoverflow.com/questions/2678551/when-to-encode-space-to-plus-or-20

即,encode成‘+’ 的原因来自于 https://tools.ietf.org/html/rfc1866 的 8.2.2节 提到,在get请求的中,URL串应该被encode成 ‘application/x-www-form-urlencoded’ 格式的。在这个格式中,空格会被特殊替换成 +

而Java库中也有一个负责encode 请求参数的类, 来自 java.net 包的 URLEncoder,其中的encode方法有如下代码

if (c == ' ') {
    c = '+';
    needToChange = true;
}

这也就解释了,为什么使用 URLEncoder.encode("hello world", "utf-8"), 空格会被替换成 + 了。

那么encode成 %20呢?

很简单,很多的http协议说明里,并没有要求需要特殊处理空格’ ‘,那么就会遵循一般的编码要求, 即Non-alphanumeric characters are replaced by %HH,a percent sign and two hexadecimal digits representing the ASCII code of the character 将非字母字母字符,encode成百分号+其ASCII码的十六进制即可。

那么空格的 ASCII码为32,十六进制 0x20, 因此会被decode成%20

https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 中,(注:这是http4.1协议 是比较通用的协议)提到如下一段话,

application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:

Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH’, a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as “CR LF” pairs (i.e., %0D%0A'). The control names/values are listed in the order they appear in the document. The name is separated from the value by =’ and name/value pairs are separated from each other by `&’. multipart/form-data
Note. Please consult [RFC2388] for additional information about file uploads, including backwards compatibility issues, the relationship between “multipart/form-data” and other content types, performance issues, etc.

Please consult the appendix for information about security issues for forms.

The content type “application/x-www-form-urlencoded” is inefficient for sending large quantities of binary data or text containing non-ASCII characters. The content type “multipart/form-data” should be used for submitting forms that contain files, non-ASCII data, and binary data.

The content “multipart/form-data” follows the rules of all multipart MIME data streams as outlined in [RFC2045]. The definition of “multipart/form-data” is available at the [IANA] registry.

A “multipart/form-data” message contains a series of parts, each representing a successful control. The parts are sent to the processing agent in the same order the corresponding controls appear in the document stream. Part boundaries should not occur in any of the data; how this is done lies outside the scope of this specification.

As with all multipart MIME types, each part has an optional “Content-Type” header that defaults to “text/plain”. User agents should supply the “Content-Type” header, accompanied by a “charset” parameter.

上面首先讲解了 application/x-www-form-urlencoded 协议,即 使用 + 替换 空格, 将非字母字母字符,encode成百分号+其ASCII码的十六进制即可。

随后讲解了,对于发送大量包含非ASCII字符的数据,application/x-www-form-urlencoded 效率低下, 我们应该使用 “multipart/form-data”来提交表单,

而目前看来,除了 application/x-www-form-urlencoded 协议, 其他协议并没有明确规定,使用 + 替换 空格

因此使用其他协议的程序按照正常编码即可,默认将空格编码成 %20

+与%20的解码

上面写到

甚至我们这样请求都ok,https://www.google.com/search?q=hello+world%20java +与%20 都会被decode成空格

那么我们来看为什么会这样。 这样混乱的编码不会引起问题么?

直接打开java.net.URLDecoder类,其实现decode方法,看其如何解码的

public static String decode(String s, String enc) {
   ...
   while (i < numChars) {
    c = s.charAt(i);
    switch (c) {
    case '+':      // 针对+ 进行特殊解码,解码为空格
        sb.append(' ');
        i++;
        needToChange = true;
    case '%':     //  针对%开头的,解码其后两位成ascii码。那么 %20 也会解码成 空格
        ...
        int v = Integer.parseInt(s.substring(i+1,i+3),16);
        ...
        bytes[pos++] = (byte) v;
     ....
}

看了上面的代码。很简单,针对+ 进行特殊解码,解码为空格, 针对%开头的,解码其后两位成ascii码。那么 %20 也会解码成 空格。 因此解释了 为什么+能与 %20混用。

为什么我的请求没有进行encode也能成功?

首先要说的是,http请求一定要进行encode,不然直接传输的是空格,会让服务器截断这个http url,认为其非法。报出400错误。

那么为什么大家使用http请求的时候不需要encode,且携带了空格也能成功呢?

在Android开发中,http使用的都是okhttp,因为Android从系统层的网络请求就使用的是okhttp。那么作为三方应用,可能使用的是自己依赖的okhttp库,也可能是用的系统的okhttp库。

我们来看下okhttp是如何请求的。(Retrofit等框架只是动态生成okhttp的请求)

我们请求最后使用的类一定是 HttpURLConnection,而其实现类为com.squareup.okhttp.internal.huc.HttpURLConnectionImpl

而http请求时最后我们一定会去进行connect操作。 则,走到了HttpURLConnectionImplconnect方法

@Override public final void connect() throws IOException {
  initHttpEngine();
  boolean success;
  do {
    success = execute(false);
  } while (!success);
}

其中会去initHttpEngine,然后调用newHttpEngine方法, 而其编码正是在这个里面getHttpUrlChecked

  private HttpEngine newHttpEngine(String method, StreamAllocation streamAllocation,
      RetryableSink requestBody, Response priorResponse)
      throws MalformedURLException, UnknownHostException {
    // OkHttp's Call API requires a placeholder body; the real body will be streamed separately.
    RequestBody placeholderBody = HttpMethod.requiresRequestBody(method)
        ? EMPTY_REQUEST_BODY
        : null;
    URL url = getURL();
    // 对url进行encode操作
    HttpUrl httpUrl = Internal.instance.getHttpUrlChecked(url.toString());
    Request.Builder builder = new Request.Builder()
        .url(httpUrl)
        .method(method, placeholderBody);
    Headers headers = requestHeaders.build();
    for (int i = 0, size = headers.size(); i < size; i++) {
      builder.addHeader(headers.name(i), headers.value(i));
    }

因此我们的请求到这边,会被进行编码。空格被替换成了%20

@Override public HttpUrl getHttpUrlChecked(String url)
    throws MalformedURLException, UnknownHostException {
  return HttpUrl.getChecked(url);
}

再看HttpUrl类

public final class HttpUrl {
  static final String USERNAME_ENCODE_SET = " \"':;<=>@[]^`{}|/\\?#";

  static HttpUrl getChecked(String url) throws MalformedURLException, UnknownHostException {
    Builder builder = new Builder();
    Builder.ParseResult result = builder.parse(null, url);  // 进行编码
    switch (result) {
      case SUCCESS:
        return builder.build();
      case INVALID_HOST:
        throw new UnknownHostException("Invalid host: " + url);
      case UNSUPPORTED_SCHEME:
      case MISSING_SCHEME:
      case INVALID_PORT:
      default:
        throw new MalformedURLException("Invalid URL: " + result + " for " + url);
    }
  }

  // 编码实现方法,最后会走到这个方法. input为 USERNAME_ENCODE_SET等
  static void canonicalize(Buffer out, String input, int pos, int limit, String encodeSet,
      boolean alreadyEncoded, boolean strict, boolean plusIsSpace, boolean asciiOnly) {
    Buffer utf8Buffer = null; // Lazily allocated.
    int codePoint;
    for (int i = pos; i < limit; i += Character.charCount(codePoint)) {
      codePoint = input.codePointAt(i);
      if (alreadyEncoded
          && (codePoint == '\t' || codePoint == '\n' || codePoint == '\f' || codePoint == '\r')) {
        // Skip this character.
      } else if (codePoint == '+' && plusIsSpace) {
        // Encode '+' as '%2B' since we permit ' ' to be encoded as either '+' or '%20'.
        out.writeUtf8(alreadyEncoded ? "+" : "%2B");
      } else if (codePoint < 0x20
          || codePoint == 0x7f
          || codePoint >= 0x80 && asciiOnly
          || encodeSet.indexOf(codePoint) != -1   // 空格符合这个条件,在set中位置为1, 因此会encode成%20
          || codePoint == '%' && (!alreadyEncoded || strict && !percentEncoded(input, i, limit))) {
        // Percent encode this character.
        if (utf8Buffer == null) {
          utf8Buffer = new Buffer();
        }
        utf8Buffer.writeUtf8CodePoint(codePoint);
        while (!utf8Buffer.exhausted()) {
          int b = utf8Buffer.readByte() & 0xff;
          out.writeByte('%');
          out.writeByte(HEX_DIGITS[(b >> 4) & 0xf]);
          out.writeByte(HEX_DIGITS[b & 0xf]);
        }
      } else {
        // This character doesn't need encoding. Just copy it over.
        out.writeUtf8CodePoint(codePoint);
      }
    }
  }

}

因此上面的代码可以解释为什么okhttp能够自己帮我们编码了。好了,皆大欢喜?

No!

如果应用不是集成自己的okhttp版本,而使用的是系统的版本,那么。Android6.0上的okhttp版本是2.3的。

在这个版本的代码上。

newHttpEngine的地方。并没有进行checkURL。会直接传过去。因此,我们需要自己进行encode

而okhttp接下来两个版本的更新对其都进行了改动,其更新日志也在纠结+的问题

okhttp  Version 2.4.0-RC1
2015-05-16

FormEncodingBuilder now uses %20 instead of + for encoded spaces. Both are permitted-by-spec, but %20requires fewer special cases.


okhttp  Version 2.6.0

2015-11-22

Fix: Don't re-encode + as %20 in encoded URL query strings. OkHttp prefers %20 when doing its own encoding, but will retain + when that is provided.

总结

至此,我们对+与%20的原因都了解了。其实没有必要纠结+与%20.如果纠结,那么就选%20吧。因为这个更加通用。而+在decode时候需要特殊处理。

作者:Anderson大码渣,欢迎关注我的简书: Anderson大码渣