Originally CDX files were only used to index web archives containing GET requests. As browser based capture methods can record non-GET requests such as those generated by JavaScript a way for CDX records to differentiate based on request method and request body is needed. This document describes the mechanism used for encoding the request method and body in the CDX key by appending additional query parameters as originally implemented by pywb.
Compatibility Note
This document aims to describe the behaviour of pywb 2.6.7 running on Python 3.7 or later. Older versions of pywb or Python can produce different output.
[TODO: To be written]
If the request method is not GET
it must be appended as the value of query parameter __wb_method
.
If the URL does not have a query string a ?
must be added:
http://example.org/ => http://example.org/?__wb_method=POST
If the URL already has a query string the __wb_method
parameter must be added at the end after a &
separator:
http://example.org/?page=1 => http://example.org/?page=1&__wb_method=POST
Even if the query string already ends in &
another separator must still be added:
http://example.org/?foo& => http://example.org/?foo&&__wb_method=POST
Encoding the request body depends on the content-type.
Content-Type | Primary Encoding | Fallback Encoding |
---|---|---|
application/json | JSON | |
application/x-amf | AMF | |
application/x-www-form-urlencoded | urlencoded form | binary |
multipart/* | multipart form | binary |
text/plain | JSON | binary |
* | binary |
[TODO: To be written]
The request body is encoded as Base64 (RFC 4648) and appended to the query string as the __wb_post_data
parameter.
Example
Original request:
POST /chat HTTP/1.0 Host: example.org Content-Length: 5 hello
Encoded URL:
http://example.org/chat?__wb_method=POST&__wb_post_data=aGVsbG8=
Decode the body to a string using UTF-8, percent decoded the string, percent plus encode it and then append the result to the output. If a UTF-8 decoding error occurs then the binary encoding method must be used instead.
[TODO: example]
The body must be decoded as form data per RFC 2388 and then percent plus encoded. If the body is not a valid multipart/form-data message then the binary encoding method must be used instead.
[TODO: example]
The request must be parsed as JSON (RFC 8259) and then apply the following algorithm with an empty string as the initial value of name.
To encode a JSON value, given a name and an initially-empty map nameCounts of strings to integers:
Example
Original request:
POST /events HTTP/1.0 Host: example.org Content-Type: application/json { "type": "event", "id": 44.0, "values": [true, false, null], "source": { "type": "component", "id": "a+b&c= d", "values": [3, 4] } }
Encoded URL:
http://example.org/events?__wb_method=POST&type=event&id=44.0&values=True &values.2_=False&values.3_=None&type.2_=component&id.2_=a%2Bb%26c%3D+d &values.4_=3&values.5_=4
To percent plus encode a string, first encode it as UTF-8 and then percent plus encode the resulting byte sequence.
To percent plus encode a byte sequence, for each byte in the input sequence:
If the byte falls within the following ASCII character ranges append it to the output as is.
'0'-'9', 'a'-'z', 'A'-'Z', '-', '.', '_', '~'
If the byte is the ASCII space character (‘ ‘) append the ASCII plus character (‘+’) to the output.
Otherwise, append ASCII percent character (‘%’) to the output and followed by the value of the byte formatted as two uppercase hexadecimal digits.
Compatibility Note
Prior to Python 3.7 the character “~” was percent encoded.