Introduction to web archive formats


Archiving a URL


What’s in the WARC?


WARC-Type: warcinfo
WARC-Type: request
WARC-Type: response
WARC-Type: metadata
WARC-Type: resource
Finding individual records

$ jwattools cdx hello-world.warc


$ cdx-indexer hello-world.warc > hello-world.warc.cdx

which means we can pick out the response record like this:

$ tail -c +1261 hello-world.warc | head -c 1085
WARC-Type: response
WARC-Record-ID: <urn:uuid:3C74F309-6B37-461C-B982-1B5C447C3C0E>
WARC-Warcinfo-ID: <urn:uuid:B8FDDD7C-DBB0-4EC4-BC7E-AA0B21749707>
WARC-Concurrent-To: <urn:uuid:8DCD2661-1B5A-445C-B4F4-2ACEB69A900B>
WARC-Date: 2015-07-08T21:55:13Z
Content-Type: application/http;msgtype=response
Content-Length: 494

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Last-Modified: Wed, 08 Jul 2015 21:53:08 GMT
Access-Control-Allow-Origin: *
Expires: Wed, 08 Jul 2015 22:05:13 GMT
Cache-Control: max-age=600
Content-Length: 13
Accept-Ranges: bytes
Date: Wed, 08 Jul 2015 21:55:13 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-lcy1127-LCY
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1436392513.648949,VS0,VE165
Vary: Accept-Encoding

Hello World

How playback works

CDX, sorted, search. Gets filename, offset. grab WARC record Modify as needed Playback

Comparison with ARC files

Further Reading

The official WARC specification is maintained by ISO. The draft versions are hosted at, and mirrored here.

For introductory information about the WARC format, see:

Appendix: Tools used

This section outlines the tools and commands that were used to generate the example files.

Making the WARC

To create a WARC, we used wget:

$ wget --warc-file hello-world

…which created the compressed hello-world.warc.gz file. These special block-compressed files are often used directly, but in this primer, we uncompress it so we can see what’s going on:

$ gunzip hello-world.warc.gz

…leaving us with hello-world.warc.

Making the CDX

To generate a content index (CDX) file, we have at least two options. There’s JWATTools:

…(which created cdx.unsorted.out), or the cdx-indexer from OpenWayback:

…(which created hello-world.warc.cdx).

Extracting a WARC record

Once we’ve identified the offset and length of a particular record (in this case, an offset of 1260 bytes and a length of 1085 bytes), we can snip out an individual record like this:

Making a Memento

To create an archived version of the page that could be played back properly, I used the Internet Archive’s “Save” feature by going to this URL in my web browser:

…which created this snapshot:

From here, we can use wget to look at what gets played back:

$ wget --server-response


  HTTP/1.0 200 OK
  Server: Tengine/2.1.0
  Date: Thu, 09 Jul 2015 10:41:38 GMT
  Content-Type: text/plain;charset=utf-8
  Content-Length: 13
  Set-Cookie: wayback_server=19;; Path=/; Expires=Sat, 08-Aug-15 10:41:38 GMT;
  Memento-Datetime: Thu, 09 Jul 2015 10:40:19 GMT
  Link: <>; rel="original", <>; rel="timemap"; type="application/link-format", <>; rel="timegate", <>; rel="first last memento"; datetime="Thu, 09 Jul 2015 10:40:19 GMT"
  X-Archive-Orig-x-cache-hits: 0
  X-Archive-Orig-x-served-by: cache-sjc3122-SJC
  X-Archive-Orig-cache-control: max-age=600
  X-Archive-Orig-content-type: text/plain; charset=utf-8
  X-Archive-Orig-age: 0
  X-Archive-Orig-x-timer: S1436438419.302921,VS0,VE141
  X-Archive-Orig-access-control-allow-origin: *
  X-Archive-Orig-last-modified: Wed, 08 Jul 2015 22:33:03 GMT
  X-Archive-Orig-expires: Thu, 09 Jul 2015 10:50:19 GMT
  X-Archive-Orig-accept-ranges: bytes
  X-Archive-Orig-vary: Accept-Encoding
  X-Archive-Orig-connection: close
  X-Archive-Orig-date: Thu, 09 Jul 2015 10:40:19 GMT
  X-Archive-Orig-via: 1.1 varnish
  X-Archive-Orig-content-length: 13
  X-Archive-Orig-x-cache: MISS
  X-Archive-Wayback-Perf: {"IndexLoad":359,"IndexQueryTotal":359,"RobotsFetchTotal":1,"RobotsRedis":1,"RobotsTotal":1,"Total":371,"WArcResource":10}
  X-Archive-Playback: 1
  X-Page-Cache: MISS