Release notes

Full listing of changes and bug fixes are not available prior to release 1.2.0 and between release 1.6.0 and OpeWayback 2.0.0 BETA 1 release.

OpenWayback 2.1.0 Release

Features

  • Synchronised with latest changes from the Internet Archive fork. #195
  • URL-decode timestamp segment of replay URL. #195, internetarchive#23
  • Revisits can be resolved with excluded capture #195, internetarchive#65
  • Added rudimentary mime-type sniffing (work in progress). #195, internetarchive#46
  • Timestamp-collapsing can be configured to return the last best capture in eacho collapse group. #195, internetarchive#64
  • UIResults now has makePlainCaptureQueryUrl() method for generating clean, short URL for capture query links. #195, internetarchive#60
  • MultipleRegexReplaceStringTransformer may also be used as RewriteRule. #195, internetarchive#54
  • Allow for using different collapseTime for replay and capture search #195, internetarchive#49
  • Make collection-dependent exclusion configurable. #195, internetarchive#48
  • Removed CustomUserResourceIndex class, which does not appear to have broad utility. #195
  • Performance information in response header can now be in JSON format. #195, internetarchive#69
  • FastArchivalUrlReplayParseEventHandler no longer rewrite relative URLs for better replay quality. #195
  • Made start date configurable (defaults to old value of 1996), end date dynamic to current year.#51

Bug Fixes

  • Fixed issue #196 to allow running under Tomcat 8. #198
  • Fixed incorrect Content-Type in replay of resource record with JWATFlexResource #195, internetarchive#68
  • Fixed ClassCastException when JWATFlexResourceStore is in use #195, internetarchive#67
  • Pass-through Content-Range header field for audio playback to work #195, internetarchive#66
  • Fixed undesirable rewrite of in-page (fragment-only) links. #195, internetarchive#63
  • Fixed XHTML parse error due to banner insert before XML declaration. #195, internetarchive#61
  • Fixed PrivTokenAuthChecker resetting ignoreRobots. #195, internetarchive#51
  • Made CharsetDetector adher to WHAT-NG recommendation. #195, internetarchive#47
  • Fixed building with JDK 8. #141
  • NullPointerException for RemoteResourceIndex #193
  • Removed direct references to Unix specific TMP paths /tmp and /var/tmp. #172
  • Initial thread-safety fix for Memento from Luda. #180
  • Fixed xml-markup in Toolbar.jsp which caused probelsm on some sites. #171, #60
  • Fixed some @import url's in <style> section of html are not rewritten. #131
  • Fixed issue #48 jQuery getting stomped on.
  • Support for loading resources from S3 buckets. #189
  • Refactored CDX Server into a war and jar module. #164

OpenWayback 2.0.0 Release

Features

  • Fixed URL resolution in ServerRelativeArchivalRedirect in non-ROOT context. #92
  • Deprecated use of bean name in spring to provide configuration. #94
  • Updated and improved documentation. #125 #121 #133
  • Reviewed and updated mailing lists. #126 #127
  • Added Java cross-reference and updated site generation with dependencies. #128 #30
  • Fixed Javadoc output in Java8. #136
  • Updated 'developers' and 'contributors' lists in POM. #137
  • Cleaned up the Memento configuration. #150
  • Added new logos to project. #100
  • Cleaned up default config file. #144
  • Updated and improved documentation.
  • Updated dependency on Webarchive-commons 1.1.4. #157
  • Added 'accessPointPath' to default proxy config. #158

Bug Fixes

  • Fixed the date locale issue. Creations of java.text.SimpleDateFormat now independent of local setting. #157 #148 #154
  • Fixed support for uncompressed ARCs files #101

OpenWayback 2.0.0 BETA 2 release

Features

  • Added PrefixFieldCollapser and RegexFieldMatcher to CDX server. #7
  • Added support for WARC metadata records. #23
  • Added support for WARC resource records. #24
  • Removed Internet Archive defaults and branding. #45
  • Integrated JWAT ResorceStore. #54
  • Provided an OpenWayback Sample Overlay.
  • Carried out and documented manual testing. #80
  • Updated and improved documentation (as on the wiki.
  • Renamed artefacts and repositories “webarchive-commons” and updated POMs. #90

Bug Fixes

  • Query string being stripped from Memento queries. #106
  • Support for uncompressed ARC files. #101

OpenWayback 2.0.0 BETA 1 Release

Features

  • Added livewebPrefix to wayback.xml. #3
  • Removed dependencies on Internet Archive’s Maven artefacts, enabling Tarvis CI builds and clean releases. #10
  • Moved critical code for OpenWayback from the heritrix-commons codebase into webarchive-commons. #4

Bug Fixes

  • Dependency on heritrix-commons SNAPSHOT release. #11

The following releases of the Open Source Wayback Machine (OSWM) were made by the Internet Archive. The on-going development of Wayback was handed over to the International Internet Preservation Consortium (IIPC) in October 2013. For more details please see General overview.

release 1.8.0

Features

  • Introduced the wayback-cdx-server.

No further release notes available.

Release 1.7.0

Release notes not available.

Release 1.6.0

Major Features

  • Memento integration.
  • Improved live-web fetching, enabling simpler external caching of robots.txt documents, or other arbitrary content used to improve function of a replay session.
  • Customizable logging, via a logging.properties configuration file.
  • Vastly improved Server-side HTML rewriting capabilities, including customizable rewriting of specific tags and attributes, rewriting of (some easily recognizable) URLs within JavaScript and CSS.
  • Snazzy embedded toolbar with "sparkline" indicating the distribution of captures for a given HTML page, control elements enabling navigation between various versions of the current page, and a search box to navigate to other URLs directly from a replay session.
  • Improved hadoop CDX generation capabilities for large scale indexes.
  • SWF (Flash) rewriting, to contextualize absolute URLs embedded within flash content.
  • ArchivalUrl mode now accepts identity ("id_") flag to indicate transparent replaying of original content.
  • NotInArchive can now optionally trigger an attempt to fill in content from the live web, on the fly.
  • Updated license to Apache 2.

Major Bug Fixes

  • More robust handling of chunk encoded resources.
  • Fixed problem with improperly resolving path-relative URLs found in HTML, CSS, Javascript, SWF content.
  • Fixed problem with improperly escaping URLs within HTML when rewriting them.
  • Fixed problem where a misconfigured or missing administrative exclusion file was allowing results to be returned, instead of returning and appropriate error.
  • No longer extracts resources from the ResourceStore before redirecting to the closest version, which was a major inefficiency.

Minor Features

  • Now provide closeMatches list of search results which were not applicable given the users request, but that may be useful for followup requests.
  • Archival Url mode now allows rotating through several character encoding detection schemes.
  • Proxy Replay mode now accepts ArchivalURL format requests, allowing dates to be explicitly requested via proxy mode.
  • AccessPoints can be now configured to optional require strict host matching for queries and replay requests.
  • Now filters URLs which contain user-info (USER:PASSWORD@example.com) from the ResourceIndex
  • ArchivalURL mode requests without a datespec are now interpreted as a request for the most recent capture of the URL.
  • Improvements in mapping incoming requests to AccessPoints, to allow virtual hosts to target specific AccessPoints.
  • ResourceNotAvailable exceptions now include other close search results, allowing the UI to offer other versions which may be available.
  • ArchivalURL mode now forwards request flags (cs_, js_, im_, etc) when redirecting to a closer date.
  • ResourceStore implementation now allows retrying when confronted with possibly-transient HTTP 502 errors.

Minor Bug Fixes

  • cdx-indexer (replacement for arc-indexer and warc-indexer) tool now returns accurate error code on failure.
  • No longer sets JVM-wide default timezone to GMT - now it is set appropriately on Calendars when needed.
  • Hostname comparison is now case-insensitive.
  • Server-relative archival url redirects now include query arguments when redirecting.
  • Server-relative archival url redirects now include a Vary HTTP header, to fix problems when a cache is used between clients and the Wayback service.
  • Fixed problem with robots.txt caching within a single request, which caused serious inefficiency.
  • Fixed problem with resources redirecting to alternate HTTP/HTTPS version of themselves.
  • Fixed problem with accurately converting 14-digit Timestamps into Date objects for later comparison.
  • Automatically remaps the oft-misused charset "iso-8859-1" to the superset "cp1252".

Release 1.4.2

Features

  • Added exactSchemeOnly configuration to AccessPoint, allowing explicit distinction between http:// and https://(ACC-32)
  • Now times out requests to a slow/non-responsive RemoteResourceIndex and remote(HTTP 1.1) ResourceStore nodes.(ACC-38)
  • experimental OpenSearchQuery .jsp implementations(ACC-56)
  • FileProxyServlet now accepts /OFFSET trailing path in addition to Content-Range HTTP header.(ACC-74)
  • warc-indexer now has -all option to produce a CDX line for ALL records, not just captures and revisits(ACC-75)
  • now includes file+offset for all records, keying off mime-time of warc/revist to determine revisits at query time.(ACC-76)
  • Allow prefixing of original HTTP headers with a fixed string. (ACC-77)
  • Now Wayback rewrites Content-Base HTTP headers.(ACC-78)
  • Timeline.jsp improvements which prevent Timeline from being severely distorted on some pages.
  • Improvement to ArchivalUrl client-rewrite.js to preserve link text, working around a bug in Internet Explorer.

Bug Fixes

  • Now all mime-types are escaped to prevent spaces from getting into the CDX files.(ACC-45)
  • Some CSS URLs were being rewritten twice. (ACC-53)
  • No longer writing original pages Content-Length HTTP header to output, which caused original pages with Lower-Case "L" in "Content-length" to return wrong length, truncating replayed documents. This caused some replayed pages to not have embedded disclaimers, nor javascript rewriting of links and images. (ACC-60)
  • Fixed severe problem with live web robots.txt retrieval where wrong offset was being writting into the live web ResourceIndex. (ACC-62)
  • Charset extraction from HTTP headers is now case-insensitive. (ACC-63)
  • No longer adding content to HTML pages with FrameSet tags, as they were being broken.(ACC-65)
  • No longer set GMT as default timezone for entire JVM.(ACC-70)

Release 1.4.1

Features

  • Index filter which allows including/excluding records based on HTTP response code field.(ACC-43)
  • Outputs log message instead of stack dump when failing to access a Resource.

Bug Fixes

  • Some redirect records were not being located in index due to bad logic in Duplicate record filter.(ACC-30)
  • Wayback was not throwing a NotInArchiveException when Self-Redirect replay filter removes all records. (unreported)
  • Location HTTP header values were not being escaped before placing in CDX, causing some records to have too many columns. (ACC-31)
  • Search Result summary counts were incorrect in Url Prefix searches.(ACC-33)
  • Implemented NoCache.jsp, a replay insert which adds a Cache-Control: no-cache HTTP header to all replayed documents.(ACC-34)
  • Timeline.jsp was using Request Date, not Capture date, which caused Proxy Mode Timeline to show the wrong date. (ACC-36)
  • Advanced Search reference implementation .jsp was broken. (ACC-37)
  • AnchorDate and AnchorWindow functionality is now disabled by default, and can be enabled via configuration on an AccessPoint. (ACC-46)

Release 1.4.0

Features

  • @ Completely new implementation of ResourceStore classes, including recursive local directory scanning, scanning multiple local directories, an experimental remote directory scanning capability, and groundwork for future support of both non ARC/WARC file formats and large scale automatic indexing.
  • @ Complete overhaul of the Replay system, allowing jspInserts within ArchivalUrl, DomainPrefix, and Proxy replay modes. Also includes groundwork for future fine-grained mime-type and url-based Replay customizations.
  • Added capability to explicitly set Locale to use for an AccessPoint, overriding the default behavior of using the user agents specified preferred language.
  • New flat file implementation of FileLocationDB. See CDXCollection.xml within the .war file for and example usage.
  • AnchorDate feature, tracking the date with which a user begins a replay session. During this session, wayback will always attempt to remain near this date, preventing time-drift within a replay session.
  • AnchorWindow feature, which allows users to specify a maximum time window in either direction of the AnchorDate that they wish to view replayed content. When a user has set this option, Wayback will not display captures outside the specified window.
  • New command line tool location-db to create a location DB offline, populating with lines read from STDIN.
  • Added new AccessControlSettingOperation authentication control component, allowing the configuration of the appropriate Exclusion system per-request, as defined by arbitrary BooleanOperators. See ComplexAccessPoint.xml within the .war file for an example usage.
  • Added .asx archival URL replay, which rewrites links inside archived .asx files, attempting to make them point back into the Wayback service.
  • Now accept "http:/" as identical to "http://" in the beginning of a URL, working around a browser bug which stripped multiple "/"s in URL paths.
  • @ Refactoring of ResourceIndex interfaces, to allow for future update-able ResourceIndex implementations beyond BDBIndex based ResourceIndexes.
  • * Major internal refactoring of WaybackRequest object, providing more stable get/set methods for accessing the standard internal fields with type-safety.
  • * Major internal refactoring of SearchResults into CaptureSearchResults and UrlSearchResults, which was previously under-specified and often confusing. These new classes provide more stable get/set methods for accessing the standard internal fields with type-safety.
  • * Changed locations of replay, query, and exception .jsp files within .war file to underneath WEB-INF, so they are not directly accessible via HTTP.
  • German translation of default Wayback UI. Thanks Andreas!
  • Czech translation of default Wayback UI. Thanks Lukáš Matějka! (<< ACC-29)
  • All threads now notified of shut downs, allowing resources to be released cleanly.
  • *Refactor of all Request and Result related constants from WaybackConstants to WaybackRequest and the *SearchResult(s) classes.
  • * Refactor of the various UI*Results classes, which are used by Query, Replay, and Exception .jsp files to access context information into the single class, UIResults, which has a more stable interface.
  • New AccessPoint.urlRoot optional configuration, enabling explicit control over URLs generated for the UI.

Bug Fixes

  • (ACC-24) Fixed bug in Proxy mode which prevented the correct number of results from being returned from the index during Replay.
  • (ACC-21) fixed bug where some CSS import declarations where not being correctly rewritten.
  • (ACC-26) fixed rare String OOB exception when marking up pages with some forms of Javascript generated HTML.
  • (ACC-28) verifies that detected encoding is supported in local JVM before attempting to decode a resource into a String.
  • (unreported) fixed declared page encoding of help, advanced search and index page to UTF-8.
  • Explicitly set character encoding on returned documents, instead of relying on Tomcat to return the correct encoding.

Migration notes to 1.4.0 from 1.2.X

Wayback 1.4.0 includes substantial code changes aimed at extending current capabilities, enabling planned future features, and stabilizing interfaces used in .jsp customizations. Since these changes would already require a significant update of existing customizations made to .jsp files, many non-vital cleanups to the source tree were included. The goal of implementing all of these features within this single release is to minimize future required updates.

Below is a somewhat inclusive list of changes that will be required when upgrading to Wayback 1.4.0 from 1.2.X, divided into two main categories: changes required to Spring configuration, and changes required for .jsp customizations. Depending on the scope of the existing customizations in your installations, it may be simpler to modify your existing customizations to conform to new interfaces and packages, and in other cases, it may be simpler to begin with the new reference implementations and modify them to meet your needs.

If there are changes not addressed here, or if you have questions regarding specific issues when upgrading, please direct these questions to the archive-access-discuss forum.

Spring upgrade information

New features with the @ mark indicate features that will directly impact Spring XML configuration files used with 1.2.X.

  • org.archive.wayback.resourcestore.http.FileLocationDB now: org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB
  • org.archive.wayback.resourcestore.http.FileLocationDBServlet now: org.archive.wayback.resourcestore.locationdb.ResourceFileLocationDBServlet
  • org.archive.wayback.resourcestore.http.ArcProxyServlet now: org.archive.wayback.resourcestore.locationdb.FileProxyServlet
  • All ReplayUI implementations changed completely, now located in: ArchivalUrlReplay.xml, DomainPrefixReplay.xml, ProxyReplay.xml. Customizations to jspInserts should be straightforward on inspecting these files.
  • org.archive.wayback.resourcestore.Http11ResourceStore now: org.archive.wayback.resourcestore.SimpleResourceStore. See RemoteCollection.xml for configuration example.
  • The new automatic indexing is most simply upgraded by modifying the new example in BDBCollection.xml with your custom paths.

.jsp upgrade information

New features with the * mark indicate features that will directly impact customizations made to .jsp files used with 1.2.X. The bulk of the changes fit three categories:

  • class name and package changes requiring import tag updates. Please see .jsps in new distribution for updated packages.
  • .jsp path changes due to webapp directory tree cleanup. Again, please see the current locations in the new distribution.
  • Java changes within .jsp files due to UIResults refactoring. Previously each type of response page had a unique class used to marshal context information to the .jsp files. These have all been refactored into a single class, org.archive.wayback.core.UIResults which has methods to access the appropriate data in each case. Additionally, many convenience methods that were present on the various UI*Results classes have been removed, since convenience methods are now available on the core classes:

    • WaybackRequest
    • CaptureSearchResult
    • CaptureSearchResults
    • UrlSearchResult
    • UrlSearchResults

    As an example, the Timestamp class is no longer used in the .jsp files, since all time information uses the Date class for localization. All of the above classes now have methods to directly return Dates.

    For specific examples, please see the reference .jsp files included with the new distribution.

Release 1.2.1

Features

  • Now explicitly sets the charset component of replayed HTML page Content-Type HTTP headers in Archival URL mode. This overrides Tomcat's default behavior of explicitly setting this value to Tomcat's default encoding character set, if a document does not set it explicitly. The original Content-Type HTTP header value is now returned as HTTP header X-Wayback-Orig-Content-Type.

Bug Fixes

  • added getter/setter for replay image, css, javascript, and html error handling .jsps
  • now returns "closest" indicator on XML query results, fixing problem with WAXToolbar/Proxy mode.(ACC-11)
  • auto-indexer now closes ARC/WARC files after indexing, fixing out-of-filehandle problem(ACC-12)
  • location-client now syncs .warc and .warc.gz files with locationDB, in addition to .arc and .arc.gz files.(ACC-13)
  • fixed problem which prevented captures archived after webapp was deployed from being returned. Now captures up to the current moment are returned. (ACC-14)
  • changed all .jsp files to return UTF-8(ACC-18)
  • now sending correct end Date to remote NutchWAX index. (ACC-20)
  • fixed String OOB exception when attempting to rewrite some CSS text (ACC-17)
  • now updates CSS "import 'URL';" and 'import "URL";' content. Previously only updated "import url(URL);" content.
  • fixed Replay redirect loop when using RemoteResourceIndex (ACC-15)

Release 1.2.0

Features

  • now supports compressed and uncompressed ARC and WARC files.
  • initial revision of "deduplicated" WARC record handling, which returns the last version that was actually stored when subsequent captures are not saved because they have not changed.
  • now filters (literal) duplicate records from the ResourceIndex, in case the same capture (url + date) appears twice, or in two CDX files.
  • UrlCanonicalizer is now pluggable, current functionality is now implemented in AggressiveUrlCanonicalizer. Added IdentityUrlCanonicalizer, which performs no canonicalization.
  • bin-search command line tool now outputs a single stream of sorted results from multiple files, instead of returning matches from each file sequentially.
  • extracted several replay features into separate jspInserts that can now be mixed and matched.
  • now handles most text/css URL rewriting, both inside HTML pages, and in externally linked .css files.
  • externalized comment embedded inside replayed HTML pages into jspInsert: ArchiveComment.jsp.
  • non-javascript Archival URL replay mode, where all URL rewriting occurs on the server. This includes a non-javascript Timeline jspInsert.
  • added two-month timeline partition.
  • root page of webapp now lists access points, when users make a request that does not specify one. Also, now access point "slash-pages" are available "without the slash".

Bug Fixes

  • Now rewrite Location and Content-Base HTTP headers in non-HTML Archival URL replayed documents.
  • now rewrites all background attributes found in returned pages (archival URL mode only) instead of just on BODY tags.
  • now rewrites src attributes on INPUT tags.
  • command line tools now allow whitespace arguments, important for tools accepting delimiter arguments.
  • replay URLs in query results now include non-standard ports, if needed.
  • Timezone is now explicitly set to GMT/UTC, fixing a Calendar result partitioning problem.
  • uncaught character-encoding exceptions now handled, plus slightly improved detection of correct character encoding by removing internal whitespace in declared encoding names.
  • archival URL parsing of query end-date now assumes latest possible date given a partial end-date, instead of earliest possible date.
  • re-implemented lost "closest" indicator for XML results.
  • now supports multiple auto index threads, one per ResourceStore, and also multiple auto index merge threads, one per BDB ResourceIndex.
  • fixed hard-coded maximum year issue.
  • reimplemented NotInArchive logging, which was lost in 1.0.0.