Release notes
Full listing of changes and bug fixes are not available prior to release 1.2.0 and between release 1.6.0 and OpeWayback 2.0.0 BETA 1 release.
OpenWayback 2.1.0 Release
Features
- Synchronised with latest changes from the Internet Archive fork. #195
- URL-decode timestamp segment of replay URL. #195, internetarchive#23
- Revisits can be resolved with excluded capture #195, internetarchive#65
- Added rudimentary mime-type sniffing (work in progress). #195, internetarchive#46
- Timestamp-collapsing can be configured to return the last best capture in eacho collapse group. #195, internetarchive#64
- UIResults now has makePlainCaptureQueryUrl() method for generating clean, short URL for capture query links. #195, internetarchive#60
- MultipleRegexReplaceStringTransformer may also be used as RewriteRule. #195, internetarchive#54
- Allow for using different collapseTime for replay and capture search #195, internetarchive#49
- Make collection-dependent exclusion configurable. #195, internetarchive#48
- Removed CustomUserResourceIndex class, which does not appear to have broad utility. #195
- Performance information in response header can now be in JSON format. #195, internetarchive#69
- FastArchivalUrlReplayParseEventHandler no longer rewrite relative URLs for better replay quality. #195
- Made start date configurable (defaults to old value of 1996), end date dynamic to current year.#51
Bug Fixes
- Fixed issue #196 to allow running under Tomcat 8. #198
- Fixed incorrect Content-Type in replay of resource record with JWATFlexResource #195, internetarchive#68
- Fixed ClassCastException when JWATFlexResourceStore is in use #195, internetarchive#67
- Pass-through Content-Range header field for audio playback to work #195, internetarchive#66
- Fixed undesirable rewrite of in-page (fragment-only) links. #195, internetarchive#63
- Fixed XHTML parse error due to banner insert before XML declaration. #195, internetarchive#61
- Fixed PrivTokenAuthChecker resetting ignoreRobots. #195, internetarchive#51
- Made CharsetDetector adher to WHAT-NG recommendation. #195, internetarchive#47
- Fixed building with JDK 8. #141
- NullPointerException for RemoteResourceIndex #193
- Removed direct references to Unix specific TMP paths /tmp and /var/tmp. #172
- Initial thread-safety fix for Memento from Luda. #180
- Fixed xml-markup in Toolbar.jsp which caused probelsm on some sites. #171, #60
- Fixed some @import url's in <style> section of html are not rewritten. #131
- Fixed issue #48 jQuery getting stomped on.
- Support for loading resources from S3 buckets. #189
- Refactored CDX Server into a war and jar module. #164
OpenWayback 2.0.0 Release
Features
- Fixed URL resolution in ServerRelativeArchivalRedirect in non-ROOT context. #92
- Deprecated use of bean name in spring to provide configuration. #94
- Updated and improved documentation. #125 #121 #133
- Reviewed and updated mailing lists. #126 #127
- Added Java cross-reference and updated site generation with dependencies. #128 #30
- Fixed Javadoc output in Java8. #136
- Updated 'developers' and 'contributors' lists in POM. #137
- Cleaned up the Memento configuration. #150
- Added new logos to project. #100
- Cleaned up default config file. #144
- Updated and improved documentation.
- Updated dependency on Webarchive-commons 1.1.4. #157
- Added 'accessPointPath' to default proxy config. #158
Bug Fixes
- Fixed the date locale issue. Creations of java.text.SimpleDateFormat now independent of local setting. #157 #148 #154
- Fixed support for uncompressed ARCs files #101
OpenWayback 2.0.0 BETA 2 release
Features
- Added PrefixFieldCollapser and RegexFieldMatcher to CDX server. #7
- Added support for WARC metadata records. #23
- Added support for WARC resource records. #24
- Removed Internet Archive defaults and branding. #45
- Integrated JWAT ResorceStore. #54
- Provided an OpenWayback Sample Overlay.
- Carried out and documented manual testing. #80
- Updated and improved documentation (as on the wiki.
- Renamed artefacts and repositories “webarchive-commons” and updated POMs. #90
Bug Fixes
- Query string being stripped from Memento queries. #106
- Support for uncompressed ARC files. #101
OpenWayback 2.0.0 BETA 1 Release
Features
- Added livewebPrefix to wayback.xml. #3
- Removed dependencies on Internet Archive’s Maven artefacts, enabling Tarvis CI builds and clean releases. #10
- Moved critical code for OpenWayback from the heritrix-commons codebase into webarchive-commons. #4
Bug Fixes
- Dependency on heritrix-commons SNAPSHOT release. #11
The following releases of the Open Source Wayback Machine (OSWM) were made by the Internet Archive. The on-going development of Wayback was handed over to the International Internet Preservation Consortium (IIPC) in October 2013. For more details please see General overview.
release 1.8.0
Features
- Introduced the wayback-cdx-server.
No further release notes available.
Release 1.7.0
Release notes not available.
Release 1.6.0
Major Features
-
Memento integration.
-
Improved live-web fetching, enabling simpler external caching of
robots.txt documents, or other arbitrary content used to improve
function of a replay session.
-
Customizable logging, via a logging.properties configuration file.
-
Vastly improved Server-side HTML rewriting capabilities, including
customizable rewriting of specific tags and attributes, rewriting
of (some easily recognizable) URLs within JavaScript and CSS.
-
Snazzy embedded toolbar with "sparkline" indicating the distribution
of captures for a given HTML page, control elements enabling
navigation between various versions of the current page, and a
search box to navigate to other URLs directly from a replay session.
-
Improved hadoop CDX generation capabilities for large scale indexes.
-
SWF (Flash) rewriting, to contextualize absolute URLs embedded
within flash content.
-
ArchivalUrl mode now accepts identity ("id_") flag to indicate
transparent replaying of original content.
-
NotInArchive can now optionally trigger an attempt to fill in
content from the live web, on the fly.
-
Updated license to Apache 2.
Major Bug Fixes
-
More robust handling of chunk encoded resources.
-
Fixed problem with improperly resolving path-relative URLs found
in HTML, CSS, Javascript, SWF content.
-
Fixed problem with improperly escaping URLs within HTML when
rewriting them.
-
Fixed problem where a misconfigured or missing administrative
exclusion file was allowing results to be returned, instead of
returning and appropriate error.
-
No longer extracts resources from the ResourceStore before
redirecting to the closest version, which was a major inefficiency.
Minor Features
-
Now provide closeMatches list of search results which were not
applicable given the users request, but that may be useful for
followup requests.
-
Archival Url mode now allows rotating through several character
encoding detection schemes.
-
Proxy Replay mode now accepts ArchivalURL format requests, allowing
dates to be explicitly requested via proxy mode.
-
AccessPoints can be now configured to optional require strict host
matching for queries and replay requests.
-
Now filters URLs which contain user-info (USER:PASSWORD@example.com)
from the ResourceIndex
-
ArchivalURL mode requests without a datespec are now interpreted as
a request for the most recent capture of the URL.
-
Improvements in mapping incoming requests to AccessPoints, to allow
virtual hosts to target specific AccessPoints.
-
ResourceNotAvailable exceptions now include other close search
results, allowing the UI to offer other versions which may be
available.
-
ArchivalURL mode now forwards request flags (cs_, js_, im_, etc)
when redirecting to a closer date.
-
ResourceStore implementation now allows retrying when confronted
with possibly-transient HTTP 502 errors.
Minor Bug Fixes
-
cdx-indexer (replacement for arc-indexer and warc-indexer) tool now
returns accurate error code on failure.
-
No longer sets JVM-wide default timezone to GMT - now it is set
appropriately on Calendars when needed.
-
Hostname comparison is now case-insensitive.
-
Server-relative archival url redirects now include query arguments
when redirecting.
-
Server-relative archival url redirects now include a Vary HTTP
header, to fix problems when a cache is used between clients and
the Wayback service.
-
Fixed problem with robots.txt caching within a single request,
which caused serious inefficiency.
-
Fixed problem with resources redirecting to alternate HTTP/HTTPS
version of themselves.
-
Fixed problem with accurately converting 14-digit Timestamps into
Date objects for later comparison.
-
Automatically remaps the oft-misused charset "iso-8859-1" to the
superset "cp1252".
Release 1.4.2
Features
-
Added exactSchemeOnly configuration to AccessPoint, allowing
explicit distinction between http:// and https://(ACC-32)
-
Now times out requests to a slow/non-responsive RemoteResourceIndex
and remote(HTTP 1.1) ResourceStore nodes.(ACC-38)
-
experimental OpenSearchQuery .jsp implementations(ACC-56)
-
FileProxyServlet now accepts /OFFSET trailing path in addition to
Content-Range HTTP header.(ACC-74)
-
warc-indexer now has -all option to produce a CDX line for ALL
records, not just captures and revisits(ACC-75)
-
now includes file+offset for all records, keying off mime-time of
warc/revist to determine revisits at query time.(ACC-76)
-
Allow prefixing of original HTTP headers with a fixed string.
(ACC-77)
-
Now Wayback rewrites Content-Base HTTP headers.(ACC-78)
-
Timeline.jsp improvements which prevent Timeline from being severely
distorted on some pages.
-
Improvement to ArchivalUrl client-rewrite.js to preserve link text,
working around a bug in Internet Explorer.
Bug Fixes
-
Now all mime-types are escaped to prevent spaces from getting into
the CDX files.(ACC-45)
-
Some CSS URLs were being rewritten twice. (ACC-53)
-
No longer writing original pages Content-Length HTTP header to
output, which caused original pages with Lower-Case "L" in
"Content-length" to return wrong length, truncating replayed
documents. This caused some replayed pages to not have embedded
disclaimers, nor javascript rewriting of links and images.
(ACC-60)
-
Fixed severe problem with live web robots.txt retrieval where wrong
offset was being writting into the live web ResourceIndex.
(ACC-62)
-
Charset extraction from HTTP headers is now case-insensitive.
(ACC-63)
-
No longer adding content to HTML pages with FrameSet tags, as they
were being broken.(ACC-65)
-
No longer set GMT as default timezone for entire JVM.(ACC-70)
Release 1.4.1
Features
-
Index filter which allows including/excluding records based on HTTP
response code field.(ACC-43)
-
Outputs log message instead of stack dump when failing to access
a Resource.
Bug Fixes
-
Some redirect records were not being located in index due to bad
logic in Duplicate record filter.(ACC-30)
-
Wayback was not throwing a NotInArchiveException when
Self-Redirect replay filter removes all records. (unreported)
-
Location HTTP header values were not being escaped before
placing in CDX, causing some records to have too many columns.
(ACC-31)
-
Search Result summary counts were incorrect in Url Prefix
searches.(ACC-33)
-
Implemented NoCache.jsp, a replay insert which adds a
Cache-Control: no-cache HTTP header to all replayed
documents.(ACC-34)
-
Timeline.jsp was using Request Date, not Capture date, which
caused Proxy Mode Timeline to show the wrong date.
(ACC-36)
-
Advanced Search reference implementation .jsp was broken.
(ACC-37)
-
AnchorDate and AnchorWindow functionality is now disabled by
default, and can be enabled via configuration on an AccessPoint.
(ACC-46)
Release 1.4.0
Features
-
@ Completely new implementation of ResourceStore classes,
including recursive local directory scanning, scanning multiple
local directories, an experimental remote directory scanning
capability, and groundwork for future support of both non ARC/WARC
file formats and large scale automatic indexing.
-
@ Complete overhaul of the Replay system, allowing
jspInserts within ArchivalUrl, DomainPrefix, and Proxy replay
modes. Also includes groundwork for future fine-grained mime-type
and url-based Replay customizations.
-
Added capability to explicitly set Locale to use for an
AccessPoint, overriding the default behavior of using the user
agents specified preferred language.
-
New flat file implementation of FileLocationDB. See
CDXCollection.xml within the .war file for and example usage.
-
AnchorDate feature, tracking the date with which a user begins a
replay session. During this session, wayback will always attempt to
remain near this date, preventing time-drift within a replay
session.
-
AnchorWindow feature, which allows users to specify a maximum time
window in either direction of the AnchorDate that they wish to view
replayed content. When a user has set this option, Wayback will not
display captures outside the specified window.
-
New command line tool location-db to create a location DB
offline, populating with lines read from STDIN.
-
Added new AccessControlSettingOperation authentication control
component, allowing the configuration of the appropriate Exclusion
system per-request, as defined by arbitrary BooleanOperators. See
ComplexAccessPoint.xml within the .war file for an example usage.
-
Added .asx archival URL replay, which rewrites links inside
archived .asx files, attempting to make them point back into the
Wayback service.
-
Now accept "http:/" as identical to "http://" in the beginning of
a URL, working around a browser bug which stripped multiple "/"s in
URL paths.
-
@ Refactoring of ResourceIndex interfaces, to allow for
future update-able ResourceIndex implementations beyond BDBIndex
based ResourceIndexes.
-
* Major internal refactoring of WaybackRequest object,
providing more stable get/set methods for accessing the standard
internal fields with type-safety.
-
* Major internal refactoring of SearchResults into
CaptureSearchResults and UrlSearchResults, which was previously
under-specified and often confusing. These new classes provide more
stable get/set methods for accessing the standard internal fields
with type-safety.
-
* Changed locations of replay, query, and exception .jsp
files within .war file to underneath WEB-INF, so they are not
directly accessible via HTTP.
-
German translation of default Wayback UI. Thanks Andreas!
-
Czech translation of default Wayback UI. Thanks Lukáš Matějka!
(<<
ACC-29)
-
All threads now notified of shut downs, allowing resources to be
released cleanly.
-
*Refactor of all Request and Result related constants from
WaybackConstants to WaybackRequest and the *SearchResult(s)
classes.
-
* Refactor of the various UI*Results classes, which are used
by Query, Replay, and Exception .jsp files to access context
information into the single class, UIResults, which has a more
stable interface.
-
New AccessPoint.urlRoot optional configuration, enabling explicit
control over URLs generated for the UI.
Bug Fixes
-
(ACC-24) Fixed bug in Proxy mode which prevented the correct number
of results from being returned from the index during Replay.
-
(ACC-21) fixed bug where some CSS import declarations where not
being correctly rewritten.
-
(ACC-26) fixed rare String OOB exception when marking up pages with
some forms of Javascript generated HTML.
-
(ACC-28) verifies that detected encoding is supported in local JVM
before attempting to decode a resource into a String.
-
(unreported) fixed declared page encoding of help, advanced search
and index page to UTF-8.
-
Explicitly set character encoding on returned documents, instead of
relying on Tomcat to return the correct encoding.
Migration notes to 1.4.0 from 1.2.X
Wayback 1.4.0 includes substantial code changes aimed at extending
current capabilities, enabling planned future features, and
stabilizing interfaces used in .jsp customizations. Since these
changes would already require a significant update of existing
customizations made to .jsp files, many non-vital cleanups to the
source tree were included. The goal of implementing all of these
features within this single release is to minimize future required
updates.
Below is a somewhat inclusive list of changes that will be required
when upgrading to Wayback 1.4.0 from 1.2.X, divided into two main
categories: changes required to Spring configuration, and changes
required for .jsp customizations. Depending on the scope of the
existing customizations in your installations, it may be simpler
to modify your existing customizations to conform to new interfaces
and packages, and in other cases, it may be simpler to begin with the
new reference implementations and modify them to meet your needs.
If there are changes not addressed here, or if you have questions
regarding specific issues when upgrading, please direct these
questions to the archive-access-discuss forum.
Spring upgrade information
New features with the @ mark indicate features that will directly
impact Spring XML configuration files used with 1.2.X.
-
org.archive.wayback.resourcestore.http.FileLocationDB now:
org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB
-
org.archive.wayback.resourcestore.http.FileLocationDBServlet now:
org.archive.wayback.resourcestore.locationdb.ResourceFileLocationDBServlet
-
org.archive.wayback.resourcestore.http.ArcProxyServlet now:
org.archive.wayback.resourcestore.locationdb.FileProxyServlet
-
All ReplayUI implementations changed completely, now located in:
ArchivalUrlReplay.xml, DomainPrefixReplay.xml, ProxyReplay.xml.
Customizations to jspInserts should be straightforward on
inspecting these files.
-
org.archive.wayback.resourcestore.Http11ResourceStore now:
org.archive.wayback.resourcestore.SimpleResourceStore. See
RemoteCollection.xml for configuration example.
-
The new automatic indexing is most simply upgraded by modifying
the new example in BDBCollection.xml with your custom paths.
.jsp upgrade information
New features with the * mark indicate features that will directly
impact customizations made to .jsp files used with 1.2.X. The bulk of
the changes fit three categories:
-
class name and package changes requiring import tag updates.
Please see .jsps in new distribution for updated packages.
-
.jsp path changes due to webapp directory tree cleanup. Again,
please see the current locations in the new distribution.
-
Java changes within .jsp files due to UIResults refactoring.
Previously each type of response page had a unique class used
to marshal context information to the .jsp files. These have all
been refactored into a single class,
org.archive.wayback.core.UIResults which has methods to
access the appropriate data in each case. Additionally, many
convenience methods that were present on the various UI*Results
classes have been removed, since convenience methods are now
available on the core classes:
- WaybackRequest
- CaptureSearchResult
- CaptureSearchResults
- UrlSearchResult
- UrlSearchResults
As an example, the Timestamp class is no longer used in the .jsp
files, since all time information uses the Date class for
localization. All of the above classes now have methods to
directly return Dates.
For specific examples, please see the reference .jsp files
included with the new distribution.
Release 1.2.1
Features
-
Now explicitly sets the charset component of replayed HTML
page Content-Type HTTP headers in Archival URL mode. This
overrides Tomcat's default behavior of explicitly setting this value
to Tomcat's default encoding character set, if a document
does not set it explicitly. The original Content-Type HTTP
header value is now returned as HTTP header
X-Wayback-Orig-Content-Type.
Bug Fixes
-
added getter/setter for replay image, css, javascript, and html
error handling .jsps
-
now returns "closest" indicator on XML query results, fixing problem
with WAXToolbar/Proxy mode.(ACC-11)
-
auto-indexer now closes ARC/WARC files after indexing, fixing
out-of-filehandle problem(ACC-12)
-
location-client now syncs .warc and .warc.gz files with
locationDB, in addition to .arc and .arc.gz files.(ACC-13)
-
fixed problem which prevented captures archived after webapp was
deployed from being returned. Now captures up to the current moment
are returned. (ACC-14)
-
changed all .jsp files to return UTF-8(ACC-18)
-
now sending correct end Date to remote NutchWAX index.
(ACC-20)
-
fixed String OOB exception when attempting to rewrite some CSS text
(ACC-17)
-
now updates CSS "import 'URL';" and 'import "URL";' content.
Previously only updated "import url(URL);" content.
-
fixed Replay redirect loop when using RemoteResourceIndex
(ACC-15)
Release 1.2.0
Features
-
now supports compressed and uncompressed ARC and WARC files.
-
initial revision of "deduplicated" WARC record handling, which
returns the last version that was actually stored when
subsequent captures are not saved because they have not changed.
-
now filters (literal) duplicate records from the ResourceIndex,
in case the same capture (url + date) appears twice, or in two
CDX files.
-
UrlCanonicalizer is now pluggable, current functionality is now
implemented in AggressiveUrlCanonicalizer. Added
IdentityUrlCanonicalizer, which performs no canonicalization.
-
bin-search command line tool now outputs a single stream of
sorted results from multiple files, instead of returning matches
from each file sequentially.
-
extracted several replay features into separate jspInserts that
can now be mixed and matched.
-
now handles most text/css URL rewriting, both inside HTML pages,
and in externally linked .css files.
-
externalized comment embedded inside replayed HTML pages into
jspInsert: ArchiveComment.jsp.
-
non-javascript Archival URL replay mode, where all URL rewriting
occurs on the server. This includes a non-javascript
Timeline jspInsert.
-
added two-month timeline partition.
-
root page of webapp now lists access points, when users make
a request that does not specify one. Also, now access point
"slash-pages" are available "without the slash".
Bug Fixes
-
Now rewrite Location and Content-Base HTTP headers in non-HTML
Archival URL replayed documents.
-
now rewrites all background attributes found in returned
pages (archival URL mode only) instead of just on BODY tags.
-
now rewrites src attributes on INPUT tags.
-
command line tools now allow whitespace arguments, important for
tools accepting delimiter arguments.
-
replay URLs in query results now include non-standard ports, if
needed.
-
Timezone is now explicitly set to GMT/UTC, fixing a Calendar
result partitioning problem.
-
uncaught character-encoding exceptions now handled, plus
slightly improved detection of correct character encoding by
removing internal whitespace in declared encoding names.
-
archival URL parsing of query end-date now assumes latest
possible date given a partial end-date, instead of earliest
possible date.
-
re-implemented lost "closest" indicator for XML results.
-
now supports multiple auto index threads, one per ResourceStore,
and also multiple auto index merge threads, one per BDB
ResourceIndex.
-
fixed hard-coded maximum year issue.
-
reimplemented NotInArchive logging, which was lost in 1.0.0.