Introducing SWORD 3

Richard Jones, Cottage Labs
richard [at] cottagelabs [dot] com
@richard_d_jones @cottagelabs

SWORD 3.0 is a protocol enabling clients and servers to communicate around complex digital objects

It defines semantics for creating, appending, replacing, deleting, and retrieving information about these complex resources.

Working Principles

  • The more optional features, the harder true interoperability
  • Simpler the better - aim to remove any unusued features from SWORDv2
  • Research data support is key, though not at the expense of existing features
  • Make it easy for the community to engage and developers to pick up
  • Make it easy to maintain and extend
  • Be clear about the distinction between protocol and implementation
  • One single simple (as possible) document describing the protocol
  • Pay attention to anti-patterns: only one file, only one metadata schema, etc.
  • Prioritise current, validated and pressing use cases
  • Make it easy to relate implementations to the parts of the protocol
  • Minimise the effort to implement against a repository (as few special features as possible)

Usage Patterns

A usage pattern is what we called our single units of functionality that we wanted to support.

Full set here (41 in total)

Some examples below.

Research data deposit
Researchers should be able to easily deposit data for publication, discovery, safe storage, long-term archiving and preservation
Transmission of data meeting metadata standards
The protocol should support the transfer of well-understood data formats and profiles such as PCDM, METS, RIOXX, etc.
Automated machine-to-machine deposit
Autonomous systems should be able to communicate with eachother as needed via the protocol
Man-in-the-middle broker
Deposits should be possible via a brokerage service or other itermediate, which stands between the depositor and the target archive(s)
Real-time file-storage
The protocol should offer the facilities to enable the repository to behave like a real-time file store for user-facing systems
Monitor workflow progress
Be able to track the state of an item as it is in the repository - whether it is in a workflow, in the archive, or if other actions have happened to it
Arbitrarily large files
Deposited files may be very large
Send files by reference
Send one or more links to files to be ingested and attached to an item. Some links may not need to be ingested, a reference may just need to be created

Differences to SWORDv2 and New Features

JSON instead of XML

No more AtomPub. Was:

<?xml version="1.0" ?>
<service xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:sword="http://purl.org/net/sword/terms/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns="http://www.w3.org/2007/app">

    <sword:version>2.0</sword:version>
    <sword:maxUploadSize>16777216</sword:maxUploadSize>

    <workspace>
        <atom:title>Main Site</atom:title>

        <collection href="http://swordapp.org/col-iri/43">
            <atom:title>Collection 43</atom:title>
            <accept>*/*</accept>
            <accept alternate="multipart-related">*/*</accept>
            <sword:collectionPolicy>Collection Policy</sword:collectionPolicy>
            <dcterms:abstract>Collection Description</dcterms:abstract>
            <sword:mediation>false</sword:mediation>
            <sword:treatment>Treatment description</sword:treatment>
            <sword:acceptPackaging>http://purl.org/net/sword/package/SimpleZip</sword:acceptPackaging>
            <sword:acceptPackaging>http://purl.org/net/sword/package/METSDSpaceSIP</sword:acceptPackaging>
            <sword:service>http://swordapp.org/sd-iri/e4</sword:service>
        </collection>
    </workspace>
</service>

Now

{
  "@context" : "https://swordapp.github.io/swordv3/swordv3.jsonld",

  "@id" : "http://example.com/service-document",
  "@type" : "ServiceDocument",

  "dc:title" : "Site Name",
  "dcterms:abstract" : "Site Description",

  "root" : "http://example.com/service-document",
  "acceptDeposits": true,

  "version": "http://purl.org/net/sword/3.0",
  "maxUploadSize" : 16777216000,
  "maxByReferenceSize" : 30000000000000000,
  "maxAssembledSize" : 30000000000000,
  "maxSegments" : 1000,

  "accept" : ["*/*"],
  "acceptArchiveFormat" : ["application/zip"],
  "acceptPackaging" : ["*"],
  "acceptMetadata" : ["http://purl.org/net/sword/3.0/types/Metadata"],

  "collectionPolicy" : {
    "@id" : "http://www.myorg.ac.uk/collectionpolicy",
    "description" : "...."
  },
  "treatment" : {
    "@id" : "http://www.myorg.ac.uk/treatment",
    "description" : "..."
  },

  "staging" : "http://example.com/staging",
  "stagingMaxIdle" : 3600,

  "byReferenceDeposit" : true,
  "onBehalfOf" : true,

  "digest" : ["SHA-256", "SHA", "MD5"],
  "authentication": ["Basic", "OAuth", "Digest", "APIKey"],

  "services" : [
    {
      "@id": "http://swordapp.org/deposit/43",

      "dc:title" : "Deposit Service Name",
      "dcterms:abstract" : "Deposit Service Description",

      "root" : "http://example.com/service-document",
      "parent" : "http://example.com/service-document",
      "acceptDeposits": true,

      "services" : []
    }
  ]
}

Support for Arbitrary Metadata

SWORDv2 provided implicit support for arbitrary metadata formats, with no standard way to indicate to the server what you were sending it.

SWORD v3 provides explicit support for arbitrary metadata formats, via the Metadata-Format header.

SWORDv2: Create a resource with metadata only

POST Col-IRI HTTP/1.1
Host: example.org
Authorization: Basic ZGFmZnk6c2VjZXJldA==
Content-Length: [content length]
Content-Type: application/atom+xml;type=entry
In-Progress: true
On-Behalf-Of: jbloggs
Slug: [suggested identifier]

<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
        xmlns:dcterms="http://purl.org/dc/terms/">
    <title>Title</title>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2005-10-07T17:17:08Z</updated>
    <author><name>Contributor</name></author>
    <summary type="text">The abstract</summary>

    <!-- some embedded metadata -->
    <dcterms:abstract>The abstract</dcterms:abstract>
    <dcterms:accessRights>Access Rights</dcterms:accessRights>
    <dcterms:alternative>Alternative Title</dcterms:alternative>
    <dcterms:available>Date Available</dcterms:available>
    <dcterms:bibliographicCitation>Bibliographic Citation</dcterms:bibliographicCitation>
    <dcterms:contributor>Contributor</dcterms:contributor>
    <dcterms:description>Description</dcterms:description>
    <dcterms:hasPart>Has Part</dcterms:hasPart>
    <dcterms:hasVersion>Has Version</dcterms:hasVersion>
    <dcterms:identifier>Identifier</dcterms:identifier>
    <dcterms:isPartOf>Is Part Of</dcterms:isPartOf>
    <dcterms:publisher>Publisher</dcterms:publisher>
    <dcterms:references>References</dcterms:references>
    <dcterms:rightsHolder>Rights Holder</dcterms:rightsHolder>
    <dcterms:source>Source</dcterms:source>
    <dcterms:title>Title</dcterms:title>
    <dcterms:type>Type</dcterms:type>

</entry>

SWORDv3: Create a resource with metadata-only

(using a metadata format that is not the default)

POST Service-URL
Content-Type: application/xml
Content-Disposition: attachment; metadata=true
Digest: SHA-256=74b2851bd2760785b0987ba219debea69c228353f7ccc67a2bdcd9819f97fc71
Metadata-Format: http://www.loc.gov/mods/v3

<mods xmlns:mods="http://www.loc.gov/mods/v3">
  <originInfo>
    <place>
      <placeTerm type="code" authority="marccountry">nyu</placeTerm>
      <placeTerm type="text">Ithaca, NY</placeTerm>
    </place>
    <publisher>Cornell University Press</publisher>
    <copyrightDate>1999</copyrightDate>
  </originInfo>
</mods>

Concurrency Control

SWORDv2 did not have the concept of concurrency control.

SWORDv3 provides Optimistic Concurrency Control via the use of ETag and If-Match headers.

Segmented File Upload

SWORDv2 dealt only in full by-value deposits of files, which could be problematic if the files are very large.

In SWORDv3, to transfer a large file, the client can break it down into a number of equally sized segments of binary data (the final segment may be a different size to the rest). It can then initialise a Segmented File Upload with the server, and then transfer the segments. The server will reconstitute these segments into a single file, and then the client may deposit this file by-reference.

By-Reference Deposit

SWORDv2 did not have any formal mechanism for depositing files by-reference (although some workarounds existed)

SWORDv3 provides explicit support for By-Reference deposit, where the client provides the server with URLs for Files which it would like the server to retrieve asynchronously.

More Advanced Packaging

There has been a lot of pressure on the SWORD team to provide more detail about actual packaging formats. We have resisted for a long time, but for SWORDv3 we have introduced a BagIt profile which is slightly more advanced than the package formats required by SWORDv2

SwordBagIt
| -- bag-info.txt
| -- bagit.txt
| -- data
| -- | -- bitstreams ...
|    \ -- directories ...
|         \ bitstreams ...
| -- manifest-sha-256.txt
| -- metadata
|     \-- sword.json
\ -- tagmanifest-sha-256.txt

This allows us to represent the item as a combination of an arbitrary structure of bitstreams in the data directory (similar to SimpleZip), and the metadata in the sword default format in metadata/sword.json. A manifest (and tagmanifest) of sha-256 checksums is required, as well as the bagit.txt file and a bag-info.txt file.

Implementation Plans

We are currently waiting to hear on the outcome of a proposal for onward funding for implementation.

Below is what we're hoping to get funding to cover:

We want to implement all the aspects of the specification which are identified as MUST in the documentation. This means we would be doing the smallest complete implementation possible, which will provide significant insight into the implementability of the full specification.

Client

Implement a full client which can carry out all of the operations available to it in the specification.

This would be a fully re-usable code-library for anyone else wanting to work with SWORDv3

Test Suite

Provide a test suite which can drive the client to carry out all the actions defined in the specification, in each case with a selection of data, including error cases.

This would also be fully re-usable so anyone else implementing a SWORDv3 server would be able to run our test suite to validate their work.

Invenio Back-End

Implement Invenio support for all of the features of the specification which are identified as MUST in the documentation, and any optional features which are required to support those features in this environment.

This would make Invenio3 a SWORDv3 compliant repository and also provide a reference implementation for SWORDv3 servers. Some components will be re-usable outside Invenio.

Last But Not Least

Thanks to all of the following who were involved in the Technical Advisory Board:

Adam Rehin, Adrian Stevenson, Alan Stiles, Catherine Jones, Claire Knowles, David Moles, David Wilcox, Eoghan Ó Carragáin, Erick Peirson, Gertjan Filarski, Goosyara Kovbasniy, Graham Triggs, Hideaki Takeda, Jan van Mansum, Jauco Noordzij, Jochen Schirrwagen, John Chodacki, Justin Simpson, Lars Holm Nielsen, Marisa Strong, Martin Wrigley, Masaharu Hayashi, Masud Khokhar, Mike Jackson, Morane Gruenpeter, Neil Chue Hong, Paul Walk, Peter Sefton, Ralf Claussnitzer, Ricardo Otelo Santos Saraiva Cruz, Richard Rodgers, Scott Wilson, Shannon Searle, Stephanie Taylor, Stuart Lewis, Tomasz Parkola, Vitali Peil

Thanks for Listening

Richard Jones, Cottage Labs
richard [at] cottagelabs [dot] com
@richard_d_jones @cottagelabs