Simple URF (SURF) Specification

Author: Garret Wilson (GlobalMentor, Inc.)
Version: Draft 2020-06-20

Introduction

Simple URF (SURF) is a compact, text-based, human-readable persistence format for a directed graph of data values. It is similar in purpose to JSON yet less verbose and more expressive, supporting for example a greater number of types. Moreover as one of the primary serializations (along with TURF) of the Uniform Resource Format (URF), it rigorously represents semantic data in a simpler manner than serializations of the Resource Description Framework (RDF).

Definitions

SURF parser: Any software component that interprets SURF syntax and produces an appropriate data model according to this specification.
SURF serializer: An software that produces SURF syntax complying with this specification to reflect some data model.

Design Constraints

This section is non-normative.

The following considerations were used to guide the creation of this specification:

SURF must support the Unicode character set for text values.
SURF should support the Unicode character set for identifiers and comments.
SURF must use only ASCII characters for delimiters.
All valid JSON documents must also be valid SURF documents.
All valid SURF documents must also be valid TURF documents.
SURF must allow for distinctions among vocabularies of identifiers.
SURF must not require namespace IRIs to be declared.

Conventions Used in this Document

The key words “must”, “must not”, “required”, “shall”, “shall not”, “should”, “should not”, “recommended”, “may”, and “optional” in this document are to be interpreted as described in RFC 2119. Parts of this specification marked as notes and annotations are non-normative.

Internet Media Type

The Internet media type (RFC 6838) of a SURF document shall be text/simple-urf and must be encoded in UTF-8. A SURF document must not begin with a so-called byte order mark (BOM) or UTF-8 signature.

Structure

The content of a SURF document encodes a graph of resources defined by the Uniform Resource Framework (URF) with a single resource as the root of the graph. A SURF document may be empty, representing no resources. A SURF parser or a SURF serializer may represent a SURF document as a graph of URF resources. Nevertheless, although SURF syntax maintains compliance with the URF model, the implementation and use of SURF does not require use of the URF model.

Whitespace

SURF consider the following characters as whitespace, including characters in the Unicode Space_Separator (Zs) category.

whitespace ⇒ tab | vtab | ff | sp | nbsp | zwnbspr | Space_Separator
tab ⇒ CHARACTER TABULATION (U+0009)
vtab ⇒ LINE TABULATION (U+000B)
ff ⇒ FORM FEED (FF) (U+000C)
sp ⇒ SPACE (U+0020)
nbsp ⇒ NO-BREAK SPACE (U+00A0)
zwnbspr ⇒ ZERO WIDTH NO-BREAK SPACE (U+FEFF)

This specification uses the MIDDLE DOT character · to represent zero or more whitespace characters.

· ⇒ whitespace*

Line Endings

SURF recognizes both the CARRIAGE RETURN (CR) character (U+000D), the LINE FEED (LF) character (U+000A), and any Unicode Line_Separator (Zl) or Paragraph_Separator (Zp) character as marking the end of a line. A SURF parser must behave as if every CRLF sequence as well as every CR not followed by a LF were normalized to a single LF. A SURF serializer should use the conventional line ending sequence supported by the platform on which it is running if that sequence is allowed by this specification.

eol ⇒ cr | lf | Line_Separator | Paragraph_Separator
cr ⇒ CARRIAGE RETURN (CR) (U+000D)
lf ⇒ LINE FEED (LF) (U+000A)

Comments

Line Comments

A line comment may appear before the end of any line. A line comment begins with the EXCLAMATION MARK character ! (U+0021) and proceeds to the next line ending character.

line_comment ⇒ '!' [^eol]*

Filler

Some structures allow the addition of whitespace, line comments, and/or line endings; these are collectively referred to as filler.

filler ⇒ (whitespace | line_comment | eol)*

Sequences

Several SURF types allow components to be presented in a sequence. A sequence is a syntactical construct indicated by the form item-sequence, where item is the construct that may appear zero or more times in the sequence.

Any two items in a sequence are separated by a sequence separator, which is either a COMMA character , (U+002C) optionally surrounded by filler; or filler with at least one line break but without a COMMA character. If a COMMA character is present, an item must follow. If no COMMA character or filler is present, an item must not follow. This means that filler may end a sequence or appear in an empty sequence.

item-sequence ⇒ filler [ item (sequence_next_comma_separated | sequence_next_break_separated)* filler ]
sequence_next_comma_separated ⇒ filler ',' filler item
sequence_next_break_separated ⇒ · line_comment? eol filler item

Handles

Example SURF handles.

foo
fooBar
foo_bar
touché
काम
chem-salt
crypto-salt
User
chem-Molecule

A name token in SURF must begin with a character from the Unicode Letter (L) category; followed by zero or more characters each from the Letter (L) category, from the Mark (M) category, from the Decimal_Number (Nd) category, or from the Connector_Punctuation (Pc) category. The sequence of Unicode code points in a name must follow Normalization Form C (NFC) as per UAX #15.

name_token ⇒ Letter (Letter | Mark | Decimal_Number | Connector_Punctuation)*

In URF each SURF handle represents a name inside the ad-hoc namespace https://urf.name/, with each prefix indicating an informal subnamespace of that IRI. For instance the handle example-FooBar represents the name FooBar in the namespace https://urf.name/example/, that is the tag https://urf.name/example/FooBar. Nevertheless a SURF parser may treat SURF names as opaque identifiers. The separate TURF format allows defining custom namespaces.

The SURF prefix delimiter - is meant to parallel that used by HTML5 data-* attributes, that used by HTML Custom Elements, as well as naming conventions such as used for npm packages and Maven artifacts. The SURF informal namespace prefix mechanism attempts to strike a balance between the complex, draconian URL associations of Namespaces in XML 1.0 and a lackadaisical, free-for-all identifier situation such as in JSON.

A name, which is a name token, may be introduced by one or more prefixes, each itself a name token. These segments are separated by the HYPHEN-MINUS character - (U+002D), and together are referred to as a handle. An example of a handle is example-FooBar.

handle ⇒ (segment '-')* name
segment ⇒ name_token
name ⇒ name_token

Authors of SURF documents should use prefixes corresponding to a reverse series of domain name components for a domain that author controls or has authority to use, either starting the the top-level domain or the second-level domain. The owner of the example.com domain, for example, might create a handle com-example-FooBar or example-FooBar.

SURF documents must not use handles beginning with the urf- prefex unless defined by one of the URF specifications. The example- prefix is reserved for use as examples in documentation for private testing. There are no restrictions on using SURF handles with no prefixes, although authors should should follow conventions that may develop associating semantics with certain names.

The tokens false and true must not appear as handles in a SURF document.

Resources

A SURF document must contain at most a single resource, which may recursively contain other resources. A resource consists of an optional label followed by a resource representation.

document ⇒ filler resource* filler
resource ⇒ label? · resource_representation | label
described_resource ⇒ label? · resource_representation · description? | label
resource_representation ⇒ object | literal | collection
literal ⇒ binary | boolean | character | email | iri | media_type | number | regex | string | telephone | temporal | uuid
collection ⇒ list | map | set

A label consists of an identifier; which is either a SURF name, a string, or an IRI; surrounded by matching VERTICAL LINE characters | (U+007C). The first occurrence of a label with a particular identifier may include a resource representation; if no resource representation is present at the first appearance of a label with some identifier, an object with no type and no description is implied. Subsequent appearances of a label with the same identifier must not include a resource representation. A nested resource representation may refer to the label of an outer resource in the graph.

label ⇒ '|' alias | id | tag '|'
alias ⇒ name_token
id ⇒ string
tag ⇒ iri

If a label uses a SURF name as its identifier, it indicates an alias for referencing resources only within the confines of the SURF document. If the identifier is an IRI, it is a tag and provides a unique identifier for the resource across all SURF documents. A SURF tag must not contain an IRI fragment. A string as the identifier functions as an ID for an object, unique only for a certain object type.

A tag or an ID must not appear in front of any resource representation other than an object. An ID must not appear in front of a resource representation without an indicated type. A SURF parser must provide tags and IDs as part of the parsed data.

Objects

Objects are general resources with an optional type and that may be described by a description.

object ⇒ '*' · type?
type ⇒ handle

Descriptions

A description must not follow any resource representation other than an object. A description must not contain more than one property with the same handle, and a SURF parser must consider such a condition as a non-recoverable error.

description ⇒ ':' property-sequence ';'
property ⇒ handle filler '=' filler resource

Literals

SURF literals are lexical representations of resources.

Binary

SURF uses the binary delimiter % because it resembles 0 and 1.

URF allows the encoding of an arbitrary sequence of octets. Zero or more bytes must be encoded using the “Base 64 Encoding” defined in RFC 4648, beginning with the PERCENT SIGN character % (U+0025). The encoding must use the “base64url” alphabet and must not include Base 64 padding.

binary ⇒ '%' rfc_4648_base64url

Boolean

A Boolean is either of the tokens true or false.

boolean ⇒ "false" | "true"

Character

A SURF character is a representation of a Unicode code point, delimited on both sides by the APOSTROPHE character ' (U+0027). The backslash or REVERSE SOLIDUS \ (U+005C) is used as as an escape character. The APOSTROPHE , REVERSE SOLIDUS, and control characters must not appear in a character unless they are escaped. The following escape sequences are allowed:

\\: REVERSE SOLIDUS (U+005C)
\/: SOLIDUS (U+002F)
\': APOSTROPHE (U+0027)
\b: BACKSPACE (U+0008)
\f: FORM FEED (FF) (U+000C)
\n: LINE FEED (LF) (U+000A)
\r: CARRIAGE RETURN (CR) (U+000D)
\t: CHARACTER TABULATION (U+0009)
\v: LINE TABULATION (U+000B)
\uXXXX: Any 16-bit Unicode code point encoding, where XXXX is four hexadecimal digits in any case. Escaped Unicode code points outside the Basic Multilingual Plane must be represented as two UTF-16 surrogate characters.

A SURF parser must correctly interpret characters outside the Basic Multilingual Plane, whether represented as a literal character or as an escaped Unicode code point.

TODO production

Email Address

An email address in SURF begins with the CIRCUMFLEX ACCENT character ^ (U+005E) commonly known as a “caret”, followed by the “addr-spec” format specified in RFC 5322. The representation must not include any obsolete elements (those starting with the prefix “obs-”) in RFC 5322. The representation must not include any “comments” or “folding white space” as defined by RFC 5322.

email ⇒ '^' rfc_5322_addr_spec

IRI

An Internationalized Resource Identifiers (IRI) is a sequence of Unicode characters for identifying a resource as defined in RFC 3987. In SURF an IRI is placed between a LESS-THAN SIGN character < (U+003C) and a GREATER-THAN SIGN character > (U+003E).

iri ⇒ '<' ( rfc_3987_IRI | email | telephone | uuid ) '>'

If an email address, telephone number, or UUID appears between the delimiters, it represents an “IRI short form” that is equivalent to a literal IRI according to the following rules:

email: The email address is converted into an IRI with a scheme of mailto according to RFC 6068.
telephone: The telephone is converted into an IRI with a scheme of tel according to RFC 3966.
uuid: The UUID is converted into a IRI with a scheme of urn and a URN namespace of uuid according to RFC 4122.

Media Type

A media type, sometimes referred to as a “content type”, indicates the type of content contained in a resource and is essential for navigating the World Wide Web. It consists of a type and subtype, optionally followed by one or more parameters.

One of the most common media types is text/html;charset=UTF-8 indicating an HTML document using the UTF-8 charset. In SURF this can be represented using >html;charset=utf-8<.

SURF places the media type between the GREATER-THAN SIGN character > (U+003E) and the LESS-THAN SIGN character < (U+003C), in that order. This representation is not to be confused with that of an IRI, which uses the same delimiters but in a different order.

media_type ⇒ '>' rfc_6838_media_type '<'

Example SURF media types with meanings.

>xml<: text/xml
>markdown;charset=utf-8<: text/markdown;charset=UTF-8
>text/markdown;charset=UTF-8<: text/markdown;charset=UTF-8
>image/png<: image/png

The syntax of the media type is that prescribed by RFC 6838 with the following additional restrictions and recommendations:

The type, subtype, and parameter name(s) should be in lowercase.
The If the top-level type is text, the top-level type and its following SOLIDUS (U+002F) delimiter may be left out. For example, the media type text/plain may be indicated as simply >plain< in SURF.
The value of the charset parameter should be in lowercase.
If any value for any parameter other than charset is case-insensitive, it must be in lowercase.

Number

A number represents a numerical value in base 10 that may be negative and may be fractional. If the number begins with the DOLLAR SIGN character $ (U+0024), it is considered a decimal regardless of the presence or absence of a fraction and/or decimal component, and a SURF parser must represent the value using a construct that exactly represents the fractional part without rounding within the supported range.

If the number does not begin with the DOLLAR SIGN character $ (U+0024) and contains neither a fraction nor an exponent component, it is considered an integer. A SURF parser may represent non-decimal numbers using IEEE 754, but it must maintain a distinction between general numbers and integers.

number ⇒ ['$'] ['-'] whole [fraction] [exponent]
whole ⇒ digit+
fraction ⇒ '.' digit+
exponent ⇒ ('e' | 'E') ['-' | '+'] digit+
digit ⇒ '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

A number should be in its canonical form:

No leading zeros in the whole component except to meet the requirement of at least one digit.
No trailing zeros in the digit(s) in the fraction component except to meet the requirement of at least one digit.
No leading zeros in the digit(s) in the exponent component.
A lowercase 'e' in the exponent component.
No '+' sign in the exponent component if the exponent is non-negative.

Nevertheless the presence of any leading zero(s) in the whole component shall not be interpreted as indicating any other number base other than base 10.

Regular Expression

A regular expression is surrounded by slash or SOLIDUS character / (U+002F). The backslash or REVERSE SOLIDUS \ (U+005C) is interpreted as as an escape character only if followed by a slash character /.

TODO decide on whether and how to allow flags; TODO reference regular expression standard

regex ⇒ '/' regular_expression '/'

String

A string represents a sequence of Unicode code points, delimited on both sides by the QUOTATION MARK character " (U+0022). The sequence of Unicode code points in a string should follow Normalization Form C (NFC) as per UAX #15. The backslash or REVERSE SOLIDUS \ (U+005C) is used as as an escape character. The QUOTATION MARK , REVERSE SOLIDUS, and control characters must not appear in a string unless they are escaped. The following escape sequences are allowed:

\\: REVERSE SOLIDUS (U+005C)
\/: SOLIDUS (U+002F)
\": QUOTATION MARK (U+0022)
\b: BACKSPACE (U+0008)
\f: FORM FEED (FF) (U+000C)
\n: LINE FEED (LF) (U+000A)
\r: CARRIAGE RETURN (CR) (U+000D)
\t: CHARACTER TABULATION (U+0009)
\v: LINE TABULATION (U+000B)
\uXXXX: Any 16-bit Unicode code point encoding, where XXXX is four hexadecimal digits in any case. Escaped Unicode code points outside the Basic Multilingual Plane must be represented as two UTF-16 surrogate characters.

TODO production

Telephone

In SURF the representation of a telephone number follows the “global number” format prescribed by RFC 3966, which is a PLUS SIGN + (U+002B) followed by at least one digit. The representation must not include any “visual separators” as defined by RFC 3966.

Example SURF telephone number.

+12015550123

telephone ⇒ '+' digit+

Temporal

Although JSON does not support dates, the SURF instant format is compatible with JavaScript.Date.prototype.toJSON().

The SURF temporal representation encodes date and/or time information based on ISO 8601. A temporal starts with the COMMERCIAL AT character @ (U+0040). Time zone names tz are from the IANA TZ database and are case-sensitive. The format for month_day conforms to an older version of ISO 8601; the most recent version does not mention a month+day format. The format for zoned_date_time is an extension to the ISO 8601 specification.

The format for zoned_date_time is an extension to the ISO 8601 specification, and follows Java java.time.format.DateTimeFormatter.ISO_ZONED_DATE_TIME.

Example SURF temporal values.

@2017-02-12T23:29:18.829Z
@2017-02-12T15:29:18.829-08:00[America/Los_Angeles]
@2017-02-12T15:29:18.829-08:00
@2017-02-12-08:00
@15:29:18.829-08:00
@2017-02-12T15:29:18.829
@2017-02-12
@15:29:18.829
@2017-02
@--02-12
@2017

temporal ⇒ '@' (instant | zoned_date_time | offset_date_time | offset_date | offset_time | local_date_time | local_date | local_time | year_month | month_day | year)
instant ⇒ date 'T' time 'Z'
zoned_date_time ⇒ offset_date_time '[' tz ']'
offset_date_time ⇒ date 'T' time offset
offset_date ⇒ date offset
offset_time ⇒ time offset
local_date_time ⇒ date 'T' time
local_date ⇒ date
local_time ⇒ time
year ⇒ YYYY
year_month ⇒ year '-' MM
month_day ⇒ '-' '-' MM '-' DD
date ⇒ YYYY '-' MM '-' DD
time ⇒ hh ':' mm ':' ss ['.' s]
offset ⇒ ('+' | '-') hh ':' mm
YYYY ⇒ digit digit digit digit
MM ⇒ digit digit
DD ⇒ digit digit
hh ⇒ digit digit
mm ⇒ digit digit
ss ⇒ digit digit
s ⇒ digit digit digit [ digit digit digit [ digit digit digit ] ]

TODO add support for durations

UUID

A Universally Unique IDentifier (UUID) must adhere to the requirements of RFC 4122. The SURF representation of a UUID be must be introduced by the AMPERSAND character & (U+0026) and be followed by the “UUID” production given in RFC 4122.

Example SURF UUID.

&f81d4fae-7dec-11d0-a765-00a0c91e6bf6

uuid ⇒ '&' hex*8 '-' hex*4 '-' hex*4 '-' hex*4 '-' hex*12

Collections

Collection resources represent abstract data types that can hold other resources.

List

A SURF list is an ordered sequence of zero or more resources with optional descriptions, beginning with a LEFT SQUARE BRACKET character [ (U+005B) and ending with a RIGHT SQUARE BRACKET character ] (U+005D). A SURF parser should represent a SURF list using an corresponding language construct that represents list semantics.

list ⇒ '[' described_resource-sequence ']'

Map

A SURF map is a sequence of associations between a key and a value. A map begins a LEFT CURLY BRACKET character { (U+007B) and ends with a RIGHT CURLY BRACKET character } (U+007D). Keys and values can be any resources. If a key is an object with a description, the key must be surrounded by the REVERSE SOLIDUS character \ (U+005C). The key and value in each association or entry are separated by a COLON character : (U+003A).

A map should not have entries with duplicate keys, and a SURF serializer must not produce a map with duplicate-key entries. A surf parser must ignore all but one of each entry with the same key. TODO revisit; this bring JSON compatibility, but could cause problems with tags if a duplicate entry is ignored; also address key equality

A SURF parser should represent a SURF map using an corresponding language construct that represents map semantics.

map ⇒ '{' entry-sequence '}'
entry ⇒ key filler ':' filler value
key ⇒ '\' described_resource '\' | resource
value ⇒ described_resource

Set

A set in SURF is an unordered sequence of zero or more resources with optional descriptions, beginning with a LEFT PARENTHESIS character ( (U+0028) and ending with a RIGHT PARENTHESIS character ) (U+0029). The same resource must not appear more than once in a set. A SURF parser should represent a SURF set using an corresponding language construct that represents set semantics.

set ⇒ '(' described_resource-sequence ')'

References

IEEE 754-2008: IEEE Standard for Floating-Point Arithmetic. IEEE.
ISO 8601:2004: Data elements and interchange formats — Information interchange — Representation of dates and times, third edition, 2014-12-01. ISO.
RFC 2119: Key words for use in RFCs to Indicate Requirement Levels, S. Bradner (Harvard University). IETF.
RFC 3966: The tel URI for Telephone Numbers, H. Schulzrinne (Columbia University). IETF.
RFC 3986: Uniform Resource Identifier (URI): Generic Syntax, T. Berners-Lee (W3C/MIT), R. Fielding (Day Software), L. Masinter (Adobe Systems). IETF.
RFC 3987: Internationalized Resource Identifiers (IRIs), M. Duerst (W3C), M. Suignard (Microsoft Corporation). IETF.
RFC 4122: A Universally Unique IDentifier (UUID) URN Namespace, P. Leach (Microsoft Corporation), M. Mealling (Refactored Networks, LLC), R. Salz (DataPower Technology, Inc.). IETF.
RFC 4648: The Base16, Base32, and Base64 Data Encodings, S. Josefsson (SJD). IETF.
RFC 5322: Internet Message Format, P. Resnick, Ed. (Qualcomm Incorporated). IETF.
RFC 6068: The 'mailto' URI Scheme, M. Duerst (Aoyama Gakuin University), L. Masinter (Adobe Systems Incorporated), J. Zawinski (DNA Lounge). IETF.
RFC 6838: Media Type Specifications and Registration Procedures, N. Freed (Oracle), J. Klensin, T.Hansen (AT&T Laboratories). IETF.
RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format, T. Bray (Google, Inc.). IETF.
TZ: Time Zone Database. IANA.
UAX #15: Unicode® Standard Annex #15: Unicode Normalization Forms, Mark Davis, Ken Whistler. The Unicode Consortium.