Simple URF (SURF) Specification

Author
Garret Wilson (GlobalMentor, Inc.)
Version
Draft 2020-06-20

Introduction

Simple URF (SURF) is a compact, text-based, human-readable persistence format for a directed graph of data values. It is similar in purpose to JSON yet less verbose and more expressive, supporting for example a greater number of types. Moreover as one of the primary serializations (along with TURF) of the Uniform Resource Format (URF), it rigorously represents semantic data in a simpler manner than serializations of the Resource Description Framework (RDF).

Definitions

SURF parser
Any software component that interprets SURF syntax and produces an appropriate data model according to this specification.
SURF serializer
An software that produces SURF syntax complying with this specification to reflect some data model.

Design Constraints

This section is non-normative.

The following considerations were used to guide the creation of this specification:

Conventions Used in this Document

The key words “must”, “must not”, “required”, “shall”, “shall not”, “should”, “should not”, “recommended”, “may”, and “optional” in this document are to be interpreted as described in RFC 2119. Parts of this specification marked as notes and annotations are non-normative.

Internet Media Type

The Internet media type (RFC 6838) of a SURF document shall be text/simple-urf and must be encoded in UTF-8. A SURF document must not begin with a so-called byte order mark (BOM) or UTF-8 signature.

Structure

The content of a SURF document encodes a graph of resources defined by the Uniform Resource Framework (URF) with a single resource as the root of the graph. A SURF document may be empty, representing no resources. A SURF parser or a SURF serializer may represent a SURF document as a graph of URF resources. Nevertheless, although SURF syntax maintains compliance with the URF model, the implementation and use of SURF does not require use of the URF model.

Whitespace

SURF consider the following characters as whitespace, including characters in the Unicode Space_Separator (Zs) category.

This specification uses the MIDDLE DOT character · to represent zero or more whitespace characters.

Line Endings

SURF recognizes both the CARRIAGE RETURN (CR) character (U+000D), the LINE FEED (LF) character (U+000A), and any Unicode Line_Separator (Zl) or Paragraph_Separator (Zp) character as marking the end of a line. A SURF parser must behave as if every CRLF sequence as well as every CR not followed by a LF were normalized to a single LF. A SURF serializer should use the conventional line ending sequence supported by the platform on which it is running if that sequence is allowed by this specification.

Comments

Line Comments

A line comment may appear before the end of any line. A line comment begins with the EXCLAMATION MARK character ! (U+0021) and proceeds to the next line ending character.

Filler

Some structures allow the addition of whitespace, line comments, and/or line endings; these are collectively referred to as filler.

Sequences

Several SURF types allow components to be presented in a sequence. A sequence is a syntactical construct indicated by the form item-sequence, where item is the construct that may appear zero or more times in the sequence.

Any two items in a sequence are separated by a sequence separator, which is either a COMMA character , (U+002C) optionally surrounded by filler; or filler with at least one line break but without a COMMA character. If a COMMA character is present, an item must follow. If no COMMA character or filler is present, an item must not follow. This means that filler may end a sequence or appear in an empty sequence.

Handles

Example SURF handles.

A name token in SURF must begin with a character from the Unicode Letter (L) category; followed by zero or more characters each from the Letter (L) category, from the Mark (M) category, from the Decimal_Number (Nd) category, or from the Connector_Punctuation (Pc) category. The sequence of Unicode code points in a name must follow Normalization Form C (NFC) as per UAX #15.

A name, which is a name token, may be introduced by one or more prefixes, each itself a name token. These segments are separated by the HYPHEN-MINUS character - (U+002D), and together are referred to as a handle. An example of a handle is example-FooBar.

Authors of SURF documents should use prefixes corresponding to a reverse series of domain name components for a domain that author controls or has authority to use, either starting the the top-level domain or the second-level domain. The owner of the example.com domain, for example, might create a handle com-example-FooBar or example-FooBar.

SURF documents must not use handles beginning with the urf- prefex unless defined by one of the URF specifications. The example- prefix is reserved for use as examples in documentation for private testing. There are no restrictions on using SURF handles with no prefixes, although authors should should follow conventions that may develop associating semantics with certain names.

The tokens false and true must not appear as handles in a SURF document.

Resources

A SURF document must contain at most a single resource, which may recursively contain other resources. A resource consists of an optional label followed by a resource representation.

A label consists of an identifier; which is either a SURF name, a string, or an IRI; surrounded by matching VERTICAL LINE characters | (U+007C). The first occurrence of a label with a particular identifier may include a resource representation; if no resource representation is present at the first appearance of a label with some identifier, an object with no type and no description is implied. Subsequent appearances of a label with the same identifier must not include a resource representation. A nested resource representation may refer to the label of an outer resource in the graph.

If a label uses a SURF name as its identifier, it indicates an alias for referencing resources only within the confines of the SURF document. If the identifier is an IRI, it is a tag and provides a unique identifier for the resource across all SURF documents. A SURF tag must not contain an IRI fragment. A string as the identifier functions as an ID for an object, unique only for a certain object type.

A tag or an ID must not appear in front of any resource representation other than an object. An ID must not appear in front of a resource representation without an indicated type. A SURF parser must provide tags and IDs as part of the parsed data.

Objects

Objects are general resources with an optional type and that may be described by a description.

Descriptions

A description must not follow any resource representation other than an object. A description must not contain more than one property with the same handle, and a SURF parser must consider such a condition as a non-recoverable error.

Literals

SURF literals are lexical representations of resources.

Binary

URF allows the encoding of an arbitrary sequence of octets. Zero or more bytes must be encoded using the “Base 64 Encoding” defined in RFC 4648, beginning with the PERCENT SIGN character % (U+0025). The encoding must use the “base64url” alphabet and must not include Base 64 padding.

Boolean

A Boolean is either of the tokens true or false.

Character

A SURF character is a representation of a Unicode code point, delimited on both sides by the APOSTROPHE character ' (U+0027). The backslash or REVERSE SOLIDUS \ (U+005C) is used as as an escape character. The APOSTROPHE , REVERSE SOLIDUS, and control characters must not appear in a character unless they are escaped. The following escape sequences are allowed:

\\
REVERSE SOLIDUS (U+005C)
\/
SOLIDUS (U+002F)
\'
APOSTROPHE (U+0027)
\b
BACKSPACE (U+0008)
\f
FORM FEED (FF) (U+000C)
\n
LINE FEED (LF) (U+000A)
\r
CARRIAGE RETURN (CR) (U+000D)
\t
CHARACTER TABULATION (U+0009)
\v
LINE TABULATION (U+000B)
\uXXXX
Any 16-bit Unicode code point encoding, where XXXX is four hexadecimal digits in any case. Escaped Unicode code points outside the Basic Multilingual Plane must be represented as two UTF-16 surrogate characters.

A SURF parser must correctly interpret characters outside the Basic Multilingual Plane, whether represented as a literal character or as an escaped Unicode code point.

TODO production

Email Address

An email address in SURF begins with the CIRCUMFLEX ACCENT character ^ (U+005E) commonly known as a “caret”, followed by the “addr-spec” format specified in RFC 5322. The representation must not include any obsolete elements (those starting with the prefix “obs-”) in RFC 5322. The representation must not include any “comments” or “folding white space” as defined by RFC 5322.

IRI

An Internationalized Resource Identifiers (IRI) is a sequence of Unicode characters for identifying a resource as defined in RFC 3987. In SURF an IRI is placed between a LESS-THAN SIGN character < (U+003C) and a GREATER-THAN SIGN character > (U+003E).

If an email address, telephone number, or UUID appears between the delimiters, it represents an “IRI short form” that is equivalent to a literal IRI according to the following rules:

email
The email address is converted into an IRI with a scheme of mailto according to RFC 6068.
telephone
The telephone is converted into an IRI with a scheme of tel according to RFC 3966.
uuid
The UUID is converted into a IRI with a scheme of urn and a URN namespace of uuid according to RFC 4122.

Media Type

A media type, sometimes referred to as a “content type”, indicates the type of content contained in a resource and is essential for navigating the World Wide Web. It consists of a type and subtype, optionally followed by one or more parameters.

SURF places the media type between the GREATER-THAN SIGN character > (U+003E) and the LESS-THAN SIGN character < (U+003C), in that order. This representation is not to be confused with that of an IRI, which uses the same delimiters but in a different order.

Example SURF media types with meanings.
>xml<
text/xml
>markdown;charset=utf-8<
text/markdown;charset=UTF-8
>text/markdown;charset=UTF-8<
text/markdown;charset=UTF-8
>image/png<
image/png

The syntax of the media type is that prescribed by RFC 6838 with the following additional restrictions and recommendations:

Number

A number represents a numerical value in base 10 that may be negative and may be fractional. If the number begins with the DOLLAR SIGN character $ (U+0024), it is considered a decimal regardless of the presence or absence of a fraction and/or decimal component, and a SURF parser must represent the value using a construct that exactly represents the fractional part without rounding within the supported range.

If the number does not begin with the DOLLAR SIGN character $ (U+0024) and contains neither a fraction nor an exponent component, it is considered an integer. A SURF parser may represent non-decimal numbers using IEEE 754, but it must maintain a distinction between general numbers and integers.

A number should be in its canonical form:

Nevertheless the presence of any leading zero(s) in the whole component shall not be interpreted as indicating any other number base other than base 10.

Regular Expression

A regular expression is surrounded by slash or SOLIDUS character / (U+002F). The backslash or REVERSE SOLIDUS \ (U+005C) is interpreted as as an escape character only if followed by a slash character /.

TODO decide on whether and how to allow flags; TODO reference regular expression standard

String

A string represents a sequence of Unicode code points, delimited on both sides by the QUOTATION MARK character " (U+0022). The sequence of Unicode code points in a string should follow Normalization Form C (NFC) as per UAX #15. The backslash or REVERSE SOLIDUS \ (U+005C) is used as as an escape character. The QUOTATION MARK , REVERSE SOLIDUS, and control characters must not appear in a string unless they are escaped. The following escape sequences are allowed:

\\
REVERSE SOLIDUS (U+005C)
\/
SOLIDUS (U+002F)
\"
QUOTATION MARK (U+0022)
\b
BACKSPACE (U+0008)
\f
FORM FEED (FF) (U+000C)
\n
LINE FEED (LF) (U+000A)
\r
CARRIAGE RETURN (CR) (U+000D)
\t
CHARACTER TABULATION (U+0009)
\v
LINE TABULATION (U+000B)
\uXXXX
Any 16-bit Unicode code point encoding, where XXXX is four hexadecimal digits in any case. Escaped Unicode code points outside the Basic Multilingual Plane must be represented as two UTF-16 surrogate characters.

TODO production

Telephone

In SURF the representation of a telephone number follows the “global number” format prescribed by RFC 3966, which is a PLUS SIGN + (U+002B) followed by at least one digit. The representation must not include any “visual separators” as defined by RFC 3966.

Example SURF telephone number.

Temporal

The SURF temporal representation encodes date and/or time information based on ISO 8601. A temporal starts with the COMMERCIAL AT character @ (U+0040). Time zone names tz are from the IANA TZ database and are case-sensitive. The format for month_day conforms to an older version of ISO 8601; the most recent version does not mention a month+day format. The format for zoned_date_time is an extension to the ISO 8601 specification.

Example SURF temporal values.

TODO add support for durations

UUID

A Universally Unique IDentifier (UUID) must adhere to the requirements of RFC 4122. The SURF representation of a UUID be must be introduced by the AMPERSAND character & (U+0026) and be followed by the “UUID” production given in RFC 4122.

Example SURF UUID.

Collections

Collection resources represent abstract data types that can hold other resources.

List

A SURF list is an ordered sequence of zero or more resources with optional descriptions, beginning with a LEFT SQUARE BRACKET character [ (U+005B) and ending with a RIGHT SQUARE BRACKET character ] (U+005D). A SURF parser should represent a SURF list using an corresponding language construct that represents list semantics.

Map

A SURF map is a sequence of associations between a key and a value. A map begins a LEFT CURLY BRACKET character { (U+007B) and ends with a RIGHT CURLY BRACKET character } (U+007D). Keys and values can be any resources. If a key is an object with a description, the key must be surrounded by the REVERSE SOLIDUS character \ (U+005C). The key and value in each association or entry are separated by a COLON character : (U+003A).

A map should not have entries with duplicate keys, and a SURF serializer must not produce a map with duplicate-key entries. A surf parser must ignore all but one of each entry with the same key. TODO revisit; this bring JSON compatibility, but could cause problems with tags if a duplicate entry is ignored; also address key equality

A SURF parser should represent a SURF map using an corresponding language construct that represents map semantics.

Set

A set in SURF is an unordered sequence of zero or more resources with optional descriptions, beginning with a LEFT PARENTHESIS character ( (U+0028) and ending with a RIGHT PARENTHESIS character ) (U+0029). The same resource must not appear more than once in a set. A SURF parser should represent a SURF set using an corresponding language construct that represents set semantics.

References

IEEE 754-2008
IEEE Standard for Floating-Point Arithmetic. IEEE.
ISO 8601:2004
Data elements and interchange formats — Information interchange — Representation of dates and times, third edition, 2014-12-01. ISO.
RFC 2119
Key words for use in RFCs to Indicate Requirement Levels, S. Bradner (Harvard University). IETF.
RFC 3966
The tel URI for Telephone Numbers, H. Schulzrinne (Columbia University). IETF.
RFC 3986
Uniform Resource Identifier (URI): Generic Syntax, T. Berners-Lee (W3C/MIT), R. Fielding (Day Software), L. Masinter (Adobe Systems). IETF.
RFC 3987
Internationalized Resource Identifiers (IRIs), M. Duerst (W3C), M. Suignard (Microsoft Corporation). IETF.
RFC 4122
A Universally Unique IDentifier (UUID) URN Namespace, P. Leach (Microsoft Corporation), M. Mealling (Refactored Networks, LLC), R. Salz (DataPower Technology, Inc.). IETF.
RFC 4648
The Base16, Base32, and Base64 Data Encodings, S. Josefsson (SJD). IETF.
RFC 5322
Internet Message Format, P. Resnick, Ed. (Qualcomm Incorporated). IETF.
RFC 6068
The 'mailto' URI Scheme, M. Duerst (Aoyama Gakuin University), L. Masinter (Adobe Systems Incorporated), J. Zawinski (DNA Lounge). IETF.
RFC 6838
Media Type Specifications and Registration Procedures, N. Freed (Oracle), J. Klensin, T.Hansen (AT&T Laboratories). IETF.
RFC 7159
The JavaScript Object Notation (JSON) Data Interchange Format, T. Bray (Google, Inc.). IETF.
TZ
Time Zone Database. IANA.
UAX #15
Unicode® Standard Annex #15: Unicode Normalization Forms, Mark Davis, Ken Whistler. The Unicode Consortium.