Simple URF (SURF) Specification
- Author
- Garret Wilson (GlobalMentor, Inc.)
- Version
- Draft 2020-06-20
Introduction
Simple URF (SURF) is a compact, text-based, human-readable persistence format for a directed graph of data values. It is similar in purpose to JSON yet less verbose and more expressive, supporting for example a greater number of types. Moreover as one of the primary serializations (along with TURF) of the Uniform Resource Format (URF), it rigorously represents semantic data in a simpler manner than serializations of the Resource Description Framework (RDF).
Definitions
- SURF parser
- Any software component that interprets SURF syntax and produces an appropriate data model according to this specification.
- SURF serializer
- An software that produces SURF syntax complying with this specification to reflect some data model.
Design Constraints
This section is non-normative.
The following considerations were used to guide the creation of this specification:
- SURF must support the Unicode character set for text values.
- SURF should support the Unicode character set for identifiers and comments.
- SURF must use only ASCII characters for delimiters.
- All valid JSON documents must also be valid SURF documents.
- All valid SURF documents must also be valid TURF documents.
- SURF must allow for distinctions among vocabularies of identifiers.
- SURF must not require namespace IRIs to be declared.
Conventions Used in this Document
The key words “must”, “must not”, “required”, “shall”, “shall not”, “should”, “should not”, “recommended”, “may”, and “optional” in this document are to be interpreted as described in RFC 2119. Parts of this specification marked as notes and annotations are non-normative.
Internet Media Type
The Internet media type (RFC 6838) of a SURF document shall be text/simple-urf
and must be encoded in UTF-8. A SURF document must not begin with a so-called byte order mark (BOM) or UTF-8 signature.
Structure
The content of a SURF document encodes a graph of resources defined by the Uniform Resource Framework (URF) with a single resource as the root of the graph. A SURF document may be empty, representing no resources. A SURF parser or a SURF serializer may represent a SURF document as a graph of URF resources. Nevertheless, although SURF syntax maintains compliance with the URF model, the implementation and use of SURF does not require use of the URF model.
Whitespace
SURF consider the following characters as whitespace, including characters in the Unicode Space_Separator
(Zs
) category.
whitespace
⇒tab
|vtab
|ff
|sp
|nbsp
|zwnbspr
|Space_Separator
tab
⇒CHARACTER TABULATION
(U+0009
)vtab
⇒LINE TABULATION
(U+000B
)ff
⇒FORM FEED (FF)
(U+000C
)sp
⇒SPACE
(U+0020
)nbsp
⇒NO-BREAK SPACE
(U+00A0
)zwnbspr
⇒ZERO WIDTH NO-BREAK SPACE
(U+FEFF
)
This specification uses the MIDDLE DOT
character ·
to represent zero or more whitespace characters.
- · ⇒
whitespace
*
Line Endings
SURF recognizes both the CARRIAGE RETURN (CR)
character (U+000D
), the LINE FEED (LF)
character (U+000A
), and any Unicode Line_Separator
(Zl
) or Paragraph_Separator
(Zp
) character as marking the end of a line. A SURF parser must behave as if every CRLF
sequence as well as every CR
not followed by a LF
were normalized to a single LF
. A SURF serializer should use the conventional line ending sequence supported by the platform on which it is running if that sequence is allowed by this specification.
eol
⇒cr
|lf
|Line_Separator
|Paragraph_Separator
cr
⇒CARRIAGE RETURN (CR)
(U+000D
)lf
⇒LINE FEED (LF)
(U+000A
)
Comments
Line Comments
A line comment may appear before the end of any line. A line comment begins with the EXCLAMATION MARK
character !
(U+0021
) and proceeds to the next line ending character.
line_comment
⇒ '!' [^eol
]*
Filler
Some structures allow the addition of whitespace, line comments, and/or line endings; these are collectively referred to as filler.
filler
⇒ (whitespace
|line_comment
|eol
)*
Sequences
Several SURF types allow components to be presented in a sequence. A sequence is a syntactical construct indicated by the form item-sequence
, where item
is the construct that may appear zero or more times in the sequence.
Any two items in a sequence are separated by a sequence separator, which is either a COMMA
character ,
(U+002C
) optionally surrounded by filler; or filler with at least one line break but without a COMMA
character. If a COMMA
character is present, an item must follow. If no COMMA
character or filler is present, an item must not follow. This means that filler may end a sequence or appear in an empty sequence.
item-sequence
⇒filler
[item
(sequence_next_comma_separated
|sequence_next_break_separated
)*filler
]sequence_next_comma_separated
⇒filler
','filler
item
sequence_next_break_separated
⇒ ·line_comment
?eol
filler
item
Handles
A name token in SURF must begin with a character from the Unicode Letter
(L
) category; followed by zero or more characters each from the Letter
(L
) category, from the Mark
(M
) category, from the Decimal_Number
(Nd
) category, or from the Connector_Punctuation
(Pc
) category. The sequence of Unicode code points in a name must follow Normalization Form C
(NFC
) as per UAX #15.
name_token
⇒Letter
(Letter
|Mark
|Decimal_Number
|Connector_Punctuation
)*
A name, which is a name token, may be introduced by one or more prefixes, each itself a name token. These segments are separated by the HYPHEN-MINUS
character -
(U+002D
), and together are referred to as a handle. An example of a handle is example-FooBar
.
handle
⇒ (segment
'-')*name
segment
⇒name_token
name
⇒name_token
Authors of SURF documents should use prefixes corresponding to a reverse series of domain name components for a domain that author controls or has authority to use, either starting the the top-level domain or the second-level domain. The owner of the example.com
domain, for example, might create a handle com-example-FooBar
or example-FooBar
.
SURF documents must not use handles beginning with the urf-
prefex unless defined by one of the URF specifications. The example-
prefix is reserved for use as examples in documentation for private testing. There are no restrictions on using SURF handles with no prefixes, although authors should should follow conventions that may develop associating semantics with certain names.
The tokens false
and true
must not appear as handles in a SURF document.
Resources
A SURF document must contain at most a single resource, which may recursively contain other resources. A resource consists of an optional label followed by a resource representation.
document
⇒filler
resource
*filler
resource
⇒label
? ·resource_representation
|label
described_resource
⇒label
? ·resource_representation
·description
? |label
resource_representation
⇒object
|literal
|collection
literal
⇒binary
|boolean
|character
|email
|iri
|media_type
|number
|regex
|string
|telephone
|temporal
|uuid
collection
⇒list
|map
|set
A label consists of an identifier; which is either a SURF name, a string, or an IRI; surrounded by matching VERTICAL LINE
characters |
(U+007C
). The first occurrence of a label with a particular identifier may include a resource representation; if no resource representation is present at the first appearance of a label with some identifier, an object with no type and no description is implied. Subsequent appearances of a label with the same identifier must not include a resource representation. A nested resource representation may refer to the label of an outer resource in the graph.
label
⇒ '|'alias
|id
|tag
'|'alias
⇒name_token
id
⇒string
tag
⇒iri
If a label uses a SURF name as its identifier, it indicates an alias for referencing resources only within the confines of the SURF document. If the identifier is an IRI, it is a tag and provides a unique identifier for the resource across all SURF documents. A SURF tag must not contain an IRI fragment. A string as the identifier functions as an ID for an object, unique only for a certain object type.
A tag or an ID must not appear in front of any resource representation other than an object. An ID must not appear in front of a resource representation without an indicated type. A SURF parser must provide tags and IDs as part of the parsed data.
Objects
Objects are general resources with an optional type and that may be described by a description.
object
⇒ '*' ·type
?type
⇒handle
Descriptions
A description must not follow any resource representation other than an object. A description must not contain more than one property
with the same handle
, and a SURF parser must consider such a condition as a non-recoverable error.
description
⇒ ':'property-sequence
';'property
⇒handle
filler
'='filler
resource
Literals
SURF literals are lexical representations of resources.
Binary
URF allows the encoding of an arbitrary sequence of octets. Zero or more bytes must be encoded using the “Base 64 Encoding” defined in RFC 4648, beginning with the PERCENT SIGN
character %
(U+0025
). The encoding must use the “base64url” alphabet and must not include Base 64 padding.
binary
⇒ '%'rfc_4648_base64url
Boolean
A Boolean is either of the tokens true
or false
.
boolean
⇒ "false" | "true"
Character
A SURF character is a representation of a Unicode code point, delimited on both sides by the APOSTROPHE
character '
(U+0027
). The backslash or REVERSE SOLIDUS
\
(U+005C
) is used as as an escape character. The APOSTROPHE
, REVERSE SOLIDUS
, and control characters must not appear in a character unless they are escaped. The following escape sequences are allowed:
\\
REVERSE SOLIDUS
(U+005C
)\/
SOLIDUS
(U+002F
)\'
APOSTROPHE
(U+0027
)\b
BACKSPACE
(U+0008
)\f
FORM FEED (FF)
(U+000C
)\n
LINE FEED (LF)
(U+000A
)\r
CARRIAGE RETURN (CR)
(U+000D
)\t
CHARACTER TABULATION
(U+0009
)\v
LINE TABULATION
(U+000B
)\uXXXX
- Any 16-bit Unicode code point encoding, where
XXXX
is four hexadecimal digits in any case. Escaped Unicode code points outside the Basic Multilingual Plane must be represented as two UTF-16 surrogate characters.
A SURF parser must correctly interpret characters outside the Basic Multilingual Plane, whether represented as a literal character or as an escaped Unicode code point.
TODO production
Email Address
An email address in SURF begins with the CIRCUMFLEX ACCENT
character ^
(U+005E
) commonly known as a “caret”, followed by the “addr-spec” format specified in RFC 5322. The representation must not include any obsolete elements (those starting with the prefix “obs-”) in RFC 5322. The representation must not include any “comments” or “folding white space” as defined by RFC 5322.
email
⇒ '^'rfc_5322_addr_spec
IRI
An Internationalized Resource Identifiers (IRI) is a sequence of Unicode characters for identifying a resource as defined in RFC 3987. In SURF an IRI is placed between a LESS-THAN SIGN
character <
(U+003C
) and a GREATER-THAN SIGN
character >
(U+003E
).
iri
⇒ '<' (rfc_3987_IRI
|email
|telephone
|uuid
) '>'
If an email address, telephone number, or UUID appears between the delimiters, it represents an “IRI short form” that is equivalent to a literal IRI according to the following rules:
email
- The email address is converted into an IRI with a scheme of
mailto
according to RFC 6068. telephone
- The telephone is converted into an IRI with a scheme of
tel
according to RFC 3966. uuid
- The UUID is converted into a IRI with a scheme of
urn
and a URN namespace ofuuid
according to RFC 4122.
Media Type
A media type, sometimes referred to as a “content type”, indicates the type of content contained in a resource and is essential for navigating the World Wide Web. It consists of a type and subtype, optionally followed by one or more parameters.
SURF places the media type between the GREATER-THAN SIGN
character >
(U+003E
) and the LESS-THAN SIGN
character <
(U+003C
), in that order. This representation is not to be confused with that of an IRI, which uses the same delimiters but in a different order.
media_type
⇒ '>'rfc_6838_media_type
'<'
The syntax of the media type is that prescribed by RFC 6838 with the following additional restrictions and recommendations:
- The type, subtype, and parameter name(s) should be in lowercase.
- The If the top-level type is
text
, the top-level type and its followingSOLIDUS
(U+002F
) delimiter may be left out. For example, the media typetext/plain
may be indicated as simply>plain<
in SURF. - The value of the
charset
parameter should be in lowercase. - If any value for any parameter other than
charset
is case-insensitive, it must be in lowercase.
Number
A number represents a numerical value in base 10 that may be negative and may be fractional. If the number begins with the DOLLAR SIGN
character $
(U+0024
), it is considered a decimal regardless of the presence or absence of a fraction and/or decimal component, and a SURF parser must represent the value using a construct that exactly represents the fractional part without rounding within the supported range.
If the number does not begin with the DOLLAR SIGN
character $
(U+0024
) and contains neither a fraction nor an exponent component, it is considered an integer. A SURF parser may represent non-decimal numbers using IEEE 754, but it must maintain a distinction between general numbers and integers.
number
⇒ ['$'] ['-']whole
[fraction
] [exponent
]whole
⇒digit
+fraction
⇒ '.'digit
+exponent
⇒ ('e' | 'E') ['-' | '+']digit
+digit
⇒ '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
A number should be in its canonical form:
- No leading zeros in the
whole
component except to meet the requirement of at least one digit. - No trailing zeros in the digit(s) in the
fraction
component except to meet the requirement of at least one digit. - No leading zeros in the digit(s) in the
exponent
component. - A lowercase
'e'
in theexponent
component. - No
'+'
sign in theexponent
component if the exponent is non-negative.
Nevertheless the presence of any leading zero(s) in the whole
component shall not be interpreted as indicating any other number base other than base 10.
Regular Expression
A regular expression is surrounded by slash or SOLIDUS
character /
(U+002F
). The backslash or REVERSE SOLIDUS
\
(U+005C
) is interpreted as as an escape character only if followed by a slash character /
.
TODO decide on whether and how to allow flags; TODO reference regular expression standard
regex
⇒ '/'regular_expression
'/'
String
A string represents a sequence of Unicode code points, delimited on both sides by the QUOTATION MARK
character "
(U+0022
). The sequence of Unicode code points in a string should follow Normalization Form C
(NFC
) as per UAX #15. The backslash or REVERSE SOLIDUS
\
(U+005C
) is used as as an escape character. The QUOTATION MARK
, REVERSE SOLIDUS
, and control characters must not appear in a string unless they are escaped. The following escape sequences are allowed:
\\
REVERSE SOLIDUS
(U+005C
)\/
SOLIDUS
(U+002F
)\"
QUOTATION MARK
(U+0022
)\b
BACKSPACE
(U+0008
)\f
FORM FEED (FF)
(U+000C
)\n
LINE FEED (LF)
(U+000A
)\r
CARRIAGE RETURN (CR)
(U+000D
)\t
CHARACTER TABULATION
(U+0009
)\v
LINE TABULATION
(U+000B
)\uXXXX
- Any 16-bit Unicode code point encoding, where
XXXX
is four hexadecimal digits in any case. Escaped Unicode code points outside the Basic Multilingual Plane must be represented as two UTF-16 surrogate characters.
TODO production
Telephone
In SURF the representation of a telephone number follows the “global number” format prescribed by RFC 3966, which is a PLUS SIGN
+
(U+002B
) followed by at least one digit. The representation must not include any “visual separators” as defined by RFC 3966.
telephone
⇒ '+'digit
+
Temporal
The SURF temporal representation encodes date and/or time information based on ISO 8601. A temporal starts with the COMMERCIAL AT
character @
(U+0040
). Time zone names tz
are from the IANA TZ database and are case-sensitive. The format for month_day
conforms to an older version of ISO 8601; the most recent version does not mention a month+day format. The format for zoned_date_time
is an extension to the ISO 8601 specification.
temporal
⇒ '@' (instant
|zoned_date_time
|offset_date_time
|offset_date
|offset_time
|local_date_time
|local_date
|local_time
|year_month
|month_day
|year
)instant
⇒date
'T'time
'Z'zoned_date_time
⇒offset_date_time
'['tz
']'offset_date_time
⇒date
'T'time
offset
offset_date
⇒date
offset
offset_time
⇒time
offset
local_date_time
⇒date
'T'time
local_date
⇒date
local_time
⇒time
year
⇒YYYY
year_month
⇒year
'-'MM
month_day
⇒ '-' '-'MM
'-'DD
date
⇒YYYY
'-'MM
'-'DD
time
⇒hh
':'mm
':'ss
['.'s
]offset
⇒ ('+' | '-')hh
':'mm
YYYY
⇒digit
digit
digit
digit
MM
⇒digit
digit
DD
⇒digit
digit
hh
⇒digit
digit
mm
⇒digit
digit
ss
⇒digit
digit
s
⇒digit
digit
digit
[digit
digit
digit
[digit
digit
digit
] ]
TODO add support for durations
UUID
A Universally Unique IDentifier (UUID) must adhere to the requirements of RFC 4122. The SURF representation of a UUID be must be introduced by the AMPERSAND
character &
(U+0026
) and be followed by the “UUID” production given in RFC 4122.
uuid
⇒ '&'hex
*8 '-'hex
*4 '-'hex
*4 '-'hex
*4 '-'hex
*12
Collections
Collection resources represent abstract data types that can hold other resources.
List
A SURF list is an ordered sequence of zero or more resources with optional descriptions, beginning with a LEFT SQUARE BRACKET
character [
(U+005B
) and ending with a RIGHT SQUARE BRACKET
character ]
(U+005D
). A SURF parser should represent a SURF list using an corresponding language construct that represents list semantics.
list
⇒ '['described_resource-sequence
']'
Map
A SURF map is a sequence of associations between a key and a value. A map begins a LEFT CURLY BRACKET
character {
(U+007B
) and ends with a RIGHT CURLY BRACKET
character }
(U+007D
). Keys and values can be any resources. If a key is an object with a description, the key must be surrounded by the REVERSE SOLIDUS
character \
(U+005C
). The key and value in each association or entry are separated by a COLON
character :
(U+003A
).
A map should not have entries with duplicate keys, and a SURF serializer must not produce a map with duplicate-key entries. A surf parser must ignore all but one of each entry with the same key. TODO revisit; this bring JSON compatibility, but could cause problems with tags if a duplicate entry is ignored; also address key equality
A SURF parser should represent a SURF map using an corresponding language construct that represents map semantics.
map
⇒ '{'entry-sequence
'}'entry
⇒key
filler
':'filler
value
key
⇒ '\'described_resource
'\' |resource
value
⇒described_resource
Set
A set in SURF is an unordered sequence of zero or more resources with optional descriptions, beginning with a LEFT PARENTHESIS
character (
(U+0028
) and ending with a RIGHT PARENTHESIS
character )
(U+0029
). The same resource must not appear more than once in a set. A SURF parser should represent a SURF set using an corresponding language construct that represents set semantics.
set
⇒ '('described_resource-sequence
')'
References
- IEEE 754-2008
- IEEE Standard for Floating-Point Arithmetic. IEEE.
- ISO 8601:2004
- Data elements and interchange formats — Information interchange — Representation of dates and times, third edition, 2014-12-01. ISO.
- RFC 2119
- Key words for use in RFCs to Indicate Requirement Levels, S. Bradner (Harvard University). IETF.
- RFC 3966
- The tel URI for Telephone Numbers, H. Schulzrinne (Columbia University). IETF.
- RFC 3986
- Uniform Resource Identifier (URI): Generic Syntax, T. Berners-Lee (W3C/MIT), R. Fielding (Day Software), L. Masinter (Adobe Systems). IETF.
- RFC 3987
- Internationalized Resource Identifiers (IRIs), M. Duerst (W3C), M. Suignard (Microsoft Corporation). IETF.
- RFC 4122
- A Universally Unique IDentifier (UUID) URN Namespace, P. Leach (Microsoft Corporation), M. Mealling (Refactored Networks, LLC), R. Salz (DataPower Technology, Inc.). IETF.
- RFC 4648
- The Base16, Base32, and Base64 Data Encodings, S. Josefsson (SJD). IETF.
- RFC 5322
- Internet Message Format, P. Resnick, Ed. (Qualcomm Incorporated). IETF.
- RFC 6068
- The 'mailto' URI Scheme, M. Duerst (Aoyama Gakuin University), L. Masinter (Adobe Systems Incorporated), J. Zawinski (DNA Lounge). IETF.
- RFC 6838
- Media Type Specifications and Registration Procedures, N. Freed (Oracle), J. Klensin, T.Hansen (AT&T Laboratories). IETF.
- RFC 7159
- The JavaScript Object Notation (JSON) Data Interchange Format, T. Bray (Google, Inc.). IETF.
- TZ
- Time Zone Database. IANA.
- UAX #15
- Unicode® Standard Annex #15: Unicode Normalization Forms, Mark Davis, Ken Whistler. The Unicode Consortium.