SIST ISO 24617-10:2025
(Main)Language resource management — Semantic annotation framework (SemAF) — Part 10: Visual information
Language resource management — Semantic annotation framework (SemAF) — Part 10: Visual information
This document specifies an annotation language for visual information, based on VoxML (visual object concept structure modelling language), a modelling language for the visualizations of concepts and actions denoted by natural language (NL) expressions in three dimensions (3D).
The specification of the VoxML-based annotation scheme conforms to the requirements given in ISO 24617-1, ISO 24617-7 and ISO 24617-14. The adoption of VoxML, specified in ISO 24617-14 as a semantic basis, is necessary for the 3D simulation and visualization of actions and motions taken by both human and artificial agents in real-life situations.
Gestion des ressources linguistiques - Cadre d'annotation sémantique — Partie 10: informations visuelles (VoxML)
Upravljanje jezikovnih virov - Ogrodje za semantično označevanje (SemAF) - 10. del: Vizualne informacije
Ta dokument določa jezik za označevanje vizualnih informacij, ki temelji na VoxML (jezik za modeliranje struktur konceptov vizualnih objektov), to je modelirnem jeziku za vizualizacije konceptov in dejanj, označenih z izrazi naravnega jezika (NL) v treh dimenzijah (3D). Specifikacija sheme označevanja, ki temelji na jeziku VoxML, je skladna z zahtevami standardov ISO 24617-1, ISO 24617-7 in ISO 24617-14. Uporaba jezika VoxML, ki je v standardu ISO 24617-14 določen kot semantična osnova, je potrebna za 3D-simulacijo in vizualizacijo dejanj in gibov tako človeških kot umetnih agentov v resničnih situacijah.
General Information
Standards Content (Sample)
SLOVENSKI STANDARD
01-junij-2025
Upravljanje jezikovnih virov - Ogrodje za semantično označevanje (SemAF) - 10.
del: Vizualne informacije
Language resource management — Semantic annotation framework (SemAF) — Part
10: Visual information
Gestion des ressources linguistiques - Cadre d'annotation sémantique — Partie 10:
informations visuelles (VoxML)
Ta slovenski standard je istoveten z: ISO 24617-10:2024
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
International
Standard
ISO 24617-10
First edition
Language resource management —
2024-08
Semantic annotation framework
(SemAF) —
Part 10:
Visual information
Gestion des ressources linguistiques - Cadre d'annotation
sémantique —
Partie 10: informations visuelles (VoxML)
Reference number
© ISO 2024
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Basic semantic assumptions — Habitats and affordances . 3
6 VoxML specification . 4
6.1 Metamodel and VoxML elements . .4
6.2 Representation of VoxML structures .5
6.3 Objects .6
6.4 Actions as programs .7
6.5 Relations .8
6.5.1 General .8
6.5.2 Properties (Attributes) .8
6.5.3 Relations .9
6.5.4 Functions .9
7 Examples of voxemes . 9
7.1 General .9
7.2 Objects .10
7.3 Eventualities as programs . 13
7.4 Properties .14
7.5 Relations . 15
7.6 Functions . 15
8 Using VoxML for simulation modelling of language .16
9 VoxML-based annotation scheme .18
9.1 Overview .18
9.2 Annotation scheme .18
9.2.1 Abstract specification .18
9.2.2 Concrete syntax for the representation of annotation structures .19
9.3 Semantic representation and interpretation . 20
Bibliography .22
iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee
has been established has the right to be represented on that committee. International organizations,
governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely
with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of ISO document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent
rights in respect thereof. As of the date of publication of this document, ISO had not received notice of (a)
patent(s) which may be required to implement this document. However, implementers are cautioned that
this may not represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO’s adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 4, Language resource management.
A list of all parts in the ISO 24617 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
Introduction
This document standardizes the specification of a semantic annotation scheme for visual information, based
on a modelling language for constructing three-dimensional (3D) visualizations of concepts denoted by
natural language (NL) expressions. This modelling language serves as a semantic basis of interpreting the
semantic forms of annotation structures model-theoretically by constraining the models for interpretation.
This document focuses on the introduction of the modelling language as a semantic basis for interpretation,
since the syntactic specification of the annotation scheme for visual information is a simplified formulation
based on the abstract specification of the spatio-temporal annotation schemes, such as those specified in
ISO 24617-1, ISO 24617-7 and ISO 24617-14. These three standards lay a theoretical basis for this document,
which specifies ways of annotating visual information involving motions and actions that are spatio-
temporally characterized.
The modelling language, named “VoxML” (visual object concept structure modelling language), where “Vox”
abbreviates “visual object concept structure” (VOCS), can be used as the platform for creating multimodal
semantic simulations in the context of human-computer communication. VoxML encodes semantic knowledge
of real-world objects represented as 3D models, and of events and attributes related to and enacted over
these objects. VoxML is intended to overcome the limitations of existing 3D visual markup languages by
allowing for the encoding of a broad range of semantic knowledge that can be exploited by a variety of
systems and platforms, leading to multimodal simulations of real-world scenarios using conceptual objects
that represent their semantic values.
NOTE 1 The main content of this document is based on References [1] and [2]. Reference [1] was developed by the
Brandeis University Computer Science Department in the context of communicating with computers (CwC), a Defence
Advanced Research Projects Agency (DARPA) effort to identify and construct computational semantic elements, for
the purpose of carrying out joint plans between a human and computer through NL discourse.
NOTE 2 This document adopts VoxML as a semantic basis for enriching the model for interpreting the descriptions
of objects, actions and relations involving dynamic visual information.
This document outlines a specification:
a) to formulate the annotation scheme for visual information;
b) to represent semantic knowledge of real-world objects represented as 3D models.
It uses a combination of parameters that can be determined from the object’s geometrical properties as
well as lexical information from NL, with methods of correlating the two where applicable. This information
allows for visualization and simulation software to fill in information missing from the NL input and
allows the software to render a functional visualization of programs being run over objects in a robust and
extensible way. Currently, a voxicon, which is the structured repository of visual object concepts, contains
500 object (noun) voxemes, lexemes or entries of the voxicon, and 10 program (verb) voxemes.
NOTE 3 As this library of available voxemes continues to grow, the specification elements will operationalize an
increasingly large library of various and more complicated programs. A voxeme library and visualization software
where users will be able to conduct visualizations of available behaviours driven by VoxML after parsing and
interpretation is available from Reference [25].
v
International Standard ISO 24617-10:2024(en)
Language resource management — Semantic annotation
framework (SemAF) —
Part 10:
Visual information
1 Scope
This document specifies an annotation language for visual information, based on VoxML (visual object
concept structure modelling language), a modelling language for the visualizations of concepts and actions
denoted by natural language (NL) expressions in three dimensions (3D).
The specification of the VoxML-based annotation scheme conforms to the requirements given in ISO 24617-1,
ISO 24617-7 and ISO 24617-14. The adoption of VoxML, specified in ISO 24617-14 as a semantic basis, is
necessary for the 3D simulation and visualization of actions and motions taken by both human and artificial
agents in real-life situations.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 24610-1:2006, Language resource management — Feature structures — Part 1: Feature structure
representation
ISO 24617-1, Language resource management — Semantic annotation framework (SemAF) — Part 1: Time and
events (SemAF-Time, ISO-TimeML)
ISO 24617-7, Language resource management — Semantic annotation framework — Part 7: Spatial information
ISO 24617-14, Language resource management — Semantic annotation framework (SemAF) — Part 14: Spatial
semantics
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
affordance
affordance structure
set of specific actions, described along with the requisite conditions, that the object may take part in
3.1.1
Gibsonian affordance
GA
set of specific actions that an agent can perform with an object that is presented to the agent
EXAMPLE Hold, grasp, move.
3.1.2
telic affordance
set of goal-oriented or intentionally situated actions of an agent on an object presented to the agent
EXAMPLE An agent eating an apple when it is presented to the agent.
3.2
habitat
representation of an object situated within a partial minimal model
3.3
minimal embedding space
MES
three-dimensional (3D) region within which the state is configured, or the event unfolds
3.4
qualia
qualia structure
QS
relational forces or aspects of a lexical item or concept
3.5
telic
purpose or function qualia (3.4) of an object
3.6
voxeme
basic entries in voxicon (3.7)
3.7
voxicon
lexicon or list of basic visual object concepts of VoxML (visual object concept structure modelling language)
4 Abbreviated terms
3D three dimensional
A agentive role
ARG argument
AS atomic structure
AS annotation scheme for visual information markup language
visML
ASyn abstract syntax for visual information markup language
visML
CSyn concrete syntax for visual information markup language
visML
C constitutive property
F formal property
GA Gibsonian affordance
ID identifier
MES minimal embedding space
NL natural language
NLP natural language processing
QS qualia structure
T telic role
Vox visual object concept structure
VoxML visual object concept structures modelling language
XML extensible markup language
5 Basic semantic assumptions — Habitats and affordances
Before introducing the VoxML specification, this document reviews two basic assumptions regarding the
[3]
semantics underlying the model. Following the Generative Lexicon, lexical entries in the object language
are given a feature structure consisting of a word’s basic type, its parameter listing, its event typing and its
qualia structure. In accordance with ISO 24610-1:2006, each feature structure shall be typed, consisting of
pairs of features (attributes) and values, either atomic or complex. If a value is a variable, then it is bound
either universally, existentially, or by the lambda operator, as shown in Example 1.
The semantic structure of an object shall be analysed into the following four sub-structures:
a) atomic structure (formal): objects expressed as basic nominal types;
b) subatomic structure (constitutive): mereo-topological structure of objects;
c) event structure (telic) and (agentive): origin and functions associated with an object;
d) macro-object structure: how objects fit together in space and through coordinated activities.
Objects can be partially contextualized through their qualia structure. For example, a food item has an atelic
value of “eat”; an instrument for writing has a telic value of “write”; a cup has a telic value of “hold”, etc. As a
further example, the lexical semantics for the noun “chair” carries a telic value of “sit_in”:
EXAMPLE 1
where
AS is an atomic structure;
QS is a qualia structure;
ARG1 is argument 1;
F is a formal property;
T is a telic role.
While an artefact is designed for a specific purpose (its telic role), this can only be achieved under
specific circumstances. Reference [4] introduces the notion of an object’s “habitat”, which encodes these
circumstances. References [5] and [6] further define the notion of habitat and how it interacts with
affordances. It is assumed that for an artefact, x, given the appropriate context C, performing the action π
will result in the intended or desired resulting state, R, i.e. C → [π]R. That is, if a context C (a set of contextual
factors) is satisfied, then every time the activity of π is performed, the resulting state R will occur. It is
necessary to specify the precondition context C since this enables the local modality to be satisfied.
Using this notion, a habit is defined as representing an object situated within a partial minimal model; it is
a directed enhancement of the qualia structure. Multi-dimensional affordances determine how habitats are
deployed and how they modify or augment the context, and compositional operations include procedural
(simulation) and operational (selection, specification, refinement) knowledge.
The habitat for an object is built by first placing it within an embedding space and then contextualizing it.
For example, to use a table, the top must be oriented upward, the surface must be accessible, etc. A chair
also must be oriented up, the seat must be free and accessible, it must be able to support the user, etc. An
illustration of how the resulting knowledge structure for the habitat of a chair is shown in Example 2.
EXAMPLE 2
where
F is a formal property;
C is a constitutive property;
T is a telic role;
A is an agentive role.
As described in more detail in 6.4, event or action simulations are constructed from the composition of
object habitats, along with some constraints imposed by the dynamic event structure inherent in the verb
itself, when interpreted as a program.
The final step in contextualizing the semantics of an object is to operationalize the telic value in its habitat.
[7][8]
This effectively means identifying the “affordance structure” for the object. The affordance structure
available to an agent, when presented with an object, is the set of actions that can be performed with it.
These are referred to as “Gibsonian affordances” and they include “grasp”, “move”, “hold”, “turn”, etc.
This is to distinguish them from more goal-directed, intentionally situated activities, referred to as “telic
affordances”.
6 VoxML specification
6.1 Metamodel and VoxML elements
The spatio-temporal annotation schemes given in ISO 24617-1, ISO 24617-7 and ISO 24617-14 shall apply.
The metamodel, graphically depicted by Figure 1, represents a small world of basic elements modelled in
VoxML. These elements form a set of categories:
a) event (program);
b) entity (object);
c) relation over them.
Events, especially actions, work as programs while taking simple objects or spatio-temporally localized
objects as arguments. Entities as objects are individuals or groups that may behave as agents. Relations can
be divided into properties, often referred to as “attributes”, and functions as subcategories. Attributes and
relations evaluate to states, and functions evaluate to geometric regions. These elements can then compose
into visualizations ns of NL concepts and expressions.
The metamodel of VoxML, presented in Figure 1, has no regions or times. These are introduced by functions
such as loc and τ. The function loc, for instance, maps an object x to the region loc(x) to which it is anchored.
Likewise, τ(x) maps an event to an event time, the time of its occurrence. Similarly, the function seq or the
function vec maps a set of regions to a path or a vector. Thereby, the ontology of VoxML is enriched with
spatio-temporal entities and dynamic paths.
NOTE 1 The empty triangular head of an arrow represents a subcategorization relation. Each directed arrow with
a smaller filled-in arrowhead relates one element to one or other more elements while its labelling specifies such a
relation. An entity as an agent, for example, triggers intentionally an action, while the action is a subcategory of an
event, treated as a program.
NOTE 2 SOURCE: Reference [2], reproduced with the permission of the authors.
Figure 1 — Metamodel
6.2 Representation of VoxML structures
This document follows the convention of the current version of VoxML and Voxicon (see Reference [1]). Basic
VoxML structures called “voxemes” are conventionally represented as feature structures, each consisting of
a set of attribute-value specifications, conforming to ISO 24610-1. Voxemes are mostly formed by complex
feature structures, having at least one of their substructures embedded in them as a feature structure, as
illustrated in this clause.
NOTE 1 ISO 24610-1 avoids the use of the term “attribute-value”. Instead, it uses the term “feature-value”, thus
defining a feature structure as a function from a set of features to a set of values.
In the concrete syntax, adopted for representing these feature structures of VoxML in this document, the
names of its attributes are represented in all uppercase characters, while the names of elements start with
their first character in upper case (e.g. the attribute LEX for the element Object as in Figure 2).
NOTE 2 This document follows the convention of the current version of VoxML and Voxicon for representing
attribute names in upper case characters.
Figure 2 — Voxeme structure of a wall
6.3 Objects
The element Object in VoxML is used for modelling nouns. The current set of Object attributes is shown in
Table 1.
Table 1 — Object attributes
LEX Object’s lexical information
TYPE Object’s geometrical typing
HABITAT Object’s habitat for actions
AFFORD_STR Object’s affordance structure
EMBODIMENT Object’s agent-relative embodiment
The attribute LEX in Table 1 contains a substructure, specified by two attributes: PRED and TYPE. The
attribute PRED in the substructure specifies the predicate lexeme denoting the Object, and the attribute
[3]
TYPE in the substructure specifies the Object’s type according to the Generative Lexicon (see Figure 2).
There are two different sorts of the attribute TYPE, as shown in Figure 2. The first sort refers to the attribute
TYPE of the element Object. In contrast, the second sort refers to the attribute TYPE of the substructure
of the attribute LEX, which contains information to define the object geometry in terms of primitives.
This attribute TYPE has an attribute HEAD in its substructure, which specifies a primitive 3D shape that
roughly describes the object’s form (such as calling an apple an “ellipsoid”), or the form of the object’s most
semantically salient subpart. Possible values for the attribute HEAD are grounded in, for completeness,
[9]
mathematical formalism defining families of polyhedra , and, for the annotator’s ease, common primitives
found across the “corpus” of 3D artwork and 3D modelling software.
NOTE Mathematically curved surfaces such as spheres and cylinders are in fact represented, computed and
[10]
rendered as polyhedra by most modern 3D software.
Using common 3D modelling primitives as convenience definitions provides some built-in redundancy to
VoxML, as is found in an NL description of structural forms. For example, a “rectangular_prism” is the same
as a “parallelepiped” that has at least two defined planes of reflectional symmetry, meaning that an object
whose Head is a rectangular_prism can be defined in two ways, an association which a reasoner can unify
axiomatically. Possible values for the attribute HEAD are given in Table 2
Table 2 — Possible values for the attribute HEAD
HEAD prismatoid, pyramid, wedge, parallelepiped, cupola, frustum, cylindroid, ellipsoid,
hemiellipsoid, bypyramid, rectangular_prism, toroid, sheet
These values are not intended to reflect the exact structure of a particular geometry, but rather a cognitive
[11]
approximation of its shape, as is used in some image-recognition work.
The substructures of an object are enumerated in its attribute COMPONENTS. In Figure 2, the attribute
COMPONENTS embedded in the attribute TYPE has its value nil. Concavity can be concave, flat or convex and
refers to any concavity that deforms the Head shape. ROTATSYM, or rotational symmetry, defines any of the
world’s three orthogonal axes around which the object’s geometry may be rotated for an interval of less than
360° and retain identical form as the unrotated geometry. A sphere may be rotated at any interval around
any of the three axes and retain the same form. A rectangular prism may be rotated 180° around any of the
three axes and retain the same shape. An object such as a ceiling fan would only have rotational symmetry
around the y-axis. Reflectional symmetry, or REFLECTSYM, is defined similarly. If an object can be bisected
by a plane defined by two of the world’s three orthogonal axes and then reflected across that plane to obtain
the same geometric form as the original object, it is considered to have reflectional symmetry across that
plane. A sphere or rectangular prism has reflectional symmetry across the XY, XZ and YZ planes. A wine
bottle only has reflectional symmetry across the XY and YZ planes.
The possible values of ROTATSYM and REFLECTSYM are intended to be world-relative, not object-relative.
That is, because objects are only being discussed when situated in a minimal embedding space (MES), even
an otherwise empty one, wherein all coordinates are given Cartesian values, the axis of rotational symmetry
or plane of reflectional symmetry are those denoted in the world, not of the object. Thus, a tetrahedron
(which in isolation has seven axes of rotational symmetry, no two of which are orthogonal) when placed in
the MES such that it cognitively satisfies all “real-world” constraints, is situated with one base downward
(a tetrahedron placed any other way will fall over). Thus, reducing the salient in-world axes of rotational
symmetry to one: the world’s y-axis. When the orientation of the object is ambiguous relative to the world,
the world is assumed to provide the grounding value.
The Habitat element defines habitats “intrinsic” to the object, regardless of what action it participates in,
such as intrinsic orientations or surfaces, as well as “extrinsic” habitats which must be satisfied for some
specified actions to take place. Intrinsic faces of an object can be defined in terms of its geometry and axes.
The model of a computer monitor, when axis-aligned according to 3D modelling convention, aligns the screen
with the world’s Z-axis facing the direction of increasing Z values. When discussing the object “computer
monitor”, the lexeme “front” singles out the screen of the monitor as opposed to any other p
...
International
Standard
ISO 24617-10
First edition
Language resource management —
2024-08
Semantic annotation framework
(SemAF) —
Part 10:
Visual information
Gestion des ressources linguistiques - Cadre d'annotation
sémantique —
Partie 10: informations visuelles (VoxML)
Reference number
© ISO 2024
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Basic semantic assumptions — Habitats and affordances . 3
6 VoxML specification . 4
6.1 Metamodel and VoxML elements . .4
6.2 Representation of VoxML structures .5
6.3 Objects .6
6.4 Actions as programs .7
6.5 Relations .8
6.5.1 General .8
6.5.2 Properties (Attributes) .8
6.5.3 Relations .9
6.5.4 Functions .9
7 Examples of voxemes . 9
7.1 General .9
7.2 Objects .10
7.3 Eventualities as programs . 13
7.4 Properties .14
7.5 Relations . 15
7.6 Functions . 15
8 Using VoxML for simulation modelling of language .16
9 VoxML-based annotation scheme .18
9.1 Overview .18
9.2 Annotation scheme .18
9.2.1 Abstract specification .18
9.2.2 Concrete syntax for the representation of annotation structures .19
9.3 Semantic representation and interpretation . 20
Bibliography .22
iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee
has been established has the right to be represented on that committee. International organizations,
governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely
with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of ISO document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent
rights in respect thereof. As of the date of publication of this document, ISO had not received notice of (a)
patent(s) which may be required to implement this document. However, implementers are cautioned that
this may not represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO’s adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 4, Language resource management.
A list of all parts in the ISO 24617 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
Introduction
This document standardizes the specification of a semantic annotation scheme for visual information, based
on a modelling language for constructing three-dimensional (3D) visualizations of concepts denoted by
natural language (NL) expressions. This modelling language serves as a semantic basis of interpreting the
semantic forms of annotation structures model-theoretically by constraining the models for interpretation.
This document focuses on the introduction of the modelling language as a semantic basis for interpretation,
since the syntactic specification of the annotation scheme for visual information is a simplified formulation
based on the abstract specification of the spatio-temporal annotation schemes, such as those specified in
ISO 24617-1, ISO 24617-7 and ISO 24617-14. These three standards lay a theoretical basis for this document,
which specifies ways of annotating visual information involving motions and actions that are spatio-
temporally characterized.
The modelling language, named “VoxML” (visual object concept structure modelling language), where “Vox”
abbreviates “visual object concept structure” (VOCS), can be used as the platform for creating multimodal
semantic simulations in the context of human-computer communication. VoxML encodes semantic knowledge
of real-world objects represented as 3D models, and of events and attributes related to and enacted over
these objects. VoxML is intended to overcome the limitations of existing 3D visual markup languages by
allowing for the encoding of a broad range of semantic knowledge that can be exploited by a variety of
systems and platforms, leading to multimodal simulations of real-world scenarios using conceptual objects
that represent their semantic values.
NOTE 1 The main content of this document is based on References [1] and [2]. Reference [1] was developed by the
Brandeis University Computer Science Department in the context of communicating with computers (CwC), a Defence
Advanced Research Projects Agency (DARPA) effort to identify and construct computational semantic elements, for
the purpose of carrying out joint plans between a human and computer through NL discourse.
NOTE 2 This document adopts VoxML as a semantic basis for enriching the model for interpreting the descriptions
of objects, actions and relations involving dynamic visual information.
This document outlines a specification:
a) to formulate the annotation scheme for visual information;
b) to represent semantic knowledge of real-world objects represented as 3D models.
It uses a combination of parameters that can be determined from the object’s geometrical properties as
well as lexical information from NL, with methods of correlating the two where applicable. This information
allows for visualization and simulation software to fill in information missing from the NL input and
allows the software to render a functional visualization of programs being run over objects in a robust and
extensible way. Currently, a voxicon, which is the structured repository of visual object concepts, contains
500 object (noun) voxemes, lexemes or entries of the voxicon, and 10 program (verb) voxemes.
NOTE 3 As this library of available voxemes continues to grow, the specification elements will operationalize an
increasingly large library of various and more complicated programs. A voxeme library and visualization software
where users will be able to conduct visualizations of available behaviours driven by VoxML after parsing and
interpretation is available from Reference [25].
v
International Standard ISO 24617-10:2024(en)
Language resource management — Semantic annotation
framework (SemAF) —
Part 10:
Visual information
1 Scope
This document specifies an annotation language for visual information, based on VoxML (visual object
concept structure modelling language), a modelling language for the visualizations of concepts and actions
denoted by natural language (NL) expressions in three dimensions (3D).
The specification of the VoxML-based annotation scheme conforms to the requirements given in ISO 24617-1,
ISO 24617-7 and ISO 24617-14. The adoption of VoxML, specified in ISO 24617-14 as a semantic basis, is
necessary for the 3D simulation and visualization of actions and motions taken by both human and artificial
agents in real-life situations.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 24610-1:2006, Language resource management — Feature structures — Part 1: Feature structure
representation
ISO 24617-1, Language resource management — Semantic annotation framework (SemAF) — Part 1: Time and
events (SemAF-Time, ISO-TimeML)
ISO 24617-7, Language resource management — Semantic annotation framework — Part 7: Spatial information
ISO 24617-14, Language resource management — Semantic annotation framework (SemAF) — Part 14: Spatial
semantics
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
affordance
affordance structure
set of specific actions, described along with the requisite conditions, that the object may take part in
3.1.1
Gibsonian affordance
GA
set of specific actions that an agent can perform with an object that is presented to the agent
EXAMPLE Hold, grasp, move.
3.1.2
telic affordance
set of goal-oriented or intentionally situated actions of an agent on an object presented to the agent
EXAMPLE An agent eating an apple when it is presented to the agent.
3.2
habitat
representation of an object situated within a partial minimal model
3.3
minimal embedding space
MES
three-dimensional (3D) region within which the state is configured, or the event unfolds
3.4
qualia
qualia structure
QS
relational forces or aspects of a lexical item or concept
3.5
telic
purpose or function qualia (3.4) of an object
3.6
voxeme
basic entries in voxicon (3.7)
3.7
voxicon
lexicon or list of basic visual object concepts of VoxML (visual object concept structure modelling language)
4 Abbreviated terms
3D three dimensional
A agentive role
ARG argument
AS atomic structure
AS annotation scheme for visual information markup language
visML
ASyn abstract syntax for visual information markup language
visML
CSyn concrete syntax for visual information markup language
visML
C constitutive property
F formal property
GA Gibsonian affordance
ID identifier
MES minimal embedding space
NL natural language
NLP natural language processing
QS qualia structure
T telic role
Vox visual object concept structure
VoxML visual object concept structures modelling language
XML extensible markup language
5 Basic semantic assumptions — Habitats and affordances
Before introducing the VoxML specification, this document reviews two basic assumptions regarding the
[3]
semantics underlying the model. Following the Generative Lexicon, lexical entries in the object language
are given a feature structure consisting of a word’s basic type, its parameter listing, its event typing and its
qualia structure. In accordance with ISO 24610-1:2006, each feature structure shall be typed, consisting of
pairs of features (attributes) and values, either atomic or complex. If a value is a variable, then it is bound
either universally, existentially, or by the lambda operator, as shown in Example 1.
The semantic structure of an object shall be analysed into the following four sub-structures:
a) atomic structure (formal): objects expressed as basic nominal types;
b) subatomic structure (constitutive): mereo-topological structure of objects;
c) event structure (telic) and (agentive): origin and functions associated with an object;
d) macro-object structure: how objects fit together in space and through coordinated activities.
Objects can be partially contextualized through their qualia structure. For example, a food item has an atelic
value of “eat”; an instrument for writing has a telic value of “write”; a cup has a telic value of “hold”, etc. As a
further example, the lexical semantics for the noun “chair” carries a telic value of “sit_in”:
EXAMPLE 1
where
AS is an atomic structure;
QS is a qualia structure;
ARG1 is argument 1;
F is a formal property;
T is a telic role.
While an artefact is designed for a specific purpose (its telic role), this can only be achieved under
specific circumstances. Reference [4] introduces the notion of an object’s “habitat”, which encodes these
circumstances. References [5] and [6] further define the notion of habitat and how it interacts with
affordances. It is assumed that for an artefact, x, given the appropriate context C, performing the action π
will result in the intended or desired resulting state, R, i.e. C → [π]R. That is, if a context C (a set of contextual
factors) is satisfied, then every time the activity of π is performed, the resulting state R will occur. It is
necessary to specify the precondition context C since this enables the local modality to be satisfied.
Using this notion, a habit is defined as representing an object situated within a partial minimal model; it is
a directed enhancement of the qualia structure. Multi-dimensional affordances determine how habitats are
deployed and how they modify or augment the context, and compositional operations include procedural
(simulation) and operational (selection, specification, refinement) knowledge.
The habitat for an object is built by first placing it within an embedding space and then contextualizing it.
For example, to use a table, the top must be oriented upward, the surface must be accessible, etc. A chair
also must be oriented up, the seat must be free and accessible, it must be able to support the user, etc. An
illustration of how the resulting knowledge structure for the habitat of a chair is shown in Example 2.
EXAMPLE 2
where
F is a formal property;
C is a constitutive property;
T is a telic role;
A is an agentive role.
As described in more detail in 6.4, event or action simulations are constructed from the composition of
object habitats, along with some constraints imposed by the dynamic event structure inherent in the verb
itself, when interpreted as a program.
The final step in contextualizing the semantics of an object is to operationalize the telic value in its habitat.
[7][8]
This effectively means identifying the “affordance structure” for the object. The affordance structure
available to an agent, when presented with an object, is the set of actions that can be performed with it.
These are referred to as “Gibsonian affordances” and they include “grasp”, “move”, “hold”, “turn”, etc.
This is to distinguish them from more goal-directed, intentionally situated activities, referred to as “telic
affordances”.
6 VoxML specification
6.1 Metamodel and VoxML elements
The spatio-temporal annotation schemes given in ISO 24617-1, ISO 24617-7 and ISO 24617-14 shall apply.
The metamodel, graphically depicted by Figure 1, represents a small world of basic elements modelled in
VoxML. These elements form a set of categories:
a) event (program);
b) entity (object);
c) relation over them.
Events, especially actions, work as programs while taking simple objects or spatio-temporally localized
objects as arguments. Entities as objects are individuals or groups that may behave as agents. Relations can
be divided into properties, often referred to as “attributes”, and functions as subcategories. Attributes and
relations evaluate to states, and functions evaluate to geometric regions. These elements can then compose
into visualizations ns of NL concepts and expressions.
The metamodel of VoxML, presented in Figure 1, has no regions or times. These are introduced by functions
such as loc and τ. The function loc, for instance, maps an object x to the region loc(x) to which it is anchored.
Likewise, τ(x) maps an event to an event time, the time of its occurrence. Similarly, the function seq or the
function vec maps a set of regions to a path or a vector. Thereby, the ontology of VoxML is enriched with
spatio-temporal entities and dynamic paths.
NOTE 1 The empty triangular head of an arrow represents a subcategorization relation. Each directed arrow with
a smaller filled-in arrowhead relates one element to one or other more elements while its labelling specifies such a
relation. An entity as an agent, for example, triggers intentionally an action, while the action is a subcategory of an
event, treated as a program.
NOTE 2 SOURCE: Reference [2], reproduced with the permission of the authors.
Figure 1 — Metamodel
6.2 Representation of VoxML structures
This document follows the convention of the current version of VoxML and Voxicon (see Reference [1]). Basic
VoxML structures called “voxemes” are conventionally represented as feature structures, each consisting of
a set of attribute-value specifications, conforming to ISO 24610-1. Voxemes are mostly formed by complex
feature structures, having at least one of their substructures embedded in them as a feature structure, as
illustrated in this clause.
NOTE 1 ISO 24610-1 avoids the use of the term “attribute-value”. Instead, it uses the term “feature-value”, thus
defining a feature structure as a function from a set of features to a set of values.
In the concrete syntax, adopted for representing these feature structures of VoxML in this document, the
names of its attributes are represented in all uppercase characters, while the names of elements start with
their first character in upper case (e.g. the attribute LEX for the element Object as in Figure 2).
NOTE 2 This document follows the convention of the current version of VoxML and Voxicon for representing
attribute names in upper case characters.
Figure 2 — Voxeme structure of a wall
6.3 Objects
The element Object in VoxML is used for modelling nouns. The current set of Object attributes is shown in
Table 1.
Table 1 — Object attributes
LEX Object’s lexical information
TYPE Object’s geometrical typing
HABITAT Object’s habitat for actions
AFFORD_STR Object’s affordance structure
EMBODIMENT Object’s agent-relative embodiment
The attribute LEX in Table 1 contains a substructure, specified by two attributes: PRED and TYPE. The
attribute PRED in the substructure specifies the predicate lexeme denoting the Object, and the attribute
[3]
TYPE in the substructure specifies the Object’s type according to the Generative Lexicon (see Figure 2).
There are two different sorts of the attribute TYPE, as shown in Figure 2. The first sort refers to the attribute
TYPE of the element Object. In contrast, the second sort refers to the attribute TYPE of the substructure
of the attribute LEX, which contains information to define the object geometry in terms of primitives.
This attribute TYPE has an attribute HEAD in its substructure, which specifies a primitive 3D shape that
roughly describes the object’s form (such as calling an apple an “ellipsoid”), or the form of the object’s most
semantically salient subpart. Possible values for the attribute HEAD are grounded in, for completeness,
[9]
mathematical formalism defining families of polyhedra , and, for the annotator’s ease, common primitives
found across the “corpus” of 3D artwork and 3D modelling software.
NOTE Mathematically curved surfaces such as spheres and cylinders are in fact represented, computed and
[10]
rendered as polyhedra by most modern 3D software.
Using common 3D modelling primitives as convenience definitions provides some built-in redundancy to
VoxML, as is found in an NL description of structural forms. For example, a “rectangular_prism” is the same
as a “parallelepiped” that has at least two defined planes of reflectional symmetry, meaning that an object
whose Head is a rectangular_prism can be defined in two ways, an association which a reasoner can unify
axiomatically. Possible values for the attribute HEAD are given in Table 2
Table 2 — Possible values for the attribute HEAD
HEAD prismatoid, pyramid, wedge, parallelepiped, cupola, frustum, cylindroid, ellipsoid,
hemiellipsoid, bypyramid, rectangular_prism, toroid, sheet
These values are not intended to reflect the exact structure of a particular geometry, but rather a cognitive
[11]
approximation of its shape, as is used in some image-recognition work.
The substructures of an object are enumerated in its attribute COMPONENTS. In Figure 2, the attribute
COMPONENTS embedded in the attribute TYPE has its value nil. Concavity can be concave, flat or convex and
refers to any concavity that deforms the Head shape. ROTATSYM, or rotational symmetry, defines any of the
world’s three orthogonal axes around which the object’s geometry may be rotated for an interval of less than
360° and retain identical form as the unrotated geometry. A sphere may be rotated at any interval around
any of the three axes and retain the same form. A rectangular prism may be rotated 180° around any of the
three axes and retain the same shape. An object such as a ceiling fan would only have rotational symmetry
around the y-axis. Reflectional symmetry, or REFLECTSYM, is defined similarly. If an object can be bisected
by a plane defined by two of the world’s three orthogonal axes and then reflected across that plane to obtain
the same geometric form as the original object, it is considered to have reflectional symmetry across that
plane. A sphere or rectangular prism has reflectional symmetry across the XY, XZ and YZ planes. A wine
bottle only has reflectional symmetry across the XY and YZ planes.
The possible values of ROTATSYM and REFLECTSYM are intended to be world-relative, not object-relative.
That is, because objects are only being discussed when situated in a minimal embedding space (MES), even
an otherwise empty one, wherein all coordinates are given Cartesian values, the axis of rotational symmetry
or plane of reflectional symmetry are those denoted in the world, not of the object. Thus, a tetrahedron
(which in isolation has seven axes of rotational symmetry, no two of which are orthogonal) when placed in
the MES such that it cognitively satisfies all “real-world” constraints, is situated with one base downward
(a tetrahedron placed any other way will fall over). Thus, reducing the salient in-world axes of rotational
symmetry to one: the world’s y-axis. When the orientation of the object is ambiguous relative to the world,
the world is assumed to provide the grounding value.
The Habitat element defines habitats “intrinsic” to the object, regardless of what action it participates in,
such as intrinsic orientations or surfaces, as well as “extrinsic” habitats which must be satisfied for some
specified actions to take place. Intrinsic faces of an object can be defined in terms of its geometry and axes.
The model of a computer monitor, when axis-aligned according to 3D modelling convention, aligns the screen
with the world’s Z-axis facing the direction of increasing Z values. When discussing the object “computer
monitor”, the lexeme “front” singles out the screen of the monitor as opposed to any other part. The lexeme
can therefore be correlated with the geometrical representation by establishing an intrinsic habitat of the
computer monitor of front(+Z). The terminology of “alignment” of an object dimension, d ∈ {x, y, z}, is adopted
with the dimension, d', of its embedding space, Ԑ, as follows: align (d, Ԑ, d’).
The attribute AFFORD_STR describes the set of specific actions, along with the requisite conditions, that
the object can potentially take part in. There are low-level affordances, called “Gibsonian”, which involve
manipulation or manoeuv
...










Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...