ETSI TR 104 062 V1.2.1 (2024-07)
Securing Artificial Intelligence; Automated Manipulation of Multimedia Identity Representations
Securing Artificial Intelligence; Automated Manipulation of Multimedia Identity Representations
RTR/SAI-0010
General Information
Standards Content (Sample)
TECHNICAL REPORT
Securing Artificial Intelligence;
Automated Manipulation of
Multimedia Identity Representations
2 ETSI TR 104 062 V1.2.1 (2024-07)
Reference
RTR/SAI-0010
Keywords
artificial intelligence, identity
ETSI
650 Route des Lucioles
F-06921 Sophia Antipolis Cedex - FRANCE
Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16
Siret N° 348 623 562 00017 - APE 7112B
Association à but non lucratif enregistrée à la
Sous-Préfecture de Grasse (06) N° w061004871
Important notice
The present document can be downloaded from:
https://www.etsi.org/standards-search
The present document may be made available in electronic versions and/or in print. The content of any electronic and/or
print versions of the present document shall not be modified without the prior written authorization of ETSI. In case of any
existing or perceived difference in contents between such versions and/or in print, the prevailing version of an ETSI
deliverable is the one made publicly available in PDF format at www.etsi.org/deliver.
Users of the present document should be aware that the document may be subject to revision or change of status.
Information on the current status of this and other ETSI documents is available at
https://portal.etsi.org/TB/ETSIDeliverableStatus.aspx
If you find errors in the present document, please send your comment to one of the following services:
https://portal.etsi.org/People/CommiteeSupportStaff.aspx
If you find a security vulnerability in the present document, please report it through our
Coordinated Vulnerability Disclosure Program:
https://www.etsi.org/standards/coordinated-vulnerability-disclosure
Notice of disclaimer & limitation of liability
The information provided in the present deliverable is directed solely to professionals who have the appropriate degree of
experience to understand and interpret its content in accordance with generally accepted engineering or
other professional standard and applicable regulations.
No recommendation as to products and services or vendors is made or should be implied.
No representation or warranty is made that this deliverable is technically accurate or sufficient or conforms to any law
rule and/or regulation and further, no representation or warranty is made of merchantability or fitness
and/or governmental
for any particular purpose or against infringement of intellectual property rights.
In no event shall ETSI be held liable for loss of profits or any other incidental or consequential damages.
Any software contained in this deliverable is provided "AS IS" with no warranties, express or implied, including but not
limited to, the warranties of merchantability, fitness for a particular purpose and non-infringement of intellectual property
rights and ETSI shall not be held liable in any event for any damages whatsoever (including, without limitation, damages
for loss of profits, business interruption, loss of information, or any other pecuniary loss) arising out of or related to the use
of or inability to use the software.
Copyright Notification
No part may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and
microfilm except as authorized by written permission of ETSI.
The content of the PDF version shall not be modified without the written authorization of ETSI.
The copyright and the foregoing restriction extend to reproduction in all media.
© ETSI 2024.
All rights reserved.
ETSI
3 ETSI TR 104 062 V1.2.1 (2024-07)
Contents
Intellectual Property Rights . 5
Foreword . 5
Modal verbs terminology . 5
1 Scope . 6
2 References . 6
2.1 Normative references . 6
2.2 Informative references . 6
3 Definition of terms, symbols and abbreviations . 10
3.1 Terms . 10
3.2 Symbols . 10
3.3 Abbreviations . 10
4 Introduction . 11
4.1 Problem Statement . 11
5 Deepfake methods . 11
5.1 Video . 11
5.1.1 General . 11
5.1.2 Face swapping . 12
5.1.3 Face reenactment . 12
5.1.4 Synthetic faces . 12
5.2 Audio . 13
5.3 Text . 13
5.4 Combinations . 14
6 Attack scenarios . 15
6.1 Attacks on media and societal perception . 15
6.1.1 Influencing public opinion . 15
6.1.2 Personal defamation . 15
6.2 Attacks on authenticity . 16
6.2.1 Attacking biometric authentication methods . 16
6.2.2 Social Engineering . 16
6.3 Digression: Benign use of deepfakes . 16
7 State of the art . 17
7.1 Data . 17
7.1.1 Data required for Video Manipulation . 17
7.1.2 Data required for Audio Manipulation. 17
7.1.3 Data required for Text Manipulation . 17
7.2 Tools . 18
7.2.1 Tools for Video Manipulation . 18
7.2.2 Tools for Audio Manipulation . 18
7.2.3 Tools for Text Manipulation . 19
7.3 Latency . 19
7.3.1 Latency in Video Manipulation . 19
7.3.2 Latency in Audio Manipulation . 19
7.3.3 Latency in Text Manipulation . 19
7.4 Distinguishability . 19
7.4.1 Distinguishability of Video Manipulation . 19
7.4.2 Distinguishability of Audio Manipulation . 20
7.4.3 Distinguishability of Text Manipulation . 20
8 Countermeasures . 20
8.1 General countermeasures . 20
8.2 Attack-specific countermeasures . 21
8.2.1 Influencing public opinion . 21
8.2.2 Social Engineering . 21
ETSI
4 ETSI TR 104 062 V1.2.1 (2024-07)
8.2.3 Attacks on authentication methods . 21
History . 22
ETSI
5 ETSI TR 104 062 V1.2.1 (2024-07)
Intellectual Property Rights
Essential patents
IPRs essential or potentially essential to normative deliverables may have been declared to ETSI. The declarations
pertaining to these essential IPRs, if any, are publicly available for ETSI members and non-members, and can be
found in ETSI SR 000 314: "Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to
ETSI in respect of ETSI standards", which is available from the ETSI Secretariat. Latest updates are available on the
ETSI Web server (https://ipr.etsi.org/).
Pursuant to the ETSI Directives including the ETSI IPR Policy, no investigation regarding the essentiality of IPRs,
including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not
referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become,
essential to the present document.
Trademarks
The present document may include trademarks and/or tradenames which are asserted and/or registered by their owners.
ETSI claims no ownership of these except for any which are indicated as being the property of ETSI, and conveys no
right to use or reproduce any trademark and/or tradename. Mention of those trademarks in the present document does
not constitute an endorsement by ETSI of products, services or organizations associated with those trademarks.
DECT™, PLUGTESTS™, UMTS™ and the ETSI logo are trademarks of ETSI registered for the benefit of its
Members. 3GPP™ and LTE™ are trademarks of ETSI registered for the benefit of its Members and of the 3GPP
Organizational Partners. oneM2M™ logo is a trademark of ETSI registered for the benefit of its Members and of the ®
oneM2M Partners. GSM and the GSM logo are trademarks registered and owned by the GSM Association.
Foreword
This Technical Report (TR) has been produced by ETSI Technical Committee Securing Artificial Intelligence (SAI).
NOTE: The present document updates and replaces ETSI GR SAI 011.
Modal verbs terminology
In the present document "should", "should not", "may", "need not", "will", "will not", "can" and "cannot" are to be
interpreted as described in clause 3.2 of the ETSI Drafting Rules (Verbal forms for the expression of provisions).
"must" and "must not" are NOT allowed in ETSI deliverables except when used in direct citation.
ETSI
6 ETSI TR 104 062 V1.2.1 (2024-07)
1 Scope
The present document covers AI-based techniques for automatically manipulating existing or creating fake identity data
represented in different media formats, such as audio, video and text (deepfakes). The present document describes the
different technical approaches and analyses the threats posed by deepfakes in different attack scenarios. It then provides
technical and organizational measures to mitigate these threats and discusses their effectiveness and limitations.
2 References
2.1 Normative references
Normative references are not applicable in the present document.
2.2 Informative references
References are either specific (identified by date of publication and/or edition number or version number) or
non-specific. For specific references, only the cited version applies. For non-specific references, the latest version of the
referenced document (including any amendments) applies.
NOTE: While any hyperlinks included in this clause were valid at the time of publication, ETSI cannot guarantee
their long term validity.
The following referenced documents are not necessary for the application of the present document but they assist the
user with regard to a particular subject area.
[i.1] Reuters, 2020: "Fact check: "Drunk" Nancy Pelosi video is manipulated".
[i.2] Karras et al., 2019: "Analyzing and Improving the Image Quality of StyleGAN".
[i.3] Gu et al., 2021: "StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image
Synthesis".
[i.4] Abdal et al., 2020: "StyleFlow: Attribute-conditioned Exploration of StyleGAN-Generated Images
using Conditional Continuous Normalizing Flows".
[i.5] Roich et al., 2021: "Pivotal Tuning for Latent-based Editing of Real Images".
[i.6] Zhang et al., 2020: "MIPGAN - Generating Robust and High Quality Morph Attacks Using
Identity Prior Driven GAN".
[i.7] Tan et al., 2021: "A Survey on Neural Speech Synthesis".
[i.8] Qian et al., 2020: "Unsupervised Speech Decomposition via Triple Information Bottleneck".
[i.9] Casanova et al., 2021: "YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice
Conversion for everyone".
[i.10] VICE, 2017: "AI-Assisted porn has arrived - and Gal Gadot has been made its victim".
[i.11] NYTimes, 2020: "Deepfake Technology Enters the Documentary World".
[i.12] BuzzFeedVideo, 2018: "You Won't Believe What Obama Says In This Video!".
[i.13] C. Chan et al., 2019: "Everybody Dance Now". ®
[i.14] Adobe , 2021: "Roto Brush and Refine Matte".
[i.15] Prajwal et al., 2020: "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the
Wild".
[i.16] Fried et al., 2019: "Text-based Editing of Talking-head Video".
ETSI
7 ETSI TR 104 062 V1.2.1 (2024-07)
[i.17] Zhou et al., 2021: "Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-
Visual Representation".
[i.18] Hwang, 2020: "Deepfakes - A grounded threat assessment", Center for Security and Emerging
Technology.
[i.19] Reuters, 2022: "Deepfake footage purports to show Ukrainian president capitulating".
[i.20] Forbes, 2021: "Fraudsters Cloned Company Director's Voice In $35 Million Bank Heist, Police
Find".
[i.21] Forbes, 2019: "Deepfakes, Revenge Porn, And The Impact On Women".
[i.22] Shazeer Vaswani et al., 2017: "Attention is all you need". Advances in neural information
processing systems, 30, pp.
[i.23] Irene Solaiman et al., 2019: "Release Strategies and the Social Impacts of Language Models".
[i.24] Vincenzo Ciancaglini et al., 2020: "Malicious Uses and Abuses of Artificial Intelligence", Trend
Micro Research.
[i.25] Eugene Lim, Glencie Tan, Tan Kee Hock, 2021: "Hacking Humans with AI as a Service", DEF
CON 29.
[i.26] Susan Zhang, 2022: "OPT: Open Pre-trained Transformer Language Models".
[i.27] Karen Hao, 2021: "The race to understand the exhilarating, dangerous world of language AI", MIT
Technology Review.
[i.28] Ben Buchanan et al., 2021: "Truth, Lies, and Automation How Language Models Could Change
Disinformation", Center for Security and Emerging Technology.
[i.29] Cooper Raterink, 2021: "Assessing the risks of language model "deepfakes" to democracy".
[i.30] Li Dong et al., 2019: "Unified Language Model Pre-training for Natural Language Understanding
and Generation", Advances in Neural Information Processing Systems, Curran Associates, Inc.
[i.31] Almira Osmanovic Thunström: "We Asked GPT-3 to Write an Academic Paper about Itself-Then
We Tried to Get It Published".
[i.32] Tom B. Brown et al, 2020: "Language Models are Few-Shot Learners", Advances in Neural
Information Processing Systems, Curran Associates, Inc.
[i.33] OpenAI, 2019: "Better Language Models and Their Implications".
[i.34] David M. J. Lazer et al., 2018: "The science of fake news".
[i.35] Mark Chen et al., 2021: "Evaluating Large Language Models Trained on Code".
[i.36] Chaos Computer Club, 2022: "Chaos Computer Club hacks Video-Ident".
[i.37] European Commission, 2021: "Proposal for a Regulation of the European parliament and of the
council laying down Harmonised rules on artificial intelligence (Artificial Intelligence act) and
amending certain union legislative acts".
[i.38] Alexandre Sablayrolles et al., 2020: "Radioactive data: tracing through training".
[i.39] Zen et al., 2019: "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech".
[i.40] Kim et al., 2022: "Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech
with Untranscribed Data".
[i.41] Watanabe et al., 2018: "ESPnet: End-to-End Speech Processing Toolkit".
[i.42] Hayashi et al., 2020: "Espnet-TTS: Unified, reproducible, and integratable open source end-to-end
text-to-speech toolkit".
ETSI
8 ETSI TR 104 062 V1.2.1 (2024-07)
[i.43] Chen et al., 2022: "Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-
streaming Teacher Guidance".
[i.44] Ronssin et al., 2021: "AC-VC: Non-parallel Low Latency Phonetic Posteriorgrams Based Voice
Conversion".
[i.45] Tan et al., 2022: "NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality".
[i.46] Liu et al., 2022: "ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild".
[i.47] Müller et al., 2021, ASVspoof 2021: "Speech is Silver, Silence is Golden: What do ASVspoof-
trained Models Really Learn?".
[i.48] Müller et al., 2022, ASVspoof 2021: "Does Audio Deepfake Detection Generalize?".
[i.49] Gölge Eren, 2021: "Coqui TTS - A deep learning toolkit for Text-to-Speech, battle-tested in
research and production".
[i.50] Min et al., 2021, Meta-StyleSpeech: "Multi-Speaker Adaptive Text-to-Speech Generation".
[i.51] Keith Ito, Linda Johnson, 2017: "The LJ Speech Dataset".
[i.52] Ganesh Jawahar, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, 2020: "Automatic
Detection of Machine Generated Text: A Critical Survey".
[i.53] Rowan Zellers et al., 2019: "Defending Against Neural Fake News", Advances in Neural
Information Processing Systems, Curran Associates, Inc.
[i.54] Original Deepfake Code, 2017.
[i.55] Matt Tora, Bryan Lyon, Kyle Vrooman, 2018: "Faceswap".
[i.56] Ivan Perov et al., 2020: "DeepFaceLab: A simple, flexible and extensible face swapping
framework".
[i.57] Yuval Nirkin et al., 2019: "FSGAN: Subject Agnostic Face Swapping and Reenactment".
[i.58] Lingzhi Li et al., 2020: "FaceShifter: Towards High Fidelity and Occlusion Aware Face
Swapping".
[i.59] Renwang Chen et al., 2021: "SimSwap: An Efficient Framework for High Fidelity Face
Swapping".
[i.60] Jiankang Deng et al., 2018: "ArcFace: Additive Angular Margin Loss for Deep Face Recognition".
[i.61] Aliaksandr Siarohin et al., 2020: "First Order Motion Model for Image Animation".
[i.62] Justus Thies et al., 2020: "Face2Face: Real-time Face Capture and Reenactment of RGB Videos".
[i.63] Guy Gafni et al., 2021: "Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar
Reconstruction".
[i.64] Andreas Rössler et al., 2019: "FaceForensics++: Learning to Detect Manipulated Facial Images".
[i.65] TheVerge, 2021: "Tom Cruise deepfake creator says public shouldn't be worried about 'one-click
fakes'".
[i.66] Matt Tora, 2019: "[Guide] Training in Faceswap".
[i.67] J. Naruniec et al., 2020: "High-Resolution Neural Face Swapping for Visual Effects".
[i.68] H. Khalid et al., 2021: "FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset".
[i.69] W. Paier et al., 2021: "Example-Based Facial Animation of Virtual Reality Avatars Using Auto-
Regressive Neural Networks".
ETSI
9 ETSI TR 104 062 V1.2.1 (2024-07)
[i.70] L. Ouyang et al., 2022: "Training language models to follow instructions with human feedback"
(GPT35).
[i.71] P. Christiano et al., 2017: "Deep reinforcement learning from human preferences"
(RLHFOriginal).
[i.72] OpenAI, 2022: "Introducing ChatGPT" (ChatGPT).
[i.73] A. Glaese et al., 2022: "Improving alignment of dialogue agents via targeted human judgements"
(Sparrow).
[i.74] J. Menick et al., 2022: "Teaching language models to support answers with verified quotes"
(GopherCite).
[i.75] Emily M. Bender et al., 2021: "On the Dangers of Stochastic Parrots: Can Language Models Be
Too Big?".
[i.76] J. Devlin et al., 2019: "BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding".
[i.77] G. Lopez, 08.12.2022: "A Smarter Robot", The New York Times.
[i.78] P. Mukherjee et al., 2021: "Real-Time Natural Language Processing with BERT Using NVIDIA
TensorRT (Updated)".
[i.79] F. Nonato de Paula and M. Balasubramaniam, 2021: "Achieve 12x higher throughput and lowest
latency for PyTorch Natural Language Processing applications out-of-the-box on AWS
Inferentia".
[i.80] F. Matern et al., 2019: "Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations",
IEEETM Winter Applications of Computer Vision Workshops.
[i.81] A, Azmoodeh and Ali Dehghantanha, 2022: "Deep Fake Detection, Deterrence and Response:
Challenges and Opportunities".
[i.82] N. Yu et al., 2021: "Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution
in Training Data", Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV)IEEETM International Conference on Computer Vision (ICCV).
[i.83] B. Guo et al., 2023: "How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation,
and Detection".
[i.84] Insikt Group, 2023: "I, Chatbot", Recorded Future.
[i.85] Cade Metz, 2023: "OpenAI to Offer New Version of ChatGPT for a $20 Monthly Fee", NYT.
[i.86] Joseph Cox, 2023: "How I Broke Into a Bank Account With an AI-Generated Voice", Vice.
[i.87] C. Wang et al., 2023: "Neural Codec Language Models are Zero-Shot Text to Speech
Synthesizers".
[i.88] Coalition for Content Provenance and Authenticity, 2023: "Overview".
[i.89] Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data and on the free
movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation).
ETSI
10 ETSI TR 104 062 V1.2.1 (2024-07)
3 Definition of terms, symbols and abbreviations
3.1 Terms
For the purposes of the present document, the following terms apply:
deepfake: manipulation of existing or creation of fake multimedia identity representation
face reenactment: method for creating deepfakes in which the facial expressions of a person in an video are changed
face swap: method for creating deepfakes in which the face of a person in an video is exchanged
meme: cultural item that is spread via the Internet, often through social media platforms to give a falsified or amusing
representation of a person or thing
multimedia identity representation: data representing a person's identity or linked to it in different media formats
such as video, audio and text
Text-To-Speech (TSS): method for creating deepfakes in which text (or a phoneme sequence) is converted into an
audio signal
voice conversion: method for creating deepfakes in which the style of an audio sequence (e.g. speaker characteristic) is
changed without altering its semantic content
3.2 Symbols
Void.
3.3 Abbreviations
For the purposes of the present document, the following abbreviations apply:
AI Artificial Intelligence
AML Anti-Money Laundering
API Application Programming Interface
BEC Business E-mail Compromise
CEO Chief Executive Officer
DNN Deep Neural Network
GAN Generative Adversarial Network
GDPR General Data Protection Regulation
HTML Hyper Text Markup Language
ID Identity
KYC Know Your Customer
MOS Mean Opinion Score
NLP Natural Language Processing
RLHF Reinforcement Learning from Human Feedback
TTS Text-To-Speech
VC Voice Conversion
ETSI
11 ETSI TR 104 062 V1.2.1 (2024-07)
4 Introduction
4.1 Problem Statement
The present document addresses the problems and concerns of AI-based manipulation of multimedia identity
representations. Due to significant progress in applying AI to the problem of generating or modifying data represented
in different media formats (in particular, audio, video and text), new threats have emerged that can lead to substantial
risks in various settings ranging from personal defamation, opening bank accounts using false identities (by attacks on
biometric authentication procedures), to placing false or manipulated information into campaigns with the intent to
influence public opinion. AI techniques can be used to manipulate authentic multimedia identity representations or to
create fake ones. The possible output of such manipulations includes, among other things, video or audio files that show
people doing or saying things they never did or said in reality. Since usually Deep Neural Networks (DNNs) are used
for generating such outputs, they are commonly referred to as "deepfakes".
In principle, this phenomenon is not entirely new, since somewhat similar attacks have been possible for an extended
period of time. Falsely associating people with text they have never uttered does not require complex technology and
has been done for millennia. Similarly, photos, audio and video files can be used out of their original context and
attributed to a completely different one. Although this technique is unsophisticated, it can be remarkably successful, and
is still routinely used, e.g. in today's social networks in the form of memes. The rapid advance of computer technology
in recent decades also made the manipulation of photos, audio and video files increasingly easier. Editing programs
allow cropping and rearranging audio and video files or changing their speed. Since photo-editing programs became
widespread in the 2000s, the possibilities for manipulating photos have been practically unlimited.
EXAMPLE: In 2020, a video showing US Speaker of the House Nancy Pelosi circulated on social media. The
video had been slowed down to give the impression of Mrs. Pelosi being drunk [i.1].
Nevertheless, AI techniques allow going one step further in many respects and can have adverse effects in a larger array
of situations. AI techniques allow automating manipulations that previously required a substantial amount of manual
work, creating fake multimedia data from scratch and manipulating audio and video files in a targeted way while
preserving high acoustic and visual quality of the result, which was infeasible using previous technology. AI techniques
can also be used to manipulate audio and video files in a broader sense, e.g. by applying changes to the visual or
acoustic background. However, such manipulations do not target the identity representations of the persons involved.
The present document focuses on the use of AI for manipulating multimedia identity representations and illustrates the
consequential risks and measures to mitigate them.
5 Deepfake methods
5.1 Video
5.1.1 General
This clause discusses the methods available for the manipulation of image sequences from video data. The audio part of
video data is discussed separately within clause 5.2, as well as the combination of manipulated image sequences with
audio data in clause 5.4. Multiple methods based on deep neural networks exist for the editing of image sequences.
These methods were developed for achieving various objectives. They include methods for "face swapping" and "face
reenactment" / "puppeteering". Beyond face swapping and reenactment, further AI-assisted video editing methods are
available or actively researched, but not yet as popular. Full-body puppeteering [i.13] methods aim to transfer the body
movement of a person to another person. In addition to the aforementioned methods, which generally use identity
attributes from another existing person to perform the manipulation of image sequences, fully synthetic data can also be
created.
ETSI
12 ETSI TR 104 062 V1.2.1 (2024-07)
5.1.2 Face swapping
Face swapping is possibly the most famous method in social media and the general public, and also the one which
coined the term "deepfake". The term became popular in 2017 when a user with the pseudonym "deepfakes" started to
insert faces of celebrities into pornographic material using a neural network as an autoencoder model and posted the
results on the web platform reddit [i.10]. The aim in face swapping is to change the identity of a person by changing
either the core part of the face or the entire head. In this context, the neural network is trained to extract relevant
information such as the face identity, expression and lighting conditions from an input image, and to generate a facial
image of the target identity with the same expression and lighting conditions for seamless insertion into the frame.
The purpose of a face swap can be either entertainment, for example when inserting a popular celebrity's face into a
movie scene that he/she originally did not participate in, or nefarious activities as in the case of non-consensual
pornography (for details see clause 6.1.2). It can also be used for other purposes, as for a more natural de-identification
(opposed to face blurring) within a documentary film. This allows keeping the respective persons' emotional
expressions but protects them from prosecution [i.11].
5.1.3 Face reenactment
If one does not aim to manipulate the identity of a speaker but for example to alter a spoken message, face reenactment
methods can be used for editing a given video.
EXAMPLE: In an early video from 2018 former president of the USA Barack Obama warns of an upcoming
era of disinformation and insults acting president Donald Trump, just to reveal afterwards that the
video was manipulated all along [i.12].
As the identity of the person in the video is preserved in this method, only subtle changes need to be made in the facial
expression or in the region of the mouth. This manipulated content can then be inserted seamlessly, and can achieve
higher quality in comparison to face swapping methods as differences in skin color or texture do not need to be
considered. However, the general setting of the video is mostly determined by the original source material that is being
manipulated, unless further manipulation steps are applied to the body of the manipulated person or the background.
5.1.4 Synthetic faces
Using techniques such as StyleGAN2 [i.2], it is possible to create 2D pictures of synthetic faces at a resolution of
1024x1024 pixel, which show faces of people that might not exist in reality. On the technical level, the goal of these
systems is usually to map a simple random distribution, such as a multivariate Gaussian distribution, onto the
distribution of natural faces. For creating a new face, a vector is first sampled from the simple distribution, which is
then converted by the system into a two-dimensional image. The mapping of the two distributions is generally modelled
with a deep neural network. Usually Generative Adversarial Networks (GANs) or Variational Autoencoders are used
for this task.
Modern methods based on this technology are also capable of creating three-dimensional representations of random
pseudo-identities [i.3]. Furthermore, these systems can also be used to manipulate facial attributes of the created faces.
The input vector or an intermediate representation of it is often changed in a controlled manner, which results in the
change of the specified attribute in the output of the system. In some cases, however, methods for changing attributes
still have the problem that other attributes are also changed during this process.
EXAMPLE: The age, facial expression, or hair color of a pseudo-identity can be controlled and manipulated
using software tools such as StyleFlow [i.4]. However, if an attribute is changed too much, it can
have the side effect of changing the interpretation of gender of the person, for example.
In addition, those systems also provide the ability to generate facial images of real people, whose attributes can in turn
be manipulated [i.5]. The ability to manipulate real faces by means of these methods even allows the morphing of
several faces into one face, which contains biometric characteristics of all the original faces [i.6].
On the one hand, synthetic faces can be used by attackers to conceal their identity or to create fake profiles on social
media in the scope of disinformation operations. On the other hand, synthetic faces can also be used for anonymization
for legitimate purposes.
ETSI
13 ETSI TR 104 062 V1.2.1 (2024-07)
5.2 Audio
Methods for the creation of manipulated audio data have the goal of creating audio data that contain a given semantic
content and have a specified style.
This class of manipulation methods can be divided into two main categories, Text-To-Speech (TTS) methods, which
can be used to generate synthetic audio data and Voice Conversion (VC) methods, which can be used to manipulate
existing audio data.
Text-To-Speech methods can be used to convert a certain semantic content, which is specified by a text or a phoneme
sequence, into an audio signal. The generated audio signal should contain the specified semantic content and be
perceived to be as natural as possible by a human listener [i.7].
Frequently, TTS methods also have the option of controlling the style of the generated audio signal. This can be used,
for example, to control the speaker, the emotion, or the speech rate of the audio signal. Modern TTS methods are
usually designed as multi-speaker systems, which makes it possible to define the speaker whose characteristics are to be
included in the generated audio signal at inference time. In some cases, it is also possible to generate forgeries of
speakers who were not present during the training phase of the TTS method by providing the TTS system with real
audio material as a reference at the inference phase ("one-shot") [i.9]. However, if high-quality fakes which
approximate the speaker characteristics of the target speaker as well as possible are to be generated, it is necessary that
data on the target speaker is contained in the training set of the system.
Usually, a lot of audio data and the corresponding transcription are needed to train such models. Furthermore, in
addition to multi-speaker methods, there are also multi-language methods, which make it possible to specify at system
runtime which language the given text is. This makes it possible to achieve better results for languages for which only
few training data are available. Most TTS methods consist of two components, a "text-to-spectrogram" module and a
vocoder, which are usually both modeled with the help of deep neural networks.
The former is used to convert a text, or other representation of semantic content, into a lossy spectral representation,
which is usually a mel spectrogram. The vocoder, on the other hand, is used to generate an audio signal from this
representation.
Voice conversion techniques can be used to convert a source audio signal into another audio signal in such a way that
the semantic content remains, but the style of the audio is changed according to the given specification. Such style
changes could be a change of the speaker characteristic, a change of emotion, or a change of speech rate.
EXAMPLE: The most common application of voice conversion methods is to convert one audio file into a new
file by changing the voice of the source speaker to a specified target speaker. The output audio file
contains the same semantic content as the source audio but sounds like the target speaker's voice.
In addition to the two common components ("text-to-spectrogram" and vocoder) used in TTS systems, VC methods
usually have a component that decomposes the source audio signal into different representations, such as the semantic
content, timbre, or prosody [i.8].
5.3 Text
In the past years, the area of Natural Language Processing (NLP) has evolved steadily. NLP includes several tasks like
question-answering, machine translation, summarization and also text generation. Due to the success of several so
called language models (roughly speaking, models that are trained to predict the likelihood of a word or sentence, given
a context), NLP is receiving increasing attention from scientists as well as the public [i.27]. There is no clear definition
of deepfakes in the text domain; however, in the present document the term "deepfake" is used when a text is
machine-generated with the intention to appear human and to spoof an entity (e.g. a specific person, company or
organization). Moreover, the term is mostly used in the context of targeted or untargeted deceptive attacks. Other
possibilities of malicious use of language models, e.g. polymorphic malware generation [i.35], also exist. They are out
of scope for the present document. The following text is focused only on the threats posed by automatically generated
human-like text with the intention to spoof an entity.
Starting a few years back, concerns were growing that language models could be misused in order to either harm people
with fraudulent texts (e.g. phishing, spam or CEO -fraud) or to fool people or society at large by generating misleading
or fake content (e.g. fake news). Besides that, NLP models provide further use cases to deceive human individuals.
Recently published models provide the ability to generate literature or scientific research papers raising the question of
responsibility for the content or legal issues in terms of authorship and copyright [i.31].
ETSI
14 ETSI TR 104 062 V1.2.1 (2024-07)
Due to their ability of writing highly convincing human-like texts, several tech companies prevented, limited or delayed
the access to or the release of their models to impede misuse [i.33], [i.26] and [i.23]. In order to be able to understand
how these models work and why they are well performing in various text-based tasks, the next paragraph gives a brief
theoretical overview.
Most state-of-the-art large language models are based on the transformer architecture that was presented in 2017 by
Vaswani et al. [i.22]. Transformer models are using word embeddings combined with a distinct positional encoding as
input. Word embeddings are a vector-based representation of words, whereas the positional encoding contains
information about the position of each word within the input. The original transformer architecture presented in 2017
consists of an encoding and a decoding block. The main intention behind this architecture is to reduce the input within
the encoder to a lower- dimensional space (e.g. reducing a word to its meaning) and reconstruct it via the decoder
(e.g. translation into a different language). Transformer models differ from former language models in the use of
so-called self-attention as their core architecture. This self-attention mechanism represents the relationship between
each word of the input-text and every other word within the text [i.22].
There are various types of transformer models, two of which are often discussed in the context of generation and
detection of fake content. Bidirectional language models, on the one hand, are transformers consisting of the encoder
part only. Among other things, this transformer architecture shows good results in question-answering or in detecting
certain automatically generated texts [i.23], which will be further discussed in clause 8. On the other hand,
unidirectional transformer architectures are solely based on the decoder module of the original transformer presented in
[i.22]. They process the text from left to right and have to predict the next word. Therefore, they are extraordinarily
good in generating texts [i.30].
Training transformer models contains an unsupervised or self-supervised pre-training step with unlabeled data. After
that, the model can either be fine-tuned for a specific task (which can make it less universal but well suited for the
trained use-case), or directly used via zero-shot transfer, one-shot or few-shot learning. To use the model directly, it is
sufficient to provide it with a description of a task, written in natural language, followed by either no (zero-shot), one or
few examples. This makes these models extremely easy to use. The authors of [i.32] state that when the size of the
model is large, increasing the number of shots will increase t
...








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...