Are Classification Deep Neural Networks Good for Blind Image Watermarking?

Image watermarking is usually decomposed into three steps: (i) a feature vector is extracted from an image; (ii) it is modified to embed the watermark; (iii) and it is projected back into the image space while avoiding the creation of visual artefacts. This feature extraction is usually based on a classical image representation given by the Discrete Wavelet Transform or the Discrete Cosine Transform for instance. These transformations require very accurate synchronisation between the embedding and the detection and usually rely on various registration mechanisms for that purpose. This paper investigates a new family of transformation based on Deep Neural Networks trained with supervision for a classification task. Motivations come from the Computer Vision literature, which has demonstrated the robustness of these features against light geometric distortions. Also, adversarial sample literature provides means to implement the inverse transform needed in the third step above mentioned. As far as zero-bit watermarking is concerned, this paper shows that this approach is feasible as it yields a good quality of the watermarked images and an intrinsic robustness. We also tests more advanced tools from Computer Vision such as aggregation schemes with weak geometry and retraining with a dataset augmented with classical image processing attacks.


Introduction
Deep Learning (DL) has completely revolutionized the field of Computer Vision. It started with image classification [1] where a DL network is trained to recognize classes of images and it is now spreading to any task of Computer Vision-object recognition [2], instance search [3], localization [4], similar image retrieval [5,6].
The transfer property is underlying this versatility-a DL network trained on some supervised task (e.g., classification) has been shown to perform well on other applications, even non-supervised tasks like similarity search.
A DL network is comprised of several layers. Supervised training learns the best parameters of each layer in order to minimize a loss function counting the prediction errors over an annotated image set. While the last layer of a classifier network provides a vector of probabilities that the input image belongs to the classes, the internal auxiliary data (the output of the previous layers) also happen to contain very relevant information about the input image. Other tasks can indeed tap these data. Many Computer Vision works take a trained classification network off the shelf, keep the first layers as is and only re-train the deepest layers for their specific task [5].
Following this trend, state-of-the-art image search algorithms tap the output of an internal layer and use them to derive a global descriptor of the image [5,6]. This way, finding similar images boils down to finding close vectors in a Euclidean high dimensional space, where efficient fast search engines exist. It has been shown that these global descriptors are compact and discriminative while at the same time robust to valuemetric (e.g., JPEG, noise, filtering) and light geometric (e.g., cropping, rotation, scaling) distortions [7].

Problem Formulation
This paper investigates whether this approach is also suitable for zero-bit watermarking, with the hope of benefiting from this apparent robustness. The layers of a DL network play the role of the extraction function yielding a vector to be watermarked. This raises the following challenges: • How to invert this highly non-linear extraction function? Once the feature vector is watermarked, how to map it back into the image space to create an image whose feature vector equals a given target vector? • How to take into account a perceptual model ensuring the quality of the watermarked image?
• How to guarantee a required false positive level as there is no probabilistic modeling of these features? • Which network architecture provides the best extraction function? Is it possible to robustify this extraction by fine-tuning the underlying network?

Prior Works
The main trend dealing with DL and watermarking is actually the protection of DL networks. It aims at proving the ownership of a network by embedding a watermark into the learned parameters of the model [8,9].
The connection between machine learning and image watermarking is surprisingly old. References [10,11] protect images with a classical watermarking technique and a neural network is only used at the decoding side in place of a maximum likelihood decoder based on a statistical model, which may not be accurate. J. Stolarek and P. Lipiński learn a transformation [12] as proposed in this paper. However, this transform is dedicated to one specific host image. References [13,14] have a similar position than ours in the sense that they use a single network at the detection side. Reference [14] needs to train another network for embedding, whereas [13] uses back-propagation as we do (see Section 4.3). There are also shortcomings that watermarking experts would point out. These are multi-bit watermarking schemes that fail guaranteeing a probability of false alarm. Indeed, Reference [14] briefly looks at turning their multi-bit embedding into a zero-bit scheme. They report a detection rate lower than 10% for a JPEG compression with quality factor Q ≤ 90 (see [14], Figure 11-combined). There is no secret key preventing renewability when compromised. There is no simple control of the embedding distortion. The embedding distortion reported in both papers is indeed quite high with an average PSNR around 35 dB . They are not robust to geometric transformations (unless a registration is performed first [13]).
As a related work, we mention Reference [15] which makes a very good comparison between attacks in watermarking and attacks on DL networks (a.k.a. adversarial sample-see Section 3.2).
This article is the journal version of the conference paper [16] proposing improvements to robustify the feature vector extraction. This is achieved by using aggregation schemes with weak geometry (RMAC and WELDON-see Section 3.1) and by an adversarial retraining of the network (see Section 4.4). This last idea is coming from References [13,14].
The contribution of the study is to assess the relevance of Deep Neural Networks as a feature extraction when watermarking requirements are prioritized: embedding and detection use a secret key which is easily renewed if compromised, the embedding distortion is under control, the probability of false alarm is guaranteed and the test images are large.

Structure of the Paper
Section 2 presents the state-of-the-art in zero-bit watermarking when combined with a linear extraction function. Section 3 presents elements of DL networks with a focus on adversarial examples. Section 4 details our new watermarking scheme and a re-training process to increase the robustness by strengthening the invariance of the network. Section 5 reports our experimental work following by a discussion in Section 6.

Zero-Bit Watermarking
Our scheme belongs to the TEMIT approach (a wording coined by Ton Kalker.): Transform, EMbed, Inverse Transform. We denote by Tr(·) the transformation extracting a feature vector, so-called the host signal x o = Tr(I o ) of dimension n, from an image I o . The embedding modifies it into x m = Emb(x o ). This function implicitly depends on a secret key. For simplicity, the paper focuses on a zero-bit watermarking scheme [17][18][19]: we hide the presence of a secret mark, not a message. The watermarked image is given by At the detection side, the image under scrutiny is I a with x a = Tr(I a ). It is deemed as a watermarked image if x a belongs to the acceptance region D ⊂ R n . The use of a secret key is implicit in these notations.

The Hypercone Detector
This paper focuses on one specific zero-bit watermarking scheme: the hypercone detector [17]. This scheme is provably good [19] and the detector is blind and oblivious to the watermark, host and noise powers.
The acceptance region D is a dual hypercone of axis a ∈ R n ( a = 1) and half angle 0 < θ < π/2 defined as: Vector a indeed plays the role of the secret key. We now define the following function R(·) : R n → R with ρ(θ) := (tan 2 (θ) + 1). This quantity is negative when x / ∈ D, null when x lies on the boundary of D and positive inside the dual hypercone. Indeed, when R(x) > 0, this quantity is the amount of power of Gaussian white noise to be added onto x in order to push it outside the hypercone with high probability [19].
This gives a rationale for watermarking vectors at the embedding side. We push the host vector x o to a location x m deep inside D s.t. R(x m ) is as big as possible. This provably increases the robustness against a white Gaussian noise addition in the feature space.

Linear and Invertible Extraction Function
The hypercone detector is at the core of some image watermarking techniques [18] where the extraction function is (almost) a Parseval tight frame T (e.g., selection of some DCT or DWT coefficients). Once the embedding modifies x o in x m , the watermarked image is obtained as A constraint on the image Euclidean distance I m − I o 2 ≤ C (e.g., expressed in terms of MSE or PSNR) is ensured if x m − x o 2 ≤ C because Parseval frame preserves Euclidean distance. In the feature domain, the embedding then amounts to solve the following problem: x m = arg max It is known that x m belongs to the 2D-hyperplane containing the origin x o and the axis of the hypercone a [17,20], so that (4) boils down to a line search over angle β ∈ [0, π/2] defining: with (u, v) a basis of this hyperplane: Vectors x o and a are statistically independent, so that for large n, x o a ≈ 0. In that case, the close form solution is [19]: if C ≥ x o cos 2 θ, otherwise x o cannot be watermarked (i.e., pushed into region D) because C is too small. The embedding is thus given by a close form solution or a simple line search over β. This is thanks to an invertible extraction function Tr(·) preserving the Euclidean distance.

Deep Learning Feature Extraction
This section considers a neural network used as an image classifier over n c classes. It is denoted by function F that computes a vector from an image: p = F(I ) ∈ [0, 1] n c . Once the parameters of the classifier have been learned from a training set, the network predicts classes of images. The output p is a vector of estimated probabilities p k =P(k|I) that content I belongs to class k. In the end, the class of the query image is given by

Architecture
The function F is decomposed as F = S • N, where S is the softmax function ensuring that p is a probability distribution (i.e., ∑ n c k=1 p k = 1 and 0 ≤ p k ≤ 1, ∀k ∈ {1, . . . , n c }). Function N is the network per se, a composition of C convolutional layers and F fully connected layers.
Each layer is a non-linear process as it encompasses an activation function such as the well-known ReLU. Function V is a layer flattening the 3D activation map of the last convolutional layer into a vector suitable for the first fully connected layer.
The convolutional layers output data living in very high-dimensional spaces that are not practical for direct representation. The output of a convolutional layer is a three-dimensional tensor of size W × H × K, where K is the number of feature maps and (H, W) are the dimensions of each feature map. The total dimensionality is in an order of tens or hundreds of thousands, especially in higher layers of deeper network architectures. Our first approach watermarks the output of one layer. If it is a convolutional layer, the flattening function V produces a vector. It is used directly or by performing a dimensionality reduction [16]. In this work, a second approach considers a non-linear pooling strategy. It aggregates the 3D tensor in a specific way for the retention of localization information. It has been shown that geometry even weakly preserved improves a lot the transferability of a classification network to tasks like image search [21]. This work evaluates two aggregation methods in the context of watermarking with neural architectures-RMAC [21] and WELDON [22]. That aggregation is not part of the classification network (9). It is just added after a convolutional layer for the purpose of watermarking only. RMAC, namely Regional Maximum Activation of Convolutions, is a multi-scale aggregation method. It retains locality information by utilizing region-specific feature vectors for multiple, overlapping, multi-scale regions. It was developed primarily for content based image retrieval (CBIR) applications of convolutional neural networks. Let . Then 2 normalization and PCA-whitening is applied to each region feature vector. The global (for the whole image I) RMAC descriptor is the sum of the region feature vectors followed by a final 2 normalization. Its dimension is thus K. The second level is denoted in red and consists of 4 regions (one of which is bolded for emphasis) of size 9 × 9. The third level, denoted in green, consists of 9 regions (one of which is again bolded for emphasis) of size 7 × 7 that are offset so that they have at least 40% overlap with the regions of the previous layer.
WELDON, namely WEakly supervised Learning of Deep cOnvolutional neural Networks, is another method of regional pooling incorporating negative evidences. For a given feature i ∈ [K], it looks for the k coordinates having the largest activation. This is written as: where In the same way, it sums the m lowest responses: The authors state that introducing this kind of negative instances (regions that best supports the absence of a class) improves the avoidance of local minima and improves the overall classification results. The final descriptor for image I has size K and is obtained by summing the two regional scores:

Adversarial Samples
Adversarial sampling is a recent trend showing that DL classifiers can easily be fooled [23,24]. The attack forges an image I a visually similar to an original image I o but with a different prediction In a targeted attack, a specific class is given, say k t , together with an original image I o not belonging to this class. The attack finds I a as close as possible to I o s.t. C(I a ) = k t : with Dist a measure of distortion (usually, the L -norm of (I − I o ) with ∈ {0, 1, 2, +∞}). In a white box scenario, the attacker knows the network architecture and parameters.
Problem (14) is impossible to be solved as the region of images classified as k t is unknown. It is replaced by where λ ∈ R + and Loss is a loss function measuring how far I is from being classified as class k t . For instance, the loss is a divergence (e.g., Cross-Entropy) between the predicted probability distribution F(I ) and the targeted one-shot vector p , i.e., p k = δ k t (k), or more simply: with |a| + = a if a > 0 and 0 otherwise. This way

Practical Solutions
The minimization problem (15) implicitly uses the domain of images of the same size h × l and same number of channels c as the original image I, say {0, 1, . . . , 255} h×l×c . A relaxation looks for a solution over the continuous set [0, 255] h×l×c . Quantization onto integers is applied on the continuous solution to obtain an image. This box-constrained optimization often uses a differentiable mapper m which is a monotonic bijection from R to [0, 255], e.g., m(x) = 255(tanh(x) + 1)/2 as in Reference [24]. This changes the box-constrained optimization problem (14) into an unconstrained minimization of J λ (m(X)) over X ∈ R h×l×c .
Practical attacks are mostly based on a gradient descent [25,26]. The computation of the gradient of the objective function is not so difficult because one can back-propagate the computation of the gradient through the layers of the network. The attack initializes a random starting point X (0) = m −1 (I o ) and iteratively computes It stops when the objective function no longer decreases substantially. This gradient descent goes down to a local minimum when converging. Several rounds with different initialization are competing. Advanced numerical solvers have been also used in the literature, for example, ADAM [24]. The main difficulty is to set λ to a well chosen constant. Theoretically, there exists a range of λ values where the solution of (15) coincides with the solution of (14). In practice, this range is unknown and we apply a line search over λ on top of the gradient descent.
Simpler attacks start with a given distortion budget, Dist(I a , I o ) ≤ C, and set λ = 0. The gradient descent starts at the original image, that is, X (0) = m −1 (I o ) and iterates until either C(m(X (j) )) = k t or Dist(m(X (j) ), I o ) > C. In the latter case, the attack failed finding an adversarial sample within the distortion budget.

Application to Zero-Bit Watermarking
The key idea of this paper is (i) to use the network N or part of it (i.e., the first layers) as the extraction function Tr(·), (ii) to modify the extracted vector as explained in Section 2, (iii) to use an adversarial sample mechanism of Section 3 as Tr −1 (·) in order to put back the marked vector into the image. This raises several issues cleared in this section.

Need of a Locality Transform
Our proposed system can function with any neural architecture. Denote by E the set of the first layers and the aggregation (be it the flattening function, RMAC, or WELDON) used for extracting an embedding of an image: e = E(I ) ∈ R K . Yet, the representational coordinate system of the embedding space does not depict the feasible locations (i.e., the typical values of these extracted vectors). A neural network never provides a zero-mean isotropic embedding e. The usual culprit being the asymmetrical activation functions, such as ReLU, that have a R ≥0 codomain.
This rises the need for a mapping from the original coordinate system to a local coordinate system. We empirically model a local coordinate system by providing images to the DL network and analyzing their extracted feature vectors in the representation space. We use two possible mappings: • Centering: x = L(e) = e −ē, • PCA: x = L(e) =Σ −1/2 (e −ē), whereē is the empirical mean andΣ is the empirical covariance matrix. The first option preserves the dimensionality n = K, while the second may operate a reduction by keeping the n < K biggest eigenvalues ofΣ. In the end, Tr = L • E.

False Alarm Probability
The false alarm probability is usually defined as P fa := P(X ∈ D) where the random vector X models the feature vector of an original image. For the hypercone detector, under the weak assumption that X has an isotropic distribution (e.g., a Gaussian white distribution), this probability has a close form expression: where I is the regularized Beta incomplete function. This assumption does not hold for feature vectors extracted by the network. Even the empirical centering and whitening doesn't guarantee a distribution exactly isotropic.
Instead, we prefer modifying the definition of the probability of false alarm. For a given image under scrutiny, the extracted features are seen as a fixed vector. It is the acceptance region which is random: the axis direction of the hypercone is a random vector A uniformly distributed on the unit hypersphere. It happens that this probability of false alarm has the exact same expression as (21). For a required false positive level, (21) is inverted so as to find the corresponding half angle value θ.
Another use of this equation is to compute the p-value: for given x and a, the p-value is the probability (21) computed for a cosine equalling cos(θ) = |x a|/ x . It means that if we were drawing O(1/p) random secret keys A, on expectation, only one of them would give a normalized correlation bigger or equal than that cos(θ). The smaller the p-value is, the more convinced we are that a was the secret key used at the embedding.

The Objective Function and Imposed Constraints
A bad idea is to stick too closely to the watermarking described in Section 2.2. It amounts to first find x m (4) and then to use the back propagation to craft an image whose feature vector is as close as possible to x m .
Watermarking in the feature space and crafting the watermarked image are better done jointly by defining: Loss(I ) = −R(Tr(I)), (23) and letting an adversarial sample mechanism solving this optimization problem. The distortion function Dist is for instance the Mean Square Error (expressed in dB as a PSNR). Note that with these notations, a watermark is detected in image I if Loss(I ) < 0. While minimizing the loss and thus iteratively generating the watermarked image, additional constraints can be applied. For instance, the watermark can be attenuated selectively in the spatial domain by a perceptual mask to make it less perceivable. We use the Structural Similarity SSIM heatmap for that purpose. The watermarking is then performed by a simple gradient descent (see Algorithm 1).

Improving Robustness by Retraining
The introduction explains our motivation of benefiting of the inherent robustness of known classification DL networks in the very same way as it happened in image search. Retraining aims at increasing this robustness against typical attacks in watermarking applications. These are image to image transformations like translation, rotation, cropping or JPEG compression. Therefore building an augmented dataset is made easy by applying these transformations on the original training set. Note that watermarking is kept out of the retraining process. Our goal is to strengthen the invariance of the original classification network without any bias to the feature extraction method for watermarking. Later on, watermarking will naturally inherit this invariance.
The network is trained from scratch on a series of attacks of increasing difficulty. This is driven by the classification loss on the original network architecture (without the watermark feature extraction based on aggregation and whitening). For training, the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) dataset [27] was used. It consists of 1.2 million images of 1000 classes. Attacks to the images were applied dynamically through a data augmentation pipeline making attacks of increasing difficulty as an input to the network (elaborated in Section 5.5). i ← i + 1; until stopping criterion met; I w ← I i ;

Experiments
The watermarking system is both architecture and layer agnostic, meaning that any layer can a priori be used for watermarking. This paper considers one specific architecture at different layers, exploring their goodness for watermarking by evaluating robustness and perceptual quality.

Experimental Protocol
The experimental part evaluates our proposed system based on the off-the-shelf VGG19 [28] convolutional neural network pretrained on ImageNet (available at https://keras.io/applications/ #vgg19). We perform all the experiments on photos from the test set P "professional" provided as part of the CLIC challenge [29]. This dataset consists of 117 professionally taken photos. The typical size is 2048 × 1350 pixels.

Image Quality
When the stopping criterion is expressed in terms of PSNR, the quality of the watermarked image is roughly acceptable above 40 dB. Figure 2 illustrates this with the reference Lenna image on its top row. The watermark signal in the image space looks like uniform noise when watermarking lower layers and gets more structured when processing deeper layers, as shown in Figure 3.
A psycho-visual attenuation as specified in Algorithm 1 improves the quality of the image. Our experiments used the structural similarity (SSIM) heatmap, but any other psycho-visual model (e.g., Butteraugli [30]) can be integrated transparently. Figure 2 shows that the watermark is less perceivable on the right column thanks to attenuation in flat regions (skin, hat, wall) for a given PSNR target. This almost does not hurt robustness R(Tr(I w )).
The downside is the increased number of steps (and thus time) necessary to obtain a watermark of the same robustness. Figure 4 shows a 3D representation of the watermarking process. The first two axis corresponds to the projection of Tr(I i ) onto the basis (u, v) of Equation (6), while the third component accounts for the energy remaining in the complimentary space. The stopping criterion was met in 10 or 13 iterations, respectively without or with perceptual attenuation.    Table 1 gives the log 10 p-value for the watermarked image and after a rotation of 5 • . For the moment, no aggregation scheme is used. The first part of the table illustrates the behaviour when centering is used as a locality transform, thus retaining the original dimensionality. The second part of the table illustrates the behaviour with PCA with a reduction to 256 dimensions.

From One Layer to Another
In general, watermarking at deeper layers offers more robustness. Each convolutional-pooling block adding some invariance to rotation and scaling, the best layer is indeed the first fully-connected layer.

Robustness without Aggregation
We evaluate the robustness of the first fully-connected fc1 layer of VGG19 at the expense of a lower dimensionality (n = 4096). From now on, the stopping criterion is defined as PSNR = 42 dB and the probability of false alarm is fixed to P fa = 10 −3 . To define a baseline, we watermark the same images with the linear wavelet transform as described in Section 2.2 with the same distortion constraint. Figure 5 illustrates the robustness against rotation. Wavelets drop to a detection rate of around 25% after a rotation of only 0.5 • and it ceases to detect watermarks after a rotation of 1 • . On contrary, the robustness of our technique smoothly degrades with the strength of the attack. Figure 6 illustrates the robustness against cropping. The performances again smoothly degrade with the strength of the attack: 50% of watermark detection when cropped to 90% of their original content and continues to detect some watermarks down to a crop of 50%. The system outperforms the classic approach. This is not surprising as DWT was not designed for geometric invariance. Yet, it shows that accurate synchronization is of utmost importance with linear transforms (e.g., DCT, DWT, random projection), whereas it is far less stringent in the new approach.
On top of this, we observe that the system is robust to horizontal flips! We were not expecting this feature. After investigations, we provide the following explanation. Since horizontal flips are naturally and artificially occurring in the training dataset of ImageNet, the neural network seems to have learned a transform invariant to this.

Training Against Attacks
A new classification network is now trained from scratch. Its architecture is the same as the pretrained one used in the first part of our experiments, that is,VGG19. The ILSVRC2012 dataset that consists of 1.2 million images is augmented with attacked versions by the training pipeline. This pipeline introduces attacks of increasing strength. The network is trained for 1000 epochs which takes 3 weeks on an Nvidia GeForce GTX 1060 GPU with an Adam optimizer, a learning rate of 0.001, and a learning rate decay on plateau. The progression of the attacks is scheduled as follows: • Initially, no attacks. Images come from the ILSVRC2012 dataset.
• After 20 epochs, vertical flips and rotation of up to 3 • (both clockwise and anticlockwise) are introduced. Scaling is extended to the range from 70% to 130%. • After 35 epochs, rotation is extended up to ±5 • . • After 50 epochs, horizontal and vertical stretching up to 10% are introduced. Rotation is extended to ±10 • , scaling to the ranges from 50% to 150%. • From epoch 150 on, attacks are performed with the following parameters: rotation up to ±45 • , scaling from 10% to 190%, horizontal and vertical stretching up to 50% and with horizontal and vertical flipping.
Training was performed in a standard classification setup with non-watermarked ImageNet images and is not related to the watermarking evaluation that follows. Figure 8 illustrates the robustness to rotation of all the convolutional and pooling layers of the two VGG19 networks-the pretrained one (in black) and the one trained from scratch with progressive attacks (in red). Although there is a bit of overlapping for few layers, a clear separation is globally visible, indicating the better performances of the network trained with attacks. The best robustness is given by the pool4 layer of the network trained with attacks. conv3+rmac (pretrained) pool3+rmac (pretrained) conv4+rmac (pretrained) pool4+rmac (pretrained) conv5+rmac (pretrained) pool5+rmac (pretrained) conv3+rmac (with attacks) pool3+rmac (with attacks) conv4+rmac (with attacks) pool4+rmac (with attacks) conv5+rmac (with attacks) pool5+rmac (with attacks)

Robustness with Aggregation
Aggregation methods improve the performance of convolutional neural networks in classification and content based image retrieval tasks. Is this still the case for watermarking?
When evaluating against rotation, WELDON and RMAC aggregations improve the robustness compared to the feature extraction without aggregation. Figure 9 moreover shows that RMAC clearly outperforms WELDON for large angles. Its robustness is more stable, smoothly degrading as the rotation angle increases. As for cropping, RMAC consistently outperforms WELDON and no aggregation at all as shows in Figure 10. For small cropping factors, the robustness of WELDON collapses more quickly than using no aggregation at all. However, RMAC is not always the best aggregation. It is less robust than WELDON against JPEG compression (by a small amount-see Figure 11) and against a gamma correction (see Figure 12). Indeed, no aggregation at all seems to perform best against a gamma correction. Note that Figures 9-12 are not using the same scale on the y-axis. Overall, RMAC is the obvious choice especially against geometric attacks. Its relative weakness against valuemetric attacks is certainly due to the different 2 normalizations. Appendix A shows some watermarked images with this option.

Conclusions
This work shows that watermarking features extracted from a classification network is feasible and relevant. Feasible because the framework of adversarial sample provides a way to create watermarked image of good quality. Relevant because the DL network provides a transformation which strikes a good trade-off between invariance to geometric attacks and robustness to valuemetric attacks.
This robustness is further improved thanks to mechanisms borrowed from other Computer Vision task like RMAC in image search. The training dataset plays also a big role. Augmentation with specific image transformations increases the invariance of the network although the training is driven by the classification performance and oblivious to watermarking.
From the application point of view, this study opens the door to a single image descriptor good for both image search and watermarking. In copy detection and copyright infringement applications, image search alone yields many false positives. A watermark detection would drastically decrease the number of false recognition cases.
The proposed approach has some drawbacks. First, embedding is roughly 20 times slower than detection-it is iterative and each iteration computes a gradient by back-propagation, whose complexity is the double of one inference. Reference [14] does not have this shortcoming because a generative network is in charge of the embedding. On the other hand, this generative network is not secret-keyed. Second, watermarking is not purely invariant to geometric transformation. This is fine in some applications where watermarked images face mild geometric attacks. For other applications, it will need a registration mechanism, yet that mechanism does not need to be perfectly accurate. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
This appendix shows some samples taken from the CLIP image dataset [29] (validation dataset 'professional' subset) watermarked with the following parameters: The network is the retrained model (Section 4.4) using the last VGG19 pooling layer (namely block pool4), followed by the aggregation RMAC (10) and a PCA reduction to n = 256 dimensions and normalization (Section 4.1). The target P fa is 10 −8 , thus giving an angle θ = 69.7 • in (21). The perceptual model based on SSIM attenuation is used (Section 4.3). The iterations use the step η = 1.0 and the stopping criterion was PSNR < 43 dB in Algorithm 1.
The figures display the original image on the left, its watermarked version on the right hand side, the PSNR in dB between the two, and the p-value measured when querying the watermarked version (see Section 4.2). Indeed, a low p-value reveals that the embedding succeeded in creating a watermarked image deep inside the detection region. Thus, for a given distortion budget, the lower this p-value, the easier to be watermarked is the image.
The images are grouped in three categories (highly- Figure A1, moderately- Figure A2, and almost not- Figure A3 textured images). The watermark is invisible except when the image is very flat like davide-ragusa-716 in Figure A3. This image deserves a more advanced Human Visual System model than SSIM.
Surprisingly, moderately textured images are easier to be watermarked as their p-values suggest. We propose the following explanation. The embedding is very constrained on flat images because the perceptual model restricts the space (i.e., the number of pixels) where to hide the watermark signal. This concentrates all the distortion budget on few pixels and creates visible artefacts. For highly textured images, the inverse extraction function Tr −1 (·) struggles to deviate the extracted feature strongly reflecting the textures present in the image within the low distortion budget. This is a pity because highly textured images have great masking capabilities. It means that they can bear large embedding distortions without perceptual degradation. A more advanced Human Visual System model adapts the distortion budget depending on the masking capabilities of the host image (i.e., the 'amount' of textures). This increases the perceptual quality of almost not textured images and at the same time it increases the robustness for highly textured images. We observe this effect when simply replacing the stopping condition based on the PSNR by SSIM > 0.998 in Algorithm 1.