The goal of many machine learning tasks is to learn an unknown function which has some known symmetries. For example, consider the task of classifying 3D objects which are discretized as point clouds. In this scenario, the unknown function maps a point cloud to a discrete set of labels. Depending on the type of 3D objects we are interested in, this function could be invariant to translations, rotations, reflections, scaling, and/or affine transformations. It is also typically invariant to permutation—reordering the points does not impact the underlying object.

Another example comes from physics simulations, where we aim to learn a function which is given a point cloud at time \(t=0\), which represents the positions of a collection of physical entities such as particles or planets, and maps them to their positions at time \(t=1\). This map is equivariant with respect to translations, orthogonal transformations (and Lorenz transformations in the relativistic setting), and permutations: Applying these transformations to the input will lead to a corresponding transformation of the output.

In machine learning approaches, an approximation for the unknown function is sought for, from within a parametric hypothesis class of functions. Equivariant machine learning considers hypothesis classes which by construction have some or all of the symmetries of the unknown functions. For example, there are multiple neural network architectures for point clouds which are equivariant (or invariant) to permutations [47, 59, 63] and/or geometric transformations such as rotations [17, 34, 55], orthogonal transformations [8, 51, 61], or Lorenz transformations [10, 24, 28].

Perhaps the most famous result in the theoretical study of general neural networks is the fact that fully connected ReLU networks can approximate any continuous function [45]. Analogously, it is desirable to devise invariant (or equivariant) neural networks which are universal in the sense that they can approximate all continuous invariant (or equivariant) functions. This question has been studied in many works such as [11, 22, 25, 33, 39, 46, 50, 52, 62].

Universality results for invariant networks typically consider functions which can be written as

$$\begin{aligned} f=f^{general}\circ f^{inv}, \text { where } f^{inv}:V\rightarrow \mathbb {R}^m, f^{general}:\mathbb {R}^m\rightarrow \mathbb {R}. \end{aligned}$$
(1)

where V is a Euclidean space, G is a group acting on it, \(f^{inv}\) is a G-invariant continuous function, \(f^{general}\) is continuous, and so, f is G-invariant and continuous. If \(f^{general}\) comes from a function space rich enough to approximate all continuous functions (typically a fully connected neural network), and \(f^{inv}\) is a continuous invariant mapping which separates orbits—which means that \(f^{inv}(x)=f^{inv}(y)\) if and only if \(x=gy\) for some \(g \in G\), then invariant universality is guaranteed (see e.g,, Proposition 2.3). As discussed in [57], orbit separating invariants can also be used to construct universal equivariant networks. Thus, the question of invariant and equivariant universality is to a large extent reduced to the related question of invariant separation, which is the focus of this paper.

We note that invariant mappings on V induce well-defined mappings from V/G to some \(\mathbb {R}^m\), and separation of invariant mappings is equivalent to injectivity of this map. In practice, our mappings will all be continuous, and often, their inverses will also be continuous (though we do not explore this issue in the current paper). For this reason, we will informally refer to separating invariant mappings as invariant embeddings.

In the equivariant learning literature, orbit separating invariant functions are typically supplied by classical invariant theory results, which in many cases can characterize all polynomial invariants of a given group action through a finite set of invariant polynomial generators. However, the number of generators is often unrealistically large. For example, the classic point net architecture [47] is a permutation-invariant neural network which operates on point clouds in \(\mathbb {R}^{d\times n}\). Proofs of the universality of this network are typically based on its ability to approximate power sum polynomials, which are generators of the ring of permutation-invariant polynomials (see, e.g., [38, 52, 63]). However, while in the experimental setup reported in [47], \(n=1024,d=3\) and the network creates 1024 invariant features, the number of power sum polynomials is \({n+d}\atopwithdelims (){n} \), which amounts to over 180 million invariant features in this case.

There are various mathematical results from different disciplines which suggest that the number of separating invariants need not be larger than roughly twice the dimension of the domain. (In particular, in the example discussed above this would mean that only \(2\dim (\mathbb {R}^{d\times n})=2n\cdot d=2048\) permutation-invariant features would be needed, rather than 180 million.) For example, in invariant theory it is known that (for groups acting on affine varieties over algebraically closed fields by automorphisms), while the number of generators of the invariant ring is difficult to control, there are separating sets whose size is \(2D+1\), where D is the Krull dimension of the invariant ring [18, 20]. Additionally, continuous orbit separating mappings on V can be identified with continuous injective mappings on the quotient space V/G. Thus, if V/G is a smooth manifold Whitney’s embedding theorem shows that it can be easily be (smoothly) embedded into \(\mathbb {R}^{2D+1}\), where D is the dimension of V/G. (With more work, this can be reduced to 2D.) A similar theorem holds under the weaker assumption that V/G is a compact metric space with topological dimension D ( [44], page 309). Additionally, it is a common assumption (e.g., [2, 36, 53] ) that the data of interest in a machine learning task typically reside in a ‘data manifold’ \(\mathcal {M}\subseteq V\) whose dimension \(D_{\mathcal {M}}\) (the ‘intrinsic dimension’) is significantly smaller than the dimension of V (the ‘ambient dimension’). This assumption is at the root of many dimensionality reduction techniques such as auto-encoders, random linear projections, and PCA. In this scenario, we expect to achieve orbit separation with only \(\approx 2D_{\mathcal {M}}\) invariants.

Fig. 1
figure 1

a Two lines in \(\mathbb {R}^{3\times n}\) and their image under (some) random \(S_{n}\) permutations, visualized by projecting into \(\mathbb {R}^3\); b image of these lines (dimension \(D=1\)) under the mapping we describe in (9) with \(m=2D+1=3 \). As guaranteed by Proposition 3.1, these images do not intersect. Tables in c the training and test error when training an MLP to classify the curves based on the embedding in b, for various values of D and m

Based on the above discussion, we formulate the first goal of this paper:

First Goal: Provide an algorithm for computing \(\approx 2D_{\mathcal {M}}\) separating invariants for group actions on a ‘data manifold’ of intrinsic dimension \(D_{\mathcal {M}}\).

Figure 1 illustrates our first goal and our discussion so far. (Full details of the experiment are given in Sect. 6.) We take two lines in high-dimensional space, \(\mathbb {R}^{3\times n}\), (colored orange and blue) and apply many random permutations from \(S_n\) to the points on both lines, thus obtaining many orange and blue lines. (In (a), we visualize these data projected down to \(\mathbb {R}^3\).) In (b), we see the image of these lines under a permutation-invariant separating mapping to \(\mathbb {R}^3\). Due to invariance, all orange (respectively, blue) lines are mapped to a single orange (or blue) curve in \(\mathbb {R}^3\). Due to separation, these curves do not intersect. It is possible to achieve separation with only three coordinates since the intrinsic dimension of the lines is \(D_{\mathcal {M}}=1\).

The first row in the tables in Fig. 1c shows that applying a standard neural networks architecture to the invariant curves in (b) can accurately classify points according to the line from which they originated. The other rows show that when the intrinsic dimension \(D_{\mathcal {M}}\) of the data is increased, more invariant features are needed, where \(2D_{\mathcal {M}}+1\) features typically give accurate results.

Dimensionality reduction by linear projections In Corollary 2.9, we provide a very general tool to achieve our first goal: with some minor assumptions, we show that if a finite set of N separators is known, then the number of separators can be reduced to \(2D_{\mathcal {M}}+1\) by taking \(2D_{\mathcal {M}}+1\) random linear combinations of the original separators.

While we are not aware of a result as general as Corollary2.9 in terms of the domains and groups which it can handle, we note that the idea of dimension reduction of separators using random linear projections is not new. It is used in many of the proofs mentioned above which guarantee \(\approx 2D_{\mathcal {M}}\) separators in the algebraic setting (see, e.g., [20]), and was used for specific group actions in [5] and [13]. Random projections are also used in the Whitney embedding theorem.

A significant computational drawback of the linear projection technique is that it still necessary to compute the original large set of separators before the linear projection. Thus, the total complexity of computing the separators is not improved by taking linear projections. Accordingly, we formulate our second goal, which is in fact just a refinement of the original goal

Second goal (refinement of first goal): Provide an efficient, polynomial time algorithm for computing \(\approx 2D_{\mathcal {M}}\) separating invariants for group actions on a ‘data manifold’ of intrinsic dimension \(D_{\mathcal {M}}\). In our setting, we do this by starting with some continuous family of maps such that this entire family is separating, and show that we can separate with only \(\approx 2D_{\mathcal {M}}\) randomly chosen elements from this family.

1 Main Results

Our main result in this paper is a methodology for addressing the second goal and efficiently computing \(2D_{\mathcal {M}}+1\) separating invariants for several classical group actions which have been studied in the invariant machine learning community. Our methodology is inspired by the results in [4], which uses tools from real algebraic geometry to show orbit separation in the context of the phase retrieval problem. Our results, formally stated in Theorem 2.7, can be seen as a generalization of these results to general groups. To illustrate the theorem and its usefulness, we will give an example which will later be discussed in more detail in Sect. 3.5.

Example 1.1

Let \(\mathcal {M}\) denote the collection of all \(d\times n\) matrices which have rank d (assume that \(d\le n\)). This is a set of dimension \(D_{\mathcal {M}}=n\cdot d\). Consider the action of \(SL(d)=\{A\in \mathbb {R}^{d\times d}| \quad \textrm{det}(A)=1 \} \) on \(\mathcal {M}\) by multiplication from the left. A natural way to construct invariants for this group action is to pick a subset \(I\subseteq \{1,\ldots ,n\} \) of d indices and considering the function \(X\mapsto \textrm{det}(X_I) \), where \(X_I\) is the \(d\times d\) matrix obtained by choosing the d columns of X indexed by I. In fact, it is known that the functions \(\textrm{det}(X_I)\) are generators of the invariant ring, and as a result are separators. The trouble is that the number of subsets of size d, and so the number of generators, is a prohibitive \(n \atopwithdelims ()d \). Using Corollary2.9, we can reduce the number of separators by choosing \(w^{(1)},\ldots ,w^{(2nd+1)} \) random vectors in \(\mathbb {R}^{n\atopwithdelims ()d} \) and considering functions of the form

$$\begin{aligned} X\mapsto \sum _{I\subseteq \{1,\ldots ,n\}, \quad |I|=d} w_I^{(j)} det(X_I), \quad j=1,\ldots ,2nd+1. \end{aligned}$$
(2)

However, computing each one of these invariant still has complexity of \(\sim {n \atopwithdelims ()d}\).

Our methodology to improve upon this issue uses a family of invariants parameterized by the continuous ‘weight matrix’ W: We note that for every \(W\in \mathbb {R}^{n\times d} \) the function

$$\begin{aligned} X\mapsto \textrm{det}(XW) \end{aligned}$$

is SL(d) invariant. Additionally, we note that every generator \(det(X_I)\) is of the form \(det(X_I)=det(XW_I) \) for an appropriate choice of a \(n\times d \) matrix \(W_I\). In particular, this means that the value obtained by \(\textrm{det}(XW)\) for all possible W determines X uniquely, up to SL(d) equivalence. Theorem 2.7 shows that in this scenario, if we simply choose random \(W^{(1)},\ldots ,W^{(2nd+1)} \) then the functions

$$\begin{aligned} X\mapsto \textrm{det}(XW^{(j)}), \quad j=1,\ldots ,2nd+1 \end{aligned}$$

will be invariant and separating. In contrast to the separating invariants in (2), the complexity of computing each one of these invariants is polynomial in nd.

Informally, Theorem 2.7 can be stated as follows: Suppose we can find a family of invariants \(p(\cdot ,w)\) parameterized by the continuous ‘weight vector’ w, such that every orbit pair is separated by some w. Then for almost every random selection of vectors \(w_i\) with \(i=1,\ldots ,2D_{\mathcal {M}}+1\), the invariants \(x\mapsto p(x,w_i)\) will separate orbits. In a sense, the usefulness of Theorem 2.7 is that it reduces the problem of separating all points with a finite number of invariants to the problem of separating pairs of points with a continuous family of invariants. One could draw a vague analogy here to the usefulness of the Stone–Weierstrass theorem in reducing approximation proofs to separation of pairs of points.

Roughly speaking, the assumptions needed for this theorem are (i) that \(\mathcal {M}\) is a semi-algebraic set—that is, a set defined by polynomial equality and inequality constraints, and (ii) that the function p(xw) is a semi-algebraic function. This is a rather large class of functions which includes polynomials, rational functions, and continuous piecewise linear functions.

The usefulness of Theorem 2.7 for obtaining efficient separating invariants is illustrated in Example 1.1. In Sect. 3, we show similar applications for several other group actions: We find \(2D_{\mathcal {M}}+1\) orbit separating invariants for the action of permutations on point clouds which we discussed above. These invariants are continuous piecewise linear functions obtained as a composition of the ‘sort’ function with a linear function from the left and right. The complexity of computing each invariant is (up to a logarithmic factor) linear in the ambient dimension. Even when we are interested in separation in all of \(\mathcal {M}=\mathbb {R}^{d\times n} \), so that \(D_{\mathcal {M}}=n\cdot d\), the number of invariants we need for separation is significantly lower than the \({n+d}\atopwithdelims (){n} \) power sum polynomials discussed above. This advantage becomes more pronounced when the ‘data manifold’ \(\mathcal {M}\) is low-dimensional.

Similarly, we construct \(2D_{\mathcal {M}}+1\) separating invariants for the action orthogonal transformations O(d) and special orthogonal transformations SO(d) , on d by n point clouds, which can be compared with the standard separating invariants used for these group actions which have cardinality of \(n \atopwithdelims ()2 \) and \(\sim {n \atopwithdelims ()d}\), respectively. We also construct \(2D_{\mathcal {M}}+1 \) separating invariants for Lorenz transformations (and other isometry groups), the general linear group, and for scaling and translation. For several of the latter results, we need to assume that the ‘data manifold’ \(\mathcal {M}\) contains only full-rank point clouds. A summary of these results, and the complexity of computing each invariant, is given in Table 1.

Generic orbit separation As suggested in several works (e.g., [46, 49, 57]), the notion of orbit separating invariants can be replaced with a weaker notion of generically orbit separating invariants—that is, invariants defined on \(\mathcal {M}\) which are separating outside a subset \(\mathcal {N}\) of strictly smaller dimension. This weaker notion of separation is sufficient to prove universality only for compact sets in \(\mathcal {M}{\setminus } \mathcal {N}\). Another disadvantage is that it is not inherited by subsets: Generic orbits separation on \(\mathcal {M}\) is not necessarily preserved on a subset \(\mathcal {M}' \subseteq \mathcal {M}\), since it is even possible that \(\mathcal {M}'\) is completely contained in \(\mathcal {N}\). The advantage of generic separation is that it is generally easier to achieve. Indeed, in Theorem 2.7 we show that while we need \(2D_{\mathcal {M}}+1\) random measurements for orbit separation, only \(D_{\mathcal {M}}+1\) measurements are sufficient for generic separation. These results resemble classical results which show that for an irreducible algebraic variety of dimension D embedded in high dimension, almost every projective projection down to dimension \(D+1\) will be generically one-to-one. (For example, see Example 7.15 and Exercise 11.23 in [29].)

A more significant computational advantage in settling for generic invariants is that each invariant can usually be computed more efficiently. We exemplify this by considering graph valued permutation-invariant functions: There is no known algorithm to separate graphs (up to permutation equivalence) in polynomial time [27]. Correspondingly, while we can easily use our methodology to find a small number of separating invariants, the computational price of computing these invariant is prohibitive. On the other hand, it is known that for ‘generic graphs’ [1, 21] separating graphs is not difficult, and correspondingly, we are able to find a small number of generically separating invariants which can be computed efficiently.

Table 1 Summary of the results in Sect. 3

1.1 Related Work

Phase retrieval and generalizations As mentioned above, our results were inspired by orbit separation results in the phase retrieval literature. Phase retrieval is the problem of reconstructing a signal x in \(\mathbb {C}^n\) (or \(\mathbb {R}^n\)), up to a global phase factor, from magnitudes of linear measurements without phase information \(|\langle x,w_i \rangle |, i=1,\ldots ,m\), where \(w_i\) are complex (or real) n-dimensional vectors. In our terminology, the goal is to find when these measurements are separating invariants with respect to the action of the group of unimodular complex number \(S^1\) (or real unimodular numbers \(\{-1,1\}\)).

In [4], it is shown that \(m=2n-1\) random linear measurements in the real setting, or \(m=4n-2\) random linear measurements in the complex setting, are sufficient to define a unique reconstruction of the signal up to global phase. In [16], the number of measurements needed for attaining separation in the complex setting is slightly reduced to \(m=4n-4\).

In [23], conjugate phase retrieval is discussed: Here the vectors \(w_i\) are real measurements of complex signals in \(\mathbb {C}^n\). These measurements are invariant with respect to global phase multiplication and also complex conjugation, and it is shown that \(4n-6\) generic measurements are separating with respect to this group action.

The separation results obtained for real and conjugate phase retrieval are equivalent to saying that on the space of \(d\times n\) real matrices X, around \(\sim 2dn\) generic measurements of the form \(\Vert Xw_i\Vert \) are separating with respect to the action of O(d), for the cases \(d=1\) (real phase retrieval) and \(d=2\) (conjugate phase retrieval). In Sect. 3.2, we use our methodology to show that for \(d\ge 2\), separation with respect to the action of O(d) can be obtained by choosing \(2nd+1 \) random measurements of the form \(\Vert Xw_i\Vert ^2 \quad i=1,\ldots ,2nd+1\). We note that this result is in essence not new, as it can be derived easily from results on the recovery of rank-one matrices from rank-one linear measurements (see [49], Theorem 4.9 )

There are two natural ways to generalize the separation results of complex phase retrieval: The first is to consider measurements \(\Vert Xw_i\Vert \) with \(w_i\in \mathbb {C}^d\) and \(X\in \mathbb {C}^{d\times n}\). These measurements are invariant with respect to the action of \(d\times d\) unitary matrices on \(\mathbb {C}^{d\times n}\) by multiplication from the left. In [32], it is shown that \(4d(n-d)\) such generic measurements are separating with respect to the unitary action. For \(d=1\), this coincides with the \(4n-4\) invariants needed for complex phase retrieval.

An alternative generalization of complex phase retrieval is searching for separating invariants for the action of SO(d) on \(\mathbb {R}^{d\times n} \). Complex phase retrieval gives us separating invariants for the special case \(d=2\), using the identifications \(S^1 \cong SO(2) \) and \(\mathbb {C}\cong \mathbb {R}^2 \). In Sect. 3.3, we note that when \(d>2\) separation is not possible with measurements which involve only norms and linear operations, but can be achieved by adding an additional ‘determinant term’ to the norm-based measurements.

While our results on invariant separation for SO(d) are new (to the best of our knowledge), we feel that the main novelty in this work is in the generalization of the basic real algebraic geometry arguments used in the phase retrieval proofs to a very general class of real group actions (Theorem 2.7) and invariant functions (semi-algebraic mappings), and in the ability to apply this same methodology to multiple group actions (see Table 1).

Finally, we note that the results quoted above show that it is often possible to get generic separation even when the number of invariants is slightly smaller than the \(2D_{\mathcal {M}}+1\) promised by our Theorem 2.7. We do not pursue the optimal cardinality for separation in this paper for two reason. First, this typically requires a case-by-case analysis, and in this paper, we try to focus on highlighting a general principle that can be applied to a wide number of settings. Secondly, ultimately the number of separating invariants is not likely to be smaller than the dimension of \(\mathcal {M}/G\), and in most applications, we are interested in \(\dim (\mathcal {M}/G)\approx \dim (\mathcal {M}) \), so that ultimately we do not expect an improvement in separation cardinality by more than a factor of 2.

Permutation-Invariant Machine Learning Obtaining \(\mathcal {O}(n\cdot d)\) separating invariants for the action of the permutation group \(S_n\) on \(\mathbb {R}^{d\times n}\) is trivial when \(d=1\), as discussed in, e.g., [58]. When \(d>1 \), a separating set of invariants is suggested in [5] which is a composition of row-wise sorting with linear transformations from the left and the right. They show that this construction is separating when the intermediate dimension is very high (larger than n!), and that whenever separation is achieved, this map is also bi-Lipschitz. We will show that in fact the intermediate dimension need only be \(2nd+1\) (or lower when the ‘data manifold’ has lower dimension), and that the second, larger matrix in their construction can have a certain sparse structure which was previously used in [65] (see Remark 3.3). We note that when \(d=2\) the sufficiency of \(\sim n\) and even \(\sim \log (n)\) measurements was shown in [40, 48]

Recent results Shortly after the first preprint of this manuscript appeared, the authors of [14, 41, 42] proposed invariants called max filters which are defined for all Hilbert spaces with isometric group actions. For finite-dimensional Hilbert spaces, these invariants were shown to be separating using our Theorem 2.7. Max filters can thus be suggested as alternative separating invariant to those we suggest in Sect. 3 for the actions of \(S_n\), O(d) and SO(d), which are all groups of linear isometries. An attractive attribute of this approach is that the same invariant fits all cases, while an advantage of the invariants we present here is that they are more efficient to compute. (Computing a max filter in these examples is done by solving an optimization problem with relatively high complexity.)

Another recent application of our work here was a derivation of (relatively) efficient separating invariants for the joint action of \(SO(d)\times S_n \) or \(O(d)\times S_n \) on \(\mathbb {R}^{d\times n}\). This is described in [30].

Paper organization The structure of the paper is as follows: In Sect. 2, we provide some mathematical background and use it to state and prove our main theorem. We then discuss in general terms the ways in which this theorem can be used to devise separating invariants of low complexity for various group actions and relationships to concepts from invariant theory such as generating invariants and polarization.

In Sect. 3, we describe several applications of the theorem, showing in several examples of interest how a low-dimensional set of separating invariants, which can be computed efficiently, can be obtained using our methodology. These examples include point clouds with multiplication by permutation matrices from the right, or multiplication by orthogonal transformations, rotations, volume preserving linear transformations, general linear transformations, from the left. We also discuss the more trivial scaling and translation actions.

In Sect. 4, we discuss generic separation. We show that generic separation can be obtained using only \(D_{\mathcal {M}}+1\) invariants, and show that generic separation can be computed efficiently for weighted graphs while full separation is unlikely due to the fact that the graph isomorphism problem has no known polynomial time algorithm.

In Sect. 5, we give an outline of an argument that shows that separation can be obtained also if the parameters of the separating invariants we consider have finite precision. This argument is applicable only for some of the polynomial invariants we consider here. Finally, in Sect. 6 we provide some initial experiments, showing a simple permutation-invariant classification problem on point clouds in \(\mathbb {R}^{d\times n}\) with high ambient dimension \(n\times d\), and low intrinsic dimension \(D_{\mathcal {M}}\), can be efficiently solved using \(2D_{\mathcal {M}}+1 \) of our separating invariants, as our theory predicts.

2 Definitions and Main Theorem

2.1 Notation

We denote matrices X in \(\mathbb {R}^{d\times n}\) by capital letters and refer to them as point clouds. The columns of a point cloud are denoted by little letters \(X=(x_1,x_2,\ldots ,x_n)\). We use \(1_n\) to denote the constant vector \((1,1,\ldots ,1)\in \mathbb {R}^n\) and \(e_i\in \mathbb {R}^n \) to denote the n-dimensional vector with 1 in the ith coordinated and zero in the remaining coordinates.

2.2 Mathematical Background

We begin this section by explaining how continuous separating invariant functions are used to characterize all continuous invariant functions. We then lay out some definitions we need for our discussion and prove our main theorem regarding the construction of low-dimensional continuous separating invariant functions (Theorem 2.7).

Universality and orbit separation We begin with defining invariant functions and orbit separation

Definition 2.1

Let G be a group acting on a set \(\mathcal {M}\), let \(\mathcal {Y}\) be a set, and let \(f:\mathcal {M}\rightarrow \mathcal {Y}\). We say that f is invariant if \(f(x)=f(gx)\) for all \(g\in G, x\in \mathcal {M}\). We say that a subset \(\mathcal {M}'\) of \(\mathcal {M}\) is stable under the action of G if \(gm\in \mathcal {M}'\) for all \(g\in G\) and \(m\in \mathcal {M}'\).

Definition 2.2

Let G be a group acting on a set \(\mathcal {M}\), let \(\mathcal {Y}\) be a set, and let \(f:\mathcal {M}\rightarrow \mathcal {Y}\) be an invariant function.We say that f separates orbits if \(f(x)=f(y)\) implies that \(x=gy\) for some \(g\in G\). We say that a finite collection of invariant functions \(f_i:\mathcal {M}\rightarrow \mathcal {Y}, i=1,\ldots ,N\) separates orbits, if the concatenation \((f_1(x),\ldots ,f_N(x))\) separates orbits.

The following proposition, proved in appendix, shows that every continuous invariant function can be written as a composition of a orbit separating, continuous invariant functions, and a continuous (non-invariant) function.

Proposition 2.3

Let \(\mathcal {M}\) be a topological space, and G a group which acts on \(\mathcal {M}\). Let \(K\subset \mathcal {M}\) be a compact set, and let \(f^{inv}:\mathcal {M}\rightarrow \mathbb {R}^N\) be a continuous G-invariant map that separates orbits. Then for every continuous invariant function \(f:\mathcal {M}\rightarrow \mathbb {R}\), there exists some continuous \(f^{general}:\mathbb {R}^N \rightarrow \mathbb {R}\) such that

$$\begin{aligned} f(x)=f^{general}(f^{inv}(x)), \quad \forall x\in K \end{aligned}$$

Somewhat more complicated characterizations of equivariant functions via separating invariants are described in [57].

Now that we have established the importance of continuous separating invariant for approximation of continuous invariant (or equivariant) functions, we will exclusively focus on the topic of finding a small collection of such continuous separating invariant functions. Our technique for doing so relies on several concepts from real algebraic geometry which we will now introduce.

Real algebraic geometry Unless stated otherwise, the background on real algebraic geometry presented here is from [7].

Definition 2.4

(Semi-algebraic sets) A real semi-algebraic set in \(\mathbb {R}^k\) is a finite union of sets of the form

$$\begin{aligned} \{x\in \mathbb {R}^k| p_i(x)=0 \text { and } q_j(x)>0 \text { for } i=1,\ldots ,N \text { and } j=1,\ldots ,m\} \end{aligned}$$

where \(p_i\) and \(q_j\) are multivariate polynomials with real coefficients.

Semi-algebraic sets are closed under finite unions, finite intersections, and complements. We next define semi-algebraic functions

Definition 2.5

(Semi-algebraic functions) Let \(S\subseteq \mathbb {R}^\ell \) and \(T\subseteq \mathbb {R}^k\) be semi-algebraic sets. A function \(f:S\rightarrow T\) is semi-algebraic if

$$\begin{aligned} Graph(f)=\{(s,t)\in S \times T| \quad t=f(s) \} \end{aligned}$$

is a semi-algebraic set in \(\mathbb {R}^{\ell +k}\).

Polynomials \(f:\mathbb {R}^\ell \rightarrow \mathbb {R}^k \) are obviously semi-algebraic functions. Similarly given two polynomials \(p_1,p_2:\mathbb {R}^{\ell } \rightarrow \mathbb {R}\) the rational function \(q(x)=\frac{p_1(x)}{p_2(x)}\) is well defined and semi-algebraic on the semi-algebraic set \(\{x\in \mathbb {R}^\ell | \, p_2(x)\ne 0\}\).

In addition, assume we are given a collection of semi-algebraic sets \(S_1,\ldots ,S_n\subseteq \mathbb {R}^\ell \) whose union is all of \(\mathbb {R}^\ell \), and a function f whose restriction to each \(S_i\) is a polynomial \(f_i\). We call such functions piecewise polynomial functions. Piecewise polynomials are semi-algebraic functions since

$$\begin{aligned} Graph(f)=\cup _{i=1}^n \{(s,t)| \, s\in S_i \text { and } t=f_i(s)\} \end{aligned}$$

In particular, this class of functions include ReLU neural networks and the sorting function we will use in Sect. 3.1, which are continuous piecewise linear functions. Piecewise linear functions are a special case of piecewise polynomial functions where each semi-algebraic set \(S_i\) is a closed convex polyhedron, and each \(f_i\) is an affine functions.

Fig. 2
figure 2

Stratification of a semi-algebraic set. See explanation in main text

Stratification and dimension A semi-algebraic set S can be written as a finite union of pairwise disjoint sets \(S_1,\ldots ,S_n\) such that each \(S_i\) is a \(C^{\infty }\) manifold of dimension \(r_i\), and the closure of each \(S_i\) in S contains only \(S_i\) itself, and sets \(S_j\) with \(r_j<r_i \). This decomposition is called a stratification (see [7], page 177). The dimension of S is the maximal dimension \(\max _{1\le i \le n}r_i \) of all the manifolds in the decomposition. (This definition of dimension can be shown to be independent of the stratification chosen.)

Figure 2 shows a stratification of the semi-algebraic set

$$\begin{aligned} S=\{(x,y)| \quad 1-x^2-y^2>0\}\cup \{(x,y)| \quad 1-x^2-y^2=0\} \cup \{(x,y)| \quad xy=0\}. \end{aligned}$$

The set is shown in the left of the figure, and the stratification is visualized on the right. It includes a single two-dimensional open disk, eight open curves (dimension 1), and four points (dimension 0). Hence, the dimension of S is two.

Families of invariant separators We now introduce some definitions needed for discussion of group actions and separation using a real algebraic geometry framework.

Assume that G is a group acting on a set \(\mathcal {M}\). The orbit of \(x\in \mathcal {M}\) under the action of a group G is the set

$$\begin{aligned}{}[x]=\{y\in \mathcal {M}| \, \exists g\in G \text { such that } y=gx \} \end{aligned}$$

When y is in the orbit of x, we use the notation \(x \sim _Gy\), and when it is not in the orbit of x, we use the notation \(x \not \sim _Gy\).

Definition 2.6

Let G be a group acting on a semi-algebraic set \(\mathcal {M}\) and \(D_w\) be an integer greater than or equal to one. We say that a semi-algebraic function

$$\begin{aligned} p:\mathcal {M}\times \mathbb {R}^{D_w}\rightarrow \mathbb {R}\end{aligned}$$

is a family of G-invariant semi-algebraic functions, if for every \(w\in \mathbb {R}^{D_w} \) the function \(p(\cdot ,w)\) is G invariant.

We say a family of G-invariant semi-algebraic functions separates orbits in \(\mathcal {M}\), if for all \(x,y\in \mathcal {M}\) such that \(x \not \sim _G y\) there exists \(w\in \mathbb {R}^{D_w} \) such that \(p(x,w)\ne p(y,w) \).

We say a family of G-invariant semi-algebraic functions strongly separates orbits in \(\mathcal {M}\), if for all \(x,y \in \mathcal {M}\) with \(x \not \sim _G y\), the set

$$\begin{aligned} \{w\in \mathbb {R}^{D_w}| \, p(x,w)= p(y,w)\} \end{aligned}$$

has dimension \(\le D_w-1\).

We note that if p(xw) is polynomial in w for every fixed x, then separation implies strong separation, since the set of zeros of a polynomial which is not identically zero is always dimensionally deficient.

2.3 Main Theorem

We now have all we need to state our main theorem:

Theorem 2.7

Let G be a group acting on a semi-algebraic set \(\mathcal {M}\) of dimension \(dim(\mathcal {M})=D_{\mathcal {M}}\). Let \(p:\mathcal {M}\times \mathbb {R}^{D_w} \rightarrow \mathbb {R}\) be a family of G-invariant semi-algebraic functions. If p strongly separates orbits in \(\mathcal {M}\), then for Lebesgue almost every \(w_1,\ldots ,w_{2D_{\mathcal {M}}+1}\in \mathbb {R}^{D_w} \) the \(2D_{\mathcal {M}}+1\) G-invariant semi-algebraic functions

$$\begin{aligned} p(\cdot ,w_i), i=1,\ldots ,2D_{\mathcal {M}}+1 \end{aligned}$$

separate orbits in \(\mathcal {M}\).

The remainder of this subsection is devoted to proving this theorem. At a first reading, we recommend skipping to Sect. 2.4 at this point.

We begin by recalling some additional real algebraic geometry facts we will need for the proof, also taken from [7]. We first recall some basic properties of real algebraic dimension:

  1. 1.

    If \(A\subseteq B \subseteq \mathbb {R}^\ell \) are semi-algebraic sets, then

    $$\begin{aligned} \dim (A)\le \dim (B) \end{aligned}$$
  2. 2.

    If \(A \subseteq \mathbb {R}^\ell \) and \(B \subseteq \mathbb {R}^m\) are semi-algebraic sets, then

    $$\begin{aligned} \dim (A\times B)=\dim (A)+\dim (B) \end{aligned}$$
  3. 3.

    If \(S\subseteq \mathbb {R}^k\) is a semi-algebraic set and \(f:S\rightarrow \mathbb {R}^\ell \) is a semi-algebraic function, then f(S) is a semi-algebraic set and

    $$\begin{aligned} \dim (f(S))\le \dim (S) \end{aligned}$$

    If f is a diffeomorphism, then we have equality \(\dim (f(S))=\dim (S)\).

  4. 4.

    If \(A \subseteq \mathbb {R}^\ell \) is a semi-algebraic set of dimension strictly smaller than \(\ell \), then it has Lebesgue measure zero.

Another useful fact we will use is that the projection of a semi-algebraic set is also a semi-algebraic set.

We next state and prove the following lemma

Lemma 2.8

Let \(S\subseteq \mathbb {R}^{D_1}\) be a semi-algebraic set and \(f:\mathbb {R}^{D_1}\rightarrow \mathbb {R}^{D_2}\) a polynomial. Assume that for all \(t\in f(S)\) we have that \(\dim (f^{-1}(t))\le \Delta \), then

$$\begin{aligned} \dim (S)\le \dim (f(S))+\Delta \end{aligned}$$

Proof

Denote \(\Delta _S=\dim (S)\). Let \(S_i, i=1,\ldots ,N\) be a stratification of S. Without loss of generality, let us have \(\dim (S_1)=\Delta _S\). Because of this equality, we will be able to argue about \(\Delta _S\) by arguing only about \(\dim (S_1)\).

Fix some \(s_0\in S_1\) so that the differential of \(f_{|S_1}\) at \(s_0\) has maximal rank r. The set of \(s\in S_1\) whose differential has rank r is open, and so, there is a neighborhood of \(s_0\) on which f has constant rank. By the rank theorem [37], f is locally a projection: This means that there exists a diffeomorphism \(\psi \) which maps an open set U with \(s_0\in U \in S_1 \) to \((0,1)^{\Delta _S}\) and a diffeomorphism \(\phi :V \rightarrow \mathbb {R}^{D_2} \) where V in open in \(\mathbb {R}^{D_2}\) and contains f(U), such that the function \({\tilde{f}}=\phi \circ f \circ \psi ^{-1}\) is a projection:

$$\begin{aligned} {\tilde{f}} (s_1,s_2,\ldots ,s_r,\ldots ,s_{\Delta _S})=(s_1,s_2,\ldots ,s_r,0,0,\ldots ,0), \quad \forall (s_1,\ldots ,s_{\Delta _S}) \in (0,1)^{\Delta _s}. \end{aligned}$$

For the projection \({\tilde{f}}\), we have for every t in the image of \({\tilde{f}}\) the equality

$$\begin{aligned} \Delta _S=r+(\Delta _S-r)=\dim \tilde{f}((0,1)^{\Lambda _s})+\dim \tilde{f}^{-1}(t). \end{aligned}$$
(3)

We can now get our result by exploiting the relationship between f and \({\tilde{f}}\). Since \({\tilde{f}}\) and the restriction of f to U have the same image, we have

$$\begin{aligned} \dim \tilde{f}((0,1)^{\Delta ^s})=\dim f(U)\le \dim (f(S)) \end{aligned}$$

and

$$\begin{aligned}{} & {} \dim \tilde{f}^{-1}(t)=\dim (\psi ({f}^{-1}(\phi ^{-1}(t)))\cap U)=\dim (f^{-1}(\phi ^{-1}(t))\cap U)\\{} & {} \quad \le \dim f^{-1}(\phi ^{-1}(t))\le \Delta \end{aligned}$$

Plugging the last two inequalities into the left-hand side of (3) concludes the proof. \(\square \)

We can now prove Theorem 2.7. Our bundle-based proof presented below was inspired by ideas in [4].

Proof of Theorem 2.7

The set

$$\begin{aligned} \{(x,y)\in \mathcal {M}\times \mathcal {M}| \, x\not \sim _G y \} \end{aligned}$$

is semi-algebraic as it is the projection onto the (xy) coordinates of the semi-algebraic set

$$\begin{aligned} \{(x,y,w)| \, p(x,w)\ne p(y,w) \} \end{aligned}$$

It follows that the for every \(m\in \mathbb {N}\) the set

$$\begin{aligned}{} & {} \mathcal {B}_m=\{(x,y,w_1,\ldots ,w_m)\in \mathcal {M}\times \mathcal {M}\times \mathbb {R}^{D_w\times m}|\nonumber \\{} & {} \quad x \not \sim _G y \text { but } p(x,w_i)=p(y,w_i), i=1,\ldots ,m \} \end{aligned}$$
(4)

is semi-algebraic. We will sometimes refer to the set \(\mathcal {B}_m\) as the ‘bad set.’

Let \(\pi \) and \(\pi _W\) denote the projections

$$\begin{aligned} \pi (x,y,w_1,\ldots ,w_m)=(x,y), \quad \pi _W(x,y,w_1,\ldots ,w_m)=(w_1,\ldots ,w_m). \end{aligned}$$

The set \(\pi _W(\mathcal {B}_m)\) is precisely the set of \(w_1,\ldots ,w_m\) which are not separating. Our goal is to show that, when m is big enough, then the dimension of \(\pi _W(\mathcal {B}_m)\) is less than \(mD_w\) and so is has Lebesgue measure zero.

Let us start by bounding the dimension of \(\mathcal {B}_m\). For every \((x,y) \in \pi \mathcal {B}_m\), we have

$$\begin{aligned}{} & {} \pi ^{-1}(x,y)=\{(x,y)\}\times \underbrace{W_{(x,y)}\times W_{(x,y)}\times \ldots \times W_{(x,y)} }_{m \text { times }}, \text { where } W_{(x,y)}\\{} & {} \quad =\{w\in \mathbb {R}^{D_w}| \, p(x,w)=p(y,w)\}. \end{aligned}$$

By assumption, p is strongly separating, and thus, \(\dim (W_{(x,y)})\le D_w-1\). Therefore,

$$\begin{aligned} \dim \pi ^{-1}(x,y)\le mD_w-m. \end{aligned}$$

It follows from Lemma 2.8 that when \(m\ge 2D_{\mathcal {M}}+1\)

$$\begin{aligned} \dim (\mathcal {B}_m)\le \dim (\pi (\mathcal {B}_m))+mD_w-m\le 2D_{\mathcal {M}}-m+mD_w \le mD_w-1, \end{aligned}$$
(5)

and since applying a \(\pi _W\) to \(\mathcal {B}_m\) can only decrease its dimension we obtain that \(\dim (\pi _W(\mathcal {B}_m))\le mD_w-1 \) as required.

2.4 Using the Main Theorem

The goal of this subsection is to describe in general terms how Theorem 2.7 can be used to achieve low-dimensional orbit separating invariants (which can be computed efficiently). In the next section, we will apply Theorem 2.7 to find a small, efficiently computed collection of separating invariants for several classical group actions, many of which have been studied in the context of invariant machine learning. These results will be presented as a case-by-case elementary analysis, which requires only a combination of Theorem 2.7 with elementary linear algebra arguments. The purpose of this subsection is to provide a general explanation to the results in the next section, based on known results from invariant theory.

A methodological application of Theorem 2.7 can be achieved by searching for polynomial invariants, and using known results from classical invariant theory which studies these invariants. In particular, for the classical group actions we discuss in the next section, we typically have an available first fundamental theorem (FFT) for this group action: that is, a finite set of invariant polynomials \(f_1,\ldots ,f_N\) which are called generators, such that for every invariant polynomial p there exists a polynomial \(q:\mathbb {R}^N\rightarrow \mathbb {R}\) such that

$$\begin{aligned} p(x)=q(f_1(x),\ldots ,f_N(x)). \end{aligned}$$

The generators of the invariant polynomial ring are algebraic separatorsFootnote 1—that is, any two distinct orbits which can be separated by any invariant polynomial will be separated by one of the generators. Let us for now assume that on our ‘data manifold’ \(\mathcal {M}\) the generators do indeed separate orbits. (This is often, but not always, the case. We will return to this issue in a few paragraphs.) Typically, the cardinality N of the generators is much larger than what we would like. The easiest (but not recommended) method for achieving a smaller collection of polynomials is by starting with a generating set (or some other possibly large known set of semi-algebraic separating invariants) and applying linear projection.

Corollary 2.9

Let G be a group acting on a semi-algebraic set \(\mathcal {M}\) of dimension \(D_{\mathcal {M}}\). Assume that \(f_i:\mathcal {M}\rightarrow \mathbb {R}, i=1,\ldots ,N\) are semi-algebraic mappings which separate orbits. Then for almost all \(w^{(1)},\ldots ,w^{(2D_{\mathcal {M}}+1)}\in \mathbb {R}^N\), the functions

$$\begin{aligned} p(x,w^{(j)})=\sum _{i=1}^N w^{(j)}_if_i(x), j=1,\ldots ,2D_{\mathcal {M}}+1 \end{aligned}$$

separate orbits.

Proof

Since \(p(x,w)=\sum _{i=1}^N w_if_i(x)\) is polynomial in w for fixed x, if we can show that p is a family which separates orbits then it also strongly separates orbits, and so, Theorem 2.7 gives us separation. Given x and y in \(\mathcal {M}\) whose orbits do not intersect, we know that there is some i such that \(f_i(x)\ne f_i(y)\), and so, \(p(x,w=e_i)\ne p(y,w=e_i) \). \(\square \)

The dimensionality reduction technique described in Corollary 2.9 is essentially a random linear projection from \(\mathbb {R}^N\) to \(\mathbb {R}^{2D_{\mathcal {M}}+1}\). This method was used for generating a small number of separating invariants in [5] and [13], and is at the heart of the proofs mentioned earlier for the existence of a small set of separating invariants (see [20, 31]). From a computational perspective, this approach is sub-optimal as it requires a full computation of all N separating invariants as an intermediate step.

In many of the examples, we discuss in the next section, a significantly more efficient approach is provided by the fact that the generating invariants \(f_1,\ldots ,f_N\) are obtained from a single invariant via polarization. In our context, polarization can be described as follows: Assume that G is a subgroup of \(GL(\mathbb {R}^d)\) acting on \(\mathbb {R}^{d\times n}\) and \(\mathbb {R}^{d\times n'}\) by multiplication from the left. If \(f:\mathbb {R}^{d\times n'}\rightarrow \mathbb {R}\) is G-invariant, then we can combine f, and any linear \(W\in \mathbb {R}^{n \times n'}\), to create invariants on \(\mathbb {R}^{d\times n}\) of the form

$$\begin{aligned} p(X,W)=f(XW), W\in \mathbb {R}^{n\times n'} \end{aligned}$$

If our original generating invariants \(f_1,\ldots ,f_N \) were all of the form \(f(XW_i), i=1,\ldots ,N\), then p is a separating family of semi-algebraic mappings, and so, we obtain \(2D_{\mathcal {M}}+1\) separating invariants \(p(X,W_i), i=1\ldots ,2D_{\mathcal {M}}+1 \) without needing to compute all of the original generators. For more on the relationship between polarization and separation, see [19].

Algebraic separation vs. orbit separation We now return to discuss a question we touched upon previously: When do invariant polynomials, and thus the generators, separate orbits? In general, a group action can have two distinct orbits which cannot be separated by any invariant polynomial. The main obstruction is that a continuous function which is constant on G orbits is also constant on the orbits’ closure. Thus, two orbits which do not intersect cannot be separated if their closures do intersect. The classical example for this is the action of \(G=\{x>0\}\) on \(\mathbb {R}\) via multiplication. This action has three orbits: positive numbers, negative numbers, and zero. The closures of these orbits all intersect zero, and hence, the only invariant functions which are continuous on all of \(\mathbb {R}\) are the constant functions. We will find similar issues occurring in the next section for the action on \(\mathbb {R}^{d\times n}\) by scaling or multiplication from the left by \(GL(\mathbb {R}^d)\): In both cases, there are no non-constant invariant continuous functions, and thus, we will rely on separating rational invariants for these examples (which are not continuous on all \(\mathbb {R}^{d\times n}\) since they have singularities).

The scaling group and \(GL(\mathbb {R}^d) \) are open subsets of Euclidean spaces. For compact groups, the orbits of the group action will be compact and thus equal to their closures, and so, the closures of disjoint orbits will remain disjoint. For closed (non-compact) groups acting on \(\mathbb {R}^{d\times n}, d\le n\),the orbit of every full-rank matrix X under the group action is homeomorphic to G and thus closed. Thus, for such closed non-compact groups, orbit separation and algebraic separation are identical on the space of d by n full-rank matrices which we denote by \(\mathbb {R}^{d\times n}_{full}\). As we will see, when X is not full rank its orbit’s closure will often intersect other orbits.

In the example in the next section, we will achieve separation of orbits on all of \(\mathbb {R}^{d\times n}\) for actions of compact groups, and separation of orbits on \(\mathbb {R}^{d\times n}_{full}\) for actions of closed non-compact groups. That is, when we can guarantee orbits closures do not intersect, we are able to achieve orbit separation by polynomials. Indeed, for complex linear reductive groups, orbits whose closures do not intersect can always be separated by polynomials [18]. These results can be adapted to achieve the separation results we show here: The real groups we discuss are subgroups of complex linear reductive groups, and they share the same set of generators. As such, the separation of orbits for the complex groups implies separation for the real subgroups.

We stress again that in practice, the proofs that we use for separation of our continuous family of functions rely only on elementary linear algebra and not on the first fundamental theorem and other invariant theory results noted above. We discuss these results in the next section.

3 Separating Invariants for Point Clouds

In this section, we will use Theorem 2.7 to obtain a collection of \(2D_{\mathcal {M}}+1\) separating invariants (or \(D_{\mathcal {M}}+1\) generically separating invariants) on the data manifold \(\mathcal {M}\subseteq \mathbb {R}^{d\times n}\), for several classical group actions which are of interest in the context of invariant machine learning. For non-compact group actions, we will need to assume that \(\mathcal {M}\) contains only full-rank matrices. The group actions we consider are multiplications by permutation matrices from the right, or multiplication by the left by: orthogonal transformations, generalized orthogonal transformations, special orthogonal transformations, volume preserving transformations, or general linear transformations. We will also show that these group actions can be combined with translation and scaling with no additional costs. The complexity of computing the invariants is rather moderate, as given in Table 1 which summarizes the results of this section.

3.1 Permutation Invariance

We begin by considering the action of the group or permutations on n points, denoted by \(S_n\), on \(\mathbb {R}^{d\times n}\) by swapping the order of the points. This group action has been studied extensively in the recent invariant learning literature (e.g., [47, 52, 58, 63]). In particular, the approach we suggest here is strongly related to recent results obtained in [5]. This relationship will be discussed in Remark 3.3.

Let us first discuss the simple case when \(d=1\). Interestingly, in this case the ring of polynomials invariants on \(\mathbb {R}^{1 \times n}\) is generated by only n invariants, known as the elementary symmetric polynomials. An alternative choice of generators ( [35], exercise 8) which can be computed more efficiently is the power sum polynomials.

$$\begin{aligned} \phi _k(x)=\sum _{j=1}^n x_j^k, k=1,\ldots ,n. \end{aligned}$$

Let \(\Phi :\mathbb {R}^n\rightarrow \mathbb {R}^n\) denote the map whose coordinates are the power sum polynomials, that is \(\Phi (x)=(\phi _1(x),\ldots ,\phi _n(x))\). It is known that the power sum polynomials separate orbits (for an elementary proof of this see [63]).

An alternative way of achieving n-dimensional separation is by sorting: let \(\varvec{\textrm{sort}}:\mathbb {R}^n\rightarrow \mathbb {R}^n \) be the map which sorts a vector in ascending order. This map is invariant to permutations and separates orbits. It is a continuous piecewise linear map (and so a semi-algebraic map), but is not a polynomial. Note that \(\varvec{\textrm{sort}}(x)\) can be computed in \(O(n \log (n))\) operations while computing \(\Phi (x)\) requires \(O(n^2)\) operations. Additionally, sorting has been successfully used for permutation-invariant machine learning [9, 64] while power sum polynomials are discussed as a theoretical tool [52, 63] but are not used in practice. Finally, in [5] it is shown that \(\varvec{\textrm{sort}}\) is an isometry (with respect to the Euclidean metric on the output space and a natural metric on the input quotient space \(\mathbb {R}^n/S_n \)) while \(\Phi \) is not even bi-Lipschitz.

For \(d>1\), separation by polynomials is achievable by the multi-dimensional power sum polynomials, defined as

$$\begin{aligned} \phi _\alpha (X)=\sum _{j=1}^n x_j^\alpha , \quad \alpha \in \mathbb {Z}_{\ge 0}^d, |\alpha |\le n. \end{aligned}$$
(6)

The multi-dimensional power sum polynomials are also generators of the invariant ring. They are used in many papers which prove universality for permutation-invariant constructions [22, 52, 63]. However, the number of power sum polynomials is \(n+d\atopwithdelims ()n\): when \(d>1\) and \(n\gg d \) this is significantly larger than the dimension of \(\mathbb {R}^{n\times d}\).

Generalizing the successfulness of the function \(\varvec{\textrm{sort}}\) in separating orbits to the case \(d>1\) is less straightforward: it is possible to consider lexicographical sorting: this mapping separates orbits but is not continuous. An alternative generalization could be to sort each row independently—this gives a continuous mapping but it does not separate orbits. We now use Theorem 2.7 to propose a low-dimensional set of invariants for the case \(d>1\) by polarizing a \(d=1\) separating invariant mapping \(\Psi \) (which could for example be \(\varvec{\textrm{sort}}\) or the one-dimensional power sums \(\Phi \)):

Proposition 3.1

Let \(\mathcal {M}\) be semi-algebraic subset of \(\mathbb {R}^{d\times n}\) of dimension \(D_{\mathcal {M}}\), which is stable under the action of \(S_n\) by multiplication from the right. Let \(\Psi :\mathbb {R}^n \rightarrow \mathbb {R}^n\) be a permutation invariant semi-algebraic mapping which separates orbits, and denote

$$\begin{aligned} f(X,w^{(1)},w^{(2)})=\langle w^{(2)},\Psi (X^Tw^{(1)})\rangle , X\in \mathbb {R}^{d \times n},w^{(1)} \in \mathbb {R}^d, w^{(2)}\in \mathbb {R}^n \end{aligned}$$
(7)

If \(m\ge 2D_{\mathcal {M}}+1\), then for Lebesgue almost every \((w_1^{(1)},w_1^{(2)}),\ldots ,(w_m^{(1)},w_m^{(2)}) \) in \(\mathbb {R}^d\times \mathbb {R}^{n}\), the invariant functions

$$\begin{aligned} f(\cdot ,w_i^{(1)},w_i^{(2)}), \quad i=1,\ldots ,m \end{aligned}$$

are separating with respect to the action of \(S_n\).

Proof

The permutation invariance of f for every fixed choice of parameters follows from the invariance of \(\Psi \). By Theorem 2.7, it is sufficient to show that the family of semi-algebraic invariant mappings f strongly separates orbits. Fix some \(X,Y\in \mathbb {R}^{d\times n}\) with disjoint \(S_n\) orbits. We need to show that the dimension of the semi-algebraic set

$$\begin{aligned} B=\{(w^{(1)},w^{(2)})\in \mathbb {R}^d \times \mathbb {R}^n| f(X,w^{(1)},w^{(2)})=f(Y,w^{(1)},w^{(2)}) \} \end{aligned}$$

is strictly smaller than \(n+d\). Since X cannot be reordered to be equal to Y, it follows that the set

$$\begin{aligned} B_1=\{w^{(1)}\in \mathbb {R}^d | X^Tw^{(1)} \text { is equal to } Y^Tw^{(1)}\text { up to reordering} \} \end{aligned}$$

has dimension \(d-1\). Thus, it is sufficient to show that the set

$$\begin{aligned} \tilde{B}=\{(w^{(1)},w^{(2)})\in \mathbb {R}^d \times \mathbb {R}^n| f(X,w^{(1)},w^{(2)})=f(Y,w^{(1)},w^{(2)}) \text { and } w^{(1)}\not \in B_1 \} \end{aligned}$$

has dimension \(\le n+d-1\). For fixed \(w^{(1)}\not \in B_1\), the orbit separation of \(\Psi \) implies that \(\Psi (X^Tw^{(1)})\ne \Psi (Y^Tw^{(1)})\), and so, the set of \(w^{(2)}\) for which \(\langle w^{(2)}, \Psi (X^Tw^{(1)}) \rangle =\langle w^{(2)}, \Psi (Y^Tw^{(1)}) \rangle \) has dimension \(n-1\). Denoting by \(\pi \) the projection of \(\tilde{B}\) onto the first coordinate, this means that for every \(w^{(1)}\in \pi ({\tilde{B}})\) we have that \(\dim (\pi ^{-1}w^{(1)})=n-1\), and from Lemma 2.8, this implies

$$\begin{aligned} \dim (\tilde{B})\le \dim (\pi (\tilde{B}))+n-1\le n+d-1 \end{aligned}$$

Thus, f is strongly separating which concludes the proof. \(\square \)

We conclude this subsection with some remarks on the significance of this result in the context of the existing literature. Firstly, we note that characterizations of permutation-invariant mappings on \(\mathbb {R}^{d\times n}\) which use separating mappings of the form

$$\begin{aligned} (x_1,\ldots ,x_n)\in \mathbb {R}^{d\times n}\mapsto \sum _{j=1}^nF(x_j) \end{aligned}$$

are common in the literature investigating the expressive power of neural networks for sets and graphs (see for example Lemma 5 in [43]). However, these are typically based on the multivariate power sum polynomials, so that the output dimension of F is the unrealistically high \({n+d \atopwithdelims ()n} \) as discussed above. In contrast, we can obtain separation on all of \(\mathbb {R}^{d\times n}\) with \(2n\cdot d+1 \) invariants, or an even smaller number of invariants when restricting to a lower-dimensional \(S_n\) stable set \(\mathcal {M}\), by choosing \(\Psi \) to be the univariate power sum polynomial mapping \(\Phi \) defined in (6):

Corollary 3.2

Let \(\mathcal {M}\) be semi-algebraic subset of \(\mathbb {R}^{d\times n}\) of dimension \(D_{\mathcal {M}}\), which is stable under the action of \(S_n\) by multiplication from the right. Then there exists a polynomial mapping \(F:\mathcal {M}\rightarrow \mathbb {R}^{2D_{\mathcal {M}}+1} \) such that the function

$$\begin{aligned} \mathcal {M}\ni X=(x_1,\ldots ,x_n) \mapsto \sum _{j=1}^n F(x_j) \end{aligned}$$
(8)

is invariant and separating.

proof of corollary

Denote

$$\begin{aligned} {\hat{\Phi }}(t)=(t,t^2,\ldots ,t^n) \end{aligned}$$

so that we have

$$\begin{aligned} \Phi (t_1,\ldots ,t_n)=\sum _{i=1}^n {\hat{\Phi }}(t_i). \end{aligned}$$

Taking \(\Psi =\Phi \) in Proposition 3.1, we obtain that for \(m=2D_{\mathcal {M}}+1\), and Lebesgue almost every choice of parameters, the mapping \(f(X,w_i^{(1)},w_i^{(2)}), i=1,\ldots ,m \) is invariant and separating. Note that the ith coordinate of this map is given by

$$\begin{aligned} f(X,w_i^{(1)},w_i^{(2)})&=\langle w_i^{(2)},\Phi (X^Tw_i^{(1)})\rangle =\langle w_i^{(2)},\sum _{j=1}^n{\hat{\Phi }}(x_j^Tw_i^{(1)})\rangle \\&=\sum _{j=1}^n \langle w_i^{(2)},\hat{\Phi }(x_j^Tw_i^{(1)})\rangle =\sum _{j=1}^n F_i(x_j) \end{aligned}$$

where we define \(F_i:\mathbb {R}^d \rightarrow \mathbb {R}\) by

$$\begin{aligned} F_i(x)=\langle w^{(2)},{\hat{\Phi }}(x^Tw^{(1)})\rangle . \end{aligned}$$

Thus, the mapping as in (8) with \(F=(F_i)_{i=1}^m \) is invariant and separating.

Remark 3.3

When choosing \(\Psi =\varvec{\textrm{sort}}\) in the formulation of Proposition 3.1, we obtain invariants which are closely related to those discussed in [5]. To describe the results in this paper and the relationship to our results here, let us first rewrite our results with \(\Psi =\varvec{\textrm{sort}}\) in matrix notations: Denote by \(\varvec{\textrm{colsort}}:\mathbb {R}^{n\times m} \rightarrow \mathbb {R}^{n\times m}\) the continuous piecewise linear function which independently sorts each of the m columns of an \(n\times m\) matrix in ascending order. Let \(A\in \mathbb {R}^{d\times m} \) be a matrix whose m columns correspond to \(w^{(1)}_1,\ldots , w^{(1)}_m\), and let \(B\in \mathbb {R}^{n \times m} \) be matrix whose m columns correspond to \(w^{(2)}_1,\ldots , w^{(2)}_m\). Proposition 3.1 can be restated in matrix form as saying that on \(\mathcal {M}=\mathbb {R}^{d\times n}\), for \(m=2nd+1\), and Lebesgue almost every AB, the mapping

$$\begin{aligned} \mathbb {R}^{d\times n} \ni X\mapsto L_B\circ \beta _A(X) \end{aligned}$$
(9)

is invariant and separating, where \(L_B:\mathbb {R}^{n\times m} \rightarrow \mathbb {R}^m \) and \(\beta _A:\mathbb {R}^{d\times n} \rightarrow \mathbb {R}^{n \times m} \) are defined by

$$\begin{aligned}{}[L_B(Y)]_j=\sum _{i=1}^n B_{ij}Y_{ij} \text { and } \beta _A(X)=\varvec{\textrm{colsort}}(X^TA). \end{aligned}$$

Note that it follows that \(\beta _A\) is invariant and separating as well.

In [5], Balan et al. consider invariant maps which are compositions of \(\beta _A\) as defined above with general linear maps \(L:\mathbb {R}^{n \times m} \rightarrow \mathbb {R}^{2nd} \). Under the assumption that \(m>(d-1)n!\), and that the parameters defining A and L are generic, they show that these maps are separating, and moreover, bi-Lipschitz (with respect to the Euclidean metric on the output space and a natural metric on the input quotient space \(\mathbb {R}^{d\times n}/S_n \)). Thus, the main differences between Balan’s results and the results here are

  1. 1.

    Balan’s proof requires \(>n! \) measurements to guarantee separation of \(\beta _A\), while we only require \(2nd+1\) measurements.

  2. 2.

    We consider compositions of \(\beta _A\) with sparse linear mappings \(L_B\) (these same mappings are suggested in [65]). In contrast, Balan considers general linear mappings L which are defined by n times more parameters than \(L_B\).

  3. 3.

    Balan’s results show that \(\beta _A \) and \(L\circ \beta _A\) are bi-Lipschitz. We do not consider this important aspect in this paper. We note that Balan shows that \(\beta _A\) is bi-Lipschitz whenever \(\beta _A\) is separating. Thus, their results coupled with out own show that \(\beta _A\) is bi-Lipschitz even when \(m=2nd+1 \). The bi-Lipschitzness of our sparse \(L_B\) was not directly addressed in [5], and we leave this to future work.

3.2 Orthogonal Invariance

We now consider the action of the group or orthogonal matrices O(d) on \(\mathbb {R}^{d\times n}\) via multiplication from the left.

We consider a polynomial family of invariants of the form

$$\begin{aligned} p(X,w)=\Vert Xw\Vert ^2, X\in \mathbb {R}^{d \times n}, w\in \mathbb {R}^n. \end{aligned}$$
(10)

For fixed Xw the cost of computing this invariant is \(\mathcal {O}(n\cdot d) \). This choice of invariants is a natural generalization of the type of invariants encountered in phase retrieval (see discussion in Sect. 1.1 and Remark 3.6). It also can be seen as a realization of the invariant theory-based methodology discussed in 2.4. The ring of invariant polynomials is generated by the inner product polynomials \(\langle x_i,x_j\rangle , 1\le i\le j \le n\) [60]. It is thus also generated by the polynomials

$$\begin{aligned} \Vert x_i\Vert ^2, i=1,\ldots ,n \text { and } \Vert x_i-x_j\Vert ^2 1\le i<j\le n \end{aligned}$$
(11)

since these polynomials have the same linear span as the inner product polynomials. These invariant are obtained from the squared norm invariant on \(\mathbb {R}^d\) by polarization and so are all of the form (10) for an appropriate choice of \(w\in \mathbb {R}^n\), that is,

$$\begin{aligned} p(X,w=e_i)=\Vert x_i\Vert ^2 \text { and } p(X,w=e_i-e_j)=\Vert x_i-x_j\Vert ^2. \end{aligned}$$

Our result is now an easy consequence of the discussion so far and Theorem 2.7:

Proposition 3.4

Let \(n\ge d\), let \(\mathcal {M}\) be a semi-algebraic subset of \(\mathbb {R}^{d\times n}\) of dimension \(D_{\mathcal {M}}\), which is stable under the action of O(d). If \(m\ge 2D_{\mathcal {M}}+1\), then for Lebesgue almost every \(w_1,\ldots ,w_m \) in \(\mathbb {R}^n\) the invariant polynomials

$$\begin{aligned} X\mapsto \Vert Xw_i\Vert ^2, \quad i=1,\ldots ,m \end{aligned}$$

are separating with respect to the action of O(d).

Proof

By Theorem 2.7, it is sufficient to show that the family of invariant functions p is strongly separating, and as they are polynomials, we only need to show separation. It is sufficient to show that the finite collection of polynomials in (10) are separating, which as mentioned above is equivalent to showing that the inner product polynomials \(\langle x_i,x_j \rangle \) are separating. This is just the known fact that the Gram matrix \(X^TX \) determines X uniquely up to orthogonal transformation. See, e.g., Lemma 3.7 and its proof in appendix. \(\square \)

We note that Proposition 3.4 (with a slightly smaller number of separating invariants) can also be deduced immediately from Theorem 4.9 in [49] which discusses the equivalent problem of separating rank-one matrices using rank-one linear measurements.

3.3 Special Orthogonal Invariance

We now turn to the action of the special orthogonal group \(SO(d)=\{R\in O(d),det(R)=1\}\) on \(\mathbb {R}^{d\times n}\) by multiplication from the left. The invariant ring for this group action is generated by the polynomials in (11) together with the invariant polynomials

$$\begin{aligned}{}[i_i,i_2,\ldots ,i_d](X)=det(x_{i_1},x_{i_2},\ldots ,x_{i_d}), \quad 1\le i_1<i_2<\ldots <i_d\le n. \end{aligned}$$
(12)

Accordingly, the generators can all be realized by specific choices of (wW) from the family of polynomial invariants

$$\begin{aligned} p(X,w,W)=\Vert Xw\Vert ^2+det(XW), X\in \mathbb {R}^{d\times n}, w\in \mathbb {R}^n, W\in \mathbb {R}^{n\times d}. \end{aligned}$$
(13)

The complexity of calculating each invariant (for fixed wW) is dominated by the matrix product XW which with the standard method for matrix multiplication requires \(\mathcal {O}(n\cdot d^2)\) operations. We can easily prove that this family of invariants separates orbits:

Proposition 3.5

Let \(n\ge d\), and let \(\mathcal {M}\) be a semi-algebraic subset of \(\mathbb {R}^{d\times n}\) of dimension \(D_{\mathcal {M}}\), which is stable under the action of SO(d). If \(m\ge 2D_{\mathcal {M}}+1\), then for Lebesgue almost every \((w_1,W_1),\ldots ,(w_m,W_m) \) in \(\mathbb {R}^n\times \mathbb {R}^{n\times d}\), the invariant polynomials

$$\begin{aligned} X\mapsto \Vert Xw_i\Vert ^2+det(XW_i), \quad i=1,\ldots ,m \end{aligned}$$

are separating with respect to the action of SO(d).

Proof

By Theorem 2.7, it is sufficient to show that the family of invariant functions p is strongly separating, and as they are polynomials, we only need to show separation. Let \(X,Y\in \mathbb {R}^{d\times n}\) which do not have the same orbit. If X and Y are not related by any orthogonal transformation, then we already showed that they can be separated by the ‘norm polynomials.’ We now need to consider the case where X and Y are not related by a rotation, but are related by \(X=RY\) where R is orthogonal with \(det(R)=-1\). In this case, we see that X and Y have the same rank. Moreover, they must be full rank, since otherwise we can multiply R by an orthogonal transformation \(R_0\) with \(det(R_0)=-1\) which fixes the column span of Y and obtain \(X=RR_0Y\) and \(det(RR_0)=1\), in contradiction to the fact that X and Y do not have the same SO(d) orbit. Since X is full rank, we can choose \(1\le i_1<\ldots <i_d\le n\) such that \([i_1,\ldots ,i_d](X)\ne 0\). This polynomial will separate X and Y since

$$\begin{aligned} -[i_1,\ldots ,i_d](X)\ne [i_1,\ldots ,i_d](X)=[i_1,\ldots ,i_d](RY)=-[i_1,\ldots ,i_d](Y) \end{aligned}$$

\(\square \)

Remark 3.6

For \(d=2\), there are more efficient invariants than the ones we suggest here: As mentioned previously, known results [4, 16] on complex phase retrieval state that for generic \(m=4n-4\) complex vectors \(w^{(1)},\ldots ,w^{(m)}\) in \(\mathbb {C}^n\), the maps

$$\begin{aligned} \mathbb {C}\ni z\mapsto |\langle z, w_j \rangle | \end{aligned}$$
(14)

separate orbits of the action of \(S^1\) on \(\mathbb {C}^n\). Note that the linear map \(\langle z,w_j \rangle \) is \(S^1\) equivariant, that is,

$$\begin{aligned} \langle \xi z,w_j \rangle =\xi \langle z,w_j \rangle , \forall \xi \in S^1 \end{aligned}$$

Identify \(\mathbb {C}^n \cong \mathbb {R}^{2\times n}\) and \(S^1\cong SO(2) \), we see that there are m linear SO(2) equivariant maps \(W^{(1)},\ldots ,W^{(m)}\) from \(\mathbb {R}^{2\times n}\) to \(\mathbb {R}^2\), modeling multiplication in \(\mathbb {C}^1\), and

$$\begin{aligned} \mathbb {R}^{2\times n} \ni X \mapsto \Vert XW^{(j)}\Vert \end{aligned}$$
(15)

is SO(2) invariant and separate orbits. Each one of these linear maps \(W^{(j)} \) is parameterized by 2n real numbers, while our invariants in (13) are parameterized by 3n parameters (when \(d=2\)).

When \(d\ne 2\), it would be natural to look for separating invariants of the form (15), where the \(W^{(j)}\) are SO(d) equivariant linear maps from \(\mathbb {R}^{d\times n}\) to \(\mathbb {R}^d\), and avoid the additional determinant term we use in (13). However, in Proposition A.1 in appendix we show that when \(d\ne 2\), the only linear SO(d) equivariant maps are of the form \(X\mapsto Xw\) with \(w\in \mathbb {R}^n\). These maps are also O(d) equivariant, and as a result, \(\Vert Xw\Vert \) is O(d) invariant. It follows that these maps cannot separate point clouds which are related by reflections, but do not have the same orbit in SO(d).

3.4 Isometry Groups for Non-degenerate Bilinear Forms

The next examples we consider are isometry groups for non-degenerate bilinear forms. As usual, we assume \(n\ge d\), and we are given a symmetric invertible \(Q\in \mathbb {R}^{d\times d}\) which induces a symmetric bilinear form

$$\begin{aligned} \langle x,Qy\rangle , x,y\in \mathbb {R}^d. \end{aligned}$$

We define a Q-isometry as a matrix \(U\in \mathbb {R}^{d\times d}\) such that \(U^TQU=Q\), and thus, the symmetric bilinear form defined by Q is preserved by U:

$$\begin{aligned} \langle Ux,QUy\rangle =\langle x,U^TQUy\rangle =\langle x,Qy\rangle \end{aligned}$$

The set of Q-isometries is a subgroup of \(GL(\mathbb {R}^d)\) which we denote by \(O_Q(d)\). The orthogonal group O(d) discussed earlier corresponds to \(Q=I_d\). Indefinite orthogonal groups \(O(s,d-s) \) correspond to diagonal Q matrices with s positive unit entries and \(d-s\) negative unit entries. In particular, the Lorenz group O(3, 1) (together with translations) is an important symmetry group in special relativity and has been discussed in the context of invariant machine learning for physics simulations [10, 57].

We consider the task of finding separating invariants for the action of \(O_Q(d)\) on \(\mathbb {R}^{d\times n}\) by multiplication from the left. A natural place to start is the Q Gram matrix \(X^TQX \), whose coordinates are the Q-inner products \(\langle x_i,Qx_j\rangle , 1\le i\le j \le n \). Indeed, at least for the Lorenz group it is known that the inner product polynomials are indeed generators [57]. When Q is positive definite, the Q-inner products do indeed separate orbits. When Q is not positive definite, this is no longer always true: Consider the following example with \(d=2, n=2\) and

$$\begin{aligned} Q=\begin{bmatrix} 1 &{} 0\\ 0&{} -1 \end{bmatrix}, \quad X=\begin{bmatrix} 0 &{} 0\\ 0&{} 0 \end{bmatrix}, \quad Y=\begin{bmatrix} 1 &{} 1\\ 1&{} 1 \end{bmatrix}, \quad \end{aligned}$$

We see that the Q-Gram matrix of X and Y is both zero, while X and Y cannot be related by a Q-isometry since Q-isometries are invertible. However, the Q-inner products do separate orbits when restricted to the set of full-rank matrices \(\mathbb {R}^{d\times n}_{full}\). The following lemma, proved in appendix, formulates this claim. The proof is essentially taken from [26] corollary 8.

Lemma 3.7

Assume \(Q\in \mathbb {R}^{d\times d}\) is a symmetric invertible matrix and \(X,Y\in \mathbb {R}^{d\times n}\) have the same Q-Gram matrix. If (i) X has rank d or (ii) Q is positive definite, then X and Y are related by a Q-isometry.

Once we know that the Q inner products separate orbits on \(\mathbb {R}^{d\times n}_{full}\), we proceed as we did for O(d). We see that the ‘Q-norm polynomials’

$$\begin{aligned} \langle x_i,Qx_i \rangle , i=1,\ldots ,n \text { and } \langle x_i-x_j,Q(x_i-x_j) \rangle , 1\le i <j\le n \end{aligned}$$

span the Q-inner product polynomial and hence are also separating on \(\mathbb {R}_{full}^{d\times n}\). We can then prove an analogue of Proposition 3.4 using the same arguments used there:

Proposition 3.8

Let \(n\ge d\), let \(Q\in \mathbb {R}^{d\times d}\) be symmetric and invertible, and let \(\mathcal {M}\) be a semi-algebraic subset of \(\mathbb {R}^{d\times n}_{full}\) of dimension \(D_{\mathcal {M}}\), which is stable under the action of \(O_Q(d)\). If \(m\ge 2D_{\mathcal {M}}+1\), then for Lebesgue almost every \(w_1,\ldots ,w_m \) in \(\mathbb {R}^n\) the invariant polynomials

$$\begin{aligned} X\mapsto \langle Xw_i,QXw_i\rangle , \quad i=1,\ldots ,2nd+1 \end{aligned}$$

are separating with respect to the action of \(O_Q(d)\).

3.5 Special Linear Invariance

We now give a full treatment for the group action described in Example1.1. We consider the action of the special linear group \(SL(d)=\{A\in \mathbb {R}^{d\times d}| \, \textrm{det}(A)=1 \} \) on \(\mathbb {R}^{d \times n}\) by multiplication from the left. The generators for the ring of invariants are given by the determinant polynomials [35]

$$\begin{aligned}{}[i_1,\ldots ,i_d](X)=\textrm{det}(x_{i_1},\ldots ,x_{i_d}), 1\le i_1<i_2<\ldots <i_d\le n, \end{aligned}$$
(16)

which we have already encountered in Sect. 3.3. The generators cannot separate matrices in \(\mathbb {R}^{d\times n}\) which are not full rank, since for such matrices we will always get zero determinants. In Proposition A.2, in appendix we give an elementary proof that the determinant polynomials from (16) separate orbits on \(\mathbb {R}_{full}^{d \times n}\).

The separation of the determinant polynomials in (16) together with Theorem 2.7 implies

Proposition 3.9

Let \(n\ge d\), and let \(\mathcal {M}\) be a semi-algebraic subset of \(\mathbb {R}^{d\times n}_{full}\) of dimension \(D_{\mathcal {M}}\), which is stable under the action of SL(d). If \(m\ge 2D_{\mathcal {M}}+1\), then for Lebesgue almost every \(W_1,\ldots ,W_m \) in \(\mathbb {R}^{n\times d}\) the invariant polynomials

$$\begin{aligned} X\mapsto \textrm{det}(XW_i), \quad i=1,\ldots ,m \end{aligned}$$

are separating with respect to the action of SL(d).

3.6 Translation

We consider the action of \(\mathbb {R}^d\) on \(\mathbb {R}^{d\times n}\) by translation:

$$\begin{aligned} t_*(X)=X+t1_n^T. \end{aligned}$$

We can easily compute \(n\cdot d\) separating invariants for this action: Examples include the mapping \( X \mapsto X-x_11_n^T \) suggested in [57] or the centralization mapping \(cent(X)= X-\frac{1}{n}X1_n1_n^T\). The centralization mapping is equivariant w.r.t the action of multiplication by a matrix \(A\in GL(\mathbb {R}^d)\) from the left and a permutation matrix P from the right. That is

$$\begin{aligned}{} & {} cent(AXP^T)=AXP^T-\frac{1}{n}AXP^T1_n1_n^T\\{} & {} =AXP^T-\frac{1}{n}AX1_n1_n^T=AXP^T-\frac{1}{n}AX1_n1_n^TP^T=Acent(X)P^T. \end{aligned}$$

It follows that if \(f:\mathbb {R}^{d\times n}\rightarrow \mathbb {R}^m\) is invariant with respect to some group G which is a subgroup of \(GL(\mathbb {R}^d)\times S_n\), then f(cent(X)) will be invariant with respect to the group \(\langle G,\mathbb {R}^d \rangle \) generated by G and the translation group. Additionally, if f separates orbits w.r.t the action of G, then f separates orbits with respect to the action of \(\langle G,\mathbb {R}^d \rangle \). To see this, note that if \(X,Y\in \mathbb {R}^{d\times n}\) and \(f(cent(X)=f(cent(Y)) \), then since f separates orbits we have that there exist some \((A,P)\in G\le GL(\mathbb {R}^d)\times S_n\) such that \(cent(X)=Acent(Y)P^T \), and so, X is obtained from Y by translation by the mean of Y, follows by a G action, and translation by the mean of X.

3.7 Scaling

We consider the action of \(\mathbb {R}_{>0}=\{x>0\}\) on \(\mathbb {R}^{d\times n}\) by scaling (scalar–matrix multiplication). In this case, there are no non-constant invariant polynomials, or in fact any non-constant invariants which are continuous on all of \(\mathbb {R}^{d\times n}\). This is because the orbit of each \(X\in \mathbb {R}^{d\times n}\) contains the zero matrix \(0\in \mathbb {R}^{d\times n}\) in its closure. However, we can easily come up with non-polynomial separating invariants with singularities at zero, such as \(X\mapsto \Vert X\Vert ^{-1} X\), where \(\Vert \cdot \Vert \) denotes some norm on \(\mathbb {R}^{d\times n}\). If we choose the Frobenius norm \(\Vert \cdot \Vert _F\), this mapping is equivariant with respect to multiplication by an orthogonal matrix from the left and a permutation matrix from the right. As a result, if \(f:\mathbb {R}^{d\times n}\rightarrow \mathbb {R}^m\) is invariant with respect to some group G which is a subgroup of \(O(d)\times S_n\), then \(X\mapsto f(\Vert X\Vert _F^{-1} X)\) will be invariant with respect to the group generated by G and the scaling group. Additionally, if f separates orbits with respect to the G action, then \(X\mapsto f(\Vert X\Vert _F^{-1} X)\) separates orbits with respect to the group generated by G and the scaling group.

3.8 General Linear Invariance

We consider the problem of finding separating invariants for the action of the general linear group \(GL(\mathbb {R}^d) \) on \(\mathbb {R}^{d\times n}\) by multiplication from the left. There are no non-constant polynomial invariants for this action, since this is the case even for the scaling group which is a subgroup of \(GL(\mathbb {R}^d)\). We consider a family of rational invariants

$$\begin{aligned} q(X,W)=\frac{det^2(XW)}{det(XX^T)}, \quad X\in \mathbb {R}^{d\times n}, W\in \mathbb {R}^{n\times d} \end{aligned}$$

The function q is well defined on \(\mathbb {R}^{d\times n}_{full} \times \mathbb {R}^{n\times d}\), and for fixed W, the function \(X\mapsto q(X,W)\) is \(GL(\mathbb {R}^d) \)-invariant. We prove:

Proposition 3.10

Let \(n\ge d\), and let \(\mathcal {M}\) be a semi-algebraic subset of \(\mathbb {R}_{full}^{d\times n}\) of dimension \(D_{\mathcal {M}}\), which is stable under the action of \(GL(\mathbb {R}^d)\). If \(m\ge 2D_{\mathcal {M}}+1\), then for Lebesgue almost every \(W_1,\ldots ,W_m \) in \(\mathbb {R}^{n\times d}\) the invariant polynomials

$$\begin{aligned} X\mapsto q(X,W_i), \quad i=1,\ldots ,m \end{aligned}$$

are separating with respect to the action of \(GL(\mathbb {R}^d)\).

Proof

By Theorem 2.7, it is sufficient to show that the family of rational functions q is strongly separating. In fact, since q(XW) is polynomial in W for every fixed X, it is sufficient to show orbit separation.

Let \(X,Y\in \mathbb {R}_{full}^{d\times n}\) be two full-rank point clouds whose orbits do not intersect. Since X is full rank, it has d columns which are linearly independent; for simplicity of notation, we assume these are the first d columns. If the first d columns of Y are not linearly independent, then by choosing \(W_0=[I_d, \, 0]^T\) we get that

$$\begin{aligned} q(X,W_0)= \frac{\left[ det(x_1,\ldots ,x_d) \right] ^2}{det(XX^T)} \end{aligned}$$

is zero on Y and not on X, and so, it separates the two points. Thus, we can assume that the first d columns of Y are linearly independent. It follows that the matrix A defined uniquely by the equations \(Ax_i=y_i, i=1,\ldots ,d\) is non-singular.

By assumption, \(AX\ne Y\) so there exists some index \(j, d<j\le n\) such that \(Ax_j\ne y_j\). Since \(y_1,\ldots ,y_d\) span \(\mathbb {R}^d\), there exist \(\alpha _1,\ldots ,\alpha _d\) and \(\beta _1,\ldots ,\beta _d\) such that

$$\begin{aligned} Ax_j=\sum _{i=1}^d \alpha _iy_i, \quad y_j=\sum _{i=1}^d \beta _iy_i \end{aligned}$$

and since \(Ax_j\ne y_j\) there exists some \(k, 1\le k \le d\) such that \(\alpha _k\ne \beta _k\). Let \(W_1\in \mathbb {R}^{n\times n}\) be a matrix such that for all \(Z=[z_1,\ldots ,z_n]\in \mathbb {R}^{d\times n}\) we have

$$\begin{aligned} ZW_1=[z_1,z_2,\ldots ,z_{k-1},\beta _k z_k-z_j,z_{k+1},\ldots ,z_n]. \end{aligned}$$

Then the first d columns of \(YW_1\) have rank \(d-1\), while the first d columns of \(AXW_1\), and therefore also of \(XW_1\), have rank d. It follows that

$$\begin{aligned} q(X,W_1W_0)=q(XW_1,W_0)\ne 0= q(YW_1,W_0)=q(Y,W_1W_0). \end{aligned}$$

Thus, we have shown that q separates orbits. \(\square \)

3.9 Intractable Separation for Permutation Actions on Graphs

Consider the action of the permutation group \(S_n\) on the vector space \(\mathbb {R}^{n\times n}\) by conjugation: Given a permutation matrix \(P\in S_n\), and a matrix \(X\in \mathbb {R}^{n\times n} \), this action is defined as

$$\begin{aligned} P_*X=PXP^T. \end{aligned}$$

If A is an adjacency matrix of a graph, applying a relabeling \(\sigma \) to the vertices creates a new graph, isomorphic to the previous one, whose adjacency matrix \(A'\) is equal to \(A'= PAP^T \) for P the matrix representation of the permutation \(\sigma \). We are thus interested in studying this action on the set of weighted adjacency matrices \(\mathcal {M}_{weighted}\) defined as

$$\begin{aligned} \mathcal {M}_{weighted}=\{A\in \mathbb {R}^{n\times n}| A=A^T, A_{ii}=0 \text { and } A_{ij}\ge 0, \forall i,j=1,\ldots ,n\}. \end{aligned}$$
(17)

We note that \(\mathcal {M}_{weighted}\) is stable under the action of \(S_n\), and has dimension \(n_{weighted}=(n^2-n)/2 \).

More generally, we will want to think of this action of \(S_n\) on \(S_n\) stable semi-algebraic subsets \(\mathcal {M}\) of \(\mathcal {M}_{weighted}\) of arbitrary dimension \(D_{\mathcal {M}}\). For example, the collection of all binary (unweighted) graphs can be parameterized by the finite \(S_n\) stable set

$$\begin{aligned} \mathcal {M}_{binary}=\{A\in \mathcal {M}_{weighted}| \, A_{ij}\in \{0,1 \}, \forall i,j=1,\ldots ,n \}. \end{aligned}$$

Another natural example includes (weighted or unweighted) graphs of bounded degree.

Let us now consider the task of constructing separating invariants for the action of \(S_n\) on a semi-algebraic stable subset \(\mathcal {M}\subseteq \mathcal {M}_{weighted}\) of dimension \(D_{\mathcal {M}}\). As our discussion suggest, we will be able to find such separating invariants of dimension \(2D_{\mathcal {M}}+1\). However, the computational effort involved in computing the invariants in our constructions grows superpolynomially in n. This is not surprising as a polynomial time algorithm for computing separating invariants for the action of \(S_n\) on \(\mathcal {M}_{binary}\) will lead to a polynomial time algorithm for the notoriously hard graph isomorphism problem (see [27]).

One simple separating family for the action of \(S_n\) on \(\mathcal {M}_{weighted}\) is polynomials of the form

$$\begin{aligned} p(X,W)=\prod _{P\in S_n} \Vert PXP^T-W\Vert _F^2, \, X\in \mathcal {M}_{weighted}, W\in \mathbb {R}^{n\times n}. \end{aligned}$$

Clearly for fixed W the polynomials \(X\mapsto p(X,W) \) is permutation-invariant, and separation follows from the fact that if \(X,Y \in \mathcal {M}_{weighted}\) and \(X \not \sim Y \), then taking \(W=X\) we obtain

$$\begin{aligned} p(X,W)=0\ne p(Y,W). \end{aligned}$$

Thus, by Theorem 2.7, we can obtain \(m=2D_{\mathcal {M}}+1\) separating invariants for the action of \(S_n\) on \(\mathcal {M}\), as for Lebesgue almost every \(W_i, i=1,\ldots ,m \) in \(\left( \mathbb {R}^{n \times n}\right) ^m \), the functions

$$\begin{aligned} X \mapsto p(X,W_i), i=1,\ldots ,m \end{aligned}$$

are invariant and separating. Note, however, that the degree of these polynomials is \(2\cdot n!\), and so, computing these invariants is not tractable.

4 Generic Separation

Generic separation is a relaxed notion of separability which is often easier to achieve than full separation:

Definition 4.1

(Generic separation) Let G be a group acting on a semi-algebraic set \(\mathcal {M}\). Let \(\mathcal {Y}\) be a set, and let \(f:\mathcal {M}\rightarrow \mathcal {Y}\) be an invariant function. We say that f is generically separating on \(\mathcal {M}\), with singular set \(\mathcal {N}\), if \(\mathcal {N}\subseteq \mathcal {M}\) is a semi-algebraic set which is stable under the action of G, satisfies \(\dim (\mathcal {N})<\dim {\mathcal {M}}\), and for every \(x\in \mathcal {M}{\setminus } \mathcal {N}\), if there exists some \(y\in \mathcal {M}\) such that \(f(x)=f(y) \), then \(x \sim _G y \).

Note that being generically separating on \(\mathcal {M}\) is slightly stronger than being separating on \(\mathcal {M}\setminus \mathcal {N}\), since the latter would only consider \(Y\in \mathcal {M}\setminus \mathcal {N}\).

Some possible practical disadvantages of generic separating invariants in comparison to fully separating invariants were discussed in Sect. 1. Our purpose in this section is to show that achieving generic separation is easier than achieving full separation in two respects:

  1. 1.

    While \(2D_{\mathcal {M}}+1 \) separating invariants can be obtained by randomly choosing parameters of families of strongly separating invariants, when considering generic separation \(D_{\mathcal {M}}+1 \) invariants suffice. We discuss this next in Sect. 4.1.

  2. 2.

    More importantly, for some group actions it is easy to come up with generic separators which can be computed efficiently, but obtaining true separators with low complexity seems inaccessible. This is discussed in Sect. 4.2.

4.1 Generic Separation from Generic Families of Separators

In this section, we prove an analogous theorem to Theorem 2.7 where now we discuss generic invariants. The notion of generic separating invariants was defined in Definition 4.1. We now define this notion for families of invariants:

Definition 4.2

(Strong generic separation for invariant families) Let G be a group acting on a semi-algebraic set \(\mathcal {M}\). We say that a family of semi-algebraic functions \(p:\mathcal {M}\times \mathbb {R}^{D_w} \rightarrow \mathbb {R}\) strongly separates orbits generically on \(\mathcal {M}\),with respect to a singular set \(\mathcal {N}\), if \(\mathcal {N}\subseteq \mathcal {M}\) is a semi-algebraic set which is stable under the action of G, satisfies \(\dim (\mathcal {N})<\dim {\mathcal {M}}\), and for every \(x\in \mathcal {M}{\setminus } \mathcal {N}\) and \(y\in \mathcal {M}\) with \(x \not \sim _G y \), the set

$$\begin{aligned} \{w\in \mathbb {R}^{D_w}| \, p(x,w)=p(y,w)\} \end{aligned}$$

has dimension \(\le D_w-1 \).

We can now state an analogous theorem to Theorem 2.7 for separating invariants. As mentioned above, the cardinality for generic separation is \(D_{\mathcal {M}}+1\) and not the \(2D_{\mathcal {M}}+1\) we have in Theorem 2.7 for full separation.

Theorem 4.3

Let G be a group acting on a semi-algebraic set \(\mathcal {M}\) of dimension \(D_{\mathcal {M}}\). Let \(p:\mathcal {M}\times \mathbb {R}^{D_w} \rightarrow \mathbb {R}\) be a family of G-invariant semi-algebraic functions. If p strongly separates orbits generically in \(\mathcal {M}\), then for Lebesgue almost every \(w_1,\ldots ,w_{D_{\mathcal {M}}+1}\in \mathbb {R}^{D_w} \) the \(D_{\mathcal {M}}+1\) G-invariant semi-algebraic functions

$$\begin{aligned} p(\cdot ,w_i), i=1,\ldots ,D_{\mathcal {M}}+1 \end{aligned}$$

generically separate orbits in \(\mathcal {M}\).

Proof of Theorem 4.3

Similarly to the proof of Theorem 2.7, we can consider the ‘bad set’

$$\begin{aligned} \mathcal {B}_m=\{(x,y,w_1,\ldots ,w_m)\in (\mathcal {M}\setminus \mathcal {N})\times \mathcal {M}\\ \times \mathbb {R}^{D_w\times m}| \quad x \not \sim _G y \text { but } p(x,w_i)=p(y,w_i), i=1,\ldots ,m \} \end{aligned}$$

and repeat the dimension argument used there, together with our requirement that \(m\ge D_{\mathcal {M}}+1 \), to obtain

$$\begin{aligned} \dim (\mathcal {B}_m)\le 2D_{\mathcal {M}}+m(D_w-1)\le D_{\mathcal {M}}+mD_w-1. \end{aligned}$$

Denote

$$\begin{aligned} W=(w_1,\ldots ,w_m) \text { and } \pi _W(x,y,W)=W. \end{aligned}$$

Our goal next is to bound the dimension of the fiber \(\pi _W^{-1}(W) \) over W. Let \(U_1,\ldots ,U_K\) be a stratification of \(\mathcal {B}_m\) so that each \(U_k\) is a manifold and \(\cup _{k=1}^K U_k=\mathcal {B}_m\). For every fixed k, if the dimension of \(\pi _W(U_k)\) is less than \(mD_w\) then almost all W will not be in the projection, and so, the intersection of the fiber over these W with \(U_k\) will be empty. Now let us assume that the dimension of \(\pi _W(U_k)\) is \(mD_w\). By Sard’s theorem [37], almost all W in \(\pi _W(U_k)\) is a regular value of the restriction of \(\pi _W\) to \(U_k\). By the pre-image theorem [56], every regular value W is either not in the image, or the dimension of its fiber \(\pi _W^{-1}(W)\cap U_k \) is precisely

$$\begin{aligned} \dim (U_k)-\dim (\mathbb {R}^{D_w\times m})\le \dim (\mathcal {B}_m)-mD_w\le D_{\mathcal {M}}-1 \end{aligned}$$

It follows that for almost all \(W=(w_1,\ldots ,w_m)\), the fiber over W

$$\begin{aligned} \pi _W^{-1}(W)=\cup _{k=1}^K\left( \pi _W^{-1}(W)\cap U_k \right) \end{aligned}$$

has dimension strictly smaller than \(D_{\mathcal {M}}\). Thus, this is also true for the projection of the fiber onto the x coordinate, which is the set

it follows that for such \(W=(w_1,\ldots ,w_m)\), the invariants \(p(\cdot ,w_i), i=1,\ldots ,m \) are generically separating on \(\mathcal {M}\) with singular set \(\mathcal {N}\cup \mathcal {N}_W \).

4.2 Generic Separation for Graphs

We now return to discuss the graph separation result we discussed in Sect. 3.9. We will show that while computing true separating invariants on \(\mathcal {M}_{weighted}\) in polynomial time seems out of reach, generically separating invariants can be computed in polynomial time. We note that this is hardly surprising: The fact that graph isomorphism is not hard for generic weighted [21] or unweighted [3] graphs is well known.

Proposition 4.4

For every natural \(n\ge 2\), the mapping

$$\begin{aligned} F(A)=\left( \varvec{\textrm{sort}}(A1_n),\varvec{\textrm{sort}}\left\{ A_{ij}, 1\le i <j \le n \right\} \right) \end{aligned}$$

is generically separating and invariant with respect to the action of \(S_n\) on \(\mathcal {M}_{weighted}\).

Proof

When \(n=2\), a matrix in \(\mathcal {M}_{weighted}\) is determined by a single off-diagonal element, and thus, F is easily seen to be (globally) separating.

Now assume \(n>2\). A generic \(A\in \mathcal {M}_{weighted}\) has the following property: the summation of any two different subsets of the sub-diagonal element of A yields a different result. In particular, (i) two different rows of A sum to different values and (ii) the sub-diagonal elements are pairwise distinct.

Now let B be a graph in \( \mathcal {M}_{weighted}\) with \(F(A)=F(B) \). Since \(\varvec{\textrm{sort}}(A1_n)=\varvec{\textrm{sort}}(B1_n)\), we can without loss of generality assume that the vertices of B are ordered so that all row sums of A and B agree. We now claim that \(A=B\). Indeed, we know that \(\varvec{\textrm{sort}}\left\{ A_{ij}, 1\le i<j \le n \right\} =\varvec{\textrm{sort}}\left\{ B_{ij}, 1\le i <j \le n \right\} \) and if there is some \(i<j\) for which \(A_{ij}\ne B_{ij} \), then either the ith of jth row of B consists solely of elements of A, but does not contain \(A_{ij}\), and by assumption on A this yields a contradiction to the fact that the rows of A and B sum to the same number. \(\square \)

5 Computer Numbers

Theorem 2.7 is a statement about ‘almost every’ choice for \(w_1,\ldots ,w_{m}\in \mathbb {R}^{D_w}\) in a measure theoretic sense, over the reals. A slight disadvantage of these separators is that while we have separation for almost all \(w_1,\ldots ,w_{m}\), we cannot point at any specific \(w_1,\ldots ,w_{m}\) for which we can absolutely guarantee separation. We do not address this difficult problem in this paper.

A second issue is that in the computational setting, where each \(w_i\) is represented using a finite number of bits, it is conceivable, a priori that almost all real \(w_1,\ldots ,w_{m}\) are separating, any yet all \(w_1,\ldots ,w_{m}\) which can be represented as computer numbers are not separating. Our goal is to show that even when the \(w_i\) are represented with a finite number of bits, we can still obtain separation with high probability.

Our strategy for obtaining this goal is as follows: We first show that all bad \(w_1,\ldots ,w_m\) are in the zero locus of some polynomial f whose degree is at most R. We then will use the Schwartz–Zippel lemma:

Lemma 5.1

(Schwartz, Zippel, DeMillo, Lipton). Let f be a nonzero real polynomial of degree R in l variables over a field k. Select \(x_1,\ldots ,x_l\) uniformly at random from a finite subset X of k. Then the probability than \(f(x_1,\ldots ,x_l)\) is 0 is less than R/|X|.

This lemma implies that if we use more than \(\log _2 (\epsilon ^{-1}R)\) bits to represent our real numbers, so that they are selected from a finite alphabet X with \(>\epsilon ^{-1}R \) elements, then the probability of picking a bad collection of \((x_1,\ldots ,x_l)\) is less than \(\epsilon \). In our setting, the \(x_i\) will comprise the coordinates of our \((w_i,\ldots ,w_m)\).

The challenge in this strategy is finding an appropriate nonzero f with bounded degree. In the following, we will replace the real semi-algebraic requirements for Theorem 2.7 with the stronger algebraic requirement:

Lemma 5.2

Let G be a group acting on \(\mathbb {R}^D\), let \(p:\mathbb {R}^D \times \mathbb {R}^{D_w} \rightarrow \mathbb {R}\) be a family of G-invariant separating polynomials of degree r, and let \(m\ge 2D+1\). Then the ‘bad set’ of \((w_i,\ldots ,w_m)\) which do not form a separating mapping is contained in the zero set of a nonzero polynomial f of degree at most \((mD_w-1)(2D+mD_w-m) (r^m)\).

Applying Schwartz–Zippel to this bound, we see that we need \(O(D\log (r))\) bits to obtain a formal proof that we are unlikely to pick a bad set of \(w_i\). Note that in our applications, r is quite low.

Proof (sketch) of Lemma 5.2

We will first move to the complex setting, where can use the Bezout’s theorem. Then we will restrict our attention to the real locus. To move to the complex setting, we now think of p(xyw) as a polynomial over complex variables. We define \((x \sim _p y)\) for \(x,y \in \mathbb {C}^D\) if \(p(x,w)=p(y,w)\) for all \(w \in \mathbb {C}^{D_w}\). Note that under this definition, there may be \(x,y \in \mathbb {R}^D\) such that \((x \sim _G y)\) while \((x \not \sim _p y)\) (due to some non-real valued, separating w). What is true is that \((x \not \sim _G y)\) implies \((x \not \sim _p y)\).

Define the polynomial over \(\mathbb {C}^{D + D + D_w}\): \(q(x,y,w)= p(x,w)-p(y,w)\). This has degree r. For \(i=1,..,m\), we can define the polynomials \(q_i(x,y,w_1,\ldots ,w_m):=q(x,y,w_i)\). Together, these \(q_i\) define a variety V in \(\mathbb {C}^{D + D + mD_w}\), which contains the bundle

$$\begin{aligned} \mathcal {B}=\{(x,y,w_1,\ldots ,w_m)\in \mathbb {C}^D \times \mathbb {C}^D \times \mathbb {C}^{D_w\times m}| \\ \quad x \not \sim _p y \text { but } p(x,w_i)=p(y,w_i), i=1,\ldots ,m \} \end{aligned}$$

By assumption, the set of \((x \sim _p y)\) must satisfy \(q(x,y,w)=0\) for all w. Taking the intersection over a sufficiently large finite collection of such w must stabilize and define a strict sub-variety U of p-equivalent pairs. \(\mathcal {B}\) is obtained from V by removing U. This can remove some of the components of V as well as some nowhere dense subsets of other components. Thus, the Zariski closure \(\bar{\mathcal {B}}\) must consist of some subset of the components of V. Note that the Zariski closure does not increase the dimension of \(\mathcal {B}\), which is bounded from above by \(2D+mD_W-m\) (using the argument from Theorem 2.7 for the complex setting). Let us stratify \(\bar{\mathcal {B}}\) into pure dimensional algebraic sets \(V_i\). From Bezout’s theorem (see [29, Chapter 18] and see [54], especially remark 2), and the fact that V is defined using the intersection of m varieties, each of degree r, each of these \(V_i\) (made up of components of V) is of degree at most \(r^m\). There are at most \(2D+mD_w-m\) such i.

Next we project each of these \(V_i\) onto \(\mathbb {C}^{mD_w}\). By our assumption that \(m\ge 2D+1\), we know that this projection is of dimension less than \(mD_w\). The image of a fixed \(V_i\) can be stratified into constructible sets \(V_{ij}\) of pure dimension. There are at most \(mD_w-1\) of these. From Bezout, the closure of each such \(V_{ij}\) is a variety of degree at most \(r^m\) (and of this same pure dimension). Each \(V_{ij}\) is contained in an algebraic hypersurface of at most the same degree. Taking the union over these \((mD_w-1)(2D+mD_w-m)\) hypersurfaces shows that the image of this projection must satisfy a single non-trivial polynomial equation F, where that polynomial’s degree is at most \((mD_w-1)(2D+mD_w-m)(r^m)\).

Let us define a set of \(w_1,\ldots ,w_m\), each in \(\mathbb {C}^{D_w}\) to be p-bad if there is an xy, each in \(\mathbb {C}^D\), with \((x \not \sim _p y)\) but such that, for all i, we have \(p(x,w_i)=p(y,w_i)\). By construction, the p-bad set lies in the zero set of F. Let us define a set of \(w_1,\ldots ,w_m\), each in \(\mathbb {R}^{D_w}\) to be g-bad if there is an xy, each in \(\mathbb {R}^D\), with \((x \not \sim _G y)\) but such that, for all i, we have \(p(x,w_i)=p(y,w_i)\). The G-bad set lies in the real locus of the p-bad set. Thus, it lies in the zero sets of \(F_r\) and \(F_i\), the real polynomials defined by taking, respectively, the real/imaginary components of the coefficients of F. At least one of these two polynomials is nonzero. Such a nonzero polynomial gives us our f in the statement of the lemma.\(\square \)

It is not as clear how to cover the full real semi-algebraic setting.

6 A Toy Experiment

To visualize the possible implications of our results to invariant machine learning, we consider the following toy example: We create random high-dimensional point clouds in \(\mathbb {R}^{3\times 1024}\) which reside in an \(S_n, n=1024\) invariant ‘manifold’ \(\mathcal {M}\) of low intrinsic dimension \(D_{\mathcal {M}}\). In fact, \(\mathcal {M}\) is a union of two invariant ‘manifolds’ \(\mathcal {M}=\mathcal {M}_0\cup \mathcal {M}_1 \) of dimension \(D_{\mathcal {M}}\), and we consider the problem of learning the resulting binary classification task.

The binary classification task is visualized in Fig. 1a. In this figure, \(D_{\mathcal {M}}=1\), each \(\mathcal {M}_0,\mathcal {M}_1\) is a line in \(\mathbb {R}^{3\times 1024}\) and all its possible permutations, and points in \(\mathcal {M}_0,\mathcal {M}_1\) are projected onto \(\mathbb {R}^3\) for visualization. While these data may appear hopelessly entangled, using the permutation-invariant mapping we describe in (9) with \(m=2D+1=3 \) to embed \(\mathcal {M}_0,\mathcal {M}_1\) into \(\mathbb {R}^3\) we obtain very good separation of the initial curves as shown in Subplot (b). Note that the non-intersection of the images of \(\mathcal {M}_0,\mathcal {M}_1\) is guaranteed by Proposition 3.1.

In Fig. 1c, we show the results obtained for the binary classification task by first computing the invariant embedding in (9) with randomly chosen weights, and then applying an MLP (multilayer perceptron) to the resulting embedding. The results on train and test data are shown for various choices of intrinsic dimension \(D=D_{\mathcal {M}}\) and embedding dimension m. In particular, for the \(D=1,m=3\) case visualized in Fig. 1a, b we get \(98 \% \) accuracy on the test dataset.

The diagonal entries in the tables show the accuracy obtained for varying intrinsic dimensions D and embedding dimension \(m=2D+1 \). Recall that Proposition 3.1 our embedding in separating for these Dm values, and thus theoretically perfect separation can be obtained by applying an MLP to the embedding. The diagonal entries in the tables show that indeed high accuracy can be obtained for these (Dm) pairs. At the same time, we also see that taking higher-dimensional embeddings \(m>2D+1 \) leads to improved accuracy. This is parsimonious with the common observation that deep learning algorithms are more successful in the over-parameterized regime, as well as results on phase retrieval [15] and random linear projections [6], where the embedding dimension needed for stable recovery is typically larger than the minimal dimension needed for injective embedding. In any case, we note that in all cases we obtain high accuracy with embedding dimension much smaller than the extrinsic dimension \(3\times 1024=3072 \).

Additional details on the experimental setup can be found in Appendix B. Code for reconstructing our experiment can be found in [12].

7 Conclusion and Future Work

The main result of this paper is providing a small number of efficiently computable separating invariants for various group actions on point clouds. Many interesting questions remain. One example is studying the optimal cardinality necessary for separation. As mentioned above, in phase retrieval it is known that the number of invariants needed for separation is slightly less than twice the dimension, and we believe this is the case for other invariant separation problems we discuss here as well. Another important question, which is discussed, e.g., in [5, 14] is understanding how stable given separating invariants are: separating invariants are essentially an injective mapping from the quotient space into \(\mathbb {R}^m\). Stability in this context means that the natural metric on the quotient space should not be severely distorted by the injective mapping.

Perhaps the most important challenge is translating the theoretical insights presented here into invariant learning algorithms with strong empirical performance, provable separation and universality properties, and reasonable computational complexity.

A useful direction for reducing computational complexity is ‘settling’ for generic separation which as we saw can often be achieved with a small computational burden. In general, the downside of this is that there is a low-dimensional singular set on which there is no separation. This disadvantage will only be significant, for a given learning task, if a significant percentage of the data resides on or near the singular set. Therefore, it could be useful to understand what the singular sets of various generic separators are, and what the likelihood of encountering them in specific data is.

We hope to address these questions in future work, and hope others are inspired to do so as well.