This is an old revision of the document!


Information Geometry

“One geometry cannot be more true than another; it can only be more convenient.” - Jules Poincare:


This identifies the pattern and should be representative of the concept that it describes. The name should be a noun that should be easily usable within a sentence. We would like the pattern to be easily referenceable in conversation between practitioners.


Provide a unifying treatment of non-parametric exponential models and parametric models.


How can we describe the evolution path of models? How can we unifying parametric models with non-parametric models? How can we describe model evolution? How can we generalize the concept of distance? How can we generalize the idea of clustering? Are there more efficient ways to evolve our models?


This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.


Information geometric is a formalized mathematical treatment that studies information evolution in geometric terms.

The benefit of understanding information geometry is that it provides a intuitive approach to exploring information spaces. We obviously are making an assumption here that information space conform to some geometric logic. Geometric of course make assumption of how space in constructed and these assumptions may not necessarily hold in the domain of deep learning systems. At best we can employ geometry to reason and build arguments supporting the construction of our architectures. The main benefit is that it provides a unifying umbrella around different mechanisms for defining models.

Essentially we want to have a framework where we can tie the different models that we use as building blocks. These involve not just parametric models as that we find in a finite vector dot product (see Similarity) but also non-parametric models like those found in the exponential family of functionals. This at least gives us a principled way to coordinate our usage of neural network constructions and probabilistic constructions.

Furthermore, abstractions on the concept of distance give us more generalized forms of performing object comparisons. We always need for our models a mechanism for measuring similarity. The standard linear models of neural networks may not be the final say in how we define similarity. In fact, the Adversarial Features problem hints at a major deficiency of any machine learning algorithm that employs linear matching elements. Perhaps this problem is solvable while employing an alternative similarity measure.

Information Geometry build around a metric (i.e. fisher information) that permits us to treat information in geometric terms. Knowledge of this gives us an abstraction of how we may build more efficient learning algorithms. In fact, there are plenty of research the approach of using Natural Gradient Descent as an alternative to the less efficient gradient descent method. In our search for faster learning we need to discover short cuts to learning. Because of our human experience, learning may not exactly comply entire with Computational Irreducibility. Humans can somehow perform learning shortcuts such that a time and data consuming training effort is not needed.

Known Uses

Here we review several projects or papers that have used this pattern.

Related Patterns In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.

Relationship to Canonical Patterns:

  • Entropy is a divergence measure that does not comply with the triangle inequality.
  • Distance Measure in the form of the Fisher Information Matrix gives us a way to compare models.
  • Disentangled Basis Exponential forms are used to construct generative models.

Relationship to other Patterns

Cited by these patterns:

Further Reading

We provide here some additional external material that will help in exploring this pattern in more detail.


I wanted to add two possible reasons on why Information Geometry has not taken hold in Machine Learning:

1. The parameter space of non-trivial models (i.e. not in the exponential family) is not necessarily a manifold and can have singularities that need concepts from algebraic geometry. I'm thinking here of the work of Sumio Watanabe in Japan and Pachter and Sturmfels in the USA ('Algebraic Statistics in Computational Biology'). The implications of this is that Information Geometry as synthesized by Amari in the early 80's is incomplete for Machine Learning problems in general. Although I have no reference to Amari saying this directly, his collaboration with Watanabe implies it.

2. Once you leave the exponential family of models, you are almost guaranteed that the Fisher metric cannot be calculated in closed form. Furthermore, the dimensionality of the space under consideration can be huge even in the most trivial of departures from the exactly-solvable models. For example, a mixture of one-dimensional Gaussian models requires carrying out differential geometry calculations in a 3*m-1 space, where m is the number of mixtures.

In other words, it is very appealing to think in geometrical terms but the dimensionality of the space and the lack of closed solutions for the metric makes it very hard to actually calculate the geometry of the space, never mind obtain insights from such calculations. CLASSIFICATION WITH MIXTURES OF CURVED MAHALANOBIS METRICS

The ideas I just explained are unified in information geometry, where distance-like quantities such as the relative entropy and the Fisher information metric are studied. From here it’s a short walk to a very nice version of Fisher’s fundamental theorem of natural selection, which is familiar to researchers both in evolutionary dynamics and in information geometry. Relative Natural Gradient for Learning Large Complex Models

Fisher information and natural gradient provided deep insights and powerful tools to artificial neural networks. However related analysis becomes more and more difficult as the learner's structure turns large and complex. This paper makes a preliminary step towards a new direction. We extract a local component of a large neuron system, and defines its relative Fisher information metric that describes accurately this small component, and is invariant to the other parts of the system. This concept is important because the geometry structure is much simplified and it can be easily applied to guide the learning of neural networks. Pattern learning and recognition on statistical manifolds: An information-geometric review An Information-Geometric Characterization of Chernoff Information Determination of the edge of criticality in echo state networks through Fisher information maximization How deep learning works — The geometry of deep learning