**A** most unusual definition (?) of sufficiency came up on X validated this morn, as stated in Koller and Friedman’s Probabilistic Graphical Models. But as reported, it is quite restrictive, apparently limited to the natural statistic of an exponential family with conditionally Uniform ancillary (since the likelihood functions are *equal* rather than *proportional*). Even more strangely, with this formulation, the Normal sample size *n* *[typo on the last line of the question]* appears as a component of the sufficient statistic (Example 17.4). While not being random.

## Archive for graphical models

## a most unusual definition of sufficiency

Posted in Books, Kids, Statistics with tags ancillary statistics, cross validated, graphical models, sufficient statistics on January 13, 2021 by xi'an## ISBA@NIPS

Posted in Statistics, Travel, University life with tags ABC in Montréal, Canada, graphical models, ISBA, machine learning, Montréal, NIPS 2014, Québec, travel award, variational Bayes methods on September 2, 2014 by xi'an*[An announcement from ISBA about sponsoring young researchers at NIPS that links with my earlier post that our ABC in Montréal proposal for a workshop had been accepted and a more global feeling that we (as a society) should do more to reach towards machine-learning.]
*

**T**he International Society for Bayesian Analysis (ISBA) is pleased to announce its new initiative *ISBA@NIPS*, an initiative aimed at highlighting the importance and impact of Bayesian methods in the new era of data science.

Among the first actions of this initiative, ISBA is endorsing a number of *Bayesian satellite workshops* at the Neural Information Processing Systems (NIPS) Conference, that will be held in Montréal, Québec, Canada, December 8-13, 2014.

Furthermore, a special ISBA@NIPS Travel Award will be granted to the best Bayesian invited and contributed paper(s) among all the ISBA endorsed workshops.

ISBA endorsed workshops at NIPS

- ABC in Montréal. This workshop will include topics on: Applications of ABC to machine learning, e.g., computer vision, other inverse problems (RL); ABC Reinforcement Learning (other inverse problems); Machine learning models of simulations, e.g., NN models of simulation responses, GPs etc.; Selection of sufficient statistics and massive dimension reduction methods; Online and post-hoc error; ABC with very expensive simulations and acceleration methods (surrogate modelling, choice of design/simulation points).
- Networks: From Graphs to Rich Data. This workshop aims to bring together a diverse and cross-disciplinary set of researchers to discuss recent advances and future directions for developing new network methods in statistics and machine learning.
- Advances in Variational Inference. This workshop aims at highlighting recent advancements in variational methods, including new methods for scalability using stochastic gradient methods, , extensions to the streaming variational setting, improved local variational methods, inference in non-linear dynamical systems, principled regularisation in deep neural networks, and inference-based decision making in reinforcement learning, amongst others.
- Women in Machine Learning (WiML 2014). This is a day-long workshop that gives female faculty, research scientists, and graduate students in the machine learning community an opportunity to meet, exchange ideas and learn from each other. Under-represented minorities and undergraduates interested in machine learning research are encouraged to attend.

## JSM 2014, Boston

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags ABC, Abraham De Moivre, Boston, Cambridge, graphical models, Harvard University, JSM 2014, likelihood, Massachusset, residuals, Stephen Stigler, T.E. Lawrence, trek rule theorem on August 6, 2014 by xi'an**A** new Joint Statistical meeting (JSM), first one since JSM 2011 in Miami Beach. After solving [or not] a few issues on the home front (late arrival, one lost bag, morning run, flat in a purely residential area with no grocery store nearby and hence no milk for tea!), I “trekked” to [and then through] the faraway and sprawling Boston Convention Centre and was there in (plenty of) time for Mathias Drton’s Medalion Lecture on linear structural equations. (The room was small and crowded and I was glad to be there early enough!, although there were no Cerberus [Cerberi?] to prevent additional listeners to sit on the ground, as in Washington D.C. a few years ago.) The award was delivered to Mathias by Nancy Reid from Toronto (and reminded me of my Medallion Lecture in exotic Fairbanks ten years ago). I had alas missed Gareth Roberts’ Blackwell Lecture on Rao-Blackwellisation, as I was still in the plane from Paris, trying to cut on my slides and to spot known Icelandic locations from glancing sideways at the movie *The Secret Life of Walter Mitty* played on my neighbour’s screen. (Vik?)

**M**athias started his wide-ranging lecture by linking linear structural models with graphical models and specific features of covariance matrices. I did not spot a motivation for the introduction of confounding factors, a point that always puzzles me in this literature [as I must have repeatedly mentioned here]. The “reality check” slide made me hopeful but it was mostly about causality [another of or the same among my stumbling blocks]… What I have trouble understanding is how much results from the modelling and how much follows from this “reality check”. A novel notion revealed by the talk was the “trek rule“, expressing the covariance between variables as a product of “treks” (sequence of edges) linking those variables. This is not a new notion, introduced by Wright (1921), but it is a very elegant representation of the matrix inversion of (I-Λ) as a power series. Mathias made it sound quite intuitive even though I would have difficulties rephrasing the principle solely from memory! It made me [vaguely] wonder at computational implications for simulation of posterior distributions on covariance matrices. Although I missed the fundamental motivation for those mathematical representations. The last part of the talk was a series of mostly open questions about the maximum likelihood estimation of covariance matrices, from existence to unimodality to likelihood-ratio tests. And an interesting instance of favouring bootstrap subsampling. As in random forests.

**I** also attended the ASA Presidential address of Stephen Stigler on the seven pillars of statistical wisdom. In connection with T.E. Lawrence’s 1927 book. (Actually, 1922.) Itself in connection with Proverbs IX:1. Unfortunately wrongly translated as *seven pillars* rather than *seven sages*. Here are Stephen’s pillars:

*aggregation*, which leads to gain information by throwing away information, aka the sufficiency principle [one may wonder at the extension of this principleto non-exponantial families]*information*accumulating at the √n rate, aka precision of statistical estimates, aka CLT confidence [quoting our friend de Moivre at the core of this discovery]*likelihood*as the right calibration of the amount of information brought by a dataset [including Bayes’ essay]*intercomparison*[i.e. scaling procedures from variability within the data, sample variation], eventually leading to the bootstrap*regression*[linked with Darwin’s evolution of species, albeit paradoxically] as conditional expectation, hence as a Bayesian tool*design of experiment*[enters Fisher, with his revolutionary vision of changing all factors in Latin square designs]*residuals*[aka goodness of fit but also ABC!]

**M**aybe missing the positive impact of the arbitrariness of picking or imposing a statistical model upon an observed dataset. Maybe not as it is somewhat covered by #3, #4 and #7. The reliance on the reproducibility of the data could be the ground on which those pillars stand.

## Bayesian programming [book review]

Posted in Books, Kids, pictures, Statistics, University life with tags artificial intelligence, Bayesian inference, Bayesian programming, CHANCE, conjugate priors, E.T. Jaynes, graphical models, maximum entropy, Python, robots on March 3, 2014 by xi'an

“We now think the Bayesian Programming methodology and tools are reaching maturity. The goal of this book is to present them so that anyone is able to use them. We will, of course, continue to improve tools and develop new models. However, pursuing the idea that probability is an alternative to Boolean logic, we now have a new important research objective, which is to design specific hsrdware, inspired from biology, to build a Bayesian computer.”(p.xviii)

**O**n the plane to and from Montpellier, I took an extended look at Bayesian Programming a CRC Press book recently written by Pierre Bessière, Emmanuel Mazer, Juan-Manuel Ahuactzin, and Kamel Mekhnacha. *(Very nice picture of a fishing net on the cover, by the way!)* Despite the initial excitement at seeing a book which final goal was to achieve a Bayesian computer, as demonstrated by the above quote, I however soon found the book too arid to read due to its highly formalised presentation… The contents are clear indications that the approach is useful as they illustrate the use of Bayesian programming in different decision-making settings, including a collection of Python codes, so it brings an answer to the *what* but it somehow misses the *how* in that the construction of the priors and the derivation of the posteriors is not explained in a way one could replicate.

“A modeling methodology is not sufficient to run Bayesian programs. We also require an efficient Bayesian inference engine to automate the probabilistic calculus. This assumes we have a collection of inference algorithms adapted and tuned to more or less specific models and a software architecture to combine them in a coherent and unique tool.” (p.9)

**F**or instance, all models therein are described via the curly brace formalism summarised by

which quickly turns into an unpalatable object, as in this example taken from the online PhD thesis of Gabriel Synnaeve (where he applied Bayesian programming principles to a MMORPG called StarCraft and developed an AI (or bot) able to play BroodwarBotQ)

thesis that I found most interesting!

“Consequently, we have 21 × 16 = 336 bell-shaped distributions and we have 2 × 21 × 16 = 772 free parameters: 336 means and 336 standard deviations.¨(p.51)

**N**ow, getting back to the topic of the book, I can see connections with statistical problems and models, and not only via the application of Bayes’ theorem, when the purpose (or *Question*) is to take a decision, for instance in a robotic action. I still remain puzzled by the purpose of the book, since it starts with very low expectations on the reader, but hurries past notions like Kalman filters and Metropolis-Hastings algorithms in a few paragraphs. I do not get some of the details, like this notion of a discretised Gaussian distribution (I eventually found the place where the 772 prior parameters are “learned” in a phase called “identification”.)

“Thanks to conditional independence the curse of dimensionality has been broken! What has been shown to be true here for the required memory space is also true for the complexity of inferences. Conditional independence is the principal tool to keep the calculation tractable. Tractability of Bayesian inference computation is of course a major concern as it has been proved NP-hard (Cooper, 1990).”(p.74)

**T**he final chapters (Chap. 14 on “Bayesian inference algorithms revisited”, Chap. 15 on “Bayesian learning revisited” and Chap. 16 on “Frequently asked questions and frequently argued matters” [!]) are definitely those I found easiest to read and relate to. With mentions made of conjugate priors and of the EM algorithm as a (Bayes) classifier. The final chapter mentions BUGS, Hugin and… Stan! Plus a sequence of 23 PhD theses defended on Bayesian programming for robotics in the past 20 years. And explains the authors’ views on the difference between Bayesian programming and Bayesian networks (“any Bayesian network can be represented in the Bayesian programming formalism, but the opposite is not true”, p.316), between Bayesian programming and probabilistic programming (“we do not search to extend classical languages but rather to replace them by a new programming approach based on probability”, p.319), between Bayesian programming and Bayesian modelling (“Bayesian programming goes one step further”, p.317), with a further (self-)justification of why the book sticks to discrete variables, and further more philosophical sections referring to Jaynes and the principle of maximum entropy.

“The “objectivity” of the subjectivist approach then lies in the fact that two different subjects with same preliminary knowledge and same observations will inevitably reach the same conclusions.”(p.327)

Bayesian Programming thus provides a good snapshot of (or window on) what one can achieve in uncertain environment decision-making with Bayesian techniques. It shows a long-term reflection on those notions by Pierre Bessière, his colleagues and students. The topic is most likely too remote from my own interests for the above review to be complete. Therefore, if anyone is interested in reviewing any further this book for CHANCE, before I send the above to the journal, please contact me. (Usual provisions apply.)

## cut, baby, cut!

Posted in Books, Kids, Mountains, R, Statistics, University life with tags BUGS, Chamonix, CREST, cut models, decompression, flu, graphical models, JAGS, Martyn Plummer, MCMC, MCMSki IV, Monte Carlo Statistical Methods, OpenBUGS, The BUGS book on January 29, 2014 by xi'an**A**t MCMSki IV, I attended (and chaired) a session where Martyn Plummer presented some developments on cut models. As I was not sure I had gotten the idea *[although this happened to be one of those few sessions where the flu had not yet completely taken over!]* and as I wanted to check about a potential explanation for the lack of convergence discussed by Martyn during his talk, I decided to (re)present the talk at our “MCMSki decompression” seminar at CREST. Martyn sent me his slides and also kindly pointed out to the relevant section of the BUGS book, reproduced above. *(Disclaimer: do not get me wrong here, the title is a pun on the infamous “drill, baby, drill!” and not connected in any way to Martyn’s talk or work!)*

**I** cannot say I get the idea any clearer from this short explanation in the BUGS book, although it gives a literal meaning to the word “cut”. From this description I only understand that a *cut* is the removal of an edge in a probabilistic graph, however there must/may be some arbitrariness in building the wrong conditional distribution. In the Poisson-binomial case treated in Martyn’s case, I interpret the cut as simulating from

instead of

hence loosing some of the information about φ… Now, this cut version is a function of φ and θ that can be fed to a Metropolis-Hastings algorithm. Assuming we can handle the posterior on φ and the conditional on θ given φ. If we build a Gibbs sampler instead, we face a difficulty with the normalising constant m(y|φ). Said Gibbs sampler thus does not work in generating from the “cut” target. Maybe an alternative borrowing from the rather large if disparate missing constant toolbox. (In any case, we *do not* simulate from the original joint distribution.) The natural solution would then be to make a independent proposal on φ with target the posterior given z and then any scheme that preserves the conditional of θ given φ and y; “any” is rather wistful thinking at this stage since the only practical solution that I see is to run a Metropolis-Hasting sampler long enough to “reach” stationarity… I also remain with a lingering although not life-threatening question of whether or not the BUGS code using cut distributions provide the “right” answer or not. Here are my five slides used during the seminar (with a random walk implementation that did not diverge from the true target…):