Generative models of natural images

University of Minnesota, Spring Semester, 2015

Psy 8036 (001 59790)

Dan Kersten (
with guest instructors, Sheng He, ...

While computer vision has made substantial progress in the development of algorithms for various restricted visual tasks, including object detection and recognition, achieving human-like visual capabilities still remains elusive. On the other hand, while there has also been substantial progress in understanding human vision and its relation to cortical activity, we do not understand the brain’s algorithms underlying functional behaviors. One can identify three three fundamental problems that the human visual system has solved but whose computational solutions remain a challenge: 1) Managing uncertainty. The visual system has the ability to do robust scene inference with multiple features whose causes are locally uncertain. 2) Scalability. Humans can deal with the enormous space of possible objects as they appear in different contexts in natural images. 3) Task flexibility. Humans rapidly adapt to new visual tasks and novel environments. These are “big” problems, but there is an argument that solutions will rely on advances in modeling the generation of natural images. A generative model provides for “explanations”, conscious or unconscious, of incoming image information. Such explanations can range from shallow to deep. For example, there is a good understanding of low-level “natural image statistics” as (shallow) summaries. And at the other extreme, there is also a good understanding of how 3D graphics can provide high-level (deep) explanations of how images result from physical scenes. Deep, causal models provide, in principle, knowledge that can be used to deal with local uncertainty, scalability and task flexibility. However, there are substantial unsolved problems as to how to structure and use deep generative knowledge for image interpretation. For example, to date it has not been feasible to interpret natural images by fitting a 3D graphics model. In this seminar we will review research relevant to the development and application of generative models in several specific domains including: human faces and expressions, visual character and word forms, human body poses and actions, and dynamic flows.


Meeting time: First meeting Tuesday, Jan 20th, 3:00 pm. Regular time to be decided.
Place: Elliott S204

Schedule and Readings (under construction)

Background material Discussion papers
Lecture notes



Theory: Image models. Natural image statistics features and dictionaries.

Relation to feedforward models

Discriminative vs. generative models

Lecture notes

Xie, J., Hu, W., Zhu, S.-C., & Wu, Y. N. (2014). Learning Sparse FRAME Models for Natural Image Patterns. International Journal of Computer Vision. doi:10.1007/s11263-014-0757-x

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks, 1097–1105. Advances in Neural Information Processing Systems 25 (NIPS 2012)




Theory: 3D graphics, physics-based modeling, dynamics

Lecture notes

Thompson, W., Fleming, R., Creem-Regehr, S., & Stefanucci, J. K. (2011). Visual Perception from a Computer Graphics Perspective (1st ed.). A K Peters/CRC Press.




4. Theory: Appearance-based vs. structural models. Image grammars & gestalt principles.

Lecture notes

Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological Review, 94(2), 115–147.

Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 2, 2145–2152. (pdf)

Zhu, L., Chen, Y., & Yuille, A. (2011). Recursive Compositional Models for Vision: Description and Review of Recent Work. Journal of Mathematical Imaging and Vision, 41(1-2), 122–146. (pdf)

Feldman, J. (2009). Bayes and the simplicity principle in perception. Psychological Review, 116(4), 875–887. (pdf)




Faces I

Lecture notes on early work on face modeling: 2D vs. 3D eigenfaces, Perrett,...

TERZOPOULOS, D., & WATERS, K. (1993). Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 15(6), 569–579.

Cunningham, D. W., Kleiner, M., Bülthoff, H. H., & Wallraven, C. (2004). The components of conversational facial expressions. APGV '04 Proceedings of the 1st Symposium on Applied perception in graphics and visualization, 143-150 .





Faces II

Leopold, D. A., O'Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftereffects. Nature Neuroscience, 4(1), 89–94. doi:10.1038/82947

Fang, F., & He, S. (2005). Viewer-Centered Object Representation in the Human Visual System Revealed by Viewpoint Aftereffects. Neuron, 45(5), 793–800. doi:10.1016/j.neuron.2005.01.037





Human body pose/actions I

Lecture notes

Troje, N. F. (2002). Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2(5).

Blake, R., & Shiffrar, M. (2007). Perception of Human Motion. Annual Review of Psychology, 58(1), 47–73. doi:10.1146/annurev.psych.57.102904.190152


Spring Break


Human body pose/actions II

Toshev, A., & Szegedy, C. (2013). Deeppose: Human pose estimation via deep neural networks. arXiv Preprint arXiv:1312.4659.

X. Chen and A.L. Yuille. Articulated Pose Estimation with Image-Dependent Preference on Pairwise Relations. NIPS 2014


Orlov, T., Makin, T. R., & Zohary, E. (2010). Topographic Representation of the Human Body in the Occipitotemporal Cortex. Neuron, 68(3), 586–600. doi:10.1016/j.neuron.2010.09.032

Weiner, K. S., & Grill-Spector, K. (2011). Not one extrastriate body area: Using anatomical landmarks, hMT+, and visual field maps to parcellate limb-selective activations in human lateral occipitotemporal cortex. NeuroImage, 56(4), 2183–2199. doi:10.1016/j.neuroimage.2011.03.041



Characters, words


Lecun, Hinton on single digits...

Visual word forms



Human hands, gestures

Grenander and more recent...



Dynamic flows I


Physics vs. appearance models, Kinematics, dynamics and photometric flow

Doretto, G., Chiuso, A., Wu, Y. N., & Soatto, S. (2003). Dynamic textures. International Journal of Computer Vision, 51(2), 91–109.

Bergou, M., Audoly, B., Vouga, E., Wardetzky, M., Grinspun, E., Bergou, M., et al. (2010). Example-based wrinkle synthesis for clothing animation. ACM Transactions on Graphics (TOG), 29(4), 107. doi:10.1145/1833349.1778844




12. Dynamic flows II  



Object physics

Battaglia, P. W., Hamrick, J. B., & Tenenbaum, J. B. (2013). Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences of the United States of America, 110(45), 18327–18332. doi:10.1073/pnas.1306572110 (pdf)





14. Scenes and places


Han, F., & Zhu, S.-C. (2009). Bottom-up/top-down image parsing with attribute grammar. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(1), 59–73. doi:10.1109/TPAMI.2008.55



Deeper causal models.: Modeling and recognizing intentions