Instructors:
Dan Kersten (kersten@umn.edu)
with guest instructors, Sheng He, ...
While computer vision has made substantial progress in the development of algorithms for various restricted visual tasks, including object detection and recognition, achieving human-like visual capabilities still remains elusive. On the other hand, while there has also been substantial progress in understanding human vision and its relation to cortical activity, we do not understand the brain’s algorithms underlying functional behaviors. One can identify three three fundamental problems that the human visual system has solved but whose computational solutions remain a challenge: 1) Managing uncertainty. The visual system has the ability to do robust scene inference with multiple features whose causes are locally uncertain. 2) Scalability. Humans can deal with the enormous space of possible objects as they appear in different contexts in natural images. 3) Task flexibility. Humans rapidly adapt to new visual tasks and novel environments. These are “big” problems, but there is an argument that solutions will rely on advances in modeling the generation of natural images. A generative model provides for “explanations”, conscious or unconscious, of incoming image information. Such explanations can range from shallow to deep. For example, there is a good understanding of low-level “natural image statistics” as (shallow) summaries. And at the other extreme, there is also a good understanding of how 3D graphics can provide high-level (deep) explanations of how images result from physical scenes. Deep, causal models provide, in principle, knowledge that can be used to deal with local uncertainty, scalability and task flexibility. However, there are substantial unsolved problems as to how to structure and use deep generative knowledge for image interpretation. For example, to date it has not been feasible to interpret natural images by fitting a 3D graphics model. In this seminar we will review research relevant to the development and application of generative models in several specific domains including: human faces and expressions, visual character and word forms, human body poses and actions, and dynamic flows.
Meeting time: First meeting Tuesday, Jan 20th, 3:00 pm. Regular time to be decided.
Place: Elliott S204
Week |
Topics |
Background material | Discussion papers |
1 | Introduction |
Lecture notes |
|
2. | Theory: Image models. Natural image statistics features and dictionaries. Relation to feedforward models Discriminative vs. generative models |
Xie, J., Hu, W., Zhu, S.-C., & Wu, Y. N. (2014). Learning Sparse FRAME Models for Natural Image Patterns. International Journal of Computer Vision. doi:10.1007/s11263-014-0757-x Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks, 1097–1105. Advances in Neural Information Processing Systems 25 (NIPS 2012)
|
|
3. | Theory: 3D graphics, physics-based modeling, dynamics |
|
|
4. | Theory: Appearance-based vs. structural models. Image grammars & gestalt principles. | Lecture notes Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological Review, 94(2), 115–147. Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 2, 2145–2152. (pdf) Zhu, L., Chen, Y., & Yuille, A. (2011). Recursive Compositional Models for Vision: Description and Review of Recent Work. Journal of Mathematical Imaging and Vision, 41(1-2), 122–146. (pdf) Feldman, J. (2009). Bayes and the simplicity principle in perception. Psychological Review, 116(4), 875–887. (pdf)
|
|
5. | Faces I |
Lecture notes on early work on face modeling: 2D vs. 3D eigenfaces, Perrett,... TERZOPOULOS, D., & WATERS, K. (1993). Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 15(6), 569–579. Cunningham, D. W., Kleiner, M., Bülthoff, H. H., & Wallraven, C. (2004). The components of conversational facial expressions. APGV '04 Proceedings of the 1st Symposium on Applied perception in graphics and visualization, 143-150 .
|
|
6. | Faces II |
Leopold, D. A., O'Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftereffects. Nature Neuroscience, 4(1), 89–94. doi:10.1038/82947 Fang, F., & He, S. (2005). Viewer-Centered Object Representation in the Human Visual System Revealed by Viewpoint Aftereffects. Neuron, 45(5), 793–800. doi:10.1016/j.neuron.2005.01.037 Webster... Tsao...
|
|
7. | Human body pose/actions I |
Lecture notes Troje, N. F. (2002). Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2(5). Blake, R., & Shiffrar, M. (2007). Perception of Human Motion. Annual Review of Psychology, 58(1), 47–73. doi:10.1146/annurev.psych.57.102904.190152 |
|
8. | Spring Break |
|
|
Human body pose/actions II |
Toshev, A., & Szegedy, C. (2013). Deeppose: Human pose estimation via deep neural networks. arXiv Preprint arXiv:1312.4659. X. Chen and A.L. Yuille. Articulated Pose Estimation with Image-Dependent Preference on Pairwise Relations. NIPS 2014
Orlov, T., Makin, T. R., & Zohary, E. (2010). Topographic Representation of the Human Body in the Occipitotemporal Cortex. Neuron, 68(3), 586–600. doi:10.1016/j.neuron.2010.09.032
|
||
9. | Characters, words |
Lecun, Hinton on single digits... Visual word forms
|
|
10. | Human hands, gestures |
Grenander and more recent... |
|
11. | Dynamic flows I |
Physics vs. appearance models, Kinematics, dynamics and photometric flow Doretto, G., Chiuso, A., Wu, Y. N., & Soatto, S. (2003). Dynamic textures. International Journal of Computer Vision, 51(2), 91–109. Bergou, M., Audoly, B., Vouga, E., Wardetzky, M., Grinspun, E., Bergou, M., et al. (2010). Example-based wrinkle synthesis for clothing animation. ACM Transactions on Graphics (TOG), 29(4), 107. doi:10.1145/1833349.1778844
|
|
12. | Dynamic flows II |
|
|
13. | Object physics |
Battaglia, P. W., Hamrick, J. B., & Tenenbaum, J. B. (2013). Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences of the United States of America, 110(45), 18327–18332. doi:10.1073/pnas.1306572110 (pdf) |
|
14. | Scenes and places |
Han, F., & Zhu, S.-C. (2009). Bottom-up/top-down image parsing with attribute grammar. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(1), 59–73. doi:10.1109/TPAMI.2008.55
|
|
15. | Deeper causal models.: Modeling and recognizing intentions |
|
|
|
|
||