Bootstrapped Learning of Novel Camouflaged Objects

Mark Brady

September 22, 1998

Introduction

In the biological world, the means by which a species can achieve camouflage are many. One of the more familiar methods, and one of the first to be imitated in camouflaged clothing design, is destructive camouflage. In destructive camouflage, the animal is covered with lines and / or colored patches. These lines and patches break the animal's image up into smaller parts, each of which has a chance of being integrated into the background. An observer, in order to see the camouflaged animal must determine whether each patch border or line belongs to an object's boundary or to a reflectance boundary, and then it must determine which boundary fragments go with which other boundary fragments. In addition, when a color patch on an animal matches an adjacent color patch in the background, the union of these patches is a new patch which traverses the object boundary. As a result of all this, the observer sometimes fails to detect an animal with destructive camouflage.

Figure 1: A scene from the peak of Mount Elbert, Colorado. The subject demonstrates destructive, cryptic, countershading, and perhaps even behavioral camouflage.

In cryptic camouflage the somewhat random background is statistically modeled by the animal's coloration. Exactly what this statistical model consists of remains an intriguing mystery. Yet when one looks at an animal with cryptic coloration one sees immediately that it matches the background. Look for example at the dark spots on the Ptarmigan's wing in figure 1 and notice the similarity with the moss on the rock to its lower right. This similarity is perceived in spite of the fact that there are no specific shapes which are shared between the pattern on the animal and the pattern on the background.

Another form of camouflage is countershading. Countershading defeats an observer's ability to discern shape from shading by covering the lower portion of the animal with a lighter material. This counteracts the normal distribution of luminance on objects which are generally illuminated from above, by the Sun or Moon. Examples of countershaded animals include the pronghorn antelope, whitetailed deer, killer whale, and many others.

In addition to counter shading, animals sometimes utilize behavior to eliminate shading clues. For example, by crouching close to the ground or a branch, a creature can hide the shading differential between its upper and lower body. This same behavior can also eliminate the animal's shadow; the shadow being another clue to its shape. In another sort of behavioral camouflage, an animal may exhibit behavioral camouflage by positioning itself near patterns which resemble its own.

Whereas camouflage helps an animal blend with the background, mimicry allows one species to masquerade as another. If the second species is dangerous, such mimicry is called aposmatic. Aposmatics may use coloration to portray eyes, teeth, etc. of some predator. Some aposmatics even go so far as to use bright colored spots to simulate specular reflections on their false eyes. In this process, the observer thinks it is seeing some surface S(u,v) with specularity map P(u.v) and brightness map B(u,v) when in actuality it is being presented with some other surface S'(u,v), some other brightness map B'(u,v), and a specularity map which might actually be zero everywhere.

In the case of the ptarmigan, camouflage carries out the function of concealment because it arises from evolutionary pressures. Whenever, camouflage serves such a purpose it may be referred to as intentional camouflage, even though the term is somewhat anthropomorphic. However, camouflage is not always intentional. Take for example the dog in figure 2. His coloration is a result of changes in melanin production and distribution which coincidentally accompanied changes sought during the domestication process. His coloration is not the result of any evolutionary pressure and it serves no purpose. Hence, one might call this sort of camouflage accidental camouflage.

Figure 2: A dog who's coat demonstrates accidental destructive camouflage. Given adequate distance, a proper background, and lack of movement, this dog can easily disappear. Unlike James' famous Dalmatian Dog image, this color image is not binarized or otherwise modified.

Figure 3: A book on a shelf.

One might think of this dog as being a special case and that accidental camouflage is somehow a rare phenomenon. Therefore, let us next consider a most mundane example: a book on a shelf. Figure 3 shows such as scene and figure 4 shows the same scene partially processed by a machine vision system. Before the book can be recognized by any vision system, the relationships between the various edges and the surfaces which they are attached to, must be sorted out. For example, which of the edges shown are reflectance edges, which are generated by shadows, and which are caused by 3D discontinuities? Of those edges which are 3D, which belong to the book and which belong to the background?

Another type of problem is demonstrated by edge A. A is fragmented due to coincidental matching of book color to background color. Before the book can be recognized, a vision algorithm will likely need to bind these fragments back together into a single edge feature.

Figure 4: An oriented contrast image of the book in figure 3. Colors indicate the direction of edge element orientation except for blue which indicates low contrast. Edge A is a 3D book edge which is fragmented, edge B is a 2D reflectance edge on the book, edge C is another 3D edge on the book, edge D is the border of a shadow, and edge E is a 3D background edge.

Due to the prevalence of intentional and accidental camouflage, in natural scenes, camouflage is the rule, rather than the exception. In spite of this, there have been relatively few studies on the effects of camouflage and realistic backgrounds.

The problem of object learning from natural scenes is especially relevant for machine vision engineers. When designing an object recognition system, how should the objects be presented to the system for learning? Should the backgrounds and camouflage be erased manually? Surely this is a laborious and unnatural solution. Alternatively, one might show the system a motion sequence which shows the object moving in front of the background. This would be more natural and easier. Or, one might spray paint the object some unique color and design the system so that it can segment according to color.

Human observers utilize a variety of modalities in order to overcome the many ambiguities encountered while segmenting and recognizing objects from natural scenes. These include form, motion, depth, and color. However, it is also well known that humans have the ability to recognize objects in drawings and photographs of natural scenes, when such images contain only form information. It has been hypothesized that this is accomplished with the help of a top down mechanism, whereby a stored model of some object is used to constrain the interpretation of an otherwise ambiguous raw image. In fact, it might be assumed that if one could find sufficiently novel camouflaged objects, presented with sufficiently complex backgrounds, observers would be unable to segment the objects from the backgrounds. In the course of the present experiment, we shall see that this is the case.

An outstanding difficulty with the top down hypothesis is that; if top down mechanisms are needed to disambiguate raw image data, how do models form from raw data in the first place? In other words, if a observer is presented with an image of a novel object against a novel background, how are the object parts bound together, and separated from background elements, so that a model can be formed? One obvious solution would be for the observer to await opportunities where other modalities make segmentation easy, and develop models during these opportunities. Motion and color information, for example, can make the task of segmentation relatively straight forward. This study investigates the role of motion and color during the learning of novel objects. In particular, the working hypothesis is that high level models form when segmentation clues from other modalities are present; and, when they are not present, formation of high level models fails, or is severely limited.

Purpose of the Experiment and Summary of Methods

The purpose of this experiment is to determine the extent to which the formation of high level object models depends on motion and color as segmentation clues.

In the experiment, there are two phases, a training phase and a testing phase. During the training phase, subjects are presented with camouflaged novel objects with background, which they are to learn. The training phase stimuli may include segmentation clues such as color or motion; or, the stimuli may have no segmentation clues other than form. In the test phase, subjects are shown scenes of multiple camouflaged objects. There are no segmentation clues. These test scenes may or may not contain one of the objects which appeared in training. The subjects' task is to determine if a trained object is in the scene and if so, to determine which object it is. The percent correct is then measured for each subject and clue type. A measure of accuracy as a function of clue type is the primary data sought.

Methods

Creation of Novel Objects

Previously, investigators have used a variety of methods to generate novel objects. Rock used smoothly curved wire objects (Rock, DiVita et al. 1981), Farah used clay interpolations of Rock's forms (Farah, Rochlin et al. 1994), Bulthoff used wire and spheroid objects (Bulthoff and Edelman 1992), Tarr used cube composed stick figures (Tarr 1995), Humphrey used clay shapes (Humphrey and Khan 1992), Sakai used 2D Fourier descriptors (Sakai and Miyashita 1991), and Miyashita used fractals (Miyashita, Higuchi et al. 1991).

Before designing a means for generating novel objects, one requires a set of criteria to be met. The criteria which is appropriate in the present experiment is as follows: The objects should be truly novel; in that, they do not contain elements of known objects, are not distortions of known objects, and are not molded by a human artist. Any of these three characteristics could potentially detract from the novelty of the object. At the same time, the objects should be visually relevant to the observers. In other words, humans have evolved to recognize certain classes of objects but not others. To fulfill these criteria, I have attempted to produce objects which appear like plants or animals, but not like any particular plant or animal. For example, these novel objects might consist of a body with a number of limbs protruding. So that the shapes be as general as possible, without violating the requirement of biological relevance, limb and body cross sections should take on a variety of shapes; flat, circular, concave, etc. The limb terminations should also take a variety of forms as do the limbs of true plants and animals. The formation of each object should be directed by a random process so that the particular features of the object are not influenced by a human artist.

The method used to produce such objects mimics an embryological process. Hence, the objects are called digital embryos. Each digital embryo begins as a regular polyhedron, representing a ball of cells, or in the parlance of developmental biology, a zygote. Cell division is regulated by a hormone gradient, where the hormone is secreted by one or more cells and diffused along edges connecting each cell. Hormone generating cells arise at random, and persist for random periods, thus directing the growth of the object. Physical forces of attraction and repulsion are simulated among cells, determining the ultimate position of each cell. Computer graphically, the result is a polyhedron composed of a large number of small polygons. The large number of small polygons merge to form a number of surfaces, which in turn constitute the exterior surfaces of the object. Objects are rendered using Phong shading. Fully grown digital embryos are shown in figure 5 a and b.

Figure 5a: A digital embryo.

Figure 5b: Another digital embryo.

Scene Construction

Each scene consists of a collection of background objects and a single foreground object. Each scene contains 13 background objects, selected from a pool of 60 objects; placed, rotated, and camouflaged at random. The foreground object is approximately centered, in front of the background objects, is camouflaged, and always has the same orientation All objects, background or foreground, are digital embryos. Foreground objects may move during training presentation, they may be colored, or they may be static grayscale. Background objects are always static grayscale. Object camouflage consists of texture maps which are wrapped around each object. The texture maps are images of scenes of other digital embryos, selected from a pool and placed at random. The resulting stimuli appear as in figure 6.

Figure 6a: A novel camouflaged foreground object with novel camouflaged objects as background. Camouflage patterns consist of scenes of other novel objects. Illumination direction changes at random from scene to scene. There is no occlusion of the foreground object. Due to the lack of a high level object model, untrained observers are unable to reliably segment the foreground objects from the background.

Figure 6b

Figure 6c

Observers

There were five observers, four female and one male, aged 16 to 31. All were 20/20 or corrected to 20/20.

Testing - Training Design

The five observers were trained and tested on four data sets. Each data set included three novel objects which were to be learned; giving a total of twelve novel objects of interest, in the experiment as a whole. The first training data set contained no segmentation clues (NO CLUE 1), the second training set had motion segmentation clues (MOTION), the third training set had color segmentation clues (COLOR), and the fourth set had no segmentation clues (NO CLUE 2). See figure 7 for an example of color segmentation clues. The set order was varied among subjects in order to control for order effects. Three observers used the order: NO CLUE 1, MOTION, COLOR, NO CLUE 2; while the other two observers used order: NO CLUE 2, COLOR, MOTION, NO CLUE 1. For each data set (three objects), observers were trained for two consecutive days and tested on the third consecutive day.

Figure 7: The object of interest is camouflaged as usual and colored green as a clue to segmentation. The object boundaries are plainly visible.

Training

There was a single training session per training day. Observers were shown each scene for 10 seconds. The first scene had object A in the center foreground, the second scene had object B, and the third had object C. This was repeated A, B, C, A, B, C,... until each object is presented five times. Thus each object was seen for 50 seconds per day and a total of 100 seconds over two days. Observers viewed the screen from a distance of 1.5 - 2.0 feet. They were not required to perform any task during training, other than to view the scenes. A sound effect accompanied each scene to identify the object. Lighting was from a single source, directional in type, and simulated to be above the viewer. However, the right - left position of the light varied at random between scenes. Every scene had different background and object camouflage.

The method of training was intended to simulate natural visual learning as much as possible. Objects in natural scenes, most often appear with various background, changing celestial lighting, and changing reflectance patterns. Appearances of objects under natural conditions are often separated by various intervals, from seconds to days, with other stimuli being processed in the interim. In general, animal vision does not rely on language understanding, yet initial identification via some other sensory modality is often possible.

Instructions to the observers informed them that the objects would appear in the center of the scene, that the objects would be camouflaged, that there are three different objects per session, and that a sound would be used to identify each object. No other information was given about the scenes.

Testing

Each test session consisted of 30 scenes. Each scene was similar to a training scene except that there were never any segmentation clues. Backgrounds and camouflage varied and there were no identification sounds. Half of the scenes contained objects from the training set and half did not. Observers did not know what percentage of the scenes had trained objects. Each scene was presented until the observer gave his / her response. The task was four alternative forced choice: "object A", "object B", "object C", or "no trained object". However, observers did not know the objects as A, B, or C, so they referred to them by a name of their own choosing.

Following the recognition-identification task, each observer was shown three scenes, one for each object in the test set. Using the computer's mouse, they were asked to trace the outline of the object in the scene,. The purpose of this test was to uncover the post learning relationship between recognition and segmentation.

Results

Contrary to the experimental hypothesis, subjects were able to recognize trained objects without the assistance of segmentation clues. Figure 8 shows the main result. This is quite a surprising result, since there is no obvious way in which segmented examples of the objects could have arrived at the model level. Apparently, observers are able to bootstrap the learning process using unsegmented data for model building. I shall use the term bootstrapped learning to describe this sort of model building.

There was a significant amount of subject variability, yet all subjects performed well above chance, including JA who did relatively poorly. See figure 9. Perhaps more interesting, is the near perfect performance of MB (not the author) and MN, demonstrating the ultimate potential of bootstrapped learning algorithms. Given more than the 100 seconds per object training, perhaps JA would also reach these levels.

There are three types of errors which subjects could make, and they made all three with some regularity. See figure 10. Imagining a trained object, when there was none, was the most common. This is perhaps due to the strong influence of top down models, which imposed some order on the camouflaged jumble of novel objects.

Figure 8: Portion correct as a function of clue type. Data is averaged over subjects. The total number of trials run was 600. Performance at chance is 0.25.

Figure 9: Portion correct as a function of subject. Data is averaged over clue type.

Figure 10: Distribution of error types.

Tracing results indicate that an ability to segment the objects did develop along with the ability to recognize them. See figure 11. However, this ability was not complete, since the subjects were typically able to trace only part of the object boundary. This partial knowledge of object contours is apparently sufficient for recognition.

An ability to trace, may be based on either object knowledge, as represented by a high level object model; or, it may be based on an understanding of the surface information in the scene being presented, independent of knowledge gained during training. Figure 11 a-c demonstrate that either source of information is insufficient to completely overcome the effects of camouflage. Figure 11 c demonstrates that, even when understanding of the surfaces in the given scene has failed completely, model knowledge serves as a means for producing a reasonable object outline. Thus, object model knowledge plays a dominant role in object tracing ability.

Figure 11a: MN's tracing of object C. Her knowledge of the object's shape appears to be good, except that she is unaware of the object's "ventral fins."

Figure 11b: MB's tracing of object B. MB recognized this object, during the recognition trials, on all but one occasion. In the tracing she is unaware of 3 object limbs. An observer, given the uncamouflaged version at right, might still have trouble finding the object outline at left although observers in the experiment had no such hint.

Figure 11c: AM's tracing of object A. The tracing is essentially correct but is in the wrong position! In the drawing at left, the true object position is immediately below the tracing. Obviously, she is tracing based on model knowledge, not according to information in the given image.

Discussion

Simple inspection of the NO CLUE stimuli tell us that there exist cases where object segmentation is impossible without high level models. Yet, after repeated exposure to different scenes, a model is somehow formed at some object related level, such as IT. How can this bootstrapped learning occur? There must be some sort of image data buffer which stores the scenes containing the objects of interest, so that they can be compared with latter scenes. This buffer must be capable of resisting masking by other images and must be capable of resisting erasure by intermediate tasks for at least 20 seconds.

Figure 12 illustrates the role of the buffer in a visual learning system. There are two modes, a learning mode and a recognition mode. During either mode, image data undergoes early processing. In the learning mode, scenes of partially processed data are collected in a buffer. Two or more scenes from the buffer are then presented simultaneously to a hypothesis engine. The hypothesis engine compares the scenes, looking for common features. Common features are then bound, along with their relationships, into a model and passed to a model and recognition module. Similar mechanisms could also be used to form models of surfaces, edges, etc. However, subjects in the present experiment already have extensive knowledge at these levels, so that the object level is the only level where there is significant potential for novelty.

During recognition, the model has already been established. Therefore, high level information is available to guide the interpretation of rising surface data as possible object features.

Figure 12: Model explaining the phenomenon of bootstrapped learning. Gray arrows indicate pathways active during learning. Thin black arrows indicate pathways active during recognition. Thick black arrow indicates shared pathway. "Object Models & Recognition" is shown as one module, yet recognition may occur at a single level of processing while the model actually exists between levels of processing (via binding connections). The same holds true for "Surface Models & Recognition." The diagram therefore indicates functionally defined modules rather than exact anatomical regions.

If such buffers and hypothesis engines do exist, where in the brain might they reside? Perirhinal cortex has been implicated as a region concerned with short term visual memory (Meunier, Bachevalier et al. 1993) (Eacott, Gaffan et al. 1994). Eacott found that perirhinal ablations interfered with cueing tasks only when the cue was unfamiliar, which is precisely the case in this experiment. In learning a new object, subjects begin with novel stimuli and subsequently attempt to find similar features in other images.

A candidate region for the hypothesis engine is V4 or some human homologue. In his experiment, Haenny modulated the responses of V4 neurons using cues (Haenny, Maunsell et al. 1988). Such would be the characteristics of neurons within any comparison engine, like the one in the bootstrapped learning model.

There are a number secondary conclusions which can be drawn from this experiment. For instance, the experiment tells us something about figure completion, residuals, and their role in object recognition. In this experiment, subjects develop a high level of performance, while not even knowing what the complete figure looks like. Obviously, completion is neither possible nor necessary. This is not to say that completion cannot play some role in helping to decipher the image and deduce its proper segmentation. However, given the difficulty designed into these images and the high level of performance which is still possible, one must conclude that the contribution of figure completion is minor.

Regarding the processing of unexplained image residuals, one can see that object recognition proceeds quite well, even when the residual is large. After some training, the entire background remains largely uninterpretable, while the object of interest is successfully segmented and recognized.

Then, there is the matter of recognition by form and its relation to color, disparity, and motion information. The brain is usually characterized as a highly integrated device, which it is. Therefore, one must ask if it even makes sense to investigate one modality in isolation of other supporting modalities. However, if isolated form processing units can proceed from a learning stage through to a recognition stage, without any assistance from other modalities, then one can conclude that it may indeed make sense to study recognition by form in isolation, depending on the scientific question, of course.

Finally, regarding machine vision, one can conclude that an artificial object recognition system can be designed, which does not depend on manual isolation of the object of interest, motion, or any other segmentation clues. It may be beyond current state of the art, but some day we should be able to present a vision system with multiple natural scenes of some object, and the system will determine for itself what is and what is not part of the object.

Bibliography

Bulthoff, H. H. and S. Edelman (1992). "Psychophysical support for a two-dimensional view interpolation theory of object recognition." Proceedings of the National Academy of Sciences 89: 60-64.

Eacott, M. J., D. Gaffan, et al. (1994). "Preserved recognition memory for small sets, and impaired stimulus identification for large sets, following rhinal cortex ablations in monkeys." European Journal of Neuroscience 6: 1466-1478.

Farah, M. J., R. Rochlin, et al. (1994). "Orientation invariance and geometric primitives in shape recognition." Cognitive Science 18: 325-344.

Haenny, P. E., J. H. R. Maunsell, et al. (1988). "State dependent activity in monkey visual cortex." Experimantal brain research 69: 245-259.

Humphrey, G. K. and S. C. Khan (1992). "Recognizing novel views of three-dimensional objects." Canadian Journal of Psychology 46: 170-190.

Meunier, M., J. Bachevalier, et al. (1993). "Effects on visual recognition of combined and separate ablations of the entorhinal and perirhinal cortex in rhesus monkeys." The Journal of Neuroscience 13(12): 5418-5432.

Miyashita, Y., S. Higuchi, et al. (1991). "Generation of fractal patterns for probing the visual memory." Neuroscience Research 12: 307-311.

Rock, I., J. DiVita, et al. (1981). "The effect on form perception of change of orientation in the third dimension." Journal of Experimental Psychology 7: 719-732.

Sakai, K. and Y. Miyashita (1991). "Neural organization for the long-term memory of paired associates." Nature 354: 152-155.

Tarr, M. J. (1995). "Rotating objects to recognize them: a case study of the role of mental transformations in the recognition of three-dimensional objects." Psychological Bulletin Review 2: 55-82.