|
Mode of Vision, and Process of SeeingGift of Seeing
Optics of the Eye
Sensing and Vision: From the Retina to the Perception of Vision
Gaze Movements
Video Codecs: Technology Decisions Behind Transmitting Video
Visual Impairments, Disability, and Access/Accommodation Strategies
Gift of Seeing
Optics of Light, Anatomical Apparatus and Optics-Related Neuromuscular Control SystemsAs you know, light rays are electromagnetic waves. Light rays refract (bend) when passing through a media of different densities, such as a curved surface (e.g., cornea) and a biconvex shape (lens). Light rays that are close to parallel when arriving at such structures can be, if the shapes are near-perfect, refracted to a point. This is called the principal focus, and the distance between the lens and the principal focus is called the principal focal distance. The typical unit of measure of refractive power, termed the diopter, is the reciprocal of this focal distance, in meters. A person with good vision and a typical size eyeball will have a refractive power of about 67 diopters at rest. Rays for closer objects are diverging upon entry, and will have a longer principal focus distance. This would seem a problem. But it's been solved: we have access to an effective accommodation control system (see below) that helps us maintain acuity via using muscles to cause subtle changes in the shape of the lens, resulting in up to 12 diopters of change in young individuals. Apparently nature has been fine-tuning eyeball optics for a while, given that the biological design seems optimal in the sense of having the principal focus, at any moment, be on or near the retina. Interestingly, this is a different strategy than a camera lens, where it changes the length to camera film; biology has also discovered this strategy, as some fish actually change the shape of their eyes rather than the curvature of their lens. But our human biological system not always perfect, and often corrective lens (eyeglasses or contact lenses) are needed:
There are two important neuromuscular control systems related to optics:
Thus the above discussion shows tight ties between biological control systems and optics for both acuity and light intensity. Another key quality of light - color - is dealt with via photoreceptor cell sensitivity, as will be seen in the next section. Sensing/Vision: From the Retina to the Perception of VisionIn the previous web page we studied eye optics, including several neuromuscular control systems that help focus light rays and regulate light intensity. These systems subserve the retina, which is where we will start as we investigate the sensory mechanisms responsible for visual perception. The material on this web page could easily be the subject of a 3-credit course (e.g., one taken by this instructor while a graduate student at UC Berkeley). Our aim is to briefly summarize the process of vision, from a "systems" perspective. You are encouraged to augment this summary with illustrations of the eye anatomy, found in any good physiology book. If you prefer, good web summaries of the basic anatomy and function include those at The Vision Channel and at www.tedmontgomery.com. Processing at the RetinaWe start by assuming that light arrives at the retina, a sheet of cells on the posterior part of the eyeball that extends nearly to the ciliary body. This sheet is organized into 10 layers, and includes sensory receptors (rods and cones) and four types of interneurons: bipolor, ganglion, horizontal and amacrine. The photosensitive rods and cones synapse with the bipolar cells, which in turn synapse with ganglion cells, which send out long axons that leave the eye via the optic nerve. Horizontal cells connect receptor cells to other receptor cells, while amacrine cells connect between ganglion cells. The details of retinal connectivity are well worked out, but beyond the scope of this class. But there are several items of special relevance for this class:
This is a remarkably well-designed visual sensing and tracking system. For example, consider the ability of the rods and subsequent circuitry to locate objects in the periphery:
While processing for such exquisite capabilities starts in the retinal neurocircuitry, this is just the first stage of a systematic process. Beyond the RetinaAn optic nerve fiber that leaves the retina will either cross over the midline (if nasal) and target the contralateral lateral geniculate body, or will connect with the ipsilateral lateral geniculate body (if lateral). This neural structure, part of the thalamus, serves mostly as an important a relay "way station" that coordinates an integrated retinotopic spatial mapping of the two eyes. From here information passes on to other brain structures, mostly notably the occipital (visual) cortex. The visual cortex possesses the 6-layer columnar structure that is common for cortical tissue. It is here that each fiber and various collections are fibers are processed in many ways. Receptive fields in the visual cortex can become remarkably selective, for instance to certain shapes traveling in certain direction. But this is just the beginning of the story, as there is a degree of imaging processing via neurocircuitry that remains, despite decades of study, mind-boggling to scientists. In particular, the robustness of pattern recognition is impressive. Consider, for example, that you can often recognize people and objects from may distances and orientations, including a friend after a haircut or a change in clothing. Also, the system is actively engaged in recognizing and classifying objects. SummaryFor the purposes of this class, a key observation is that the visual system enables a remarkable capacity to actively adapt to new settings and recognize persons and objects, if the frequency and intensity of the incoming light are within the sensitive ranges. But this ability is a function of many factors, ranging from optics to the effectiveness of physiological control systems. There are ways for designing environments that are more accommodating for anybody, such as providing adequate lighting. There are also sometimes accommodations that can be made for specific persons with sensory dysfunction, but these start with an understanding the underlying sources of the visual dysfunction. "Seeing" is an active process, and before considering these sources, we will first develop an understanding of the eye movement and gaze control systems that are intricately intertwined with the sensory apparatus. GazeMechanics of Eyeball and Extraocular Eye MusclesThe eyeball can be thought of as a suspended sphere that is held in place by viscoelastic tissue that is grounded in a skeletal socket. This arrangement makes it relatively easy for a strategically-placed muscle to rotate the eyeball. Since muscles only pull, and since there is a desire to rotate the eyes both medio-laterally (left-right) and superior-inferiorly (up-down), one might expect that there might be two pairs of antagonistic muscles that tug on either side of the eyeball, one pair on the medio-lateral direction and the other in the superior-inferier direction. Indeed, this is the case, giving us the following four muscles:
As suggested by their actions, these muscles insert on the eyeball about exactly 90 deg from each other. Their origin sites are close to each other, with insertions making tangential connections on the eyeball. Thus they are nearly in parallel with each other, with each having a maximal moment arm relative to the axis of rotation (which is about the center of the sphere). [There are also two other, angled extraocular muscles that are nearly in an orthogonal plane, called the superior oblique (moves eye inward and downward) and inferior oblique (moves eye outward and upward), that play more minor axial stabilizing roles.] The inertia of the eyeball is very small, and there is normally no external load on the eye other than perhaps a small contact lens. This has two implications: not much muscle force is needed to rotate the eyeball, and the speed of rotation is rate-limited by the mechanics of muscles. Thus it in not surprising that the extraocular muscles are very slender since the required force is small, and somewhat long with fast muscle fibers since fast rotations are often desired. Indeed, these muscles have the highest proportion of fast muscle fibers of any in the human. The result is a very fast, predictable musculoskeletal system. As with all skeletal muscles, these have some key mechanical properties that as "systems engineers" we capture as a "tension-length" relation, a force-velocity property, a series elastic property, and a parallel elastic property. JW side note: as a graduate student, I published papers modeling this neuromuscular system, using six nonlinear differential equations in each plane: 2 for each muscle and 2 for the eyeball. The bottom line on these properties is that the parallel elastic and tension-length properties are tuned for an operating range of about ±60 deg, and force-velocity properties that enable eyeball speeds of over 10 rad/sec (570 deg/sec). That's fast, and can occur by maximally exciting one muscle while relaxing its antagonist. Remember from the previous section that we see only about 2 deg of arc with high clarity. Thus there is a need to rotate the eyeball to fixate on targets of interest. When the eyes are fixated straight ahead, the motoneuronal drive to the antagonistic muscles is about 10% of maximum. To hold fixation at an angle to, say, 10 deg laterally, the drive to the lateral muscle must be a few percent greater, and that of the medial muscle less. But we don't move from location to location with such step-like shifts in activation; if we did, or visual world would spin while we moved! Rather, nature has come up with a wonderful collection of stereotype eye movements, each controlled by different parts of the brain that converge to the oculomotor nuclei in the brainstem. Thus we have the following four classic types of eye movements:
Integrated Use of These MovementsThese four classic eye movements are fairly easy to recognize during inspection of experimental angle versus time data. Indeed, saccades are identified by high-speed "jumps" between regions of no movement or smooth movement. If head movement is also measured, one also can easily distinguish between VOR and smooth pursuit head movements within the trace. With appropriate mathematical mapping, one can also overplot eye movements onto the spatial images that the individual was looking at, such as a picture of art or a page that is to be read. These are commonly called scan paths, and often are displayed as lines connecting between dots. This tells us something about the sampling/processing part of the brain, and where the person chooses to focus their attention. For instance, an individual looking at a picture of a face will tend to focus their gaze primarily on key facial features such as the eyes and mouth, while occasionally jumping to seemingly random locations for a greater sampling of the image. In contrast, the gaze of many persons with aphasia display what appear to be suboptimal strategies, focusing on regions of contrast that are of less functional significance, such as ears or clothes. This is one of many examples where eye movements provide a "window to the brain"; another example is that persons with schizophrenia often display double-saccades. Video/Codecs: Technological VideoBoth the eye and the camera have i) a variable "aperture" for controlling the intensity of light, ii) a lens that includes mechanisms for focusing an image, and iii) photosensitive elements that can encode both intensity and color. In both, the density of the photosensitive elements, called "pixels" in a digital image from a camera, is a measure of resolution. In both, higher-level spatial and temporal filters are used to help remap the image to extract certain features, and intelligent algorithms are often used to recognize patterns. Furthermore, both have gone through an evolutionary process that yields multiple solutions: while different animals have different eye properties, camera resolutions and storage protocols also tend to be based on the evolutionary process, one that can be documented through the evolution of consensus standards that reflect a mixture of performance capabilities and a quasi-random economic process similar to natural selection that helps determine " winners" and "losers" by their success in the field. There are, of course, also many subtle differences. For instance, in the still or video camera, the resolution is uniform across the field of view, unlike the strategy of a dense region of foveal cones and peripheral rods found in the eye that integrates in the eye movements that we studied in the previous web page. This is an important difference, and ironically the images and video are seen through a pair of eyes that make saccadic eye movements to determine the clarity of the image. Technical Building Blocks for Digital Image RepresentationThe building block for digital images is the pixel (picture element). An image is a grid of pixels, normally described by the horizontal by vertical number, for instance 640 x 480. Each pixel has a state that relates to brightness, color, etc. There are several common schemes, all having to do with the number of bits of information being coded to describe the state of the pixel. This can range from 1 bit (e.g., black or white) to very high numbers of bits, such as the 24 or more. Very common is 8 bits, which gives 256 shades. For instance, using 8 bits (1 byte) gives nice "black-and-white" resolution, and good representation of "intensity" of light through shading. Pixels needn't be square in shape, but usually are. For image color representation, a common approach is to use 8 bits (1 byte) for each of three "RGB" (red-green-blue) colors, where each color has 256 shades, the three colors are combined to give a truly rich variety of colors, with the number of colors depending on the standard. For instance, there is RGB8 with 256 total colors, RGBH with 32768 colors (15 bits), and RGBT with over 16 million colors (24 bits). For instance, many packages in Windows allow you to set colors, and if you try this out, you'll see that for shades of grey each of R, G and B are the same, e.g. (255,255,255) for pure white, and (0,0,0) for pure black. Pure red is (255,0,0), pure yellow combines red and green (255,255,0), and somewhat dark purple combines some red and blue (64,0,64). These are among the 48 "basic colors" that you can select from. that you can depend on any monitor or program or reliably reproduce. But you can set each of these three to any value between 0 and 255, and furthermore, to help you windows also gives to "Hue," "Sat" and "Lum" settings to help with interactive RGB setting. For instance, "Lum" is tightly tied to the degree of white, since most people will naturally associate white with brightness or intensity. Of note is that designers often use only 48 or 256 colors simply because they want to assure that what they see is what their customer will see as well. There is another standard called the natural color (YUV) format that tries to separate brightness information from color information. The Y values are for brightness (luminance) and ranges from 1-16, and the U and V are for color (chorominance) and range from 16-240. As with RGB, there are several variants on the format. There is also a mathematical mapping between the two standards:
Grid sizes also tend to follow standards. Since pixels are usually square, the aspect ratio typically represents both the ratio of horizontal to vertical pixels and the shape for the whole field of view. Common for monitors is 4:3, such as 320x240 or 640x480 or 1280x960. But there is complexity, due to human-inspired technical evolution. Let's start with TV. The U.S. and Japan use the NTSC standard, which started at 352x240, with a 1.46 aspect ratio and a sampling rate of 30 fps. Most of Europe uses the PAL standard, at 352x288, with a 1.22 aspect ratio and a lower sampling rate (25 fps), and YUV. In both standards the picture quality is pretty good, but some people from the U.S. feel that European TV seems a bit choppy; but clearly images look smooth for most people at 25 fps, and even 15 fps is pretty good. Videoconferencing systems use the CIF standard, which is 352x288 (like PAL) but 30 fps (like NTSC). A good choice. For lower-bandwidth videoconferencing such as H.324-compliant videophone systems, QCIF (quarter CIF, 176x144) is common, and typically peak sampling is at 15 fps. You'll see the difference in the lab. What about DVD's and high-definition TV? DVD's roughly double these dimensions, with the NTSC format being 720x480 or 704x480, and the PAL format being 720x576 or 704x576. What if one wants to shift between formats? There are several options. One is the cut the sides and include the "common" pixels. Another is to warp the pixels, typically stretching the vertical dimension. Still another is to mathematically re-calculate the pixels, possibly causing a decrease in resolution. You've probably seen all of these. By the way, often cameras of higher resolution, such as one of our Sony's, might be used to collect data at lower resolution. Mapping then needs to be done using an algorithm, and this helps explain why hi-resolution cameras sometimes give just average quality for a given application. If you are planning to collect at, say, common web-cam grids of 640x480 or 320x240, sometimes a cheaper camera that is tuned to this protocol will actually provide a better image. Digital Filters and CODECsThere are many types of digital filters for images. Such filters may extract and emphasize features, such as seen for some neurons in the visual cortex or in medical imaging products, or may provide another representation of the data. One obvious observation from the above is that digital images and video can lead to large files for storage, and a lot of information to transfer. Consider that if we multiply 352x288 pixels (CIF) by 24 bits/pixel (RGB) and then 30 fps, without compression we'd need to send 73 MBits/sec of information. Within 15 sec we'd have sent, and perhaps stored, over 1 GBits of data. That's a lot. In reality it is not really necessary to store every pixel. Often the goal is to reduce the number of bits necessary to capture the essence of the image. This is calling compression. Spatial CODECs. As an example, an image often has regions with little change in color (e.g., a wall). One can smooth over the region, perhaps via a mathematical transformation. The end result is a smaller file. Of one tried doing the reverse operation, the resulting "decompressed" file would not be quite the original, but might be awfully close, and perhaps imperceptible to the eye. This algorithm that is involved in this process of compression-decompression is called a CODEC. There are many codecs, two common ones are GIF and JPEG. GIF is an example of a loss-less algorithm in that the original image quality can be recovered, while JPEG is an example of a "lossy" algorithm that assumes that some details others are. An example of a "loss-less" algorithm is Compuserve's GIF (Graphics Interchange Format), covered by a patent from Unisys, that is based on the LZW of the 1970's and 80's, an algorithm that added the ability to use variable-length codes for compression translations that has roots in classic information theory. This algorithm, implemented in the 1980's with the web in mind and broadly supported by Internet browsers and development environments such as Microsoft's .Net, has constraints such as a maximum color palette of 256, and flexibility in areas such as selecting quality vs file size, and implementing transparency, and single-file animation and looping (loading image, then giving progressive display improvement via interlacing more details). An alternative without the proprietary concerns is PNG (Portable Network Graphic), which has greater functionality in terms of image size and colors but doesn't implement animation and looping. An example of an effective "lossy" algorithm is JPEG (from the Joint Photographic Experts Group) - files with a .jpg extension are associated with this CODEC. JPEG files essentially apply a mathematical transformational model to 8x8 pixel blocks of the image, using the Discrete Cosine Transform (DCT) and a quantization scheme that gets rid of higher frequency content. The degree of compression ranges from about 2 to 17 times, depending on settings for the algorithm and the type of image, and of course the more compressed, the more risk of a worse representation of the original image (i.e., "lossy" compression). Each CODEC has advantages and disadvantages. For instance, JPEG is great at keeping representations of subtle shades in color, but not so good with sharp boundaries on images. JPEG falls under the collection of approaches that are part of the international SPIFF (Still Picture Interface File Format, .spf files) standard (ISO/IEC 19818-3). Other newer spatial coding schemes are using wavelets and the Discrete Wavelet Transform. The latter are robust, but like JPEG blur edges. Many other algorithms, including fractals, are being tried. Often these are offshoots of EE faculty and students, who love this challenge. So undoubtedly improvements (mostly incremental) will continue to be on the horizon. Spatiotemporal CODECs. The popular AVI file format usually uses motion JPEG, or MJEG. This is rather conservative, in that often only a small part of a video image may actually change between two frames. It's crazy to keep storing the same information on frame after frame. Thus combining filtering with temporal filtering makes sense. Addressing this need was critical to the emerging videoconferencing field, and in 1990 the H.261 standard was approved that targeted video operating on multiples of 64Kbits/sec transmission (i.e., the maximum capacity of a dedicated phone line), based on the Discrete Cosine Transform for spatial compression and a block-based motion algorithm for temporal compression; this formed the basis for all subsequent video algorithms. Two bodies have been involved in forming video compression standards: the ITU-T (for the H.26x standards) and the ISO/IEC (for MPEG Moving Picture Experts Group) standards). While they overlap, the ITU-T is a bit more targeted on video for teleconferencing, the ISO/IEC group towards video for multimedia (both transmission and storage). The ITU's also tend to target the CIF standard (352x288), and more aggressive compression. For instance, the ITU realized the need for a standard that worked well below 64 KB/s, and hence came the H.263 (1995) and H.263+ (1998), which has much more flexibility (e.g., algorithms adjusts modes with different connection speeds) and has replace H.261 for most videoconferencing products. The mathematical foundation for the intra-frame compression algorithm is a hybrid space-time filter: it does some preliminary work (e.g., maps to YUV if initially in RGB, and subsamples U and V), then uses the DCT) in 8x8 image blocks (as does JPEG), but then it checks correlations between subsequent frames and then implements an algorithm across time for motion compensated prediction between frames. When any of the DCT-based algorithms have trouble, for instance at low bandwidths, the familiar "checkerboard" effect of 8x8 blocks forms around movement transitions. You'll see that this happens often during human movement when using our lower-bandwidth H.324 systems. The rule of thumb is that for near-TV quality videoconferencing, the bandwidth needs to be at least 384 KBits/sec (i.e., three ISDN lines) - checkerboard and blurring effects become rare. This is what we usually use, though we can go higher in that we have four ISDN lines in both of our videoconferencing rooms and thus can go as high as 512KBits/sec. Our Polycom systems also support IP conferencing, which we use mainly with a group from UC Berkeley; for IP the calls are free and there is not a dedicated line with guaranteed quality of service, and while our meetings are normally good-quality, on several occasions the transmission of both video and audio has been choppy. The above lack the multimedia flexibility of those for MPEG, which is now widely used for streaming video, DVDs, etc. For a summary of the well-known collection of MPEG standards; see UC Berkeley's Multimedia Lab. These also add smoothing through time to smoothing through space, and in our group we routinely see compression ratios of well over 30x. MPEG-1, approved as a standard in 1992, was the original standard for storage and retrieval of moving pictures, for rates up to 1.5 Mbits/sec, i.e. compression of about 50 times (but still a high rate compared to H.261). MPEG-2 added considerably in the area of scalability, motivated by digital TV. MPEG-4 is an impressive extension that includes an impressive new video codec that also goes as ITU's H.264. It adds some universal accessibility features and robustness (for a huge range of bandwidths), high multimedia interactive functionality, and compression efficiency. MPEG-2 is also the ITU's H.262, and MPEG-4 is also ITU's H.264. This latter codec has taken on great significance, and most of the key companies have or are about to implement it in all or most of their product line. The recent H.264/MPEG-4 standard, the result of combined efforts by these two key standards bodies involved in codec standards, is the newest that is currently being rapidly implement. It roughly doubles the quality for a given bandwidth, which obviously represents a significant improvement. All of the key videoconferencing companies (e.g., Polycom, Tandberg, VCON) now have H.264 embedded in some of their suite of products, with more to come. In addition to some videoconferencing systems, we now use a form of this standard (DivX implementation of MPEG-4) for compressing our digital video for our Mobile Usability Lab (MU-Lab) system, which you will be using for the lab associated with this module. Microsoft is systematically supporting more and more video codecs, for instance as alternatives for encoding video within AVI files. The Windows Media Video 8/9 encoder, freely available, offers real-time encoding, as well as accepts formats such as .avi and .mpg, and then offers a variety of scalable capabilities, with default profiles such as DSL/Cable delivery at 250-500 Kbits/sec with 320x240 pixels at 30 fps, and 56 Kbits/sec at 160x120 pixels at 15 fps. You'll see many of these in action in the Telerehab and Human Performance Lab, for Module 2 (Sensorimotor) and especially Module 3 (Telerehab, see especially section of videoconferencing standards). Visual Impairments, Disability and Access/Accommodation StrategiesNow that we've developed a background on the visual system and the gift of sight, let's go through what can go wrong with the system (impairments), and possible accommodations. A good source for eye disease/dysfunction/disability is at McMaster. Of note is that the incidence of visual impairment is in individuals with severe physical disabilities is higher than is often recognized, with these difficulties often not treated (e.g., most children with cerebral palsy have visual impairment). These impairments can affect visual acuity (due to motor and/or sensory sources), visual field (e.g., sampling surround, center), visual tracking and scanning (e.g., faulty saccades, smooth pursuit), and visual accommodation (poor focusing). Here are a few highlights that you should know: 1. Dysfunction within the eye:
2. Neural:
For a more general (but common) term, poor visual development (with poor visual acuity) is called amblyopia. The source is usually within the eye itself, although "wandering" eyes are also often classified here. Access/Accommodation StrategiesStrategies for accommodating visual impairment tend to fall into two categories: those that augment existing capabilities (e.g., assumes partial sight) and those that use an alternative form for communication that intends to replace missing function (e.g., assumes the person is blind). In fact there can be a continuum between mild impairment and blindness, and many people are legally blind but have partial sight. Such a person may use both a technology that augments residual sensory abilities (e.g., magnifier) and a technology that is intended for the blind (e.g., screen reader). As another example, many persons with partial sight have a seeing eye dog. Here we classify technologies into those that are augmentative (beyond glasses or contact lenses) and those that are replacements for lost vision. As we have seen, some people have poor fovial vision but adequate surround, others have significant cloudiness in all or certain parts of their visual field, others cannot see certain colors, others have variable vision that changes with disease changes or fatigue, others are hypersensitive to light, others see double vision, and still others cannot make normal voluntary movements. Thus the appropriate augmentative technology depends on the person's abilities and their needs in life. Examples of augmentative technologies include hand-held magnifiers, options for changing text size and/or colors on an computer monitor, environmental changes in room/device colors or lighting so that recognition is more likely, and tactile cues for button controls on devices, transitions on walkways and other physical objects. Technologies that are intended as replacements for lost vision include Braille, canes, screen readers for text and other visual content (typically with audio used to communicate content), and specialized audio signals (e.g., orientation cues, communication cues). In many public areas there are now audio cues that communicate the color of a traffic light or the location of a bus stop. Non-technological approaches include human companions who make a special effort to describe visual events, and seeing eye dog companions. Of note is that some persons who are blind have compensated by developing remarkable capacities with other senses, such as hearing and localizing sound, or touching. Thus another strategy is to take advantage of these other sensory abilities through alternative modes for providing information, working to make the alternative mode as equitable as possible.
|
|
|
|