The title equation, where is supposedly sparse arises in two different settings, with quite different meanings, yet with similar solutions:
The first is in signal processing where is the result of measurements obtained from an object of interest using a linear operator denoted here by . In this setting one is looking to recover the original signal from a small number of measurements. To achieve this goal we need to constrain and . is usually considered to be sparse, either itself or in some known domain. is usually selected so that it is incoherent with the sparsifying basis of . The solution then is to find in a way that will minimize the cardinality of the support of constrained to .
The second setting is in machine learning where denotes the original signal and we are looking to represent it as a combination of few sources. Some ideas are important here and I will try to point them out. First and foremost the general object (perhaps a natural scene) represented by the signal may be the result of many different causes in nature. So we need a largely redundant dictionary to represent all in a sparse and linear manner. Perhaps represents a face, in which case we need atoms atleast as many as there are people. Secondly an instance of when captured by sensors is usually the result of a few active causes and may thus be presented sparsely using . This is the main idea behind finding sparse representations or features for our signals . The solution though is the same as the signal processing setting.
In signal Processing, where is a measurement matrix, we are interested in using matrices that will capture the information in using very few measurements and convey that information in . In machine learning we are interested in learning dictionaries that will result in very sparse yet information preserving features, . Now considering the notion of mutual information, the fact that the amount of information conveyed through about is equal to the amount of information conveyed about through , one is led to believe that the same should be used for both cases. The state of the art solution though is quite different. The signal processing community searches for matrices (with fixed number of rows) that are maximally incoherent yet have the minimum number of rows (two obviously contradictory constraints). The machine learning community is looking for maximally redundant dictionaries with as few columns as possible with fixed number of rows (again, two contradictory constraints).