In one of the great intellectual debates of the twentieth century, Niels Bohr and Albert Einstein went back and forth on the philosophical foundations of quantum mechanics. A key issue that emerged was the relationship between the observer and the system being observed. In quantum theory, these two entities are intimately linked. When an observer makes a measurement on a system, he or she is fundamentally influencing the system that is being measured. In classical physics, the observer is not connected to the system that they are measuring but rather the two entities remain independent.
Let's take this idea and apply it to the world of modeling and analysis. In a fundamental sense, there is some intrinsic structure in the data that we are trying to discover by building a model. However, the model itself has a structure - for example, linear regression assumes that the data fits a straight line. So, there is an inevitable "collision of the geometries" when the structure of the model interferes with structure of the data that is being modeled. This can be an ultimate source of information loss during the modeling process. The modeler is now connected to the system that he or she is modeling!
An interesting question to ask is how different modeling methods interfere with the system being modeled. As discussed above, regression methods are pretty interfering! A functional form is imposed on the data and we try to find the best fit. Neural networks do the same thing although the specific manifold that they create remain hidden to the user.
In Pattern Based Analytics, the primary collision of the geometries occurs when we categorize continuous data into discrete bins. Our choice of binning strategy amounts to us imposing a structure on the data. Pattern discovery is fundamentally influenced by the mapping of continuous data into discrete states. Conversely, if the data is intrinsically categorical, pattern based analytics can be minimally invasive, approaching the classical ideal in physics.
An interesting discussion topic can thus emerge from an analysis of how different modeling approaches collide with the data that they are modeling and the resulting implications in terms of information loss. Decision trees, support vector machines, neural networks, regression methods of all sorts, all come to mind. It will be interesting to see what lessons emerge from this idea...
The strength of linear models is also their weakness: simplicity. They are easy to understand, and when the underlying phenomenon being modeling actually is fairly linear, linear models do not interfere much with data being modeled. Linear models are the hammer in a modeler’s toolbox. Everyone (even non-carpenters) knows how a hammer works, and hammers do a tremendous job when the problem happens to be a nail. However, not all modeling problems are nails. Many modelers insist on using their beloved hammer when faced with a screw because they have never heard of (or do not understand how to use) the screwdriver. They might manage to bang that screw in there eventually with their hammer, but using a screwdriver would have been cleaner and produced better results. It is of course important to use the right tool for the job when possible.
ReplyDeleteThat said, oftentimes it is not at all obvious what the right tool for the job is. In such instances it is useful to have a tool that works well for a range of possibilities. Pattern based analytics is useful here because of only minimal interference when the data is categorical. It also provides an acceptable level of interference for many problems involving truly continuous data, so it is a useful approach when the perfect tool for the job is not known.
When we build a model we always somehow do it based on assumptions. When the assumptions are valid (or at least reasonable) for a given problem, they will significantly help us. If we violate the assumptions, they can be poisonous to us. We intentionally or unintentionally apply assumptions almost everywhere. Even a simple data preprocessing step, such as discretizing a numeric attribute, may introduce information loss if we apply it inappropriately. For example, given an integer attribute in range [1,365], blindly discretizing them into bins may be dangerous if the attribute represents the day of the year. Weekends and/or holidays may have special meaning for a sales problem, and assuming each day is equal to each other and then applying equal size or frequency based on a discretization algorithm will cause significant information loss.
ReplyDeleteAs Michael McGowan mentioned, an assumption may be strong or weak. A strong assumption works for a narrow range of problems and may significantly help in building a model of high quality. A weak assumption may work for a large range of problems, but the quality of the model might be weak as well. Basically, there is no free lunch [1]. Finding the right tool for a job is a real challenge we are facing because we need to understand both domain (the meaning behind the data, see the above day of year example) and the science behind a modeling approach! One way to mitigate the challenge is allowing a user to focus on domain knowledge to prepare the data appropriately and letting a computer system adapt different modeling approaches to different problems based on data information provided by the user. Meta-learning is one of the approaches that learns what kind of modeling algorithm is good at what kind of data [2,3,4]. Pattern based analytics (PBA) can help us in this challenge in another way. Instead of directly facing the whole complex problem, we look at patterns where each one is relatively simpler. Then we build a model based on these patterns. In this case, we mitigate the requirement of modeling knowledge since we are looking at a simpler problem and it will be more practical to apply meta-learning because automatically selecting an algorithm for a complex problem may be extremely hard. One thing worthy of note is that finding patterns usually requires weaker assumptions than building a model, thus PBA is a more practical and user friendly approach for a normal user especially for a complex problem.
[1] David H. Wolpert. The supervised learning no-free-lunch theorems. In In Proc. 6th Online World Conference on Soft Computing in Industrial Applications, pages 25–42, 2001.
[2] Kate A. Smith-Miles. Cross-Disciplinary Perspectives on Meta-Learning for Algorithm Selection. ACM Computing Surveys (CSUR), 41(1):25, 2008.
[3] Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119–138, 2006.
[4] M. Hilario, A. Kalousis, P. Nguyen, and A. Woznica. A Data Mining Ontology for Algorithm Selection and Meta-Mining. In ECML/PKDD Workshop on Third-Generation Data Mining: Towards Service-Oriented Knowledge Discovery (SoKD-09), Bled, Slovenia, 2009.