As the name implies, a lurking variable is a variable that’s hidden, or rather, something that you don’t notice right away. It’s that creepy guy in a trench coat and bowler hat standing in the alleyway watching you walk down the street…he’s “lurking”, get it? Officially, a lurking variable is a variable that is having some effect on a response variable, or dependent variable, but is not distinguished from one or more of the explanatory variables.
Lurking variables are a big issue when analyzing data, especially when trying to use that data to create predictive algorithms. It can create irrelevant trends in your data, skew your results, impede an effective analysis and end the world as we know it (not really, but you get the idea). Think about it like this: when your significant other comes home from work, they walk through the door, stomps right past you to go right upstairs and slams the door. Your paranoid human nature tells you that you did something wrong and you start to think, “did I forget their birthday? Our anniversary?” In reality they could be mad for any number of reasons. Maybe their boss yelled at them today, maybe they sat in traffic and got annoyed, or maybe they’re just tired.
My point is that sometimes you just don’t know. Your scope of knowledge is limited to your interactions with your significant other and therefore you’re not aware of other possibilities. Similarly, your analysis of data is limited to the independent variables that you identify but there are almost always going to be unidentified variables affecting the dependent variables.
So how do you identify and control these lurking variables? One answer: ask why.
Examining graphs is the best place to start when attempting to identify lurking variables. You can visually see inconsistencies in the data such as outliers or any variation from trends. These are the data points you should question. If you graphed your significant other’s mood on a scale of 1-10, you may see trends that correlate to the day of the week and maybe the hours of sleep they got. (What if you could use analytics to predict what mood your spouse is going to be in? I think a fair amount of people would be interested in that…) However, perhaps the day they came home upset was a Friday and they got eight hours of sleep the night before. Both of these things that tell you they should be in a good mood. However, because they are not, this instance would be an outlier. Now you have to ask yourself why. Maybe there is another variable or trend you haven’t identified yet. For instance, something else that could affect their mood is whether or not they ate breakfast that morning. The same thing can happen in other data sets. There are a few things you can do to explain these outliers and/or find lurking variables.
You might want to start by looking at your data gathering technique. Maybe your sampling or measurement technique was flawed. Maybe your data preprocessing wasn’t complete. Do you have any foreign dates that you have to convert? If none of these are the issue, look at the graphs. It could be a trend is emerging but you lack enough data to support it and you have to continue to gather more. Whether it’s a trend or a one-time occurrence, you are going to have to do some digging. Ask relevant parties if they have any explanation. If there is a time ordering, see if your outlier can be explained by the time it occurred.
To get the most effective analysis, identifying outliers and deviations from trends is important. This challenges you to think about variables you have forgotten or any defective processes you may have used. It purifies the data and allows you to make a more confident analysis.
A final note for educational purposes: a lurking variable may also be called a hidden variable, but that’s just not as fun.