“What follows is ever closely linked to what precedes; it is not a procession of isolated events, merely obeying the laws of sequence, but a rational continuity.” — Marcus Aurelius, Meditations, Book IV, 45
I am now working with TCS on a project on forecasting demand (I will write more details later). It is a great opportunity to learn (and apply) more statistics and time series analysis, and this also has connections with lots of interesting areas from dynamical systems to machine learning techniques. I suspect I will learn more statistics than if I were in a course, and it will be more fun too. (I think that college statistics is generally taught extremely poorly, and can be improved in many many ways, but to go into details of this will require another post).
The problem: given we know the past history of some variable(s) , what is the best way to predict the future value(s)? What “best” means depends on the application. Usually it means low expected out-of-sampled error (measured as RMSE for instance), though in a business context I think it is better to use loss functions that are better suited to the area of application.
I am particularly interested now learning more about the problem of model selection and how to prevent over-fitting. How do you find a model that will give the lowest out-of-sample error? Validation or cross-validation is the most obvious and least sophisticated way and is very commonly used, but there are some issues about it that worry me (example: if partitioning a data set into training set and a test set, how does one decide how much data to include in the test set, and how much to include in a test set? Surely this is an important decision! If anyone knows anything about this please let me know.) There are lots of other things I should mention: how prediction markets work, exponential smoothing…