Thursday, June 14, 2007

Server Logs and Security

I recently posted a blog to explain why Google retains search server logs for 18 months before anonymizing them.
http://googleblog.blogspot.com/2007/06/how-long-should-google-remember.html
Security is one of the important factors that went into that decision. Google uses logs to help defend its systems from malicious access and exploitation attempts. You cannot have privacy without adequate security. I've heard from many people, all agreeing that server logs are useful tools for security, but some asking why 18 months of logs are necessary. One of my colleagues at Google, Daniel Dulitz, explained it this way:

"1. Some variations are due to cyclical patterns. Some patterns operate on hourly cycles, some daily, some monthly, and others...yearly. In order to detect a pattern, you need more data than the length of the pattern.

2. It is always difficult to detect illicit behavior when bad actors go to great lengths to avoid detection. One method of detecting _new_ illicit behaviors is to compare old data with new data. If at time t all their known characteristics are similar, then you know that there are no _new_ illicit behaviors visible in the characteristics known at time t. So you need "old" data that is old enough to not include the new illicit behaviors. The older the better, because in the distant past illicit behaviors weren't at all sophisticated.

3. Another way of detecting illicit behaviors is to look at old data along new axes of comparison, new characteristics, that you didn't know before. But the "old" data needs to run for a long interval because of (1). So its oldest sample needs to be Quite Old. The older the data, the more previously undetected illicit behaviors you can detect.

4. Some facts can be learned from new data, because they weren't true before. Other facts have been true all along, but you didn't know they were facts because you couldn't distinguish them from noise. Noise comes in various forms. Random noise can be averaged out if you have more data in the same time interval. That's nice, because our traffic grows over time; we don't need old data for that. But some noise is periodic. If there is an annual pattern, but there's a lot of noise that also has an annual period, then the only way you'll see the pattern over the noise is if you have a lot of instances of the period: i.e. a lot of years.

This probably isn't very surprising. If you're trying to learn about whether it's a good idea to buy or rent your house, you don't look only at the last 24 months of data. If you're trying to figure out what to pay for a house you're buying, you don't just look at the price it sold for in the last 24 months. If you have a dataset of house prices associated with cities over time, and someone comes along and scrubs the cities out of the data, it hasn't lost all its value, but it's less useful than it was."

No comments: