It is often the last thing big data users think about. Playing with numbers and discovering insights is so much more exciting than deciding on infrastructure.
But before any analysis can take place, IT directors need to work out where the data will be held – in-house or in the cloud. The decision will need to assess cost, security, the ability to give third parties access to data and the ability to scale.
Even getting your data where you want it is tricky. Suppose you want to make use of a storage service on the cloud. If the information isn’t already online, companies that want to use the cloud can face a huge operation to get their existing data out of their data centres and online, according to Kirk Dunn, the chief operating officer of big data software provider Cloudera.
“The notion of taking a petabyte of data and moving it from the data centre to the cloud and then to a managed provider and then back to the data centre is just not practical,” he says. “You can’t move that amount of data very effectively or very fast.”
But the director of Amazon Web Services, Iain Gavin, doesn’t think that getting the data into the cloud should be an obstacle, no matter what you might have to do to get it there. “People do send us drives – truckloads of drives,” he happily admits.
Few of the questions concerning infrastructure have easy answers. “A lot of commentators, who ought to know better, talk about big data as if it is a single, homogeneous problem-space, but it’s not,” Martin Willcox, director of marketing at Teradata Corporation, points out. “Because big data problems come in different shapes and sizes, big data solutions do too,” he says.
That said, the in-house versus the cloud debate is moving sharply in favour of the latter. Marketing firm Razorfish, for example, is an experienced advocate of hosting data on the cloud.
“Razorfish was already in the business of big data before the term was coined,” says the digital marketing agency’s UK technical director Mandhir Gidda. “Ultimately we needed a platform that had scalable characteristics for dealing with naturally increasing data and processing load, which could be dynamically scaled up and down, to deal with seasonal or peak traffic, and that was more cost effective than running a data centre operation of our own.”
Once information is stored in the cloud, it becomes easier to let other firms access it, while making sure they’re only allowed into the data they have permission to see
Cloudera’s Mr Dunn thinks pragmatism might be the key consideration for firms, particularly when they’re starting out, keeping the massive amounts of data that is already in-house where it sits and using the cloud for additional projects. But there is a sense that once firms have signed on for the cloud, few go back to in-house.
James Mucklow, an IT expert at PA Consulting, recalls a large-scale health project he was involved in that explored the potential correlation between anti-ulcer medication and pneumonia.
“We loaded published prescription data into the cloud – several billion items – which is great because we could then build systems based on that. We did that in less than a week,” he says.
“What my team did is they basically took the two data sets – anti-ulcer drug prescriptions and admissions to A&E for pneumonia – and they found a very simple correlation. Now there are a few caveats, we’re not epidemiologists and we’re not clinicians, but the point was we did that in an hour, so that suggests it’s worth exploring in more detail with more effort.”
And the cost? Again, it is tough to find a consensus on whether the cloud is really the cheaper option. Mr Mucklow thinks it’s more economic for most datasets to go with the cloud, but Teradata’s Mr Willcox reckons it’s a bit more complicated than that.
“When you get it all done, very often the costs are neither here nor there,” he says. “The principal cost benefit of a cloud approach is the consolidation of computing resources, so that we don’t have multiple dedicated systems, each only running at 15 per cent capacity, where we could have a single system running at closer to 100 per cent.
“Centralising and integrating computing resources in that way is something that we have been preaching and practising at Teradata for over 30 years; an enterprise data warehouse is a private cloud if it’s done right and built on appropriate technology.”
Then there’s monetising the data for more than just the business at hand. And that means third-party access. Big firms, such as Barclays and O2 parent Telefonica, have recently announced that they’ll be selling on their anonymised metadata gleaned from their customers. It is a popular cash-generator for data owners.
Once information is stored in the cloud, it becomes easier to let other firms access it, while making sure they’re only allowed into the data they have permission to see – a concept that Razorfish’s business model is built on.
“Razorfish provides dashboards to our clients where they can run their own reports, slice and dice their data by whatever dimensions they choose, and to make use of some of the predictive processing features to see visual extrapolations of trends being surfaced from the data,” Mr Gidda explains.
Other firms have to be sure that they’re not making money at the expense of their own business. “There’s enormous value in data. But what does it do to your relationship with your customer if you’re selling data about them?” asks Mr Mucklow. “You have to be very clear on not compromising that relationship.”
So what are the best practices when it comes to choosing infrastructure for big data projects? It seems there are no simple answers, but that’s not unusual for a young industry. Despite the fact that businesses have been playing around with customer information in one way or another for decades, the “big” in big data is still relatively new and the technology to meet that scale is still developing.
“Let’s say today the majority of data is sitting in the data centre and all of a sudden, five years from now, everybody wants to move it into a cloud service provider,” says Cloudera’s Mr Dunn. “Well, necessity is the mother of innovation. So if there’s a need that’s strong enough for a business to do that, I’m sure somebody will find a way to move five or ten petabytes over a network efficiently.”