What is Big Data? How should it be used? And is it right for you?
2013 saw the rise of Big Data, with almost every major software vendor championing it as the latest revolution in IT and business. However, like any technology, Big Data is a tool to be used where appropriate, and in this article I will discuss how to answer the question: “Is Big Data right for me?”
I will begin with defining what Big Data is, along with a brief overview of the principles and technologies it employs. Following this, I will discuss how Big Data is used and why, concluding with a list of questions to ask when deciding if Big Data is right for your company.
What is Big Data?
This is the most common question in the field, and, to some, the most difficult one to answer. The chances are if you were to ask a cross-section of IT professionals this question, you would get as many different answers as people you had asked. The problem is that the term itself is vague, and what constitutes 'big' in regards to data size is constantly changing. Twenty years ago a Gigabyte hard drive was considered huge, whereas today most smartphones are capable of storing many times more data than this. Not only that, but the technology used to organise and access data has changed dramatically; for example, relational databases and SQL have existed since the 70's but today’s businesses are adopting innovative new technologies such as NoSQL (Not only SQL) and MPP (Massively Parallel Processing) databases, which use radically different data structures and management techniques to allow huge amounts of data to be exploited in exciting new ways.
With this in mind, Jonathan Stuart Ward and Adam Barker of the University of St Andrews conducted a survey of Big Data definitions. They approached industry leaders including IBM, Oracle and Microsoft, and received a multitude of responses attempting to define it, ranging from definitions such as:
“Traditional relational database-driven business decision making, augmented with new sources of unstructured data” (Oracle)
“Big Data is the term increasingly used to describe the process of applying serious computing power – the latest in machine learning and artificial intelligence – to seriously massive and often highly complex sets of information.” (Microsoft)
and even simply as:
“[data which] exceed(s) the capacity or capability of current or conventional methods and systems” (National Institute of Standards and Technology)
They soon realised that Big Data is a general term whose interpretation changes with the times and the technology of its users, and that it cannot be associated with a single database methodology or business case. Big Data is a generic principle that promotes novel uses of existing technologies, to allow organisations to effectively utilise datasets which are outside the scope of conventional data storage and analysis methods.
Principles and Technologies
One of the core principles of Big Data, and the reason for its being, are ‘the three V's’: velocity, volume and variety.
Velocity: the speed at which the data is created,
stored, analysed and visualised.
Business grows more competitive every year, and with the increasingly rapid pace of both technology and commerce, companies need information faster and in greater detail to be able to make effective management decisions. Here at Waterstons, we aim to achieve performance through technology for our clients in one or more of the following ‘Five Ways’:
- Raising quality and lowering costs
- Acquiring and retaining customers
- Providing timely and accurate information
- Improving teamwork and communication
- Reducing risk and increasing security
If a project or technology provides a measurable return in at least one of the five ways, it can be considered viable. This is where Velocity comes into its own; by increasing the velocity of data, timely management information is more readily available, allowing a business to analyse both itself and its market and make informed decisions with the latest data. Big Data tends to focus on real-time information and reporting, allowing answers to difficult questions to be revealed almost instantly, rather than within weekly or monthly reports. This is opposed to the current practice of daily or weekly builds, often present in Business Intelligence systems, which rely on OLAP cubes that need to be built regularly.
Volume: the size of the data created, and the
resources required for its storage.
Obviously Big Data focuses on huge amounts of data. What many don’t realise is how huge these datasets tend to be. To put this into perspective, 90% of all data that has ever been created was generated in the last two years; in fact the world’s data volume approximately doubles every two years. Aeroplanes generate approximately 2.5 billion terabytes of data every year due to the sensors installed within them for safety and tracking purposes. In order to effectively store and analyse such monumental amounts of data, the traditional relational database model cannot be used; instead more efficient data structures are necessary to be able to traverse this data within a realistic timescale.
Variety: the variation in data sources and formats.
Relational databases are collections of structured, tabulated data. Due to its extreme volumes, Big Data does not have the luxury of this structure. For example, social media sites such as Twitter and Facebook have enormous databases of status updates, tweets, images, videos and user profile information. These will be stored in a number of different locations and in a huge variety of formats, making database structures varied and the navigation of them complex. Thus a challenge for Big Data technologies is to allow these sources to be pooled together and analysed effectively, without limiting the volume or velocity of these varying types of data.
One technique that is used to accommodate these principles is ‘Map Reduce’. The core idea is that there is both a ‘Map’ and a ‘Reduce’ operation. The map operation takes a dataset; sorts, filters and divides it; and effectively creates a list of items to process. The reduce operation then performs this processing, which could be a sum of the items, or of items that match a particular case, or some other more complex operation. An application of this would be determining how many citizens in a specific area have a specific name; for example, the map operation could query a dataset and return a list of citizens who live in the North East of England. The reduce operation could then iterate over each of these items, check if their surname is ‘Smith’ and increase the sum by one. The beauty of this operation is that once the data has been mapped, the list could be divided and distributed across a huge number of ‘Reduce’ nodes, which would process their section of the list and return a number that can be summed when the operation is complete. This is then easily scalable; to make the operation faster, you divide the list into smaller sections and distribute them across more nodes. Combined with virtualisation and load balancing, tasks can be distributed to more or fewer nodes as required, meaning resources (and resultant costs) can scale according to the demands of the business.
Perhaps the most notable implementation of this is Apache ‘Hadoop’, an open source framework built on numerous technologies that handles the distribution and storage of data and ‘Map Reduce’ tasks, as well as resource management and fault-tolerance in the computing clusters that process them. Due to its scalability and functionality, it has become the de-facto implementation of Big Data, with more than half of Fortune 50 companies employing it within their business. As well as this, companies such as Amazon offer EC2 (Elastic Computing) services that can host and maintain virtual Hadoop clusters, using a pay-as-you-go business model where the cluster will dynamically scale to demand; allowing businesses to scale their costs as they need to, rather than investing huge amounts in the infrastructure required for a Hadoop cluster.
How and Why is Big Data Used?
One of the most universally applicable uses of Big Data is advertising and market research. As mentioned, social media sites such as Facebook and Twitter have a huge amount of data generated by everyday people. Companies can use this data to understand their customers, both in terms of the products and services they desire, as well as their opinions on a company and its competitors. They can also target specific demographics with certain advertisements, to maximise the effectiveness of their marketing investment. An example of this is Facebook’s ‘Ad Audience’ service, with which companies can specify criteria for profiles they wish to target their advertisements to. These criteria can be almost any information that users provide in their profile, so attributes such as geographical location, age, ‘likes’ and even keywords within the contents of their status updates can be used to trigger specific advertising. The service even provides a prediction of how many users will potentially be reached by your advert, during which changes to the criteria will modify this prediction in real time; a factor made possible by the velocity and volume principles of Big Data.
Another common use for Big Data is within the energy market. Industry leaders are using huge networks of sensors and Big Data technologies to detect and predict where oil reserves are hidden, so that they can extract it safely and in greater quantities to meet the demands of the consumer. They are also using it to optimise their business operations, by analysing the logistics and refining processes to determine areas of inefficiency that can be streamlined to reduce costs, which can be passed onto the consumer or reinvested in new technologies.
However, there have been many more unusual uses, including fraud prevention, distributed IT intrusion detection and prevention systems, and even crime prediction and terrorist tracking; the latter of which uses social media, existing crime reports, electronic communication and even weather patterns to predict the rise and fall of criminal activities, as well as to correlate the links within criminal and terrorist organisations. In fact, it has even been used to predict elections, using analysis of Twitter feeds via its publicly accessible API (see the references at the end of this article for more information).
Why and When Not To Use Big Data
With all of these incredible possibilities in mind, it is easy to become entranced by the allure of Big Data. However, there are significant drawbacks that must be considered by any organisation wishing to employ its use in their day-to-day business. The first is cost; Big Data is very expensive. The reason more than half of Fortune 50 companies use it is because not only can they afford it, but the sheer size and expenditure of their organisation means the significant costs are small when compared to the money saved via increased market awareness and business operation optimisation. Not only that, but they can afford to hire from the extremely limited pool of ‘data scientists’, which has become a niche but necessary area dominated by a select group of companies, a large number of which reside in the Silicon Valley area of California.
However, perhaps the single biggest reason to not use Big Data is necessity. Most companies simply do not need it. The costs of infrastructure, support and knowledgeable personnel most often far outweigh the benefits, but more so, most companies do not have access to the amount of information that actually requires Big Data technologies to analyse and store it. The majority of corporate databases range from the hundreds of megabytes up to the tens of gigabytes: Big Data is intended for ranges in the terabytes, petabytes and far beyond. Below this scale, the techniques employed to navigate such a huge data structure actually become a hindrance, for they are deliberately limited to allow extremely fast processing over a distributed computing cluster. Hadoop is a prime example of this, as Chris Stucchio states:
“In terms of expressing your computations, Hadoop is strictly inferior to SQL. There is no computation you can write in Hadoop which you cannot write more easily in either SQL, or with a simple Python script that scans your files.”
The ‘Map Reduce’ approach that Hadoop uses is extremely limited in its expressive abilities, and dedicated query languages such as SQL are both more flexible and easier to read. They also have a significantly greater pool of talent of which highly experienced individuals are readily available; practically every developer today has an understanding of SQL, and, here at Waterstons, Microsoft SQL Server is one of our core technologies and areas of expertise.
Big Data is a buzzword that is sweeping the internet with numerous stories of its success and almost mystical predictive possibilities. The problem is that this gives it an allure which entices businesses to dedicate huge amounts of time and money towards it, without truly determining an actual justifiable return on investment. They identify Big Data and try to find a business case for it, rather than finding a business case that needs Big Data.
Big Data is a revolution, a shining example of technological achievement, but it is also a tool built for a specific job. We’ve sent men to the Moon, but that doesn’t mean we travel to work in rocket ships.
The key lesson is not to get caught up in the romance of this technology, but to actively analyse your business and see how technology can improve it. If its implementation delivers tangible and measurable benefits, for example against one of the five ways described earlier, then it is most likely a viable option for your business. Business justification and return on investment are the litmus tests that must be passed before it should be considered.
So, if you ever find yourself looking at the possibility of implementing Big Data in your business, ask yourself the following questions:
- Do you know what Big Data is?
- Do you have access to a Big Data scale of information?
If the answer is yes to all of those questions, then maybe Big Data is the right thing for you.
- Blogs and technical news agencies evangelise the predictive and analytical benefits of Big Data everywhere.
- A trend is emerging where companies are seeing how they can apply Big Data before they determine a need for it.
- Big Data is exciting and incredible, and with the ever expanding amount of information being generated every day, it is more and more current to today’s businesses.
- However it is a tool to be used where necessary with a valid and cost effective use-case.
Undefined By Data: A Survey of Big Data Definitions: http://arxiv.org/pdf/1309.5821v1.pdf
3D Data Management: Controlling Data Volume Velocity and Variety: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-ControllingData-Volume-Velocity-and-Variety.pdf
Chris Stucchio: Don’t use Hadoop – your data isn’t that big: http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
BBC News – Crime Prediction and Terrorist Tracking using Big Data: http://www.bbc.co.uk/news/technology-22008497
What the Frack: U.S. Energy Prowess With Shale, Big Data Analytics: http://www.wired.com/insights/2014/01/big-data-analytics-can-deliver-u-s-energy-independence/
Nate Silver’s Election Predictions a Win for Big Data, The New York Times: http://adage.com/article/campaign-trail/nate-silver-s-election-predictions-a-win-big-datayork-times/238182/