Introduction
Social Media can be defined as a set of channels using which people can communicate in a many-to-many relationship as opposed to one-to-many relationship of the traditional media like Radio, Television or Magazines. Broadly speaking, Social Media consists of micro-blogging (like Twitter and Facebook), blogging (like WordPress and Blogger), forums (where users ask questions and post complaints), photo sharing (like Flickr) and video sharing (like YouTube). With the growth of Social Media, so has grown the demand for getting meaningful information out of raw data. And to get this done, two different kinds of problems need to be solved. The data collection is the first problem and information retrieval the second. It is the second kind of problem that this article tries to address – the challenges faced in Social Media Data Analytics.
What data analytics are we talking about?
Let’s take a case where data for a new brand is collected from Twitter, Facebook, Blogger and Youtube. When this data is viewed in its raw form, it appears like a mosaic. A trained analyst can immediately see patterns in it and form conclusions and business decisions based on it. Once in a while this exercise can be carried out but owing to its high costs, this might not be always feasible. Instead, a system can be made that partially emulates the work of the analyst. The system can do things like estimating the polarity of the message (or finding author’s sentiment), it can list down what authors in a particular country are saying about the brand, it can list out the popular positive and negative topics related to the brand and it can also be used to get some fancy albeit very speculative features like guessing the age group and gender of the posters.
Machine Learning – Supervised or Unsupervised
Most of the challenges can be solved through the standard machine learning techniques. But unlike the other machine learning problems, the problems faced here are different. For instance, getting a good training data is extremely difficult. This means unsupervised learning approaches are hard to apply. Also, completely supervised learning approach can be tedious and time consuming to train. One good solution to this problem is as shown in [1] whereby the authors collect positive sentiment tweets by searching the emoticon 🙂 and negative sentiment tweets by searching the emoticon :(. Once the data problem has been solved, training algorithms can then be chosen.
At this point, it is also important to note that rather than using a very complex algorithm with long training time and high accuracy, it might be better to use a simple algorithm that trains fast with moderate accuracy and better runtime. A more desirable trait of the algorithm is that it should adapt to the feedback generated from the user for incorrectly tagged messages, and for that it should be quicker to train. Hence simple algorithms like Naïve Bayes and Maximum Entropy are also good candidates, while most tree based algorithms are not so well suited, as they are over sensitive to bad data.
Storage Requirement – Speed and Volume
There are systems that solve the data volume problems very elegantly like Google’s Big Data and its implementation as Hadoop and Hbase. And yet other systems which solve the speed problem by redundancy storage and indexing like RDBMS systems like MySql. However both of these are not quite suitable for Social Media data storage as it needs both the speedy retrieval and should support a large volume. It is quite common as a requirement for such systems to handle a data rate of a hundred data entries per second and their realtime retrieval. Indexing solutions like Lucene and Solr handle this quite well and are well suited for the task. Expensive RDBMS systems like Oracle and MsSql Server might also solve the task.
Language Handling
This is not a difficult task to solve, as once a system has been prepared for English the same can be repeated for other languages. The task is however a tedious and big. The training models have to be prepared in such a way that for any message, language has to be determined first and then the model corresponding to the detected language should be applied over the message. This means that any learning model in the system should be trained for all the required languages.
Picture and Video
This is a desirable feature of a system – to get the meaningful data from picture and video. This poses a new challenge because such processing is CPU extensive and cannot be done in realtime. So this can be achieved by a separate subsystem possibly utilizing the Map-Reduce architecture. This has been wonderfully solved by YouTube to check for copyright infringements in uploaded video, see [2].
In summary, Social Media Data has a huge scope in terms of innovations. A lot can be done on this field to connect social media data to the real business value. We are only seeing the start of it.
References:
[1] http://www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf
[2] http://www.ted.com/talks/margaret_stewart_how_youtube_thinks_about_copyright.html
Image Source : filosophy.org