Introduction!

My Interest

I am very interested in the way big data is being used to understand questions that have so far resisted quantitative analysis. For example, big data can help organizations to better understand customers and their behaviors and preferences. I am particularly inspired by the work of theoreticians such as Viktor Mayer-Schönberger and Kenneth Cukier, whose Big Data: An Evolution That Will Transform How We Live, Work, and Think has truly transformed my own thinking about myself and our society. Their discussion of the growing effect of big data on every field of study, and on everyday beliefs, depicts an ideological and sociological evolution that has already begun. I yearn to participate fully in these sweeping changes, and the ideal next step in Quantitative Social Sciences with a Concentration in Data Science.

To take just one example, I am interested in exploring factors that contribute to youth homelessness; using hard data to shed light on social patterns may help alleviate problems of teenagers and their families. Big Data can elicit patterns from rich data sets which have previously been considered too large to use: Facebook, newspaper databases, telephone records, etc. By using these sources we are able to elicit new types of information, new patterns and sometimes surprising insight.

My Thesis

As part of my M.Sc. in Management Engineering from the University of Waterloo, I wrote a thesis on “Detecting Weak Signals by Internet-Based Environmental Scanning.” This was an opportunity to apply data mining, computer tools and human judgement to predict the market potential of a new product. I used both programming and human analysis to retrieve 40,000 HTML pages, analyze the data, and produce information that was relevant for the strategic marketing department of Christie Digital. The retrieved information was useful to the company’s executives, and contributed to the proof of concept of a disruptive product idea. What was interesting to me was that we were able to blend human judgement with programming techniques to understand and predict the behavior of the target group of customers. Rather than following pre-conceived notions of what the relevant factors would be, we allowed the data to lead us—and the results were fascinating.

To succeed in my thesis, I used a novel methodology which enabled me to consolidate data from a large sample of web pages and convert it into practicable information. I retrieved 40,000 HTML pages to extract weak signals and find behavior patterns. It was essential to cluster pages according to their content. I chose to use a combination of Apple script (in the pre-processing phase to convert from HTLM to txt); the k-means algorithm (with each cluster accounting for no more than 10% of the overall count) and CLUTO for the clustering function. This enabled me to produce clusters with similar content (matching key words), which were ready for human analysis.

Samples of documents were given to readers who were experts in the company’s niche. They removed the irrelevant documents and reduced the number of clustered documents to 20,000 pages; and then repeated this process twice to finally arrive at a set of the 2,000 most relevant pages. These pages included new and relevant information that the experts and company executives had not expected to find.

alt image

Written on January 15, 2016