image missing
SiteNav SitNav (0) SitNav (1) SitNav (2) SitNav (3) SitNav (4) SitNav (5) SitNav (6) SitNav (7) SitNav (8)
Date: 2022-07-03 Page is: DBtxt001.php txt00008596

Dr. Vincent Granville

From the trenches: 360-degree data science


Peter Burgess

From the trenches: 360-degree data science

This is data science from the trenches - both a case study, and a tutorial for data sciencist candidates. Here I illustrate how gut feelings, carefully selected data (rather than getting granular data), full understanding of business (horizontal knowledge), high level vision, and outsourcing (to make data science almost free) combined together, makes a data science project successful. I also share with you the data set used in this project: top 5,000 webpages from our network in the first few weeks of 2014, with rather detailed metrics; the time period is less than 3 months, but more than 1 month.

The goal here is to assess the effectiveness of our Google advertising campaigns, and how to better shift and optimize our traffic sources. This data science analysis is about our own Data Science Central websites and business model, but you will learn trends that apply to many other businesses, regarding Google, LinkedIn, Facebook, Twitter and Google+ traffic.

Example of complex optimization problem, with multiple maxima

1. Fun facts

Here are a few fun facts, as an appetizer:

  • The 2nd page where users spent most time per pageview (out of 41,009 pages with at least 1 visit during the time period in question) is the blocked members page, with visitors spending an average 40 minutes on the page. It's indeed an outlier, but not an error: our sofware engineer (contractor working remotely in Eastern Europe) spends regularly 40 min on this admin page to detect fake profiles and spammers, and ban them. This page accounts for about 0.1% of all time spent on the network. The 1st page is an outlier (data glitch resulting in unusually long visits).

  • The total time spent by all 300,000 users during the time period in question, is 4.14 years, that's about 1.5 minute per user.

  • We did not know that we had so many active pages, though many pages are counted twice: one for the web version, one for the mobile version.

  • The front page accounts for only 8.6% off all page views (not surprising this number is so low since we have more than 40k active pages). Front page accounts for 6.8% of all time spent on the network.

  • The top 5,000 pages account for 88% of all page views; the top 111 pages account for 50% of all page views.

2. Data science with no data scientist, no data silo

The data scientist in charge of this study is the co-founder of the company. As a lean start-up with 0 employee (generating 10 times more revenue with 80% margin, than my previous money-losing VC-funded startup that had 20 employees), we just don't hire a data scientist, though we have the money to do so if we wanted. Instead, an executive familiar with all business aspects (me) spends 5% of his time on this type of investigations, and the money saved on a data scientist goes to profit sharing. Many executives nowadays, especially in technology company like ours, have a strong analytic acumen (and background) and can do just like us.

All the data collected comes from vendors: reporting is 100% outsourced and cost little to nothing. We get

detailed traffic numbers from Google Analytics (we'll share with you some exciting data),

newsletter statistics: open rate, open rate by ISP (we created segments to track this data) and clicks broken down per link. These stats are provided by our vendor VerticalResponse.

statistics from external sources: and UR shorterners, providing info on clicks out. We've done some A/B testing comparing these two sources and came to the conclusion that both are subject to the same massive errors (traffic inflation for traffic coming from newsletters, with no referrer).

data from our clients: number of leads attached to a particular campaign, and number of qualified leads in some cases

DoubleClick numbers: impressions for each banner ad

There's no silo: one guy (me) has access to, understand, and blend all the data, including financial data (revenue, costs) broken down per product.

Here we will focus on Google Analytics data, as well as financial data, at a rather high level. The roles of data scientist and business analyst overlap here: one person wears both (and many more) hats, saving a lot of money in payroll to allow us to better compete with other, over-staffed companies.

3. No need to create your own big data: instead, leverage external big data, at no cost

I've worked only with the top 5,000 webpages but it would have been very easy to download the Google Analytics data for the 41,009 active pages. The reason to work with 5,000 is to get you familiar with sampling, and prove that you can get great results with just sample data. This being a tutorial part of our data science apprenticeship, it is important that you get familiarized with sample data.

Regarding our Google advertising campaigns, we selected keywords suggested by Google itself rather than doing our own time-consuming research. This is actually leveraging Google's massive multi-billion data base of priced keywords, at no cost. A next step would be to customize bids per keyword and ad group, but I believe it won't provide much added value. Finding the top 20 keywords that need customized bids and optimize them is good enough. More than that, and we might be doing data science producing negative ROI after factoring the cost of data science. Of course it's a different story if you are eBay and manage 10 million keywords.

Another weakness is that we don't track conversions yet (new members) on Google analytics. But we have a pretty good idea about conversions and we even created a metric called value, attached to each web page in the data set that we share with you in the last section. We will soon track conversions in Google Analytics, as this will help us drop poor-performing keywords: it is worth the effort (also it helps Google optimize our campaigns for conversions rather than for clicks, a much better solution).

I bought a book on Google Adwords (cost $50) and learned one great thing: how to set up display campaigns where your ads show up only on websites that you have selected (such as our competitors that accept Google ads, or other data science websites). This saved me a lot of money in attending classes or hiring a Google expert. Also, me might use the service of a SEM/SEO company in the future, but again it will be outsourced (vendor relationship). And for now, since our network is sitting on the Ning platform (saving sys admin and server costs), we automatically benefit from Ning SEO efforts. The reason I mention this is to show how analytic thinking / gut feeling help decide how deep you want to go with data science. As a small, lean start-up, we don't want to over-spend, we have a pretty good idea when we spend too much (for instance, if all this activity eats more than 20% of our budget, unless it boosts total revenue). One of the nice things is that all reporting activities are automated.

4. Interpreting results, transforming insights into actions

Our problem is complex. We don't have a dollar amount attached to a conversion, and in general, we don't charge clients by number of impressions or clicks: we typically offer fixed fees with guaranteed numbers in terms of leads, impression or clicks.

We want to keep ad spend below 10% of our budget (our margin is currently 80%). Currently Ad Spend is about 4% of our gross revenue. We can't easily increase this 4% figure, because to get more traffic, we would need to increase our bids, which would eventually generate negative ROI. It is important that you know your break-even point. For us, the maximum cost of acquisition (to maximize revenue) of a conversion has not been fully calculated yet, but it is below $10, in other words, $3 per Google paid click maximum.

The impact of ad spend on our traffic (page views) is small: less than 3%. But it is much bigger on conversions (our main source of revenue), accounting for 25% of all conversions, and diversifying our conversion sources to minimize risks. An easy way to measure the 25% is using a different landing page for each source (combined with proper taggings for the conversion URL) so that we can identify the origin of the conversion (Google AdWords, direct traffic, LinkedIn, etc.) Or you can turn on/off Google AdWords and see the impact.

Note that we purchase mostly US / UK / Canada / Australia traffic and avoid midnight to 4am traffic, in order to increase the quality of the paid traffic that we receive from Google: this is another way to leverage a vendor's (Google) big data capabilities without incurring big data costs. Indeed, now our Google paid traffic is better than our Google organic traffic, as it is well targeted and focused on driving traffic to the conversion page.


For every $100 of revenue that we make, $15 is coming from impressions (page views) and $70 is linked in some ways to the number of active subscribers and members. Google ad spend eats $4 (from these $100) but produces only 2% of impressions. We haven't done survival analysis to assess how many page views a user generates over his lifetime, broken down by acquisition channel. Plus, attribution modeling would suggest that some of the new users coming from Google ads would still be acquired by a cost-free channel, if we did not use any Google advertising.

Nevertheless, it is clear that the Google Ad Spend has negative ROI with respect to page views, but the total dollar amount is small. Since we don't operate in silos, we also check the impact on conversions (subscribers, members). We estimate that 25% of our new members come from Google ads, that is, 25% of $70 in revenue can be attributed to Google ads (though some would join via a free channel if we did not advertise, and users acquired by Google ads have higher churn - just a wild guess). So I'll reduce the 25% to 15%. In short $10.50 = 15% of the $70 revenue, costs us $4 (Google Ad Spend), and thus, Google Ad Spend works for us, we can even increase our CPC and budget.

However, the situation is more complicated than it seems at first glance. Getting more traffic makes sense if we get more revenue. We could increase the fee for our services (email blasts) if we deliver to more subscribers, resulting in more clicks and more leads for the clients. But this is not obvious: increasing prices can deter clients - clients also have fixed budgets. We can easily get more clients, but we can not send more than one blast per day: at some point, our inventory is full booked. We could segment our member database, send more blasts to more targeted, smaller groups of people. That's the way to go to grow revenue along with traffic. Another way is to reach an equilibrium, have our company run on auto-pilot, and start another one (maybe a community for astronomers) and then another one. We are definitely contemplating this option.

Note: Google Analytics reports contain a column called Page Value, based on conversion and revenue per page, for each of the 40,009 active pages. We don't track conversions yet in Google analytics, but we've found a good proxy for page value, using two other columns from Google Analytics reports, Entrances and % Exit. Then Page Value = Entrances * (2 - % Exit). Entrances is the number of times the page in question is an entrance page, % Exit is the number of times (proportion) it's an exit page. If Entrance = 1,000 and % Exit = 30%, you can expect at least 0.7 extra page views (0.7 = 1 - 30%) after the entrance, providing a conservative page value of 1,000 * (2 - 0.30) = 1,700 single page views attributable to the page in question.


Based on this analysis, we decided to:

Identify, among top 150 pages,those that are too often exit pages. Add links to keep visitors on the website. Expected impact: 5% increase in total page views.

Increase Ad Spend (both number of clicks and cost per click) by 20%. The impact on total page views will be negligeable, but impact on conversions will be a 5% boost (that is, 100 extra new members per month - worth the extra cost).

Add conversion tracking on Google Analytics, optimize conversions rather than clicks.

Future steps will involve automated content syndication and content mix optimization. In particular, detecting how to optimize the following mix:

very few extremely popular articles with very long lifetime (> 25,000 pageviews), maybe bringing in external bloggers to have more of these

a few very popular articles like this one that you are reading right now (> 10,000 page views),

some popular articles with > 2,000 page views

tons of good articles (salary surveys, new books, new training; each generating > 300 page views)

other articles that are less successful

forum questions

5. Year-to-year comparison

The year-over-year tab in the spreadsheet (see next section) shows a spectacular growth (> 80%) in incoming traffic, for Google organic traffic and in direct traffic (driven by email campaigns). Google organic and direct traffic represents 66% of the visits (33% for Google organic, a perfectly normal number, especially since we don't do any SEO) and 33% for direct traffic (quite good, with growth driven by membership growth after factoring in churn). LinkedIn, although the traffic is better with more page views by visit, is barely growing, which is good since our reliance on LinkedIn was too high in the past, representing a risk. Twitter is very promising and will eventually surpass LinkedIn, in terms of share of incoming traffic. Our Twitter advertising campaigns contribute to this shift. Facebook and Google+ bring modest contributions, and we don't expect spectacular growth from these traffic sources, though Facebook advertising has gotten better over time (less fake traffic, more reasonable CPC).

Finally, we've noticed that LinkedIn and Google organic traffic sources are negatively correlated. The more we get from LinkedIn (by posting on LinkedIn), the less we get from Google, as the LinkedIn links to our articles show up above our internal links, on Google. This is actually an incentive for us to either do better SEO (to beat LinkedIn and the fact that Google wrongly attributes our articles to LinkedIn), or to post less on LinkedIn. We've chosen the latter. However, our posts on LinkedIn get re-tweeted or re-posted outside LinkedIn, resulting mostly in direct traffic to our website. We haven't quantified the amount of traffic indirectly generated via LinkedIn, but it might represent 10% of our direct traffic (based on some statistics). The same applies to Facebook, Google+ and Twitter, but not to Google organic or paid traffic.

6. Get the data set

Click here to download Google Analytics report, with traffic metrics for 5,000 top pages and estimates for all 40,009 active pages on our websites, during time period in question.

7. Other links

Source code for our Big Data keyword correlation API Great statistical analysis: forecasting meteorite hits Fast clustering algorithms for massive datasets 53.5 billion clicks dataset available for benchmarking and testing Over 5,000,000 financial, economic and social datasets* New pattern to predict stock prices, multiplies return by factor 5 * 3.5 billion web pages* Another large data set - 250 million data points - available for do... * 125 Years of Public Health Data Available for Download * Two big datasets to challenge your data science expertise * Data Science Certification Update about our Data Science Apprenticeship Our Wiley Book on Data Science Data Science Articles Our Data Science Weekly Newsletter Views: 11942 Like 2 members like this ShareTwitter

< Previous Post Next Post > Comment You need to be a member of Data Science Central to add comments! Join Data Science Central Comment by Vincent Granville on April 13, 2014 at 11:10am I did run a global optimizatiom model on my data, but not with a computer, mostly with my brain instead. That's the take away of this story - you can do great optimization with little (or no) data, if you have vision and data/business acumen - and manage to get a little data, just the right data you need for optimization. Why top 150 pages? Why not top 200? It's an arbitrary choice, but it required 2 seconds of my time to make the decision based on domain expertize. Maybe top 200 is better, but if you spend a whole day deciding on whether it should be 200, 375, 100 or whatever ideal number rather than 150, you've wasted $4,000 of your time to get an improvement (over my arbitrary 150) worth less than $2,000. Another takeaway of my story - the fact that domain expertize, analytics acumen and good judgment more than compensate for using light or no analytics. Comment by Thia Kai Xin on April 12, 2014 at 5:15pm Quick comments: The multi maxima graph is misleading, unless you really ran a global optimization model on your data and that was the graph you got, it seems unnecessary. On the other hand, simple bar charts to summarize your findings should make your message clearer. I find it hard to follow through the numbers and understand the key takeaways, especially from the 'interpretation' and 'insight' section. How was the 'actions' derived? In particular, why was those numbers chosen: Why 150 top pages, not 100 or 200? Why 5% increase in total page views, not 10%? Why is a 5% conversion worth the extra cost? Why not investigate the topics / articles that people spent time on? (Besides the blocked member page and outlier). More time spent on articles leads to greater interaction and possibly larger click through rate. If you can find a cluster of topics that interest people and increase number of articles related to those topics, it might have larger effect than some of the estimations made in this article? RSS

Posted by Vincent Granville
on March 27, 2014 at 10:30am
The text being discussed is available at
Amazing and shiny stats
Blog Counters Reset to zero January 20, 2015
TrueValueMetrics (TVM) is an Open Source / Open Knowledge initiative. It has been funded by family and friends. TVM is a 'big idea' that has the potential to be a game changer. The goal is for it to remain an open access initiative.
The information on this website may only be used for socio-enviro-economic performance analysis, education and limited low profit purposes
Copyright © 2005-2021 Peter Burgess. All rights reserved.