Begin­ning April 26th we intro­duced a new, improved method­ol­ogy for the way we han­dle high car­di­nal­ity reports in Site­Cat­a­lyst and Dis­cover. This change greatly improves your abil­ity to under­stand the sig­nif­i­cant trends in your data. But before I dive into the nature of these changes and how they work, I want to give you a lit­tle background.

What is High Cardinality?

For the pur­poses of this blog post, a report has high car­di­nal­ity when a “high num­ber” of dis­tinct val­ues are passed in for given vari­able within a spe­cific time frame. The vari­able may be page name, a prop, an eVar or any other stan­dard or cus­tom Site­Cat­a­lyst report­ing dimen­sion.  So for instance, if you pass in mil­lions of page names into Site­Cat­a­lyst each month, the page name vari­able has high car­di­nal­ity. If you pass in mil­lions of search terms via a Cus­tom Traf­fic Vari­able (prop) each month, that vari­able also has high car­di­nal­ity. There’s no spe­cific line in the sand where a vari­able becomes highly car­di­nal per se, but for his­tor­i­cal rea­sons we’ll state that any time a vari­able has more than 500,000 unique val­ues in a month, the vari­able has high cardinality.

What does the data typ­i­cally look like?

The graph in the fig­ure below depicts a fic­tional (but real­is­tic) exam­ple of the typ­i­cal nature of high car­di­nal­ity vari­ables. This exam­ple uses the page names variable.

In this exam­ple, a cus­tomer passed in 1.1 mil­lion page names in a given month with a total of 560 mil­lion page views. I grouped the page names in buck­ets of var­i­ous sizes based on the num­ber of page views per page.  The blue bars rep­re­sent the count of page names for each bucket. For exam­ple, roughly 500 thou­sand of the 1.1 mil­lion page names had only one or two page views dur­ing the month. Like­wise, about 19 thou­sand pages had 1,000 or more page views.

The green line rep­re­sents the total page views for each bucket. You can see that the 19 thou­sand page names with greater than 1,000 page views accounted for 540 mil­lion of the 560 mil­lion total page views for the month! The other buck­ets accounted for the remain­ing 20 mil­lion page views.

I like to use the anal­ogy of big rocks, medium-sized rocks, and sand. The items on the left-hand side of the chart are the big, impor­tant rocks. These are the pages you pay par­tic­u­lar atten­tion to. The items on the right-hand side of the chart are the sand. There are many, many pages in this group and each page has very lit­tle traf­fic.  The medium-sized rocks fall some­where in between.

What causes high cardinality?

High car­di­nal­ity can be caused by a num­ber of factors:

  1. Some­times it is the nature of the data.  High car­di­nal­ity may be a nat­ural by-product of type of data you are col­lect­ing in the vari­able. For exam­ple, if you are col­lect­ing cus­tomer IDs in a prop, you may have mil­lions of cus­tomers to track in a given month. Like­wise, if you are using a vari­able to col­lect search terms, you may have mil­lions of dis­tinct search terms that are used dur­ing the month.
  2. Often a vari­able has unnec­es­sary car­di­nal­ity. Unnec­es­sary or sur­plus car­di­nal­ity can be caused in a num­ber of ways. Poor/outdated imple­men­ta­tions, lack of or non-adherence to nam­ing stan­dards, CMS prob­lems, insuf­fi­cient pre-processing of data and other fac­tors can all con­tribute to unwanted car­di­nal­ity. Page names are par­tic­u­larly sus­cep­ti­ble to this prob­lem, espe­cially if you are using the URL as the page name. Query string para­me­ters and ses­sion vari­ables in a URL can quickly cause extreme car­di­nal­ity and make the data very dif­fi­cult to interpret.
  3. Some­times Site­Cat­a­lyst itself con­tributes to high car­di­nal­ity. His­tor­i­cally, traf­fic vari­ables such as props have been treated as case-sensitive vari­ables in Site­Cat­a­lyst. If you pass in the val­ues “Home”, “home”, “HOME”, and “HomE” into a prop, these are con­sid­ered four sep­a­rate line items in Site­Cat­a­lyst reports. This cre­ates unnec­es­sary car­di­nal­ity. eVars, on the other hand, are case-insensitive (i.e. case is ignored.) “Home”, “HOME”, “home”, and “HomE” are con­sid­ered a sin­gle line item in reports and their met­rics are aggre­gated together.

What is the impact of high cardinality?

High car­di­nal­ity can have unwanted side-effects. First, reports and searches are slower, espe­cially in Site­Cat­a­lyst 14 and ear­lier.  But per­haps more impor­tantly, unnec­es­sary high car­di­nal­ity can make it dif­fi­cult to inter­pret and use the data. It’s hard to take action when there is too much sand in the data. Trends become less mean­ing­ful and pre­dic­tive mod­el­ing won’t work. It is impor­tant to keep the level of gran­u­lar­ity where you can make the best use of the data.

How have we han­dled this in the past? Enter “Uniques Exceeded.” 

Because of the neg­a­tive impact to report­ing per­for­mance, his­tor­i­cally in V14 and pre­vi­ous ver­sions of Site Cat­a­lyst we have lim­ited the num­ber of reportable line items that can show up in reports dur­ing a given month.  The his­tor­i­cal algo­rithm works this way:

At the begin­ning of the month, all incom­ing val­ues (big-traffic, medium-traffic, and low-traffic val­ues) flow into reports:

Later in the month when the car­di­nal­ity of the report reaches a pre-determined thresh­old (500,000 items by default), all new incom­ing val­ues (regard­less of pop­u­lar­ity) are fun­neled into a sin­gle bucket called “Uniques Exceeded” :

This Uniques Exceeded algo­rithm achieves the goal of keep­ing report­ing speed rea­son­able but has the nasty side-effect that impor­tant “big rocks” in your data that hap­pen to come along late in the month are buried in the Uniques Exceeded bucket. You can’t see them as indi­vid­ual line items in your reports, at least as far as Site­Cat­a­lyst and Dis­cover are con­cerned. Boo! For­tu­nately Data Ware­house (with a few excep­tions) stores all the unique val­ues so you can find the late-in-the-month big rocks that way, but Data Ware­house reports take much longer to gen­er­ate than do reports in Site­Cat­a­lyst or Discover.

What can be done?

Some­times the best way to rid your­self of unnec­es­sary car­di­nal­ity is to improve your imple­men­ta­tion. From time to time you should take a look at the val­ues you are cap­tur­ing in high car­di­nal­ity reports and ask your­self the ques­tion, “What actions do I want to take with this data? What ques­tions do I need to answer? What level of gran­u­lar­ity or car­di­nal­ity will best help me achieve these goals?” VISTA rules can help you clean up data before it is sent in for report­ing. For exam­ple, you could use VISTA to strip query string para­me­ters from a URL.

For the remain­der of this post though I will focus on the prod­uct enhance­ments we have just intro­duced that improve the usabil­ity of high car­di­nal­ity data.

How does the new “high uniques” algo­rithm work?

The new high uniques algo­rithm (which applies to both V14 and V15) works as follows:

At the begin­ning of the month, all incom­ing val­ues (big-traffic, medium-traffic, and low-traffic) flow into reports:

Later in the month when the car­di­nal­ity of the report reaches a pre-determined low thresh­old, all low-traffic val­ues are fun­neled into a sin­gle bucket:

As the month pro­gresses, if the car­di­nal­ity of the report reaches a higher thresh­old, we begin fun­nel­ing some medium-traffic val­ues into the bucket.

If you think about this in the con­text of the chart at the begin­ning of this post you’ll see the net result is that much of the sand will be grouped together. You likely will never reach the high thresh­old. But most impor­tantly, the “big rocks” in your data (the left-hand side of the graph) will always show up in your reports as indi­vid­ual line items, regard­less of when they occur in the month!

The thresh­olds men­tioned above will depend on your cur­rent uniques limit:

From a tech­ni­cal per­spec­tive the algo­rithm is based on the num­ber of times a value for a par­tic­u­lar vari­able is passed in from your web site (i.e. the instances metric.) An exam­ple will illus­trate how this works.

An exam­ple

Let’s say you are using a prop to cap­ture a color, and at some time dur­ing a month the value “blue” is passed in:

  • If “blue” is passed in early in the month before the report has 500 thou­sand line items, “blue” will show up in reports regard­less of its traf­fic level.
  • If “blue” is passed in after the report already has 500 thou­sand  line items but “blue” has very low traf­fic each day for the rest of the month, “blue” will not show up as a line item in reports.
  • If “blue” is passed in after the report already has 500 thou­sand line items, has low traf­fic for a few days, and then becomes wildly pop­u­lar on the 20th of the month (for exam­ple), “blue” will start show­ing up in reports from that time for­ward. The total instances for “blue” will be slightly under­stated since it had low traf­fic for a while that was caught with the other sand, but start­ing on the 20th “blue” will show up as an indi­vid­ual line item in reports and will include the bulk of its traf­fic. It doesn’t really mat­ter how much traf­fic “blue” gets for the remain­der of the month.  Once it has seen sig­nif­i­cant traf­fic it will show up in reports for the remain­der of the month.

Typ­i­cally you will see that any value with a hun­dred or more instances will show up in reports, although this depends a bit on the dis­tri­b­u­tion of instances per dis­tinct vari­able value. Most impor­tantly, high-traffic val­ues will always come through to reports regard­less of when they occur in the month.

Is this the death of Uniques Exceeded?

Will the term “Uniques Exceeded” still show up in your reports?  The answer, in the short term, is yes. But using the new high uniques algo­rithm it will be far less likely in most cases that Uniques Exceeded will show up in the top items in your report for date ranges after April 26. Given that data prior to April 26 used the old algo­rithm we decided to leave “Uniques Exceeded” as the name of the over­flow bucket for now. Even­tu­ally we will rename this line item to “(Low-traffic)”.

Note that the coun­ters start over at the begin­ning of each month, so most of you will start to see the net impact of the new algo­rithm through­out the month of May.

Case-insensitive traf­fic variables

Beyond the new “high uniques” algo­rithm, on April 26th we intro­duced the con­cept of case-insensitivity for traf­fic vari­ables for all new report suites. That is, case will be ignored. This impacts the fol­low­ing reports:

  • all props, page name, chan­nel, server, cus­tom links, down­load links, exit links

Using the exam­ple cited ear­lier, if the val­ues “Home”, “home”, “HOME”, and “HomE” are passed into a prop they will show up as one line item in reports (usu­ally the first ver­sion that was passed in dur­ing the month.) The met­rics for all four ver­sions will be aggre­gated together. Data Ware­house will use the all-lowercase ver­sion (“home”). The post col­umn in data feeds will also use the all-lowercase version.

Later in the year we plan to add the option to enable case-insensitivity to exist­ing traf­fic vari­ables in exist­ing report suites.

Con­clu­sion

I hope you have found the con­tents of this post help­ful. The new fea­tures I have described will greatly improve the value of reports with high car­di­nal­ity. You will no longer need to worry about high-traffic val­ues that don’t hap­pen until late in the month!

I’d love to hear your feed­back. Please feel free to post your com­ments below.

Post­script

I neglected to men­tion a nuance that cer­tain “Traf­fic Sources” reports in Site­Cat­a­lyst V14 are lim­ited to 25,000 unique items per day when the report is run with traf­fic met­rics. This applies to the fol­low­ing reports: Domains, Search Key­words (includ­ing Paid & Nat­ural), the Search Key­words by URL break­down, Refer­ring Domain, Refer­rer Type, and Referrers. The new “high uniques” algo­rithm does not impact this Site­Cat­a­lyst 14 lim­i­ta­tion. Site­Cat­a­lyst 15 does not have this limitation.

0 comments