Rendering schema.org microdata for content behind a paywall

 

an example of a paywallOne of the challenges with writing moments to a user’s Google+ history page is that you need a target URL to pass to the API and that target URL needs to contain appropriate microdata for Google+ to use for calculating the content that appears in history. For example, let’s say I’m a publisher who creates articles that require log-in to read and the articles are deliberately kept behind a sign-in page. When a user reads an article, I still want to write moments to their history but I don’t have a readable page for Google to view. What do I do?

In this post, I will describe a few solutions to this problem that I have been experimenting with that would enable me in this example scenario to have target URLs to pass to the history API. Note Although I have tested these approaches, neither is a complete solution that is right for everybody (anybody?). Consider these ideas as starting points for how you can think about your content for Google+ history as opposed to real-world means of supporting target content for history.

A solution: Custom login pages with microdata synopses

In this strategy, a custom login page is created with either a short synopsis of the article outlined in microdata contained in content markup or meta tags.  I’ll describe two ways of rendering the data to illustrate both a lazy approach and a more robust approach. The advantage of this approach is that customers who reach your site will be able to see a teaser for the content that you offer to premium subscribed customers and you may be able to draw them into signing up for your service, registering, and so forth.

What not to do: a lazy approach to programmatic preview schema markup

In our first approach, a lazy approach, we will be dynamically rendering the content based on input parameters.  In this approach, you simply pass parameters to your login page and these parameters will render the schema.org markup for the page, moments written to history would contain these parameters and the appropriate data would render to history. I’ll warn you beforehand that this is primarily demonstrated as a counterexample because it is insecure and illustrates what not to do. It would be easy for a malicious user to make it appear as though you are giving references to virtually any content that they trick your page into rendering.

A reminder about the following example, do not do this, it’s a good prototype for a more comprehensive solution to surfacing preview content on a login page but is insecure. That said, I have created (and hosted) a Perl CGI script that will dynamically render Schema.org markup for the article type.  The gist can be determined from the following snippet:

#!/usr/bin/perl
use CGI;

my $q = CGI->new;

my $img = $q->param("i");
my $title = $q->param("t");
my $description = $q->param("d");

print qq|
<!DOCTYPE html>
...
...
...

Within the quoted block, I will render some simple HTML for the schema content that is passed as parameters to the URL.

<div itemscope="http://schema.org/Article" class="hero-unit">
  <h1 itemprop="title">$title</h1>
  <p itemprop="description">$description</p>
  <img itemprop="image" src="$img">
  ...
</div>

At this point, the content will dynamically render schema.org markup based on the parameters passed to the script. A sign-in and sign-up prompt is rendered on the page because the user isn’t signed in. You can see this example in practice on this page giving a preview to an article on fixie bikes.  To the keen eye, you will also notice some Twitter Bootstrap code that I used to pretty up the page.  You can use the Google Webmaster tools to see how the schema is being parsed by search engines. This approach has a number of security, content, and conciseness issues that I will try and address in an improved approach.

A better approach: explicitly rendering content based on your article content

If you’re providing premium content behind a paywall or registration, you probably already have a database that contains attributes for your content. If the data is already there, why not reuse it for the preview? As an example, I have created a database with the following rows to demonstrate this:

idtitledescriptionimageURLbody
1Look a fixieYou clearly aren’t cool enough to read this article unless you sign uphttp://wheresgus.com/fixie.jpgGlorious article content that I have painstakingly authored

As you can see, there’s an article id, title, description, imageURL, and body. A more complex example could include other interesting attributes such as the author and so forth. What we will do next is use this content [from the database] to both render the preview content as well as the article. Starting from the Perl code, let’s see how the DB, a MySQL database in my case, would be connected to the variables as opposed to using CGI parameters.  This is exactly the same as the previous example with one exception, the variables are populated using the db fields and the content is shared between the article and the preview / paywall:

 

my $db   = "dbname";
my $host = "host";
my $port = 3306;
my $username = "username";
my $pass = "pass";

print $dbpath;

my $dbpath = "DBI:mysql:" . $db . ":host=" . $host;#. ":" . $port;

my $artID = $q->param("id");
my $skipPaywall = $q->param("skippaywall");

my $dbh = DBI->connect($dbpath, $username, $pass
             ) || die "Could not connect to database: $DBI::errstr";

my $sth = $dbh->prepare("select title, description, imgURL, content from article where id=?");

$sth->execute($artID);
my @result = $sth->fetchrow_array();

my $title       = $result[0];
my $description = $result[1];
my $img         = $result[2];
my $content     = $result[3];

$dbh->disconnect();

print $q->header;
print $q->start_html("");

Now the variables are populated from the database as opposed to the URL. You can see the teaser page in action, also, you can pass a debug flag to test rendering the actual content.

A couple important things to note in this improvement to the first example:

  • The schema that is generated for the article is consistent with the article and is also rendered within the article.  This works in the spirit of Schema.org content, all entities are represented as best they can be in all instances of that content.
  • The schema that is rendered is fixed to that specific content.  As such, parameters cannot be manipulated when passed to the script to generate arbitrary microdata.
  • You can still write moments to a user’s history that can be read by the history API and that also can safely be shared by the user if they choose to without risking customers circumventing your paywall.

Best practices and considerations

Although this example approach works, I’m still seeing a number of potential issues with it. When taking an approach like this, make sure you think about various things that malicious users could do. For example, what if a malicious user:

  • Discovers how you are writing these moments and authors a target page that is inappropriate for your site (SPAM, mature content, etc)
  • Spoofs forms to generate moments that are not intended to be rendered by your service
  • Spoofs the path to your moment for an account that isn’t their own (e.g. generates a moment for their friend’s account and writes it to their history)
  • Formulates the conditions for rendering the moment and then breaches (partially) private information from your site
  • Manipulates parameters to generate moments that are not intended to be created with your service (e.g. check-ins for places they haven’t been to)

All of these risks can be reduced through careful design and thoughtful consideration around how people potentially could be accessing and writing moments based on your content.

A final note, enabling schema.org markup in your page has another benefit beyond just letting you write moments specific to the content: search engines that crawl your site will be able to parse the schema markup to create rich experiences in search results. When considering how you approach supporting a preview, remember also the other ways that this content will be used and rendered.