EPiServer  /  CMS April 29, 2013

Building large scale EPiServer sites

It has been proven by numerous sites that EPiServer CMS can handle huge amounts of content. Doing so does bring a few challenges though. Here's a few few patterns that I've identified when it comes to building large scale EPiServer sites with great performance.

Last week I received an e-mail with the subject ”Huge number of pages (>500 000) in EPiServer?” As you can imagine the e-mail contained questions related to whether EPiServer CMS can handle sites with A LOT of content.

It has been proven on a number of occasions that it indeed can. For instance, several of the biggest newspapers in Sweden, with millions of pages, run EPiServer. There are also government sites with several hundreds of thousands of pages.

That doesn’t mean that we’re not faced with challenges when building large EPiServer sites though.

In my experience, what those challenges are differs depending on the type of site, or rather, depending on how the content is structured on the site. While an individual site may be a mix of the two, there are two categories of large scale sites in terms of how they structure their content.

Content that fit naturally in a deep hierarchy

This is common for government sites and the like. Such sites may publish a lot of content that is organized and exposed to visitors in a hierarchy based on topics, subtopic and so on.

Alternatively the content may fit naturally in a date-based hierarchy, such as an archive with publications.

Content that isn’t hierarchical or the hierarchy is shallow

This is very common on media sites such as newspapers. Those sites typically have a shallow hierarchy made up of sections. There may for instance be a first level section called "Sport" and a subsection to that called "Football".

An article about one of the Champions League semifinals 2013 is displayed in the context of Sport/Football but beyond that it has no natural place in a tree like structure. On a site with a lot of content this in turn means that an article in context of its place in the hierarchy may have thousands, or even millions, of siblings.

Finding content based on non-hierarchical criteria

Given that we’re dealing with a site that has a lot of content that fit nicely into a tree based structure EPiServer CMS works great out-of-the-box for editors. The CMS stores content in a tree, the content tree, and expose that to editors using UI components such as the Page tree and the Block gadget that also lets editors work with content in a tree.

While there is a lot of content, meaning that the content tree has a lot of branches and leafs editors can easily find the right place to publish new content. They can also find old content simply by navigating the content tree, or the site, the same way a public visitor would.

In cases where the standard navigation doesn’t suffice, such as when an editor or visitors needs to find content that, in their view, isn’t placed where it should be on the site, basic free text search functionality typically can handle that.

As for developers building navigation components is typically easy as all they have to do is utilize the page tree. EPiServer's methods for doing that, such as Get, GetChildren and GetAncestors are highly optimized and aggresively cached with clever dependencies for releasing their caches when needed and only then.

However, no matter how natural the content hierarchy is there are usually a number of requirements for components that lists content in a way that isn’t based on the hierarchy. Examples of such components could for instance be the most recently published pages of a certain type, all articles published be a certain author or department and all publications categorized with a certain keyword.

For such requirements EPiServer CMS only has the method FindPagesWithCriteria to offer. Besides obvious usability issues for developers FPWC has some serious performance issues, especially on a site with a large volume of content.

In other words, on a large scale EPiServer site with content fitting naturally into a deep hierarchy we’re faced with the challenge of finding content based on non-hierarchical criteria.

There are two common solutions to this problem. One is to somehow store the answer to such questions at a time where we know that the answer is changing. For instance, this may mean storing a list of the ten last published pages of a given type serialized into a property somewhere or in the Dynamic Data Store, updating each time a page is published. This requires quite a lot of development time and, worse, it requires us to know what questions will need answering beforehand. It’s also rather error prone.

The second, and much better, common solution to this is to use a search engine. This has been done on a number of large EPiServer sites using different search engines. Today though the obvious solution is to use EPiServer Find, the search and content retrieval product that EPiServer offers. Find was in many ways built exactly to address this problem in a way that offers great usability for developers and short development time.

Solution: Use a EPiServer Find to create navigations and listings of content that are not based on the contents place in the content tree.

Non-hierarchical content

When the content can’t naturally be fitted into a deep hierarchy additional challenges arise. First, EPiServer’s API and editorial interface is designed for sites organizing content in a tree. If the content can’t be organized into a deep hierarchy performance will suffer. Here’s how I put it in my reply to the e-mail:

“The content tree can handle millions of items BUT if those items aren't stored in a deep hierarchy there will be performance problems. That is, if you have a page with ten thousand children you have a problem. If you have a page with a hundred children and each of those have a hundred children you won't have a problem.”

While I knew this from experience, after sending that reply, I decided to conduct a few experiments to prove it.

In the context of a scheduled job I wrote code that created ten thousand pages below the same parent. It also created a hundred pages below another common parent and then a hundred pages below each of those pages. Everything was done in batches of a hundred pages and the mean time for creating a page during each batch was logged.

Let’s look at the results for creating 100x100 pages in a hierarchy first.

Batch Avg. time per published page (ms)
1-100 22
201-300 22
501-600 22
901-1000 24
1901-2000 24
4901-5000 33
9901-10000 30

There are of course some variances which would likely even themselves out with a larger sample (I ran the test only four times), but it's pretty clear that it takes almost exactly the same amount of time to create page number ten thousand as page number one when storing pages in a hierarchy where each page has ninetynine siblings. 

Now, let’s compare that to creating 10000 pages below the same parent.

Batch Avg. time per published page (ms)
1-100 22
201-300 26
501-600 34
901-1000 50
1901-2000 83
4901-5000 180
9901-10000 414

As we can see the time required to publish a page grows based on the number of existing pages below the page’s parent page. Plotted into a diagram we can see that this growth is linear.

Beyond API performance issues, expanding the tree node for a page revealing it’s ten thousand pages in edit mode takes time. Below is what FireBug reported for me when I tried to expand a none with ten thousand children. 

After receiving the response from the server Firefox reported an unresponsive JavaScript on the page and it took several minutes before I actually got to see the pages in the page tree. 

Of course, even if it the page tree wouldn't have any issues with displaying thousands of children for a node such a list would hardly be useful for editors.

Conclusion: EPiServer is built and optimized for sites that stores content in a hierarchy in which each node has hundreds and not thousands of child nodes.

Clearly, EPiServer's page tree doesn't work well for large scale sites with huge volumes of non-hierarchical content. Not in terms of performance and not in terms of editor usability.

Luckily there's a fairly easy solution that has proven to work very well. In fact, I've seen it done so many times that I'd call it a pattern. What is is? Faking it!

Structuring bulk content in arbitrary hierarchies

We know that EPiServer needs, or prefers, organizing pages in such a way that each node in its content tree doesn't have more than hundreds of immediate children. EPiServer does not however care about why a certain page belongs in a certain place in the tree. Therefor we can automatically place pages in a hierarchy based on some arbitrary criteria.

For articles on a media site this is commonly done by placing them in a structure based on publish date.

There are a number of ways to implement such functionality but it typically involves:

  1. Defining a root node for a certain type of content.
  2. Hooking up to events from EPiServer's API listening to when content is created.
  3. When a page of a matching type is created ensure that there is a place for it in the date structure below the root, otherwise create it.
  4. Move the newly created content to its parent in the date structure.

Of course, when content resides in a structure whose only purpose is to work well with EPiServer performance wise we can't utilize the content hierarchy when building navigations. The solution to that typically involves four things:

  1. Defining one or several properties on the content types that will be used for "bulk content". These properties are typically of type PageReference, ContentReference or ContentArea. For instance, an article may have a property named Section of type PageReference which points to the Football section.
  2. Utilizing the above mentioned property/properties when rendering pages to determine in what context they should be shown. For instance, an article about a Champions League game may have a Section property pointing to a page named Football which in turn is a child of a page named Sport. Based on that the article's content is displayed framed by a header, navigation elements and right column from Football or Sport. 
  3. Using a search engine to create listings. Essentially I'm talking about the same problem as we looked at before here, finding content based on non-hierarchical criteria. The only difference is that we now need to apply the same solution in more places as the majority of the content of the site's content is organized in such a way in the content tree that it can't be used to build navigations and listings.
    With that said, we can still utilize EPiServer's standard API methods for components such as the main navigation as those pages aren't the "bulk content" and therefor works well with the page tree. Again while we can use pretty much any search engine that offers good performance and scalability, EPiServer Find is the best option as it was born out of these specific needs.
  4. Rewriting URLs. By default URLs on an EPiServer site is built up using the page's name prefixed by the name of each of its ancestors in the page tree. When storing articles or other content in a structure that has nothing to do with how visitors sees the page on the site URLs won't seem very logical. For instance an URL like /2013/04/23/champions-league.. is hardly helpful and doesn't look very good. With older versions of EPiServer CMS we typically handled that using a custom URL rewriter. Nowadays with EPiServer 7 we do it using custom routing.

Solution: Automatically organize non-hierarchical content in arbitrary hierarchies based on creation date, first letter in their names or some other criteria. Use properties on the content to tie it to pages in the page tree which provide the context in which the content should be rendered. Use EPiServer Find to build listings.

Using the above approach we solve the performance issues when dealing with non-hierarchical content of the type that is often found on media sites, blogs and the like. While that works great we have one more problem to solve - how editors create and find the content for which the page tree is more or less useless.

Custom components for editorial workflows

We could tell editors "To create a new article click on the Articles node in the page tree. We have some code that will automatically move it a few levels down in a date based structure. Oh, and don't forget to set the Section property." However, odds are we wouldn't find much jobs afterwards and we'd see buch of angry comments about EPiServer on Twitter.

Clearly, if we're dealing with the type of site that the page tree doesn't work well with we can't just be content with solving performance issues. We'll also need to extend EPiServer's edit mode to provide good workflows for editors. 

Exactly how such components should work and be implemented differs from site to site but it typically involves functionality to:

  • Create "bulk content" without having to use the page tree gadget.
  • Either automatically populating "infrastructure properties" such as Section or making it very easy for editors to do so.
  • List the most recent content. Especially on media sites it's a very common requirement to have a list that displays all articles that have either been published today, not yet been published or is scheduled to be published.
  • Find content based on criteria such as author, publish date and section.

I've built such functionality both in EPiServer 6 and EPiServer 7 and based on those experiences I've created the PowerSlice project. PowerSlice is one way of addressing several of the above requirements and may solve all needs for some sites. For other sites it may be used for inspiration.

Either way, it's very much possible to build the components needed by editors. With EPiServer 7 it typically involves creating custom Dojo/Dijit widgets and utilizing EPiServer Find.

Solution: Extend EPiServer's edit mode with custom components tailor made for the needs of the editors. PowerSlice may be an option and/or used for inspiration.

Are you still with me? Perhaps it's about time we looked at an example.

An example - this site

This site doesn't exactly fit the "large scale" description. However, as it's primarily a blog it, along with certain parts of many other "small" sites, does have non-hierarchical content. Therefor I applied the above mentioned techniques to it, meaning that we can look at it as an example of how a large scale site, such as a media site, can be built.

Organizing pages

In terms of hiearchy there are two types of pages on the site. Articles and tags are automatically organized in two separate structures. Sections and standard pages are not.

All articles resider under a node below the start page. Under that node they are grouped first by year and then by month.

Articles have a property named MainCategory edited using a drop down from which it's possible to select one of the categories (sections) on the site.

This property is used for breadcrumbs and context specific things when rendering an article.

Articles also have a content area property to which other categories can be added.

MainCategory and any categories added to AdditionalCategories are combined by a code only property on articles named AllCategories.

New articles can be created using a UI component that's created using PowerSlice.

When an article is created it's initial parent is the root node for articles. Using a modified version of my old open source project PageStructureBuilder the article as automatically moved to a year/month structure.

Listings and routing

Categories/sections list articles "belonging" to them based ordered by publication date. This is done using a fairly simple search query using EPiServer Find that filters on article's AllCategories property.

_searchClient.Search()
    .Filter(page => page.Categories.Match(currentPage.ContentLink))
    .CurrentlyPublished()   
    .FilterOnReadAccess()
    .OrderByDescending(x => x.StartPublish);

As for URLs I use a custom partial router, which I've previously described in great detail.

Dealing with traffic

So far this article has been about how to build sites with large volumes of content on EPiServer CMS. There's of course another way to interpret "large scale sites" - sites with a lot of traffic.

Again, it's already been proven by a number of existing EPiServer customers that EPiServer can handle huge volumes of traffic. With that said, it's of course also very much possible to build a site on EPiServer that crumbles once it's hit with more than a couple of concurrent requests.

A robust EPiServer site that can handle a lot of traffic and let editors work efficiently at the same time requires a good implementation. And a good implementation requires skilled and experienced developers who know what they are doing.

In general EPiServer CMS's API is highly optimized and the most significant methods for getting content based on the hierarchy are cached. As for EPiServer Find it's fast, highly scalable and also has mechanisms for caching that can be used when needed. So, as first step in hardening a site for production it's absolutely vital to use the API methods visely. Having done so the site will hold well on its own.

Sometimes though, we're dealing with a site that has so much traffic that it's not just enough use caching to prevent database calls. The actual rendered HTML output needs to be cached as well. For that EPiServer offers a fairly basic output cache that can be used.

While that may be an option, for site with really, really, really much traffic we may need even more efficient output caching. In those cases we can either construct a custom output cache to use on the webservers or, which I prefer, use a web content accelerator/caching reverse proxy such as Varnish. We may also want to look in to using a CDN instead or as a compliment.

One thing to beware of though is that any form of output caching, with the exception of partial caching, will limit editors ability to use some of the functionality built into the CMS such as personalization. If that's a problem we can often work around that by loading parts of pages using JavaScript and not caching such requests. Of course that means more requests to the site though.

In general, my philosophy is to build the site as robust and performant as possible without so that it can handle the traffic without any other form of caching. After that, if it's needed or economically motivated some sort of cache can be put in front of the site.

This approach has two benefits. First of all we can choose to use output caching for the right reasons. Second, using output caching tends to hide performance problems in the application and while those may not be a problem at first they may be whenever the cache is released or if there's an issue with the output cache. Then it's very valuable if the web applciation can hold on its own.

Summary

  •  EPiServer CMS can handle sites with millions of pages.
  •  For large scale sites FindPagesWithCriteria doesn't work for non-hierarchical queries. Use EPiServer Find for that.
  •  EPiServer relies on content being split up into a hierarchy. If the content doesn't fit naturally into such a hierarchy, make one up and use a combination of properties, Find and edit mode extensions for creating new content and building listings.

PS. For updates about new posts, sites I find useful and the occasional rant you can follow me on Twitter. You are also most welcome to subscribe to the RSS-feed.

Joel Abrahamsson

Joel Abrahamsson

I'm a passionate web developer and systems architect living in Stockholm, Sweden. I work as CTO for a large media site and enjoy developing with all technologies, especially .NET, Node.js, and ElasticSearch. Read more

Comments

comments powered by Disqus

My book

Want a structured way to learn EPiServer 7 development? Check out my book on Leanpub!

More about EPiServer CMS