Alfresco High Availability Infrastructure in Real Life

http://alfrescoshare.wordpress.
com/tag/infrastructure/
Alfresco high availability infrastructure in real life (Part1)

In our company, as the Alfresco DM service becomes more and more popular everyday (+ 2 GB of documents added each day), we are currently studying the implementation of a high availability (H.A) infrastructure (cluster, failover, etc). So I will start this series of post to share my comments, questions and recommendations with you, because implementing such a H.A is a somewhat complex projectI should say that this complexity is not due to Alfresco software itself, but to the common underlying infrastructure concerns you have when you need to install any big scale architecture, with huge volume of documents to backup and hundreds or thousands of users. I think most of the companies which have deployed Alfresco a few years ago, did start the project infrastructure with a pragmatic approach, i.e. using a more or less basic standalone server, as we did. So after a few years of production deployment, Im sure they are in the same situation as we are: due to the success of the project, you now need to build a more industrialized architecture to better manage DRP, failover, load-balancing, backup, etc. As I said, we have almost completed the design of our H.A infrastructure. I will give you more details about it once the final design will be validated (see next post), and also why we have taken these decisions, but basically here are the main concepts which have been selected so far: Failover (YES): 2 distinct Alfresco instances located in 2 distinct Data-Centers geographically separated. Clustering (YES/NO): yes, because we will have a 2 cluster nodes. But this will be a hot standby system (only one of the 2 Alfresco instances is exposed to end-users). The standby instance is mainly used for backup purpose (and of course failover). Load-balancing (NO): after long discussion we have decided not to address load-balancing for the first release of the project. The main constraint is that load-balancing requires the 2 nodes of the cluster to be in synch in real time which can be complex to achieve (see next post). Also, managing CIFS and load-balancing is something we did not have the time to study for the first release. Please note that these criteria does NOT illustrate the perfect infra which can match every company needs, but this is rather a pragmatic approach which address our functional constraints (RTO, RPO), leverage existing infra components (datacenter, backup robot), and which takes into account the cost restrictions . At this step of the project study, I can suggest you the following recommendations: 1 Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product. 2 Work with Alfresco service team at the early beginning of the study, 3 Define and validate the SLA with your functional team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective). 4 Estimate the number of concurrent users you plan to have.
5 Estimate your documents data volume and how it will scale up in the future (this is a key point that will drive your backup strategy). 6 Study carefully the backup and the restore strategy (DRP). Also define if you want to be able to restore a single file. 7 Involve your internal infrastructure team (and Alfresco service team) to finalize the design of your infrastructure. I will detail each of these items in the next blog post. But the conclusion for us, at this step of the H.A design study, is clearly that there is no perfect architecture, but that the final infrastructure design is only the best compromise between all your internal company constraints (functional and technical). Once again, I can say that most of these compromises are not due to the Alfresco software limitation itself, but rather to the huge volume of documents that we have to manage. Sogoto thenext blog postif you want to know the next episode of the story
Alfresco high availability infrastructure in real life (Part2)

As mentioned in my previous post (part 1), the purpose of this series of articles is to provide some recommendations to help you build your own alfresco high-availability (H.A) infrastructure. So lets review more in details the first 3 advices I think you should consider: 1. Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product, 2. Work with Alfresco service team at the early beginning of the study, 3. Define and validate the SLA with your functional team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective). 1 Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product. To prepare your study, I think the first thing to do is to review all the existing documentation related to Alfresco clustering and H.A. This will help you to choose the best approach for your design, at least the high-level principles. Whatever your Alfresco version, here is a list of documents and articles (provided by Alfresco editor) which could be of interest: Alfresco Backup and HA Cluster Configuration V2.1.3 and Later High Availability (Disaster Recovery) Backup and Restore Also, you should carefully choose the target version of Alfresco you will use to build your H.A architecture. In our case, we are currently using v2.1.1 E in production, but we are waiting for the latest v3.2 E version (end of this year) to build the new H.A infra.
Of course, in most cases, using the latest version is likely to be the best choice. But for some technical reasons (e.g CPU/memory limitation of your current production infra), you might have to migrate as soon as possible to the current available version (which is 3.1). The purpose of this post is not to do a comparative study of 3.1 and 3.2, but you should be aware that even between 2 minor releases there might be significant evolutions (as Alfresco developers are working very hard). For instance, changes you can do through the JMX console (e.g changing any server configuration parameter) can be done at runtime with both 3.1 and 3.2, but changes are persistent only in 3.2So if you need more advanced administration capabilities (like sub-system, JMX, LDAP synchronization, etc) 3.2 is clearly the best target. Please note however that most of the major clustering features and options are already included in 3.0 version, and that there are no significant changes between 3.0 to 3.2 regarding clustering. The next major evolutions (sharing lucene indexes between cluster nodes) is planned for 3.3 (or even 4.0) only. 2 Work with Alfresco service team at the early beginning of the study: To start our study, we have involved the Alfresco service team. This is clearly the best advice I can give you for your own projectIndeed, even after a single consulting day (or 2), you will not only clarify the main Alfresco cluster principles, but also (and this is more important) learn that some of the most common documentation on this topic does not necessarily reflect the limitation of the clustering or HA methodology in a real life context. I especially do think about the Replicating Content Store (RCS), which initially seems to be the most common way to replicate your documents between cluster nodes (assuming you need to have 2 content stores for failover), but which is actually a very limited process. In our case, we plan to build a failover site, which must be the mirror of the live site in term of data (content store and meta-data) to avoid loosing data as much as possible when switching to the failover site is required. At the beginning of the project we thought that using the Replicating Content Store could be a good option to perform the document (content store) synchronization. But the problem is that the RCS component does not ensure that the failover site will contain the exact same documents. For instance, if failover site is unavailable (e.g maintenance operation), then the RCS will not be able to re-run the transaction which have not been executed during downtime period. This is not the only limitation of RCS, so clearly it is not well suited in our context. So the service team clearly told us that the RCS is usually not used by customers for documents replication purpose. Insteed, a tool like rsync is usually recommended by the Alfresco editor. So we will study if rsync can be a good fit for our datacenter infrastructure. But, with this advice at the very beginning of the project, we have avoided a big design mistake So once again, involving Alfresco service team is a good investment ! 3 Define and validate the SLA with your functional team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective): One of the main technical constraints that will drive the design of your HA infrastructure are RTO (Recovery Time Objective) and RPO (Recovery Point Objective) . Basically RTO is the time required to re-activate the service in case of major crash of your production system. So lets assume your Alfresco server crashed for any reason (unexpected power shutdown, corruption of database, etc, etc), you usually need to have a DRP (Disaster Recovery Plan) to be able to restore the system as soon as possible. In our case, we will use a failover site (in another
geographical location) and we will switch the users on this new system. Our functional team has specified that our RTO must not be more than 2 hours, so that means we must be able to start the failover Alfresco instance and switch the load-balancer in less than 2 hours. Once your failover site is started, another indicator is how much data have been lost during the crash and the switch operation ? This usually depends on the frequency of your backup. In our case, our plan is to synchronize the documents content store every 30 mn (using rsync), and the database every 60 mn (using Oracle archive log feature). That means that we should not lost more than 60 mn of data creation time. Choosing a RTO of 2 hours and RPO of 1 hour is already a challenge, but I think this is something technically (and humanly) feasible. Depending on the criticity of your documents, this might of course not be enough, or not acceptable for your end-usersbut you should know that the highest your expectation are, and the biggest the infrastructure cost will beFor instance, being able to reduce RTO to less than 30 mn might need to involve some monitoring solution coupled to advanced mechanism to automatically start your Alfresco instance (and database) on the failover site. Such solution already exist (likeVeritas Storage Foundation HA), but the price of such solution is usually significant. We have studied this kind of very advanced solution, but finally we have decided (mainly for cost reason) to start with a pragmatic and more simple HA solution. To give you an overview of what our Alfresco HA infrastructure will be, here is a logical diagram:
The final design is still not completely validated but basically here are the main concepts which have been selected so far: Failover(YES): 2 distinct Alfresco instances located in 2 distinct Data-Centers geographically separated. Site A is the live site exposed to end-users. Site B is used only for failover (only in case of crash of Site A). Clustering(YES/NO): yes, because we will simulate a 2 cluster nodes. But this will be more a hot standby system rather than a real cluster. Here the standby instance is mainly used for backup purpose (and of course failover). The failover site will be in read only mode and it will not be synchronized in real time (the data are replicated only every 1 hour). Data replication(YES): rsync will be used for files synchronization. The archive log export/import mechanism (Oracle) will be used to replicate the database meta-data. One very
important constraint is that Content Store replication must be performed before the DB replication (I will detail that later on). Lucene indexes management: even with 3.2 version of Alfresco, each node of the cluster needs to have a local copy of the indexes. Site A indexes are updated in real time during each transaction. On Site B, we might need to enable the lucene index tracking process to make sure the lucene indexes are up to date. Load-balancing(NO): we have decided not to address load-balancing for the first release of the project. The main difficulty is that load-balancing requires the 2 nodes of the cluster to be in synch in real time which can be complex to achieve (especially if you consider the limitations of the RCS component). Also, managing CIFS and load-balancing is something we did not have the time to study for the first release. Hope that you begin to have a good overview of what infra we currently try to setup in our case. In a next post, I will continue to describe the following recommendation: 4 Estimate the number of concurrent users you plan to have. 5 Estimate your documents data volume and how it will scale up in the future (this is a key point that will drive your backup strategy). 6 Study carefully the backup and the restore strategy (DRP). Also define if you want to be able to restore a single file. 7 Involve your internal infrastructure team (and Alfresco service team) to finalize the design of your infrastructure.

Alfresco High Availability Infrastructure in Real Life

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Alfresco High Availability Infrastructure in Real Life

Transféré par

Droits d'auteur :

Formats disponibles

http://alfrescoshare.wordpress.

Alfresco high availability infrastructure in real life (Part1)

Alfresco high availability infrastructure in real life (Part2)

Vous aimerez peut-être aussi