Олег Федоров
Менеджер по управлению проектами Linxdatacenter
30.09.2020

Network-as-a-Service для крупного предприятия: нестандартный кейс

Как обновить сетевое оборудование на крупном предприятии без остановки производства? О масштабном проекте в режиме «операции на открытом сердце» рассказывает менеджер по управлению проектами Linxdatacenter Олег Федоров

Как обновить сетевое оборудование на крупном предприятии без остановки производства? О масштабном проекте в режиме «операции на открытом сердце» рассказывает менеджер по управлению проектами Linxdatacenter Олег Федоров.

Over the past few years, we have seen increased customer demand for services related to the network component of the IT infrastructure. The need for connectivity of IT systems, services, applications, monitoring and operational business management tasks in almost any area are forcing companies today to pay increased attention to networks.

The range of requests is from network resiliency to creation and management of client autonomous system with purchase of IP-address blocks, setting up routing protocols and traffic management according to the policies of organizations.

There is also a growing demand for integrated solutions for building and maintaining network infrastructure, primarily on the part of customers whose network infrastructure is built from scratch or is obsolete, requiring major modifications.

This trend coincided with the period of development and complication of Linxdatacenter's own network infrastructure. We expanded the geography of our presence in Europe by connecting to remote sites, which, in turn, required improvements to our network infrastructure.

The company has launched a new service for clients, Network-as-a-Service: we take care of all of our clients' networking needs, allowing them to focus on their core business.

In the summer of 2020, the first big project in this direction, which I would like to tell you about, was completed.

At the start

Крупный промышленный комплекс обратился к нам за модернизацией сетевой части инфраструктуры на одном из своих предприятий. Требовалось произвести замену старого оборудования на новое, в том числе ядра сети.

The last equipment upgrade at the company took place about 10 years ago. The new management of the company decided to improve connectivity, starting with an upgrade of the infrastructure at the most basic, physical level.

The project was divided into two parts: upgrading the server fleet and network equipment. We were responsible for the second part.

The basic requirements for the work included minimizing the downtime of the production lines of the company during the execution of the work (and in some areas even eliminating downtime altogether). Any stoppage is a direct monetary loss to the client, which should not happen under any circumstances. Due to the facility's 24x7x365 operation mode, and taking into account the total absence of any periods of planned downtime in the company's practice, we were tasked to perform, in fact, an open-heart operation. This was the main distinguishing feature of the project.

Let's go

The work was planned according to the principle of movement from the nodes of the network distant from the core to closer, as well as from less affecting the work of production lines to affect this work directly.

For example, if we take a network node in the sales department, then a connection failure as a result of work in this department will not affect production. At the same time such an incident will help us as a contractor to check the correctness of the chosen approach to work on such nodes and, having corrected the actions, to work on the next stages of the project.

It is necessary not only to replace the nodes and wires in the network, but also to properly configure all the components for the correct operation of the solution as a whole. It was the configurations that were tested in this way: starting work at a distance from the core, we kind of gave ourselves the "right to make a mistake", without putting the critical areas of the enterprise at risk.

We identified areas that do not affect the production process, as well as critical areas - shops, loading and unloading unit, warehouses, etc. In the key areas, we agreed with the client on the permissible downtime for each network node individually: from 1 to 15 minutes. It was impossible to completely avoid disconnecting individual network nodes because the cable had to be physically switched from old equipment to new equipment, and in the process of switching it was also necessary to untangle the "beard" of wires that had formed during several years of operation without proper maintenance (one of the consequences of outsourcing cable line installation work).

The work was divided into several stages.

Этап 1 – Аудит. Подготовка и согласование подхода к планированию работ и оценка готовности команд: клиента, подрядчика, выполняющего монтаж, и нашей команды.

Этап 2 – Разработка формата для проведения работ, с глубоким детальным анализом и планированием. Выбрали формат чек-листа с точным указанием порядка и последовательности действий, вплоть до последовательности переключения патч-кордов по портам.

Этап 3 – Проведение работ в шкафах, не влияющих на производство. Оценка и корректировка времени простоя для последующих этапов работ.

Этап 4 – Проведение работ в шкафах, напрямую влияющих на производство. Оценка и корректировка времени простоя для финального этапа работ.

Этап 5 – Проведение работ в серверной по переключению оставшегося оборудования. Запуск на маршрутизации на новом ядре.

Этап 6 – Последовательное переключение ядра системы со старых сетевых конфигураций на новые для плавного перехода всего комплекса системы (VLAN, маршрутизация и т. д.). На данном этапе мы подключили всех пользователей и перевели все сервисы на новое оборудование, проверили правильность подключения, удостоверились, что никакие из сервисов предприятия не остановились, гарантировали, что в случае возникновения каких-либо проблем они будут связаны непосредственно с ядром, что облегчало устранение возможных неполадок и финальную настройку.

Wire beard hairstyle

The project turned out to be difficult also because of the difficult initial conditions.

First of all, there were a huge number of network nodes and sections, with a confusing topology and classification of wires according to their purpose. Such "beards" had to be pulled out of cabinets and painstakingly "combed," figuring out which wire leads from where and to where.

It looked something like this:

or:

Or like this:

Second, for each such task it was necessary to prepare a file describing the process. "Take wire X from port 1 of the old equipment, plug it into port 18 of the new equipment." Sounds simple, but when you have 48 completely clogged ports in the raw data, and there is no idle option (we remember the 24x7x365), the only way out is to work in blocks. The more wires you can pull out of your old equipment at one time, the faster you can comb them and put them into the new network "hardware", avoiding network failures and downtime.

Therefore, during the preparatory phase, we broke down the network into blocks - each of them belonged to a particular VLAN. Each port (or its subset) on the old equipment is one of the VLANs in the new network topology. We grouped them this way: the first ports of the switch housed user networks, in the middle - production networks, and in the latter - access points and uplinks.

This approach made it possible to pull and comb out 10-15 wires from old equipment in one go. This speeded up the work process several times over.

By the way, this is what the wires in the cabinets look like after combing them:

Or, for example, like this:

После завершения 2-го этапа мы взяли паузу на анализ ошибок и динамики проекта. Например, сразу вылезли мелкие недочеты из-за неточностей в предоставленных нам схемах сети (неверный коннектор на схеме – неверный купленный патч-корд и необходимость его замены).

The pause was necessary because when working with the server right even a small disruption in the process was unacceptable. If the goal was to ensure that the downtime for a section of the network did not exceed 5 minutes, then it was impossible to exceed it. Any possible deviation from the schedule had to be coordinated with the client.

However, the pre-planning and breakdown of the project into blocks made it possible to meet the planned downtime at all sites, and in most cases to do without it at all.

The Challenge of Time - a project under COVID

However, it was not without additional complications. Of course, one of the obstacles was the coronavirus.

Работы осложнились тем, что началась пандемия, и невозможно было присутствовать во время проведения работ на площадке клиента всем специалистам, задействованным в процессе. На площадку были допущены только сотрудники монтажной организации, а контроль осуществлялся через комнату в Zoom – в ней находились сетевой инженер со стороны Linxdatacenter, я как руководитель проекта, сетевой инженер со стороны клиента, ответственный за производство работ, и команда, выполняющая монтажные работы.

In the course of work, unrecorded problems arose, and we had to make adjustments on the fly. In this way, we were able to quickly prevent human error (errors in the schematic, errors in determining the activity status of the interface, etc.).

Although the remote format of the work seemed unusual at the beginning of the project, we quickly adapted to the new conditions and reached the final stage of work.

We ran a temporary network configuration to run the two network kernels in parallel, the old and the new, in order to make a smooth transition. However, it turned out that one extra line in the configuration file of the new kernel had not been removed, and the transition did not take place. This caused us to spend some time looking for the problem.

It turned out that the main traffic was transmitted correctly, while the control traffic was not reaching the node through the new core. Thanks to the clear division of the project into stages, it was possible to quickly identify the network section where the problem arose, identify the problem and fix it.

And as a result

Technical results of the project

First of all, a new core of the new enterprise network was created, for which we built physical/logical rings. This was done in such a way that each switch in the network had a "second arm". In the old network, many switches were connected to the core by one path, one shoulder (uplink). If it broke, the switch was completely unavailable. And if several switches were connected through one uplink, a failure would take out an entire department or production line in an enterprise.

In a new network, even a fairly serious network incident will not, in any scenario, "bring down" the entire network or a significant section of it.

90% of all network equipment has been updated, media converters (converters of signal propagation medium) have been retired, and the need for dedicated power lines to power the equipment has been eliminated by connecting to PoE-switches, where power is supplied through Ethernet wires.

Also, all optical connections in the server room and in the cabinets in the field - at all key communication nodes - were marked. This made it possible to prepare a topological diagram of equipment and connections in the network, reflecting its actual state today.

Network diagram

The most important result in technical terms: quite extensive infrastructure work was carried out quickly, without creating any interference in the work of the enterprise and almost unnoticed by its staff.

Business results of the project

In my opinion, this project is interesting not primarily from the technical but from the organizational side. The difficulty was primarily in planning and thinking through the steps for implementing the project tasks.

The success of the project let us say that our initiative to develop the networking area within the Linxdatacenter service portfolio is the right choice of the company's development vector. Responsible approach to project management, competent strategy, accurate planning allowed us to perform the work at the proper level.

Confirmation of the quality of work is a request from the client to continue to provide network modernization services at its remaining sites in Russia.

 

News and publications

Article
09.08.2022
IS in scarcity: a big transition strategy
News
01.08.2022
Linxdatacenter is in the TOP10 of the largest DC service providers
News
25.07.2022
Linxdatacenter launches its own PaaS tools
News
20.07.2022
St. Petersburg cloud Linxdatacenter passed the certification on ...
Article
30.06.2022
How we optimized customer data center management
News
27.06.2022
Linxdatacenter: Russian Cloud Market to Grow in 2022...
News
26.05.2022
Anna Malmi has been appointed the regional director of Linxdatacenter...
Article
20.05.2022
Cloud Edge: What's Happening in the Russian Market - Linxdatacenter
News
13.05.2022
The new CEO of Linxdatacenter is Andrei...
Article
03.05.2022
Unit per monoblock: the modular UPS revolution in data centers

You may also be interested in

Linx Outsourcing
Audit, upgrade and optimize your server capacities
read more
Data-center management outsourcing
Linx Network
Ensure network resiliency and uptime
read more
Network services
Linx DraaS
Protect your IT systems today!
read more
Disaster Recovery as a Service

Write to us

How we optimized customer data center management

Data center is a complex IT and engineering object, which requires professionalism at all levels of management: from managers to technical specialists and executors of maintenance works. Here's how we helped our client put operational management in corporate data centers in order.
 

Taras Chirkov, Head of Data Center in St. Petersburg  in St. Petersburg 

Konstantin Nagorny, chief engineer of data center in St. Petersburg.  in St. Petersburg 

Data center is a complex IT and engineering object, which requires professionalism at all levels of management: from managers to technical specialists and executors of maintenance works. Here's how we helped our client put operational management in corporate data centers in order.  

Management is in the lead 

The most advanced and expensive IT equipment will not bring the expected economic benefits if proper processes of engineering systems operation in the data center, where it is located, are not established.  

The role of reliable and productive data centers in today's economy is constantly growing along with the requirements for their uninterrupted operation. However, there is a big systemic problem on this front.  

A high level of "uptime" - failure-free operation of a data center without downtime - depends very much on the engineering team that manages the site. And there is no single formalized school of data center management.  

And there is no single formalized school of data center management.    

Nationwide  

In practice, the situation with the operation of data centers in Russia is as follows.  

Data centers in the commercial segment usually have certificates confirming their management competence. Not all of them do, but the very specifics of the business model, when a provider is responsible to the client for the quality of service, money and reputation in the market, obligates them to own the subject. 

The segment of corporate data centers that serve companies' own needs lags far behind commercial data centers in terms of operational quality. The internal customer is not treated as carefully as the external customer, not every company understands the potential of well-configured management processes. 

Finally, government departmental data centers - in this regard, they are often unknown territory due to their closed nature. An international audit of such facilities is understandably impossible. Russian state standards are just being developed.  

This all translates into a "who knows what" situation. "Diverse" composition of operation teams composed of specialists with different backgrounds, different approaches to the organization of corporate architecture, different views and requirements to IT departments.  

There are many factors that lead to this state of affairs, one of the most important is the lack of systematic documentation of operational processes. There are a couple of introductory articles by Uptime Institute which give an idea of the problem and how to overcome it. But then it's necessary to build the system by your own efforts. And not every business has enough resources and competence for that.  ⠀⠀  

Meanwhile, even a small systematization of management processes according to industry best practices always yields excellent results in terms of improving the resilience of engineering and IT systems.  

Case: through thorns to the relative order 

Let's illustrate by the example of an implemented project. A large international company with its own data center network approached us. The request was for help to optimize the management processes at three sites where IT systems and business-critical applications are deployed.  

The company had recently undergone an audit of its headquarters and received a list of inconsistencies with corporate standards with orders to eliminate them. We were brought in as a consultant for this as a bearer of industry competence: we have been developing our own data center management system and have been educating on the role of quality in operational processes for several years.  

Communication with the client's team began. The specialists wanted to get a well-established system of data center engineering systems operation, documented on the processes of monitoring, maintenance and troubleshooting. All this had to ensure optimization of the infrastructure component in terms of IT equipment continuity.  

And here began the most interesting part.  

Know thyself 

To assess the level of data centers in terms of compliance with standards, you need to know the exact requirements of the business to IT systems: what is the level of internal SLA, the allowable period of equipment downtime, etc.  

It became clear right away that the IT department did not know exactly what the business wanted. There were no internal criteria of service quality, no understanding of the logic of their own infrastructure.  

Colleagues simply had no idea what the permissible downtime for IT-related operations was, what the optimal system recovery time in case of a disaster was, or how the architecture of their own applications was structured. For example, we had to figure out whether a "crash" of one of the data centers would be critical to the application, or if there were no components affecting the application.  

Without knowing such things, it is impossible to calculate any specific operational requirements. The client recognized the problem and increased coordination between IT and the business to develop internal requirements and establish relationships to align operations.  

Once an understanding of the IT systems architecture was achieved, the team was able to summarize the requirements for operations, contractors, and equipment reliability levels.  

Improvements in the process 

Our specialists traveled to sites to assess infrastructure, read existing documentation, and checked the level of compliance of data center projects with actual implementation.  

Interviews with the responsible employees and their managers became a separate area of focus. They told what and how they do in different work situations, how the key processes of engineering systems' operation are arranged.  

After starting the work and getting acquainted with the specifics of the task the client "gave up" a little: we heard the request "just to write all the necessary documentation", quickly and without deep diving into the processes.  

However, proper optimization of data center "engineering" management implies the task to teach people to properly assess the processes and write unique documentation for them based on the specifics of the object.  

It is impossible to come up with a working document for a specific maintenance area manager - unless you work with him at the site continuously for several months. Therefore this approach was rejected: We found local leaders who were willing to learn themselves and lead their subordinates.  

Having explained the algorithm of documents creation, requirements to their contents and principles of instructions ecosystem organization, for the next six months we controlled the process of detailed writing of documentation and step-by-step transition of the personnel to work in a new way. 

This was followed by a phase of initial support for work on the updated regulations, which lasted one year in a remote format. Then we moved on to training and drills - the only way to put the new material into practice.  

What's been done 

In the process, we were able to resolve several serious issues.  

First of all, we avoided double documentation, which the client's employees feared. To this end, we combined in the new regulations the regulatory requirements applied to various engineering systems as standard (electrics, cooling, access control), with industry best practices, creating a transparent documentation structure with simple and logical navigation.   

The principle of "easy to find, easy to understand, easy to remember" was complemented by the fact that the new information was linked to the old experience and knowledge of the employees. 

Next, we reshuffled the staff of service engineers: several people turned out to be completely unprepared for the change. The resistance of some was successfully overcome in the course of the project through the demonstration of benefits, but a certain percentage of employees turned out to be untrained and unresponsive to new things.  

But we were surprised by the company's frivolous attitude to its IT infrastructure: from the lack of redundancy of critical systems to the chaos in the structure and management.  

In 1.5 years the engineering systems management processes have been pumped up to the level that allowed the company's specialists to successfully report "for quality" to the auditors from the headquarters.  

With the support of the operating component development pace the company will be able to pass any existing certification of data centers from leading international agencies.  

Summary 

In general, the prospects of consulting in the field of operational management of data centers, in our opinion, are the brightest.  

The process of digitalization of the economy and the public sector is in full swing. Yes, there will be a lot of adjustments in the launch of new projects and plans for the development of old ones, but this will not change the essence - the operation should be improved at least to improve the efficiency of already built sites.  

The main problem here: many managers do not understand what thin ice they are walking on, not paying proper attention to this point. The human factor is still the main source of the most unpleasant accidents and failures. And it needs to be explained.  

Government data center projects are also becoming more relevant now and require increased attention in terms of operations: the scope of government IT systems is growing. Here, too, the development and introduction of a system of standardization and certification of sites will be required.  

When the requirements to public data centers in Russia at the level of legislation will be reduced to a standard, it can be applied to commercial data centers, including for the placement of public IT resources.  

The work in this area is ongoing, we are participating in this process in consultation with the Ministry of Digital and by building competencies for teaching courses on data center operation at the ANO Data Center. There is not much experience on such tasks in Russia, and we believe that we should share it with colleagues and clients. 

Network-as-a-Service для крупного предприятия: нестандартный кейс

BEST, money transfer and payments operator

business challenge

The customer faced a technical issue with a persistent BGP session flag with Linxdatacenter hardware. We examined the problem and found out that one of customer’s hosts was under a DDoS attack.

Because of the distributed nature of the attack, traffic couldn’t be filtered effectively, and disconnecting the host from the external network wasn’t an option. The attack stopped after changes in the server configuration, but resumed the day after. A 5.5 Gbps attack overloaded the junctions with internet providers, affecting other Linx Cloud users. To mitigate the effects of the attack, we employed a dedicated DDoS protection service.

Solution

To ensure the continuous availability of resources hosted in Linx Cloud, we rerouted all the customer’s traffic through StormWall Anti-DDoS system. The attack was stopped within half an hour. To prevent future cyberattacks, we organized all connections to the customer’s resources through the StormWall network.

client:

BEST, money transfer and payments operator

business challenge

The customer faced a technical issue with a persistent BGP session flag with Linxdatacenter hardware. We examined the problem and found out that one of customer’s hosts was under a DDoS attack.

Because of the distributed nature of the attack, traffic couldn’t be filtered effectively, and disconnecting the host from the external network wasn’t an option. The attack stopped after changes in the server configuration, but resumed the day after. A 5.5 Gbps attack overloaded the junctions with internet providers, affecting other Linx Cloud users. To mitigate the effects of the attack, we employed a dedicated DDoS protection service.

Solution

To ensure the continuous availability of resources hosted in Linx Cloud, we rerouted all the customer’s traffic through StormWall Anti-DDoS system. The attack was stopped within half an hour. To prevent future cyberattacks, we organized all connections to the customer’s resources through the StormWall network.

Thank you for your inquiry, we will get back to you shortly!