IT has changed over the past 10 years with the adoption of cloud computing, continuous delivery, and significantly better telemetry tools. These technologies have spawned an entirely new container ecosystem, demonstrated the importance of strong security practices, and have been a catalyst for a new world of big data. Small and midsize businesses, or SMBs, and enterprises alike now likely need to employ data engineers, data scientists, and security specialists. These roles may be siloed right now, but history tells us there can be a more collaborative path.
DevOps broke down the barrier between development and operations to create the best methodology for building and shipping software. InfoSec is the latest member to join the DevOps value stream. Take a look around the internet and you’ll find plenty of posts on integrating security and compliance concerns into a continuous delivery pipeline, including The DevOps Handbook, which dedicates an entire chapter to the topic. I’ve also written about continuous security on this blog. While Dev, InfoSec, and Ops have become the “DevSecOps” you see splashed around on the internet, we need a new movement that’s rooted in DevOps philosophy to bring in data workers.
Calling for DevSecDataOps
So the name may not be great (you may have even seen it coming), but I’ll make my case for integrating all data related activities into the DevOps value stream. My case begins with the second DevOps principle: The Principle of Feedback.
The Principle of Feedback requires teams to use automated telemetry to identify new production issues and verify the correctness of any new release. Let’s put aside the first clause and instead focus on the arguably more important second clause. First I must clarify a common shortcoming. Many teams ship changes to production and considers that “done”. That’s not “Done”, but still “Work In Progress”. “Done” means delivering expected business value in production.
Imagine your team ships feature X to production. The product manager expects to see Y engagement on feature X and possible changes in business KPIs A, B, and C. The Principle of Feedback requires telemetry for Y, A, B, and C such that the team can confirm or deny in a reasonable time that feature X produced the expected outcome. Businesses live and die with this type of telemetry. Data engineers and data teams are becoming increasingly responsible for providing this type of telemetry. Thus, data workers are part of the critical path from business idea to delivering business value in production. Moreover, their workflows and processes must move at the speed of continuous delivery. In other words: it’s time to bring the data side into the DevOps Value stream.
InfoSec and Compliance for Data Pipelines
I think it’s clear that business KPIs and user engagement telemetry collected by data teams is critical to business. It’s also clear to me that the principle of continuous security connects to compliance in data pipelines. I predicted that the GDPR (General Data Protection Regulation) would be a big deal in 2019 in my previous post. Data warehouses and data lakes are potential sources for GDPR and other regulatory infractions; consider the GDPR requirement that all user data must be deleted 90 days after terminating service.
One solution is to deploy time-to-live telemetry on different types of data, creating alerts for violations. Another solution is to add automated tests for scripts that scrub user data and run the tests as part of the automated deployment pipeline. Hopefully, there’s already a set of automated tests for whatever transformation and munging goes on. If not, then this is a place to create and start building a deployment pipeline for the data processing system. Plus, a deployment pipeline is required for data teams to move at the speed of DevOps.
Converge Data Engineering and DevOps Practices
Data teams can benefit from DevOps practice. Consider what happens when a new data scientist joins the team. That person needs an environment to build and test their models. This requires test data, workstation setup, and even cloud infrastructure setup. This also calls for automation backed by infrastructure-as-code, a key DevOps practice (along with the management of test data, and any other artifact required to bootstrap a new environment). The environment may be something simple like a dedicated EC2 instance or a more complex pipeline of data streams and serverless Lambda functions. Regardless, the setup can and should be automated.
Consider the architecture with a transactional system and a separate data processing system. The data system ingests data from the transactional system to produce reports, KPIs, and other real-time telemetry. Our imaginary feature X spans both systems: functional implementation changes in the transactional system, and processing or analysis changes in the data processing system. Both systems need to be developed simultaneously, tested alongside each other, and ultimately promoted together to production. Note the relationship between these two systems. The data system should not be tested in production, especially if it outputs drive business decisions. Technical issues should not prevent the team from achieving this. It just requires some automated elbow grease and collaboration. Given both systems are encapsulated in infrastructure-as-code, then it should be possible to deploy each system into an isolated and dedicated test environment, enabling smoke testing across both systems. A simple test triggers feature X and assert on the availability of telemetry Y, A, B and C. This small test eliminates an entire class of costly regressions like misconfigured integration points and flat out broken implementations. If the automated tests pass, then both systems can be promoted into production. That’s continuous delivery in a nutshell or The Principle of Flow. The Principle of Flow leads us back to where we started: The Principle of Feedback.
Earlier, we set aside the first part of The Principle of Feedback. Now we must return and apply it to data pipelines. Data Pipelines are just like any other IT component. At runtime, they can be impacted by operational conditions such as memory limits, CPU thrashing, network latency, disk capacity, and/or bandwidth saturation. There are known telemetry playbooks for common data pipelines components such as Kafka or Hadoop. There are also known abnormal operational conditions, application specific failure modes, and tripwires. Consider a data pipeline using Kafka. If there are no messages across the ingestion stream, then something is wrong. That’s a simple tripwire. That covers integration points. Data stores and processing systems also require standard USE (Utilization, Saturation, Errors) metrics and relevant alerts. One example is the disk capacity inside a data warehouse system. Known limits can be defined to trigger an alert condition, say 85% utilization for example, and a resolution. Again, applying these telemetry practices is a core DevOps concept.
There’s one remaining DevOps principle: The Principle of Continuous Learning and Experimentation. The entire IT organization must experiment and learn together. Integrating all members of the value stream is only possible if it’s attempted. Teams have to start somewhere. It may be asking questions like “How can we test and deploy our product and data systems together?” Or “How can we get more real-time data from our data pipeline?” Both are valid questions with many possible solutions. The best outcomes involve collaboration and experimentation. Your organization will achieve something when proper leadership and learning is applied.
How to Shift to DevSecDataOps
Cloud Academy has a deep training catalog for anyone interested in development, security, operations and/or data. You can lead the convergence in your team or organization with a strong knowledge mix across these areas.
The DevOps Culture learning path teaches you to see things from a DevOps perspective and how to bridge gaps in your organization. There’s also a lab on Building a Data Pipeline with DCOS that connects data pipelines to ops and infrastructure. The library of data-oriented courses can get you started with AWS, Google Cloud Platform, or Azure.
Cloud Academy provides in-depth courses which will take you from zero to hero on infrastructure-as-code and configuration management tools:
- Terraform (developed in coordination with Hashicorp)
- Puppet (developed in coordination with Puppet)
- Chef (developed in coordination with Chef)
- Ansible (developed in coordination with Ansible — and my personal favorite!)
There’s also an introduction to continuous delivery course.
With CloudAcademy, you can learn the skills to make essential changes, converging development, operations, InfoSec and data engineering. Here’s my last question to you: How will you lead DevSecDataOps in your company?