AWS Lambdas for system testing of VPC restricted services

Or; How to drink the serverless Kool-Aid when your services are too high volume

Contents

  1. Context
  2. Getting Into It
  3. Epilogue

Context

I’ve been told that in the current meta Serverless is the cure to all software engineering ills. I’m not sure that’s quite the case yet, nor am I completely clear on what people are exactly referring to when they talk about the “current meta”, but it is safe to say that a not insignificant amount of problems (at a certain scale) have been reduced to near triviality by the emergence of serverless infrastructure. Or to be rather more precise: A lot of smart people who help build and design serverless systems have allowed many of us less infrastructure savvy people to completely avoid ever having to deal with a number of infrastructure problems.

I’m a big fan of the flexibility of systems built atop of serverless technologies. The ability to iterate on business problems without worrying about optimizing for current scaling needs is very empowering at both an engineering and product level. For all of its flexibility, there are several situations in which serverless may not be a fit for the problem space: security concerns necessitate a harder segregation between the public internet and your network, call volume making the pricing problematic, the need (real or imagined) to use technologies not yet supported in serverless (e.g. gRPC), among others.

It was a combination of these first three that led the team I’m a part of at SendGrid to decide to build certain of our systems as VPC-ful, ECS-hosted, gRPC-over-Envoy microservices. Exhausted of both buzzwords and hyphens, but very happy with the performance of our system, we still found ourselves in a spot of discomfort. We’re in the habit of automating our system tests, and given the closed nature of several of these intra-VPC services, we didn’t have a great way to run these tests as part of our CI/CD pipeline. scp and ssh were the order of the day, until one of the new services had particularly good coverage as a result of its simplicity: a dual gRPC and HTTP API with a DynamoDB backend. dynamodb-local had allowed us to write fairly comprehensive tests that could be run locally, but there was no reason why these couldn’t double as system tests.

Getting Into It

A seldom used feature of go test (at least by yours truly) is the ability to use the -c flag to compile package specific tests. It struck me that given the flexibility of the Lambda runtime used for Go based Lambdas (and the newer Lambda Layers), that it might be feasible to actually compile a static binary of our tests that could be run by Lambda.

A few changes to TestMain to make sure we could run the tests both locally and remotely

var (
	addr         = flag.String("addr", "", "Address to run grpc tests against. If left empty, local-dynamo is assumed and a server is started locally")
	httpaddr     = flag.String("http-addr", "", "Address to run http tests against. If left empty, local-dynamo is assumed and a server is started locally")
	...
)

...

func TestMain(m *testing.M) {
	...
	signer = iamsign.NewStaticSigner(sess.Config.Credentials, cfg.AWSRegion)

	runLocally = *addr == "" && *httpaddr == ""
	if runLocally {
		signer = &iamsign.TestSigner{}

		// some setup for locally running dependencies
		...

		service, err = New(dynamo, tableName)
		if err != nil {
			logrus.Fatal("failed to initialize local service")
		}

		port := freeport.GetPort()
		*addr = fmt.Sprintf("localhost:%d", port)

		// Create a raw listener
		lis, err := net.Listen("tcp", fmt.Sprintf(":%d", port))
		if err != nil {
			logrus.Fatal(err)
		}

		// Start the GRPC server
		grpcSrv = grpc.NewServer()
		api.RegisterAPIServer(grpcSrv, service)
		go func() {
			logrus.WithField("port", port).Info("starting grpc server")
			if err := grpcSrv.Serve(lis); err != nil {
				logrus.WithError(err).Error("failed to serve grpc")
			}
		}()

		ts = httptest.NewServer(service.Gin())
		*httpaddr = ts.URL
	}

Feeling confident in these changes, I went ahead and ran

GOOS=linux GOARCH=amd64 go test -c -o system-tests -mod vendor
zip system-tests.zip system-tests

Obviously, this led to immediate fun and profit, right? If you’ve been an engineer for any amount of time, or at least one of my caliber (read: as bad as I am) you know that statement to be a lie. There were any number of IAM permission hiccups, Lambda VPC misconfigurations, and environment variables missing. Past all of these missteps there remained one glorious word: PASS. Unfortunately, it was in a red box in the AWS Console and returned 1 when invoked with the AWS CLI.

Oh gosh, to stumble so close to the finish line! I must gather my ideas back together and consider a way around this predicament! - Me 1, a couple of days ago

The issue wasn’t particularly hard to figure out: Lambda expects to receive a certain payload in response to its invocation of the underlying functions. We needed to call our binaries (our automated destructive tests being in a separate package, which necessitated two different test binaries), and wrap their execution in a neat little Lambda package.

Determined to find a solution before I left town for my wedding and honeymoon, and completely devoid of any clever names, I came up with lamex. It’s a very simple (with much room to improve) exercise in Go os.Command invocation and output redirection. Basically, given a newline separated file with multiple commands and flags, we could report back the status of our test binaries back to Lambda!

All that was left was a little Makefile wizardy (I am insufficiently adept in Makefile to make the distinction), Terraforming the test Lambda, and having the build pipeline invoke the test Lambda with

aws --region us-east-2 lambda invoke --function-name system-tests-staging out --log-type Tail --query 'LogResult' --output text | base64 -d

Epilogue

These are early days, and we’ve yet to fully plumb the depths of the successes and failures of this approach. Clearly, our system tests are now time limited, which isn’t too bad for our normal happy path and negative testing, but which may become an issue with longer running destructive tests. At any rate, it was a fun approach to try out, and one that may have some ROI to those of you whose services are in an isolated network and clamoring for a better CI/CD automated test solution.

1: Those who know me, and perhaps in particular those who work with me, may object to the veracity of the specific wording of the quote.

AWS  Go  Lambda  VPC  Testing