Architecture

Raccoon written in GO is a high throughput, low-latency service that provides an API to ingest streaming data from mobile apps, sites and publish it to message queues. Following message queues are currently supported:

Apache Kafka
Google Cloud PubSub
AWS Kinesis Data Streams

Raccoon supports websocket, REST and gRPC protocols for clients to send events. With websocket it provides long persistent connections, with no overhead of additional headers sizes as in http protocol. Racoon supports protocol buffers and JSON as serialization formats. Websocket and REST API support both whereas with gRPC only protocol buffers are supported. It provides an event type agnostic API that accepts a batch (array) of events in protobuf format. Refer here for data definitions format that Raccoon accepts.

Raccoon was built with the primary purpose to source or collect user behaviour data in near-real time. User behaviour data is a stream of events that occur when users traverse through a mobile app or website. Raccoon powers analytics systems, big data pipelines and other disparate consumers by providing high volume, high throughput ingestion APIs consuming real time data. Raccoon’s key architecture principle is a realization of an event agnostic backend (accepts events of any type without the type awareness). It is this capability that enables Raccoon to evolve into a strong player in the ingestion/collector ecosystem that has real time streaming/analytical needs.

System Design

HLD

At a high level, the following sequence details the architecture.

Raccoon accepts events through one of the supported protocols.
The events are deserialized using the correct deserializer and then forwarded to the buffered channel.
A pool of worker go routines works off the buffered channel
Each worker iterates over the events' batch, determines the topic based on the type and serializes the bytes to the Producer synchronously.

Note: The internals of each of the components like channel size, buffer sizes, publisher properties etc., are configurable enabling Raccoon to be provisioned according to the system/event characteristics and load.

Connections

Websockets

Raccoon supports long-running persistent WebSocket connections with the client. Once a client makes an HTTP request with a WebSocket upgrade header, raccoon upgrades the HTTP request to a WebSocket connection end of which a persistent connection is established with the client.

The following sequence outlines the connection handling by Raccoon:

Clients make websocket connections to Raccoon by performing a http GET API call, with headers to upgrade to websocket.
Raccoon uses gorilla websocket handlers and for each websocket connection the handlers spawn a goroutine to handle incoming requests.
After the websocket connection has been established, clients can send the events.
Construct connection identifier from the request header. The identifier is constructed from the value of SERVER_WEBSOCKET_CONN_ID_HEADER header. For example, Raccoon is configured with SERVER_WEBSOCKET_CONN_ID_HEADER=X-User-ID. Raccoon will check the value of X-User-ID header and make it an identifier. Raccoon then uses this identifier to check if there is already an existing connection with the same identifier. If the same connection already exists, Raccoon will disconnect the connection with an appropriate error message as a response proto.
- Optionally, you can also configure SERVER_WEBSOCKET_CONN_GROUP_HEADER to support multi-tenancy. For example, you want to use an instance of Raccoon with multiple mobile clients. You can configure raccoon with SERVER_WEBSOCKET_CONN_GROUP_HEADER=X-Mobile-Client. Then, Raccoon will use the value of X-Mobile-Client along with X-User-ID as identifier. The uniqueness becomes the combination of X-User-ID value with X-Mobile-Client value. This way, Raccoon can maintain the same X-User-ID within different X-Mobile-Client.
Verify if the total connections have reached the configured limit based on SERVER_WEBSOCKET_MAX_CONN configuration. On reaching the max connections, Raccoon disconnects the connection with an appropriate error message as a response proto.
Upgrade the connection and persist the identifier.
Add ping/pong handlers on this connection, read timeout deadline. More about these handlers in the following sections
At this point, the connection is completely upgraded and Raccoon is ready to accept SendEventRequest. The handler handles each SendEventRequest by sending it to the events-channel. The events can be published by the publisher either synchronously or asynchronous based on the configuration.
When the connection is closed. Raccoon clean up the connection along with the identifier. The same identifier then can be reused on the upcoming connection.

REST

Client connects to the server with the same endpoint but with POST HTTP method. As it is a rest endpoint each request is uniquely handled.

Connection identifier is constructed from the values of SERVER_WEBSOCKET_CONN_ID_HEADER and SERVER_WEBSOCKET_CONN_GROUP_HEADER header here too.

gRPC

It is recommended to generate the gRPC client for Raccoon's EventService and use that client to do gRPC request. Currently only unary requests are supported.

Client's SendEvent method is called to send the event.
Connection identifier is constructed from the values of SERVER_WEBSOCKET_CONN_ID_HEADER and SERVER_WEBSOCKET_CONN_GROUP_HEADER in gRPC metadata.

Clients can send the request anytime as long as the websocket connection is alive whereas with REST and gRPC requests can be sent only once.

Event Delivery Gurantee (at-least-once for most time)

The server for the most times provide at-least-once event delivery guarantee.

Event data loss happens in the following scenarios:

When the server shuts down, events in-flight in the buffer or those stored in the internal channels are potentially lost. The server performs, on a best-effort basis, to send all the events within a configured shutdown timeout WORKER_BUFFER_FLUSH_TIMEOUT_MS. The default time is set to 5000 ms within which it is expected that all the events are sent by then.
When the downstream message queue is facing a downtime
Every event sent from the client is stored in-memory in the buffered channels (explained in the Acknowledging events section). The workers pull the events from this channel and send it to Producer for publishing. The server does not maintain any event persistence. This is a conscious decision to enable a simpler, performant ingestion design for the server. In future: Server can be augmented for zero-data loss or at-least-once guarantees through intermediate event persistence.

Acknowledging events

Event acknowledgements was designed to signify if the events batch is received and sent successfully. This will enable the clients to retry on failed event delivery. Raccoon chooses when to send event acknowledgement based on the configuration parameter EVENT_ACK.

EVENT_ACK = 0

Raccoon sends the acknowledgments as soon as it receives and deserializes the events successfully using the proto SendEventRequest. This configuration is recommended when low latency takes precedence over end to end acknowledgement. The acks are sent even before it is produced to downstream message queue. The following picture depicts the sequence of the event ack.

Pros:

Performant as it does not wait for producer/network round trip for each batch of events.

Cons:

Potential data-loss and the clients do not get a chance to retry/resend the events. The possibility of data-loss occurs when the downstream message queue is experiencing downtime.

EVENT_ACK = 1

Raccoon sends the acknowledgments after the events are acknowledged successfully from the downstream message queue. This configuration is recommended when reliable end-to-end acknowledgements are required. Here the underlying publisher acknowledgement is leveraged.

Pros:

Minimal data loss, clients can retry/resend events in case of downtime/broker failures.

Cons:

Increased end to end latency as clients need to wait for the event to be published.

Considering that kafka is set up in a clustered, cross-region, cross-zone environment, the chances of it going down are unlikely. In case if it does, the amount of events lost is negligible considering it is a streaming system and is expected to forward millions of events/sec.

PubSub and Kinesis offer strong SLAs (>=99.95% and >=99.9%) so they are least likey to be unavailable. However, you may hit rate limits for these services, so we advise that you provision your infrastructure sufficiently to avoid it. In case a rate-limit is hit, Raccoon will report the message as undelivered.

When an SendEventRequest is sent to Raccoon over any connection be it Websocket/HTTP/gRPC a corresponding response is sent by the server indicating whether the event was consumed successfully or not.

Supported Protocols and Data formats

Protocol	Data Format	Version
WebSocket	Protobufs	v0.1.0
WebSocket	JSON	v0.1.2
REST API	JSON	v0.1.2
REST API	Protobufs	v0.1.2
gRPC	Protobufs	v0.1.2

Request and Response Models

Protobufs

When an SendEventRequest proto below containing events are sent over the wire

message SendEventRequest {
  //unique guid generated by the client for this request
  string req_guid = 1;
  // time probably when the client sent it
  google.protobuf.Timestamp sent_time = 2;
  // actual events
  repeated Event events = 3;
}

a corresponding SendEventResponse is sent by the server.

message SendEventResponse {
  Status status = 1;
  Code code = 2;
      /* time when the response is generated */
  int64 sent_time = 3;
      /* failure reasons if any */
  string reason = 4;
      /* Usually detailing the success/failures */
  map<string, string> data = 5;
}

JSON

When a JSON event like the one mentioned below is sent a corresponding JSON response is sent by the server.

Request

{
  "req_guid": "1234abcd",
  "sent_time": {
    "seconds": 1638154927,
    "nanos": 376499000
  },
  "events": [
    {
      "eventBytes": "Cg4KCHNlcnZpY2UxEgJBMRACIAEyiQEKJDczZTU3ZDlhLTAzMjQtNDI3Yy1hYTc5LWE4MzJjMWZkY2U5ZiISCcix9QzhsChAEekGEi1cMlNAKgwKAmlkEgJpZBjazUsyFwoDaU9zEgQxMi4zGgVBcHBsZSIDaTEwOiYKJDczZTU3ZDlhLTAzMjQtNDI3Yy1hYTc5LWE4MzJjMWZkY2U5Zg==",
      "type": "booking"
    }
  ]
}

Response

{
  "status": 1,
  "code": 1,
  "sent_time": 1638155915,
  "data": {
    "req_guid": "1234abcd"
  }
}

Event Distribution

Event distribution works by finding the type for each event in the batch and sending them to appropriate message queue topic. The topic name is determined by the following code

topic := strings.Replace(p.topicFormat, "%s", event.Type, 1)

where:

topicFormat - is the pattern configured via EVENT_DISTRIBUTION_PUBLISHER_PATTERN
type - is the type set by the client on the Event

For eg. setting the

EVENT_DISTRIBUTION_PUBLISHER_PATTERN=topic-%s-log

and a type such as type=viewed in the event format

message Event {
  /*
  `eventBytes` is where you put bytes serialized event.
  */
  bytes eventBytes = 1;
  /*
  `type` denotes an event type that the producer of this proto message may set.
  It is currently used by raccoon to distribute events to respective message queue topics. However the
  users of this proto can use this type to set strings which can be processed in their
  ingestion systems to distribute or perform other functions.
  */
  string type = 2;
 }

will have the event sent to a topic like

topic-viewed-log

The event distribution does not depend on any partition logic. So events can be randomly distributed to any available partition.

Event Deserialization

The top level wrapper SendEventRequest is deserialized which provides a list of events of type Event proto. This event wrapper composes of serialized bytes, which is the actual event, set in the field bytes inside the Event proto. Raccoon does not open this underlying bytes. The deserialization is used to unwrap the event type and determine the topic that the eventBytes (an event) need to be sent to.

Channels

Buffered Channels are used to store the incoming events' batch. The channel sizes can be configured based on the load & capacity.

Keeping connections alive

The server ensures that the connections are recyclable. It adopts mechanisms to check connection time idleness. The handlers ping clients very 30 seconds (configurable). If the client does not respond within a stipulated time the connection is marked as corrupt. Every subsequent read/write message there after on this connection fails. Raccoon removes the connections post this. Clients can also ping the server while the server responds with pongs to these pings. Clients can programmatically reconnect on failed or corrupt server connections.

Components

Producer

Raccoon supports a number of destination event storage systems. Following is a list of currently supported systems, along with their status.

Name	Version	Status
Apache Kafka	v0.1.0	`STABLE`
Google Cloud PubSub	v0.2.3	`ALPHA`
AWS Kinesis Data Streams	v0.2.5	`ALPHA`

Apache Kafka

Raccoon uses confluent go kafka as the producer client to publish events. Publishing events are light weight and relies on kafka producer's retries. Confluent internally uses librdkafka which produces events asynchronously. Application writes messages using a functional based producer API

Produce(message, deliveryChannel) -- deliveryChannel is where the delivery reports or acknowledgements are received.

Raccoon internally checks for these delivery reports before pulling the next batch of events. On failed deliveries the appropriate metrics are updated. This mechanism makes the events delivery synchronous and a reliable events delivery.

Google Cloud PubSub

Raccoon uses cloud.google.com/go/pubsub as the producer client for publishing events to Google Cloud PubSub.

The Google Cloud PubSub SDK internally buffers messages in batches before sending them downstream. You can control this buffering behaviour by tuning the following env variables:

The defaults for these settings are optimal for near-realtime uses cases.

AWS Kinesis Data Streams

Raccoon uses github.com/aws/aws-sdk-go-v2/service/kinesis as the producer client for publishing events to AWS Kinesis.

In particular, kinesis.PutRecord() is used for sending messages downstream. This means that the messages are sent immediately without any buffering at the SDK level. Each message is given a random partition key (using rand.Int31()) so that messages are evenly distributed amongst available shards.

In the future, Raccoon may support a more robust partition selection mechanism that has stronger distribution guarantees.

Observability Stack

Raccoon supports StatsD and Prometheus as telemetry systems.

StatD

A recommended choice for observability stack would be to host telegraf as the receiver of these measurements and export it to influx database for storage, grafana to build dashboards using Influx as the source.

Prometheus

Prometheus operates on a pull model and comes with it's own time-series database. You don't need any additional components apart from the prometheus to start collecting and storing metrics. Grafana can be used to build dashboards using Prometheus as a data source.

Architecture

System Design​

Connections​

Websockets​

REST​

gRPC​

Event Delivery Gurantee (at-least-once for most time)​

Acknowledging events​

EVENT_ACK = 0​

EVENT_ACK = 1​

Supported Protocols and Data formats​

Request and Response Models​

Protobufs​

JSON​

Event Distribution​

Event Deserialization​

Channels​

Keeping connections alive​

Components​

Producer​

Apache Kafka​

Google Cloud PubSub​

AWS Kinesis Data Streams​

Observability Stack​

StatD​

Prometheus​