De-serializing Kafka Messages With Union-Defined Field

I was writing a Kafka consumer application for a proof-of-concept (POC) project recently when I got into this weird de-serialization behavior, where when it reached a certain field there would always be an error no matter what. This is even though, and I checked repeatedly, that the Kafka producer was serializing and publishing the object to the topic correctly.

Initially, I thought it was the JSON library I was using that was parsing it incorrectly. Both GSON and Jackson, however, failed at several attempts of parsing. This made me think that it was not the JSON libraries’ fault, although I was not totally convinced, so I went to searching on the Internet if anybody had encountered anything similar and how they fixed it.

To give some context on the issue, the error I get is that it was expecting a boolean for that field but instead got an object.

java.lang.IllegalStateException: Expected a boolean but was BEGIN_OBJECT at line XX column YY path...

And for further context, I defined that particular field (Well, there were 2 fields like this in the Avro schema) with a union type. In Avro, fields must have a type and it can be defined with more than one datatype if not just one is expected or allowed for that field.

In my case I wanted to have a boolean field that can also have a null value.

 {
 	"name": "foobar",
 	"type": ["boolean", "null"]
 }

Yes I know, boolean with a null value? What was I thinking?

As quirky as this may sound it is allowed in Java design using the Boolean class instead of its primitive.

Running on debug mode, this is what happened:

{
	"someField": "hello world",
	"dependencyField": true,
	"foobar": {
		"boolean": false
	}
}

But the expected format is supposed to be (And this is how the Avro message is published in the Kafka queue before consumption):

{
	"someField": "hello world",
	"dependencyField": true,
	"foobar": false
}

The reason behind this is that the field, in our example I called it foobar, depends on another boolean field, where if the latter is true, then and only then should foobar have either a true or false value. Simply having its default as false might cause it to give the wrong state for further processors down the line.

At some point I even toyed with the idea of converting the field as a String. Then have 3 possible values, “YES”, “NO” and “N/A”. This would have been more logical. But I did not. I stuck with boolean.

Ultimately, I knew I had to change the schema for it to get past that error. I did not want to but since this was still a POC project, I thought I could live with it. I took out the union type and defined the field as boolean only. Then the issue went away.

This did not answer the question though on why that particular field with a union representation of boolean and null as its type is being interpreted in a different way than the expected.

UPDATE:

So I just found out that there is a JIRA issue about this created here – https://issues.apache.org/jira/browse/AVRO-1582

It looks like this issue has been around for some time and is still unresolved. From the timestamp, it was reported since September 2014.

2nd UPDATE:

It’s been months since I last bothered about producing records to Kafka. For almost the entire time since then, my team’s focus has been on consuming records instead.

So anyway, I would just like to add a few things about when defining a field type as a union in an Avro schema. I might have picked up a thing or 2 over the course of time. (Experience really is a great teacher!)

Anyway, here are some useful hints:

  1. If the type can be a null, define it first. This is not a strict rule. Unless you want one of those as the default. Then it should be the first.
    • e.g. { “name”: “foobar”, “type”: [“null”, “int”] }
    • e.g. {“name”: “foobar”, “type”: [“null”, “int”], “default”: null }
  2. A literal null value can be assigned directly to that field.
    • Correct: e.g. { “foobar”: null }
  3. When creating a record with the other type – int – that field type definition has to be included as an object and not assigned directly. Thus,
    • Wrong: e.g. { “foobar”: 21 }
    • Correct: e.g. { “foobar”: { “int”: 21 } }

For several days back then, as I have explained in this article, me and my team who were practically getting our feet wet for the first time within the world of Kafka, were so perplexed why our consumer could not de-serialize the record properly. We did not know about #3 then. We thought it was a bug. To clarify, we know the union field was were the exception was happening, we just didn’t know why the de-serialization behaved that way.

Lastly, being able to produce Kafka records to a topic is still important for Kafka consumer development. When you’re on the other end of the flow – the receiving end – you definitely want to keep that flow coming. Being data starved is bad from a consumer perspective. How will you be able to test out your consumer client code without records coming in from the topic queue, right? Of course, there’s unit testing. But it’s really nicer to test it running under load with hundreds of thousands of records being consumed. Then with your own producer, you can also manipulate the data being sent to the topic to your heart’s content. Easy way to force and test errors on the consumer side.

Similar Posts: