Check the errors coming back on your poll/commit. The kafka stack should tell yo...

Check the errors coming back on your poll/commit. The kafka stack should tell you when you can retry items. If it is in the middle of something sometimes it does not always fail nicely but you can retry and it is usually fine. Usually I see that sort of behavior if the whole cluster just 'goes away' (reboots, upgrades, etc). It will yeet out a network error and then just stop doing anything. You have to watch for it and recreate your kafka object (sometimes, sometimes retry is fine). If they are bouncing the whole cluster on you each broker can take a decent amount of time before they are alive again. So if you have 3 and they restart all 3 in quick succession all at once you will see some nasty behavior out of the kafka stack. You can fiddle your retries and timeouts. However, if that is lower than it takes for the cluster to come back you can end up with what looks like a busted kafka stream. I have seen it take anywhere from 3-10 mins for a single broker to restart sometimes (other times it is like 10 seconds). So depending on the upgrade/patch script that can be a decent outage. It goes really sideways if the cluster has a lot of volume to replicate between topics on each broker (your replica factor).