GitHub only keeps 14 days of traffic data. After that it's gone. Due to this, a lot of our questions remained unanswered over a long window.
— Was there a spike in clones last month, and where did it come from?
— Which docs pages are actually being read?
— How many people came to the repo over the last 3 months?
We were flying blind. Not the best way to maintain an open source repo.
So build zingg-stats — a small daily automation that runs three Python scripts every morning and posts everything to Slack:
→ GitHub traffic (views, clones, referrers) saved to CSV before the 14-day window closes
→ Charts showing trends over the last 3 months
→ Google Analytics summary from yesterday
We built it with Claude and kept it generic enough that any open source maintainer should be able to adapt it to their own project in an afternoon.
If you run an open source project and have the same blind spots, give it a try.
As part of my data consulting, I struggled with identity resolution and started working on scalable no code identity resolution - https://github.com/zinggAI/zingg/ . It has pushed my limits as a software engineer and product builder, and I had to do a lot of learning to build it. Its cool to see people use Zingg in their workflows and save months of working on custom solutions. Big highlight has been North Carolina Open Campaign Data https://crossroads-cx.medium.com/building-open-access-to-nc-...
Thanks for your support. Yes we do ship with some examples and their models which can be run out of the box. We have 3 customer demographic datasets and an ecommerce items matching across Google and Amazon. You can check them here https://github.com/zinggAI/zingg/tree/main/examples
We see performance varies by
a) Number of attributes to match
b) Size of data
c) Type of matching and the features we compute for each
d) Hardware and cluster size
Although we do not do matching across languages like English with Chinese, we have tested Zingg quite rigorously with Chinese, Japanese, Hindi, German and other languages and it seems to work out of the box. Likely due to the inbuilt Java unicode support and the ML based learning.
You make a great point about continuous variables like lat/long, age etc. Age seems to work, again due to integer differences and the learning. Have not tried lat/long yet. Would you have any dataset you could recommend for testing?
— Was there a spike in clones last month, and where did it come from? — Which docs pages are actually being read? — How many people came to the repo over the last 3 months?
We were flying blind. Not the best way to maintain an open source repo.
So build zingg-stats — a small daily automation that runs three Python scripts every morning and posts everything to Slack:
→ GitHub traffic (views, clones, referrers) saved to CSV before the 14-day window closes → Charts showing trends over the last 3 months → Google Analytics summary from yesterday
We built it with Claude and kept it generic enough that any open source maintainer should be able to adapt it to their own project in an afternoon.
If you run an open source project and have the same blind spots, give it a try.