Hacker Newsnew | past | comments | ask | show | jobs | submit | codesoap's commentslogin

A while ago I wrote https://github.com/codesoap/osmar, a tool for searching PostgreSQL databases with open streetmap data from the command line. Now I have rewritten the tool to read PBF files directly instead of querying a database. This article gives some insight into the performance optimizations needed to make this approach viable.


Only briefly. HAR is extremely verbose, which makes it impractical for storing large amounts of requests and responses. It's also not line based, not designed for easy streaming/filtering (it has header/config fields) and the more "granular structure" (e.g. HTTP headers are separate JSON objects) allows for less freedom in creating malformed requests, which can be desirable when trying to find bugs in web applications.


It was a tradeoff. The solution with "base" is more compact and arguably easier to read. On the other hand, it's easier to filter or manipulate httpipe, if the fields are separated. For example, to filter out all requests to a certain host, independent of port or protocol, one could just use this:

    jq -c 'select(.host == "test.net")' my_stored_results.jsonl
The port of some stored requests can be swapped out like this:

    jq -c '.port = 8080' my_stored_requests.jsonl
In other words: I chose to sacrifice compactness for "ease of tinkering".

However, I was also thinking about specifying an optional/informational "url" field or something similar, which also includes the path; "https://test.net/src" for your example. This field would be ignored by tools, but one could more easily copy things over into curl or a browser for further investigations.


Hm yeah I wondered if it might be something like that. I would say that adding an additional field might be the worst option, as then the info could get out of sync somehow. I'd just pick whichever and if it's the combined one have some code to explode it into bits, or keep what you have and have some code to combine it, and maybe print the combined one out in a log line.


Good point. Maybe a small 'hp2url' or 'hp2curl' tool, which takes httpipe input and prints URLs or curl commands for each given request, would be a better solution.


I haven't been on the hunt much since I wrote the tools, but the first two bug bounties I ever received where issued for things I found with a web fuzzer. No doubt I would have made that money with pfuzz, if I had it back then :-)


Wow, didn't expect you to see this, but glad to have you here :-) Don't hesitate to contact me if you have any feedback!


To be frank, I hadn't even considered looking at the file format of wireshark when thinking of existing file formats, that I could reuse. I've now taken a brief look and it seems like wireshark supports quite a lot of different formats [1], but the preferred one seems to be PcapNG. At a first glance, there are several attributes that make them less suitable for my purposes:

1. PcapNG as well as the other file formats look like they are storing packets, which is a lower level than HTTP requests and unnecessarily verbose for my intended purposes.

2. They are binary formats, which makes them less suitable for printing to stdout. This also means, that they are not line based, which means UNIX tools, like grep, cannot be used effectively.

3. They are not designed for streaming. The httpipe format is line-based and contains no header/global fields. Thus it is trivial to, for example, build a filtering program: it would just read one line at a time and print it again, if it matches filter criteria; the output would automatically be valid httpipe again.

4. Lastly, parsing and composing JSON is something most developers have done before and basically every programming language has libraries for it. This makes it easy for the ecosystem to grow and enables users to build custom tools without too much initial effort.

[1] https://wiki.wireshark.org/FileFormatReference

[2] https://pcapng.com/


Fair enough. This does seem like one of those things that’s so simple that no-one thinks “oh I will grab the library to work with it”. That’s kind of a good thing, I mean JSON is so simple but I still use a library for reading and writing


I implemented my own packet capture stuff for Wireshark, the format is pretty straightforward. Use case was different and I reimplemented the rpcapd protocol but the packets themselves are easy to dump assuming you have access to the raw packet information (Ethernet headers and whatnot). You can of course also synthesize that information if needed.


I have thought about this a lot, but came to the conclusion, that people are most likely to write tools if the format is easy to parse and construct in many programming language. I think it's hard to find an encoding that suits this criteria better than JSON.

It also has the benefit of being single-lined, since newline characters are encoded, which is necessary for a line based format. This, in turn, allows the use of many UNIX tools, like grep.

However, it certainly is not the most compact format, when encoding large amounts of binary data. Gzipping for storage will alleviate the overhead somewhat. Overall, dealing with large amounts of binary data in web application testing seems like a less common use case, so I felt like I shouldn't put too much focus on it.


It seems to me like "fuzzing" has a different meaning in web application penetration testing. Here, "fuzzer" is a term for tools that just generate different request using wordlists, without adding any mutations. For example, the two popular tools ffuf [1] and wfuzz [2] also call themselves fuzzers.

I see how reusing a term for a different concept is bothersome, but I feel like "fuzzer" is the term that people learning about bug bounty hunting are familiar with.

[1] https://github.com/ffuf/ffuf

[2] https://wfuzz.readthedocs.io/en/latest/


Yeah, this is generally what people mean by "web fuzzing".


> go is the worst because the unix-hater club at the root of their development team [...]

I was surprised to find this sentiment about Go. I always felt Go focused on simplicity, modularity, good tooling and even good integration with C. Those are qualities that seem UNIX-affirmative to me.

Where does De Raadt's opinion come from? Could it be referring to Rob Pike's blasphemic Plan 9 adventure or is it just a provocative rambling about a certain ioctl-incident?


Author here. I've heard that before. Maybe I should really rename. Got any suggestions? "osms" (OSM search) might be misleading because it reminds me of SMS. "osmfind" is a little long for my taste, but well, maybe it's better than "osmf".


osmq where the 'q' is for query?


Oh, I like that one. I'll think about it for a bit, but maybe I'll take it. Thanks!


I just discovered https://osmq.eu and https://osmq.org. I guess this name is taken :(


May I suggest osmfd? I use fd instead of find on the command line. Also it just means adding one letter and no ambiguity.


Good idea. I'll think about it. Thanks!


XOSM - as in eXplore

CLOSM - might be better as a horror movie


Hmm, I'm afraid I'm a bit biased against anything "explore", because I always associate it with Microsoft (internet explorer and file explorer).


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: