I actually came there via two other aggregators (Twitter and Prismatic) and thought that I had found the source. Will be more careful next time - sorry about that. In any case it was an interesting read.
I have seen many people try and shoehorn various problems into MongoDB, when in most cases a relational database would have been better suited.
I have yet to see a real use case for Mongo unless you are building a Facebook clone. Can someone suggest when it is actually useful over a properly tuned relational database?
I guess I kind of reached the irrelevance stage just thinking about the sort of problems it would be suited to rather than actually building anything.
> I have yet to see a real use case for Mongo unless you are building a Facebook clone
People seem to keep making the mistake of building a social network in mongo. It's entirely unsuitable for that.
Source: I've helped 3 companies move their social networks from MongoDB to OrientDB when they figured out that MongoDB prevents them from shipping features that users expect, e.g. friend-of-a-friend style queries
What has your experience been like in deploying and working with OrientDB? This is a bit of really cool tech that I've been keeping an eye on for some time but haven't gotten around to really playing with it.
What server language were you using and where were you deploying?
OrientDB is very cool software, I find it pretty hard to go back to traditional databases now that I've seen how powerful graphs are, but the cool thing about it is that it's still a document store at heart, so you get all the same advantages of mongo, but with the graph awesomeness on top. It's a fantastic tool, but there are a few quirks that can catch beginners out and the documentation is not stellar. It also requires some configuration tweaking to get the best performance for your workload.
I've been using node.js (I develop the official driver - https://github.com/codemix/oriento), but there's pretty good libraries emerging for other languages too. Most people I've worked with are deploying to AWS, one company was running on the bare metal.
I actually JUST discovered oriento and was absolutely delighted to see a bluebird promise-based api. API looks fantastic. Thanks a ton for creating the lib.
Do you have any advise for someone thinking of deploying on GCE?
My use-case would be for an online code editor (Plunker, if you've heard of it) with users, projects, packages, collections (of projects), comments and project versions (stored as content-addressable git-compatible objects).
I'm also interested in understanding if there is any built-in compression mechanism because I will be storing a large volume of very similar text files. Any hints?
Have you used the lucene indexes much? If so, can you do any of the crazy faceting delivered by ElasticSearch?
> Do you have any advise for someone thinking of deploying on GCE?
No, sorry, I've never used that. However, generally - like all dbs, Orient is happiest when it has access to a lot of RAM. Also the Write Ahead Log can take up a significant amount of disk space, those are two things to be immediately aware of.
> I'm also interested in understanding if there is any built-in compression mechanism because I will be storing a large volume of very similar text files. Any hints?
Sadly it doesn't yet do leveldb style document compression, and I've seen no hints that it's on the horizon, but the OrientDB guys are pretty responsive and would probably be open to the suggestion.
> Have you used the lucene indexes much? If so, can you do any of the crazy faceting delivered by ElasticSearch?
I'm just starting to use the lucene indexes now so I can't give much feedback on those yet. It should be possible to do faceting using SQL to a limited degree, but I don't think there's native support for it yet. I think that will get improved in the next few versions because people are crying out for it.
actually both in many cases, Neo4j does have quite a hostile license, but its biggest downside is that it is graph only, whereas OrientDB (and ArangoDB which is also worth looking at) are "multi-model" - you can use them as pure document (or even key / value) stores, and the graph is just another way of viewing / representing that data. This is really powerful and means that you don't have the problem of duplicating data between different database systems or needing to do cross-data-store joins in your application.
I don't want to create a schema in a relational database for that kind of use case. A MongoDB fits as well.
Edit: Sorry for my short and unprecise comment. I regret posting it, loosing all my Karma. Yes, of course I would use a schema, but I may use hundreds to get my products presented with all attributes and variants available. There is a set of attributes that will remain for all products, others will vary, some will change and will be dropped. Some may be conditional, depending on size or color. I could model that in RDBMS as well, adding those properties in an additional JSON or something else.
Yes, I'm also mainly using relational databases and just wanted to give an example of a use-case for MongoDB. I regret, sorry.
You are always going to end up creating a schema, whether it's explicit in your tables or implicit within your code. Otherwise you end up with completely heterogeneous data which is impossible to query in any useful way.
Not really. Maybe some products have an attribute that others don't. Maybe you want to query by that attribute. The query won't turn up products without that attribute. Maybe that's ok. Like if you wanted to find all shoes with black laces you can just query for that. Tractors don't even have laces and so they don't come up in the results.
In a sql db, every single thing needs to fit into a flat schema with nesting done by relations. Additionally, it's very rigid. Maybe you just don't want to set up a laces table, and 4000 other tables for every little attribute of every product. Of course this is not how you would do it in real life, there are better ways. Mongo is one of those ways.
Saying that Mongo is good for nothing is just as dumb as saying that Mongo is good for everything. In 2015, the Mongo backlash is just as tired as the Mongo hype.
I would probably solve this problem using Relational DB by having an ItemHeader table with all the common attributes. Then will make an ItemAttributes table where I'd use the foreign key of the first table and just have Key/value pairs for all the unique attributes you want to give to the item. Any sort of query on any of the attributes can be one via a simple join.
Though i suppose it gets trickier if attributes may have sub attributes etc....
Are you saying you're using a single MongoDB collection as a dumping ground for all your entities? I hope you're not building a real product that someone has to maintain.
I don't actually use Mongo at all, I prefer Postgres. But for loosely structured data, it has a use case. Just like dynamic vs typed languages, there are pros and cons.
Agreed. That said, optional schema enforcement would be really nice (in my experience, most of the data has a predefined format; so why not define schema and enforce it?).
Not quite. The 'type' in the example is a label for what it contains - it's wholly semantic - as opposed to a field that would exert some sort of constraint such as describing the data type, collation, restraints, etc. Schemaless databases don't completely abandon all forms of organisation because that'd be unusable. They just make the organisation looser. For example, if the user wanted to they could add a record to the products in that example that doesn't have a 'type' (although in the case of that example that probably wouldn't be useful).
The thing is that in order to use that /label/ you need to write logic to handle it in your code, ultimately you end up implementing schema, the only difference is that you enforce it in your application.
This label becomes not much different than a column in database that is nullable.
Now things become more hairy when you realize that perhaps you want to keep more information about the actor so for example you want to change details.actor to details.actor.name. You will have two choices, either run through your database and modify all documents (which is something similar to RDBMS is doing) or write code to handle both cases. The second one seems easier, but as you'll have more changes it'll come back and bite you hard.
Later in the future you might realize that by repeating details about actor for every single movie you simply not only wasting a lot of resources (your database is bigger and slower) but also this affects integrity (in one movie perhaps you include actor's middle name, or maybe you have a typo).
At that point you'll start creating a collection that holds actors and then only store key of it in "details.actors". You will soon realize that you're basically reimplementing a relational database on top of Mongo, except not only it is way slower, your application is becoming more complex.
There are uses for NoSQL, but whenever you have to ask yourself which model you should go with pretty much always the answer will be: relational.
Have a category field or link in the table. Join to it. Joins are not evil, they are damn useful. And well done if your product has become large enough to have scaling problems that can't be solved by a bit of indexing on the joins.
Diaspora is a social network that made the mistake of choosing mongodb for their first iteration. Here's a blog post where one of the involved developers explains why it's not good for that, even: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never...
> I have yet to see a real use case for Mongo unless you are building a Facebook clone. Can someone suggest when it is actually useful over a properly tuned relational database?
Seamless auto-sharding, supported as a first-class use case. Particularly in a cloud environment where you can have your system automatically spin up more hosts as load increases (and take them down as load drops) with zero human intervention, which is really nice.
There are plenty of alternatives in that space nowadays (more than when mongo first came out), and reasonable people can disagree over whether to use mongo rather than e.g. cassandra, riak, or shonky third-party clustering addons for mysql/postgresql, but I've yet to see an affordable relational database with out-of-the-box auto-sharding support that comes anywhere close to what mongo offers.
(Also, first-class support for async-io clients. This is about client libraries rather than the server itself, and purely an artefact of when mongo was released, but if you're calling e.g. PostgreSQL from the JVM, most of the libraries are oriented towards blocking JDBC which is "good enough" (there are valiant efforts like https://github.com/mauricio/postgresql-async but you can't use them with established higher-level libraries). Whereas every mongo library offers a callback- or future-based API, and the higher-level libraries are built around this)
Sounds interesting, but frankly I wouldn't trust anything so young in production yet. For all mongo's faults, at this point the pitfalls are reasonably well-known and there are enough serious organizations using it at scale to give a certain level of confidence.
I generally agree in regards to shoehorning data into MongoDB. I don't agree that a facebook clone fits appropriately. When thinking through a project, my mental model generally self-configures into third-normal form, which is to say I literally think in the manner of a relational database.
That said, I have found one specific use-case where it has incredibly worked well, which is as a JSON-based file system in which I can query into the document structure to find the records and data I need en-masse with very simple queries. A project I've been working on for a few years has its own custom schema that fits far better into a JSON hierarchy than a relational one. In this specific system there are very few records (compared to our primary RDBMS which has millions), and when not being explicitly maintained, the files are generally static. The single-collection database in question with a few hundred JSON objects would likely have to be split into at least 10 tables with thousands of records each.
When it comes to using said "files" in production, I index the objects into memory for quick retrieval. In order to populate said cache, MongoDB allows me to query into the JSON file structure and retrieve the necessary "files" very simply into however many indices I need. And that's where it has shined in my experience; Not as a live read-write datastore for production use, but as read-heavy storage for a relatively small amount of JSON data with a deep hierarchy that would require an annoying amount of joins and subqueries for otherwise simple queries.
I've also tried it out with our multi-system logging system using capped collections (JSON formatted messages transported and collected with rsyslog). But we grew beyond its limits almost immediately and ended up in far more manageable territory with kibana / elasticsearch.
Hopefully a useful example, would love some comment.
I've been building https://rwt.to and https://movingjoburg.co.za with storage backed by MongoDB. Both are transit apps/websites in South Africa, could be more useful, except that we spend most of our time collecting and capturing data.
At the time I was very inexperienced (I guess I'm a bit better now?), but these are the reasons why I chose MongoDB.
1. Geospatial query support out of the box (this is during 2.0-2.2). It's now great with GeoJSON, I store routes and stations as GeoJSON, and querying them is very easy. Compare with PostgreSQL + PostGIS.
2. Some bus services have weird schedule structures, and I needed to be able to generate those schedules, I store schedules as a giant JSON file, and I can query that file to find when the next bus/train is. It's now possible with JSONB in PostgreSQL, but trying to fit that into a relational model wasn't worth the pain at the time, especially when there's always edge cases that I have to cater for at times.
3. To create a pricing engine on rwt.to, I needed a very flexible schema, because I have to cater for many cases (are there bus/train transfers, discounts, and quite different rules for those transfers and discounts). MongoDB gave me the ability to vary my schema in the instances where I needed to store different types of data. I'd rather contend with that than have a table with 100 columns to do the same. To head off on a tangent, in GTFS this data (fares) is computed and stored in the table, but I calculate fares at query-time because there's sometimes a lot of rules to consider, making calculating them and caching them like GTFS unpleasant.
4. Other reasons was because it was pleasant to work with MongoDB + Mongoose.js. This is very important as I'm the sole developer, and due to my line of work, I don't get all the time I need to work on both projects.
For my core transit data, I won't reach >20GB even when I manage to collect all the data for South Africa, so I won't have to contend with the 'big data' issues that other users would face.
We are building a music streaming website. Think of it as a Spotify for Pakistan and Mongo has been wonderful for the kind of Data Structure we have. Its perfect for a catalogue like structure where objects can very easily nest inside each other. Each Artist has multiple albums, and multiple artists have multiple songs. All of this nestles very neatly inside eachother, and when i want a song, i almost always need its album and artist data, so i can very easily get that. Mongo is blazing fast on such kind of nested data structures. The only slight problems is where you have playlists and need to reference songs for a certain Artist. That could be countered by Data Duplication, since the song artist data etc once entered will very rarely change. So for us it makes perfect sense. You really need to understand what your data model is, and what you want from it, and then Mongo will be your friend, other than that it will cause you a world of pain.
How does that hierarchical model work when songs or albums have more than one artist?
Why does a traditional relational db fail for such a well-defined data model? I mean, there's nothing about 'given an song (id), retrieve artist and album' that would make mongo inherently better.
Lately (and not so lately ;) there has been a lot of bad press about MongoDB. We use it extensively as part of our product (on single servers) and it fills this role nicely. It has a few drawbacks, most notably huge disk space requirements (MongoDB has no compression) which we are hoping to solve with TokuMX (haven't tried it yet). It has some other quirks too, but in general it just... works. And I love using a document DB because it lets you use relations just when they are needed. With relational DBs you often have to use complex joins even for things which should be incredibly simple. On the other hand, I do miss JOINs when I need them (though we solved that on app level). And I would appreciate a way to define schema... :) But would I exchange document DB for relational DB for that? No (for the current project).
Maybe I'm getting old, but I really want DB to just work (unlike HBase, at least a few years ago). Other than that, I'm flexible.
The good news is, you can just wait a few weeks and get the possibility to use WiredTiger inside the upcoming 2.8.0. That will solve a few pain points that we've been dragging along for years (namely, document-level locking, disk requirements, multi-document transactions).
We don't use node.js. Also, we have app-level schema, but enforcing should (IMHO) be done on DB level so as to avoid any chance of invalid data.
Joins: easy, we specify which fields are connected (on app level), then fetching goes to connected collections and fetches data from there too. Not ideal, but it works. We made a similar solution for foreign keys.
Transactions: no need for any further guarantees. Our system mostly works on a single record at a time in all critical components. If this is not possible, we have app-level locking to avoid conflicts. No issues or wishes here.
> If this is not possible, we have app-level locking to avoid conflicts.
app-level locking is a recipe for disaster if data matters and more than 1 process/user/client/etc can access your data. if the data doesn't matter then I guess you can do whatever you want.
web "developers" have a bad reputation for a reason. so many amateurs amongst web developers...
I will assume the last part is just a generic statement and is not directed towards me. You are a Java dev I presume?
As for app-level locking, I am talking about sacrificing performance, not safety. We just make sure that some piece of code runs exactly once. Since the need for this is very rare and the places are not performance critical, we can live with that. So no, we have no need for additional transactional guarantees on DB level.
> I will assume the last part is just a generic statement and is not directed towards me.
It's most definitely directed at you.
> You are a Java dev I presume?
C#/C++. Though I've worked with java before amongst others.
> As for app-level locking, I am talking about sacrificing performance, not safety.
You are sacrificing both. Only idiots truly depend on app-level locking.
> We just make sure that some piece of code runs exactly once.
Amateur hour...
> Since the need for this is very rare and the places are not performance critical, we can live with that. So no, we have no need for additional transactional guarantees on DB level.
So it's a useless pointless trivial application...
MongoDB is not at all good for Social networks or anything resembling a graph. You are better off with Titan (horizontal scaling) or Neo4j (vertical scaling). Neo4j offers this great ability to query by a path, which no other database offers.
> Can someone suggest when it is actually useful over a properly tuned relational database?
In your own question you kind of hint at it. You need a tuned database. NoSQL lowers the barrier to entry for fast persistent data storage with replication. NoSQL doesn't replace SQL databases, it's looking to optimise a use case where transactions are not required.
Personally I use Mongo a bit like a cache, sitting in front of a SQL DB. Things that need to be ACID are handed back and forth. Those that don't are kept in Mongo and I get the best of both worlds.
> NoSQL doesn't replace SQL databases, it's looking to optimise a use case where transactions are not required.
I'd be curious to know in what contexts anyone would not want transactions.
I readily see the case for memcached or equivalent when it comes to caching -- it is very valid, and indeed useful, when slightly outdated data is perfectly acceptable. I can even picture how you might be using MongoDB to do the same, even if you admittedly have me wondering how you're invalidating the mess.
For a persistent data store, however, I'm honestly at a loss. In my own experience, transactions are needed as soon as there's a remote possibility that a concurrent write may occur. Even embedded systems need them, when you're threading statements concurrently for performance reasons, or when you're subsequently merging local data with another node. (See CoreData/iCloud bugs for what occurs when you ignore ACID in the latter case.)
Document databases normally has atomic updates of a document. This may contain many parts and would require transactions in a relational db. Meaning you kind of get some transactions automatically. Other things are not possible to make transactions around and then you must try to create the app to handle the problems. And some document databases can actually use transactions, like FoundationDb.
> Can someone suggest when it is actually useful over a properly tuned relational database?
Maybe when developers don't want to bother to learn SQL?
On my CS degree we got proper introduction to SQL and the relational algebra behind it, so for me it is just a tool, not a scary monster.
I really love the data integrity options I have at my disposal with relational databases.
So for me this all NoSQL fad never made much sense. Then again, I never had to deal with a Facebook like scale problem, only daily reports from mobile network operators across their whole network elements.
I'd say a graph database would work much better for building a Facebook clone compared to a document store like MongoDB... And a relational database would still be better than the document store.
I've been using Mongo happily to provide online analytics solutions, but the advantage is mostly from the development side, not really from the performance side.
On the other hand, this kind of approach is great for attribute matching, which is usually a nightmare to do properly with a RDBMS.
For some systems the development time is much shorter with a document db (or a graph db) than with a relational db. That can be more important than pure performance. Basic scalability is also usually better since it often easy to shard data across servers.
I just like being able to insert a random JSON into a collection and query it by any of its properties. Not sure how I would do that with a relational database.
The reliability and speed advantages of PostgreSQL over MongoDB. Plus the fact that sometimes KV/document is better and sometimes normalized relational. Is nice if one tool solves both.
You can do this in Oracle 12 and I think in Postgres. You might want to think about why you need to store unstructured data as part of your application though and query it. If it's expected to live a long time and be queried it should probably be structured. If not you could store it (a serialized object) as a blob if you are just caching some state.
I've used it and been happy. If you want something that's simple to use, know your schema partly but not entirely, and you don't mind losing a few writes, then mongo fits.
Relational databases place such emphasis on reliability. Mongo is generally reliable but only two or three nines, and it's simple.
For example, if you're collecting sub-cent line items for invoices, you might prefer to collect 99.x% of the line items simply over 100% at greater complication and expense, particularly if you can measure x.
What? When would it ever be OK to lose parts of an invoice. Mongo apologists baffle me sometimes. To anyone considering MongoDB, don't be fooled, you probably want to stick with an RDBMS.
When the parts are cheap compared to the cost of storing them in a real database.
I even know someone who stores invoice items in memcached. Whenever memchached evicts something that hasn't made it onwards to real storage, they lose the ability to invoice the right customer for an ad impression.
Not at all. Let me run an example with simple and made-up numbers.
Consider two storage solutions used to gather line items to invoices. The line items are ten cents on average.
One system is properly ACID, and running it costs one cent per line item on the invoice. The other loses 1% of writes randomly, but its hardware/backup/ops requirements are lower, so it costs just 0.1c/item.
The ACID way gives you complete invoices, but you spend 10% of the invoiced amount on the invoices. The lossy way makes your invoices 1% smaller, randomly, but the loss+cost adds up to about 2% instead of 10%.
It's like returns on physical goods, really. A random percentage of customers will return goods, you can't control that but you can estimate and monitor it, and optimize if the numbers aren't good enough.
Unfortunately I didn't find much information in the article.
Basically I got out of it:
- he likes JS, and was comfortable with MEAN-stack (MongoDB/Express web framework/AngularJS/Node.js)
- he found that for document oriented purposes MongoDB could sometimes be a nice fit
- he found that replacing MongoDB with the drop-in replacement TokuMx (seems like a MySQL/MariaDB type of idea), he could get a big performance increase
- he found that with Postgres 9.2+ using the JSON/document storage there he could get some of the relational benefits and some of the document benefits, and ACID
- he likes ACID because of transactions/handling multiple documents at once
Ok, so I guess I was wrong, there is a bit there. But: this is all stuff that I've heard several times about MongoDB.
One of the things he got into was that he learned when MongoDB was appropriate and when it wasn't -- but IMO he didn't really go into detail here other than to say: joins and transactions. I wish there was more substance than that.
I also seem to recall reading recently that even supposedly ACID compliant databases have problems with transactions under load and that the only way to really get full ACID compliance is to treat transactions as serializable - perhaps I am wrong but I think this is the gist of what I've learned.
Personally, I'm working on a project right now that uses MongoDB and I definitely do miss joins, transactions, and schemas. I am not even remotely sure where one would benefit from a system that had documents with arbitrary fields in it. (I didn't pick the technologies for my project.)
I would really like to know when to choose which database - I feel like I know the bare basics of several, and none of them in depth. I really don't have time to learn them all. I feel like even when I'm building one application it's only some months after deployment that if I'm lucky or made a mistake that I will run into some actual performance constraint.
>I would really like to know when to choose which database
I have asked this question before - under what situations is Mongo not just equally good as Postgres - but actually better?
The only really coherent answer was that it was easier to configure replication (presumably because it chooses a lot of defaults for you). I'm not even sure that was a good thing given the number of obscure bugs that can arise from incorrectly configured replication.
Postgres even seems to be more performant at NoSQL use cases (using the JSON store) than Mongo, which is frankly embarrassing.
MongoDB 2.8 (in RC4 right now, due out "in early January" last I heard) has the WiredTiger storage engine, and as well as high performance and compression also gets document level locking and multi document transactions. The roadmap says this'll become the default storage engine in 3.0 (3rd quarter 2015). I think the blog author's claim that for basically those omissions "MongoDB can still beat TokuMX on a future release. But only in a future release. Today it can’t." observation is only true if you ignore the development releases completely, and will be incorrect within a week or two.
Sure, if you want SQL or traditional relational database, a traditional relational SQL database is a better choice. Using Postgres or Oracle for workloads better suited to tied hashes or BerkleyDB doesn't automatically become "right" though.
MongoDB will always be a relevant example of false advertising and over-marketing (to the point of 10gen probably opening themselves to litigation) and how we all need to stop drinking the kool-aid.
There is LITERALLY no reason to use MongoDB today. If you're thinking of using MongoDB, for the love of god just try PostgreSQL.
we use a mongodb cluster for a multi-hundred GB document store that powers ~50 various worker instances, 2 sites and an API with sub 100ms response times. we have never lost data or experienced downtime worse than other dbs i've used in the past.
at the end of the day you just need to do your job properly and read the documentation, not blame the tool when you can't use it properly.
*disclaimer - this is not to say postgres isn't awesome.
It may be hype, but when MongoDB was released, what other RDBMS was offering the same functionality? I see Posgresql + JSON mentioned, but when was JSON support added?
I personally like MongoDB for:
- flexible schema (less migration pain)
- easy tags implementation
- product attributes (list of name-value pairs)
- GridFS - store binary files
- nested documents for analytics (ex: a record for each day with a nested doc for each hour)
I wish MongoDB had
- some support of join
- multi-master replication
I personally used MongoDB as primary storage (yeay dot me), but currently prefer it as secondary storage.
As any product it evolves and I expect to see more improvements. I think MongoDB brought NoSQL to the masses :)
If transactions are the biggest problems, there are driver solutions like http://godoc.org/labix.org/v2/mgo/txn for that. I am not sure, how they compare to "real" transactions.
Still missing "joining" in some cases though. Otherwise I am ok with MongoDB so far.
Jeez. Please do not mention client-library-managed transactions and then put "real" transactions in scare quotes. At least read the Wikipedia article first.
I just wanted to distinguish between the two. I wasn't scared in any way but read up on scare quotes now - so thanks for that.
http://en.wikipedia.org/wiki/Scare_quotes
While it does work, it's still a workaround for the lack of first-class transactions in MongoDB. It offers a more limited API and requires care on the developer side.
Despite being the author, I do hope it gets obsoleted by more convenient first-class transaction support in MongoDB itself at some point in the near future.
I never understood the hype around MongoDB. Maybe it is the fact that I started programming when SQL DBs were the prevalent/mainstream option, and got too much used to it... Anyways, Im glad people are recognizing the hype and realizing that RDBMS has enough to offer.
Nice writeup, I always felt mongodb as a really fast product prototyping database . After you get to certain point you in web scale I would guess you would use one of these alternate solutions.