June 25, 2016

Text colorizers

JS Bin on jsbin.comcapture (for some reason jsbin's babel option doesn't work)

May 29, 2015

March 18, 2015

February 26, 2015

February 13, 2015

2048 AI

JS Bin
It goes currently to 4096 maximum, still too inefficient/slow (no bitboards used) and not tuned enough

January 26, 2015

Maze solver

Solving mazes: http://caub.github.io/misc/maze shamefully with this great pathfinding library.
It searches paths through image pixels, here's a fiddle illustrating it http://jsfiddle.net/crl/6amoefe0/

January 17, 2015

January 14, 2015

Ruzzle solver

A game+solver of this addictive game (my pseudo is crl12) http://caub.github.io/misc/ruzzle

December 23, 2014

DBify

A simple app that will allow you to call your DB directly, it's meant for playing and testing only.
Examples:
post("/sql", ["select * from test where hello=?", 1])
post("/mongo", ["find", {"hello": 1}])
Demo: https://dbify.herokuapp.com/

June 11, 2014

February 22, 2014

MongoCli: MongoDB + websocket + knockoutJS

Mongo-cli is a web 'swiss-knife' that I made last year, for conveniently saving any type of data from javascript, with a set of permissions R/W/Modify by person, and a notification to the group of persons concerned (allowed readers).
So it's a minimal (regarding features and complexity/size of code) Content Management System.

It's named mongo-cli  (Command Line Interface) as the way it's used as an interactive shell to MongoDB, simply from Javascript:
send({fn:'insert', args:[{a: 2, hello: 'world'}]}, console.log) 
send({fn:'find', args:[{a: 2}]}, console.log)

Demos:
 - heroku nodejs (PaintChatTodos)
 - heroku java (PaintChatTodos)
 - cloudbees java

It's on github:
https://github.com/n11/mongo-cli-nodejs
https://github.com/n11/mongo-cli-java

February 4, 2013

Key-Value Indexation on Cassandra

During my work at Inria Sophia-Antipolis, there was an exciting challenge to add a search logic on top of Cassandra database, which is a bit 'raw' and designed for efficiency.
Cassandra tables are structured in a key-value scheme like DHTs, at the difference that values are themselves ordered* maps of inner (key-value) tuples called columns, the table is called column-family. (*specified at their creation)
Cql3 Index creation is an alternative way, but would not be very adequate, with any possible index name.

A realistic way is for example tree structures such as B-trees, however in our use case, we chose to explore a very simple method: inverted indexes that acts simply like a flat directory lookup.
Let's illustrate it with an example:
{  
  'a5e9c146f': {
    'text': 'The quick brown fox jumps over lazy dog',
    'title': 'Fox'
    'author': 'mike123@gmail.com',
    'date': '2012-11-12'
  }
}

For indexing this small data chunk, we will store in a different table the so-called 'inverted indexes', which are rows pointing to a particular data id:
{
  'author=mike123%40gmail.com'                   : {'a5e9c146f': null}, 
  'text=quick'                                   : {'a5e9c146f': null},
  'text=brown'                                   : {'a5e9c146f': null},
  'date=2012-11-12'                              : {'a5e9c146f': null},
  //...
key=value pairs are generated from the original data, either with the exact value or post-processed in special cases (in this example, 'text' value is split in single words).
Column entries are simply formed with a unique data id and an empty value (we don't need it).
Row-keys can also be compositions of name-value pairs sorted alphabetically and concatenated:
  //...
  'author=mike123%40gmail.com&text=quick'        : {'a5e9c146f': null},
  'author=mike123%40gmail.com&text=brown'        : {'a5e9c146f': null},
  'text=quick&title=Fox'                         : {'a5e9c146f': null},
  //..
  'date=2012-11-12&text=fox&text=quick'          : {'a5e9c146f': null},
  //...
  //...
}

You can see how it grows up, we would end up with 2^k insertions. This exponential growth is not acceptable, we can limit the length of composites with min(3, k), for instance. This way, we don't exceed Sum(i=0..min(3,k) of BinomialCoefficient(i, k)) insertions, which scales better, in O(n³). Actually there are high chances that rows like text=the or text=dog already exist in database, so we are just appending this data id column to these rows.

Search queries

{
  'title': ['Fox'], 'text': ['brown', 'quick']
}
will be a query matching our example above, thanks to the index row 'text=brown&text=quick&title=Fox'.

For larger composites exceeding 3 elements, the first 3 elements will be queried directly against the database, the next 3 also, untill the last 3 (in order to reduce results count); before intersecting these different result sets.

We can also form 'OR' queries, that perform parallel queries and merge results in the same map:
[
  {'title':['Fox'], 'text':['brown','quick']},
  {'title':['Fox'], 'text':['test']},
  {'title':['Fox'], 'date':['2012-11-12']}
]

Range queries


With numbers, we rather want to search over a range. Cassandra has a built-in column-keys comparator (range queries are allowed within a row), that could serve for 1-Dimension ranges. But we preferred to use a DB-agnostic method, with k-ary trees (in particular decimal trees).
For instance, for indexing '2012-11-12' (1352678400 unix time in seconds), we add all starting subprefixes of this value to the rest of indexed pairs of this data:
{
  '_t=1'                                         : {'a5e9c146f': null},
  //..
  '_t=1352678400'                                : {'a5e9c146f': null},
  //..
  '_t=1&_t=13&text=quick'                        : {'a5e9c146f': null},
  '_t=13&_t=135&text=quick'                      : {'a5e9c146f': null},
  '_t=135&_t=1352&text=quick'                    : {'a5e9c146f': null},
  //...
  '_t=135267&text=quick'                         : {'a5e9c146f': null},
  //...
  '_t=131&_t=13526&_t=1352678400'                : {'a5e9c146f': null},
  //...

Searching a range, the whole 2012 year, [1325376000, 1356998400] is done by finding the 'best' covering indexes. The more accurate the covering is, the larger the OR query will be.
[132,133,134,135] has a resolution of only 3 (representing an interval of 1e7 seconds=115 days), [1325,1326,1327,1328,1329,133,134,1350,1351,1352,1353,1354,1355,1356] has a resolution of 4 (thus at worse 1e6 seconds=11days of error), etc..

For 2D, we can continue to apply this method for each coordinate:
{
  '_lat=8e21&_lat=8e21f&_lon=1a1242'             : {'a5e9c146f': null},
  //...
Searching in a bounding box is similar, by covering ranges of each dimension.
note: there exists alternatives like space-filling curves, to go down to 1D. (see this blog)

Conclusion

  • Searches are fast (mainly under 3 'AND' searched terms) 
Management is more complex:
  • Insertion, update and deletion will need to regenerate those indexes each time. 
  • We index (and search) differently K-V terms: either directly (exact value of V), or by range, or with a semantic parsing of V (e.g. stemming, de-conjugate verbs, eliminate plurals ... similarly to Lucene text-processing).
  • An optional object in the query maps a specific field in data to a defined handler in our backend, ex: range1|range2|...|range15|text|binary (rangeN generates indexes with the first N subprefixes)
ex:
 {type: 'create', data: {t: 958, name:'Bolt'}, handler: {t: 'range4'}} // generates indexes [0,09,095,0958]
 {type: 'search', data: {t: [900,1000]}, handler: {t: 'range4'}}  //finds [09] as the optimal list of indexes to search, "handler: {t: 'range4'}" is optional, it will enforce a maximal range value of length 4

February 2, 2013

An example of KeyValue indexation for graphs

The DB architecture was defined in the previous post, let's illustrate it with a graph example
{  
  'graph:1:A': {
    'graph:1:D': 'knows',
    'graph:1:B': 'knows'
  },'graph:1:B': {
    'graph:1:C': 'knows'
  },'graph:1:C': {
    'graph:1:D': 'knows',
    'graph:1:E': 'knows'
  },'graph:1:D': {
  },'graph:1:E': {
  }
}
These data items can be created with
{type: 'create', data: {'graph:1:D': 'knows', 'graph:1:B': 'knows'}, id:'graph:1:A'}
//etc..

which produces in the inverted-index table, the rows
{  
  'graph:1:D=knows': {
    'graph:1:A': null,
    'graph:1:C': null
  },'graph:1:B=knows': {
    'graph:1:A': null
  },'graph:1:B=knows&graph:1:D=knows': {
    'graph:1:A': null
  },'graph:1:E=knows': {
    'graph:1:C': null
  },'graph:1:D=knows&graph:1:E=knows': {
    'graph:1:C': null
  },'graph:1:C=knows': {
    'graph:1:B': null
  }
}
The keys contain ':', this notations called composite column keys is actually useful inside rows (As the columns in a row are sorted, searching by one composite prefix is possible, so Cassandra allows to search 'graph:1:*' or other more specific range query ['graph:1:D', 'graph:1:J']) but all this isn't useful when applied on row keys directly, the row keys partitioners, usually chosen, don't preserve ordering, so this technic is simply a way to prefix, partition data when it is wanted.

The above example is actually modeling outgoing connections in the data items, and the in-going ones are in the inversed-indexes items.

we can read outgoing connections for a node with:
{type: 'read', id:'graph:1:C'}
or read its ingoing connections with
{type: 'search', data: {'graph:1:C': 'knows'}}
search nodes with A and C as predecessors
{type: 'search', data: {'graph:1:C': 'knows', 'graph:1:A': 'knows'}}
searching nodes with A or C as predecessor
{type: 'search', data: [{'graph:1:C': 'knows'}, {'graph:1:A': 'knows'}]}

Note that there is a dissymetry here, ingoing connections (predecessors) queries can be richer than direct read of data (successors), you can't concretely search for nodes that would have A and C as successors. The way to deal with this is to add in the data the other connections directions, e.g. 'graph:1:B': {'graph:1:A': 'isKnownBy'}...
More complex queries will be the responsibility of the app. Although we plan to extend the API to execute a sequence of searches, instead of multiple API calls.
ex:
{
type: 'search',
data: {'graph:1:C': 'knows'},
each_i: {
type: 'search',
data: {i: 'knows'},
each_i: {type:'read', id: i}
}
}