Randomized Probabilistic Bag of Objects Data-Structure using Map

"Probabilistic Bag data-structure"

Recently I was working on a NLP problem, wherein I needed this (this data-structure is a pretty simple thing which is already being used in industry also, I am sure). So here few important features or requirements that I had :

NOTE: Any randomized data-structure works well only when it uses random-number generators which generate uniformly distributed values over the range.

A probabilistic, randomized bag of objects data-structure with -
a) Each object having some weight,
b) Selection from this structure should be random but should be influenced by their weights
c) The weights are positive and multiple objects can have same weight (there is no upper bound on weights but a lower bound of 0).

So I searched online and there were some real great responses - I am basically consolidating them -

There are many ways you can build such structures -
1) Using Lists: There are 2 main approaches using list

Storing unique objects in a list
Storing objects multiple times based on object's weight value

2) Using Map (Map with weight as a key)

A) Storing unique objects in a list :

Basically you maintain a list of objects (lets call it main-list), and you add each new object to a main list. Each object contains the value as well as the probability (or weight), so no object is stored multiple times, but there can be objects with same value and different probability.
You also keep track of total probability or total weight.

Operations:
a) Insert objects into bag : You add to the total probability and append the object in the main list.
b) Get object from bag : We generate a uniformly distributed random number (with max-range = total weight), and iterate over the main-list and add-up each individual elements probability until we reach the random number generated (i.e. probability-sum >= random-number).

Advantages:
a) Consumes less space, stores elements uniquely i.e. only once
b) pretty easy to implement (but mind the multi-threading scenarios)

Disadvantages:
a) Less efficient in retrieval as it takes O(n) time

Order of complexity: considering complexity of java.util.LinkedList in Java
Insertion - O(1)
Deletion - O(1)
Get / Retrieve - O(n) - as you have to iterate over main-list to add-up probabilities until you reach (i.e. equals or greater than) generated random number.

B) Storing objects multiple times based on object's weight value:

This is similar to above list of objects approach but with one significant difference, rather than storing unique objects, here we simply insert values in the list those number of times (i.e. number of times the weight).

For example, lets say there are four values i.e. format (value - weight) -> (1 - 5) , (5 - 3), (8 - 1), (9 - 2)

Then after adding each of these (in the same order) to this bag-structure - our main list would look like,

1, 1, 1, 1, 1, 5, 5, 5, 8, 9, 9

Meaning literally add elements to list - those number of times determined by their weights.
Only caveat is when you have fractional values for weights (or 0-to-1 probabilities), then you need to convert them into integers and add elements those number of times into the list.

Operations:
a) Insert objects into bag: You add or append new element into the list multiple times determined by its weight. e.g. if weight is 34 then you add that element 34 times into the list in the end.
b) Get object from bag: We generate a uniformly distributed random number (with max-range = total list size), and get the element at generated random number -index in the list. e.g. if list-size = 120 then generate random number between 0 and 119, lets say it generated 34 then select the element at index 34 from the list.

Advantages:
a) Very easy to understand and implement (but mind the multi-threading scenarios)

Disadvantages:
a) Consumes more space, stores elements multiple times so unnecessary redundancy
b) Can be less efficient in retrieval if list doesn't support O(1) index based access, e.g. ArrayList gives O(1) index based element access while LinkedList takes O(n) time.

Order of complexity: considering complexity of java.util.ArrayList in Java
Insertion - When inserting at the end -> amortized cost O(1) else O(n) when inserting in-between or at start of list.
Deletion - O(n)
Get / Retrieve - O(1) - as you use index based access to get value at generated random number index in list.
Note: For java.util.LinkedList, the values differ as.

C) Using Map (Map with weight as a key): Using java.util.NavigableMap in java

You use a navigable map - in-order to remember the insertion-order of elements and also to get the O(1) - constant time performance.
We use weight as a key - but not directly. Instead we increment the total_weight value every time new element is added, so global total-bag-weight variable value is maintained.

Operations:
a) Insert objects into bag: You add new element's weight value to total-bag-weight value and this new total-bag-weight is used as a key for new entry into map.
b) Get object from bag: We generate a uniformly distributed random number (with max-range = total-bag-weight value), and get the element from map which has key (take such last key) value greater than or equals to the generated random number value. e.g. lets say there two elements to be inserted (4.5, 15) and (45, 24) - i.e. (value - weight). Then first element will be inserted at key - 0 and second at 15. While getting an element lets say random number generated was

Advantages: This is better approach than any of above two -
a) Because it doesn't store the values multiple times, so storage space is optimized
b) Because it takes amortized O(1) access time to get the values from Bag.

Disadvantages:
All operations have an amortized time-cost and it all depends upon how-good-is-your hash-function if you have any custom hash-function. Because if it causes many key collisions then it would give O(n) worst case performance.

Order of complexity: considering complexity of java.util.NavigableMap in Java
Insertion - amortized cost O(1)
Deletion - amortized cost O(1)
Get / Retrieve - amortized cost O(1)

Here is the java code - which I found out from stack-overflow - and this guy gave a brilliant answer.

/**
 * RandomCollection class
 * @param 
 */
class RandomCollection {
  private final NavigableMap map = new TreeMap();
  private final Random random;
  private int total = 0;

  public RandomCollection() {
    this(new Random());
  }

  public RandomCollection(Random random) {
    this.random = random;
  }

  public void add(int weight, E result) {
    if (weight <= 0)
      return;
    map.put(total, result);
    total += weight;
  }

  public E next() {
    int value = (int) (random.nextDouble() * total);
    return map.ceilingEntry(value).getValue();
  }
}

Credits to Peter Lawrey- http://stackoverflow.com/questions/6409652/random-weighted-selection-java-framework

Reference Links:

1) http://en.wikipedia.org/wiki/Fitness_proportionate_selection#Java
2) http://stackoverflow.com/questions/9330394/how-to-pick-an-item-by-its-probability Simplest
3) http://bukkit.org/threads/help-with-fake-random.85467/
4) https://bukkit.org/threads/solved-randomly-picking-from-a-list-of-items-with-probability.85508/
5) https://github.com/andfRa/Saga/blob/b012377d1ebe3db384d0ea41ae707edb5601b23c/src/org/saga/dependencies/EconomyDependency.java
6) http://stackoverflow.com/questions/6409652/random-weighted-selection-java-framework
7) http://docs.oracle.com/javase/7/docs/api/java/util/NavigableMap.html

Search This Blog

Mixture (Sanskrit - मिश्रण) of Interesting Topics