Data Structure for Spreadsheet / Excel

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

In this article, we have explored several Data Structures that are used in Spreadsheet / Excel Sheet. These involve ideas like Sparse Matrix, AVL Tree and much more.

Table of contents:

  1. Introduction to Spreadsheet / Problem Statement
  2. simple bruteforce way
  3. a lazy way
  4. sparse matrix
  5. openoffice's implementation
  6. using AVL Tree

Pre-requisites:

Introduction to Spreadsheet / Problem Statement

If you have used Excel Sheet from Microsoft or Calc from OpenOffice then you must have seen that the whole Spreadsheet looks like a table,
Where it has columns as A, B, C, ... and row as 1, 2, 3, ...

Let's say you want to make your Spreadsheet Application and you simply took the whole table as a
2D Array!!.
The problem with this is that you would be occupying lots of memory that you don't even require because lots of your cells are just empty.

That's why you could use many other ways to store your cells and data without occupying lots of unnecessary memories. And we will be seeing one of those ways in this article, going to go through how to implement it, especially one method that is known to be used by Excel Sheet is using Binary Tree.

Notes:
Access word refers to accessing a cell in the spreadsheet using its row and column.

Index
The Bruteforce Way

Using directly 2D Matrix would be simple and straightforward to store the whole Spreadsheet because you don't have to every time create or delete a piece of memory.
But this will be a really bruteforced method to store the values since, normally no one will be using that whole reserved memory.
You might have another reason to not use this is that you will be defining a static 2D array.
But let's say you're using the whole space, and you're not deleting or inserting any new values at random positions, then this would be a great choice ( simple and fast, but for only this case).

Inserting

Since, we're defining a static 2D Array here, therefore, you cannot extend it, but let's say you did define a dynamic array then, you may perform two kinds of insertion, i.e.

1. Inserting a column

Doing this will require you to first decide at what place you want to insert a new column(by index), and then you need to shift the whole array of a column into a new array while creating a new empty row of the same size as the other rows.
Complexity: O(N)

2. Inserting a row

Now this could be done in two ways,

  1. insert one row in one column, hence, losing the rule of the same size of the column and same size of the row.
  2. insert one row in every column, hence, replacing every column array with a bigger one, So, it could fit a whole new row, while maintaining the Square Matrix.
    Complexity: O(N2)

Inserting a cell would shift all other values which will deform the integrity of your data, therefore, inserting a cell in this data structure is not possible.

Deleting

Actually while deleting a cell in this Matrix will shift the other values unlike the other data structure we're going to see afterward, therefore a cell isn't possible.
But if you need you can delete a column or a row.

1. Deleting a column

When we delete a whole column ( so, deleting one array of rows ), we will require to shift the whole columns after that to fill the place.
Complexity: O(N)

2. Deleting a row

The whole sequence of a row is separated throughout the columns, therefore we would have to delete a row and shift the whole other rows to fill the space for every column.
Complexity: O(N2)

Searching

When searching a value in this data structure we would require to look at every (both empty and non-empty) cell.
Therefore, the cells we need to check will be row * col.
i.e. Searching Complexity will be: O(N2)

Complexities
  • Space : O(N2)
  • Insert : Cannot be done.
  • Delete : Cannot be done.
  • Search : O(N2)
  • Access : O(1)

N is the number of rows/columns considering that the row and column have the same length and form Square Matrix.

The Lazy Way

Will discuss this one first because it's simple and easy, and it provides an idea to make it easy to continue to a sparse matrix.
But I'm not sure if it should be used anywhere.

In this approach, whenever their is value available for a cell then we store both the value and the position of the cell together, this could be done using JSON as mentioned in above link. Or you could try some of your other ways such as HashMap also known as Dictionary.

Few examples:
1. In JSON with Cell and Value.

lazy_method_1

2. A little variation

Here, we're using HashMap as above, except instead of storing it by Cell (position value), first, we have HashMap for columns, and then each column has another HashMap for a cell.
This solution is more like a variant of Sparse Matrix that we will see next.
lazy_method_2

Inserting

If you have used HashMap before then you may know that you require a value for your position/index ( to generate a hash value ) and then a value that you want to store at that index.
So, if we're using the above method of nested column and row, then,
Steps to insert a value will be:

  1. Check if the column you need exists or not, if it doesn't then make an entry for the column with the position values such as A, B, C, ... or you can also use 1, 2, 3, ..., and create a HashMap to store rows.
  2. Then Check for the row position value and make Entry if not available.
    Complexity: O(1)
Deleting

Check like for the Entry as above we did while inserting the value and delete the entry for the row and if there aren't any available rows in that column then also delete the entry for the column.

Complexity: O(1)

Searching

Searching will require us to look through all the entries, So:

  1. we would have to go through every Entry for rows and check if the value is the one we're searching for
  2. and step one would have to be performed on every column.
    Complexity: O(N)
Complexities
  • Space : O(N)
  • Insert : O(1)
  • Delete : O(1)
  • Search : O(N)
  • Access : O(1)

N is the number of entries made.

Sparse Matrix

The concept of Sparse Matrix is used when most of the element available in a matrix is zero, And when we're dealing with a Spreadsheet then this is the case for most of the time.

In Sparse Matrix instead of storing the values in 2D Array, we take three different arrays to store the values of the column, row, and the values that reside there.

Three Arrays:

  1. Values => Stores the user given values that you see in cells.
  2. Row => Stores the row value of the cell.
  3. Column => Here every index represents the column, and this array stores the index value at which the rows start, in the Row array.

Something like:
sparse_matrix_1

Before we proceed, let's assume you have col and row as your position for Column and Row respectively and value that you need to add on the Value array.

Inserting

To insert a value in Sparse Matrix we will need to alter all three arrays in one order,

  • Column
  1. First we will check if is there any value available at Column[col] that is not null or any value that you decided to use to show that "this Column has no value",
  2. If Column[col] is empty then find the next column that is available take its value, let's call it row_i and set Column[col] = row_i (skip 5th step).
  3. And Column[col] have any value then increment values of the Column[col] and for every values after that.
  • Row
  1. Now, if you found not NULL value at Column[col], then go to Row[Column[col]] ( let's take Row[Column[col]] as row_i ).
  2. Now, increment row_i and find the value which is greater than your row or else your row_i becomes greater than Column[next_column].
  3. Shift the whole values starting from row_i to make a space in Row array and then do Row[row_i] = row.
  • Values
  1. Now only insertion is remaining, take row_i from there shift every value towards the end of the array by 1, and now put the value at Value[row_i].

Pseudocode

def next_column(pre):
    next_col = 0
    for i in range(pre, len(Column)):
        if Column[i] is not Null:
            next_col = Column[i]
            break;
    return next_col

get col, row, value;
row_i = 0
if Column[col] is Null:
    row_i = next_column(col)
    Column[col] = row_i
else:
    row_i = Column[col]
    next_col = next_column(row_i)
    for i in range(row_i, Column[next_col] + 1):
        if Row[i] > row:
            row_i = i
            break;
        elif i == Column[next_col]:
            row_i = Column[next_col] + 1
    
for i in range(col, len(Column)):
    Column[i] += 1 if Column[i] is not Null    

Row.insert(row_i, row)
Value.insert(row_i, value)
Deleting

Let's look at steps for deleting a value from the Matrix here again we would be using the row and col

  • Column
  1. Get the row_i from Column[col].
  2. Now, decrement the values in the Column array starting from col to the end of Column.
  • Row
  1. Find the row by incrementing row_i and checking the value in the Row array until you reach Column[next_column].
  2. Now shift all row starting from the row_i to the end of the Row array, towards left by 1.
  • Value
  1. And then shift all the value in the Value array towards the left by 1, from row_i to the end of the Value array.

Pseudocode

get row, col;
row_i = 0

if Column[col] is not Null:
    # step 1
    row_i = Column[col]
    
    # step 2
    for i in range(col, len(Column)):
        Column[col] -= 1 if Column[col] is not Null
    
    # step 3
    next_col = Column[next_column(col)]
    for i in range(row_i, next_col + 1):
        if Row[i] == row:
            row_i = i
        elif i == next_col:
            return;

    Row.delete(row_i)
    Value.delete(row_i)
Searching

Searching in Sparse Matrix should be possible through the Linear Search.
This time we will be going from the array Value and then to the Row and then Column.

  • Value

    Here you can perform a Linear Search to find the index at which our value resides.

  • Row

    Now, after we know index for value, we know index for the row as well, therefore, row = Row[index].

  • Column

    Also, here do Linear Search and find a value that is equal to index and return that as a col, and if you find a value that is greater than index then return the previous index which was not Null.

Pseudocode

get value;

index = 0
row = 0
col = 0

# searching the value
for i in range(len(Value)):
    if value == Value[i]:
        index = i
        # got the row
        row = Row[i]
        break;

# searching for the col
pre_i = 0
for i in range(len(Column)):
    if Column[i] is not Null:
        if Column[i] > index:
            col = pre_i
            break;
        elif Column[i] == index:
            col = i
            break;
        pre_i = i

return (col, row)

Complexities
  • Space : O(N)
  • Insert : O(N)
  • Delete : O(N)
  • Search : O(N)
  • Access : O(1)
OpenOffice's Implementation

Mentioning this here only as an example to the Sparse Matrix.

We're not going to go through the whole thing but only the idea that they have used here to store the cells because a lot of things happen in a Spreadsheet than just storing the values like, you have to provide support for multiple kinds of data types and also a possibility to perform the operations regarding them.

If we look at the OpenOffice doc for data model of cell and sheet, we notice that they use 1024 columns and that's the max value for a column, and after this, the column stores only values for non-empty rows.

To imagine all this.
sparse_matrix_2

You can notice that the column is going in one order, but that is not compulsory with the rows.

AVL Tree

AVL Tree is a Binary Search Tree with some extra steps, which allows it to balance itself.
I recommend you to read about AVL and BST before you proceed on this topic of using AVL as a data structure for a Spreadsheet.

In this approach we will use Nodes of the AVL Tree store three values in tree row, column and value, and in that we will use the values of row and col to make comparisons in the AVL Tree.
We can have two variants regarding this:

1. One-Tree

One Tree is not an official name or anything, but just a name popped up in my head while organizing this article.

In this variant, we will store all three col, row and value, in one Node and only one AVL Tree will be created ( therefore, every value will be stored by one AVL Tree).
The comparison between Nodes can be done using both col and row values while giving more priority to one than the other,

Giving more priority to:

  • col

If we look at some values while giving priority to col for comparisons, then it looks something like,
A1 < A2 < A3 ... < B1 ... < C1 ...
So, you see no matter what row you select in A it will always be less than the Bn.

  • row

Like col, if we give priority to row then all A1, B1, C1, ... will be less than the A2, B2, C2,...
And if said in short any x1 will be less than any x2 and so on.

Well, here I'm not sure what kind of difference there will be in performance, but we're going to choose to give more priority to the column because it would separate every row into different Sub-Tree of a column.

2. Nested-Tree

Another way to do this is to have a different AVL Tree for column and row. Every Node in AVL Tree for column need to contain a pointer to the AVL Tree of rows and this Tree for row contains Cells of the Spreadsheet.
You might notice that structure of this is not different than the above One Tree while we're giving priority to the col.

Operations

Now let's discuss how to implement this thing, well, we're going to look at the Nested-Tree, because if you want to implement the One-Tree method then we would need to implement the whole AVL Tree instead of using an already built one or maybe you could try features like such as operator overloading or any feature that a programming language might provide.
Therefore, for now, let's just only focus on Nested one.

Inserting a value

For this, we would need row, col, and value
the first step would be, to see if col is available or not,

  • if it is available then, use the pointer to the Tree of row and insert the Node for the row with the value
  • And if it's not available then, create an AVL Tree with root Node of row which contains value and then insert the Tree pointer in the Tree of column.

pseudocode

get row col value

if Columns.has(col):
    Rows = Columns.get(col)
    Rows.insert(Node(row, Cell(value)))
else:
    Rows = Tree()
    Rows.insert(Node(row, Cell(value))
    Columns.insert(Node(col, Row))
Deleting a value

Deleting a value in this is simpler than Inserting a value,
We will only need row and col of the Cell we need to delete.
First, we need to search the Node for col in Columns's Tree and then check if the Rows Tree have only one row that we need to delete then, delete the Node for col in Columns Tree directly or else only delete the Node Rows's Tree.

Pseudocode

get row col
global Columns

if Columns.has(col):
    Rows = Columns.get(col)
    if Rows.len() == 1:
        Columns.delete(col)
    else:
        Rows.delete(row)
Searching a value

Since we're finding a value, therefore, we need the value for this operation.
After we have value we need to look at every Node in every Rows Tree and check if the Node has the value we need.
This operation would be easier to do, if your AVL Tree is made up of an array since, you have to only iterate through it, else if it's made up of Nodes pointing to their children then you will require more codes to do this.
We will look at both of these.

  1. If Tree is Array
    Pseudocode
get value
global Columns

for col in range(len(Columns)):
    for row in range(len(Columns[col].rows)):
        if Columns[col].rows[row].value == value:
            return(col, row)

  1. If Tree is Nodes
    Pseudocode
get value
global Columns

class Tree:
    # This is generator which will run recursively
    # provides the next node everytime it's called.
    def next(self):
        if self.left is not Null:
            yield self.left
            self.left.next()
        if self.right is not Null:
            yield self.right
            self.right.next()
            
for _ in range(Columns.len()):
    node = Columns.next()
    (col, Rows) = node.col, node.rows
    for _ in range(Rows.len()):
        node = Rows.next()
        if value == node.value:
            return (col, node.row)
Complexities
  • Space : O(N)
  • Insert : O(logN))
  • Delete : O(logN)
  • Search : O(N)
  • Access : O(logN)
Extras

Well, we're done here, but I will consider the above concepts are hard to understand, and for that, I have written examples in python for all the concepts mentioned above in this repo.