Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
In this article, we have explored several Data Structures that are used in Spreadsheet / Excel Sheet. These involve ideas like Sparse Matrix, AVL Tree and much more.
Table of contents:
- Introduction to Spreadsheet / Problem Statement
- simple bruteforce way
- a lazy way
- sparse matrix
- openoffice's implementation
- using AVL Tree
Pre-requisites:
Introduction to Spreadsheet / Problem Statement
If you have used Excel Sheet
from Microsoft or Calc
from OpenOffice then you must have seen that the whole Spreadsheet looks like a table,
Where it has columns as A, B, C, ...
and row as 1, 2, 3, ...
Let's say you want to make your Spreadsheet Application and you simply took the whole table as a
2D Array!!
.
The problem with this is that you would be occupying lots of memory that you don't even require because lots of your cells are just empty.
That's why you could use many other ways to store your cells and data without occupying lots of unnecessary memories. And we will be seeing one of those ways in this article, going to go through how to implement it, especially one method that is known to be used by Excel Sheet
is using Binary Tree.
Notes:
Access
word refers to accessing a cell in the spreadsheet using its row and column.
Index
The Bruteforce Way
Using directly 2D Matrix
would be simple and straightforward to store the whole Spreadsheet because you don't have to every time create or delete a piece of memory.
But this will be a really bruteforced method to store the values since, normally no one will be using that whole reserved memory.
You might have another reason to not use this is that you will be defining a static 2D array
.
But let's say you're using the whole space, and you're not deleting or inserting any new values at random positions, then this would be a great choice ( simple and fast, but for only this case).
Inserting
Since, we're defining a static 2D Array
here, therefore, you cannot extend it, but let's say you did define a dynamic array then, you may perform two kinds of insertion, i.e.
1. Inserting a column
Doing this will require you to first decide at what place you want to insert a new column(by index), and then you need to shift the whole array of a column into a new array while creating a new empty row of the same size as the other rows.
Complexity: O(N)
2. Inserting a row
Now this could be done in two ways,
- insert one row in one column, hence, losing the rule of the same size of the column and same size of the row.
- insert one row in every column, hence, replacing every column array with a bigger one, So, it could fit a whole new row, while maintaining the Square Matrix.
Complexity: O(N2)
Inserting a cell would shift all other values which will deform the integrity of your data, therefore, inserting a cell in this data structure is not possible.
Deleting
Actually while deleting a cell in this Matrix
will shift the other values unlike the other data structure we're going to see afterward, therefore a cell isn't possible.
But if you need you can delete a column or a row.
1. Deleting a column
When we delete a whole column ( so, deleting one array of rows ), we will require to shift the whole columns after that to fill the place.
Complexity: O(N)
2. Deleting a row
The whole sequence of a row is separated throughout the columns, therefore we would have to delete a row and shift the whole other rows to fill the space for every column.
Complexity: O(N2)
Searching
When searching a value in this data structure we would require to look at every (both empty and non-empty) cell.
Therefore, the cells we need to check will be row * col
.
i.e. Searching Complexity will be: O(N2)
Complexities
- Space :
O(N2)
- Insert :
Cannot be done.
- Delete :
Cannot be done.
- Search :
O(N2)
- Access :
O(1)
N
is the number of rows/columns considering that the row and column have the same length and form Square Matrix.
The Lazy Way
Will discuss this one first because it's simple and easy, and it provides an idea to make it easy to continue to a sparse matrix.
But I'm not sure if it should be used anywhere.
In this approach, whenever their is value available for a cell then we store both the value and the position of the cell together, this could be done using JSON
as mentioned in above link. Or you could try some of your other ways such as HashMap
also known as Dictionary
.
Few examples:
1. In JSON with Cell
and Value
.
2. A little variation
Here, we're using HashMap as above, except instead of storing it by Cell (position value), first, we have HashMap for columns, and then each column has another HashMap for a cell.
This solution is more like a variant ofSparse Matrix
that we will see next.
Inserting
If you have used HashMap
before then you may know that you require a value for your position/index ( to generate a hash value ) and then a value that you want to store at that index.
So, if we're using the above method of nested column and row, then,
Steps to insert a value will be:
- Check if the column you need exists or not, if it doesn't then make an entry for the column with the position values such as
A, B, C, ...
or you can also use1, 2, 3, ...
, and create aHashMap
to store rows.- Then Check for the row position value and make Entry if not available.
Complexity: O(1)
Deleting
Check like for the Entry as above we did while inserting the value and delete the entry for the row and if there aren't any available rows in that column then also delete the entry for the column.
Complexity: O(1)
Searching
Searching will require us to look through all the entries, So:
- we would have to go through every Entry for rows and check if the value is the one we're searching for
- and step one would have to be performed on every column.
Complexity: O(N)
Complexities
- Space :
O(N)
- Insert :
O(1)
- Delete :
O(1)
- Search :
O(N)
- Access :
O(1)
N
is the number of entries made.
Sparse Matrix
The concept of Sparse Matrix is used when most of the element available in a matrix is zero, And when we're dealing with a Spreadsheet then this is the case for most of the time.
In Sparse Matrix instead of storing the values in 2D Array,
we take three different arrays to store the values of the column, row, and the values that reside there.
Three Arrays:
- Values => Stores the user given values that you see in cells.
- Row => Stores the row value of the cell.
- Column => Here every index represents the column, and this array stores the index value at which the rows start, in the Row array.
Something like:
Before we proceed, let's assume you have
col
androw
as your position forColumn
andRow
respectively andvalue
that you need to add on theValue
array.
Inserting
To insert a value in Sparse Matrix
we will need to alter all three arrays in one order,
- Column
- First we will check if is there any value available at
Column[col]
that is notnull
or any value that you decided to use to show that "this Column has no value",- If
Column[col]
is empty then find the next column that is available take its value, let's call itrow_i
and setColumn[col] = row_i
(skip 5th step).- And
Column[col]
have any value then increment values of theColumn[col]
and for every values after that.
- Row
- Now, if you found not NULL value at
Column[col]
, then go toRow[Column[col]]
( let's takeRow[Column[col]]
asrow_i
).- Now, increment
row_i
and find the value which is greater than yourrow
or else yourrow_i
becomes greater thanColumn[next_column]
.- Shift the whole values starting from
row_i
to make a space inRow
array and then doRow[row_i] = row
.
- Values
- Now only insertion is remaining, take
row_i
from there shift every value towards the end of the array by 1, and now put thevalue
atValue[row_i]
.
Pseudocode
def next_column(pre):
next_col = 0
for i in range(pre, len(Column)):
if Column[i] is not Null:
next_col = Column[i]
break;
return next_col
get col, row, value;
row_i = 0
if Column[col] is Null:
row_i = next_column(col)
Column[col] = row_i
else:
row_i = Column[col]
next_col = next_column(row_i)
for i in range(row_i, Column[next_col] + 1):
if Row[i] > row:
row_i = i
break;
elif i == Column[next_col]:
row_i = Column[next_col] + 1
for i in range(col, len(Column)):
Column[i] += 1 if Column[i] is not Null
Row.insert(row_i, row)
Value.insert(row_i, value)
Deleting
Let's look at steps for deleting a value from the Matrix
here again we would be using the row
and col
- Column
- Get the
row_i
fromColumn[col]
.- Now, decrement the values in the
Column
array starting fromcol
to the end ofColumn
.
- Row
- Find the row by incrementing
row_i
and checking the value in the Row array until you reachColumn[next_column]
.- Now shift all
row
starting from therow_i
to the end of theRow
array, towards left by 1.
- Value
- And then shift all the
value
in the Value array towards the left by 1, fromrow_i
to the end of theValue
array.
Pseudocode
get row, col;
row_i = 0
if Column[col] is not Null:
# step 1
row_i = Column[col]
# step 2
for i in range(col, len(Column)):
Column[col] -= 1 if Column[col] is not Null
# step 3
next_col = Column[next_column(col)]
for i in range(row_i, next_col + 1):
if Row[i] == row:
row_i = i
elif i == next_col:
return;
Row.delete(row_i)
Value.delete(row_i)
Searching
Searching in Sparse Matrix
should be possible through the Linear Search.
This time we will be going from the array Value
and then to the Row
and then Column
.
- Value
Here you can perform a Linear Search to find the index at which our
value
resides. - Row
Now, after we know index for value, we know index for the row as well, therefore,
row = Row[index]
. - Column
Also, here do Linear Search and find a value that is equal to
index
and return that as acol
, and if you find a value that is greater thanindex
then return the previousindex
which was notNull
.
Pseudocode
get value;
index = 0
row = 0
col = 0
# searching the value
for i in range(len(Value)):
if value == Value[i]:
index = i
# got the row
row = Row[i]
break;
# searching for the col
pre_i = 0
for i in range(len(Column)):
if Column[i] is not Null:
if Column[i] > index:
col = pre_i
break;
elif Column[i] == index:
col = i
break;
pre_i = i
return (col, row)
Complexities
- Space :
O(N)
- Insert :
O(N)
- Delete :
O(N)
- Search :
O(N)
- Access :
O(1)
OpenOffice's Implementation
Mentioning this here only as an example to the Sparse Matrix.
We're not going to go through the whole thing but only the idea that they have used here to store the cells because a lot of things happen in a Spreadsheet than just storing the values like, you have to provide support for multiple kinds of data types and also a possibility to perform the operations regarding them.
If we look at the OpenOffice doc for data model of cell and sheet, we notice that they use 1024 columns and that's the max value for a column, and after this, the column stores only values for non-empty rows.
To imagine all this.
You can notice that the column is going in one order, but that is not compulsory with the rows.
AVL Tree
AVL Tree is a Binary Search Tree with some extra steps, which allows it to balance itself.
I recommend you to read about AVL and BST before you proceed on this topic of using AVL as a data structure for a Spreadsheet.
In this approach we will use Nodes of the AVL Tree store three values in tree row
, column
and value
, and in that we will use the values of row
and col
to make comparisons in the AVL Tree.
We can have two variants regarding this:
1. One-Tree
One Tree is not an official name or anything, but just a name popped up in my head while organizing this article.
In this variant, we will store all three col
, row
and value
, in one Node and only one AVL Tree will be created ( therefore, every value will be stored by one AVL Tree).
The comparison between Nodes can be done using both col
and row
values while giving more priority to one than the other,
Giving more priority to:
- col
If we look at some values while giving priority to
col
for comparisons, then it looks something like,
A1 < A2 < A3 ... < B1 ... < C1 ...
So, you see no matter whatrow
you select inA
it will always be less than theBn
.
- row
Like
col
, if we give priority torow
then allA1
,B1
,C1
, ... will be less than theA2
,B2
,C2
,...
And if said in short anyx1
will be less than anyx2
and so on.
Well, here I'm not sure what kind of difference there will be in performance, but we're going to choose to give more priority to the column because it would separate every row into different Sub-Tree of a column.
2. Nested-Tree
Another way to do this is to have a different AVL Tree for column and row. Every Node in AVL Tree for column need to contain a pointer to the AVL Tree of rows and this Tree for row contains Cells of the Spreadsheet.
You might notice that structure of this is not different than the above One Tree while we're giving priority to the col.
Operations
Now let's discuss how to implement this thing, well, we're going to look at the Nested-Tree, because if you want to implement the One-Tree method then we would need to implement the whole AVL Tree instead of using an already built one or maybe you could try features like such as operator overloading or any feature that a programming language might provide.
Therefore, for now, let's just only focus on Nested one.
Inserting a value
For this, we would need row
, col
, and value
the first step would be, to see if col
is available or not,
- if it is available then, use the pointer to the Tree of row and insert the Node for the
row
with thevalue
- And if it's not available then, create an AVL Tree with root Node of
row
which containsvalue
and then insert the Tree pointer in theTree of column
.
pseudocode
get row col value
if Columns.has(col):
Rows = Columns.get(col)
Rows.insert(Node(row, Cell(value)))
else:
Rows = Tree()
Rows.insert(Node(row, Cell(value))
Columns.insert(Node(col, Row))
Deleting a value
Deleting a value in this is simpler than Inserting a value,
We will only need row
and col
of the Cell we need to delete.
First, we need to search the Node for col
in Columns's Tree and then check if the Rows Tree have only one row
that we need to delete then, delete the Node for col
in Columns Tree directly or else only delete the Node Rows's Tree.
Pseudocode
get row col
global Columns
if Columns.has(col):
Rows = Columns.get(col)
if Rows.len() == 1:
Columns.delete(col)
else:
Rows.delete(row)
Searching a value
Since we're finding a value, therefore, we need the value
for this operation.
After we have value
we need to look at every Node in every Rows Tree and check if the Node has the value we need.
This operation would be easier to do, if your AVL Tree is made up of an array since, you have to only iterate through it, else if it's made up of Nodes pointing to their children then you will require more codes to do this.
We will look at both of these.
- If Tree is Array
Pseudocode
get value
global Columns
for col in range(len(Columns)):
for row in range(len(Columns[col].rows)):
if Columns[col].rows[row].value == value:
return(col, row)
- If Tree is Nodes
Pseudocode
get value
global Columns
class Tree:
# This is generator which will run recursively
# provides the next node everytime it's called.
def next(self):
if self.left is not Null:
yield self.left
self.left.next()
if self.right is not Null:
yield self.right
self.right.next()
for _ in range(Columns.len()):
node = Columns.next()
(col, Rows) = node.col, node.rows
for _ in range(Rows.len()):
node = Rows.next()
if value == node.value:
return (col, node.row)
Complexities
- Space :
O(N)
- Insert :
O(logN))
- Delete :
O(logN)
- Search :
O(N)
- Access :
O(logN)
Extras
Well, we're done here, but I will consider the above concepts are hard to understand, and for that, I have written examples in python for all the concepts mentioned above in this repo.