• Featured post

Embeddings, Vector Search & BM25

Un ordenador no puede entender texto ni relación semántica o significado entre palabras. Solo puede entender números. Esto lo resolvemos mediante el uso de embeddings.

Un embedding es la representación de texto (en forma de números) en un espacio vectorial. Esto permite a los modelos de IA comparar y operar sobre el significado de las palabras.

flowchart TD
    A["perro"] --> B
    B --> C["[-0.003, 0.043, ..., -0.01]"]
    
    N1["(texto que queremos convertir)"]:::note --> A
    N2["(vectores con contenido semántico)"]:::note --> C
    
    classDef note fill:none,stroke:none,color:#777;    

Los vectores de cada palabra o documento capturan el significado semántico del texto.

  • perro estará cerca de mascota
  • contrato estará lejos de playa

Vector vs SQL databases

El problema con las BBDD típicas es que solo buscan matches exactos. Si yo busco por coche solo me sacará las entradas que contengan coche.

En cambio, como las BBDD vectoriales pueden interpretar la semántica de las palabras mediante los vectores, si busco por coche puede sacarme valores como sedán, SUV, Land Rover, etc.

Las BBDD vectoriales son muy buenas cuando necesitamos buscar items similares por proximidad uno respecto al otro. Un ejemplo de uso es buscar películas parecidas (Netflix). Otro ejemplo son los recomendadores de items parecidos en tiendas online (Amazon).

Como ejecutar una búsqueda (query) mediante vectores

(You can see the code here)

Necesitamos:

  • Una BBDD Vectorial (CosmosDB)
  • Un modelo para transformar los embeddings (text-embedding-3-large)

El flujo completo es el siguiente:

  1. Usar un embedding model para obtener los vectores del contenido que queremos indexar
  2. Insertar el texto original y los vectores del contenido en una BBDD vectorial
  3. Cuando queramos ejecutar una query usar el mismo embedding model de antes con la query a buscar. Con el embedding resultante buscamos vectores similares en la BBDD y sacamos el texto original de original_text

    Introducir vectores en CosmosDB

    Para poder buscar necesitamos rellenar antes la BBDD con contenido. Lo mantenemos simple. Metemos

    • un ID a mano
    • el texto original
    • los vectores resultado de hacer el embedding sobre el texto original

El pseudocódigo se ve así y se ejecuta de uno en uno

text = "A shiba walks alone in the park"
# this sends the text to the model text-embedding-3-large 
vectors = createEmbeddingsForText(text)
item = {
	"id": "1",
	"original_text": text,
	"vectors": vectors
}
uploadToCosmosDB(item)

ejemplos de los datos que guardo

{
	"id": "1",
	"original_text": "A shiba walks alone in the park",
	"vectors": [-0.003, 0.043, ..., -0.001]
}

Read More

Docker-compose how to export volumes to another machine

Data generated and used by Docker containers does not persist after restarts. We use Docker volumes to manage data to solve this issue. We use it to persist data in a container or share data between containers.

Volumes are the preferred mechanism for persisting data generated and user by Docker containers. Volumes are easier to back up or migrate.

Volumes are often a better choice than persisting data in a container’s writable layer, because a volume does not increase the size of the containers using it. The volume’s contents exist outside the lifecycle of a given container.

We have 3 volume types:

Anonymous volumes

Helpful to persist data temporarily. If we restart our container data is still visible. It doesn’t persist when we remove the container. Not accessible by other containers. They’re created inside /var/lib/docker/volume.

Example file:

version: '3.8'  
services:  
  db:  
    image: mysql  
    restart: always  
    environment:  
      MYSQL_ROOT_PASSWORD: root  
      MYSQL_DATABASE: test_db  
    ports:  
      - "3306:3306"  
    volumes:  
      - /var/lib/mysql

Read More

Migrate docker-compose to k8s

Steps to go from a docker-compose file, to build an Image out of the file, upload it to DockerHub, and run it with Kubernetes (minikube).

Set up

Install Kompose

# linux
curl -L https://github.com/kubernetes/kompose/releases/download/v1.22.0/kompose-linux-amd64 -o kompose
chmod +x kompose
sudo mv ./kompose /usr/local/bin/kompose

Clone App and Host It into DockerHub

Clone the files, go to the directory where your Dockerfile is and run:

# mariocodes is my DockerHub username
# kubernetes-custom-java-maven-app is the name of the repo I created at DockerHub
docker build -f Dockerfile -t mariocodes/kubernetes-custom-java-maven-app .

Read More

K8s install Minikube

(Oficial Doc)

Technology to quickly set up a Kubernetes cluster locally to learn how to use Kubernetes.

Installation

How to install on a fresh Ubuntu installation:

Install prerrequisites and kubectl

sudo apt-get update && sudo apt-get install -y apt-transport-https curl

curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list

sudo apt-get update
sudo apt-get install -y kubectl

Read More

Common errors in k8s

STATUS CrashLoopBackOff

CrashLoopBackOff and no log appears. This happens as the container has no command to run so it start up and then immediately exits.

Give your image a command.

more here

apiVersion: v1
kind: Pod
metadata:
  name: twocontainers
spec:
  containers:
  - name: container1
    image: python:3.6-alpine
	command: ['sh', '-c', 'echo cont1 > index.html && python -m http.server 8082']
  - name: container2
    image: python:3.6-alpine
	command: ['sh', '-c', 'echo cont1 > index.html && python -m http.server 8083']

i18n & l10n

i18n is just a nomenclature for Internationalization. i18n involves, among other things, the ability to display translated content. It prepares a digital product for localization by for example, separate the content into strings so they are ready to be translated and delivered.

The same goes for L10n. This is a nomenclature for Localization. L10n involves translating content, adapting graphics and finalizing the producto for each regional market.

Reference(s)

https://www.oneskyapp.com/blog/i18n-need-to-know-internationalization/

SQL Indexes

Indexes are a basic structure type that apply to one or multiple columns in order to improve performance and speed up queries that: filter, sort or join data for a table.

this may improve performance for a query that uses last_name in a WHERE clause or an ORDER BY

CREATE INDEX idx_last_name ON employees (last_name);

you can also create composite indexes

CREATE INDEX idx_composite ON employees (last_name, first_name)

and also composite index for only active employees

CREATE INDEX idx_active_employees ON employees (status) WHERE status = 'active';

When to use indexes

Frequent filters WHERE - if you usually filter by a specific column (last_name for example), an index should improve performance.

JOIN for big tables - when you use JOIN with big tables through PKs or FKs.

ORDER BY or GROUP BY - queries that search for / group by a specific column also benefit from indexes.

Best practices

Don’t create indexes in every column. This slows things down on insert, delete or update operations.

Index are best used for big domain fields such as ids, names, surnames. Don’t use them for male/female (or boolean) fields.

Keep indexes optimized: operations where you mass update or mass delete items in your tables may fragment your indexes. You may need to periodically check them and REINDEX them.

ALWAYS MEASURE PERFORMANCE TIME before and after the creation of that index. If your index doesn’t improve performance, remove it as it causes overhead.

Reference(s)

https://stackoverflow.com/questions/7744038/decision-when-to-create-index-on-table-column-in-database
https://stackoverflow.com/questions/52444912/how-to-find-out-fragmented-indexes-and-defragment-them-in-postgresql
https://chatgpt.com/

Advanced SQL

UNION

The union sentence is used to accumulate results for two SELECT sentences.

SELECT column1, column2 FROM table1
UNION
SELECT column1, column2 FROM table2

We have the following tables

company1

per name surname
1 ANTONIO PEREZ
2 PEDRO RUIZ

company2

per name surname
1 LUIS LOPEZ
2 ANTONIO PEREZ

Read More

Count number of entries in filtered table

(for this post some formulas and menu names are in spanish as my excel and computer are in spanish and excel formulas depend on this).

full list

The formula is:

=AGREGAR(3;3;J:J)-1

The first two parameters are for the function itself. The important one is J:J which marks the column to count. What’s important here is this is not going to count filtered items in tables.

Watch out with headers! If you have headers in your table, add -1 to your formula.