• Featured post

Embeddings, Vector Search & BM25

Un ordenador no puede entender texto ni relación semántica o significado entre palabras. Solo puede entender números. Esto lo resolvemos mediante el uso de embeddings.

Un embedding es la representación de texto (en forma de números) en un espacio vectorial. Esto permite a los modelos de IA comparar y operar sobre el significado de las palabras.

flowchart TD
    A["perro"] --> B
    B --> C["[-0.003, 0.043, ..., -0.01]"]
    
    N1["(texto que queremos convertir)"]:::note --> A
    N2["(vectores con contenido semántico)"]:::note --> C
    
    classDef note fill:none,stroke:none,color:#777;    

Los vectores de cada palabra o documento capturan el significado semántico del texto.

  • perro estará cerca de mascota
  • contrato estará lejos de playa

Vector vs SQL databases

El problema con las BBDD típicas es que solo buscan matches exactos. Si yo busco por coche solo me sacará las entradas que contengan coche.

En cambio, como las BBDD vectoriales pueden interpretar la semántica de las palabras mediante los vectores, si busco por coche puede sacarme valores como sedán, SUV, Land Rover, etc.

Las BBDD vectoriales son muy buenas cuando necesitamos buscar items similares por proximidad uno respecto al otro. Un ejemplo de uso es buscar películas parecidas (Netflix). Otro ejemplo son los recomendadores de items parecidos en tiendas online (Amazon).

Como ejecutar una búsqueda (query) mediante vectores

(You can see the code here)

Necesitamos:

  • Una BBDD Vectorial (CosmosDB)
  • Un modelo para transformar los embeddings (text-embedding-3-large)

El flujo completo es el siguiente:

  1. Usar un embedding model para obtener los vectores del contenido que queremos indexar
  2. Insertar el texto original y los vectores del contenido en una BBDD vectorial
  3. Cuando queramos ejecutar una query usar el mismo embedding model de antes con la query a buscar. Con el embedding resultante buscamos vectores similares en la BBDD y sacamos el texto original de original_text

    Introducir vectores en CosmosDB

    Para poder buscar necesitamos rellenar antes la BBDD con contenido. Lo mantenemos simple. Metemos

    • un ID a mano
    • el texto original
    • los vectores resultado de hacer el embedding sobre el texto original

El pseudocódigo se ve así y se ejecuta de uno en uno

text = "A shiba walks alone in the park"
# this sends the text to the model text-embedding-3-large 
vectors = createEmbeddingsForText(text)
item = {
	"id": "1",
	"original_text": text,
	"vectors": vectors
}
uploadToCosmosDB(item)

ejemplos de los datos que guardo

{
	"id": "1",
	"original_text": "A shiba walks alone in the park",
	"vectors": [-0.003, 0.043, ..., -0.001]
}

Read More

Find differences for big dynamic lists in Excel

(for this post some formulas and menu names are in spanish as my excel and computer are in spanish and excel formulas depend on this).

Here is how to find and mark differences in unequal, really long lists or tables in Excel. For my example, one list is a partial list from other. Some items are missing and you’ve to find which ones are.

This is the full list.

full list

Read More

OAuth 2.0

Authentication process of verifying an identity. We confirm they’re who they say they are. (username & pwd).

Authorization process of verifying what someone is allowed to do. (Permissions and access control).

Past solutions

From worst one to best one and the problems they originate:

Credential Sharing

The worst one. An App is not able to differentiate between real user access and programmatical access.
Permissions are typically too broad. It also the ability to access more content than it should.

We could redirect the user off to the API where they could enter their credentials and get a cookie. This allows an app to access the API.

Dangerous because CSRF attacks. We’ve authorised the whole browser and not the app.

Read More

How to solve VirtualBox disk has run out of space

oauth2.0-flow

How to solve the problem “Low disk space on ‘Filesystem root’. The volume has only xMB disk space remaining” when you completely fill a virtual disk in VirtualBox.

(You have to delete all your snapshots first)

Open a cmd terminal and run the following command:

"c:\Program Files\Oracle\VirtualBox\VBoxManage.exe" modifymedium  
"c:\Users\mario\VirtualBox VMs\Ubuntu OTAN\Ubuntu OTAN.vdi" --resize 30000

The first path is an executable included with VirtualBox.
The second one is where your VDI actually is. --resize takes the size in MBs.

Open gpartitioner and resize it.

SCRUM PSM1 Certification - Index

scrum-psm1-badge

Status: Certified!

This notes are my watered-down, personal version of The Scrum Guide 2020 and the following Udemy Course: “Preparation For Professional Scrum Master Level 1 (PSM1)” by Vladimir Raykov.

If you want to get ready for the certification exam, I fully recommend buying and watching his course, several times, in Udemy.

Scrum Guide 2020
1. Scrum Guide 2020 Notes
2. Scrum Glossary

“Preparation For Professional Scrum Master Level 1 (PSM1)” by Vladimir Raykov
1. Scrum Introduction
2. The Scrum Team
3. Scrum Events
4. Scrum Artifacts
5. Scrum Practices and Charts
6. A few words before the Exam
7. Recap of key concepts
8. Possible exam questions

Java Index

This are my Java-related notes. Here I have all the knowledge I refer to when I have doubts about how to use or how to implement a framework / feature I’ve already implemented once.

Version changes

Interesting changes, new functionality and APIs that come to Java with each new version. They don’t include the full changes but the ones I deemed most useful or most interesting.

From Java 8 to Java 11
Java12
Java13

Experience

Small, functional snipets on how to implement a determined feature.

Java experience sheet
How to create a database intermediate table
Java date time API
New script files in Java

Frameworks

How to use and implement determined frameworks in a Java project (using Maven).

Spring in Action (Book)
Spring Cache
Spring Beans
Thymeleaf
Spring Cors

Maven (builder)
Testing (JUnit, TestNG, Mockito)
Vert.x (microservices)
Lombok (builder)
MapStruct (mapper)
Liquibase (database version control)

Splunk

Splunk take any type of data of millions of entries and allows you to process it into reports, dashboards and alerts.

It’s great at parsing machine data. We can train Splunk to look for certain patterns in data and label those patterns as fields.

Planning Splunk Deployments

A note on config files

Everything Splunk does is governed by configuration files. They’re stored in /etc and they’ve .conf extension.

They’re layered. You can have files with the same name in several directories. You might have a global level conf file and an app specific conf file. Splunk check which one to use based on the current app.

Read More

Oracle 1Z0-819 (Java11) Certification - Index

The new 1Z0-819 certification is the combination of the old existing ones (1Z0-815 & 1Z0-816) together.

OCP Java SE 11 Programmer I - Study guide for 1Z0-815

Welcome to Java
Java Building Blocks
Java Operators
Making Decisions
Core Java APIs
Lambdas and Functional Interfaces
Methods and Encapsulation
Class Design
Advanced Class Design
Exceptions
Java Modules

OCP Java SE 11 Programmer II - Study guide for 1Z0-816

Java Fundamentals
Java Annotations
Generics and Collections