Back
This section is aimed at students in upper secondary education in the Danish school system, some objects will be simplified and some details will be omitted.

Definition of a Function

A function is a mathematical object that relates pairs of other objects. This can be done in multiple ways but the most immediately obvious way, that even young children could do, is to simply place those two objects near each other. You can also connect those two objects by a string or by some description. Since we will mostly be working with mathematical abstractions, most of our relations will be described mathematically. As an example, consider the following sets of objects $$X=(car_1, car_2)$$ and $$Y=(car_3,car_4)$$ and $$Z=(red,green,blue)$$ We could make a function that related the car to its colour, examples of such functions could be the following $$f:X\to Z\text{, and }g:X\cup Y\to Z$$ where we call \(X=dom(f)\) and \(f(X)=ran(f)\) the "domain" and "range" of \(f\) respectively, and \(Z\) is the "co-domain" of both. Let these functions be defined as follows $$f(car_1)=green, f(car_2)=blue$$ and $$g(car_1)=red,g(car_2)=green,g(car_3)=blue,g(car_4)=red$$ where the first function represents the car's interior and the second represents the exterior paint. The first function is called "injective", since each car, in the domain, is assigned a seperate colour in the co-domain. This is not true for the second function, since two cars are red, but it has another property, namely that for each colour in the co-domain, there is a car with that colour. This property is called being "surjective", which the first function is not since neither of the cars are red. If a function is both injective and surjective it is really nice, since each object in the domain and co-domain is assigned a unique object in the other, and is called "bijective".
These can be skipped on first approach

Rigorous Definitions of
Injectivity, Surjectivity and Bijectivity

For a function \(g:X\mapsto Z\) and \(Y\subset X\) we define $$g(Y)=\{z\in Z:z=f(y)\text{ for }y\in Y\}$$ In relation to our example, its all the colours that are used as exterior colours for the last two cars. A function \(f:X\to Z\) is injective if $$f(x_i)=f(x_j)\implies x_i=x_j$$ It's surjective if $$f(X)=Z$$ and it's bijective if it's both.
This property has some consequences for the so-called inverse function.

Inverse Functions

Relating pairs of objects is theoretically a symmetric process but it is very useful, and common practice, to consider functions as directed relations, going from one object to the other. The opposite direction is then called the "inverse" function. The function \(f:X\mapsto Y\) will then have the inverse function \(f^{-1}:Y\mapsto X\) with the property that $$f^{-1}(y)=\{x\in X:f(x)=y\}$$
This can lead to the uncomfortable situation where the inverse function can have multiple values for each object in its domain. To be honest, this is not unique to the inverse function, if you consider our example from previously, we could imagine a car having two colours, i.e. $$g(car_1)=\{red,blue\}$$ which could represent it being purple, or simply that it's not monochromatic. These types of functions are called "multi-valued" functions and you may encounter them again in the future, but for the current treatment I will not be analysing multi-valued functions in depth. Instead I will mostly consider single-valued inverse functions, and this can be accomplished by making sure that the original function is injective.

Inverse functions

An injective function \(f:X\mapsto Y\) will have a single-valued inverse function, and the property that $$x=f^{-1}(f(x))$$ for all \(x\in X\). If it's bijective we also have that $$y=f(f^{-1}(y))$$ for all \(y\in Y\).

Proof

Let \(y\in ran(f)\), and assume that we have two values, \(x_1\) and \(x_2\) that get mapped to it. Then we have \begin{align} &&f^{-1}(y)=&\{x\in X:f(x)=y\}\\ &&\supset&\{x_1,x_2\}\\ \implies&&f(x_1)=&f(x_2)=y\\ \implies&&x_1=&x_2 \end{align} by injectivity, which shows the inverse is single valued. Lets drop the subscript since they are equal, then we get $$f^{-1}(y)=f^{-1}(f(x))=x$$ For the second identity, we just use surjectivety, i.e. that \(ran(f)=Y\), to allow us to choose any \(y\in Y\) to perform the following argument $$y=f(x)=f(f^{-1}(y))$$

Examples of bijective functions are non-constant linear functions \(f(x)=ax+b\) with \(X=Y=\mathbb{R}\) and inverse $$x=f^{-1}(y)=\frac{y-b}{a}$$ Exponential functions \(f(x)=ba^x\) with \(X=\mathbb{R}\), \(Y=\mathbb{R}_+\), and $$x=f^{-1}(y)=\frac{\log(y)-\log(b)}{\log(a)}$$ specifically, if \(b=1\) and \(a=e\) then we get that \(\log(y)\) is the inverse function to the natural exponential function. We also have power functions \(f(x)=bx^a\) with \(X=Y=\mathbb{R}_+\) and $$x=f^{-1}(y)=\sqrt[a]{\frac{y}{b}}$$ which is why we choose both \(y,b>0\) so we don't get complex numbers for now.

Discrete Functions

So far I have allowed functions to act on arbitrary(non-mathematical) objects, which will be useful for probability theory since you can consider random processes with all sorts of weird objects, and have the probability function relate them to a number, namely it's probability. On first approach, you will probably start an analysis of probabilities by counting the size of events, and this will produce so-called "discrete" functions, where the function values are seperated by some non-zero value, which will also produce a discrete probability function. These discrete functions are arguably the simplest class of functions where we can start doing actual mathematical analysis. One characteristic that allows this is that it's at this point that we can actually start systematizing the relation into functional expressions, e.g. $$f(n)=\frac{n}{12}$$

Real Functions

On further analysis, we introduce random variables where discrete probability functions will eventually be replaced by probability density functions, which allow both the domain and co-domain to be real intervals, with the latter usually being the unit interval \([0,1]\). These real functions are generally much more powerful than discrete functions and allow for much broader and powerful analysis. Most of this analysis will depend on being able to represent these functions with functional expressions, of various complexities. In fact, when you collect a discrete data-set of the relationship between two quantities, the first analysis you perform is often a regression which creates a real function that interpolates the data-set into all the intermediary points by some functional expression. With the domain and co-domain being such similar objects, it's often not necessary to distinguish between the two, so that we denote by \(x\) the value being related, regardless whether it is by the function or its inverse. With this convention we can write $$x=f(f^{-1}(x))=f^{-1}(f(x))$$

Integrals

The arguably most immediate characteristic is the area under the graph, which we call the definite integral, with the following notation $$\int_a^b f(x)dx$$ which denotes the area between the graph of \(f\) and the x-axis in the interval \([a,b]\). The wavy symbol is an old version of the letter "S" and the "dx" is just a delimiter that indicates where the function ends and which letter is the variable of our function. There are multiple ways to define this area but the simplest procedure is to split the interval into sub-intervals $$a=x_0< x_1< x_2<\cdots< x_n< x_{n+1}=b$$ and calculating the area of the rectangles that have these sub-intervals as bases and some function value, in the interval, as the height. For low values of \(n\) this will often deviate greatly from the true area, depending on the choice of the function value, but by decreasing the width of the sub-intervals, the estimates becomes better. I.e. $$\int_a^bf(x)dx=\lim_{n\to\infty}\sum_{k=0}^nf(t_k)\Delta x_k$$ where \(t_k\in[x_k,x_{k+1}]\) and \(\lim\) stands for the latin word "limes", which reperesents the value of the expression under the condition indicated underneath, in this case as \(n\) increases to infinity, and the width of all the intervals approaching 0. This is a pretty vague and limited definition, but it will be sufficient for our current purposes.

Continuity

The next characteristic that is novel to real functions, is continuity, which in common terms refers to the "connectedness" of the graph. At university you will eventually define continuity in a much different manner, namely in terms of topologies, but this will suffice for now.
These can be skipped on first approach

Rigorous Definitions of Continuity

A function \(f:X\to Y\) is called continuous if there for each \(x_0\in X\) and \(\delta>0\), exists an \(\varepsilon>0\) such that $$|x-x_0|<\varepsilon⇒|f(x)-f(x_0)|<\delta$$
This just means that you can make two function values arbitrarily close to each other by making the two variables sufficiently close to each other, which is exactly the same as the graph being connected.

Derivative

The last novel characteristic is the slope of the graph, which requires the graph to be connected at that point to be meaningfully defined, and a real function with this characteristic is called "differentiable". As opposed to linear functions, most functions don't have constant slope, so we will need another way to define the slope of an arbitrary function. To achieve this we do draw inspiration from linear functions, by choosing two points on the graph and drawing the line through these two points. This line is called the secant line and we can calculate its slope by the two-point theorem $$a=\frac{\Delta y}{\Delta x}=\frac{f(x_2)-f(x_1)}{x_2-x_1}=\frac{f(x+h) -f(x)}{h}$$ where we've written the horizontal distance \(\Delta x=x_2-x_1=h\), and omitted the sub-script since we've expressed the second point in terms of the first. The slope of the secant line will approximate the slope of the function well for most nice functions assuming that the points are sufficiently close to each other, i.e. for sufficiently low \(h\). In fact if we have the two points "collide", we will arrive at the so-called tangent line that only intersects the graph in a single point locally, and this is what we define as the slope of the graph in that point, also known as the derivative. We use the following notation $$f'(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}$$ where \(f'(x)\) denotes the slope, or derivative, of \(f\) at the point \(x\). This time the limit is for a vanishing value of \(h\), i.e. the vertical distance between the points.
The Danish school system teaches the so-called three step rule, that I personally find rather rigid and restrictive, so I will try to formulate a more robust rule with the same name:

Three Step Rule

\begin{align} 1. &\text{ Write out }\Delta y=f(x+h)-f(x)\\ &\text{ (rewrite and reduce as necessary)}\\ 2.&\text{ Divide by }h\\ &\text{ (rewrite and reduce as necessary)}\\ 3.&\text{ Let }h\text{ approach }0\\ &\text{ (Use any tricks or previous results to make a conclusion.)} \end{align}
The main difference is that the Danish rule merges steps 1. and 2. and adds the reductions as the second step, but it's much more convenient not to divide by \(h\) until you've reduced the numerator.

Vector Functions

Sometimes we are interested in relating pairs, and even sets, of numbers to another. An example is the ideal gas law from physics $$PV=nRT$$ where we can calculate the pressure if we know the other variables by the following function $$P(V,n,T)=R\frac{nT}{V}$$ To avoid having to deal with infinities we can restrict the variables to be strictly positive $$dom(P)=\mathbb{R}_+\times\mathbb{R}_+\times\mathbb{R}_+=\mathbb{R}_+^3$$ This is pretty much the same thing as three dimensional vectors with positive coordinates. In general, the most common domains are the so-called Euclidean Spaces, $$X=\mathbb{R}\times\cdots\times\mathbb{R}=\mathbb{R}^n$$ where \(n\in\mathbb{N}\) is called the dimension. It might not be clear how any vectors higher than 3 or 4 might be useful, but in AI research you often deal with functions with domains that have dimensions in the dozens, hundreds or higher. In the case of the ideal gas law function, the domain consisted of three dimensional vectors while the range was just a single number, i.e. $$P:\mathbb{R}_+^3\to\mathbb{R}_+$$ In general a vector function will defined as $$f:\mathbb{R}^n\to\mathbb{R}^m$$ where \(n,m\in\mathbb{N}\).