This section is aimed at students in upper secondary education in the Danish
school system, some objects will be simplified and some details will be omitted.
Definition of a Function
A function is a mathematical object that relates pairs of other objects.
This can be done in multiple ways but the most immediately obvious way, that
even young children could do, is to simply place those two objects near
each other. You can also connect those two objects by a string or by some
description. Since we will mostly be working with mathematical abstractions,
most of our relations will be described mathematically. As an example,
consider the following sets of objects
$$X=(car_1, car_2)$$
and
$$Y=(car_3,car_4)$$
and
$$Z=(red,green,blue)$$
We could make a function that related the car to its colour, examples of
such functions could be the following
$$f:X\to Z\text{, and }g:X\cup Y\to Z$$
where we call \(X=dom(f)\) and \(f(X)=ran(f)\) the
"domain" and "range" of \(f\) respectively, and \(Z\) is the "co-domain" of both.
Let these functions be defined as follows
$$f(car_1)=green, f(car_2)=blue$$
and
$$g(car_1)=red,g(car_2)=green,g(car_3)=blue,g(car_4)=red$$
where the first function represents the car's interior and the second
represents the exterior paint. The first function is called
"injective", since each car, in the domain, is assigned a
seperate colour in the co-domain. This is not true for the second function,
since two cars are red, but it has another property, namely that for each
colour in the co-domain, there is a car with that colour. This property is
called being "surjective", which the first function is not since
neither of the cars are red. If a function is both injective and surjective
it is really nice, since each object in the domain and co-domain is assigned
a unique object in the other, and is called "bijective".
These can be skipped on first approach
Rigorous Definitions of Injectivity, Surjectivity and Bijectivity
For a function \(g:X\mapsto Z\) and \(Y\subset X\) we define
$$g(Y)=\{z\in Z:z=f(y)\text{ for }y\in Y\}$$
In relation to our example, its all the colours that are used as
exterior colours for the last two cars.
A function \(f:X\to Z\) is injective if
$$f(x_i)=f(x_j)\implies x_i=x_j$$
It's surjective if
$$f(X)=Z$$
and it's bijective if it's both.
This property has some consequences for the so-called inverse function.
Inverse Functions
Relating pairs of objects is theoretically a symmetric process but it
is very useful, and common practice, to consider functions as directed
relations, going from one object to the other. The opposite direction
is then called the "inverse" function. The function \(f:X\mapsto Y\)
will then have the inverse function \(f^{-1}:Y\mapsto X\) with the
property that
$$f^{-1}(y)=\{x\in X:f(x)=y\}$$
This can lead to the uncomfortable situation where the inverse function
can have multiple values for each object in its domain. To be honest,
this is not unique to the inverse function, if you consider our
example from previously, we could imagine a car having two colours, i.e.
$$g(car_1)=\{red,blue\}$$
which could represent it being purple, or simply that it's not
monochromatic. These types of functions are called "multi-valued"
functions and you may encounter them again in the future, but for the
current treatment I will not be analysing multi-valued
functions in depth. Instead I will mostly consider single-valued inverse
functions, and this can be accomplished by making sure that the original
function is injective.
Inverse functions
An injective function \(f:X\mapsto Y\) will have a single-valued
inverse function, and the property that
$$x=f^{-1}(f(x))$$
for all \(x\in X\). If it's bijective we also have that
$$y=f(f^{-1}(y))$$
for all \(y\in Y\).
Proof
Let \(y\in ran(f)\), and assume that we have two values, \(x_1\) and
\(x_2\) that get mapped to it. Then we have
\begin{align}
&&f^{-1}(y)=&\{x\in X:f(x)=y\}\\
&&\supset&\{x_1,x_2\}\\
\implies&&f(x_1)=&f(x_2)=y\\
\implies&&x_1=&x_2
\end{align}
by injectivity, which shows the inverse is single valued. Lets drop
the subscript since they are equal, then we get
$$f^{-1}(y)=f^{-1}(f(x))=x$$
For the second identity, we just use surjectivety, i.e. that
\(ran(f)=Y\), to allow us to choose any \(y\in Y\) to perform the
following argument
$$y=f(x)=f(f^{-1}(y))$$
∎
Examples of bijective functions are non-constant linear functions
\(f(x)=ax+b\) with \(X=Y=\mathbb{R}\) and inverse
$$x=f^{-1}(y)=\frac{y-b}{a}$$
Exponential
functions \(f(x)=ba^x\) with \(X=\mathbb{R}\),
\(Y=\mathbb{R}_+\), and
$$x=f^{-1}(y)=\frac{\log(y)-\log(b)}{\log(a)}$$
specifically, if \(b=1\) and \(a=e\) then we get that \(\log(y)\) is
the inverse function to the natural exponential function. We also
have power functions \(f(x)=bx^a\) with
\(X=Y=\mathbb{R}_+\) and
$$x=f^{-1}(y)=\sqrt[a]{\frac{y}{b}}$$
which is why we choose both \(y,b>0\) so we don't get complex
numbers for now.
Discrete Functions
So far I have allowed functions to act on arbitrary(non-mathematical)
objects, which will be useful for
probability theory since you can consider random processes with all
sorts of weird objects, and have the probability function relate them to
a number, namely it's probability. On first approach, you will probably start
an analysis of probabilities by counting the size of events, and this will
produce so-called "discrete" functions, where the function values are seperated
by some non-zero value, which will also produce a discrete probability function.
These discrete functions are arguably the simplest class of functions where
we can start doing actual mathematical analysis. One characteristic that allows
this is that it's at this point that we can actually start systematizing the
relation into functional expressions, e.g.
$$f(n)=\frac{n}{12}$$
Real Functions
On further analysis, we introduce random variables
where discrete probability functions will eventually be replaced by probability
density functions, which allow both the domain and co-domain to be real intervals,
with the latter usually being the unit interval \([0,1]\). These real functions
are generally much more powerful than discrete functions and allow for much
broader and powerful analysis. Most of this analysis will depend on being able
to represent these functions with functional expressions, of various complexities.
In fact, when you collect a discrete data-set of the relationship between two
quantities, the first analysis you perform is often a
regression which creates a real function that interpolates the data-set
into all the intermediary points by some functional expression. With the domain
and co-domain being such similar objects, it's often not necessary to distinguish
between the two, so that we denote by \(x\) the value being related, regardless
whether it is by the function or its inverse. With this convention we can write
$$x=f(f^{-1}(x))=f^{-1}(f(x))$$
Integrals
The arguably most immediate characteristic is the area under the graph,
which we call the definite integral, with the following notation
$$\int_a^b f(x)dx$$
which denotes the area between the graph of \(f\) and the x-axis in the interval
\([a,b]\). The wavy symbol is an old version of the letter "S" and the "dx"
is just a delimiter that indicates where the function ends and which letter
is the variable of our function. There are multiple ways to define this area
but the simplest procedure is to split the interval into sub-intervals
$$a=x_0< x_1< x_2<\cdots< x_n< x_{n+1}=b$$
and calculating the area of the rectangles that have these sub-intervals
as bases and some function value, in the interval, as the height. For low
values of \(n\) this will often deviate greatly from the true area, depending
on the choice of the function value, but by decreasing the width of the
sub-intervals, the estimates becomes better. I.e.
$$\int_a^bf(x)dx=\lim_{n\to\infty}\sum_{k=0}^nf(t_k)\Delta x_k$$
where \(t_k\in[x_k,x_{k+1}]\) and \(\lim\) stands for the latin word "limes",
which reperesents the value of the expression under the condition indicated
underneath, in this case as \(n\) increases to infinity, and the width of
all the intervals approaching 0. This is a pretty vague and limited definition,
but it will be sufficient for our current purposes.
Continuity
The next characteristic that is novel to real functions, is continuity, which
in common terms refers to the "connectedness" of the graph. At university
you will eventually define continuity in a much different manner, namely
in terms of topologies, but this will suffice for now.
These can be skipped on first approach
Rigorous Definitions of Continuity
A function \(f:X\to Y\) is called continuous if there for each \(x_0\in X\)
and \(\delta>0\), exists an \(\varepsilon>0\) such that
$$|x-x_0|<\varepsilon⇒|f(x)-f(x_0)|<\delta$$
This just means that you can make two function values arbitrarily close
to each other by making the two variables sufficiently close to each other,
which is exactly the same as the graph being connected.
Derivative
The last novel characteristic is the slope of the graph, which requires the
graph to be connected at that point to be meaningfully defined, and a real
function with this characteristic is called "differentiable". As opposed
to linear functions, most functions don't have constant slope, so we will
need another way to define the slope of an arbitrary function. To achieve
this we do draw inspiration from linear functions, by choosing two points
on the graph and drawing the line through these two points. This line is
called the secant line and we can calculate its slope by the two-point theorem
$$a=\frac{\Delta y}{\Delta x}=\frac{f(x_2)-f(x_1)}{x_2-x_1}=\frac{f(x+h)
-f(x)}{h}$$
where we've written the horizontal distance \(\Delta x=x_2-x_1=h\), and omitted
the sub-script since we've expressed the second point in terms of the first.
The slope of the secant line will approximate the slope of the function well
for most nice functions assuming that the points are sufficiently close to
each other, i.e. for sufficiently low \(h\). In fact if we have the two points
"collide", we will arrive at the so-called tangent line that only intersects
the graph in a single point locally, and this is what we define as the slope
of the graph in that point, also known as the derivative. We use the following
notation
$$f'(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}$$
where \(f'(x)\) denotes the slope, or derivative, of \(f\) at the point \(x\).
This time the limit is for a vanishing value of \(h\), i.e. the vertical
distance between the points.
The Danish school system teaches the so-called three step rule, that I
personally find rather rigid and restrictive, so I will try to formulate
a more robust rule with the same name:
Three Step Rule
\begin{align}
1. &\text{ Write out }\Delta y=f(x+h)-f(x)\\
&\text{ (rewrite and reduce as necessary)}\\
2.&\text{ Divide by }h\\
&\text{ (rewrite and reduce as necessary)}\\
3.&\text{ Let }h\text{ approach }0\\
&\text{ (Use any tricks or previous results to make a conclusion.)}
\end{align}
The main difference is that the Danish rule merges steps 1. and 2. and
adds the reductions as the second step, but it's much more convenient not
to divide by \(h\) until you've reduced the numerator.
Vector Functions
Sometimes we are interested in relating pairs, and even sets, of numbers
to another. An example is the ideal gas law from physics
$$PV=nRT$$
where we can calculate the pressure if we know the other variables by
the following function
$$P(V,n,T)=R\frac{nT}{V}$$
To avoid having to deal with infinities we can restrict the variables to
be strictly positive
$$dom(P)=\mathbb{R}_+\times\mathbb{R}_+\times\mathbb{R}_+=\mathbb{R}_+^3$$
This is pretty much the same thing as three dimensional vectors with
positive coordinates. In general, the most common domains are the
so-called Euclidean Spaces,
$$X=\mathbb{R}\times\cdots\times\mathbb{R}=\mathbb{R}^n$$
where \(n\in\mathbb{N}\) is called the dimension. It might not be clear
how any vectors higher than 3 or 4 might be useful, but in AI research
you often deal with functions with domains that have dimensions in the
dozens, hundreds or higher. In the case of the ideal gas law function,
the domain consisted of three dimensional vectors while the range was
just a single number, i.e.
$$P:\mathbb{R}_+^3\to\mathbb{R}_+$$
In general a vector function will defined as
$$f:\mathbb{R}^n\to\mathbb{R}^m$$
where \(n,m\in\mathbb{N}\).